cometadata/crossref-arxiv-citations

Name: cometadata/crossref-arxiv-citations
Creator: cometadata
Published: 2026-01-09 16:00:49
License: 暂无描述

Hugging Face2026-01-09 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/cometadata/crossref-arxiv-citations

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc0-1.0 task_categories: - text-classification - feature-extraction language: - en tags: - arxiv - crossref - citations - preprints - scholarly-communications - bibliometrics - doi pretty_name: Crossref arXiv Citations size_categories: - 100K<n<1M configs: - config_name: all data_files: - split: train path: data/all.parquet default: true - config_name: asserted data_files: - split: train path: data/asserted.parquet - config_name: mined data_files: - split: train path: data/mined.parquet --- # Crossref arXiv Citations A dataset of arXiv preprints and their citations extracted from Crossref metadata, validated against DataCite records and DOI resolution. ## Dataset Description This dataset maps arXiv works to the works in Crossref that cite them. Each record represents an arXiv preprint with all known citations from Crossref-registered works. ### Dataset Configurations | Config | Description | arXiv Works | Citations | |--------|-------------|-------------|-----------| | `all` | All validated citations (default) | 923,302 | 5,121,642 | | `asserted` | Citations where DOI was explicitly provided by publisher or matched by Crossref | 110,030 | 308,474 | | `mined` | Citations where DOI was extracted from unstructured text | 891,035 | 4,813,168 | ### Loading the Dataset ```python from datasets import load_dataset # Load default (all citations) dataset = load_dataset("cometadata/crossref-arxiv-citations") # Load specific config asserted = load_dataset("cometadata/crossref-arxiv-citations", "asserted") mined = load_dataset("cometadata/crossref-arxiv-citations", "mined") ``` ## Data Schema Each record contains: | Field | Type | Description | |-------|------|-------------| | `arxiv_doi` | string | The DOI for the arXiv work (format: `10.48550/arXiv.{id}`) | | `arxiv_id` | string | The arXiv identifier (e.g., `1412.6980`) | | `reference_count` | integer | Total number of reference instances | | `citation_count` | integer | Number of unique citing works | | `cited_by` | array | List of citing work objects | Each citing work object in `cited_by` contains: | Field | Type | Description | |-------|------|-------------| | `doi` | string | DOI of the citing work | | `provenance` | string | How the citation was obtained (see below) | | `matches` | array | List of reference matches from this citing work | ### Provenance Values Each citation includes a `provenance` field indicating how the DOI was obtained: - `publisher` - DOI was explicitly provided by the publisher in the reference metadata - `crossref` - DOI was matched/validated by Crossref - `mined` - DOI was extracted from unstructured text or other fields The `asserted` config contains only `publisher` and `crossref` provenance citations (higher confidence). The `mined` config contains only `mined` provenance citations. ### Example Record ```json { "arxiv_doi": "10.48550/arXiv.1412.6980", "arxiv_id": "1412.6980", "reference_count": 40040, "citation_count": 40040, "cited_by": [ { "doi": "10.1016/j.jhydrol.2025.134709", "provenance": "mined", "matches": [ { "provenance": "mined", "raw_match": "arXiv.1412.6980", "reference": { "key": "10.1016/j.jhydrol.2025.134709_b0175", "unstructured": "Kingma, D.P., & Ba, J. (2014), Adam: A Method for Stochastic Optimization..." } } ] } ] } ``` ## Extraction Process ### 1. Reference Extraction arXiv references were extracted from the Crossref January 2025 Public Data File by scanning reference metadata for arXiv identifiers. The extraction detects multiple reference formats: - arXiv IDs, including legacy formats: `arXiv:1412.6980`, `arXiv: 2206.15325`, `arXiv:cs.DM/9910013` - arXiv DOIs: `10.48550/arXiv.1412.6980` - arXiv URLs: `arxiv.org/abs/1412.6980` Identifiers are normalized (lowercase, version suffixes removed) and deduplicated per citing work. ### 2. Provenance Tracking For each extracted citation, the provenance is determined by checking: 1. If the reference has an explicit `DOI` field with `doi-asserted-by: publisher` -> `publisher` 2. If the reference has an explicit `DOI` field with `doi-asserted-by: crossref` -> `crossref` 3. Otherwise (DOI extracted from unstructured text) -> `mined` ### 3. Validation The extracted arXiv DOIs were validated through a two-stage process: 1. DOIs were checked against the DataCite metadata for the arXiv record set (~2.8M records) 2. For DOIs not found in DataCite, HTTP HEAD requests to `doi.org` were used to verify resolution Only arXiv works with valid, resolvable DOIs are included in this dataset. ## Data Sources - January 2025 Crossref Public Data File - January 2025 DataCite Public Data File - Extraction tools: [cometadata/crossref-arxiv-citation-extraction](https://github.com/cometadata/crossref-arxiv-citation-extraction) ## Limitations - Only includes citations registered in Crossref metadata - Reference parsing depends on successful identifier extraction from unstructured references - The `mined` subset may contain lower-quality matches from noisy reference text ## License This dataset is released under [CC0 1.0 Universal](https://creativecommons.org/publicdomain/zero/1.0/). The underlying Crossref metadata is available under similar open terms. ## Citation If you use this dataset, please cite: ```bibtex @dataset{crossref_arxiv_citations_2025, title = {Crossref arXiv Citations}, author = {Cometadata}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/cometadata/crossref-arxiv-citations} } ```

提供机构：

cometadata

5,000+

优质数据集

54 个

任务类型

进入经典数据集