cometadata/arxiv-author-affiliations-matched-ror-ids
收藏Hugging Face2026-01-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cometadata/arxiv-author-affiliations-matched-ror-ids
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc0-1.0
task_categories:
- text-classification
language:
- en
tags:
- arxiv
- affiliations
- ror
- research-organizations
- metadata
size_categories:
- 1M<n<10M
---
# arXiv Author Affiliations
This dataset contains author affiliation data extracted from arXiv works, matched to Research Organization Registry (ROR) identifiers.
## Dataset Description
This dataset was generated from all arXiv works as of 2025/12. The source PDFs were converted to markdown using [markitdown](https://github.com/microsoft/markitdown), and author affiliations were then extracted using [cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air](https://huggingface.co/cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air). The extracted affiliations were matched to ROR IDs using the [single search matching strategy](https://doi.org/10.71938/zz90-g810) in the ROR API.
The dataset contains approximately 2.8 million arXiv papers with 12.1 million author-affiliation pairs, of which 75.8% were successfully matched to ROR identifiers. Approximately 10,000 works were excluded because they exceeded the context size of the model.
## Data Format
Each line is a JSON object with the following structure:
```json
{
"arxiv_id": "arXiv:1109.3791",
"doi": "10.48550/arxiv.1109.3791",
"version": "v1",
"prediction": [
{
"name": "Author Name",
"affiliations": [
{
"affiliation": "Department of Computer Science, Example University",
"ror_id": "https://ror.org/example123"
}
]
}
]
}
```
### Fields
| Field | Description |
|-------|-------------|
| `arxiv_id` | arXiv identifier with prefix (e.g., `arXiv:1109.3791`) |
| `doi` | DOI derived from arXiv ID (e.g., `10.48550/arxiv.1109.3791`) |
| `version` | Paper version from arXiv (e.g., `v1`, `v2`) |
| `prediction` | Array of authors with their affiliations |
| `prediction[].name` | Author name as extracted from the paper |
| `prediction[].affiliations` | Array of affiliation objects |
| `prediction[].affiliations[].affiliation` | Raw affiliation text |
| `prediction[].affiliations[].ror_id` | ROR identifier URL, or `null` if no match found |
## License
This dataset is released under [CC0 1.0 Universal (Public Domain Dedication)](https://creativecommons.org/publicdomain/zero/1.0/).
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{arxiv_affiliations_ror_2025,
title={arXiv Author Affiliations with ROR IDs},
author={COMET},
year={2025},
url={https://huggingface.co/datasets/cometadata/arxiv-author-affiliations-ror}
}
```
提供机构:
cometadata



