five

cometadata/arxiv-author-affiliations-matched-ror-ids

收藏
Hugging Face2026-01-14 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cometadata/arxiv-author-affiliations-matched-ror-ids
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - text-classification language: - en tags: - arxiv - affiliations - ror - research-organizations - metadata size_categories: - 1M<n<10M --- # arXiv Author Affiliations This dataset contains author affiliation data extracted from arXiv works, matched to Research Organization Registry (ROR) identifiers. ## Dataset Description This dataset was generated from all arXiv works as of 2025/12. The source PDFs were converted to markdown using [markitdown](https://github.com/microsoft/markitdown), and author affiliations were then extracted using [cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air](https://huggingface.co/cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air). The extracted affiliations were matched to ROR IDs using the [single search matching strategy](https://doi.org/10.71938/zz90-g810) in the ROR API. The dataset contains approximately 2.8 million arXiv papers with 12.1 million author-affiliation pairs, of which 75.8% were successfully matched to ROR identifiers. Approximately 10,000 works were excluded because they exceeded the context size of the model. ## Data Format Each line is a JSON object with the following structure: ```json { "arxiv_id": "arXiv:1109.3791", "doi": "10.48550/arxiv.1109.3791", "version": "v1", "prediction": [ { "name": "Author Name", "affiliations": [ { "affiliation": "Department of Computer Science, Example University", "ror_id": "https://ror.org/example123" } ] } ] } ``` ### Fields | Field | Description | |-------|-------------| | `arxiv_id` | arXiv identifier with prefix (e.g., `arXiv:1109.3791`) | | `doi` | DOI derived from arXiv ID (e.g., `10.48550/arxiv.1109.3791`) | | `version` | Paper version from arXiv (e.g., `v1`, `v2`) | | `prediction` | Array of authors with their affiliations | | `prediction[].name` | Author name as extracted from the paper | | `prediction[].affiliations` | Array of affiliation objects | | `prediction[].affiliations[].affiliation` | Raw affiliation text | | `prediction[].affiliations[].ror_id` | ROR identifier URL, or `null` if no match found | ## License This dataset is released under [CC0 1.0 Universal (Public Domain Dedication)](https://creativecommons.org/publicdomain/zero/1.0/). ## Citation If you use this dataset, please cite: ```bibtex @dataset{arxiv_affiliations_ror_2025, title={arXiv Author Affiliations with ROR IDs}, author={COMET}, year={2025}, url={https://huggingface.co/datasets/cometadata/arxiv-author-affiliations-ror} } ```
提供机构:
cometadata
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作