five

cometadata/datacite-affiliations-matched-ror

收藏
Hugging Face2026-01-18 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/cometadata/datacite-affiliations-matched-ror
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc0-1.0 task_categories: - text-classification language: - en tags: - research - affiliations - ror - datacite - metadata - scholarly-infrastructure pretty_name: DataCite Affiliations Matched to ROR size_categories: - 100M<n<1B configs: - config_name: doi_author_affiliations data_files: - split: train path: data/doi_author_affiliations/*.parquet - config_name: enriched_records data_files: - split: train path: data/enriched_records/*.parquet - config_name: ror_matches data_files: - split: train path: data/ror_matches/*.parquet - config_name: ror_matches_failed data_files: - split: train path: data/ror_matches_failed/*.parquet - config_name: unique_affiliations data_files: - split: train path: data/unique_affiliations/*.parquet - config_name: existing_assignments data_files: - split: train path: data/existing_assignments/*.parquet - config_name: existing_assignments_aggregated data_files: - split: train path: data/existing_assignments_aggregated/*.parquet - config_name: disagreements data_files: - split: train path: data/disagreements/*.parquet --- # DataCite Affiliations Matched to ROR This dataset contains author affiliation data extracted from DataCite metadata records, matched against the Research Organization Registry (ROR). ## Dataset Description - **Source:** [DataCite Public Data File](https://datacite.org/) - **ROR Version:** [v2.1-2026-01-15-ror-data](https://doi.org/10.5281/zenodo.18260365) - **Processing Tool:** [match-datacite-affiliations-to-ror-ids](https://github.com/cometadata/match-datacite-affiliations-to-ror-ids) - **Total Records:** 294,704,394 - **Total Size:** 69.40 GB - **Match Rate:** 52.78% ## Dataset Configurations | Configuration | Records | Size | |---------------|---------|------| | `doi_author_affiliations` | 158,351,335 | 36.01 GB | | `enriched_records` | 5,640,704 | 4.92 GB | | `ror_matches` | 731,798 | 134.62 MB | | `ror_matches_failed` | 654,780 | 91.80 MB | | `unique_affiliations` | 1,386,578 | 113.57 MB | | `existing_assignments` | 127,865,845 | 28.11 GB | | `existing_assignments_aggregated` | 58,760 | 11.01 MB | | `disagreements` | 14,594 | 4.94 MB | ## Configuration Details ### `doi_author_affiliations` Flattened author-affiliation pairs extracted from DataCite records. Each row represents one author-affiliation relationship. **Schema:** - `doi` (string): The DOI of the work - `author_idx` (int): Index of the author within the work - `author_name` (string): Name of the author - `affiliation_idx` (int): Index of the affiliation for this author - `affiliation` (string): Raw affiliation string - `affiliation_hash` (string): MD5 hash of the normalized affiliation string ### `enriched_records` Original DataCite records enriched with ROR IDs where matches were found. **Schema:** - `doi` (string): The DOI of the work - `creators` (list): List of creator objects with nested affiliation data including matched ROR IDs ### `ror_matches` Successful affiliation-to-ROR matches. **Schema:** - `affiliation` (string): Raw affiliation string - `affiliation_hash` (string): MD5 hash of the normalized affiliation string - `ror_id` (string): Matched ROR ID ### `ror_matches_failed` Affiliations that could not be matched to a ROR ID. **Schema:** - `affiliation` (string): Raw affiliation string - `affiliation_hash` (string): MD5 hash of the normalized affiliation string - `error` (string): Reason for match failure ### `unique_affiliations` List of all unique affiliation strings found in the dataset. **Schema:** - `affiliation` (string): Raw affiliation string ### `existing_assignments` Pre-existing ROR assignments found in DataCite records. Each row represents one author-affiliation-ROR relationship that was already present in the source data. **Schema:** - `doi` (string): The DOI of the work - `author_idx` (int): Index of the author within the work - `author_name` (string): Name of the author - `affiliation` (string): Raw affiliation string - `ror_id` (string): Pre-existing ROR ID in the DataCite record - `ror_name` (string): Name of the ROR organization ### `existing_assignments_aggregated` Aggregated view of pre-existing ROR assignments, grouped by affiliation string and ROR ID. **Schema:** - `affiliation` (string): Raw affiliation string - `affiliation_hash` (string): MD5 hash of the normalized affiliation string - `ror_id` (string): Pre-existing ROR ID - `ror_name` (string): Name of the ROR organization - `count` (int): Number of occurrences of this affiliation-ROR pair ### `disagreements` Cases where the newly matched ROR ID differs from a pre-existing ROR assignment, or where multiple conflicting ROR IDs exist. **Schema (type="match"):** - `type` (string): "match" - disagreement between new match and existing assignment - `affiliation` (string): Raw affiliation string - `affiliation_hash` (string): MD5 hash of the normalized affiliation string - `existing_ror_id` (string): Pre-existing ROR ID in DataCite - `existing_ror_name` (string): Name of existing ROR organization - `existing_count` (int): Occurrences of this existing assignment - `matched_ror_id` (string): Newly matched ROR ID - `matched_ror_name` (string): Name of newly matched organization **Schema (type="user"):** - `type` (string): "user" - multiple conflicting user-submitted ROR IDs - `affiliation` (string): Raw affiliation string - `affiliation_hash` (string): MD5 hash of the normalized affiliation string - `ror_ids` (list): List of conflicting ROR assignments with counts ## Statistics ### Top 20 Most Common Matched ROR IDs | ROR ID | Count | |--------|-------| | https://ror.org/01kzn7k21 | 8,097 | | https://ror.org/034t30j35 | 4,024 | | https://ror.org/036rp1748 | 2,798 | | https://ror.org/04y75dx46 | 2,730 | | https://ror.org/05qrfxd25 | 2,514 | | https://ror.org/01tmp8f25 | 1,948 | | https://ror.org/052gg0110 | 1,626 | | https://ror.org/05591te55 | 1,553 | | https://ror.org/02en5vm52 | 1,489 | | https://ror.org/042aqky30 | 1,455 | | https://ror.org/02kkvpp62 | 1,389 | | https://ror.org/03490as77 | 1,350 | | https://ror.org/04xfq0f34 | 1,322 | | https://ror.org/00rcxh774 | 1,248 | | https://ror.org/013meh722 | 1,211 | | https://ror.org/00987cb86 | 1,201 | | https://ror.org/011647w73 | 1,191 | | https://ror.org/05qbk4x57 | 1,184 | | https://ror.org/001w7jn25 | 1,179 | | https://ror.org/019qzkn95 | 1,177 | ### Error Distribution (Failed Matches) | Error Type | Count | |------------|-------| | No match found | 654,775 | | HTTP 414 URI Too Long | 5 | ### Existing Assignment Coverage | Metric | Value | |--------|-------| | Total records with existing ROR assignments | 127,865,845 | | Unique affiliations with existing assignments | 57,016 | | Overlap with new matches | 49,538 | | Agreement rate | 70.54% | ### Disagreement Analysis | Metric | Value | |--------|-------| | Total disagreements | 14,594 | | Disagreement rate | 29.46% | | Match-type disagreements | 13,318 | | User-type disagreements | 1,276 | ### Top Disagreement Patterns | Existing ROR | Matched ROR | Count | |--------------|-------------|-------| | DataCite (https://ror.org/04wxnsj81) | Islamic Azad University, Tehran (https://ror.org/01kzn7k21) | 5,766 | | DataCite (https://ror.org/04wxnsj81) | Islamic Azad University, Isfahan (https://ror.org/039zhhm92) | 399 | | DataCite (https://ror.org/04wxnsj81) | University of Tehran (https://ror.org/05vf56z40) | 245 | | DataCite (https://ror.org/04wxnsj81) | Payame Noor University (https://ror.org/031699d98) | 154 | | DataCite (https://ror.org/04wxnsj81) | Islamic Azad University, Karaj (https://ror.org/01y4xm534) | 131 | | DataCite (https://ror.org/04wxnsj81) | Islamic Azad University, Yazd (https://ror.org/04mwvcn50) | 121 | | DataCite (https://ror.org/04wxnsj81) | Islamic Azad University, Mashhad (https://ror.org/00bvysh61) | 113 | | DataCite (https://ror.org/04wxnsj81) | University of Mohaghegh Ardabili (https://ror.org/045zrcm98) | 108 | | DataCite (https://ror.org/04wxnsj81) | Urmia University (https://ror.org/032fk0x53) | 93 | | DataCite (https://ror.org/04wxnsj81) | Tarbiat Modares University (https://ror.org/03mwgfy56) | 89 | ## Usage ```python from datasets import load_dataset # Load successful ROR matches matches = load_dataset("cometadata/datacite-affiliations-matched-ror", "ror_matches") # Load author-affiliation pairs (large dataset, use streaming) affiliations = load_dataset( "cometadata/datacite-affiliations-matched-ror", "doi_author_affiliations", streaming=True ) # Iterate over records for record in affiliations["train"]: print(record["doi"], record["affiliation"]) break ``` ## License This dataset is released under the [CC0 1.0 Universal (Public Domain Dedication)](https://creativecommons.org/publicdomain/zero/1.0/) license. ## Citation If you use this dataset, please cite: ```bibtex @dataset{datacite_affiliations_ror, title = {DataCite Affiliations Matched to ROR}, author = {cometadata}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/cometadata/datacite-affiliations-matched-ror} } ``` ## Acknowledgments - [DataCite](https://datacite.org/) for providing the source metadata - [ROR](https://ror.org/) for the Research Organization Registry
提供机构:
cometadata
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作