Problematic ROR-Affilliation Names in Crossref 2024 Dump

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/14996457

下载链接

链接失效反馈

官方服务：

资源简介：

Overview This dataset contains 326 entries corresponding to DOIs and related Affiliation Name and ROR IDs for which a mismatch between the name and ID was detected in Crossref data, April 2024 dump. It is a manually checked dataset of wrong affiliation names from a list of automatically pre-selected candidates. It may be used as a benchmark for matching algorithms working with affiliation data in Crossref. Entries stemming from some particular issues (3 ROR IDs with multiple issues) were not included, as they were considered less useful for the dataset as a benchmark of what wrong matches may look like (see the "Scripts and Analytics" contents for details.Note: the entries in the dataset represent entries and ROR-Affiliation Name pairs with issues (sometimes referred to as "false matches"). The pipeline focused on precision over recall, so it is not comprehensive and it is likely that there are other problematic entries in the 2024 dump not listed here. Source datasets The following CC0 datasets were used as source for this dataset: April 2024 Public Data File from CrossRef (http://doi.org/10.13003/849J5WP), downloaded via torrent ROR Release v1.59 (https://doi.org/10.5281/zenodo.14728473), downloaded manually via web browser Wikidata, queried via QLever (https://qlever.cs.uni-freiburg.de/wikidata), full Wikidata dump from https://dumps.wikimedia.org/wikidatawiki/entities (latest-all.ttl.bz2 and latest-lexemes.ttl.bz2, version 29.01.2025) Column meanings On the .tsv dataset (main), the column names are: DOI - The Crossref DOI for the work Affiliation_Name - An affiliation name string listed for some author of the work (DOI) ROR_ID - The ROR ID provided by the publisher corresponding to this Affiliation Name for this DOI ROR_Display - The display name for this ROR ID via the ROR Release v1.59 Status - "manually curated false match" for all; this is just a sanity check for data reusers, reinforcing these entries are manually curated to be wrongThe .xlsx file contains extra information and some notes done during the curation process. Scripts and analytics Scripts and analytics for the baseline matching pipeline are available (as of March 2025) at https://github.com/lubianat/crossref_interview.Manual curation was done in Google Sheets, available (as of March 2025) at https://docs.google.com/spreadsheets/d/1XX_v5sI_EYHtRUp69s5LjITJD7v2dp4JqdFvLolG23U/edit?gid=1978804245#gid=1978804245 with parts of the process live streamed at https://www.youtube.com/watch?v=-Jum8E3_cQs .

创建时间：

2025-03-09

5,000+

优质数据集

54 个

任务类型

进入经典数据集