five

Problematic ROR-Affilliation Names in Crossref 2024 Dump

收藏
NIAID Data Ecosystem2026-05-02 收录
下载链接:
https://zenodo.org/record/14996457
下载链接
链接失效反馈
官方服务:
资源简介:
Overview This dataset contains 326 entries corresponding to DOIs and related Affiliation Name and ROR IDs for which a mismatch between the name and ID was detected in Crossref data, April 2024 dump. It is a manually checked dataset of wrong affiliation names from a list of automatically pre-selected candidates. It may be used as a benchmark for matching algorithms working with affiliation data in Crossref.  Entries stemming from some particular issues  (3 ROR IDs with multiple issues) were not included, as they were considered less useful for the dataset as a benchmark of what wrong matches may look like (see the "Scripts and Analytics" contents for details.Note: the entries in the dataset represent entries and ROR-Affiliation Name pairs with issues (sometimes referred to as "false matches"). The pipeline focused on precision over recall, so it is not comprehensive and it is likely that there are other problematic entries in the 2024 dump not listed here.  Source datasets The following CC0 datasets were used as source for this dataset: April 2024 Public Data File from CrossRef (http://doi.org/10.13003/849J5WP), downloaded via torrent ROR Release v1.59 (https://doi.org/10.5281/zenodo.14728473), downloaded manually via web browser Wikidata, queried via QLever (https://qlever.cs.uni-freiburg.de/wikidata), full Wikidata dump from https://dumps.wikimedia.org/wikidatawiki/entities (latest-all.ttl.bz2 and latest-lexemes.ttl.bz2, version 29.01.2025) Column meanings On the .tsv dataset (main), the column names are:  DOI - The Crossref DOI for the work Affiliation_Name  - An affiliation name string listed for some author of the work (DOI) ROR_ID  - The ROR ID provided by the publisher corresponding to this Affiliation Name for this DOI ROR_Display  - The display name for this ROR ID via the ROR Release v1.59 Status - "manually curated false match" for all; this is just a sanity check for data reusers, reinforcing these entries are manually curated to be wrongThe .xlsx file contains extra information and some notes done during the curation process.  Scripts and analytics  Scripts  and analytics for the baseline matching pipeline are available (as of March 2025) at https://github.com/lubianat/crossref_interview.Manual curation was done  in Google Sheets, available (as of March 2025) at https://docs.google.com/spreadsheets/d/1XX_v5sI_EYHtRUp69s5LjITJD7v2dp4JqdFvLolG23U/edit?gid=1978804245#gid=1978804245 with parts of the process live streamed at https://www.youtube.com/watch?v=-Jum8E3_cQs .
创建时间:
2025-03-09
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作