Dataset: ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale

Name: Dataset: ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale
Creator: figshare
Published: 2021-02-14 00:52:11
License: 暂无描述

DataCite Commons2021-02-14 更新2024-08-18 收录

下载链接：

https://figshare.com/articles/dataset/ORCID-Linked_Labeled_Data_for_Evaluating_Author_Name_Disambiguation_at_Scale/13404986

下载链接

链接失效反馈

官方服务：

资源简介：

This page contains four datasets released for the paper entitled "ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale" to be published in Scientometrics (In print). 1. AUT_ORC.zip: this contains a list of 3M author name instances in MEDLINE linked to Author-ity2009. 2. AUT_NIH.zip: this contains a list of 313K author name instances in MEDLINE linked to NIH PI ID. 3. AUT_SCT_pairs.zip: this contains a list of 6.2M paper pairs and author byline positions in self-citation relation. 4. AUT-SCT_info.zip: this contains a list of 4.7M author name instances in self-citation relation as recorded in AUT_SCT_pairs. Information about an author name instance in AUT-SCT_pairs can be connected to AUT-SCT_info using the combination of PMID and Byline Position as a key. Please see the paper for details on how the datasets were created. Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6 The uploaded datasets were created by combining several data sources below. 1. ORCID data were downloaded from the link below for the 2018 version.Please refer to the policies on the use of ORCID data. https://info.orcid.org/public-data-file-use-policy/ 2. MEDLINE baseline data were downloaded from the link below for the 2016 version. Please refer to the policies on the use of MEDLINE data. https://www.nlm.nih.gov/databases/download/pubmed_medline.html 3. Author-ity2009, Ethnea, and Genni datasets were downloaded from the link below. Please refer to the policies on the use of those datasets. https://databank.illinois.edu/datasets/IDB-9087546 Please cite three papers below to properly give credits to the creators of the original datasets. Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. Acm Transactions on Knowledge Discovery from Data, 3(3). doi:10.1145/1552303.1552304 Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.http://hdl.handle.net/2142/88927 Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720 4. The dataset of NIH ID linked to Author-ity2009 was downloaded from the link below. https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1 Please cite the paper below to properly give credits to the creators of the original dataset. Lerchenmueller, M. J., & Sorenson, O. (2016). Author Disambiguation in PubMed: Evidence on the Precision and Recall of Author-ity among NIH-Funded Scientists. PLOS ONE, 11(7), e0158731. doi:10.1371/journal.pone.0158731

本页面包含为即将发表于《科学计量学》(Scientometrics)（已录用待刊）的题为《关联ORCID的标注数据：用于大规模作者姓名消歧评估》的论文所发布的4个数据集。 1. AUT_ORC.zip：内含300万个与Author-ity2009关联的MEDLINE作者姓名实例列表。 2. AUT_NIH.zip：内含31.3万个与美国国立卫生研究院(NIH)首席研究员(PI)ID关联的MEDLINE作者姓名实例列表。 3. AUT_SCT_pairs.zip：内含620万个自引关系下的论文对及作者署名位置列表。 4. AUT-SCT_info.zip：内含AUT_SCT_pairs中记录的470万个自引关系下的作者姓名实例列表。可通过PubMed ID(PMID)与署名位置的组合作为键，将AUT_SCT_pairs中的作者姓名实例关联至AUT-SCT_info。详细数据集构建方法请参见论文。 Kim J与Owen-Smith J（已录用待刊）. 关联ORCID的标注数据：用于大规模作者姓名消歧评估. 《科学计量学》. DOI:10.1007/s11192-020-03826-6 本上传数据集由以下多个数据源整合构建而成： 1. ORCID数据：于2018年版本从以下链接下载。ORCID数据使用政策请参阅：https://info.orcid.org/public-data-file-use-policy/ 2. MEDLINE基线数据：于2016年版本从以下链接下载。MEDLINE数据使用政策请参阅：https://www.nlm.nih.gov/databases/download/pubmed_medline.html 3. Author-ity2009、Ethnea及Genni数据集：从以下链接下载。上述数据集的使用政策请参阅：https://databank.illinois.edu/datasets/IDB-9087546 为正确标注原始数据集创作者的学术贡献，请引用以下三篇论文： Torvik VI与Smalheiser NR (2009). MEDLINE中的作者姓名消歧. 《ACM知识发现与数据汇刊》(Acm Transactions on Knowledge Discovery from Data), 3(3). DOI:10.1145/1552303.1552304 Torvik VI与Agarwal S. Ethnea——基于大规模书目数据库中地理编码作者姓名的实例化种族分类器. 2016年3月22-23日于美国华盛顿特区国会图书馆举办的科学学国际研讨会. http://hdl.handle.net/2142/88927 Smith B、Singh M与Torvik V (2013). 基于搜索引擎方法估计名字性别倾向的时序变化. 《ACM/IEEE联合数字图书馆会议论文集》(Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries，JCDL 2013——第13届ACM/IEEE-CS联合数字图书馆会议论文集), 199-208. DOI:10.1145/2467696.2467720 4. 关联Author-ity2009的NIH ID数据集：从以下链接下载：https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1 为正确标注该原始数据集创作者的学术贡献，请引用以下论文： Lerchenmueller MJ与Sorenson O (2016). PubMed中的作者消歧：基于NIH资助科学家的Author-ity准确率与召回率证据. 《PLOS ONE》, 11(7), e0158731. DOI:10.1371/journal.pone.0158731

提供机构：

figshare

创建时间：

2020-12-17

5,000+

优质数据集

54 个

任务类型

进入经典数据集