five

Dataset: ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale

收藏
DataCite Commons2021-02-14 更新2024-08-18 收录
下载链接:
https://figshare.com/articles/dataset/ORCID-Linked_Labeled_Data_for_Evaluating_Author_Name_Disambiguation_at_Scale/13404986
下载链接
链接失效反馈
官方服务:
资源简介:
This page contains four datasets released for the paper entitled "ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale" to be published in Scientometrics (In print).<br>1. AUT_ORC.zip: this contains a list of 3M author name instances in MEDLINE linked to Author-ity2009.<br>2. AUT_NIH.zip: this contains a list of 313K author name instances in MEDLINE linked to NIH PI ID.<br>3. AUT_SCT_pairs.zip: this contains a list of 6.2M paper pairs and author byline positions in self-citation relation. <br>4. AUT-SCT_info.zip: this contains a list of 4.7M author name instances in self-citation relation as recorded in AUT_SCT_pairs. Information about an author name instance in AUT-SCT_pairs can be connected to AUT-SCT_info using the combination of PMID and Byline Position as a key.<br>Please see the paper for details on how the datasets were created.<br><br>Kim, J., &amp; Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6<br><br>The uploaded datasets were created by combining several data sources below.<br>1. ORCID data were downloaded from the link below for the 2018 version.Please refer to the policies on the use of ORCID data.<br>https://info.orcid.org/public-data-file-use-policy/<br>2. MEDLINE baseline data were downloaded from the link below for the 2016 version.<br>Please refer to the policies on the use of MEDLINE data.<br><br>https://www.nlm.nih.gov/databases/download/pubmed_medline.html<br><br>3. Author-ity2009, Ethnea, and Genni datasets were downloaded from the link below.<br>Please refer to the policies on the use of those datasets.<br><br>https://databank.illinois.edu/datasets/IDB-9087546<br><br>Please cite three papers below to properly give credits to the creators of the original datasets.<br>Torvik, V. I., &amp; Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. Acm Transactions on Knowledge Discovery from Data, 3(3). doi:10.1145/1552303.1552304<br><br>Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.http://hdl.handle.net/2142/88927<br>Smith, B., Singh, M., &amp; Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.2467720<br>4. The dataset of NIH ID linked to Author-ity2009 was downloaded from the link below.<br>https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1<br><br>Please cite the paper below to properly give credits to the creators of the original dataset.<br><br>Lerchenmueller, M. J., &amp; Sorenson, O. (2016). Author Disambiguation in PubMed: Evidence on the Precision and Recall of Author-ity among NIH-Funded Scientists. PLOS ONE, 11(7), e0158731. doi:10.1371/journal.pone.0158731<br><br><br><br>

本页面包含为即将发表于《科学计量学》(Scientometrics)(已录用待刊)的题为《关联ORCID的标注数据:用于大规模作者姓名消歧评估》的论文所发布的4个数据集。 1. AUT_ORC.zip:内含300万个与Author-ity2009关联的MEDLINE作者姓名实例列表。 2. AUT_NIH.zip:内含31.3万个与美国国立卫生研究院(NIH)首席研究员(PI)ID关联的MEDLINE作者姓名实例列表。 3. AUT_SCT_pairs.zip:内含620万个自引关系下的论文对及作者署名位置列表。 4. AUT-SCT_info.zip:内含AUT_SCT_pairs中记录的470万个自引关系下的作者姓名实例列表。可通过PubMed ID(PMID)与署名位置的组合作为键,将AUT_SCT_pairs中的作者姓名实例关联至AUT-SCT_info。详细数据集构建方法请参见论文。 Kim J与Owen-Smith J(已录用待刊). 关联ORCID的标注数据:用于大规模作者姓名消歧评估. 《科学计量学》. DOI:10.1007/s11192-020-03826-6 本上传数据集由以下多个数据源整合构建而成: 1. ORCID数据:于2018年版本从以下链接下载。ORCID数据使用政策请参阅:https://info.orcid.org/public-data-file-use-policy/ 2. MEDLINE基线数据:于2016年版本从以下链接下载。MEDLINE数据使用政策请参阅:https://www.nlm.nih.gov/databases/download/pubmed_medline.html 3. Author-ity2009、Ethnea及Genni数据集:从以下链接下载。上述数据集的使用政策请参阅:https://databank.illinois.edu/datasets/IDB-9087546 为正确标注原始数据集创作者的学术贡献,请引用以下三篇论文: Torvik VI与Smalheiser NR (2009). MEDLINE中的作者姓名消歧. 《ACM知识发现与数据汇刊》(Acm Transactions on Knowledge Discovery from Data), 3(3). DOI:10.1145/1552303.1552304 Torvik VI与Agarwal S. Ethnea——基于大规模书目数据库中地理编码作者姓名的实例化种族分类器. 2016年3月22-23日于美国华盛顿特区国会图书馆举办的科学学国际研讨会. http://hdl.handle.net/2142/88927 Smith B、Singh M与Torvik V (2013). 基于搜索引擎方法估计名字性别倾向的时序变化. 《ACM/IEEE联合数字图书馆会议论文集》(Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries,JCDL 2013——第13届ACM/IEEE-CS联合数字图书馆会议论文集), 199-208. DOI:10.1145/2467696.2467720 4. 关联Author-ity2009的NIH ID数据集:从以下链接下载:https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1 为正确标注该原始数据集创作者的学术贡献,请引用以下论文: Lerchenmueller MJ与Sorenson O (2016). PubMed中的作者消歧:基于NIH资助科学家的Author-ity准确率与召回率证据. 《PLOS ONE》, 11(7), e0158731. DOI:10.1371/journal.pone.0158731
提供机构:
figshare
创建时间:
2020-12-17
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作