five

Dataset: ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale

收藏
figshare.com2023-05-30 更新2025-01-15 收录
下载链接:
https://figshare.com/articles/dataset/ORCID-Linked_Labeled_Data_for_Evaluating_Author_Name_Disambiguation_at_Scale/13404986/4
下载链接
链接失效反馈
官方服务:
资源简介:
This page contains four datasets released for the paper entitled "ORCID-Linked Labeled Data for Evaluating Author Name Disambiguation at Scale" to be published in Scientometrics (In print).1. AUT_ORC.zip: this contains a list of 3M author name instances in MEDLINE linked to Author-ity2009.2. AUT_NIH.zip: this contains a list of 313K author name instances in MEDLINE linked to NIH PI ID.3. AUT_SCT_pairs.zip: this contains a list of 6.2M paper pairs and author byline positions in self-citation relation. 4. AUT-SCT_info.zip: this contains a list of 4.7M author name instances in self-citation relation as recorded in AUT_SCT_pairs. Information about an author name instance in AUT-SCT_pairs can be connected to AUT-SCT_info using the combination of PMID and Byline Position as a key.Please see the paper for details on how the datasets were created.Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6The uploaded datasets were created by combining several data sources below.1. ORCID data were downloaded from the link below for the 2018 version.Please refer to the policies on the use of ORCID data.https://info.orcid.org/public-data-file-use-policy/2. MEDLINE baseline data were downloaded from the link below for the 2016 version.Please refer to the policies on the use of MEDLINE data.https://www.nlm.nih.gov/databases/download/pubmed_medline.html3. Author-ity2009, Ethnea, and Genni datasets were downloaded from the link below.Please refer to the policies on the use of those datasets.https://databank.illinois.edu/datasets/IDB-9087546Please cite three papers below to properly give credits to the creators of the original datasets.Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. Acm Transactions on Knowledge Discovery from Data, 3(3). doi:10.1145/1552303.1552304Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.http://hdl.handle.net/2142/88927Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.24677204. The dataset of NIH ID linked to Author-ity2009 was downloaded from the link below.https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1Please cite the paper below to properly give credits to the creators of the original dataset.Lerchenmueller, M. J., & Sorenson, O. (2016). Author Disambiguation in PubMed: Evidence on the Precision and Recall of Author-ity among NIH-Funded Scientists. PLOS ONE, 11(7), e0158731. doi:10.1371/journal.pone.0158731

本页面收录了即将发表于《科学计量学》期刊(已印刷)的论文《基于大规模评估作者姓名去歧义的ORCID链接标记数据集》所涉及的四个数据集。1. AUT_ORC.zip:该数据集包含与Author-ity2009链接的3百万个作者姓名实例,存储于MEDLINE数据库中。2. AUT_NIH.zip:该数据集包含与NIH PI ID链接的31.3万个作者姓名实例,存储于MEDLINE数据库中。3. AUT_SCT_pairs.zip:该数据集包含620万个论文对及其作者署名位置,这些论文对处于自引关系之中。4. AUT-SCT_info.zip:该数据集包含470万个作者姓名实例,这些实例在AUT_SCT_pairs中被记录为自引关系。通过结合PMID和署名位置,可以将在AUT_SCT_pairs中记录的作者姓名实例与AUT_SCT_info中的信息进行关联。有关数据集创建的详细信息,请参阅相关论文。Kim, J., & Owen-Smith, J. (In print). ORCID-linked labeled data for evaluating author name disambiguation at scale. Scientometrics. doi:10.1007/s11192-020-03826-6数据集的创建结合了以下数据源:1. 下载了2018版本的ORCID数据,请参阅关于ORCID数据使用的政策。[ORCID数据使用政策链接](https://info.orcid.org/public-data-file-use-policy/)。2. 下载了2016版本的MEDLINE基准数据,请参阅关于MEDLINE数据使用的政策。[MEDLINE数据使用政策链接](https://www.nlm.nih.gov/databases/download/pubmed_medline.html)。3. 下载了Author-ity2009、Ethnea和Genni数据集,请参阅关于这些数据集使用的政策。[数据集使用政策链接](https://databank.illinois.edu/datasets/IDB-9087546)。请引用以下三篇论文,以正确归功于原始数据集的创建者:Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. Acm Transactions on Knowledge Discovery from Data, 3(3). doi:10.1145/1552303.1552304Torvik VI, Agarwal S. Ethnea -- an instance-based ethnicity classifier based on geocoded author names in a large-scale bibliographic database. International Symposium on Science of Science March 22-23, 2016 - Library of Congress, Washington, DC, USA.http://hdl.handle.net/2142/88927Smith, B., Singh, M., & Torvik, V. (2013). A search engine approach to estimating temporal changes in gender orientation of first names. Proceedings Of The ACM/IEEE Joint Conference On Digital Libraries, (JCDL 2013 - Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries), 199-208. doi:10.1145/2467696.24677204. 从以下链接下载了与Author-ity2009链接的NIH ID数据集。[NIH ID数据集链接](https://figshare.com/articles/dataset/PLoS_2016_csv/3407461/1)。请引用以下论文,以正确归功于原始数据集的创建者:Lerchenmueller, M. J., & Sorenson, O. (2016). Author Disambiguation in PubMed: Evidence on the Precision and Recall of Author-ity among NIH-Funded Scientists. PLOS ONE, 11(7), e0158731. doi:10.1371/journal.pone.0158731
提供机构:
figshare
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作