five

ORCIDs mapped to PubMed authors

收藏
doi.org2025-01-16 收录
下载链接:
https://doi.org/10.13012/B2IDB-9246015_V1
下载链接
链接失效反馈
官方服务:
资源简介:
The dataset is based on a snapshot of PubMed taken in December 2018 (NLMs baseline 2018 plus updates throughout 2018), and for ORCIDs, primarily, the 2019 ORCID Public Data File https://orcid.org/. Matching an ORCID to an individual author name on a PMID is a non-trivial process. Anyone can create an ORCID and claim to have contributed to any published work. Many records claim too many articles and most claim too few. Even though ORCID records are (most?) often populated by author name searches in popular bibliographic databases, there is no confirmation that the person's name is listed on the article. This dataset is the product of mapping ORCIDs to individual author names on PMIDs, even when the ORCID name does not match any author name on the PMID, and when there are multiple (good) candidate author names. The algorithm avoids assigning the ORCID to an article when there are no good candidates and when there are multiple equally good matches. For some ORCIDs that clearly claim too much, it triggers a very strict matching procedure (for ORCIDs that claim too much but the majority appear correct, e.g., 0000-0002-2788-5457), and sometimes deletes ORCIDs altogether when all (or nearly all) of its claimed PMIDs appear incorrect. When an individual clearly has multiple ORCIDs it deletes the least complete of them (e.g., 0000-0002-1651-2428 vs 0000-0001-6258-4628). It should be noted that the ORCIDs that claim to much are not necessarily due nefarious or trolling intentions, even though a few appear so. Certainly many are are due to laziness, such as claiming everything with a particular last name. Some cases appear to be due to test engineers (e.g., 0000-0001-7243-8157; 0000-0002-1595-6203), or librarians assisting faculty (e.g., ; 0000-0003-3289-5681), or group/laboratory IDs (0000-0003-4234-1746), or having contributed to an article in capacities other than authorship such as an Investigator, an Editor, or part of a Collective (e.g., 0000-0003-2125-4256 as part of the FlyBase Consortium on PMID 22127867), or as a "Reply To" in which case the identity of the article and authors might be conflated. The NLM has, in the past, limited the total number of authors indexed too. The dataset certainly has errors but I have taken great care to fix some glaring ones (individuals who claim to much), while still capturing authors who have published under multiple names and not explicitly listed them in their ORCID profile. The final dataset provides a "matchscore" that could be used for further clean-up. Four files: person.tsv: 7,194,692 rows, including header 1. orcid 2. lastname 3. firstname 4. creditname 5. othernames 6. otherids 7. emails employment.tsv: 2,884,981 rows, including header 1. orcid 2. putcode 3. role 4. start-date 5. end-date 6. id 7. source 8. dept 9. name 10. city 11. region 12 country 13. affiliation education.tsv: 3,202,253 rows, including header 1. orcid 2. putcode 3. role 4. start-date 5. end-date 6. id 7. source 8. dept 9. name 10. city 11. region 12 country 13. affiliation pubmed2orcid.tsv: 13,133,065 rows, including header 1. PMID 2. au_order (author name position on the article) 3. orcid 4. matchscore (see below) 5. source: orcid (2019 ORCID Public Data File https://orcid.org/), pubmed (NLMs distributed XML files), or patci (an earlier version of ORCID with citations processed through the Patci tool) 12,037,375 from orcid; 1,06,5892 from PubMed XML; 29,797 from Patci matchscore: 000: lastname, firstname and middle init match (e.g., Eric T MacKenzie vs 00: lastname, firstname match (e.g., Keith Ward) 0: lastname, firstname reversed match (e.g., Conde Santiago vs Santiago Conde) 1: lastname, first and middle init match (e.g., L. F. Panchenko) 11: lastname and partial firstname match (e.g., Mike Boland vs Michael Boland or Mel Ziman vs Melanie Ziman) 12: lastname and first init match 15: 3 part lastname and firstname match (David Grahame Hardie vs D Grahame Hardie) 2: lastname match and multipart firstname initial match Maria Dolores Suarez Ortega vs M. D. Suarez 22: partial lastname match and firstname match (e.g., Erika Friedmann vs Erika Friedman) 23: e.g., Antonio Garcia Garcia vs A G Garcia 25: Allan Downie vs J A Downie 26: Oliver Racz vs Oliver Bacz 27: Rita Ostrovskaya vs R U Ostrovskaia 29: Andrew Staehelin vs L A Staehlin 3: M Tronko vs N D Tron'ko 4: Sharon Dent (Also known as Sharon Y.R. Dent; Sharon Y Roth; Sharon Yoder) vs Sharon Yoder 45: Okulov Aleksei vs A B Okulov 48: Maria Del Rosario Garcia De Vicuna Pinedo vs R Garcia-Vicuna 49: Anatoliy Ivashchenko vs A Ivashenko 5 = lastname match only (weak match but sometimes captures alternative first name for better subsequent matches); e.g., Bill Hieb vs W F Hieb 6 = first name match only (weak match but sometimes captures alternative first name for better subsequent matches); e.g., Maria Borawska vs Maria Koscielak 7 = last or first name match on "other names"; e.g., Hromokovska Tetiana (Also known as Gromokovskaia, T. S., Громоковська Тетяна) vs T Gromokovskaia 77: Siva Subramanian vs Kolinjavadi N. Sivasubramanian 88 = no name in orcid but match caught by uniqueness of name across paper (at least 90% and 2 more than next most common name) prefix: C = ambiguity reduced (possibly eliminated) using city match (e.g., H Yang on PMID 24972200) I = ambiguity eliminated by excluding investigators (ie.., one author and one or more investigators with that name) T = ambiguity eliminated using PubMed pos (T for tie-breaker) W = ambiguity resolved by authority2018

该数据集基于2018年12月(NLMs 2018年基线加上2018年全年的更新)PubMed的快照,对于ORCIDs而言,主要基于2019年ORCID公共数据文件(https://orcid.org/)。将ORCID与PMID上的个人作者姓名进行匹配是一个复杂的任务。任何人都可以创建ORCID并声称自己参与了任何已发表的出版物。许多记录声称的文章数量过多,而大多数记录声称的文章数量过少。尽管ORCID记录通常由流行文献数据库中的作者姓名搜索填充(大多数情况下),但无法确认该人的姓名是否列在文章中。本数据集是通过对PMID上的个人作者姓名与ORCID进行映射而形成的,即使ORCID的姓名与PMID上的任何作者姓名都不匹配,或者存在多个(良好)的候选作者姓名。算法在没有任何良好候选者以及存在多个同等良好的匹配时,避免将ORCID分配给文章。对于一些明显声称过多的ORCIDs,它触发了非常严格的匹配过程(例如,声称过多的但大多数情况似乎是正确的ORCIDs,如0000-0002-2788-5457),有时在所有(或几乎所有)其声称的PMIDs均出现错误时,甚至删除ORCIDs。当个人明显拥有多个ORCIDs时,它将删除其中最不完整的ORCIDs(例如,0000-0002-1651-2428与0000-0001-6258-4628相比)。值得注意的是,声称过多的ORCIDs不一定是由恶意或骚扰意图导致的,尽管一些情况看起来是这样。当然,许多是由于懒惰,例如声称带有特定姓氏的所有内容。一些案例似乎是由于测试工程师(例如,0000-0001-7243-8157;0000-0002-1595-6203)、图书馆员协助教师(例如,;0000-0003-3289-5681)、小组/实验室ID(0000-0003-4234-1746)或以调查员、编辑或其他非作者身份(例如,作为FlyBase联盟的一部分的0000-0003-2125-4256在PMID 22127867上)等身份对文章做出贡献,或者作为“回复”的情况,在这种情况下,文章和作者的身份可能会被混淆。NLM过去也限制了索引的作者总数。该数据集确实存在错误,但我已竭尽全力修正了一些明显的错误(声称过多的个人),同时仍然捕捉到那些以多个姓名发表作品但未在其ORCID个人资料中明确列出姓名的作者。最终数据集提供了一种“匹配分数”,可用于进一步的清理。包含四个文件:person.tsv:包含标题的7,194,692行,包括1. orcid 2. lastname 3. firstname 4. creditname 5. othernames 6. otherids 7. emails employment.tsv:包含标题的2,884,981行,包括1. orcid 2. putcode 3. role 4. start-date 5. end-date 6. id 7. source 8. dept 9. name 10. city 11. region 12 country 13. affiliation education.tsv:包含标题的3,202,253行,包括1. orcid 2. putcode 3. role 4. start-date 5. end-date 6. id 7. source 8. dept 9. name 10. city 11. region 12 country 13. affiliation pubmed2orcid.tsv:包含标题的13,133,065行,包括1. PMID 2. au_order(文章中作者姓名的位置)3. orcid 4. matchscore(见下文)5. source:orcid(2019年ORCID公共数据文件https://orcid.org/)、pubmed(NLMs分发的XML文件)或patci(ORCID的早期版本,通过Patci工具处理引用)12,037,375来自orcid;1,06,5892来自PubMed XML;29,797来自Patci matchscore:000:lastname, firstname和middle init匹配(例如,Eric T MacKenzie与00:lastname, firstname匹配(例如,Keith Ward)0:lastname, firstname反转匹配(例如,Conde Santiago与Santiago Conde)1:lastname, first和middle init匹配(例如,L. F. Panchenko)11:lastname和部分firstname匹配(例如,Mike Boland与Michael Boland或Mel Ziman与Melanie Ziman)12:lastname和first init匹配15:3部分lastname和firstname匹配(David Grahame Hardie与D Grahame Hardie)2:lastname匹配和multipart firstname初始匹配Maria Dolores Suarez Ortega与M. D. Suarez 22:部分lastname匹配和firstname匹配(例如,Erika Friedmann与Erika Friedman)23:例如,Antonio Garcia Garcia与A G Garcia 25:Allan Downie与J A Downie 26:Oliver Racz与Oliver Bacz 27:Rita Ostrovskaya与R U Ostrovskaia 29:Andrew Staehelin与L A Staehlin 3:M Tronko与N D Tron'ko 4:Sharon Dent(也称为Sharon Y.R. Dent;Sharon Y Roth;Sharon Yoder)与Sharon Yoder 45:Okulov Aleksei与A B Okulov 48:Maria Del Rosario Garcia De Vicuna Pinedo与R Garcia-Vicuna 49:Anatoliy Ivashchenko与A Ivashenko 5 = lastname匹配仅(弱匹配但有时能捕捉到更好的后续匹配中的替代第一个名字);例如,Bill Hieb与W F Hieb 6 = first name匹配仅(弱匹配但有时能捕捉到更好的后续匹配中的替代第一个名字);例如,Maria Borawska与Maria Koscielak 7 = last或first name在"other names"上的匹配;例如,Hromokovska Tetiana(也称为Gromokovskaia, T. S., Громоковська Тетяна)与T Gromokovskaia 77:Siva Subramanian与Kolinjavadi N. Sivasubramanian 88 = ORCIDs中没有姓名但通过论文中名称的唯一性捕获匹配(至少90%和比下一个最常见的名字多2个)prefix:C = 使用城市匹配减少(可能消除)歧义(例如,H Yang在PMID 24972200上)I = 通过排除调查员消除歧义(即,一位作者和一位或多位具有该名称的调查员)T = 使用PubMed pos消除歧义(T用于决斗)W = 通过authority2018解决歧义
提供机构:
Illinois Data Bank
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作