five

MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide

收藏
DataCite Commons2025-05-19 更新2025-04-16 收录
下载链接:
https://databank.illinois.edu/datasets/IDB-4354331
下载链接
链接失效反馈
官方服务:
资源简介:
MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. Prepared by Vetle Torvik 2018-04-05 The dataset comes as a single tab-delimited Latin-1 encoded file (only the City column uses non-ASCII characters), and should be about 3.5GB uncompressed. • How was the dataset created? The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016. Check here for information to get PubMed/MEDLINE, and NLMs data Terms and Conditions • Affiliations are linked to a particular author on a particular article. Prior to 2014, NLM recorded the affiliation of the first author only. However, MapAffil 2016 covers some PubMed records lacking affiliations that were harvested elsewhere, from PMC (e.g., PMID 22427989), NIH grants (e.g., 1838378), and Microsoft Academic Graph and ADS (e.g. 5833220). • Affiliations are pre-processed (e.g., transliterated into ASCII from UTF-8 and html) so they may differ (sometimes a lot; see PMID 27487542) from PubMed records. • All affiliation strings where processed using the MapAffil procedure, to identify and disambiguate the most specific place-name, as described in: <i>Torvik VI. MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib Magazine 2015; 21 (11/12). 10p</i> • Look for Fig. 4 in the following article for coverage statistics over time: <i>Palmblad M, Torvik VI. Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services. Tropical medicine and health. 2017 Dec;45(1):33.</i> Expect to see big upticks in coverage of PMIDs around 1988 and for non-first authors in 2014. • The code and back-end data is periodically updated and made available for query by PMID at Torvik Research Group • What is the format of the dataset? The dataset contains 37,406,692 rows. Each row (line) in the file has a unique PMID and author postition (e.g., 10786286_3 is the third author name on PMID 10786286), and the following thirteen columns, tab-delimited. All columns are ASCII, except city which contains Latin-1. 1. PMID: positive non-zero integer; int(10) unsigned 2. au_order: positive non-zero integer; smallint(4) 3. lastname: varchar(80) 4. firstname: varchar(80); NLM started including these in 2002 but many have been harvested from outside PubMed 5. year of publication: 6. type: EDU, HOS, EDU-HOS, ORG, COM, GOV, MIL, UNK 7. city: varchar(200); typically 'city, state, country' but could inlude further subvisions; unresolved ambiguities are concatenated by '|' 8. state: Australia, Canada and USA (which includes territories like PR, GU, AS, and post-codes like AE and AA) 9. country 10. journal 11. lat: at most 3 decimals (only available when city is not a country or state) 12. lon: at most 3 decimals (only available when city is not a country or state) 13. fips: varchar(5); for USA only retrieved by lat-lon query to https://geo.fcc.gov/api/census/block/find

MapAffil 2016数据集——将PubMed作者所属机构映射至全球城市及其地理编码的数据集,由Vetle Torvik于2018年4月5日整理。 该数据集为单张制表符分隔的Latin-1编码文件(仅City列使用非ASCII字符),未压缩大小约为3.5GB。 • 数据集如何构建? 该数据集基于2016年10月第一周捕获的PubMed快照(包含Medline与非Medline的PubMed记录)。如需获取PubMed/MEDLINE数据及美国国立医学图书馆(National Library of Medicine, NLM)的数据条款与使用条件,请访问此处。 所属机构与单篇文章的特定作者相关联。2014年之前,NLM仅记录第一作者的所属机构。但MapAffil 2016数据集覆盖了部分原本缺失机构信息的PubMed记录,这些机构信息从其他来源获取,包括PubMed Central(PMC,例如PMID 22427989)、美国国立卫生研究院(National Institutes of Health, NIH)资助项目(例如1838378)以及Microsoft Academic Graph与ADS(例如5833220)。 所属机构文本已完成预处理(例如从UTF-8与HTML格式转写为ASCII字符),因此可能与PubMed原始记录存在差异(有时差异较大;详见PMID 27487542)。 所有机构字符串均通过MapAffil流程进行处理,以识别并消歧得到最精确的地名,相关细节详见论文: <i>Torvik VI. MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib Magazine 2015; 21 (11/12). 10p</i> 如需查看随时间变化的覆盖范围统计数据,请参阅以下论文中的图4: <i>Palmblad M, Torvik VI. Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services. Tropical medicine and health. 2017 Dec;45(1):33.</i> 可观察到1988年左右的PMID覆盖量以及2014年非第一作者覆盖量的显著增长。 该代码与后端数据会定期更新,并可通过Torvik研究团队的平台按PMID进行查询。 • 数据集格式如何? 该数据集共包含37,406,692行。文件中每一行(每一条记录)均包含唯一的PMID与作者序号(例如10786286_3代表PMID 10786286对应的第3位作者),并包含以下13个制表符分隔的列。除city列使用Latin-1编码外,其余所有列均为ASCII编码。 1. PMID: 正非零整数; int(10) unsigned 2. au_order: 正非零整数; smallint(4) 3. lastname: varchar(80) 4. firstname: varchar(80); NLM自2002年起开始收录该字段,但部分数据从PubMed外部获取 5. year of publication: 出版年份 6. type: EDU, HOS, EDU-HOS, ORG, COM, GOV, MIL, UNK 7. city: varchar(200); 格式通常为「城市, 州, 国家」,也可包含更细的行政区划;未解决的歧义项以「|」连接 8. state: 仅适用于澳大利亚、加拿大与美国(包含波多黎各、关岛、美属萨摩亚等领地以及AE、AA等邮政编码) 9. country: 国家 10. journal: 期刊名称 11. lat: 最多保留3位小数(仅当city字段不为国家或州级行政区时可用) 12. lon: 最多保留3位小数(仅当city字段不为国家或州级行政区时可用) 13. fips: varchar(5); 仅针对美国数据,通过向https://geo.fcc.gov/api/census/block/find发起经纬度查询获取
提供机构:
University of Illinois Urbana-Champaign
创建时间:
2018-04-19
二维码
社区交流群
二维码
科研交流群
商业服务