five

OpCitance: Citation contexts identified from the PubMed Central open access articles

收藏
Mendeley Data2024-06-25 更新2024-06-28 收录
下载链接:
https://databank.illinois.edu/datasets/IDB-7312599
下载链接
链接失效反馈
官方服务:
资源简介:
Sentences and citation contexts identified from the PubMed Central open access articles ---------------------------------------------------------------------- The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019. The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles. Files: • A_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with A. • B_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with B. • C_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with C. • D_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with D. • E_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with E. • F_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with F. • G_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with G. • H_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with H. • I_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with I. • J_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with J. • K_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with K. • L_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with L. • M_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with M. • N_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with N. • O_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with O. • P_p1_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 1). • P_p2_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 2). • Q_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with Q. • R_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with R. • S_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with S. • T_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with T. • UV_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with U or V. • W_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with W. • XYZ_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with X, Y or Z. Each row in the file is a sentence/citation context and contains the following columns: • pmcid: PMCID of the article • pmid: PMID of the article. If an article does not have a PMID, the value is NONE. • location: The article component (abstract, main text, table, figure, etc.) to which the citation context/sentence belongs. • IMRaD: The type of IMRaD section associated with the citation context/sentence. I, M, R, and D represent introduction/background, method, results, and conclusion/discussion, respectively; NoIMRaD indicates that the section type is not identifiable. • sentence_id: The ID of the citation context/sentence in the article component • total_sentences: The number of sentences in the article component. • intxt_id: The ID of the citation. • intxt_pmid: PMID of the citation (as tagged in the XML file). If a citation does not have a PMID tagged in the XML file, the value is "-". • intxt_pmid_source: The sources where the intxt_pmid can be identified. Xml represents that the PMID is only identified from the XML file; xml,pmc represents that the PMID is not only from the XML file, but also in the citation data collected from the NCBI Entrez Programming Utilities. If a citation does not have an intxt_pmid, the value is "-". • intxt_mark: The citation marker associated with the inline citation. • best_id: The best source link ID (e.g., PMID) of the citation. • best_source: The sources that confirm the best ID. • best_id_diff: The comparison result between the best_id column and the intxt_pmid column. • citation: A citation context. If no citation is found in a sentence, the value is the sentence. • progression: Text progression of the citation context/sentence. Supplementary Files • PMC-OA-patci.tsv.gz – This file contains the best source link IDs for the references (e.g., PMID). Patci [1] was used to identify the best source link IDs. The best source link IDs are mapped to the citation contexts and displayed in the *_journal IntxtCit.tsv files as the best_id column. Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns: • pmcid: PMCID of the citing article. • pos: The citation's position in the reference list. • fromPMID: PMID of the citing article. • toPMID: Source link ID (e.g., PMID) of the citation. This ID is identified by Patci. • SRC: The sources that confirm the toPMID. • MatchDB: The origin bibliographic database of the toPMID. • Probability: The match probability of the toPMID. • toPMID2: PMID of the citation (as tagged in the XML file). • SRC2: The sources that confirm the toPMID2. • intxt_id: The ID of the citation. • journal: The first letter of the journal title. This maps to the *_journal_IntxtCit.tsv files. • same_ref_string: Whether the citation string appears in the reference list more than once. • DIFF: The comparison result between the toPMID column and the toPMID2 column. • bestID: The best source link ID (e.g., PMID) of the citation. • bestSRC: The sources that confirm the best ID. • Match: Matching result produced by Patci. [1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885 • intxt_cit_license_fromPMC.tsv – This file contains the CC licensing information for each article. The licensing information is from PMC's file lists [2], retrieved on June 19, 2020, and March 9, 2023. It should be noted that the license information for 189,855 PMCIDs is NO-CC CODE in the file lists, and 521 PMCIDs are absent in the file lists. The absence of CC licensing information does not indicate that the article lacks a CC license. For example, PMCID: 6156294 (NO-CC CODE) and PMCID: 6118074 (absent in the PMC's file lists) are under CC-BY licenses according to their PDF versions of articles. The intxt_cit_license_fromPMC.tsv file has two columns: • pmcid: PMCID of the article. • license: The article’s CC license information provided in PMC’s file lists. The value is nan when an article is not present in the PMC’s file lists. [2] https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/ • Supplementary_File_1.zip – This file contains the code for generating the dataset.

从PubMed Central开放获取文章中提取的句子与引用语境 ---------------------------------------------------------------------- 本数据集以24个制表符分隔的文本文件形式交付。该批文件总计包含720,649,608条句子,其中75,848,689条为引用语境。本数据集基于2019年5月采集的PubMed Central开放获取子集(PMCOA子集)XML版文章快照构建。 本数据集的构建方法详见以下文献:Hsiao TK. 与 Torvik V. I.(未刊手稿):《OpCitance: 从PubMed Central开放获取文章中提取的引用语境》。 ### 文件列表 • A_journal_IntxtCit.tsv:从期刊名称以A开头的期刊发表的文章中提取的句子与引用语境 • B_journal_IntxtCit.tsv:从期刊名称以B开头的期刊发表的文章中提取的句子与引用语境 • C_journal_IntxtCit.tsv:从期刊名称以C开头的期刊发表的文章中提取的句子与引用语境 • D_journal_IntxtCit.tsv:从期刊名称以D开头的期刊发表的文章中提取的句子与引用语境 • E_journal_IntxtCit.tsv:从期刊名称以E开头的期刊发表的文章中提取的句子与引用语境 • F_journal_IntxtCit.tsv:从期刊名称以F开头的期刊发表的文章中提取的句子与引用语境 • G_journal_IntxtCit.tsv:从期刊名称以G开头的期刊发表的文章中提取的句子与引用语境 • H_journal_IntxtCit.tsv:从期刊名称以H开头的期刊发表的文章中提取的句子与引用语境 • I_journal_IntxtCit.tsv:从期刊名称以I开头的期刊发表的文章中提取的句子与引用语境 • J_journal_IntxtCit.tsv:从期刊名称以J开头的期刊发表的文章中提取的句子与引用语境 • K_journal_IntxtCit.tsv:从期刊名称以K开头的期刊发表的文章中提取的句子与引用语境 • L_journal_IntxtCit.tsv:从期刊名称以L开头的期刊发表的文章中提取的句子与引用语境 • M_journal_IntxtCit.tsv:从期刊名称以M开头的期刊发表的文章中提取的句子与引用语境 • N_journal_IntxtCit.tsv:从期刊名称以N开头的期刊发表的文章中提取的句子与引用语境 • O_journal_IntxtCit.tsv:从期刊名称以O开头的期刊发表的文章中提取的句子与引用语境 • P_p1_journal_IntxtCit.tsv:从期刊名称以P开头的期刊发表的文章中提取的句子与引用语境(第1部分) • P_p2_journal_IntxtCit.tsv:从期刊名称以P开头的期刊发表的文章中提取的句子与引用语境(第2部分) • Q_journal_IntxtCit.tsv:从期刊名称以Q开头的期刊发表的文章中提取的句子与引用语境 • R_journal_IntxtCit.tsv:从期刊名称以R开头的期刊发表的文章中提取的句子与引用语境 • S_journal_IntxtCit.tsv:从期刊名称以S开头的期刊发表的文章中提取的句子与引用语境 • T_journal_IntxtCit.tsv:从期刊名称以T开头的期刊发表的文章中提取的句子与引用语境 • UV_journal_IntxtCit.tsv:从期刊名称以U或V开头的期刊发表的文章中提取的句子与引用语境 • W_journal_IntxtCit.tsv:从期刊名称以W开头的期刊发表的文章中提取的句子与引用语境 • XYZ_journal_IntxtCit.tsv:从期刊名称以X、Y或Z开头的期刊发表的文章中提取的句子与引用语境 每个文件中的每一行对应一条句子/引用语境,包含以下字段: • pmcid:文章的PMCID • pmid:文章的PMID。若文章无PMID,则取值为NONE。 • location:该引用语境/句子所属的文章组件(摘要、正文、表格、图表等)。 • IMRaD:该引用语境/句子所属的IMRaD章节类型。其中I、M、R、D分别代表引言/背景、方法、结果与结论/讨论;NoIMRaD表示无法识别章节类型。 • sentence_id:该引用语境/句子在所属文章组件中的ID。 • total_sentences:该文章组件中的句子总数。 • intxt_id:该引用的ID。 • intxt_pmid:该引用的PMID(如XML文件中标记的内容)。若XML文件中未为该引用标记PMID,则取值为"-"。 • intxt_pmid_source:可识别intxt_pmid的来源。Xml表示仅从XML文件中识别出PMID;xml,pmc表示PMID不仅可从XML文件中获取,还可从NCBI Entrez编程工具集采集的引用数据中获取。若该引用无intxt_pmid,则取值为"-"。 • intxt_mark:与内嵌引用相关联的引用标记。 • best_id:该引用的最佳来源链接ID(例如PMID)。 • best_source:用于验证该最佳ID的来源。 • best_id_diff:best_id列与intxt_pmid列的对比结果。 • citation:引用语境。若句子中未找到引用,则该字段值为该句子本身。 • progression:该引用语境/句子的文本演进信息。 ### 补充文件 • PMC-OA-patci.tsv.gz:该文件包含参考文献的最佳来源链接ID(例如PMID)。研究人员使用Patci [1] 识别出这些最佳来源链接ID,这些ID已映射至引用语境,并以best_id列的形式出现在*_journal_IntxtCit.tsv文件中。 该文件中的每一行对应一条引用(即从XML文件中提取的参考文献),包含以下字段: • pmcid:引用文章的PMCID。 • pos:该引用在参考文献列表中的位置。 • fromPMID:引用文章的PMID。 • toPMID:该引用的来源链接ID(例如PMID),由Patci识别得到。 • SRC:用于验证toPMID的来源。 • MatchDB:toPMID所属的原始书目数据库。 • Probability:toPMID的匹配概率。 • toPMID2:该引用的PMID(如XML文件中标记的内容)。 • SRC2:用于验证toPMID2的来源。 • intxt_id:该引用的ID。 • journal:期刊名称的首字母,用于映射至*_journal_IntxtCit.tsv文件。 • same_ref_string:该引用字符串是否在参考文献列表中多次出现。 • DIFF:toPMID列与toPMID2列的对比结果。 • bestID:该引用的最佳来源链接ID(例如PMID)。 • bestSRC:用于验证该最佳ID的来源。 • Match:Patci生成的匹配结果。 [1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885 • intxt_cit_license_fromPMC.tsv:该文件包含每篇文章的CC许可信息。许可信息来源于PMC的文件列表[2],分别于2020年6月19日与2023年3月9日获取。需注意:文件列表中有189,855个PMCID的许可信息为NO-CC CODE,另有521个PMCID未出现在文件列表中。未包含CC许可信息并不代表该文章无CC许可。例如,PMCID: 6156294(标注为NO-CC CODE)与PMCID: 6118074(未出现在PMC文件列表中)的文章,根据其PDF版本可知均采用CC-BY许可。 该文件包含两列: • pmcid:文章的PMCID。 • license:PMC文件列表中提供的文章CC许可信息。若文章未出现在PMC文件列表中,则该字段值为nan。 [2] https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/ • Supplementary_File_1.zip:该文件包含生成本数据集的代码。
创建时间:
2023-06-28
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作