OpCitance: Citation contexts identified from the PubMed Central open access articles

Name: OpCitance: Citation contexts identified from the PubMed Central open access articles
Creator: Illinois Data Bank
License: 暂无描述

doi.org2025-01-15 收录

下载链接：

https://doi.org/10.13012/B2IDB-4353270_V2

下载链接

链接失效反馈

官方服务：

资源简介：

Sentences and citation contexts identified from the PubMed Central open access articles ---------------------------------------------------------------------- The dataset is delivered as 24 tab-delimited text files. The files contain 720,649,608 sentences, 75,848,689 of which are citation contexts. The dataset is based on a snapshot of articles in the XML version of the PubMed Central open access subset (i.e., the PMCOA subset). The PMCOA subset was collected in May 2019. The dataset is created as described in: Hsiao TK., & Torvik V. I. (manuscript) OpCitance: Citation contexts identified from the PubMed Central open access articles. Files: • A_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with A. • B_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with B. • C_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with C. • D_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with D. • E_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with E. • F_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with F. • G_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with G. • H_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with H. • I_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with I. • J_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with J. • K_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with K. • L_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with L. • M_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with M. • N_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with N. • O_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with O. • P_p1_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 1). • P_p2_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with P (part 2). • Q_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with Q. • R_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with R. • S_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with S. • T_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with T. • UV_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with U or V. • W_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with W. • XYZ_journal_IntxtCit.tsv – Sentences and citation contexts identified from articles published in journals with journal titles starting with X, Y or Z. Each row in the file is a sentence/citation context and contains the following columns: • pmcid: PMCID of the article • pmid: PMID of the article. If an article does not have a PMID, the value is NONE. • location: The article component (abstract, main text, table, figure, etc.) to which the citation context/sentence belongs. • IMRaD: The type of IMRaD section associated with the citation context/sentence. I, M, R, and D represent introduction/background, method, results, and conclusion/discussion, respectively; NoIMRaD indicates that the section type is not identifiable. • sentence_id: The ID of the citation context/sentence in the article component • total_sentences: The number of sentences in the article component. • intxt_id: The ID of the citation. • intxt_pmid: PMID of the citation (as tagged in the XML file). If a citation does not have a PMID tagged in the XML file, the value is "-". • intxt_pmid_source: The sources where the intxt_pmid can be identified. Xml represents that the PMID is only identified from the XML file; xml,pmc represents that the PMID is not only from the XML file, but also in the citation data collected from the NCBI Entrez Programming Utilities. If a citation does not have an intxt_pmid, the value is "-". • intxt_mark: The citation marker associated with the inline citation. • best_id: The best source link ID (e.g., PMID) of the citation. • best_source: The sources that confirm the best ID. • best_id_diff: The comparison result between the best_id column and the intxt_pmid column. • citation: A citation context. If no citation is found in a sentence, the value is the sentence. • progression: Text progression of the citation context/sentence. Supplementary Files • PMC-OA-patci.tsv.gz – This file contains the best source link IDs for the references (e.g., PMID). Patci [1] was used to identify the best source link IDs. The best source link IDs are mapped to the citation contexts and displayed in the *_journal IntxtCit.tsv files as the best_id column. Each row in the PMC-OA-patci.tsv.gz file is a citation (i.e., a reference extracted from the XML file) and contains the following columns: • pmcid: PMCID of the citing article. • pos: The citation's position in the reference list. • fromPMID: PMID of the citing article. • toPMID: Source link ID (e.g., PMID) of the citation. This ID is identified by Patci. • SRC: The sources that confirm the toPMID. • MatchDB: The origin bibliographic database of the toPMID. • Probability: The match probability of the toPMID. • toPMID2: PMID of the citation (as tagged in the XML file). • SRC2: The sources that confirm the toPMID2. • intxt_id: The ID of the citation. • journal: The first letter of the journal title. This maps to the *_journal_IntxtCit.tsv files. • same_ref_string: Whether the citation string appears in the reference list more than once. • DIFF: The comparison result between the toPMID column and the toPMID2 column. • bestID: The best source link ID (e.g., PMID) of the citation. • bestSRC: The sources that confirm the best ID. • Match: Matching result produced by Patci. [1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885 • intxt_cit_license_fromPMC.tsv – This file contains the CC licensing information for each article. The licensing information is from PMC's file lists [2], retrieved on June 19, 2020, and March 9, 2023. It should be noted that the license information for 189,855 PMCIDs is NO-CC CODE in the file lists, and 521 PMCIDs are absent in the file lists. The absence of CC licensing information does not indicate that the article lacks a CC license. For example, PMCID: 6156294 (NO-CC CODE) and PMCID: 6118074 (absent in the PMC's file lists) are under CC-BY licenses according to their PDF versions of articles. The intxt_cit_license_fromPMC.tsv file has two columns: • pmcid: PMCID of the article. • license: The article’s CC license information provided in PMC’s file lists. The value is nan when an article is not present in the PMC’s file lists. [2] https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/ • Supplementary_File_1.zip – This file contains the code for generating the dataset.

从PubMed Central开放获取文章中识别出的句子及其引用上下文。该数据集以24个制表符分隔的文本文件的形式提供，包含720,649,608个句子，其中75,848,689个为引用上下文。数据集基于PubMed Central开放获取子集（即PMCOA子集）的XML版本文章的快照，该子集于2019年5月收集。数据集的创建过程如Hsiao TK.和Torvik V. I.在《OpCitance：从PubMed Central开放获取文章中识别出的引用上下文》一文中所述（手稿）。文件列表：• A_journal_IntxtCit.tsv – 从以A开头的期刊文章中识别出的句子和引用上下文。• B_journal_IntxtCit.tsv – 从以B开头的期刊文章中识别出的句子和引用上下文。• ...（此处省略其他文件列表，格式相同）• XYZ_journal_IntxtCit.tsv – 从以X、Y或Z开头的期刊文章中识别出的句子和引用上下文。每个文件中的每一行都是一个句子/引用上下文，并包含以下列：• pmcid：文章的PMCID。• pmid：文章的PMID。如文章没有PMID，则值为“NONE”。• location：引用上下文/句子所属的文章组成部分（摘要、正文、表格、图表等）。• IMRaD：与引用上下文/句子相关联的IMRaD部分类型。I、M、R和D分别代表引言/背景、方法、结果和结论/讨论；NoIMRaD表示无法识别部分类型。• sentence_id：引用上下文/句子在文章组成部分中的ID。• total_sentences：文章组成部分中的句子总数。• intxt_id：引用的ID。• intxt_pmid：引用的PMID（在XML文件中标记）。如引用在XML文件中没有标记PMID，则值为“-”。• intxt_pmid_source：可识别intxt_pmid的来源。Xml表示PMID仅从XML文件中识别；xml,pmc表示PMID不仅从XML文件中识别，还来自NCBI Entrez编程工具收集的引用数据。如引用没有intxt_pmid，则值为“-”。• intxt_mark：与内联引用关联的引用标记。• best_id：引用的最佳来源链接ID（例如，PMID）。• best_source：确认最佳ID的来源。• best_id_diff：best_id列与intxt_pmid列的比较结果。• citation：引用上下文。如句子中没有找到引用，则值为句子。• progression：引用上下文/句子的文本进展。补充文件：• PMC-OA-patci.tsv.gz – 此文件包含参考文献的最佳来源链接ID（例如，PMID）。使用Patci [1]识别最佳来源链接ID。最佳来源链接ID映射到引用上下文，并在*_journal IntxtCit.tsv文件中作为best_id列显示。PMC-OA-patci.tsv.gz文件中的每一行都是一个引用（即从XML文件中提取的参考文献）并包含以下列：• pmcid：引用文章的PMCID。• pos：引用在参考文献列表中的位置。• fromPMID：引用文章的PMID。• toPMID：引用的来源链接ID（例如，PMID）。此ID由Patci识别。• SRC：确认toPMID的来源。• MatchDB：toPMID的起源书目数据库。• Probability：toPMID的匹配概率。• toPMID2：引用的PMID（在XML文件中标记）。• SRC2：确认toPMID2的来源。• intxt_id：引用的ID。• journal：期刊标题首字母。这映射到*_journal_IntxtCit.tsv文件。• same_ref_string：引用字符串是否在参考文献列表中多次出现。• DIFF：toPMID列与toPMID2列的比较结果。• bestID：引用的最佳来源链接ID（例如，PMID）。• bestSRC：确认最佳ID的来源。• Match：Patci产生的匹配结果。[1] Agarwal, S., Lincoln, M., Cai, H., & Torvik, V. (2014). Patci – a tool for identifying scientific articles cited by patents. GSLIS Research Showcase 2014. http://hdl.handle.net/2142/54885• intxt_cit_license_fromPMC.tsv – 此文件包含每篇文章的CC许可信息。许可信息来自PMC的文件列表[2]，于2020年6月19日和2023年3月9日检索。请注意，文件列表中有189,855个PMCID的许可信息为无CC代码，521个PMCID在文件列表中缺失。CC许可信息的缺失并不表示文章缺乏CC许可。例如，PMCID：6156294（无CC代码）和PMCID：6118074（在PMC的文件列表中缺失）根据其文章的PDF版本处于CC-BY许可之下。intxt_cit_license_fromPMC.tsv文件有两个列：• pmcid：文章的PMCID。• license：在PMC的文件列表中提供的文章CC许可信息。当文章不在PMC的文件列表中时，值为nan。• Supplementary_File_1.zip – 此文件包含生成数据集的代码。

提供机构：

Illinois Data Bank

5,000+

优质数据集

54 个

任务类型

进入经典数据集