OCR results from 65k pathway figures published between 1995 and 2019
收藏DataCite Commons2022-10-25 更新2024-07-13 收录
下载链接:
https://nih.figshare.com/articles/OCR_results_from_65k_pathway_figures_published_between_1995_and_2019/12420881
下载链接
链接失效反馈官方服务:
资源简介:
In this study, we aimed to identify pathway figures published in the past 25 years, to characterize the human gene content in figures by optical character recognition, and to describe their utility as a resource for pathway knowledge. First, we trained a machine learning service on manually-classified figures and applied it to 235,081 image query results from PubMed Central. Then, we performed OCR to extract recognized human gene symbols from the images. Altogether, we identified 64,643 pathway figures published between 1995 and 2019, depicting 1,112,551 instances of human genes (13,464 unique NCBI Genes) in various interactions and contexts.<br>This supplementary file contains the tidy results from optical character recognition (OCR) on 65k pathway figures published between 1995 and 2019. Columns include:<br>1. <b>figid</b> - Unique figure identifier: concatenation of PMCID and figure filename2. <b>pmcid</b> - PubMed Central Identifier3. <b>word</b> - Character string extracted by OCR and eventually matched to human HGNC lexicon4. <b>symbol</b> - Transformed word that matched lexicon5. <b>source</b> - Source of lexicon that matched symbol6. <b>hgnc_symbol</b> - Official human gene symbol7. <b>entrez</b> - Official NCBI Gene identifier<br>
本研究旨在识别近25年来发表的通路图,通过光学字符识别(Optical Character Recognition, OCR)解析图中的人类基因信息,并阐述其作为通路知识资源的应用价值。首先,我们基于人工标注的通路图训练了机器学习服务,并将其应用于从PubMed Central获取的235081条图像查询结果。随后,我们通过OCR从图像中提取识别出的人类基因符号。最终,我们共识别出1995年至2019年间发表的64643张通路图,涵盖1112551个人类基因实例(含13464个独特的NCBI基因),涉及多种相互作用与研究场景。<br>本补充文件包含1995年至2019年间65000张通路图的光学字符识别(OCR)整理结果,其字段包括:<br>1. <b>figid</b> - 唯一图像标识符:由PMCID与图像文件名拼接而成<br>2. <b>pmcid</b> - PubMed Central标识符<br>3. <b>word</b> - 经OCR提取并最终匹配人类HGNC(HUGO Gene Nomenclature Committee)词典的字符序列<br>4. <b>symbol</b> - 匹配词典的经转换后的字符序列<br>5. <b>source</b> - 匹配该符号的词典来源<br>6. <b>hgnc_symbol</b> - 官方人类基因符号<br>7. <b>entrez</b> - 官方NCBI基因标识符
提供机构:
National Institutes of Health
创建时间:
2020-06-03



