NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition
收藏DataCite Commons2025-05-01 更新2025-04-09 收录
下载链接:
https://datadryad.org/dataset/doi:10.5061/dryad.dv41ns1wt
下载链接
链接失效反馈官方服务:
资源简介:
The automatic recognition of gene names and their corresponding database
identifiers in biomedical text is an important first step for many
downstream text-mining applications. The NLM-Gene corpus is a high-quality
manually annotated corpus for genes, covering ambiguous gene names, with
an average of 29 gene mentions (10 unique identifiers) per article, and a
broader representation of different species (including Homo sapiens, Mus
musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis
thaliana, Danio rerio, etc.) when compared to previous gene annotation
corpora. NLM-Gene consists of 550 PubMed articles from 156 biomedical
journals, doubly annotated by six experienced NLM indexers, randomly
paired for each article to control for bias. The annotators worked in
three annotation rounds until they reached a complete agreement.
Using the new resource, we developed a new gene finding
algorithm based on deep learning which improved both on precision and
recall from existing tools. The NLM-Gene annotated corpus is freely
available at Dryad and at https://www.ncbi.nlm.nih.gov/research/bionlp/.
The gene finding results of applying this tool to the entire PubMed/PMC
are freely accessible through our web-based tool PubTator.
提供机构:
Dryad
创建时间:
2021-07-10



