chemdner
收藏魔搭社区2025-11-27 更新2025-05-31 收录
下载链接:
https://modelscope.cn/datasets/jablonkagroup/chemdner
下载链接
链接失效反馈官方服务:
资源简介:
## Dataset Details
### Dataset Description
The CHEMDNER corpus comprises 10,000 PubMed abstracts, which have been meticulously annotated by expert chemistry literature curators according to task-specific guidelines, identifying a total of 84,355 mentions of chemical entities. The CHEMDNER corpus is a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task.
- **Curated by:**
- **License:** unknown
### Dataset Sources
- [original dataset](https://huggingface.co/datasets/bigbio/chemdner)
## Citation
<!-- If there is a paper or blog post introducing the dataset, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```bibtex
@article{Krallinger2015,
title = {The CHEMDNER corpus of chemicals and drugs and its annotation principles},
author = {
Krallinger, Martin and Rabal, Obdulia and Leitner, Florian and Vazquez,
Miguel and Salgado, David and Lu, Zhiyong and Leaman, Robert and Lu, Yanan
and Ji, Donghong and Lowe, Daniel M. and Sayle, Roger A. and
Batista-Navarro, Riza Theresa and Rak, Rafal and Huber, Torsten and
Rockt{"a}schel, Tim and Matos, S{'e}rgio and Campos, David and Tang,
Buzhou and Xu, Hua and Munkhdalai, Tsendsuren and Ryu, Keun Ho and Ramanan,
S. V. and Nathan, Senthil and {{Z}}itnik, Slavko and Bajec, Marko and
Weber, Lutz and Irmer, Matthias and Akhondi, Saber A. and Kors, Jan A. and
Xu, Shuo and An, Xin and Sikdar, Utpal Kumar and Ekbal, Asif and Yoshioka,
Masaharu and Dieb, Thaer M. and Choi, Miji and Verspoor, Karin and Khabsa,
Madian and Giles, C. Lee and Liu, Hongfang and Ravikumar, Komandur
Elayavilli and Lamurias, Andre and Couto, Francisco M. and Dai, Hong-Jie
and Tsai, Richard Tzong-Han and Ata, Caglar and Can, Tolga and Usi{'e},
Anabel and Alves, Rui and Segura-Bedmar, Isabel and Mart{'i}nez, Paloma
and Oyarzabal, Julen and Valencia, Alfonso
},
year = 2015,
month = {Jan},
day = 19,
journal = {Journal of Cheminformatics},
volume = 7,
number = 1,
pages = {S2},
doi = {10.1186/1758-2946-7-S1-S2},
issn = {1758-2946},
url = {https://doi.org/10.1186/1758-2946-7-S1-S2},
abstract = {
The automatic extraction of chemical information from text requires the
recognition of chemical entity mentions as one of its key steps. When
developing supervised named entity recognition (NER) systems, the
availability of a large, manually annotated text corpus is desirable.
Furthermore, large corpora permit the robust evaluation and comparison of
different approaches that detect chemicals in documents. We present the
CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a
total of 84,355 chemical entity mentions labeled manually by expert
chemistry literature curators, following annotation guidelines specifically
defined for this task. The abstracts of the CHEMDNER corpus were selected
to be representative for all major chemical disciplines. Each of the
chemical entity mentions was manually labeled according to its
structure-associated chemical entity mention (SACEM) class: abbreviation,
family, formula, identifier, multiple, systematic and trivial. The
difficulty and consistency of tagging chemicals in text was measured using
an agreement study between annotators, obtaining a percentage agreement of
91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts)
we provide not only the Gold Standard manual annotations, but also mentions
automatically detected by the 26 teams that participated in the BioCreative
IV CHEMDNER chemical mention recognition task. In addition, we release the
CHEMDNER silver standard corpus of automatically extracted mentions from
17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus
in the BioC format has been generated as well. We propose a standard for
required minimum information about entity annotations for the construction
of domain specific corpora on chemical and drug entities. The CHEMDNER
corpus and annotation guidelines are available at:
ttp://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/
}
}
```
## 数据集详情
### 数据集描述
CHEMDNER语料库(CHEMDNER Corpus)包含10000篇PubMed摘要,由化学领域专业文献编校人员依据任务专属标注指南完成精细标注,共计识别出84355个化学实体提及。该语料库收录的所有化学实体提及均由专业人员按照本任务定制的标注指南完成手动标注。
- **编校方:**
- **许可协议:未知**
### 数据集来源
- [原始数据集](https://huggingface.co/datasets/bigbio/chemdner)
## 引用文献
<!-- 若存在介绍该数据集的论文或博客文章,请在此处补充APA与Bibtex格式的引用信息。 -->
**BibTeX:**
bibtex
@article{Krallinger2015,
title = {CHEMDNER化学品与药物语料库及其标注原则},
author = {
Krallinger, Martin and Rabal, Obdulia and Leitner, Florian and Vazquez,
Miguel and Salgado, David and Lu, Zhiyong and Leaman, Robert and Lu, Yanan
and Ji, Donghong and Lowe, Daniel M. and Sayle, Roger A. and
Batista-Navarro, Riza Theresa and Rak, Rafal and Huber, Torsten and
Rockt{"a}schel, Tim and Matos, S{"e}rgio and Campos, David and Tang,
Buzhou and Xu, Hua and Munkhdalai, Tsendsuren and Ryu, Keun Ho and Ramanan,
S. V. and Nathan, Senthil and {{Z}}itnik, Slavko and Bajec, Marko and
Weber, Lutz and Irmer, Matthias and Akhondi, Saber A. and Kors, Jan A. and
Xu, Shuo and An, Xin and Sikdar, Utpal Kumar and Ekbal, Asif and Yoshioka,
Masaharu and Dieb, Thaer M. and Choi, Miji and Verspoor, Karin and Khabsa,
Madian and Giles, C. Lee and Liu, Hongfang and Ravikumar, Komandur
Elayavilli and Lamurias, Andre and Couto, Francisco M. and Dai, Hong-Jie
and Tsai, Richard Tzong-Han and Ata, Caglar and Can, Tolga and Usi{"e},
Anabel and Alves, Rui and Segura-Bedmar, Isabel and Mart{"i}nez, Paloma
and Oyarzabal, Julen and Valencia, Alfonso
},
year = 2015,
month = {Jan},
day = 19,
journal = {Journal of Cheminformatics},
volume = 7,
number = 1,
pages = {S2},
doi = {10.1186/1758-2946-7-S1-S2},
issn = {1758-2946},
url = {https://doi.org/10.1186/1758-2946-7-S1-S2},
abstract = {
从文本中自动提取化学信息,需将化学实体提及识别作为核心步骤之一。在开发有监督的命名实体识别(Named Entity Recognition, NER)系统时,大规模手动标注文本语料库的可用性至关重要。此外,大型语料库可对不同文档中的化学检测方法进行稳健评估与对比。本文提出CHEMDNER语料库,该库收录10000篇PubMed摘要,其中共计84355个化学实体提及均由专业化学文献编校人员按照本任务定制的标注指南完成手动标注。CHEMDNER语料库的摘要选取覆盖所有主要化学学科,具备充分代表性。所有化学实体提及均依据其结构关联化学实体提及(Structure-Associated Chemical Entity Mention, SACEM)类别进行手动标注:缩写类、家族类、分子式类、标识符类、多组分类、系统命名类以及通用名称类。通过标注人员间的一致性研究,可测量文本化学标注的难度与一致性,最终获得91%的标注一致性百分比。针对CHEMDNER语料库的子集(包含3000篇摘要的测试集),我们不仅提供金标准手动标注结果,还提供了参与BioCreative IV CHEMDNER化学实体提及识别任务的26个团队自动检测得到的实体提及结果。此外,我们发布了从17000篇随机选取的PubMed摘要中自动提取得到的CHEMDNER银标准语料库。同时,还生成了BioC格式的CHEMDNER语料库版本。我们提出了一套针对化学与药物实体领域专属语料库构建所需的实体标注最低信息标准。CHEMDNER语料库及标注指南可在以下网址获取:http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/
}
}
提供机构:
maas
创建时间:
2025-05-28
搜集汇总
数据集介绍

背景与挑战
背景概述
CHEMDNER是一个化学命名实体识别数据集,包含10,000篇PubMed摘要和84,355个专家标注的化学实体提及,覆盖所有主要化学学科领域。该数据集为开发化学信息提取系统提供了重要资源。
以上内容由遇见数据集搜集并总结生成



