bigbio/chemdner

Name: bigbio/chemdner
Creator: bigbio
Published: 2022-12-22 15:44:21
License: 暂无描述

Hugging Face2022-12-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/bigbio/chemdner

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - en bigbio_language: - English license: unknown multilinguality: monolingual bigbio_license_shortname: UNKNOWN pretty_name: CHEMDNER homepage: https://biocreative.bioinformatics.udel.edu/resources/biocreative-iv/chemdner-corpus/ bigbio_pubmed: True bigbio_public: True bigbio_tasks: - NAMED_ENTITY_RECOGNITION - TEXT_CLASSIFICATION --- # Dataset Card for CHEMDNER ## Dataset Description - **Homepage:** https://biocreative.bioinformatics.udel.edu/resources/biocreative-iv/chemdner-corpus/ - **Pubmed:** True - **Public:** True - **Tasks:** NER,TXTCLASS We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. ## Citation Information ``` @article{Krallinger2015, title = {The CHEMDNER corpus of chemicals and drugs and its annotation principles}, author = { Krallinger, Martin and Rabal, Obdulia and Leitner, Florian and Vazquez, Miguel and Salgado, David and Lu, Zhiyong and Leaman, Robert and Lu, Yanan and Ji, Donghong and Lowe, Daniel M. and Sayle, Roger A. and Batista-Navarro, Riza Theresa and Rak, Rafal and Huber, Torsten and Rockt{"a}schel, Tim and Matos, S{'e}rgio and Campos, David and Tang, Buzhou and Xu, Hua and Munkhdalai, Tsendsuren and Ryu, Keun Ho and Ramanan, S. V. and Nathan, Senthil and {{Z}}itnik, Slavko and Bajec, Marko and Weber, Lutz and Irmer, Matthias and Akhondi, Saber A. and Kors, Jan A. and Xu, Shuo and An, Xin and Sikdar, Utpal Kumar and Ekbal, Asif and Yoshioka, Masaharu and Dieb, Thaer M. and Choi, Miji and Verspoor, Karin and Khabsa, Madian and Giles, C. Lee and Liu, Hongfang and Ravikumar, Komandur Elayavilli and Lamurias, Andre and Couto, Francisco M. and Dai, Hong-Jie and Tsai, Richard Tzong-Han and Ata, Caglar and Can, Tolga and Usi{'e}, Anabel and Alves, Rui and Segura-Bedmar, Isabel and Mart{'i}nez, Paloma and Oyarzabal, Julen and Valencia, Alfonso }, year = 2015, month = {Jan}, day = 19, journal = {Journal of Cheminformatics}, volume = 7, number = 1, pages = {S2}, doi = {10.1186/1758-2946-7-S1-S2}, issn = {1758-2946}, url = {https://doi.org/10.1186/1758-2946-7-S1-S2}, abstract = { The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: ttp://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/ } } ```

--- 语言: - 英语 bigbio_language: - 英语许可证: 未知多语言属性: 单语言 bigbio_license_shortname: UNKNOWN 正式名称: CHEMDNER 主页: https://biocreative.bioinformatics.udel.edu/resources/biocreative-iv/chemdner-corpus/ bigbio_pubmed: 是 bigbio_public: 是 bigbio_tasks: - 命名实体识别（NAMED_ENTITY_RECOGNITION） - 文本分类（TEXT_CLASSIFICATION） --- # CHEMDNER数据集卡片 ## 数据集描述 - **主页:** https://biocreative.bioinformatics.udel.edu/resources/biocreative-iv/chemdner-corpus/ - **PubMed关联:** 是 - **公开属性:** 是 - **任务:** NER、TXTCLASS（对应命名实体识别、文本分类）我们提出了CHEMDNER语料库，这是一个包含10000篇PubMed摘要的集合，总计包含84355条经化学文献专业编校人员按照本任务专属标注指南手动标注的化学实体提及（chemical entity mentions）。该语料库的摘要样本选自覆盖所有主流化学学科的代表性文献。每条化学实体提及均依据其结构关联化学实体提及（structure-associated chemical entity mention, SACEM）类别完成手动标注，类别包括缩写类、家族类、分子式类、标识符类、多实体类、系统命名类以及俗名类。 ## 引用信息 @article{Krallinger2015, title = {《化学品与药物的CHEMDNER语料库及其标注原则》}, author = { Krallinger, Martin and Rabal, Obdulia and Leitner, Florian and Vazquez, Miguel and Salgado, David and Lu, Zhiyong and Leaman, Robert and Lu, Yanan and Ji, Donghong and Lowe, Daniel M. and Sayle, Roger A. and Batista-Navarro, Riza Theresa and Rak, Rafal and Huber, Torsten and Rocktäschel, Tim and Matos, Sérgio and Campos, David and Tang, Buzhou and Xu, Hua and Munkhdalai, Tsendsuren and Ryu, Keun Ho and Ramanan, S. V. and Nathan, Senthil and {Žitnik}, Slavko and Bajec, Marko and Weber, Lutz and Irmer, Matthias and Akhondi, Saber A. and Kors, Jan A. and Xu, Shuo and An, Xin and Sikdar, Utpal Kumar and Ekbal, Asif and Yoshioka, Masaharu and Dieb, Thaer M. and Choi, Miji and Verspoor, Karin and Khabsa, Madian and Giles, C. Lee and Liu, Hongfang and Ravikumar, Komandur Elayavilli and Lamurias, Andre and Couto, Francisco M. and Dai, Hong-Jie and Tsai, Richard Tzong-Han and Ata, Caglar and Can, Tolga and Usie, Anabel and Alves, Rui and Segura-Bedmar, Isabel and Martínez, Paloma and Oyarzabal, Julen and Valencia, Alfonso }, year = 2015, month = {1月}, day = 19, journal = {《化学信息学杂志》（Journal of Cheminformatics）}, volume = 7, number = 1, pages = {S2}, doi = {10.1186/1758-2946-7-S1-S2}, issn = {1758-2946}, url = {https://doi.org/10.1186/1758-2946-7-S1-S2}, abstract = { 从文本中自动提取化学信息，需将化学实体提及识别作为核心步骤之一。在开发有监督的命名实体识别（Named Entity Recognition, NER）系统时，大型手动标注文本语料库的可用性至关重要。此外，大型语料库可对不同文档中的化学实体检测方法开展稳健评估与对比。本团队提出的CHEMDNER语料库，是一个包含10000篇PubMed摘要的集合，总计包含84355条经化学文献专业编校人员按照本任务专属标注指南手动标注的化学实体提及。该语料库的摘要样本选自覆盖所有主流化学学科的代表性文献。每条化学实体提及均依据其结构关联化学实体提及（SACEM）类别完成手动标注，类别包括缩写类、家族类、分子式类、标识符类、多实体类、系统命名类以及俗名类。研究通过标注人员间的一致性研究，评估了文本化学标注的难度与一致性，最终获得91%的标注一致率。针对CHEMDNER语料库的子集（包含3000篇摘要的测试集），本团队不仅提供了金标准（Gold Standard）手动标注结果，还收录了参与BioCreative IV CHEMDNER化学提及识别任务的26个团队自动检测得到的实体提及。此外，本团队还发布了来自17000篇随机选取PubMed摘要的自动提取提及的CHEMDNER银标准（silver standard）语料库。同时，我们还生成了BioC格式的CHEMDNER语料库版本。本团队提出了一项针对化学与药物实体领域专属语料库构建所需的实体标注最低信息标准。CHEMDNER语料库及标注指南可在以下网址获取：http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/ } }

提供机构：

bigbio

原始信息汇总

数据集概述

基本信息

名称: CHEMDNER
语言: 英语
许可证: 未知
多语言性: 单语
是否公开: 是
是否可在PubMed上访问: 是

数据集描述

内容: 包含10,000篇PubMed摘要，总计84,355个化学实体提及，由专家化学文献编纂者手动标注。
标注类型: 根据结构相关化学实体提及（SACEM）类别进行手动标注，包括缩写、家族、公式、标识符、多重、系统和琐碎。
代表性: 摘要选自所有主要化学学科，以确保代表性。

任务类型

命名实体识别 (NER)
文本分类 (TXTCLASS)

引用信息

文章标题: The CHEMDNER corpus of chemicals and drugs and its annotation principles
作者: Krallinger, Martin 等
发表年份: 2015
期刊: Journal of Cheminformatics
卷/期/页: 7(1):S2
DOI: 10.1186/1758-2946-7-S1-S2
摘要: 介绍了CHEMDNER语料库的构建、标注原则及其在化学信息自动提取中的应用。

搜集汇总

数据集介绍

构建方式

在化学信息学领域，构建高质量标注语料库是推动命名实体识别技术发展的基石。CHEMDNER语料库的构建过程体现了严谨的科学方法论，其核心是从PubMed文献库中精心筛选出涵盖所有主要化学学科的10,000篇摘要，确保了数据的代表性与广度。随后，由专业的化学文献策展人严格遵循为该任务专门制定的标注指南，对文本中的化学实体进行人工标注，最终完成了总计84,355个化学实体提及的标记工作，并依据结构关联化学实体提及类别进行了细致分类。

特点

该数据集的特点在于其规模与标注深度，为化学文本挖掘研究提供了宝贵的资源。其收录的实体不仅数量庞大，更按照缩写、家族、分子式、标识符、复合型、系统命名和俗名七种类别进行了精细划分，这为模型理解化学实体的多样表达形式提供了结构化信息。此外，数据集还包含了由26个参与团队在3000篇测试摘要上自动检测的提及结果，以及从17000篇随机摘要中自动提取的银标准语料，形成了多层次、多用途的数据集合，极大地便利了模型训练、评估与比较研究。

使用方法

对于致力于化学命名实体识别的研究者而言，CHEMDNER数据集提供了标准化的评估基准与实践平台。该数据集可直接应用于监督式机器学习模型的训练与测试，其丰富的标注信息支持对模型识别化学实体及其分类能力的全面评估。研究者可利用其提供的金标准、团队自动检测结果及银标准语料，进行算法性能对比、错误分析以及半监督学习等探索。数据集亦以BioC格式发布，兼容多种文本处理工具，便于集成到现有的自然语言处理流程之中。

背景与挑战

背景概述

在化学信息学领域，从海量科学文献中自动识别化学实体是知识发现的关键环节。CHEMDNER数据集由国际生物医学文本挖掘竞赛BioCreative IV于2015年发布，由Martin Krallinger等跨学科团队联合构建，旨在为化学命名实体识别任务提供高质量标注资源。该数据集精选了10,000篇PubMed摘要，涵盖所有主要化学学科，并由专业化学文献策展人手工标注了84,355个化学实体提及，依据结构关联化学实体提及类别进行系统分类。该资源的建立显著推动了化学文本挖掘方法的发展，为药物发现和化学生物学研究提供了重要的数据基础。

当前挑战

化学命名实体识别面临多重挑战：化学实体表述高度复杂，包括系统命名、俗名、缩写、分子式及家族名称等多种形式，且常与普通词汇重叠，导致边界模糊和歧义解析困难。在数据集构建过程中，标注一致性维护极具挑战，需通过制定精细的注释指南并开展多轮标注者间一致性评估，最终达成91%的标注协议。此外，化学学科的广泛代表性要求样本需均衡覆盖各子领域，而大规模人工标注所需的高昂专家成本与时间投入，进一步增加了资源构建的难度。

常用场景

经典使用场景

在化学信息学领域，CHEMDNER数据集作为生物医学文本挖掘的基准资源，其经典使用场景聚焦于化学命名实体识别任务。该数据集通过提供一万篇PubMed摘要中八万四千余个手动标注的化学实体提及，为研究者构建和评估监督式命名实体识别模型奠定了坚实基础。这些标注遵循结构关联化学实体提及分类体系，涵盖了从缩写、家族到系统命名等多种类型，使得模型能够学习识别化学文献中复杂多样的术语表达，从而推动自动化信息提取技术的发展。

解决学术问题

CHEMDNER数据集有效解决了化学文本挖掘中长期存在的实体识别标准化与评估难题。通过提供大规模、高质量的人工标注语料，该数据集使研究人员能够系统比较不同命名实体识别方法的性能，促进了算法公平竞争与迭代优化。其标注一致性高达91%，为化学实体边界界定和分类提供了可靠依据，显著提升了领域内模型训练的准确性与泛化能力，进而加速了从海量文献中自动化提取化学知识的进程。

衍生相关工作

围绕CHEMDNER数据集，衍生出一系列经典研究工作，特别是在BioCreative IV国际评测任务中，吸引了26支团队参与化学提及识别挑战，催生了多种先进的深度学习与混合方法模型。这些工作不仅推动了化学命名实体识别技术的进步，还促进了如化学关系抽取、知识图谱构建等下游任务的发展。后续研究进一步利用该数据集的银标准语料扩展了应用范围，为化学文本挖掘领域的算法创新与资源建设提供了持续动力。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集