jnlpba/jnlpba

Name: jnlpba/jnlpba
Creator: jnlpba
Published: 2024-01-18 11:07:08
License: 暂无描述

Hugging Face2024-01-18 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/jnlpba/jnlpba

下载链接

链接失效反馈

官方服务：

资源简介：

--- annotations_creators: - expert-generated language_creators: - expert-generated language: - en license: - unknown multilinguality: - monolingual size_categories: - 10K<n<100K source_datasets: - extended|other-genia-v3.02 task_categories: - token-classification task_ids: - named-entity-recognition pretty_name: BioNLP / JNLPBA Shared Task 2004 dataset_info: features: - name: id dtype: string - name: tokens sequence: string - name: ner_tags sequence: class_label: names: '0': O '1': B-DNA '2': I-DNA '3': B-RNA '4': I-RNA '5': B-cell_line '6': I-cell_line '7': B-cell_type '8': I-cell_type '9': B-protein '10': I-protein config_name: jnlpba splits: - name: train num_bytes: 8775707 num_examples: 18546 - name: validation num_bytes: 1801565 num_examples: 3856 download_size: 3171072 dataset_size: 10577272 --- # Dataset Card for JNLPBA ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004 - **Repository:** [Needs More Information] - **Paper:** https://www.aclweb.org/anthology/W04-1213.pdf - **Leaderboard:** https://paperswithcode.com/sota/named-entity-recognition-ner-on-jnlpba?p=biobert-a-pre-trained-biomedical-language - **Point of Contact:** [Needs More Information] ### Dataset Summary The data came from the GENIA version 3.02 corpus (Kim et al., 2003). This was formed from a controlled search on MEDLINE using the MeSH terms human, blood cells and transcription factors. From this search 2,000 abstracts were selected and hand annotated according to a small taxonomy of 48 classes based on a chemical classification. Among the classes, 36 terminal classes were used to annotate the GENIA corpus. ### Supported Tasks and Leaderboards NER ### Languages English ## Dataset Structure ### Data Instances { 'id': '1', 'tokens': ['IL-2', 'gene', 'expression', 'and', 'NF-kappa', 'B', 'activation', 'through', 'CD28', 'requires', 'reactive', 'oxygen', 'production', 'by', '5-lipoxygenase', '.'], 'ner_tags': [1, 2, 0, 0, 9, 10, 0, 0, 9, 0, 0, 0, 0, 0, 9, 0], } ### Data Fields - `id`: Sentence identifier. - `tokens`: Array of tokens composing a sentence. - `ner_tags`: Array of tags, where `0` indicates no bio-entity mentioned, `1` signals the first token of a bio-entity and `2` the subsequent bio-entity tokens. ### Data Splits Train samples: 37094 Validation samples: 7714 ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information [Needs More Information] ## Considerations for Using the Data ### Social Impact of Dataset [Needs More Information] ### Discussion of Biases [Needs More Information] ### Other Known Limitations [Needs More Information] ## Additional Information ### Dataset Curators [Needs More Information] ### Licensing Information [Needs More Information] ### Citation Information @inproceedings{collier-kim-2004-introduction, title = "Introduction to the Bio-entity Recognition Task at {JNLPBA}", author = "Collier, Nigel and Kim, Jin-Dong", booktitle = "Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications ({NLPBA}/{B}io{NLP})", month = aug # " 28th and 29th", year = "2004", address = "Geneva, Switzerland", publisher = "COLING", url = "https://aclanthology.org/W04-1213", pages = "73--78", } ### Contributions Thanks to [@edugp](https://github.com/edugp) for adding this dataset.

annotations_creators: - 专家生成 language_creators: - 专家生成 language: - 英语 license: - 未知 multilinguality: - 单语言 size_categories: - 10K<n<100K source_datasets: - 扩展|其他-GENIA v3.02 task_categories: - Token分类 task_ids: - 命名实体识别（Named Entity Recognition，NER） pretty_name: BioNLP / JNLPBA 2004共享任务 dataset_info: features: - name: id dtype: 字符串 - name: tokens sequence: 字符串序列 - name: ner_tags sequence: class_label: names: '0': O（非实体标记） '1': B-DNA（DNA实体起始标记） '2': I-DNA（DNA实体延续标记） '3': B-RNA（RNA实体起始标记） '4': I-RNA（RNA实体延续标记） '5': B-cell_line（细胞系实体起始标记） '6': I-cell_line（细胞系实体延续标记） '7': B-cell_type（细胞类型实体起始标记） '8': I-cell_type（细胞类型实体延续标记） '9': B-protein（蛋白质实体起始标记） '10': I-protein（蛋白质实体延续标记） config_name: jnlpba splits: - name: 训练集（train） num_bytes: 8775707 num_examples: 18546 - name: 验证集（validation） num_bytes: 1801565 num_examples: 3856 download_size: 3171072 字节 dataset_size: 10577272 字节 # JNLPBA数据集卡片 ## 目录 - [数据集描述](#dataset-description) - [数据集概述](#dataset-summary) - [支持任务与排行榜](#supported-tasks-and-leaderboards) - [语言](#languages) - [数据集结构](#dataset-structure) - [数据实例](#data-instances) - [数据字段](#data-fields) - [数据划分](#data-splits) - [数据集构建](#dataset-creation) - [构建初衷](#curation-rationale) - [源数据](#source-data) - [注释标注](#annotations) - [个人与敏感信息](#personal-and-sensitive-information) - [数据集使用注意事项](#considerations-for-using-the-data) - [数据集的社会影响](#social-impact-of-dataset) - [偏差分析](#discussion-of-biases) - [其他已知局限](#other-known-limitations) - [附加信息](#additional-information) - [数据集维护者](#dataset-curators) - [授权信息](#licensing-information) - [引用信息](#citation-information) - [贡献者](#contributions) ## 数据集描述 - **主页**：http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004 - **代码仓库**：[需补充更多信息] - **相关论文**：https://www.aclweb.org/anthology/W04-1213.pdf - **排行榜**：https://paperswithcode.com/sota/named-entity-recognition-ner-on-jnlpba?p=biobert-a-pre-trained-biomedical-language - **联系方式**：[需补充更多信息] ### 数据集概述本数据集源自GENIA 3.02语料库（Kim等人，2003年）。该语料库通过对MEDLINE数据库（MEDLINE）执行受控检索构建，检索关键词为医学主题词表（Medical Subject Headings，MeSH）中的「人类」「血细胞」与「转录因子」。从中筛选出2000篇摘要，并依据基于化学分类的48类小型分类体系进行人工标注。其中36个终端类别被用于GENIA语料库的实体标注。 ### 支持任务与排行榜命名实体识别（NER） ### 语言英语 ## 数据集结构 ### 数据实例 json { 'id': '1', 'tokens': ['IL-2', 'gene', 'expression', 'and', 'NF-kappa', 'B', 'activation', 'through', 'CD28', 'requires', 'reactive', 'oxygen', 'production', 'by', '5-lipoxygenase', '.'], 'ner_tags': [1, 2, 0, 0, 9, 10, 0, 0, 9, 0, 0, 0, 0, 0, 9, 0], } ### 数据字段 - `id`：句子标识符。 - `tokens`：组成句子的Token序列。 - `ner_tags`：标签序列，其中`0`表示未提及生物实体，`1`表示生物实体的起始Token，`2`表示生物实体的后续Token。 ### 数据划分训练样本：37094 验证样本：7714 ## 数据集构建 ### 构建初衷 [需补充更多信息] ### 源数据 #### 初始数据收集与标准化 [需补充更多信息] #### 源语言生产者是谁？ [需补充更多信息] ### 注释标注 #### 标注流程 [需补充更多信息] #### 标注者是谁？ [需补充更多信息] ### 个人与敏感信息 [需补充更多信息] ## 数据集使用注意事项 ### 数据集的社会影响 [需补充更多信息] ### 偏差分析 [需补充更多信息] ### 其他已知局限 [需补充更多信息] ## 附加信息 ### 数据集维护者 [需补充更多信息] ### 授权信息 [需补充更多信息] ### 引用信息 bibtex @inproceedings{collier-kim-2004-introduction, title = "JNLPBA生物实体识别任务介绍", author = "Collier, Nigel 与 Kim, Jin-Dong", booktitle = "生物医学与应用自然语言处理国际联合研讨会（NLPBA/BioNLP）论文集", month = "8月28日与29日", year = "2004", address = "瑞士日内瓦", publisher = "国际计算语言学委员会（COLING）", url = "https://aclanthology.org/W04-1213", pages = "73--78", } ### 贡献者感谢[@edugp](https://github.com/edugp) 为本数据集的收录提供支持。

提供机构：

jnlpba

原始信息汇总

数据集概述

数据集描述

名称: BioNLP / JNLPBA Shared Task 2004
语言: 英语
许可证: 未知
多语言性: 单语种
数据集大小: 10K<n<100K
源数据集: 扩展自GENIA v3.02
任务类别: 词性标注
任务ID: 命名实体识别

数据集结构

特征

id: 字符串类型，句子标识符。
tokens: 字符串序列，组成句子的词。
ner_tags: 序列类型，标签序列，其中：
- 0: O (非生物实体)
- 1: B-DNA (DNA实体开始)
- 2: I-DNA (DNA实体后续)
- 3: B-RNA (RNA实体开始)
- 4: I-RNA (RNA实体后续)
- 5: B-cell_line (细胞系实体开始)
- 6: I-cell_line (细胞系实体后续)
- 7: B-cell_type (细胞类型实体开始)
- 8: I-cell_type (细胞类型实体后续)
- 9: B-protein (蛋白质实体开始)
- 10: I-protein (蛋白质实体后续)

数据分割

训练集: 18546个样本，8775707字节
验证集: 3856个样本，1801565字节

数据实例

json { id: 1, tokens: [IL-2, gene, expression, and, NF-kappa, B, activation, through, CD28, requires, reactive, oxygen, production, by, 5-lipoxygenase, .], ner_tags: [1, 2, 0, 0, 9, 10, 0, 0, 9, 0, 0, 0, 0, 0, 9, 0] }

数据集创建

数据来源

数据来自GENIA版本3.02语料库，通过MEDLINE使用MeSH术语“human”、“blood cells”和“transcription factors”进行控制搜索，选择了2000个摘要并根据一个小型分类法（48个类）进行手工标注，其中36个终端类用于标注GENIA语料库。

搜集汇总

数据集介绍

构建方式

JNLPBA数据集的构建基于GENIA版本3.02语料库，该语料库通过在MEDLINE数据库中使用MeSH术语'human'、'blood cells'和'transcription factors'进行控制搜索，从中筛选出2000篇摘要。这些摘要经过专家手工标注，形成了一个包含48个类别的分类体系，其中36个终端类别用于标注GENIA语料库中的生物实体。数据集的构建过程严格遵循专家生成的标准，确保了标注的高质量和一致性。

特点

JNLPBA数据集的主要特点在于其专注于生物医学领域的命名实体识别（NER）任务，涵盖了DNA、RNA、细胞系、细胞类型和蛋白质等多种生物实体。数据集的标注精细，每个实体的开始和内部标记清晰，便于模型学习和识别。此外，数据集的规模适中，包含超过18,000个训练样本和近4,000个验证样本，为模型训练提供了充足的数据支持。

使用方法

JNLPBA数据集适用于生物医学领域的命名实体识别任务，研究人员可以通过加载数据集并使用其中的'tokens'和'ner_tags'字段进行模型训练和评估。数据集提供了详细的标注信息，包括实体的开始和内部标记，便于模型理解和识别生物实体。使用时，建议结合生物医学领域的特定背景知识，优化模型参数，以提高识别准确率。

背景与挑战

背景概述

JNLPBA数据集，全称为BioNLP / JNLPBA Shared Task 2004，是由Nigel Collier和Jin-Dong Kim等研究人员于2004年创建的。该数据集源自GENIA版本3.02语料库，通过MEDLINE数据库中使用特定MeSH术语（如‘human’、‘blood cells’和‘transcription factors’）进行检索，最终筛选出2,000篇摘要并进行手工标注。其核心研究问题集中在生物实体识别（Named Entity Recognition, NER），旨在通过标注生物医学文本中的DNA、RNA、细胞类型和蛋白质等实体，推动生物医学自然语言处理领域的发展。JNLPBA数据集的创建不仅为生物医学文本处理提供了丰富的资源，还为相关领域的算法研究和模型训练奠定了基础。

当前挑战

JNLPBA数据集在构建过程中面临多项挑战。首先，生物医学文本的复杂性和专业性使得实体识别任务异常艰巨，需要高度专业化的标注团队。其次，数据集的标注过程依赖于专家生成，这不仅增加了成本，还可能引入主观偏差。此外，数据集的规模虽在10K到100K之间，但对于处理大规模生物医学数据的需求而言，仍显不足。最后，数据集的许可证信息未知，这可能限制其在某些研究或商业应用中的使用。这些挑战不仅影响了数据集的可用性和可靠性，也对后续研究提出了更高的要求。

常用场景

经典使用场景

在生物医学领域，JNLPBA数据集的经典使用场景主要集中在命名实体识别（Named Entity Recognition, NER）任务上。该数据集通过标注生物医学文本中的实体，如蛋白质、DNA、RNA等，为研究人员提供了一个标准化的基准，用于评估和改进NER模型的性能。

衍生相关工作

基于JNLPBA数据集，许多经典工作得以展开，包括开发更高效的NER模型和改进的生物医学语言处理技术。例如，BioBERT等预训练模型利用该数据集进行微调，显著提升了生物医学文本处理的性能，进一步推动了相关领域的研究进展。

数据集最近研究