OTAR3088/BioNLP_Filtered
收藏Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/OTAR3088/BioNLP_Filtered
下载链接
链接失效反馈官方服务:
资源简介:
BioNLP2004 Filtered数据集是对原始tner/bionlp2004数据集的改编版本,专门针对项目需求,专注于CellLine和CellType实体。原始数据集是一个生物医学领域的命名实体识别(NER)数据集,标注了多种实体类型,如DNA、Protein、Cell_type、Cell_line和RNA。此改编版本进行了以下修改:1. **实体过滤**:仅保留CellLine和CellType实体,其他原始实体类型(DNA、Protein、RNA)被转换为O(Outside)标签。2. **命名变更**:保留的实体命名约定已更新以符合项目特定样式。例如,原始数据集中的cell_line实体现在表示为CellLine,cell_type为CellType。数据集结构包含三个部分:train、test和validation。每个示例包含tokens(句子中的标记列表)和tags(每个标记的IOB标签列表,标签仅限于O、B-CellLine、I-CellLine、B-CellType和I-CellType)。
The BioNLP2004 Filtered dataset is an adaptation of the original tner/bionlp2004 dataset, specifically tailored for project needs focusing on CellLine and CellType entities. The original dataset is a named entity recognition (NER) dataset in the biomedical domain, annotated with various entity types such as DNA, Protein, Cell_type, Cell_line, and RNA. This adapted version has undergone the following modifications: 1. **Entity Filtering**: The dataset has been filtered to retain only CellLine and CellType entities. All other original entity types (DNA, Protein, RNA) have been converted to the O (Outside) tag. 2. **Nomenclature Change**: The naming convention for the retained entities has been updated to align with project-specific styling. For instance, cell_line entities from the original dataset are now represented as CellLine, and cell_type as CellType. The dataset consists of three splits: train, test, and validation. Each example contains tokens (a list of strings representing the tokens in a sentence) and tags (a list of strings representing the IOB tags for each token, limited to O, B-CellLine, I-CellLine, B-CellType, and I-CellType).
提供机构:
OTAR3088



