OTAR3088/CeLLaTe_V2.0_no_vague
收藏Hugging Face2026-03-25 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/OTAR3088/CeLLaTe_V2.0_no_vague
下载链接
链接失效反馈官方服务:
资源简介:
---
license: mit
task_categories:
- token-classification
task_ids:
- named-entity-recognition
dataset_info:
features:
- name: sentence
dtype: string
- name: entities
list:
- name: end
dtype: int64
- name: label
dtype: string
- name: start
dtype: int64
- name: text
dtype: string
- name: data_source
dtype: string
splits:
- name: train
num_bytes: 1374511
num_examples: 5451
- name: validation
num_bytes: 574060
num_examples: 2613
- name: test
num_bytes: 475616
num_examples: 2213
download_size: 1053260
dataset_size: 2424187
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: validation
path: data/validation-*
- split: test
path: data/test-*
---
---
# Updated Dataset for CeLLate Model with Vague Entitiy categories filtered
## Overview
This dataset release presents the final experimental data split used for training and
evaluating the CeLLaTe NER model, targeting three core biomedical entity types:
- **CellLine**
- **CellType**
- **Tissue**
Collectively, these entities are referred to as **CeLLaTe**.
The dataset is designed for:
- Supervised biomedical Named Entity Recognition (NER)
- Cross-domain generalisation studies
- Active Learning (AL) experimentation
- Domain-adaptive pretraining evaluation
*Note*:
This version reflects a curated and structured split across heterogeneous biomedical domains.
The filtering of so-called "*vague*" entities applies exclusively to the datasets manually curated
in-house (Single-Cell, ChEMBL-V1, and ChEMBL-V2).
We define vague entities as terms that do not strictly satisfy the criteria of a well-defined
named entity but may exhibit entity-like characteristics depending on contextual usage.
Two versions of the dataset have been created: one retaining these vague entities and one excluding them (this release).
This design enables controlled experimentation to assess model behaviour, label sensitivity,
and robustness when such borderline entity mentions are included during training versus when the label space is restricted to strictly defined named entities.
This specific version is an updated version with our updated vague entity dictionary.
## Dataset Schema
Each split contains the following fields:
- sentence: a single sentence extracted from a biomedical article
- entities: a list of entity annotations associated with the sentence
- data_source: the originating corpus or article collection from which the sentence was derived
Annotations are provided at the sentence level to facilitate downstream NER training, evaluation, and AL-driven re-annotation workflows
## Data Sources and Domain Composition
The dataset integrates articles from three complementary biomedical domains, each contributing distinct entity distributions:
### 1. Single-Cell (SC) Transcriptomics Literature:
- High prevalence of CellType and Tissue entities
- Rich terminology diversity
- Manually curated to reflect downstream project use cases
### 2. ChEMBL Assay Descriptions
- Enriched in CellLine mentions
- Derived from assay-centric biomedical literature
- Available in two versions (V1 and V2)
### 3. Stem Cell Research (CellFinder)
- Contains all three entity types
- Particularly rich in CellType mentions
- Historically curated dataset with expert annotations (Dated more than 10 years ago)
This multi-domain composition allows the evaluation of:
- Cross-domain robustness
- Entity distribution shifts
- Label imbalance behaviour
- Domain adaptation strategies
## Stem Cell Article Source (CellFinder)
Stem cell–related articles were obtained from the CellFinder repository.The original dataset and annotation methodology are described in:
>Mariana Neves, Alexander Damaschun, Andreas Kurtz, Ulf Leser (2012)
>Annotating and evaluating text for stem cell research.
>In Proceedings Third Workshop on Building and Evaluation Resources for Biomedical Text Mining (BioTxtM 2012),
>Language Resources and Evaluation (LREC) 2012.
The CellFinder corpus provides historically curated annotations across multiple stem-cell–related entity types.
## ChEMBL Data Source: Versioning
### ChEMBL-V1 (Originally Silver Standard)
ChEMBL-V1 was initially constructed as a silver-standard corpus using the following pipeline:
- A curated dictionary was assembled by combining:
- Internal ChEMBL assay descriptions
- IntAct-curated CellLine terminology
- Articles were retrieved from Europe PMC based on dictionary term occurrences.
- Retrieved texts were automatically annotated using a machine learning model trained specifically for CellLine recognition.
This dataset was later manually reviewed and corrected to improve annotation fidelity.
## ChEMBL-V2 (Gold Standard)
ChEMBL-V2 is a fully gold-standard corpus comprising 12 manually curated and expert-annotated full-text biomedical articles.
Article selection was guided by:
- High-frequency CellLine coverage: Prioritising commonly occurring CellLine entities.
- Journal diversity: Sampling across heterogeneous biomedical journals to reduce source bias and increase generalisability.
Additional details on the curation protocol are available here:
https://huggingface.co/datasets/OTAR3088/CellTissue-manual_testset
## Splitting Strategy
### Design Principles
The split was constructed to:
- Preserve domain heterogeneity
- Maintain representation of all three CeLLaTe entity types
- Establish a stable benchmark set
- Prevent data leakage across article-level boundaries
- Splitting was performed at the article level, ensuring no sentence overlap across splits.
#### CellFinder (10 Articles)
- 5 articles -> Training
- 5 articles -> Test (Benchmark)
**Rationale**:
- The dataset is more than a decade old.
- It is the only source containing robust representation of all three entity types.
- The test portion serves as a stable benchmark reflecting legacy biomedical terminology.
#### Single-Cell Corpus (12 Articles)
- 9 -> Training
- 3 -> Validation
**Rationale**:
- Richest source of diverse CellType and Tissue terminology.
- Carefully curated to reflect project specific downstream application.
- Used heavily in training to strengthen representation learning.
#### ChEMBL-V1 (10 Articles)
- 4 -> Training
- 2 -> Validation
- 2 -> Test
Primarily enriched in CellLine entities.
#### ChEMBL-V2 (12 Articles)
- 6 -> Training
- 3 -> Validation
- 3 -> Test
Contains all three entities, with CellLine predominance.
## Final Split Composition
### Training Set
- 9 Single-Cell
- 5 CellFinder
- 4 ChEMBL-V1
- 6 ChEMBL-V2
### Validation Set
- 3 Single-Cell
- 2 ChEMBL-V1
- 3 ChEMBL-V2
### Test / Benchmark Set
- 5 CellFinder
- 2 ChEMBL-V1
- 3 ChEMBL-V2
The test set is intended to function as the official benchmark for CeLLaTe evaluation.
## Intended Use
### Primary Use
- Supervised biomedical NER model training
- Evaluation of biomedical domain adaptation strategies
- Active Learning experimentation
- Cross-domain robustness analysis
### Not Intended For
- Clinical or diagnostic decision-making
- Direct patient-level inference
- Biomedical knowledge base construction without further validation
## Limitations and Considerations
- This split reflects experimental design choices and may evolve in future releases.
- Entity frequency distributions do not necessarily reflect real-world biomedical prevalence.
- Domain imbalance is intentional to support robustness evaluation.
- Some terminology reflects historical naming conventions (particularly in CellFinder).
- Annotation density varies across domains by design.
## Reproducibility Notes
- Splitting was performed at the article level.
- No article appears in more than one split.
- Entity boundaries were manually verified in gold-standard subsets.
- Vague or underspecified entity mentions were filtered prior to release
提供机构:
OTAR3088



