OTAR3088/CeLLaTe_V2.0_no_vague

Name: OTAR3088/CeLLaTe_V2.0_no_vague
Creator: OTAR3088
Published: 2026-03-25 15:48:13
License: 暂无描述

Hugging Face2026-03-25 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/OTAR3088/CeLLaTe_V2.0_no_vague

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - token-classification task_ids: - named-entity-recognition dataset_info: features: - name: sentence dtype: string - name: entities list: - name: end dtype: int64 - name: label dtype: string - name: start dtype: int64 - name: text dtype: string - name: data_source dtype: string splits: - name: train num_bytes: 1374511 num_examples: 5451 - name: validation num_bytes: 574060 num_examples: 2613 - name: test num_bytes: 475616 num_examples: 2213 download_size: 1053260 dataset_size: 2424187 configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* - split: test path: data/test-* --- --- # Updated Dataset for CeLLate Model with Vague Entitiy categories filtered ## Overview This dataset release presents the final experimental data split used for training and evaluating the CeLLaTe NER model, targeting three core biomedical entity types: - **CellLine** - **CellType** - **Tissue** Collectively, these entities are referred to as **CeLLaTe**. The dataset is designed for: - Supervised biomedical Named Entity Recognition (NER) - Cross-domain generalisation studies - Active Learning (AL) experimentation - Domain-adaptive pretraining evaluation *Note*: This version reflects a curated and structured split across heterogeneous biomedical domains. The filtering of so-called "*vague*" entities applies exclusively to the datasets manually curated in-house (Single-Cell, ChEMBL-V1, and ChEMBL-V2). We define vague entities as terms that do not strictly satisfy the criteria of a well-defined named entity but may exhibit entity-like characteristics depending on contextual usage. Two versions of the dataset have been created: one retaining these vague entities and one excluding them (this release). This design enables controlled experimentation to assess model behaviour, label sensitivity, and robustness when such borderline entity mentions are included during training versus when the label space is restricted to strictly defined named entities. This specific version is an updated version with our updated vague entity dictionary. ## Dataset Schema Each split contains the following fields: - sentence: a single sentence extracted from a biomedical article - entities: a list of entity annotations associated with the sentence - data_source: the originating corpus or article collection from which the sentence was derived Annotations are provided at the sentence level to facilitate downstream NER training, evaluation, and AL-driven re-annotation workflows ## Data Sources and Domain Composition The dataset integrates articles from three complementary biomedical domains, each contributing distinct entity distributions: ### 1. Single-Cell (SC) Transcriptomics Literature: - High prevalence of CellType and Tissue entities - Rich terminology diversity - Manually curated to reflect downstream project use cases ### 2. ChEMBL Assay Descriptions - Enriched in CellLine mentions - Derived from assay-centric biomedical literature - Available in two versions (V1 and V2) ### 3. Stem Cell Research (CellFinder) - Contains all three entity types - Particularly rich in CellType mentions - Historically curated dataset with expert annotations (Dated more than 10 years ago) This multi-domain composition allows the evaluation of: - Cross-domain robustness - Entity distribution shifts - Label imbalance behaviour - Domain adaptation strategies ## Stem Cell Article Source (CellFinder) Stem cell–related articles were obtained from the CellFinder repository.The original dataset and annotation methodology are described in: >Mariana Neves, Alexander Damaschun, Andreas Kurtz, Ulf Leser (2012) >Annotating and evaluating text for stem cell research. >In Proceedings Third Workshop on Building and Evaluation Resources for Biomedical Text Mining (BioTxtM 2012), >Language Resources and Evaluation (LREC) 2012. The CellFinder corpus provides historically curated annotations across multiple stem-cell–related entity types. ## ChEMBL Data Source: Versioning ### ChEMBL-V1 (Originally Silver Standard) ChEMBL-V1 was initially constructed as a silver-standard corpus using the following pipeline: - A curated dictionary was assembled by combining: - Internal ChEMBL assay descriptions - IntAct-curated CellLine terminology - Articles were retrieved from Europe PMC based on dictionary term occurrences. - Retrieved texts were automatically annotated using a machine learning model trained specifically for CellLine recognition. This dataset was later manually reviewed and corrected to improve annotation fidelity. ## ChEMBL-V2 (Gold Standard) ChEMBL-V2 is a fully gold-standard corpus comprising 12 manually curated and expert-annotated full-text biomedical articles. Article selection was guided by: - High-frequency CellLine coverage: Prioritising commonly occurring CellLine entities. - Journal diversity: Sampling across heterogeneous biomedical journals to reduce source bias and increase generalisability. Additional details on the curation protocol are available here: https://huggingface.co/datasets/OTAR3088/CellTissue-manual_testset ## Splitting Strategy ### Design Principles The split was constructed to: - Preserve domain heterogeneity - Maintain representation of all three CeLLaTe entity types - Establish a stable benchmark set - Prevent data leakage across article-level boundaries - Splitting was performed at the article level, ensuring no sentence overlap across splits. #### CellFinder (10 Articles) - 5 articles -> Training - 5 articles -> Test (Benchmark) **Rationale**: - The dataset is more than a decade old. - It is the only source containing robust representation of all three entity types. - The test portion serves as a stable benchmark reflecting legacy biomedical terminology. #### Single-Cell Corpus (12 Articles) - 9 -> Training - 3 -> Validation **Rationale**: - Richest source of diverse CellType and Tissue terminology. - Carefully curated to reflect project specific downstream application. - Used heavily in training to strengthen representation learning. #### ChEMBL-V1 (10 Articles) - 4 -> Training - 2 -> Validation - 2 -> Test Primarily enriched in CellLine entities. #### ChEMBL-V2 (12 Articles) - 6 -> Training - 3 -> Validation - 3 -> Test Contains all three entities, with CellLine predominance. ## Final Split Composition ### Training Set - 9 Single-Cell - 5 CellFinder - 4 ChEMBL-V1 - 6 ChEMBL-V2 ### Validation Set - 3 Single-Cell - 2 ChEMBL-V1 - 3 ChEMBL-V2 ### Test / Benchmark Set - 5 CellFinder - 2 ChEMBL-V1 - 3 ChEMBL-V2 The test set is intended to function as the official benchmark for CeLLaTe evaluation. ## Intended Use ### Primary Use - Supervised biomedical NER model training - Evaluation of biomedical domain adaptation strategies - Active Learning experimentation - Cross-domain robustness analysis ### Not Intended For - Clinical or diagnostic decision-making - Direct patient-level inference - Biomedical knowledge base construction without further validation ## Limitations and Considerations - This split reflects experimental design choices and may evolve in future releases. - Entity frequency distributions do not necessarily reflect real-world biomedical prevalence. - Domain imbalance is intentional to support robustness evaluation. - Some terminology reflects historical naming conventions (particularly in CellFinder). - Annotation density varies across domains by design. ## Reproducibility Notes - Splitting was performed at the article level. - No article appears in more than one split. - Entity boundaries were manually verified in gold-standard subsets. - Vague or underspecified entity mentions were filtered prior to release

提供机构：

OTAR3088

5,000+

优质数据集

54 个

任务类型

进入经典数据集