IIC/cantemist-ner
收藏Hugging Face2026-02-06 更新2026-02-07 收录
下载链接:
https://hf-mirror.com/datasets/IIC/cantemist-ner
下载链接
链接失效反馈官方服务:
资源简介:
CANTEMIST数据集是一个手动分类的西班牙语肿瘤学临床病例报告集合,主要用于命名实体识别(NER)任务。数据集包含1301份西班牙语肿瘤学临床病例报告,其中肿瘤形态提及被临床专家手动标注并映射到受控术语。每个肿瘤形态提及都链接到一个eCIE-O代码(西班牙语等效于ICD-O)。训练子集包含501份文档,开发子集包含500份文档,测试子集包含300份文档。原始数据集以Brat格式分发。该数据集旨在支持西班牙语医学语言模型的发展,由巴塞罗那超级计算中心的文本挖掘单位(TeMU)管理。数据集遵循CC Attribution 4.0 International许可。
The CANTEMIST dataset is a manually classified collection of Spanish oncological clinical case reports, primarily designed for Named Entity Recognition (NER) tasks. It includes 1,301 Spanish-language oncological clinical case reports with tumor morphology mentions manually annotated and mapped by clinical experts to a controlled terminology. Each tumor morphology mention is linked to an eCIE-O code (the Spanish equivalent of ICD-O). The training subset contains 501 documents, the development subset 500, and the test subset 300. The original dataset is distributed in Brat format. This dataset was created to support the development of medical language models in Spanish and is managed by the Text Mining Unit (TeMU) at the Barcelona Supercomputing Center. The dataset is licensed under CC Attribution 4.0 International.
提供机构:
IIC



