bigbio/codiesp
收藏Hugging Face2022-12-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/bigbio/codiesp
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- es
bigbio_language:
- Spanish
license: cc-by-4.0
multilinguality: monolingual
bigbio_license_shortname: CC_BY_4p0
pretty_name: CodiEsp
homepage: https://temu.bsc.es/codiesp/
bigbio_pubmed: False
bigbio_public: True
bigbio_tasks:
- TEXT_CLASSIFICATION
- NAMED_ENTITY_RECOGNITION
- NAMED_ENTITY_DISAMBIGUATION
---
# Dataset Card for CodiEsp
## Dataset Description
- **Homepage:** https://temu.bsc.es/codiesp/
- **Pubmed:** False
- **Public:** True
- **Tasks:** TXTCLASS,NER,NED
Synthetic corpus of 1,000 manually selected clinical case studies in Spanish
that was designed for the Clinical Case Coding in Spanish Shared Task, as part
of the CLEF 2020 conference.
The goal of the task was to automatically assign ICD10 codes (CIE-10, in
Spanish) to clinical case documents, being evaluated against manually generated
ICD10 codifications. The CodiEsp corpus was selected manually by practicing
physicians and clinical documentalists and annotated by clinical coding
professionals meeting strict quality criteria. They reached an inter-annotator
agreement of 88.6% for diagnosis coding, 88.9% for procedure coding and 80.5%
for the textual reference annotation.
The final collection of 1,000 clinical cases that make up the corpus had a total
of 16,504 sentences and 396,988 words. All documents are in Spanish language and
CIE10 is the coding terminology (the Spanish version of ICD10-CM and ICD10-PCS).
The CodiEsp corpus has been randomly sampled into three subsets. The train set
contains 500 clinical cases, while the development and test sets have 250
clinical cases each. In addition to these, a collection of 176,294 abstracts
from Lilacs and Ibecs with the corresponding ICD10 codes (ICD10-CM and
ICD10-PCS) was provided by the task organizers. Every abstract has at least one
associated code, with an average of 2.5 ICD10 codes per abstract.
The CodiEsp track was divided into three sub-tracks (2 main and 1 exploratory):
- CodiEsp-D: The Diagnosis Coding sub-task, which requires automatic ICD10-CM
[CIE10-Diagnóstico] code assignment.
- CodiEsp-P: The Procedure Coding sub-task, which requires automatic ICD10-PCS
[CIE10-Procedimiento] code assignment.
- CodiEsp-X: The Explainable AI exploratory sub-task, which requires to submit
the reference to the predicted codes (both ICD10-CM and ICD10-PCS). The goal
of this novel task was not only to predict the correct codes but also to
present the reference in the text that supports the code predictions.
For further information, please visit https://temu.bsc.es/codiesp or send an
email to encargo-pln-life@bsc.es
## Citation Information
```
@article{miranda2020overview,
title={Overview of Automatic Clinical Coding: Annotations, Guidelines, and Solutions for non-English Clinical Cases at CodiEsp Track of CLEF eHealth 2020.},
author={Miranda-Escalada, Antonio and Gonzalez-Agirre, Aitor and Armengol-Estap{'e}, Jordi and Krallinger, Martin},
journal={CLEF (Working Notes)},
volume={2020},
year={2020}
}
```
提供机构:
bigbio
原始信息汇总
数据集概述
基本信息
- 名称: CodiEsp
- 语言: 西班牙语
- 许可证: CC-BY-4.0
- 多语言性: 单语种
- 任务: 文本分类, 命名实体识别, 命名实体消歧
数据集描述
- 来源: 由实践中的医生和临床文档专家手动选择的1,000个临床案例组成的合成语料库,设计用于西班牙临床案例编码共享任务,作为CLEF 2020会议的一部分。
- 目标: 自动分配ICD10代码(CIE-10,西班牙语)给临床案例文档,评估基于手动生成的ICD10编码。
- 数据量: 总共包含16,504个句子和396,988个单词。
- 数据划分: 训练集包含500个临床案例,开发和测试集各有250个临床案例。
- 附加数据: 提供176,294个来自Lilacs和Ibecs的摘要及其相应的ICD10代码,平均每个摘要至少有一个关联代码,平均2.5个ICD10代码每摘要。
子任务
- CodiEsp-D: 诊断编码子任务,要求自动分配ICD10-CM [CIE10-诊断]代码。
- CodiEsp-P: 程序编码子任务,要求自动分配ICD10-PCS [CIE10-程序]代码。
- CodiEsp-X: 可解释AI探索性子任务,要求提交预测代码的参考(ICD10-CM和ICD10-PCS)。
联系方式
- 邮箱: encargo-pln-life@bsc.es



