bigbio/codiesp

Name: bigbio/codiesp
Creator: bigbio
Published: 2022-12-22 15:44:28
License: 暂无描述

Hugging Face2022-12-22 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/bigbio/codiesp

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - es bigbio_language: - Spanish license: cc-by-4.0 multilinguality: monolingual bigbio_license_shortname: CC_BY_4p0 pretty_name: CodiEsp homepage: https://temu.bsc.es/codiesp/ bigbio_pubmed: False bigbio_public: True bigbio_tasks: - TEXT_CLASSIFICATION - NAMED_ENTITY_RECOGNITION - NAMED_ENTITY_DISAMBIGUATION --- # Dataset Card for CodiEsp ## Dataset Description - **Homepage:** https://temu.bsc.es/codiesp/ - **Pubmed:** False - **Public:** True - **Tasks:** TXTCLASS,NER,NED Synthetic corpus of 1,000 manually selected clinical case studies in Spanish that was designed for the Clinical Case Coding in Spanish Shared Task, as part of the CLEF 2020 conference. The goal of the task was to automatically assign ICD10 codes (CIE-10, in Spanish) to clinical case documents, being evaluated against manually generated ICD10 codifications. The CodiEsp corpus was selected manually by practicing physicians and clinical documentalists and annotated by clinical coding professionals meeting strict quality criteria. They reached an inter-annotator agreement of 88.6% for diagnosis coding, 88.9% for procedure coding and 80.5% for the textual reference annotation. The final collection of 1,000 clinical cases that make up the corpus had a total of 16,504 sentences and 396,988 words. All documents are in Spanish language and CIE10 is the coding terminology (the Spanish version of ICD10-CM and ICD10-PCS). The CodiEsp corpus has been randomly sampled into three subsets. The train set contains 500 clinical cases, while the development and test sets have 250 clinical cases each. In addition to these, a collection of 176,294 abstracts from Lilacs and Ibecs with the corresponding ICD10 codes (ICD10-CM and ICD10-PCS) was provided by the task organizers. Every abstract has at least one associated code, with an average of 2.5 ICD10 codes per abstract. The CodiEsp track was divided into three sub-tracks (2 main and 1 exploratory): - CodiEsp-D: The Diagnosis Coding sub-task, which requires automatic ICD10-CM [CIE10-Diagnóstico] code assignment. - CodiEsp-P: The Procedure Coding sub-task, which requires automatic ICD10-PCS [CIE10-Procedimiento] code assignment. - CodiEsp-X: The Explainable AI exploratory sub-task, which requires to submit the reference to the predicted codes (both ICD10-CM and ICD10-PCS). The goal of this novel task was not only to predict the correct codes but also to present the reference in the text that supports the code predictions. For further information, please visit https://temu.bsc.es/codiesp or send an email to encargo-pln-life@bsc.es ## Citation Information ``` @article{miranda2020overview, title={Overview of Automatic Clinical Coding: Annotations, Guidelines, and Solutions for non-English Clinical Cases at CodiEsp Track of CLEF eHealth 2020.}, author={Miranda-Escalada, Antonio and Gonzalez-Agirre, Aitor and Armengol-Estap{'e}, Jordi and Krallinger, Martin}, journal={CLEF (Working Notes)}, volume={2020}, year={2020} } ```

提供机构：

bigbio

原始信息汇总

数据集概述

基本信息

名称: CodiEsp
语言: 西班牙语
许可证: CC-BY-4.0
多语言性: 单语种
任务: 文本分类, 命名实体识别, 命名实体消歧

数据集描述

来源: 由实践中的医生和临床文档专家手动选择的1,000个临床案例组成的合成语料库，设计用于西班牙临床案例编码共享任务，作为CLEF 2020会议的一部分。
目标: 自动分配ICD10代码（CIE-10，西班牙语）给临床案例文档，评估基于手动生成的ICD10编码。
数据量: 总共包含16,504个句子和396,988个单词。
数据划分: 训练集包含500个临床案例，开发和测试集各有250个临床案例。
附加数据: 提供176,294个来自Lilacs和Ibecs的摘要及其相应的ICD10代码，平均每个摘要至少有一个关联代码，平均2.5个ICD10代码每摘要。

子任务

CodiEsp-D: 诊断编码子任务，要求自动分配ICD10-CM [CIE10-诊断]代码。
CodiEsp-P: 程序编码子任务，要求自动分配ICD10-PCS [CIE10-程序]代码。
CodiEsp-X: 可解释AI探索性子任务，要求提交预测代码的参考（ICD10-CM和ICD10-PCS）。

联系方式

邮箱: encargo-pln-life@bsc.es

5,000+

优质数据集

54 个

任务类型

进入经典数据集