Biomedical-TeMU/CodiEsp_corpus

Name: Biomedical-TeMU/CodiEsp_corpus
Creator: Biomedical-TeMU
Published: 2022-03-11 02:24:53
License: 暂无描述

Hugging Face2022-03-11 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/Biomedical-TeMU/CodiEsp_corpus

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-4.0 --- ## Introduction These are the train, development, test and background sets of the CodiEsp corpus. Train and development have gold standard annotations. The unannotated background and test sets are distributed together. All documents are released in the context of the CodiEsp track for CLEF ehealth 2020 (http://temu.bsc.es/codiesp/). The CodiEsp corpus contains manually coded clinical cases. All documents are in Spanish language and CIE10 is the coding terminology (it is the Spanish version of ICD10-CM and ICD10-PCS). The CodiEsp corpus has been randomly sampled into three subsets: the train, the development, and the test set. The train set contains 500 clinical cases, and the development and test set 250 clinical cases each. The test set contains 250 clinical cases and it is released together with the background set (with 2751 clinical cases). CodiEsp participants must submit predictions for the test and background set, but they will only be evaluated on the test set. ## Structure Three folders: train, dev and test. Each one of them contains the files for the train, development and test corpora, respectively. + train and dev folders have: + 3 tab-separated files with the annotation information relevant for each of the 3 sub-tracks of CodiEsp. + A subfolder named text_files with the plain text files of the clinical cases. + A subfolder named text_files_en with the plain text files machine-translated to English. Due to the translation process, the text files are sentence-splitted. + The test folder has only text_files and text_files_en subfolders with the plain text files. ## Corpus format description The CodiEsp corpus is distributed in plain text in UTF8 encoding, where each clinical case is stored as a single file whose name is the clinical case identifier. Annotations are released in a tab-separated file. Since the CodiEsp track has 3 sub-tracks, every set of documents (train and test) has 3 tab-separated files associated with it. For the sub-tracks CodiEsp-D and CodiEsp-P, the file has the following fields: articleID ICD10-code Tab-separated files for the sub-track CodiEsp-X contain extra fields that provide the text-reference and its position: articleID label ICD10-code text-reference reference-position ## Corpus summary statistics The final collection of 1000 clinical cases that make up the corpus had a total of 16504 sentences, with an average of 16.5 sentences per clinical case. It contains a total of 396,988 words, with an average of 396.2 words per clinical case. For more information, visit the track webpage: http://temu.bsc.es/codiesp/

提供机构：

Biomedical-TeMU

原始信息汇总

数据集概述

数据集名称

CodiEsp corpus

数据集内容

语言：西班牙语
编码术语：CIE10（西班牙版本的ICD10-CM和ICD10-PCS）
数据类型：手动编码的临床案例
数据集组成：
- 训练集：500个临床案例
- 开发集：250个临床案例
- 测试集：250个临床案例
- 背景集：2751个临床案例

数据集结构

文件夹：train, dev, test
文件内容：
- 训练集和开发集：
  - 3个带注释信息的制表符分隔文件
  - 文本文件夹（text_files）：包含临床案例的纯文本文件
  - 英文文本文件夹（text_files_en）：包含机器翻译成英文的纯文本文件，句子已分割
- 测试集：
  - 文本文件夹（text_files）
  - 英文文本文件夹（text_files_en）

数据集格式

文本编码：UTF-8
文件命名：临床案例标识符
注释文件格式：制表符分隔
注释文件内容：
- CodiEsp-D和CodiEsp-P：articleID, ICD10-code
- CodiEsp-X：articleID, label, ICD10-code, text-reference, reference-position

数据集统计

总句子数：16504
平均每案例句子数：16.5
总字数：396,988
平均每案例字数：396.2

5,000+

优质数据集

54 个

任务类型

进入经典数据集