five

CLARA-MeD/claramed3800

收藏
Hugging Face2024-04-02 更新2024-04-21 收录
下载链接:
https://hf-mirror.com/datasets/CLARA-MeD/claramed3800
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc-by-nc-4.0 --- # Dataset Card for CLARA-MeD-3800 ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [https://clara-nlp.uned.es/home/med/](https://clara-nlp.uned.es/home/med/) - **Repositories:** [https://github.com/lcampillos/CLARA-MeD](https://github.com/lcampillos/CLARA-MeD), [https://digital.csic.es/handle/10261/269887](https://digital.csic.es/handle/10261/269887) - **Paper:** [http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6439](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6439) - **DOI:** [https://doi.org/10.20350/digitalCSIC/14644](https://doi.org/10.20350/digitalCSIC/14644) - **Point of Contact:** [Leonardo Campillos-Llanos](leonardo.campillos@csic.es) ### Dataset Summary A parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens) as a benchmark for medical text simplification. This dataset was collected in the CLARA-MeD project, with the goal of simplifying medical texts in the Spanish language and reducing the language barrier to patient's informed decision making. ### Supported Tasks and Leaderboards Medical text simplification ### Languages Spanish ## Dataset Structure ### Data Instances For each instance, there is a string for the source text (professional version), and a string for the target text (simplified version). ``` {'SOURCE': 'adenocarcinoma ductal de páncreas' 'TARGET': 'Cáncer de páncreas'} ``` ### Data Fields - `SOURCE`: a string containing the professional version. - `TARGET`: a string containing the simplified version. ## Dataset Creation ### Source Data #### Who are the source language producers? 1. Drug leaflets and summaries of product characteristics from [CIMA](https://cima.aemps.es) 2. Cancer-related information summaries from the [National Cancer Institute](https://www.cancer.gov/) 3. Clinical trials announcements from [EudraCT](https://www.clinicaltrialsregister.eu/) ### Annotations #### Annotation process Semi-automatic alignment of technical and patient versions of medical sentences. Inter-annotator agreement measured with Cohen's Kappa (average Kappa = 0.839 +- 0.076; very high agreement). #### Who are the annotators? Leonardo Campillos-Llanos Adrián Capllonch-Carriónb Ana Rosa Terroba-Reinares Ana Valverde-Mateos Sofía Zakhir-Puig ### Personal and Sensitive Information No personal and sensitive information was used. ### Licensing Information These data are aimed at research and educational purposes, and released under a Creative Commons Non-Commercial Attribution (CC-BY-NC-A) 4.0 International License. ### Citation Information Campillos Llanos, L., Terroba Reinares, A. R., Zakhir Puig, S., Valverde, A., & Capllonch-Carrión, A. (2022). Building a comparable corpus and a benchmark for Spanish medical text simplification. *Procesamiento del lenguaje natural*, 69, pp. 189--196. ``` @article{2022claramedcorpus, title={Building a comparable corpus and a benchmark for Spanish medical text simplification}, author={Campillos-Llanos, Leonardo and Terroba Reinares, Ana R., and Zakhir Puig, Sofía, and Valverde-Mateos, Ana and Capllonch-Carri{\'o}n}, title={Procesamiento del Lenguaje Natural}, volume={69}, year={2022}, pages={189--196}, publisher={Sociedad Espa{\~n}ola para el Procesamiento del Lenguaje Natural} } ``` ### Contributions Thanks to [Jónathan Heras from Universidad de La Rioja](http://www.unirioja.es/cu/joheras) ([@joheras](https://github.com/joheras)) for formatting this dataset for Hugging Face.
提供机构:
CLARA-MeD
原始信息汇总

数据集概述

数据集名称

CLARA-MeD-3800

数据集摘要

CLARA-MeD-3800是一个包含3800对专业和非专业(简化)版本的西班牙语医学文本的平行语料库。该数据集旨在简化医学文本,减少患者理解医学信息的语言障碍。

支持的任务

医学文本简化

语言

西班牙语

数据集结构

数据实例

每个实例包含两个字段:

  • SOURCE: 包含专业版本的字符串。
  • TARGET: 包含简化版本的字符串。

数据字段

  • SOURCE: 专业版本的医学文本。
  • TARGET: 简化版本的医学文本。

数据集创建

源数据

  • 来源包括药品传单、产品特性概要、癌症相关信息摘要和临床试验公告。

标注过程

  • 采用半自动方式对技术版本和患者版本的医学句子进行对齐。
  • 标注者间一致性通过Cohens Kappa测量,平均Kappa值为0.839。

标注者

  • Leonardo Campillos-Llanos
  • Adrián Capllonch-Carrión
  • Ana Rosa Terroba-Reinares
  • Ana Valverde-Mateos
  • Sofía Zakhir-Puig

个人和敏感信息

  • 数据集中不包含个人和敏感信息。

许可证信息

  • 数据集遵循Creative Commons Non-Commercial Attribution (CC-BY-NC-A) 4.0国际许可证。

引用信息

@article{2022claramedcorpus, title={Building a comparable corpus and a benchmark for Spanish medical text simplification}, author={Campillos-Llanos, Leonardo and Terroba Reinares, Ana R., and Zakhir Puig, Sofía, and Valverde-Mateos, Ana and Capllonch-Carri{o}n}, title={Procesamiento del Lenguaje Natural}, volume={69}, year={2022}, pages={189--196}, publisher={Sociedad Espa{~n}ola para el Procesamiento del Lenguaje Natural} }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作