CLARA-MeD corpus

NIAID Data Ecosystem2026-05-01 收录

下载链接：

https://zenodo.org/records/10926161

下载链接

链接失效反馈

官方服务：

资源简介：

1) A collection of 24.298 pairs of professional and simplified texts (>96 million tokens): Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words). Cancer-related information summaries (201 pairs of texts, >3M tokens). Clinical trials announcements (5748 pairs of texts, 451 690 tokens). The latest download of files was in February 2022. 2) 5000 parallel (technical/laymen) sentence pairs to be used as a benchmark for medical text simplification. There are 2 subsets: 3800 parallel sentences (149 862 tokens) semi-automatically aligned and revised by linguists. 1200 parallel sentences (144 019 tokens) manually simplified by linguists. If you use this resource, please cite as follows: a) For the comparable corpus and the 3800 sentences: Campillos-Llanos, L., A. R. Terroba-Reinares, S. Zakhir Puig, A. Valverde-Mateos and A. Capllonch-Carrión (2022) "Building a comparable corpus and a benchmark for Spanish medical text simplification". Procesamiento del lenguaje natural 69, 189-196. b) For the 1200 sentences: Campillos-Llanos, L., R. Bartolomé-Rodríguez and A. R. Terroba-Reinares (2024) "Enhancing the understanding of clinical trials with a sentence-level simplification dataset". Procesamiento del Lenguaje Natural 72, 31-43.

创建时间：

2024-04-04

5,000+

优质数据集

54 个

任务类型

进入经典数据集