CLARA-MeD corpus
收藏NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/records/10926161
下载链接
链接失效反馈官方服务:
资源简介:
1) A collection of 24.298 pairs of professional and simplified texts (>96 million tokens):
Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words).
Cancer-related information summaries (201 pairs of texts, >3M tokens).
Clinical trials announcements (5748 pairs of texts, 451 690 tokens).
The latest download of files was in February 2022.
2) 5000 parallel (technical/laymen) sentence pairs to be used as a benchmark for medical text simplification. There are 2 subsets:
3800 parallel sentences (149 862 tokens) semi-automatically aligned and revised by linguists.
1200 parallel sentences (144 019 tokens) manually simplified by linguists.
If you use this resource, please cite as follows:
a) For the comparable corpus and the 3800 sentences:
Campillos-Llanos, L., A. R. Terroba-Reinares, S. Zakhir Puig, A. Valverde-Mateos and A. Capllonch-Carrión (2022) "Building a comparable corpus and a benchmark for Spanish medical text simplification". Procesamiento del lenguaje natural 69, 189-196.
b) For the 1200 sentences:
Campillos-Llanos, L., R. Bartolomé-Rodríguez and A. R. Terroba-Reinares (2024) "Enhancing the understanding of clinical trials with a sentence-level simplification dataset". Procesamiento del Lenguaje Natural 72, 31-43.
创建时间:
2024-04-04



