five

CLARA-MeD corpus

收藏
NIAID Data Ecosystem2026-05-01 收录
下载链接:
https://zenodo.org/records/10926161
下载链接
链接失效反馈
官方服务:
资源简介:
1) A collection of 24.298 pairs of professional and simplified texts (>96 million tokens):  Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words). Cancer-related information summaries (201 pairs of texts, >3M tokens). Clinical trials announcements (5748 pairs of texts, 451 690 tokens). The latest download of files was in February 2022. 2) 5000 parallel (technical/laymen) sentence pairs to be used as a benchmark for medical text simplification. There are 2 subsets: 3800 parallel sentences (149 862 tokens) semi-automatically aligned and revised by linguists. 1200 parallel sentences (144 019 tokens) manually simplified by linguists. If you use this resource, please cite as follows: a) For the comparable corpus and the 3800 sentences:  Campillos-Llanos, L., A. R. Terroba-Reinares, S. Zakhir Puig, A. Valverde-Mateos and A. Capllonch-Carrión (2022) "Building a comparable corpus and a benchmark for Spanish medical text simplification".  Procesamiento del lenguaje natural 69, 189-196. b) For the 1200 sentences:  Campillos-Llanos, L., R. Bartolomé-Rodríguez and A. R. Terroba-Reinares (2024) "Enhancing the understanding of clinical trials with a sentence-level simplification dataset".  Procesamiento del Lenguaje Natural 72, 31-43.
创建时间:
2024-04-04
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作