Diagnostic Interview Corpus - Translation
收藏DataCite Commons2026-05-05 更新2026-05-06 收录
下载链接:
https://yareta.unige.ch/archives/03497afb-3903-4d13-bf76-2d361dd3d117
下载链接
链接失效反馈官方服务:
资源简介:
The Diagnostic Interview Corpus is a multilingual dataset of 12,754 French medical consultation sentences (questions and instructions) with translations into 12 languages and associated UMLS-based semantic glosses. It supports research on low-resource medical machine translation, semantic representation, and pictograph generation.
Languages
- Source: French
- Targets (in translations.csv): Albanian, Modern Standard Arabic, Tunisian Arabic, Moroccan Arabic, Algerian Arabic, Dari (Afghan Persian), Farsi (Iranian Persian), Russian, English, Spanish, Tigrinya, Ukrainian
- Semantic gloss (in translations.csv): French sentences aligned with UMLS glosses (concept sequences + functional tokens).
- Paraphrases (in paraphrases.csv): French paraphrases aligned with the corresponding French source sentences, generated through a grammar-based approach to ensure controlled syntactic variation
Domains and registers
- Medical consultations
- Questions and instructions (e.g., symptom checks, treatment directives)
- Categories by body region (e.g., head, chest, abdomen)
Features
- Parallel multilingual translations created and adapted with clinical experts
- Semantic gloss layer (UMLS CUIs + functional tokens) for pictograph generation
- Patient-centered simplifications and cultural adaptations to improve comprehension
Example
French: Avez-vous des nausées ou des vomissements ?
English: Do you have nausea or vomiting?
UMLS gloss: You | Nausea | or – article | Vomiting | Question
Intended Use
- Low-resource multilingual MT research
- Semantic representation learning (UMLS-based)
- Pictograph translation systems for patients with limited health literacy
- Evaluation of medical-domain MT beyond surface-level accuracy
Acknowledgements
This corpus was developed in the context of the BabelDr and PictoDr projects at the University of Geneva in collaboration with Geneva University Hospitals.This work is part of the PROPICTO project, funded by the Swiss National Science Foundation (N°197864) and the French National Research Agency (ANR-20-CE93-0005). This project also received funding by the ”Fondation Privée des Hôpitaux Universitaires de Genève”.
提供机构:
Université de Genève, Yareta
创建时间:
2025-10-17



