MedDialog-FR: a French Version of the MedDialog Corpus for Multi-label Classification and Response Generation related to Women's Intimate Health

NIAID Data Ecosystem2026-05-02 收录

下载链接：

https://zenodo.org/record/10889880

下载链接

链接失效反馈

官方服务：

资源简介：

MedDialog-FR: a French Version of the MedDialog Corpus for Multi-label Classification and Response Generation related to Women's Intimate Health Contributors: Xingyu Liu, Vincent Segonne, Aidan Mannion, Didier Schwab, Lorraine Goeuriot, François Portet Total Number of Single-Turn Dialogues: 16,149 dialogues of women's intimate health, 7,120 dialogues of general medicine Given the lack of French dialogue corpora for data-driven dialogue systems and the paucity of available information related to women's intimate health, MedDialog-FR is an annotated corpus of question-and-answer sessions between a patient and a doctor concerning women's intimate health. The corpus is composed of about 20,000 sessions automatically translated from the English version of MedDialog-EN. The corpus test set is composed of 1,400 sessions that have been manually post-edited and annotated with 22 categories from the UMLS ontology. Overview of the dataset To construct the French MedDialog Dataset (MedDialog-FR), we initially extracted from MedDialog-EN and automatically translated a total of 16,149 dialogues related to women's intimate health and an additional 7,120 dialogues related to general medicine. MedDialog-EN is composed of textual single-turn dialogues: a medical question by a patient and a response by a physician. From the translated dialogues, we randomly selected 900 dialogues on women's intimate health and 500 dialogues concerning general medicine to be post-edited. Subsequently, we performed multi-label annotation on the 900 questions extracted from these same dialogues focused on women's intimate health. In total, 1,286 labels were annotated, with 1.43 labels per instance in average. The summary of the statistics of the dataset: Task Women General Machine translation (#dialogs) 16,149 7,120 Post-editing (#dialogs) 900 500 Multi-label annotation (#questions) 900 - Structure of the dataset The dataset contains the following elements separated in general medicine domain (MedDialog-FR-general) and women's intimate health domain (MedDialog-FR-women): ``` ├── MedDialog-FR-general/ ├──── machine_translation/meddialog-fr-general_machine_translation.csv ├──── post-editing/meddialog-fr-general_post-editing.csv ├── MedDialog-FR-women/ ├──── machine_translation/meddialog-fr-women_machine_translation.csv ├──── post-editing/meddialog-fr-women_post-editing.csv ├──── multilabel_annotation/dataset_multilabel_meddialog_22labels.csv ├──── response_generation/dataset_response_generation_meddialog.csv ``` All the .csv files contain a column named id, which indicates the original file of *MedDialog-EN* with the id in that file. For example, hm3_96_q or hm3_96_a refers to the session with the id of 96 within the healthcaremaginc3 file. The suffix of '_q' and '_a' indicates question and answer Machine translation The .csv file contains 3 columns: id, en and fr - en: original question and answer in English - fr: translated question and answer in French Example lines: hm4_1121_q \t J'ai 52 ans, mes dernières règles remontent au 6 décembre, je pensais que c'était peut-être le début de la ménopause. J'ai fait un test d'urine pour la grossesse, qui s'est révélé positif, puis j'ai fait un test quantitatif de hcg 45343 (je suis infirmière et je l'ai fait au laboratoire de l'hôpital où je travaille). J'ai des crampes et des saignements (bruns) depuis 2 à 3 mois. hm4_1121_a \t Bonjour, j'ai compris votre préoccupation. Comme vous avez mentionné que le taux de bêta HCG est plus élevé, je vous suggère de faire une échographie. Cela confirmera l'âge gestationnel et la viabilité de la grossesse. Si vous tenez à poursuivre la grossesse, veuillez discuter des risques encourus avec votre gynécologue traitant. Vous pouvez également opter pour une interruption de grossesse avec des médicaments en toute sécurité jusqu'à 9 semaines de grossesse sous surveillance médicale. J'espère que cette réponse vous aidera. Post-editing The .csv file contains 3 columns: id, machine_translation and post-edited Example line: hm4_3334_q \t bonjour docteur je suis atteinte de pcos, je me suis mariée en novembre 2011.nous essayons d'avoir une grossesse depuis deux mois ... et ma question est comment savoir la sévérité du pcos et quel est le meilleur moment pour concv \t bonjour docteur, je suis atteinte de SOPK, je me suis mariée en novembre 2011. Nous essayons de concevoir depuis deux mois ... et ma question est comment savoir la sévérité du SOPK et quel est le meilleur moment pour concevoir. Multi-label annotation The .csv file contains 5 columns: id, source_file, labels and split. - source_file: the source file where the text content for classification can be found with id - labels: UMLS IDs representing expert-validated labels for classification - split: train, dev or test Example line: hm1_33568_q \t '../post-editing/meddialog-fr-women_post-editing.csv' \t ['C0700589', 'C0227791'] \t train Partitioning: We split the MedDialog-FR-women multi-label dataset into a training set of 500 instances, a validation set of 100 instances and a test set of 300 instances. The ratio was chosen to balance the need for maximizing the amount of fine-tuning data available while also ensuring that the test set is large enough for the results to be statistically significant, given the scarcity of some categories. The split statistics are summarized in the following table. To maintain consistent label distribution, we leveraged the iterative stratification algorithm during the data splitting process. Split #Questions Train 500 Validation 100 Test 300 Response generation The .csv file contains 3 columns: id, split and source_file - split: train, dev or test - source_file: the source file where the text content for response generation can be found with id Partitioning: The validation and test data contain the same session ID as the multi-label validation and test, but they include the corresponding answers. for the training set, we use the same ones as multi-label dataset plus the machine translated sessions. Split #Dialogues Train 15,749 Validation 100 Test 300 Corpus data cleaning By examining the MedDialog-EN corpus, we identified data that could potentially leak personal information such as the first and last name, email address, URL, etc. In order to safeguard privacy, we conducted a series of data cleaning procedures, especially anonymization: 1. replace URLs with #URL# (regex pattern: `https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*) `) 2. replace emails with #EMAIL# (regex pattern: `^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}`) 3. replace phone numbers with #TEL# (regex pattern: `^[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4,6}`) 4. replace dates with #DATE# (regex patterns: `\d{1,2}\/\d{1,2}\/\d{2,4} `; `(Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\s+(\d{1,2})\s+(\d{4})`) 5. replace hospital or clinic names with #HOSPITAL# (text patterns: `clinic`; `hospital`) 6. replace names in questions with #Person1#, and names in answers with #Person2#. If there is a name in an answer identical to the name in its question, replace it with #Person1#. (text patterns: `I am`; `I'm`;`Dr`; `Doctor`) 7. replace the names of data source forums with coded letters (text patterns: forum names) Ethics Statement and Limitations Access to actual medical data is very restricted and protected in France. We thus used an already publicly available corpus in English. But we did not simply translate it. We first made sure that no personal information could be found in the data. This is why we replaced all names that could have been kept in the original data. We also performed post-edition after automatic translation to adapt the phrasing and medical term to the French culture. All people recruited for annotation were treated fairly. This includes, but is not limited to, compensating them fairly and ensuring that they were voluntary participants. We do not foresee any direct social consequences or ethical issues. Authors of MedDialog were warned at our project and answered our questions. Since the original corpus is derived from dialogues in the U.S.A., there might be some cultural differences with French-speaking countries in the way people interact with doctors and which treatments and medical advises can be provided. Answers to questions should not be applied for self-treatment.

创建时间：

2024-11-14