five

qanastek/frenchmedmcqa

收藏
Hugging Face2023-06-08 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/qanastek/frenchmedmcqa
下载链接
链接失效反馈
官方服务:
资源简介:
--- annotations_creators: - no-annotation language_creators: - expert-generated language: - fr license: - apache-2.0 multilinguality: - monolingual size_categories: - 1k<n<10k source_datasets: - original task_categories: - question-answering - multiple-choice task_ids: - multiple-choice-qa - open-domain-qa paperswithcode_id: frenchmedmcqa pretty_name: FrenchMedMCQA --- # Dataset Card for FrenchMedMCQA : A French Multiple-Choice Question Answering Corpus for Medical domain ## Table of Contents - [Dataset Card for FrenchMedMCQA : A French Multiple-Choice Question Answering Corpus for Medical domain](#dataset-card-for-frenchmedmcqa--a-french-multiple-choice-question-answering-corpus-for-medical-domain) - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Source Data](#source-data) - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contact](#contact) ## Dataset Description - **Homepage:** https://deft2023.univ-avignon.fr/ - **Repository:** https://deft2023.univ-avignon.fr/ - **Paper:** [FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain](https://hal.science/hal-03824241/document) - **Leaderboard:** Coming soon - **Point of Contact:** [Yanis LABRAK](mailto:yanis.labrak@univ-avignon.fr) ### Dataset Summary This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online. ### Supported Tasks and Leaderboards Multiple-Choice Question Answering (MCQA) ### Languages The questions and answers are available in French. ## Dataset Structure ### Data Instances ```json { "id": "1863462668476003678", "question": "Parmi les propositions suivantes, laquelle (lesquelles) est (sont) exacte(s) ? Les chylomicrons plasmatiques :", "answers": { "a": "Sont plus riches en cholestérol estérifié qu'en triglycérides", "b": "Sont synthétisés par le foie", "c": "Contiennent de l'apolipoprotéine B48", "d": "Contiennent de l'apolipoprotéine E", "e": "Sont transformés par action de la lipoprotéine lipase" }, "correct_answers": [ "c", "d", "e" ], "subject_name": "pharmacie", "type": "multiple" } ``` ### Data Fields - `id` : a string question identifier for each example - `question` : question text (a string) - `answer_a` : Option A - `answer_b` : Option B - `answer_c` : Option C - `answer_d` : Option D - `answer_e` : Option E - `correct_answers` : Correct options, i.e., A, D and E - `choice_type` ({"single", "multiple"}): Question choice type. - "single": Single-choice question, where each choice contains a single option. - "multiple": Multi-choice question, where each choice contains a combination of multiple options. ### Data Splits | # Answers | Training | Validation | Test | Total | |:---------:|:--------:|:----------:|:----:|:-----:| | 1 | 595 | 164 | 321 | 1,080 | | 2 | 528 | 45 | 97 | 670 | | 3 | 718 | 71 | 141 | 930 | | 4 | 296 | 30 | 56 | 382 | | 5 | 34 | 2 | 7 | 43 | | Total | 2171 | 312 | 622 | 3,105 | ## Dataset Creation ### Source Data #### Initial Data Collection and Normalization The questions and their associated candidate answer(s) were collected from real French pharmacy exams on the remede website. Questions and answers were manually created by medical experts and used during examinations. The dataset is composed of 2,025 questions with multiple answers and 1,080 with a single one, for a total of 3,105 questions. Each instance of the dataset contains an identifier, a question, five options (labeled from A to E) and correct answer(s). The average question length is 14.17 tokens and the average answer length is 6.44 tokens. The vocabulary size is of 13k words, of which 3.8k are estimated medical domain-specific words (i.e. a word related to the medical field). We find an average of 2.49 medical domain-specific words in each question (17 % of the words) and 2 in each answer (36 % of the words). On average, a medical domain-specific word is present in 2 questions and in 8 answers. ### Personal and Sensitive Information The corpora is free of personal or sensitive information. ## Additional Information ### Dataset Curators The dataset was created by Labrak Yanis and Bazoge Adrien and Dufour Richard and Daille Béatrice and Gourraud Pierre-Antoine and Morin Emmanuel and Rouvier Mickael. ### Licensing Information Apache 2.0 ### Citation Information If you find this useful in your research, please consider citing the dataset paper : ```latex @inproceedings{labrak-etal-2022-frenchmedmcqa, title = "{F}rench{M}ed{MCQA}: A {F}rench Multiple-Choice Question Answering Dataset for Medical domain", author = "Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Daille, Beatrice and Gourraud, Pierre-Antoine and Morin, Emmanuel and Rouvier, Mickael", booktitle = "Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.louhi-1.5", pages = "41--46", abstract = "This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online.", } ``` ### Contact Thanks to contact [Yanis LABRAK](https://github.com/qanastek) for more information about this dataset.
提供机构:
qanastek
原始信息汇总

数据集概述

数据集描述

数据集摘要

FrenchMedMCQA 是首个公开可用的法语医学领域多项选择问答(MCQA)数据集。该数据集包含 3,105 道题目,来源于法国药学专业资格考试的真实考题,混合了单选和多选题。每个数据实例包含一个标识符、一个问题、五个可能的答案及其手动校正。

支持的任务和排行榜

  • 多项选择问答(MCQA)

语言

数据集中的问题和答案均为法语。

数据集结构

数据实例

json { "id": "1863462668476003678", "question": "Parmi les propositions suivantes, laquelle (lesquelles) est (sont) exacte(s) ? Les chylomicrons plasmatiques :", "answers": { "a": "Sont plus riches en cholestérol estérifié quen triglycérides", "b": "Sont synthétisés par le foie", "c": "Contiennent de lapolipoprotéine B48", "d": "Contiennent de lapolipoprotéine E", "e": "Sont transformés par action de la lipoprotéine lipase" }, "correct_answers": [ "c", "d", "e" ], "subject_name": "pharmacie", "type": "multiple" }

数据字段

  • id:每个例子的字符串问题标识符
  • question:问题文本(字符串)
  • answer_a:选项 A
  • answer_b:选项 B
  • answer_c:选项 C
  • answer_d:选项 D
  • answer_e:选项 E
  • correct_answers:正确选项,例如 A、D 和 E
  • choice_type({"single", "multiple"}):问题选择类型
    • "single":单选题,每个选择包含一个选项
    • "multiple":多选题,每个选择包含多个选项的组合

数据分割

# 答案 训练集 验证集 测试集 总计
1 595 164 321 1,080
2 528 45 97 670
3 718 71 141 930
4 296 30 56 382
5 34 2 7 43
总计 2171 312 622 3,105

数据集创建

源数据

初始数据收集和规范化

题目及其相关候选答案从 remede 网站上的真实法国药学考试中收集。题目和答案由医学专家手动创建,并在考试中使用。数据集包含 2,025 道多选题和 1,080 道单选题,总计 3,105 道题目。每个数据实例包含一个标识符、一个问题、五个选项(标记为 A 到 E)和正确答案。问题的平均长度为 14.17 个词,答案的平均长度为 6.44 个词。词汇量为 13k 个词,其中 3.8k 个词被估计为医学领域特定词(即与医学领域相关的词)。每个问题平均包含 2.49 个医学领域特定词(占词数的 17%),每个答案平均包含 2 个医学领域特定词(占词数的 36%)。平均而言,一个医学领域特定词出现在 2 个问题和 8 个答案中。

个人和敏感信息

语料库中不含个人或敏感信息。

附加信息

数据集创建者

数据集由 Labrak Yanis、Bazoge Adrien、Dufour Richard、Daille Béatrice、Gourraud Pierre-Antoine、Morin Emmanuel 和 Rouvier Mickael 创建。

许可信息

Apache 2.0

引用信息

如果您在研究中使用了该数据集,请考虑引用数据集论文:

latex @inproceedings{labrak-etal-2022-frenchmedmcqa, title = "{F}rench{M}ed{MCQA}: A {F}rench Multiple-Choice Question Answering Dataset for Medical domain", author = "Labrak, Yanis and Bazoge, Adrien and Dufour, Richard and Daille, Beatrice and Gourraud, Pierre-Antoine and Morin, Emmanuel and Rouvier, Mickael", booktitle = "Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI)", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.louhi-1.5", pages = "41--46", abstract = "This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online.", }

5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作