ELJAOUHARY/YeMedQA_Mutilangual
收藏Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ELJAOUHARY/YeMedQA_Mutilangual
下载链接
链接失效反馈官方服务:
资源简介:
---
dataset_info:
features:
- name: id
dtype: string
- name: question
dtype: string
- name: context_question
dtype: string
- name: answer
dtype: string
- name: language
dtype: string
- name: urgency
dtype: string
- name: speciality
dtype: string
- name: article_title
dtype: string
- name: entities
struct:
- name: age
list: string
- name: medicament
list: string
- name: sympt
list: string
- name: medical_field
list: string
- name: disease
list: string
- name: Test
list: string
- name: Result
list: string
splits:
- name: train
num_bytes: 6948163.361080951
num_examples: 7460
- name: test
num_bytes: 772121.6389190493
num_examples: 829
download_size: 4170389
dataset_size: 7720285.0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
- split: test
path: data/test-*
---
## Question Answering Mutilangue Dataset for Healthcare.

# Overview:
**YeMedQA** is a multilingual Question-Answering dataset designed for healthcare NLP applications.
It focuses on **patient–doctor medical conversations** in:
- Darija
- English
- French
**Keywords:** Medical Question Answering (MedQA), Large Language Models (LLMs), Natural Language Processing (NLP), AI in Healthcare
The dataset supports the development of **culturally and linguistically adapted medical AI systems**.
## 🌐 Data Collection
YeMedQA was constructed using:
### 1. Web Scraping (Verified Medical Sources)
Medical content was collected and curated from trusted healthcare platforms:
- www.icliniq.com
- www.altibbi.com
### 2. Hugging Face Open Data
- Publicly available medical QA datasets (ANR-Maladies)
These sources were selected for their:
- High medical credibility
- Real patient–doctor interactions
- Multilingual content availability
### Dataset Splits
| Split | Examples | Size (MB) |
| :--- | :---: | :---: |
| **Train** | 7,460 | 6.95 MB |
| **Test** | 829 | 0.77 MB |
| **Total** | **8,289** | **7.72 MB** |
## Column:
| Feature | Type | Description |
| :--- | :--- | :--- |
| `id` | `string` | Unique ID |
| `question` | `string` | The patient question(e.g., in Darija) |
| `context_question` | `string` | Clinical context or patient background |
| `answer` | `string` | Responce by Doctor Professional medical |
| `article_title` | `string` | Title of the reference medical article |
| `language` | `string` | Language of the entry (Darija, FR, EN) |
| `urgency` | `string` | Severity level (Low, Medium, High) |
| `speciality` | `string` | Medical department (e.g., Cardiology, Immunology) |
| `NER` | `string` | Name Entity Recognition (disease , Symptoms , Test ...) |
## NER Entities Metadata (`entities` column)
| Entity | Type | Description |
| :--- | :--- | :--- |
| `disease` | `list[string]` | Diagnosed conditions or illnesses |
| `sympt` | `list[string]` | Reported symptoms (e.g., "حكة", "fever") |
| `medicament` | `list[string]` | Prescribed or mentioned drugs |
| `medical_field` | `list[string]` | Broad medical categories (e.g., "Allergologie") |
| `age` | `list[string]` | Patient age or age group mentions |
| `Test` / `Result` | `list[string]` | Clinical exams and their respective outcomes |
<!-- This is open source dataset for Enhance Research in Healtcare with multilangue support both Arabic Darija , Frensh , English with Name Entity Extraction
--- -->
## ✍️ Author & Citation
This dataset was curated and processed by **Youssef Eljaouhary**.
If you use this dataset in your research or project, please cite it as:
> Eljaouhary, Y. (2026). MedQA Multilingual Dataset (Darija/FR/EN). Hugging Face.
## ⚖️ License
This project is licensed under the **MIT License**. You are free to use, modify, and distribute this dataset for both commercial and non-commercial purposes, provided that the original author is credited.
<!-- task_categories:
- question-answering
- text-classification
- text-generation
language:
- ar
- fr
- en
tags:
- medical
pretty_name: >-
Question Answering Dataset for Healthcare Domain (Original data) has collected
by Scrapping Two website icliniq.com and Altibbi.com and MedQA dataset
size_categories:
- 10K<n<100K
--- -->
数据集信息:
特征:
- 名称:id,数据类型:字符串
- 名称:question,数据类型:字符串
- 名称:context_question,数据类型:字符串
- 名称:answer,数据类型:字符串
- 名称:language,数据类型:字符串
- 名称:urgency,数据类型:字符串
- 名称:speciality,数据类型:字符串
- 名称:article_title,数据类型:字符串
- 名称:entities,结构:
- 名称:age,列表类型:字符串
- 名称:medicament,列表类型:字符串
- 名称:sympt,列表类型:字符串
- 名称:medical_field,列表类型:字符串
- 名称:disease,列表类型:字符串
- 名称:Test,列表类型:字符串
- 名称:Result,列表类型:字符串
数据集划分:
- 名称:train(训练集),字节数:6948163.361080951,样本数:7460
- 名称:test(测试集),字节数:772121.6389190493,样本数:829
下载大小:4170389,数据集总大小:7720285.0
配置项:
- 配置名称:default,数据文件:
- 划分集:train,路径:data/train-*
- 划分集:test,路径:data/test-*
---
## 医疗多语言问答数据集(Question Answering Multilingual Dataset for Healthcare)

# 概述:
**YeMedQA**是一款面向医疗自然语言处理(Natural Language Processing, NLP)应用的多语言问答数据集。
其聚焦于以下三种语言的**医患对话场景**:
- 达里贾语(Darija)
- 英语(English)
- 法语(French)
**关键词**:医疗问答(Medical Question Answering, MedQA)、大语言模型(Large Language Models, LLMs)、自然语言处理(NLP)、医疗人工智能(AI in Healthcare)
该数据集可支撑**适配文化与语言特性的医疗人工智能系统**的研发。
## 🌐 数据采集
YeMedQA的构建来源如下:
### 1. 网页抓取(经验证的医疗来源)
医疗内容采集自权威医疗平台并经过整理:
- www.icliniq.com
- www.altibbi.com
### 2. Hugging Face开源数据
- 公开可用的医疗问答数据集(ANR-Maladies)
遴选上述来源的依据为:
- 极高的医疗可信度
- 真实的医患交互场景
- 多语言内容支持
### 数据集划分
| 划分集 | 样本数 | 大小(MB) |
| :--- | :---: | :---: |
| **训练集(Train)** | 7,460 | 6.95 |
| **测试集(Test)** | 829 | 0.77 |
| **总计** | **8,289** | **7.72** |
## 字段说明:
| 特征名 | 数据类型 | 描述 |
| :--- | :--- | :--- |
| `id` | `string` | 唯一标识符 |
| `question` | `string` | 患者提问(例如达里贾语表述) |
| `context_question` | `string` | 临床背景或患者病史 |
| `answer` | `string` | 专业医师出具的诊疗回复 |
| `article_title` | `string` | 参考医疗文章的标题 |
| `language` | `string` | 数据条目所用语言(达里贾语、法语、英语) |
| `urgency` | `string` | 病情严重程度分级(低、中、高) |
| `speciality` | `string` | 医疗科室(例如心脏病学、免疫学) |
| `NER` | `string` | 命名实体识别(Named Entity Recognition, NER)结果(疾病、症状、检查等) |
## 命名实体识别(NER)实体元数据(`entities`字段)
| 实体名 | 数据类型 | 描述 |
| :--- | :--- | :--- |
| `disease` | `list[string]` | 确诊病症或疾病 |
| `sympt` | `list[string]` | 报告的症状(例如“حكة”、“发热”) |
| `medicament` | `list[string]` | 处方提及或讨论的药物 |
| `medical_field` | `list[string]` | 宽泛的医学分类(例如“变态反应学”) |
| `age` | `list[string]` | 提及的患者年龄或年龄组 |
| `Test` / `Result` | `list[string]` | 临床检查项目及其对应结果 |
## ✍️ 作者与引用
本数据集由**Youssef Eljaouhary**整理并处理。
若您在研究或项目中使用本数据集,请按以下格式引用:
> Eljaouhary, Y. (2026). MedQA Multilingual Dataset (Darija/FR/EN). Hugging Face.
## ⚖️ 许可证
本项目采用**MIT许可证**。您可自由使用、修改并分发本数据集用于商业或非商业用途,但需注明原作者。
提供机构:
ELJAOUHARY



