five

ELJAOUHARY/YeMedQA_Mutilangual

收藏
Hugging Face2026-04-21 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/ELJAOUHARY/YeMedQA_Mutilangual
下载链接
链接失效反馈
官方服务:
资源简介:
--- dataset_info: features: - name: id dtype: string - name: question dtype: string - name: context_question dtype: string - name: answer dtype: string - name: language dtype: string - name: urgency dtype: string - name: speciality dtype: string - name: article_title dtype: string - name: entities struct: - name: age list: string - name: medicament list: string - name: sympt list: string - name: medical_field list: string - name: disease list: string - name: Test list: string - name: Result list: string splits: - name: train num_bytes: 6948163.361080951 num_examples: 7460 - name: test num_bytes: 772121.6389190493 num_examples: 829 download_size: 4170389 dataset_size: 7720285.0 configs: - config_name: default data_files: - split: train path: data/train-* - split: test path: data/test-* --- ## Question Answering Mutilangue Dataset for Healthcare. ![YeMedQA.drawio (2)](https://cdn-uploads.huggingface.co/production/uploads/6962771c8b0bef761b53df3f/_M4txQSX_wMRNsxTsyTiq.png) # Overview: **YeMedQA** is a multilingual Question-Answering dataset designed for healthcare NLP applications. It focuses on **patient–doctor medical conversations** in: - Darija - English - French **Keywords:** Medical Question Answering (MedQA), Large Language Models (LLMs), Natural Language Processing (NLP), AI in Healthcare The dataset supports the development of **culturally and linguistically adapted medical AI systems**. ## 🌐 Data Collection YeMedQA was constructed using: ### 1. Web Scraping (Verified Medical Sources) Medical content was collected and curated from trusted healthcare platforms: - www.icliniq.com - www.altibbi.com ### 2. Hugging Face Open Data - Publicly available medical QA datasets (ANR-Maladies) These sources were selected for their: - High medical credibility - Real patient–doctor interactions - Multilingual content availability ### Dataset Splits | Split | Examples | Size (MB) | | :--- | :---: | :---: | | **Train** | 7,460 | 6.95 MB | | **Test** | 829 | 0.77 MB | | **Total** | **8,289** | **7.72 MB** | ## Column: | Feature | Type | Description | | :--- | :--- | :--- | | `id` | `string` | Unique ID | | `question` | `string` | The patient question(e.g., in Darija) | | `context_question` | `string` | Clinical context or patient background | | `answer` | `string` | Responce by Doctor Professional medical | | `article_title` | `string` | Title of the reference medical article | | `language` | `string` | Language of the entry (Darija, FR, EN) | | `urgency` | `string` | Severity level (Low, Medium, High) | | `speciality` | `string` | Medical department (e.g., Cardiology, Immunology) | | `NER` | `string` | Name Entity Recognition (disease , Symptoms , Test ...) | ## NER Entities Metadata (`entities` column) | Entity | Type | Description | | :--- | :--- | :--- | | `disease` | `list[string]` | Diagnosed conditions or illnesses | | `sympt` | `list[string]` | Reported symptoms (e.g., "حكة", "fever") | | `medicament` | `list[string]` | Prescribed or mentioned drugs | | `medical_field` | `list[string]` | Broad medical categories (e.g., "Allergologie") | | `age` | `list[string]` | Patient age or age group mentions | | `Test` / `Result` | `list[string]` | Clinical exams and their respective outcomes | <!-- This is open source dataset for Enhance Research in Healtcare with multilangue support both Arabic Darija , Frensh , English with Name Entity Extraction --- --> ## ✍️ Author & Citation This dataset was curated and processed by **Youssef Eljaouhary**. If you use this dataset in your research or project, please cite it as: > Eljaouhary, Y. (2026). MedQA Multilingual Dataset (Darija/FR/EN). Hugging Face. ## ⚖️ License This project is licensed under the **MIT License**. You are free to use, modify, and distribute this dataset for both commercial and non-commercial purposes, provided that the original author is credited. <!-- task_categories: - question-answering - text-classification - text-generation language: - ar - fr - en tags: - medical pretty_name: >- Question Answering Dataset for Healthcare Domain (Original data) has collected by Scrapping Two website icliniq.com and Altibbi.com and MedQA dataset size_categories: - 10K<n<100K --- -->

数据集信息: 特征: - 名称:id,数据类型:字符串 - 名称:question,数据类型:字符串 - 名称:context_question,数据类型:字符串 - 名称:answer,数据类型:字符串 - 名称:language,数据类型:字符串 - 名称:urgency,数据类型:字符串 - 名称:speciality,数据类型:字符串 - 名称:article_title,数据类型:字符串 - 名称:entities,结构: - 名称:age,列表类型:字符串 - 名称:medicament,列表类型:字符串 - 名称:sympt,列表类型:字符串 - 名称:medical_field,列表类型:字符串 - 名称:disease,列表类型:字符串 - 名称:Test,列表类型:字符串 - 名称:Result,列表类型:字符串 数据集划分: - 名称:train(训练集),字节数:6948163.361080951,样本数:7460 - 名称:test(测试集),字节数:772121.6389190493,样本数:829 下载大小:4170389,数据集总大小:7720285.0 配置项: - 配置名称:default,数据文件: - 划分集:train,路径:data/train-* - 划分集:test,路径:data/test-* --- ## 医疗多语言问答数据集(Question Answering Multilingual Dataset for Healthcare) ![YeMedQA.drawio (2)](https://cdn-uploads.huggingface.co/production/uploads/6962771c8b0bef761b53df3f/_M4txQSX_wMRNsxTsyTiq.png) # 概述: **YeMedQA**是一款面向医疗自然语言处理(Natural Language Processing, NLP)应用的多语言问答数据集。 其聚焦于以下三种语言的**医患对话场景**: - 达里贾语(Darija) - 英语(English) - 法语(French) **关键词**:医疗问答(Medical Question Answering, MedQA)、大语言模型(Large Language Models, LLMs)、自然语言处理(NLP)、医疗人工智能(AI in Healthcare) 该数据集可支撑**适配文化与语言特性的医疗人工智能系统**的研发。 ## 🌐 数据采集 YeMedQA的构建来源如下: ### 1. 网页抓取(经验证的医疗来源) 医疗内容采集自权威医疗平台并经过整理: - www.icliniq.com - www.altibbi.com ### 2. Hugging Face开源数据 - 公开可用的医疗问答数据集(ANR-Maladies) 遴选上述来源的依据为: - 极高的医疗可信度 - 真实的医患交互场景 - 多语言内容支持 ### 数据集划分 | 划分集 | 样本数 | 大小(MB) | | :--- | :---: | :---: | | **训练集(Train)** | 7,460 | 6.95 | | **测试集(Test)** | 829 | 0.77 | | **总计** | **8,289** | **7.72** | ## 字段说明: | 特征名 | 数据类型 | 描述 | | :--- | :--- | :--- | | `id` | `string` | 唯一标识符 | | `question` | `string` | 患者提问(例如达里贾语表述) | | `context_question` | `string` | 临床背景或患者病史 | | `answer` | `string` | 专业医师出具的诊疗回复 | | `article_title` | `string` | 参考医疗文章的标题 | | `language` | `string` | 数据条目所用语言(达里贾语、法语、英语) | | `urgency` | `string` | 病情严重程度分级(低、中、高) | | `speciality` | `string` | 医疗科室(例如心脏病学、免疫学) | | `NER` | `string` | 命名实体识别(Named Entity Recognition, NER)结果(疾病、症状、检查等) | ## 命名实体识别(NER)实体元数据(`entities`字段) | 实体名 | 数据类型 | 描述 | | :--- | :--- | :--- | | `disease` | `list[string]` | 确诊病症或疾病 | | `sympt` | `list[string]` | 报告的症状(例如“حكة”、“发热”) | | `medicament` | `list[string]` | 处方提及或讨论的药物 | | `medical_field` | `list[string]` | 宽泛的医学分类(例如“变态反应学”) | | `age` | `list[string]` | 提及的患者年龄或年龄组 | | `Test` / `Result` | `list[string]` | 临床检查项目及其对应结果 | ## ✍️ 作者与引用 本数据集由**Youssef Eljaouhary**整理并处理。 若您在研究或项目中使用本数据集,请按以下格式引用: > Eljaouhary, Y. (2026). MedQA Multilingual Dataset (Darija/FR/EN). Hugging Face. ## ⚖️ 许可证 本项目采用**MIT许可证**。您可自由使用、修改并分发本数据集用于商业或非商业用途,但需注明原作者。
提供机构:
ELJAOUHARY
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作