lavita/MedQuAD
收藏Hugging Face2023-12-22 更新2024-03-04 收录
下载链接:
https://hf-mirror.com/datasets/lavita/MedQuAD
下载链接
链接失效反馈官方服务:
资源简介:
MedQuAD数据集是从MedQuAD转换而来的,主要用于医学领域的问答任务。数据集中包含文档ID、文档来源、文档URL、类别、UMLS CUI、UMLS语义类型、UMLS语义组、同义词、问题ID、问题焦点、问题类型、问题和答案等特征。数据集分为训练集,包含47,441个样本。为了尊重MedlinePlus版权,移除了部分来源的答案。README还列出了数据集与论文中问题类型之间的差异,并提供了引用信息。
The MedQuAD dataset is converted from the original MedQuAD resources, and is primarily used for medical domain question answering tasks. It includes features such as document ID, document source, document URL, category, UMLS CUI, UMLS semantic type, UMLS semantic group, synonyms, question ID, question focus, question type, questions and answers. The dataset is split into a training set containing 47,441 samples. To respect the copyright of MedlinePlus, answers from some sources have been removed. The README also lists the discrepancies between the dataset and the question types in the corresponding paper, and provides citation information.
提供机构:
lavita
原始信息汇总
数据集概述
数据集信息
特征
- document_id: 字符串类型
- document_source: 字符串类型
- document_url: 字符串类型
- category: 字符串类型
- umls_cui: 字符串类型
- umls_semantic_types: 字符串类型
- umls_semantic_group: 字符串类型
- synonyms: 字符串类型
- question_id: 字符串类型
- question_focus: 字符串类型
- question_type: 字符串类型
- question: 字符串类型
- answer: 字符串类型
数据分割
- train: 包含47441个样本,占用34989308字节
下载和数据大小
- download_size: 10718159字节
- dataset_size: 34989308字节
任务类别
- 问答系统
语言
- 英语
标签
- 医学
数据集大小类别
- 10K<n<100K
数据集特点
umls_cui,umls_semantic_types,synonyms列中的多个值以|字符分隔。- 来自 [
GARD,MPlusHerbsSupplements,ADAM,MPlusDrugs] 来源的答案(31,034条记录)已从原始数据集中删除,以尊重 MedlinePlus 版权。 - UMLS: 统一医学语言系统
- CUI: 概念唯一标识符
问题类型差异
- 数据集中的问题类型与论文中提到的问题类型存在一些差异,具体差异如下:
| 数据集问题类型 | 论文问题类型 |
|---|---|
| how can i learn more | learn more |
| brand names of combination products | brand names |
| other information | information |
| outlook | prognosis |
| exams and tests | diagnosis (exams and tests) |
| stages | ? |
| precautions | ? |
| interactions with herbs and supplements | interaction with herbs and supplements |
| when to contact a medical professional | contact a medical professional |
| research | research (or clinical trial) |
| interactions with medications | interaction with medications |
| interactions with foods | interaction with food |
搜集汇总
数据集介绍

背景与挑战
背景概述
MedQuAD is a medical question-answering dataset with 47,441 entries, featuring questions and answers annotated with UMLS identifiers and semantic types. It is formatted in Parquet and intended for developing AI models in healthcare, with specific notes on copyright and question type variations.
以上内容由遇见数据集搜集并总结生成



