five

medical_llama3_instruct_dataset_short

收藏
魔搭社区2025-12-05 更新2025-12-06 收录
下载链接:
https://modelscope.cn/datasets/Shekswess/medical_llama3_instruct_dataset_short
下载链接
链接失效反馈
官方服务:
资源简介:
Dataset made for instruction supervised finetuning of Llama 2 LLMs, by combining of medical datasets and getting 2k entries from them: - Medical meadow wikidoc (https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc/blob/main/README.md) - Medquad (https://www.kaggle.com/datasets/jpmiller/layoutlm) ## Medical meadow wikidoc The Medical Meadow Wikidoc dataset comprises question-answer pairs sourced from WikiDoc, an online platform where medical professionals collaboratively contribute and share contemporary medical knowledge. WikiDoc features two primary sections: the "Living Textbook" and "Patient Information". The "Living Textbook" encompasses chapters across various medical specialties, from which we extracted content. Utilizing GTP-3.5-Turbo, the paragraph headings are transformed into questions and utilized the respective paragraphs as answers. Notably, the structure of "Patient Information" is distinct; each section's subheading already serves as a question, eliminating the necessity for rephrasing. ## Medquad MedQuAD is a comprehensive collection consisting of 47,457 medical question-answer pairs compiled from 12 authoritative sources within the National Institutes of Health (NIH), including domains like cancer.gov, niddk.nih.gov, GARD, and MedlinePlus Health Topics. These question-answer pairs span 37 distinct question types, covering a wide spectrum of medical subjects, including diseases, drugs, and medical procedures. The dataset features additional annotations provided in XML files, facilitating various Information Retrieval (IR) and Natural Language Processing (NLP) tasks. These annotations encompass crucial information such as question type, question focus, synonyms, Unique Identifier (CUI) from the Unified Medical Language System (UMLS), and Semantic Type. Moreover, the dataset includes categorization of question focuses into three main categories: Disease, Drug, or Other, with the exception of collections from MedlinePlus, which exclusively focus on diseases.

本数据集专为Llama 2大语言模型(LLMs)的指令监督微调打造,通过整合多组医学数据集并从中提取2000条样本构建而成,具体包含以下两个数据集: - 医学草甸维基文档(Medical Meadow Wikidoc):相关详情可参考https://huggingface.co/datasets/medalpaca/medical_meadow_wikidoc/blob/main/README.md 该数据集的问答对源自WikiDoc平台——一个由医学专业人员协作贡献并分享现代医学知识的在线平台。WikiDoc设有两大核心板块:「活教材(Living Textbook)」与「患者信息(Patient Information)」。我们从「活教材」的各医学专科章节中提取内容,利用GTP-3.5-Turbo将段落标题转换为问题,并以对应段落作为答案。而「患者信息」板块的结构有所不同,其各分区的子标题本身即为问题,无需额外改写。 - 医学问答数据集(MedQuAD):相关详情可参考https://www.kaggle.com/datasets/jpmiller/layoutlm 该数据集是一份涵盖47457条医学问答对的综合集合,其数据源自美国国立卫生研究院(National Institutes of Health, NIH)下属的12个权威来源,包括cancer.gov、niddk.nih.gov、GARD以及MedlinePlus健康主题等领域。这些问答对覆盖37种不同的问题类型,涉及疾病、药物、医疗操作等广泛的医学主题。 数据集附带XML格式的额外标注信息,可支持各类信息检索(Information Retrieval, IR)与自然语言处理(Natural Language Processing, NLP)任务。这些标注包含问题类型、问题焦点、同义词、统一医学语言系统(Unified Medical Language System, UMLS)的唯一标识符(CUI)以及语义类型等关键内容。此外,数据集将问题焦点划分为疾病、药物或其他三大类别,但MedlinePlus来源的集合仅聚焦于疾病主题。
提供机构:
maas
创建时间:
2025-10-03
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作