five

miriad-4.4M

收藏
魔搭社区2025-12-05 更新2025-06-21 收录
下载链接:
https://modelscope.cn/datasets/miriad/miriad-4.4M
下载链接
链接失效反馈
官方服务:
资源简介:
<img src="logo_miriad.png" alt="Centered Image" style="display: block; margin: 0 auto;" width="500"> # Dataset Summary **MIRIAD** is a further curated million scale Medical Instruction and RetrIeval Dataset. It contains **4.4 million medical question-answer pairs**, distilled from peer-reviewed biomedical literature using LLMs. MIRIAD provides structured, high-quality QA pairs, enabling diverse downstream tasks like RAG, medical retrieval, hallucination detection, and instruction tuning. The dataset was introduced in our [arXiv preprint](https://arxiv.org/abs/2506.06091). ### To load the dataset, run: ```python from datasets import load_dataset dataset = load_dataset("miriad/miriad-4.4M", split="train") ``` # Licensing In this paper, we use the Semantic Scholar Open Research Corpus (S2ORC) as the source of documents to generate our dataset. These documents are made available under the Open Data Commons Attribution License (ODC-By) v1.0 (https://opendatacommons.org/licenses/by/1-0/), which permits reuse and modification of the dataset, including for commercial use, provided that proper attribution is given. To construct our dataset, we used S2ORC documents as input to OpenAI’s language models. The resulting model-generated outputs are owned by us, as per OpenAI’s Terms of Use, which also specify that outputs must not be used for medical diagnosis or decision-making about real individuals (https://openai.com/policies/terms-of-use/). Since our outputs are generated using both S2ORC documents and OpenAI’s models, we release the dataset under the ODC-By v1.0 license, subject to the usage restrictions in OpenAI’s Terms of Use. # Intended use At this stage, the outputs of this study and the provided assets are supplied exclusively for academic research and educational exploration. They have not been reviewed or cleared by any regulatory body, and accordingly must not be used for clinical decision-making or considered a certified medical device. # Cite ```bibtex @misc{zheng2025miriadaugmentingllmsmillions, title={MIRIAD: Augmenting LLMs with millions of medical query-response pairs}, author={Qinyue Zheng and Salman Abdullah and Sam Rawal and Cyril Zakka and Sophie Ostmeier and Maximilian Purk and Eduardo Reis and Eric J. Topol and Jure Leskovec and Michael Moor}, year={2025}, eprint={2506.06091}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.06091}, } ```

<img src="logo_miriad.png" alt="居中显示的图片" style="display: block; margin: 0 auto;" width="500"> # 数据集概述 **MIRIAD**是一款经过精细化筛选的百万级医疗指令与检索数据集(Medical Instruction and RetrIeval Dataset),包含440万条医疗问答对,通过大语言模型(Large Language Model,LLM)从同行评审的生物医学文献中提炼生成。MIRIAD提供结构化、高质量的问答对,可支撑检索增强生成(Retrieval-Augmented Generation,RAG)、医疗检索、幻觉检测与指令微调等多样化下游任务。 本数据集首次公开于我们的arXiv预印本文章:https://arxiv.org/abs/2506.06091。 ### 加载该数据集的代码如下: python from datasets import load_dataset dataset = load_dataset("miriad/miriad-4.4M", split="train") # 授权许可 本文中,我们使用学术语义学者开放研究语料库(Semantic Scholar Open Research Corpus,S2ORC)作为文档来源以构建本数据集。这些文档采用开放数据 Commons 署名许可协议(Open Data Commons Attribution License,ODC-By)v1.0 进行开源(https://opendatacommons.org/licenses/by/1-0/),该协议允许对数据集进行复用与修改,包括商业用途,但需注明原作者来源。 在构建本数据集时,我们将S2ORC文档作为输入输入至OpenAI的大语言模型中。根据OpenAI的使用条款,模型生成的结果归本研究团队所有,同时该条款也明确规定,模型输出不得用于针对真实个体的医疗诊断或决策制定(https://openai.com/policies/terms-of-use/)。 由于本数据集的输出同时基于S2ORC文档与OpenAI模型生成,因此我们采用ODC-By v1.0协议开源本数据集,但需遵守OpenAI使用条款中的使用限制。 # 预期用途 现阶段,本研究的输出结果与所提供的相关资源仅用于学术研究与教育探索。本研究成果尚未经过任何监管机构的审核与批准,因此不得用于临床决策制定,也不得被视为经认证的医疗设备。 # 引用格式 bibtex @misc{zheng2025miriadaugmentingllmsmillions, title={MIRIAD: Augmenting LLMs with millions of medical query-response pairs}, author={Qinyue Zheng and Salman Abdullah and Sam Rawal and Cyril Zakka and Sophie Ostmeier and Maximilian Purk and Eduardo Reis and Eric J. Topol and Jure Leskovec and Michael Moor}, year={2025}, eprint={2506.06091}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.06091}, }
提供机构:
maas
创建时间:
2025-06-19
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
MIRIAD数据集包含440万高质量的医学问答对,适用于医学检索和指令调优等任务,基于Apache License 2.0许可发布,但需遵守OpenAI的使用限制。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作