five

miriad-5.8M

收藏
魔搭社区2026-01-07 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/AI-ModelScope/miriad-5.8M
下载链接
链接失效反馈
官方服务:
资源简介:
<img src="logo_miriad.png" alt="Centered Image" style="display: block; margin: 0 auto;" width="500"> # Dataset Summary **MIRIAD** is a curated million scale Medical Instruction and RetrIeval Dataset. It contains **5.8 million medical question-answer pairs**, distilled from peer-reviewed biomedical literature using LLMs. MIRIAD provides structured, high-quality QA pairs, enabling diverse downstream tasks like RAG, medical retrieval, hallucination detection, and instruction tuning. The dataset was introduced in our [arXiv preprint](https://arxiv.org/abs/2506.06091). ### To load the dataset, run: ```python from datasets import load_dataset dataset = load_dataset("miriad/miriad-5.8M", split="train") ``` # Licensing In this paper, we use the Semantic Scholar Open Research Corpus (S2ORC) as the source of documents to generate our dataset. These documents are made available under the Open Data Commons Attribution License (ODC-By) v1.0 (https://opendatacommons.org/licenses/by/1-0/), which permits reuse and modification of the dataset, including for commercial use, provided that proper attribution is given. To construct our dataset, we used S2ORC documents as input to OpenAI’s language models. The resulting model-generated outputs are owned by us, as per OpenAI’s Terms of Use, which also specify that outputs must not be used for medical diagnosis or decision-making about real individuals (https://openai.com/policies/terms-of-use/). Since our outputs are generated using both S2ORC documents and OpenAI’s models, we release the dataset under the ODC-By v1.0 license, subject to the usage restrictions in OpenAI’s Terms of Use. # Intended use At this stage, the outputs of this study and the provided assets are supplied exclusively for academic research and educational exploration. They have not been reviewed or cleared by any regulatory body, and accordingly must not be used for clinical decision-making or considered a certified medical device. # Cite ```bibtex @misc{zheng2025miriadaugmentingllmsmillions, title={MIRIAD: Augmenting LLMs with millions of medical query-response pairs}, author={Qinyue Zheng and Salman Abdullah and Sam Rawal and Cyril Zakka and Sophie Ostmeier and Maximilian Purk and Eduardo Reis and Eric J. Topol and Jure Leskovec and Michael Moor}, year={2025}, eprint={2506.06091}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.06091}, } ```

<img src="logo_miriad.png" alt="居中图像" style="display: block; margin: 0 auto;" width="500"> # 数据集摘要 **MIRIAD(医疗指令与检索数据集,Medical Instruction and RetrIeval Dataset)** 是一个百万级规模的经精心筛选整理的医疗指令与检索数据集。它包含**580万条医疗问答对**,通过大语言模型(Large Language Model, LLM)从同行评审的生物医学文献中提炼得到。MIRIAD提供结构化、高质量的问答对,可支撑多样化的下游任务,例如检索增强生成(Retrieval-Augmented Generation, RAG)、医疗检索、幻觉检测以及指令微调。 本数据集首次公开于我们的arXiv预印本[https://arxiv.org/abs/2506.06091]。 ### 数据集加载方式: python from datasets import load_dataset dataset = load_dataset("miriad/miriad-5.8M", split="train") # 许可协议 本研究中,我们使用语义学者开放研究语料库(Semantic Scholar Open Research Corpus, S2ORC)作为文档源以构建本数据集。该语料库基于开放数据 Commons 署名许可协议(Open Data Commons Attribution License, ODC-By)v1.0 发布,允许在正确标注原作者的前提下对数据集进行复用与修改,包括商业用途。在数据集构建过程中,我们将S2ORC文档作为输入交由OpenAI的语言模型处理。根据OpenAI的使用条款,最终由模型生成的输出归本研究团队所有,同时该条款明确规定模型输出不得用于针对真实个体的医疗诊断或决策[https://openai.com/policies/terms-of-use/]。由于本数据集的输出同时基于S2ORC文档与OpenAI的模型,我们将本数据集以ODC-By v1.0许可协议发布,但需遵守OpenAI使用条款中的使用限制。 # 预期用途 现阶段,本研究的成果与提供的相关资源仅用于学术研究与教育探索。相关内容未经过任何监管机构的审核或认证,因此不得用于临床决策,也不得被视为合格的医疗设备。 # 引用格式 bibtex @misc{zheng2025miriadaugmentingllmsmillions, title={MIRIAD: Augmenting LLMs with millions of medical query-response pairs}, author={Qinyue Zheng and Salman Abdullah and Sam Rawal and Cyril Zakka and Sophie Ostmeier and Maximilian Purk and Eduardo Reis and Eric J. Topol and Jure Leskovec and Michael Moor}, year={2025}, eprint={2506.06091}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.06091}, }
提供机构:
maas
创建时间:
2025-06-12
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作