pubmed

收藏

魔搭社区2025-12-05 更新2025-06-14 收录

下载链接：

https://modelscope.cn/datasets/MedRAG/pubmed

下载链接

链接失效反馈

官方服务：

资源简介：

# The PubMed Corpus in MedRAG This HF dataset contains the snippets from the PubMed corpus used in [MedRAG](https://arxiv.org/abs/2402.13178). It can be used for medical Retrieval-Augmented Generation (RAG). ## News - (02/26/2024) The "id" column has been reformatted. A new "PMID" column is added. ## Dataset Details ### Dataset Descriptions [PubMed](https://pubmed.ncbi.nlm.nih.gov/) is the most widely used literature resource, containing over 36 million biomedical articles. For MedRAG, we use a PubMed subset of 23.9 million articles with valid titles and abstracts. This HF dataset contains our ready-to-use snippets for the PubMed corpus, including 23,898,701 snippets with an average of 296 tokens. ### Dataset Structure Each row is a snippet of PubMed, which includes the following features: - id: a unique identifier of the snippet - title: the title of the PubMed article from which the snippet is collected - content: the abstract of the PubMed article from which the snippet is collected - contents: a concatenation of 'title' and 'content', which will be used by the [BM25](https://github.com/castorini/pyserini) retriever ## Uses  ### Direct Use  ```shell git clone https://huggingface.co/datasets/MedRAG/pubmed ``` ### Use in MedRAG ```python >> from src.medrag import MedRAG >> question = "A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral" >> options = { "A": "paralysis of the facial muscles.", "B": "paralysis of the facial muscles and loss of taste.", "C": "paralysis of the facial muscles, loss of taste and lacrimation.", "D": "paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation." } >> medrag = MedRAG(llm_name="OpenAI/gpt-3.5-turbo-16k", rag=True, retriever_name="MedCPT", corpus_name="PubMed") >> answer, snippets, scores = medrag.answer(question=question, options=options, k=32) # scores are given by the retrieval system ``` ## Citation ```shell @article{xiong2024benchmarking, title={Benchmarking Retrieval-Augmented Generation for Medicine}, author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang}, journal={arXiv preprint arXiv:2402.13178}, year={2024} } ```

# MedRAG中的PubMed语料库本Hugging Face数据集包含[MedRAG](https://arxiv.org/abs/2402.13178)中使用的PubMed语料库片段，可用于医学检索增强生成（Retrieval-Augmented Generation, RAG）。 ## 更新动态 - (2024年2月26日) 已对"id"列完成重新格式化，并新增"PMID"列。 ## 数据集详情 ### 数据集概述 [PubMed](https://pubmed.ncbi.nlm.nih.gov/)是当前应用最为广泛的生物医学文献资源，收录超3600万篇学术文章。针对MedRAG项目，我们选取了其中包含有效标题与摘要的2390万篇文章作为语料子集。本Hugging Face数据集提供了该PubMed语料库的即用型片段，共计23,898,701条，平均每条包含296个Token。 ### 数据集架构每一行对应一条PubMed语料库片段，包含以下字段： - id：该片段的唯一标识符 - title：该片段来源的PubMed文章标题 - content：该片段来源的PubMed文章摘要 - contents：由"title"与"content"拼接得到的文本，将用于[BM25](https://github.com/castorini/pyserini)检索器 ## 使用场景  ### 直接使用  shell git clone https://huggingface.co/datasets/MedRAG/pubmed ### 在MedRAG中使用 python >> from src.medrag import MedRAG >> question = "A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral" >> options = { "A": "paralysis of the facial muscles.", "B": "paralysis of the facial muscles and loss of taste.", "C": "paralysis of the facial muscles, loss of taste and lacrimation.", "D": "paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation." } >> medrag = MedRAG(llm_name="OpenAI/gpt-3.5-turbo-16k", rag=True, retriever_name="MedCPT", corpus_name="PubMed") >> answer, snippets, scores = medrag.answer(question=question, options=options, k=32) # 检索系统给出的匹配分数 ## 引用格式 bibtex @article{xiong2024benchmarking, title={Benchmarking Retrieval-Augmented Generation for Medicine}, author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang}, journal={arXiv preprint arXiv:2402.13178}, year={2024} }

提供机构：

创建时间：

2025-10-16

搜集汇总

数据集介绍

main_image_url

背景与挑战

背景概述

该数据集是MedRAG项目中使用的PubMed文献片段集合，包含约2390万篇生物医学文章的标题和摘要片段，平均每段296个令牌。它专为医疗领域的检索增强生成（RAG）设计，可直接用于相关应用。

以上内容由遇见数据集搜集并总结生成

© 2023-2026 上海数据发展科技有限责任公司版权所有

沪ICP备17003045号-15 沪公网安备31010402336585号

二维码

社区交流群

二维码

科研交流群

商业服务