pubmed
收藏魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/MedRAG/pubmed
下载链接
链接失效反馈官方服务:
资源简介:
# The PubMed Corpus in MedRAG
This HF dataset contains the snippets from the PubMed corpus used in [MedRAG](https://arxiv.org/abs/2402.13178). It can be used for medical Retrieval-Augmented Generation (RAG).
## News
- (02/26/2024) The "id" column has been reformatted. A new "PMID" column is added.
## Dataset Details
### Dataset Descriptions
[PubMed](https://pubmed.ncbi.nlm.nih.gov/) is the most widely used literature resource, containing over 36 million biomedical articles.
For MedRAG, we use a PubMed subset of 23.9 million articles with valid titles and abstracts.
This HF dataset contains our ready-to-use snippets for the PubMed corpus, including 23,898,701 snippets with an average of 296 tokens.
### Dataset Structure
Each row is a snippet of PubMed, which includes the following features:
- id: a unique identifier of the snippet
- title: the title of the PubMed article from which the snippet is collected
- content: the abstract of the PubMed article from which the snippet is collected
- contents: a concatenation of 'title' and 'content', which will be used by the [BM25](https://github.com/castorini/pyserini) retriever
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
```shell
git clone https://huggingface.co/datasets/MedRAG/pubmed
```
### Use in MedRAG
```python
>> from src.medrag import MedRAG
>> question = "A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral"
>> options = {
"A": "paralysis of the facial muscles.",
"B": "paralysis of the facial muscles and loss of taste.",
"C": "paralysis of the facial muscles, loss of taste and lacrimation.",
"D": "paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation."
}
>> medrag = MedRAG(llm_name="OpenAI/gpt-3.5-turbo-16k", rag=True, retriever_name="MedCPT", corpus_name="PubMed")
>> answer, snippets, scores = medrag.answer(question=question, options=options, k=32) # scores are given by the retrieval system
```
## Citation
```shell
@article{xiong2024benchmarking,
title={Benchmarking Retrieval-Augmented Generation for Medicine},
author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang},
journal={arXiv preprint arXiv:2402.13178},
year={2024}
}
```
# MedRAG中的PubMed语料库
本Hugging Face数据集包含[MedRAG](https://arxiv.org/abs/2402.13178)中使用的PubMed语料库片段,可用于医学检索增强生成(Retrieval-Augmented Generation, RAG)。
## 更新动态
- (2024年2月26日) 已对"id"列完成重新格式化,并新增"PMID"列。
## 数据集详情
### 数据集概述
[PubMed](https://pubmed.ncbi.nlm.nih.gov/)是当前应用最为广泛的生物医学文献资源,收录超3600万篇学术文章。针对MedRAG项目,我们选取了其中包含有效标题与摘要的2390万篇文章作为语料子集。本Hugging Face数据集提供了该PubMed语料库的即用型片段,共计23,898,701条,平均每条包含296个Token。
### 数据集架构
每一行对应一条PubMed语料库片段,包含以下字段:
- id:该片段的唯一标识符
- title:该片段来源的PubMed文章标题
- content:该片段来源的PubMed文章摘要
- contents:由"title"与"content"拼接得到的文本,将用于[BM25](https://github.com/castorini/pyserini)检索器
## 使用场景
<!-- 说明该数据集的适用使用场景 -->
### 直接使用
<!-- 本章节描述该数据集的合适应用场景 -->
shell
git clone https://huggingface.co/datasets/MedRAG/pubmed
### 在MedRAG中使用
python
>> from src.medrag import MedRAG
>> question = "A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral"
>> options = {
"A": "paralysis of the facial muscles.",
"B": "paralysis of the facial muscles and loss of taste.",
"C": "paralysis of the facial muscles, loss of taste and lacrimation.",
"D": "paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation."
}
>> medrag = MedRAG(llm_name="OpenAI/gpt-3.5-turbo-16k", rag=True, retriever_name="MedCPT", corpus_name="PubMed")
>> answer, snippets, scores = medrag.answer(question=question, options=options, k=32) # 检索系统给出的匹配分数
## 引用格式
bibtex
@article{xiong2024benchmarking,
title={Benchmarking Retrieval-Augmented Generation for Medicine},
author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang},
journal={arXiv preprint arXiv:2402.13178},
year={2024}
}
提供机构:
maas
创建时间:
2025-10-16



