five

pubmed

收藏
魔搭社区2025-12-05 更新2025-06-14 收录
下载链接:
https://modelscope.cn/datasets/MedRAG/pubmed
下载链接
链接失效反馈
官方服务:
资源简介:
# The PubMed Corpus in MedRAG This HF dataset contains the snippets from the PubMed corpus used in [MedRAG](https://arxiv.org/abs/2402.13178). It can be used for medical Retrieval-Augmented Generation (RAG). ## News - (02/26/2024) The "id" column has been reformatted. A new "PMID" column is added. ## Dataset Details ### Dataset Descriptions [PubMed](https://pubmed.ncbi.nlm.nih.gov/) is the most widely used literature resource, containing over 36 million biomedical articles. For MedRAG, we use a PubMed subset of 23.9 million articles with valid titles and abstracts. This HF dataset contains our ready-to-use snippets for the PubMed corpus, including 23,898,701 snippets with an average of 296 tokens. ### Dataset Structure Each row is a snippet of PubMed, which includes the following features: - id: a unique identifier of the snippet - title: the title of the PubMed article from which the snippet is collected - content: the abstract of the PubMed article from which the snippet is collected - contents: a concatenation of 'title' and 'content', which will be used by the [BM25](https://github.com/castorini/pyserini) retriever ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> ```shell git clone https://huggingface.co/datasets/MedRAG/pubmed ``` ### Use in MedRAG ```python >> from src.medrag import MedRAG >> question = "A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral" >> options = { "A": "paralysis of the facial muscles.", "B": "paralysis of the facial muscles and loss of taste.", "C": "paralysis of the facial muscles, loss of taste and lacrimation.", "D": "paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation." } >> medrag = MedRAG(llm_name="OpenAI/gpt-3.5-turbo-16k", rag=True, retriever_name="MedCPT", corpus_name="PubMed") >> answer, snippets, scores = medrag.answer(question=question, options=options, k=32) # scores are given by the retrieval system ``` ## Citation ```shell @article{xiong2024benchmarking, title={Benchmarking Retrieval-Augmented Generation for Medicine}, author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang}, journal={arXiv preprint arXiv:2402.13178}, year={2024} } ```

# MedRAG中的PubMed语料库 本Hugging Face数据集包含[MedRAG](https://arxiv.org/abs/2402.13178)中使用的PubMed语料库片段,可用于医学检索增强生成(Retrieval-Augmented Generation, RAG)。 ## 更新动态 - (2024年2月26日) 已对"id"列完成重新格式化,并新增"PMID"列。 ## 数据集详情 ### 数据集概述 [PubMed](https://pubmed.ncbi.nlm.nih.gov/)是当前应用最为广泛的生物医学文献资源,收录超3600万篇学术文章。针对MedRAG项目,我们选取了其中包含有效标题与摘要的2390万篇文章作为语料子集。本Hugging Face数据集提供了该PubMed语料库的即用型片段,共计23,898,701条,平均每条包含296个Token。 ### 数据集架构 每一行对应一条PubMed语料库片段,包含以下字段: - id:该片段的唯一标识符 - title:该片段来源的PubMed文章标题 - content:该片段来源的PubMed文章摘要 - contents:由"title"与"content"拼接得到的文本,将用于[BM25](https://github.com/castorini/pyserini)检索器 ## 使用场景 <!-- 说明该数据集的适用使用场景 --> ### 直接使用 <!-- 本章节描述该数据集的合适应用场景 --> shell git clone https://huggingface.co/datasets/MedRAG/pubmed ### 在MedRAG中使用 python >> from src.medrag import MedRAG >> question = "A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral" >> options = { "A": "paralysis of the facial muscles.", "B": "paralysis of the facial muscles and loss of taste.", "C": "paralysis of the facial muscles, loss of taste and lacrimation.", "D": "paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation." } >> medrag = MedRAG(llm_name="OpenAI/gpt-3.5-turbo-16k", rag=True, retriever_name="MedCPT", corpus_name="PubMed") >> answer, snippets, scores = medrag.answer(question=question, options=options, k=32) # 检索系统给出的匹配分数 ## 引用格式 bibtex @article{xiong2024benchmarking, title={Benchmarking Retrieval-Augmented Generation for Medicine}, author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang}, journal={arXiv preprint arXiv:2402.13178}, year={2024} }
提供机构:
maas
创建时间:
2025-10-16
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作