textbooks
收藏魔搭社区2025-11-27 更新2025-11-15 收录
下载链接:
https://modelscope.cn/datasets/MedRAG/textbooks
下载链接
链接失效反馈官方服务:
资源简介:
# The Textbooks Corpus in MedRAG
This HF dataset contains the chunked snippets from the Textbooks corpus used in [MedRAG](https://arxiv.org/abs/2402.13178). It can be used for medical Retrieval-Augmented Generation (RAG).
## Dataset Details
### Dataset Descriptions
[Textbooks](https://github.com/jind11/MedQA) is a collection of 18 widely used medical textbooks, which are important references for students taking the United States Medical Licensing Examination (USLME).
In MedRAG, the textbooks are processed as chunks with no more than 1000 characters.
We used the RecursiveCharacterTextSplitter from [LangChain](https://www.langchain.com/) to perform the chunking.
This HF dataset contains our ready-to-use chunked snippets for the Textbooks corpus, including 125,847 snippets with an average of 182 tokens.
### Dataset Structure
Each row is a snippet of Textbooks, which includes the following features:
- id: a unique identifier of the snippet
- title: the title of the textbook from which the snippet is collected
- content: the content of the snippet
- contents: a concatenation of 'title' and 'content', which will be used by the [BM25](https://github.com/castorini/pyserini) retriever
## Uses
<!-- Address questions around how the dataset is intended to be used. -->
### Direct Use
<!-- This section describes suitable use cases for the dataset. -->
```shell
git clone https://huggingface.co/datasets/MedRAG/textbooks
```
### Use in MedRAG
```python
>> from src.medrag import MedRAG
>> question = "A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral"
>> options = {
"A": "paralysis of the facial muscles.",
"B": "paralysis of the facial muscles and loss of taste.",
"C": "paralysis of the facial muscles, loss of taste and lacrimation.",
"D": "paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation."
}
>> medrag = MedRAG(llm_name="OpenAI/gpt-3.5-turbo-16k", rag=True, retriever_name="MedCPT", corpus_name="Textbooks")
>> answer, snippets, scores = medrag.answer(question=question, options=options, k=32) # scores are given by the retrieval system
```
## Citation
```shell
@article{xiong2024benchmarking,
title={Benchmarking Retrieval-Augmented Generation for Medicine},
author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang},
journal={arXiv preprint arXiv:2402.13178},
year={2024}
}
```
# MedRAG 教材语料库
本Hugging Face(HF)数据集收录了[MedRAG](https://arxiv.org/abs/2402.13178)项目中使用的教材语料库的分块片段,可用于医学检索增强生成(Retrieval-Augmented Generation, RAG)。
## 数据集详情
### 数据集描述
[教材语料库(Textbooks)](https://github.com/jind11/MedQA) 收录了18部广泛使用的医学教材,是参加美国医学执照考试(USLME)考生的重要参考资料。在MedRAG项目中,该教材语料库被分割为单块不超过1000字符的片段。我们使用了[LangChain](https://www.langchain.com/)提供的递归字符文本分割器(RecursiveCharacterTextSplitter)完成分块操作。本HF数据集已预先完成分块处理、可直接使用,共包含该教材语料库的125,847条分块片段,平均每条片段包含182个Token。
### 数据集结构
每一行对应一条教材分块片段,包含以下字段:
- `id`:该片段的唯一标识符
- `title`:该片段来源教材的标题
- `content`:该片段的文本内容
- `contents`:由`title`与`content`拼接得到的完整文本,将供[BM25](https://github.com/castorini/pyserini)检索器使用。
## 使用场景
### 直接使用
<!-- Address questions around how the dataset is intended to be used. -->
shell
git clone https://huggingface.co/datasets/MedRAG/textbooks
### 在MedRAG项目中使用
<!-- Address questions around how the dataset is intended to be used. -->
python
>> from src.medrag import MedRAG
>> question = "A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral"
>> options = {
"A": "paralysis of the facial muscles.",
"B": "paralysis of the facial muscles and loss of taste.",
"C": "paralysis of the facial muscles, loss of taste and lacrimation.",
"D": "paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation."
}
>> medrag = MedRAG(llm_name="OpenAI/gpt-3.5-turbo-16k", rag=True, retriever_name="MedCPT", corpus_name="Textbooks")
>> answer, snippets, scores = medrag.answer(question=question, options=options, k=32) # scores are given by the retrieval system
## 引用格式
bibtex
@article{xiong2024benchmarking,
title={Benchmarking Retrieval-Augmented Generation for Medicine},
author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang},
journal={arXiv preprint arXiv:2402.13178},
year={2024}
}
提供机构:
maas
创建时间:
2025-10-16



