MedRAG/statpearls
收藏The StatPearls Corpus in MedRAG
数据集详情
数据集描述
StatPearls 是一个类似于 UpToDate 的临床决策支持工具。我们使用 NCBI Bookshelf 提供的 9,330 篇公开的 StatPearl 文章来构建 StatPearls 语料库。我们根据文章的层次结构对 StatPearls 进行了分块处理,将每篇文章中的每个段落视为一个片段,并将所有相关的层次标题拼接为相应的标题。我们的分块语料库包含 301,202 个片段,平均每个片段包含 119 个词。
数据集结构
每行是一个 StatPearls 的片段,包含以下特征:
- id: 片段的唯一标识符
- title: 片段所属的 StatPearl 文章的标题和子标题
- content: 片段的内容
- contents: title 和 content 的拼接,将由 BM25 检索器使用
使用方法
直接使用
shell git clone https://github.com/Teddy-XiongGZ/MedRAG.git cd MedRAG
wget https://ftp.ncbi.nlm.nih.gov/pub/litarch/3d/12/statpearls_NBK430685.tar.gz -P ./corpus/statpearls tar -xzvf ./corpus/statpearls/statpearls_NBK430685.tar.gz -C ./corpus/statpearls python src/data/statpearls.py
在 MedRAG 中使用
python
from src.medrag import MedRAG
question = "A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral" options = { "A": "paralysis of the facial muscles.", "B": "paralysis of the facial muscles and loss of taste.", "C": "paralysis of the facial muscles, loss of taste and lacrimation.", "D": "paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation." }
medrag = MedRAG(llm_name="OpenAI/gpt-3.5-turbo-16k", rag=True, retriever_name="MedCPT", corpus_name="StatPearls") answer, snippets, scores = medrag.answer(question=question, options=options, k=32) # scores are given by the retrieval system
引用
shell @article{xiong2024benchmarking, title={Benchmarking Retrieval-Augmented Generation for Medicine}, author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang}, journal={arXiv preprint arXiv:2402.13178}, year={2024} }




