minsu/medrag_wikipedia

Name: minsu/medrag_wikipedia
Creator: minsu
Published: 2026-02-13 06:31:59
License: 暂无描述

Hugging Face2026-02-13 更新2026-03-29 收录

下载链接：

https://hf-mirror.com/datasets/minsu/medrag_wikipedia

下载链接

链接失效反馈

官方服务：

资源简介：

--- task_categories: - question-answering language: - en tags: - medical - question answering - large language model - retrieval-augmented generation size_categories: - 10M<n<100M --- # The Wikipedia Corpus in MedRAG This HF dataset contains the chunked snippets from the Wikipedia corpus used in [MedRAG](https://arxiv.org/abs/2402.13178). It can be used for medical Retrieval-Augmented Generation (RAG). ## News - (02/26/2024) The "id" column has been reformatted. A new "wiki_id" column is added. ## Dataset Details ### Dataset Descriptions As a large-scale open-source encyclopedia, Wikipedia is frequently used as a corpus in information retrieval tasks. We select Wikipedia as one of the corpora to see if the general domain database can be used to improve the ability of medical QA. We downloaded the processed Wikipedia data from [HuggingFace](https://huggingface.co/datasets/wikipedia) and chunked the text using [LangChain](https://www.langchain.com/) as snippets with no more than 1000 characters. This HF dataset contains our ready-to-use chunked snippets for the Wikipedia corpus, including 29,913,202 snippets with an average of 162 tokens. ### Dataset Structure Each row is a snippet of Wikipedia, which includes the following features: - id: a unique identifier of the snippet - title: the title of the Wikipedia article from which the snippet is collected - content: the content of the snippet - contents: a concatenation of 'title' and 'content', which will be used by the [BM25](https://github.com/castorini/pyserini) retriever ## Uses  ### Direct Use  ```shell git clone https://huggingface.co/datasets/MedRAG/wikipedia ``` ### Use in MedRAG ```python >> from src.medrag import MedRAG >> question = "A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral" >> options = { "A": "paralysis of the facial muscles.", "B": "paralysis of the facial muscles and loss of taste.", "C": "paralysis of the facial muscles, loss of taste and lacrimation.", "D": "paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation." } >> medrag = MedRAG(llm_name="OpenAI/gpt-3.5-turbo-16k", rag=True, retriever_name="MedCPT", corpus_name="Wikipedia") >> answer, snippets, scores = medrag.answer(question=question, options=options, k=32) # scores are given by the retrieval system ``` ## Citation ```shell @article{xiong2024benchmarking, title={Benchmarking Retrieval-Augmented Generation for Medicine}, author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang}, journal={arXiv preprint arXiv:2402.13178}, year={2024} } ```

提供机构：

minsu

5,000+

优质数据集

54 个

任务类型

进入经典数据集