five

minsu/medrag_wikipedia

收藏
Hugging Face2026-02-13 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/minsu/medrag_wikipedia
下载链接
链接失效反馈
官方服务:
资源简介:
--- task_categories: - question-answering language: - en tags: - medical - question answering - large language model - retrieval-augmented generation size_categories: - 10M<n<100M --- # The Wikipedia Corpus in MedRAG This HF dataset contains the chunked snippets from the Wikipedia corpus used in [MedRAG](https://arxiv.org/abs/2402.13178). It can be used for medical Retrieval-Augmented Generation (RAG). ## News - (02/26/2024) The "id" column has been reformatted. A new "wiki_id" column is added. ## Dataset Details ### Dataset Descriptions As a large-scale open-source encyclopedia, Wikipedia is frequently used as a corpus in information retrieval tasks. We select Wikipedia as one of the corpora to see if the general domain database can be used to improve the ability of medical QA. We downloaded the processed Wikipedia data from [HuggingFace](https://huggingface.co/datasets/wikipedia) and chunked the text using [LangChain](https://www.langchain.com/) as snippets with no more than 1000 characters. This HF dataset contains our ready-to-use chunked snippets for the Wikipedia corpus, including 29,913,202 snippets with an average of 162 tokens. ### Dataset Structure Each row is a snippet of Wikipedia, which includes the following features: - id: a unique identifier of the snippet - title: the title of the Wikipedia article from which the snippet is collected - content: the content of the snippet - contents: a concatenation of 'title' and 'content', which will be used by the [BM25](https://github.com/castorini/pyserini) retriever ## Uses <!-- Address questions around how the dataset is intended to be used. --> ### Direct Use <!-- This section describes suitable use cases for the dataset. --> ```shell git clone https://huggingface.co/datasets/MedRAG/wikipedia ``` ### Use in MedRAG ```python >> from src.medrag import MedRAG >> question = "A lesion causing compression of the facial nerve at the stylomastoid foramen will cause ipsilateral" >> options = { "A": "paralysis of the facial muscles.", "B": "paralysis of the facial muscles and loss of taste.", "C": "paralysis of the facial muscles, loss of taste and lacrimation.", "D": "paralysis of the facial muscles, loss of taste, lacrimation and decreased salivation." } >> medrag = MedRAG(llm_name="OpenAI/gpt-3.5-turbo-16k", rag=True, retriever_name="MedCPT", corpus_name="Wikipedia") >> answer, snippets, scores = medrag.answer(question=question, options=options, k=32) # scores are given by the retrieval system ``` ## Citation ```shell @article{xiong2024benchmarking, title={Benchmarking Retrieval-Augmented Generation for Medicine}, author={Guangzhi Xiong and Qiao Jin and Zhiyong Lu and Aidong Zhang}, journal={arXiv preprint arXiv:2402.13178}, year={2024} } ```
提供机构:
minsu
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作