disi-unibo-nlp/medqa-MedGENIE

Name: disi-unibo-nlp/medqa-MedGENIE
Creator: disi-unibo-nlp
Published: 2024-05-17 07:37:43
License: 暂无描述

Hugging Face2024-05-17 更新2024-06-22 收录

下载链接：

https://hf-mirror.com/datasets/disi-unibo-nlp/medqa-MedGENIE

下载链接

链接失效反馈

官方服务：

资源简介：

--- dataset_info: features: - name: id dtype: int64 - name: question dtype: string - name: target dtype: string - name: answers sequence: string - name: ctxs list: - name: text dtype: string splits: - name: train num_bytes: 75592146 num_examples: 10178 - name: validation num_bytes: 9526548 num_examples: 1272 - name: test num_bytes: 9660480 num_examples: 1273 download_size: 5680157 dataset_size: 94779174 license: mit task_categories: - question-answering language: - en tags: - medical --- # Dataset Card for "medqa-MedGENIE" ## Dataset Description The data is a part of the MedGENIE collection of medical datasets augmented with artificial contexts generated by [PMC-LLaMA-13B](https://huggingface.co/axiong/PMC_LLaMA_13B). Specifically, up to 5 artificial contexts were generated for each question in [MedQA-USMLE](https://github.com/jind11/MedQA) (4 options), employing a multi-view approach to encompass various perspectives associated with the given question. For more information, refer to our paper ["**To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering**"](https://arxiv.org/abs/2403.01924) ## Dataset Structure The dataset has three splits, suitable for: * Training *question-answering* models, including *fusion-in-decoder* architectures. * Augmenting your LLMs during inference with generated contexts rather than retrived chunks. * Augmening your knolwedge base of factual documents with generated contexts for standard RAG pipeline. The number of examples per split is: - **train:** 10178 samples - **validation:** 1273 samples - **test:** 1273 samples The dataset is stored in parquet format with each entry using the following schema: ``` { "id": 0, "question": "A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7\u00b0F (36.5\u00b0C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is the best treatment for this patient?\nA. Ampicillin\nB. Ceftriaxone\nC. Doxycycline\nD. Nitrofurantoin", "target": "D", "answers": [ "D" ], "ctxs": [ { "text": "The burning upon urination in a pregnant female is often due to asymptomatic bacteriuria that results in a urinary tract infection (UTI). Such UTIs must be aggressively treated because of their association with preterm labor..." }, { "text": "This patient has urinary tract infection (UTI) symptoms, which is a common condition in pregnancy.\n- Nitrofurantoin and cephalexin are considered safe for use during pregnancy. Ceftriaxone and ampicillin can cross the placenta..." }, { "text": "Asymptomatic bacteriuria is defined as the presence of a positive urine culture in an asymptomatic patient. The most common complication from untreated asymptomatic bacteriuria is a UTI during pregnancy which can result in kidney..." }, { "text": "Asymptomatic bacteriuria is a frequent finding in pregnancy. Treatment is not recommended unless there are signs of an upper urinary tract infection, ie, fever (temperature >99\u00b0F/37\u00b0C), flank pain or tenderness, or pyuria... " }, { "text": "Asymptomatic bacteriuria is present if a patient has persistent (>2 weeks) bacteria in the urine as documented by a positive urine culture with no symptoms. In pregnancy, even if asymptomatic, bacteriuria increases the risk of pyelonephritis..." } ] } ``` ## Augmenting LLMs during inference Augmenting *state-of-the-art* LLMs with generated contexts from both **medqa-MedGENIE** and [medmcqa-MedGENIE](https://huggingface.co/datasets/disi-unibo-nlp/medmcqa-MedGENIE/blob/main/README.md) demonstrated a remarkable performance boost. For a given question, all relevant contexts are concatenated and passed within the context window of the LLM. | Model | Learning|medqa-5-opt-MedGENIE |Accuracy | |------|------|-----|-----| | LLaMA-2-chat (7B)|2-shot | NO|36.9 | | LLaMA-2-chat (7B)| 2-shot|YES |52.4 **(+ 15.5)** | | Zephyr-β (7B)|2-shot|NO | 49.3 | | Zephyr-β (7B)|2-shot| YES |59.7 **(+ 10.4)** | ## Evaluation for RAG To assess the effectiveness of using our generated contexts for RAG pipeline, we augment the [MedWiki](https://huggingface.co/datasets/VOD-LM/medwiki) dataset with a smaller portion of artificially generated chunks derived from train and test sets of **medqa-MedGENIE** and [medmcqa-MedGENIE](https://huggingface.co/datasets/disi-unibo-nlp/medmcqa-MedGENIE). | MedWiki chunks | Artificial chunks | Rerank | LLaMA-2-chat (7B) | mistral-instruct (7B) | Zephyr-β (7B) | |------|-----|----------------|-------------------|-----------------------|---------------------| | 4.5M | - | NO | 37.2 | 45.1 | 50.4 | | 4.5M | 96K (only test)| NO | 40.2 **(+ 3.0)** | 44.9 | 50.5 **(+0.1)** | | 4.5M | 2M (train + test)| NO | 40.8 **(+ 3.6)** | 44.4 | 51 **(+0.6)** | | 4.5M | - | YES | 36.3 | 44.6 | 50.5 | | 4.5M | 96K (only test)| YES | 41.4 **(+5.1)** | 45.6 **(+1.0)** | 50.8 **(+0.3)** | | 4.5M | 2M (train + test)| YES | 40.5 **(+4.2)** | 45.9 **(+1.3)** | 51.2 **(+0.7)** | ## Citation If you find this dataset is useful in your work, please cite it with: ``` @misc{frisoni2024generate, title={To Generate or to Retrieve? On the Effectiveness of Artificial Contexts for Medical Open-Domain Question Answering}, author={Giacomo Frisoni and Alessio Cocchieri and Alex Presepi and Gianluca Moro and Zaiqiao Meng}, year={2024}, eprint={2403.01924}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```

提供机构：

disi-unibo-nlp

原始信息汇总

数据集卡片 for "medqa-MedGENIE"

数据集描述

该数据集是MedGENIE医学数据集集合的一部分，通过PMC-LLaMA-13B生成的合成上下文进行了增强。具体来说，为MedQA-USMLE（4个选项）中的每个问题生成了最多5个合成上下文，采用多视角方法来涵盖与给定问题相关的各种视角。

数据集结构

数据集包含三个部分，适用于：

训练问答模型，包括融合解码器架构。
在推理过程中使用生成的上下文增强LLMs，而不是检索的块。
使用生成的上下文增强事实文档的知识库，用于标准RAG管道。

每个部分的样本数量如下：

训练集: 10178个样本
验证集: 1273个样本
测试集: 1273个样本

数据集以parquet格式存储，每个条目使用以下模式： json { "id": 0, "question": "A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7u00b0F (36.5u00b0C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is the best treatment for this patient? A. Ampicillin B. Ceftriaxone C. Doxycycline D. Nitrofurantoin", "target": "D", "answers": [ "D" ], "ctxs": [ { "text": "The burning upon urination in a pregnant female is often due to asymptomatic bacteriuria that results in a urinary tract infection (UTI). Such UTIs must be aggressively treated because of their association with preterm labor..." }, { "text": "This patient has urinary tract infection (UTI) symptoms, which is a common condition in pregnancy.

Nitrofurantoin and cephalexin are considered safe for use during pregnancy. Ceftriaxone and ampicillin can cross the placenta..." }, { "text": "Asymptomatic bacteriuria is defined as the presence of a positive urine culture in an asymptomatic patient. The most common complication from untreated asymptomatic bacteriuria is a UTI during pregnancy which can result in kidney..." }, { "text": "Asymptomatic bacteriuria is a frequent finding in pregnancy. Treatment is not recommended unless there are signs of an upper urinary tract infection, ie, fever (temperature >99u00b0F/37u00b0C), flank pain or tenderness, or pyuria... " }, { "text": "Asymptomatic bacteriuria is present if a patient has persistent (>2 weeks) bacteria in the urine as documented by a positive urine culture with no symptoms. In pregnancy, even if asymptomatic, bacteriuria increases the risk of pyelonephritis..." } ] }

搜集汇总

数据集介绍

构建方式

在医疗领域，高质量的问答数据集对于构建可靠的临床决策支持系统至关重要。MedQA-MedGENIE数据集基于MedQA-USMLE医学题库，通过PMC-LLaMA-13B模型为每个问题生成至多五个人工上下文，采用多视角策略以涵盖与给定问题相关的不同医学观点，从而构建出增强型问答数据集。该数据集包含训练集、验证集和测试集三个划分，分别包含10178、1272和1273个样本，以parquet格式存储，每条记录包含问题、正确答案、答案列表以及生成的多条上下文文本。

特点

该数据集的核心特色在于其生成的人工上下文不仅模拟了检索式系统中可能获取的文档片段，还通过多视角生成策略确保了上下文信息的全面性和多样性。与传统的检索增强生成方法相比，MedQA-MedGENIE提供的上下文能够更精准地关联问题核心，有效弥补了稀疏检索中可能遗漏的关键医学知识。实验表明，将生成上下文融入大语言模型的推理过程，能够显著提升模型在MedQA基准上的准确率，例如LLaMA-2-chat在2-shot设置下准确率提升15.5个百分点。

使用方法

使用者可灵活将该数据集应用于多种场景：训练面向医学领域的问答模型，特别是融合解码器架构；在推理阶段用生成上下文增强大语言模型，替代传统检索片段；或将生成上下文融入知识库，用于标准RAG流水线。具体使用时，将每个问题的所有相关上下文拼接后输入模型上下文窗口，即可获得性能提升。此外，该数据集还可与MedWiki等医学文档库结合，通过扩充人工生成块来优化检索增强生成的效果，在LLaMA-2-chat等模型上实现3至5个百分点的准确率增益。

背景与挑战

背景概述

在自然语言处理领域，将检索增强生成与大型语言模型相结合已成为提升问答系统性能的重要范式，尤其在知识密集型场景中。然而，医疗领域因其专业术语复杂、知识更新迅速且对准确性要求极高，传统基于检索的方法常面临上下文不匹配或信息稀疏的困境。为此，博洛尼亚大学DISI实验室的Giacomo Frisoni、Alessio Cocchieri等研究人员于2024年提出了MedGENIE数据集系列，其中medqa-MedGENIE作为核心组件，通过PMC-LLaMA-13B模型为MedQA-USMLE数据集中的每个问题生成多达五个多视角人工上下文，旨在探索生成上下文相较于检索上下文在医学开放域问答中的有效性。该数据集包含约1.2万个样本，覆盖训练、验证与测试集，其研究揭示了生成上下文可显著提升LLaMA-2-chat等模型在医学问答上的准确率，增幅高达15.5%，为医疗AI的可靠推理开辟了新路径。

当前挑战

medqa-MedGENIE所应对的核心领域挑战在于医学问答中上下文质量的不可靠性：传统检索系统在知识库中获取的文档片段常偏离问题焦点，或包含冗余噪音，导致大语言模型在临床决策、疾病诊断等高风险场景下的推理失真。构建过程中，研究人员面临多重技术壁垒：首先，如何利用PMC-LLaMA-13B生成既符合医学事实又覆盖问题多维度视角的上下文，需平衡生成内容的准确性与多样性，避免模型幻觉引入错误信息；其次，数据集需兼容融合解码器与标准RAG管道等不同架构，这要求对上下文结构进行精心设计，确保其在训练与推理阶段均可无缝嵌入；最后，评估生成上下文与真实检索文档的协同效应时，需构建大规模混合知识库，并设计严谨的对比实验以量化性能增益，这涉及计算资源与数据规模的权衡。

常用场景

经典使用场景

在生物医学自然语言处理领域，medqa-MedGENIE数据集的核心应用场景在于为开放域问答系统提供高质量的生成式上下文增强。该数据集基于MedQA-USMLE题库，利用PMC-LLaMA-13B模型为每道四选一医学问题生成最多五条多视角人工上下文，从而在训练和推理阶段替代或补充传统检索式知识库。这种设计使得模型能够在不依赖外部文档检索的情况下，直接利用生成内容进行推理，特别适用于融合解码器（Fusion-in-Decoder）等架构的训练与评估。数据集划分为训练集（10178例）、验证集（1273例）和测试集（1273例），每个样本均包含问题、正确答案及上下文列表，为医学问答研究提供了标准化基准。

衍生相关工作

medqa-MedGENIE催生了多项开创性研究。其核心论文《To Generate or to Retrieve?》首次系统比较了生成式与检索式上下文在医学问答中的效能差异，并提出了多视角上下文生成策略。后续工作探索了将生成上下文与MedWiki等知识库融合的混合RAG管道，实验显示在测试集上引入96K人工块后，LLaMA-2-chat准确率提升5.1个百分点。此外，该数据集与medmcqa-MedGENIE共同构成了MedGENIE系列，推动了医学领域大语言模型的上下文增强研究。部分工作还借鉴其生成范式，将人工上下文用于低资源语言医学问答，以及跨模态（如影像-文本）问答系统的知识注入。

数据集最近研究