five

MedTutor

收藏
魔搭社区2025-11-27 更新2025-11-22 收录
下载链接:
https://modelscope.cn/datasets/yale-nlp/MedTutor
下载链接
链接失效反馈
官方服务:
资源简介:
# MedTutor: A Retrieval-Augmented LLM System for Case-Based Medical Education [**📄 Paper**](https://aclanthology.org/2025.emnlp-demos.24/) | [**💻 Code**](https://github.com/yale-nlp/medical-rag) | [**🎬 Videos**](https://www.youtube.com/watch?v=7NlCjVf8V4E) ## Introduction The training process for medical residents presents significant challenges, requiring both the interpretation of complex case reports and the rapid acquisition of accurate medical knowledge. Residents often find it time-consuming and difficult to locate relevant educational materials and evidence to support their learning for specific cases. To address this, we created **MedTutor**, a novel system that augments resident training by automatically generating evidence-based educational content and multiple-choice questions (MCQs) from clinical case reports. MedTutor leverages a **Retrieval-Augmented Generation (RAG)** pipeline to transform any given clinical report into a concise, reliable, and highly relevant educational module. This repository contains the dataset generated by the MedTutor system. It is designed to serve as a valuable benchmark resource for evaluating the quality and clinical utility of generative models in medicine. ![Figure 1: MedTutor System Architecture](figure1.png) *Figure 1: The overall architecture of the MedTutor system, showing the 4 main stages: Query Generation, Retrieval, Generation, and Evaluation.* ## Dataset Structure The dataset is provided as a collection of JSON files. Each file contains results from a specific **generator model** (which creates the content) and **annotator model** (which scores the content), reflected in the filename `[generator_model]_[annotator_model].json`. ### Data Fields A single data instance within a file includes the following fields: * `case_id`: A unique identifier for the case, corresponding to the original source dataset. * `source_dataset`: The original dataset the case was sourced from (`mimic-cxr`, `mimic-iv-note`, or `chexpert`). * `original_keywords`: A list of primary diagnostic keywords extracted from the original report by an LLM. * `evidence_reranked_papers`: Supporting evidence retrieved from academic literature (PubMed, Semantic Scholar) and reranked for relevance. * `evidence_retrieved_textbook_pages`: Supporting evidence retrieved from a local knowledge base of medical textbooks. * `generated_textbook_summaries`: Concise, query-focused summaries of the retrieved textbook content, created by the generator model. * `generated_final_feedback`: A comprehensive educational module synthesizing all retrieved evidence in the context of the original report. * `generated_mcqs`: A set of multiple-choice questions designed to test understanding of the key concepts in the report. * `annotation`: A dictionary of quality scores provided by the annotator model, evaluating various aspects of the generated content (e.g., `final_feedback_quality`, `mcq_quality`). ### Data Instance Example ```json { "case_id": "s59802", "source_dataset": "mimic-cxr", "original_keywords": [ "Small right apical pneumothorax" ], "evidence_reranked_papers": { "...": [] }, "evidence_retrieved_textbook_pages": { "...": [] }, "generated_textbook_summaries": { "Small right apical pneumothorax": "A small right apical pneumothorax is characterized by..." }, "generated_final_feedback": "### Small right apical pneumothorax\n\n**Clinical Teaching Points:**...", "generated_mcqs": "### Multiple Choice Questions\n\n#### Small right apical pneumothorax\n\nQ1. What is...", "annotation": { "keyword_appropriateness": { "...": 5 }, "paper_relevance": { "...": 4 }, "textbook_summary_quality": { "...": 3 }, "mcq_quality": { "...": 4 }, "final_feedback_quality": 3 } } ``` ## 🚀 Usage Guide Due to licensing restrictions, the original radiology reports are not included directly in this repository. To get the complete dataset with the reports, please follow the steps below. #### Step 1: Download Original Datasets Download the source datasets from their official websites. This requires applying for credentialed access on PhysioNet for the MIMIC datasets. - MIMIC-CXR v2.1.0: [Link to source](https://physionet.org/content/mimic-cxr/2.1.0/) - MIMIC-IV Note v2.2: [Link to source](https://www.physionet.org/content/mimic-iv-note/2.2/) - CheXpert-Plus: [Link to source](https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1) *Note: The public release of this dataset was generated using 2,000 clinical reports from each of the three datasets listed above. Due to licensing and de-identification challenges, reports from our internal Yale Hospital dataset and the ReXGradient dataset are not included.* #### Step 2: Set Up Directory Structure Create a root directory and organize the downloaded datasets and the code from this repository as follows: ``` . ├── MIMIC-CXR/ │ └── files/ <-- Contains p10/, p11/, etc. ├── MIMIC-IV-Note/ │ └── radiology.csv ├── CheXpert-Plus/ │ └── df_chexpert_plus_240401.csv └── MedTutor_Dataset/ <-- Save all MedTutor Dataset here ├── [RELEASE]...json └── link_dataset.py ``` #### Step 2: Download Original Datasets Download the source datasets from their official websites (linked in the "Source Data" section above) and place them in the structure described. This requires applying for credentialed access on PhysioNet for the MIMIC datasets. #### Step 3: Run the Finalization Script Use the provided `link_dataset.py` script to automatically link the original reports to our dataset files. Run the command from the `MedTutor_Dataset` directory. ```Bash python link_dataset.py \ --input_json [RELEASE]qwen3-32b_gemini2.5-pro.json \ --data_root ../ ``` This will create a new file, `[PUBLIC][RELEASE]...json`, which contains the complete data including the `original_reviewer_report` field. ### Contact If you have any questions or suggestions, please don't hesitate to let us know. You can post an issue on this repository, or contact us directly at: - Dongsuk Jang: jamesjang26@snu.ac.kr ### Citation If you use the MedTutor system or dataset in your research, please cite our paper: ``` @inproceedings{jang-etal-2025-medtutor, title = "{M}ed{T}utor: A Retrieval-Augmented {LLM} System for Case-Based Medical Education", author = "Jang, Dongsuk and Shangguan, Ziyao and Tegtmeyer, Kyle and Gupta, Anurag and Czerminski, Jan T and Chheang, Sophie and Cohan, Arman", editor = {Habernal, Ivan and Schulam, Peter and Tiedemann, J{\"o}rg}, booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.emnlp-demos.24/", pages = "319--353", ISBN = "979-8-89176-334-0" } ``` ### License The MedTutor dataset and code are licensed under the ODC-BY License. The original report texts are subject to the licenses of their respective sources (PhysioNet, Stanford AIMI).

# MedTutor:面向病例式医学教育的检索增强型大语言模型系统 [**📄 论文**](https://aclanthology.org/2025.emnlp-demos.24/) | [**💻 代码**](https://github.com/yale-nlp/medical-rag) | [**🎬 演示视频**](https://www.youtube.com/watch?v=7NlCjVf8V4E) ## 引言 住院医师的培训过程面临诸多严峻挑战,既需要解读复杂的病例报告,又需要快速掌握精准的医学知识。住院医师往往会耗费大量时间,且难以针对特定病例找到适配的学习资料与佐证证据,以支撑其学习需求。 为此,我们研发了**MedTutor**——一款创新性系统,可通过临床病例报告自动生成基于证据的教学内容与多项选择题(Multiple-Choice Questions, MCQs),以此强化住院医师培训。MedTutor采用**检索增强生成(Retrieval-Augmented Generation, RAG)**流程,可将任意给定的临床报告转化为简洁可靠、相关性极强的教学模块。 本仓库包含MedTutor系统生成的数据集,旨在作为评估医学领域生成式模型质量与临床实用性的优质基准资源。 ![图1:MedTutor系统架构](figure1.png) *图1:MedTutor系统整体架构,展示了四大核心阶段:查询生成、检索、生成与评估。* ## 数据集结构 本数据集以JSON文件集合的形式提供。每个文件对应特定的**生成模型(generator model)**(负责生成教学内容)与**标注模型(annotator model)**(负责对内容进行评分),文件名格式为`[生成模型]_[标注模型].json`。 ### 数据字段 单个数据文件中的实例包含以下字段: * `case_id`:病例的唯一标识符,与原始来源数据集相对应。 * `source_dataset`:该病例来源的原始数据集(`mimic-cxr`、`mimic-iv-note`或`chexpert`)。 * `original_keywords`:由大语言模型(LLM)从原始报告中提取的主要诊断关键词列表。 * `evidence_reranked_papers`:从学术文献(PubMed、Semantic Scholar)中检索到并按相关性重新排序的佐证证据。 * `evidence_retrieved_textbook_pages`:从本地医学教科书知识库中检索到的佐证证据。 * `generated_textbook_summaries`:由生成模型创建的、针对查询需求的检索到的教科书内容的简洁摘要。 * `generated_final_feedback`:整合所有检索到的证据,并结合原始报告场景形成的综合性教学模块。 * `generated_mcqs`:用于测试对报告中核心概念理解程度的多项选择题集。 * `annotation`:标注模型提供的质量评分字典,用于评估生成内容的多个维度(例如`final_feedback_quality`、`mcq_quality`)。 ### 数据实例示例 json { "case_id": "s59802", "source_dataset": "mimic-cxr", "original_keywords": [ "Small right apical pneumothorax" ], "evidence_reranked_papers": { "...": [] }, "evidence_retrieved_textbook_pages": { "...": [] }, "generated_textbook_summaries": { "Small right apical pneumothorax": "A small right apical pneumothorax is characterized by..." }, "generated_final_feedback": "### Small right apical pneumothorax **Clinical Teaching Points:**...", "generated_mcqs": "### Multiple Choice Questions #### Small right apical pneumothorax Q1. What is...", "annotation": { "keyword_appropriateness": { "...": 5 }, "paper_relevance": { "...": 4 }, "textbook_summary_quality": { "...": 3 }, "mcq_quality": { "...": 4 }, "final_feedback_quality": 3 } } ## 🚀 使用指南 由于许可限制,本仓库未直接包含原始放射学报告。如需获取包含报告的完整数据集,请遵循以下步骤。 #### 步骤1:下载原始数据集 从各数据集的官方网站下载源数据集。其中,MIMIC系列数据集需在PhysioNet平台申请授权访问权限。 - MIMIC-CXR v2.1.0: [来源链接](https://physionet.org/content/mimic-cxr/2.1.0/) - MIMIC-IV Note v2.2: [来源链接](https://www.physionet.org/content/mimic-iv-note/2.2/) - CheXpert-Plus: [来源链接](https://stanfordaimi.azurewebsites.net/datasets/5158c524-d3ab-4e02-96e9-6ee9efc110a1) *注:本数据集的公开版本由上述三个数据集各2000份临床报告生成。由于许可限制与去标识化难题,耶鲁医院内部数据集与ReXGradient数据集的报告未被纳入其中。* #### 步骤2:搭建目录结构 创建根目录,并按如下结构整理下载的数据集与本仓库的代码: . ├── MIMIC-CXR/ │ └── files/ <-- 包含p10/, p11/等目录 ├── MIMIC-IV-Note/ │ └── radiology.csv ├── CheXpert-Plus/ │ └── df_chexpert_plus_240401.csv └── MedTutor_Dataset/ <-- 将所有MedTutor数据集文件保存于此 ├── [RELEASE]...json └── link_dataset.py #### 步骤2:下载原始数据集 从各数据集的官方网站下载源数据集(详见前文“源数据”部分的链接),并按上述结构放置文件。其中,MIMIC系列数据集需在PhysioNet平台申请授权访问权限。 #### 步骤3:运行最终化脚本 使用仓库提供的`link_dataset.py`脚本,自动将原始报告与数据集文件进行关联。请在`MedTutor_Dataset`目录下运行该命令。 Bash python link_dataset.py --input_json [RELEASE]qwen3-32b_gemini2.5-pro.json --data_root ../ 该命令将生成一个新文件`[PUBLIC][RELEASE]...json`,其中包含完整数据,新增了`original_reviewer_report`字段。 ### 联系方式 若您有任何疑问或建议,欢迎随时与我们联系。您可在本仓库提交Issue,或通过以下方式直接联系我们: - 张东硕(Dongsuk Jang):jamesjang26@snu.ac.kr ### 引用格式 若您在研究中使用MedTutor系统或数据集,请引用我们的论文: bibtex @inproceedings{jang-etal-2025-medtutor, title = "{M}ed{T}utor: A Retrieval-Augmented {LLM} System for Case-Based Medical Education", author = "Jang, Dongsuk and Shangguan, Ziyao and Tegtmeyer, Kyle and Gupta, Anurag and Czerminski, Jan T and Chheang, Sophie and Cohan, Arman", editor = {Habernal, Ivan and Schulam, Peter and Tiedemann, J{"o}rg}, booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.emnlp-demos.24/", pages = "319--353", ISBN = "979-8-89176-334-0" } ### 许可协议 MedTutor数据集与代码采用ODC-BY许可协议。原始报告文本需遵循其各自来源平台(PhysioNet、斯坦福AI医学影像实验室(Stanford AIMI))的许可协议。
提供机构:
maas
创建时间:
2025-11-06
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作