casey-martin/MedInstruct

Name: casey-martin/MedInstruct
Creator: casey-martin
Published: 2023-12-02 12:32:38
License: 暂无描述

Hugging Face2023-12-02 更新2024-03-04 收录

下载链接：

https://hf-mirror.com/datasets/casey-martin/MedInstruct

下载链接

链接失效反馈

官方服务：

资源简介：

--- configs: - config_name: default data_files: - split: train path: data/MedInstruct-52k.json - split: test path: data/MedInstruct-test.jsonl task_categories: - text-generation language: - en tags: - medical --- # MedInstruct <hr> This is the repo for *MedInstruct*, which is a dataset of synthetically generated medical instructions. The repo contains: - The 52K medical instruction-response dataset [*MedInstruct-52k*](https://github.com/XZhang97666/AlpaCare/blob/master/data/MedInstruct-52k.json) used for fine-tuning *AlpaCare*, and corresponding [clinican-crafted seed task](https://github.com/XZhang97666/AlpaCare/blob/master/data/med_seed.json) to generate instruction. - A 217 clinical craft free-form instruction evaluation test set,[*MedInstruct-test*](https://github.com/XZhang97666/AlpaCare/blob/master/data/MedInstruct-test.jsonl). - The code for: 1. [medical task generation](https://github.com/XZhang97666/AlpaCare/tree/master/test_generation); 2. [fine-tuning LLaMA series models](https://github.com/XZhang97666/AlpaCare/tree/master/training); 3. [instrcution-tuned model response generation](https://github.com/XZhang97666/AlpaCare/tree/master/test_generation); 4. [response evaluation via LLMs](https://github.com/XZhang97666/AlpaCare/tree/master/evaluation). ## Overview *AlpaCare* models contain 4 models (7B/13B - LLaMA[1]/LLaMA-2[2]) tuned on a 52k medical instruction-following dataset *MedInstruct-52k*, following Alpaca[3] and Self-Instruct[4]. You can find our model weights at: | Version | Link | | --- | --- | | *AlpaCare* -LLaMA_7B |[https://huggingface.co/xz97/AlpaCare-llama1-7b](https://huggingface.co/xz97/AlpaCare-llama1-7b)| | *AlpaCare* -LLaMA2_7B |[https://huggingface.co/xz97/AlpaCare-llama2-7b](https://huggingface.co/xz97/AlpaCare-llama2-7b)| | *AlpaCare* -LLaMA_13B |[https://huggingface.co/xz97/AlpaCare-llama-13b](https://huggingface.co/xz97/AlpaCare-llama-13b)| | *AlpaCare* -LLaMA2_13B |[https://huggingface.co/xz97/AlpaCare-llama2-13b](https://huggingface.co/xz97/AlpaCare-llama2-13b)| [1]: LLaMA: Open and Efficient Foundation Language Models. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. https://arxiv.org/abs/2302.13971v1 [2] Llama 2: Open foundation and fine-tuned chat models. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. https://arxiv.org/abs/2307.09288 [3]: Stanford Alpaca: An Instruction-following LLaMA Model.Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto. https://crfm.stanford.edu/2023/03/13/alpaca.html [4]: Self-Instruct: Aligning Language Model with Self Generated Instructions. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. https://arxiv.org/abs/2212.10560 ## Data Release [*MedInstruct*](https://huggingface.co/datasets/xz97/MedInstruct) contains: - MedInstruct datasets: 1. *MedInstruct-52K*: 52 medical instruction-following data we used for fine-tuning *AlpaCare* models 2. *MedInstruct-test*: 217 clinican craft free-form instruction evulation tasks with reference responses generated by `gpt-4`, `gpt-tubro-3.5`, `text-davinci-003` and `claude-2`. All files is a list of dictionaries in JSON/JSONL format, each dictionary contains the following fields: - `instruction`: `str`, the medical task the model should perform. Each of instrcutions in *MedInstruct-52K* and *MedInstruct-test* is unique. The instrcutions in *MedInstruct-52K* are generated by OpenAI `gpt-4`, while *MedInstruct-test* are clinian-craft. - `input`: `str`, optional medical context or example for the task to further explain the instruction and increase the task diversity. - `output`: `str`, the answer to the instruction as generated by `text-davinci-003`. - `difficulity`: `str`, medical knolwedge difficulty level scale from 1-5 evulated by clinicans following the [scoring system](plots/scoresystem.pdf). This only appear in *MedInstruct-test*. - Instructions: 1. Clinican-crafted instruction seed set, containing 167 instances. 2. All medical instructions generated by OpenAI `gpt-4` with/without filtering. </details> ## Data Generation Process We following the data generation pipeline from [self-instruct](https://github.com/yizhongw/self-instruct) and [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html), and modified several points: *Instructions*: - Beginning with a [medical task seed set](https://github.com/XZhang97666/AlpaCare/blob/master/data/med_seed.json) that contains 167 clinical-craft clinical tasks. - Utilizing the GPT-4 engine for instruction data generation. - Crafting a [prompt](https://github.com/XZhang97666/AlpaCare/blob/master/prompt/instruction_prompt.txt) to serve as the medical instruction generation guidelines for GPT-4. - To reduce costs in task generation and introduce greater randomness, 12 instructions are generated at once. This is followed by the selection of 3 random seed instructions from the seed set. *Responses*: - The GPT-3.5-turbo engine is used to synthesize responses, processing each instruction individually. Here's an illustration of an instruction-response pair: ``` { instruction: "Rewrite this instruction on how to use an inhaler to be more patient-friendly and less technical." input: "Exhale deeply, engage mouthpiece of inhaler, depress canister while simultaneously inhaling long and deep, hold the breath for ten seconds, then gently exhale" output: "Here's a more patient-friendly version of the instruction on how to use an inhaler: 1. Take a deep breath and completely breathe out. 2. Put the mouthpiece of the inhaler into your mouth and seal your lips around it properly. 3. Press down the canister while breathing in slowly and deeply at the same time. 4. Hold your breath for ten seconds. 5. Finally, breathe out gently." } ```

configs: - config_name: default data_files: - split: train path: data/MedInstruct-52k.json - split: test path: data/MedInstruct-test.jsonl task_categories: - text-generation language: - en tags: - medical # MedInstruct <hr> 本仓库为*MedInstruct*项目仓库，该项目包含人工合成生成的医疗指令数据集。仓库内容如下： - 用于微调*AlpaCare*模型的52K条医疗指令-响应数据集[*MedInstruct-52k*](https://github.com/XZhang97666/AlpaCare/blob/master/data/MedInstruct-52k.json)，以及用于生成指令的[临床专家编写的种子任务集](https://github.com/XZhang97666/AlpaCare/blob/master/data/med_seed.json)。 - 包含217条临床专家编写的自由格式指令的评估测试集[*MedInstruct-test*](https://github.com/XZhang97666/AlpaCare/blob/master/data/MedInstruct-test.jsonl)。 - 以下代码工具： 1. [医疗任务生成代码](https://github.com/XZhang97666/AlpaCare/tree/master/test_generation) 2. [LLaMA系列模型微调代码](https://github.com/XZhang97666/AlpaCare/tree/master/training) 3. [指令微调模型响应生成代码](https://github.com/XZhang97666/AlpaCare/tree/master/test_generation) 4. [基于大语言模型（Large Language Model）的响应评估代码](https://github.com/XZhang97666/AlpaCare/tree/master/evaluation) ## 项目概览 *AlpaCare*系列模型包含4款基于LLaMA[1]/LLaMA-2[2]的7B/13B参数版本模型，参照Alpaca[3]与Self-Instruct[4]的范式，在52K条医疗指令跟随数据集*MedInstruct-52k*上完成微调。模型权重可通过以下链接获取： | 版本 | 链接 | | --- | --- | | *AlpaCare*-LLaMA_7B |[https://huggingface.co/xz97/AlpaCare-llama1-7b](https://huggingface.co/xz97/AlpaCare-llama1-7b)| | *AlpaCare*-LLaMA2_7B |[https://huggingface.co/xz97/AlpaCare-llama2-7b](https://huggingface.co/xz97/AlpaCare-llama2-7b)| | *AlpaCare*-LLaMA_13B |[https://huggingface.co/xz97/AlpaCare-llama-13b](https://huggingface.co/xz97/AlpaCare-llama-13b)| | *AlpaCare*-LLaMA2_13B |[https://huggingface.co/xz97/AlpaCare-llama2-13b](https://huggingface.co/xz97/AlpaCare-llama2-13b)| [1]: LLaMA: 开放高效的基础大语言模型（Large Language Model）。Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. https://arxiv.org/abs/2302.13971v1 [2]: LLaMA 2: 开放基础模型与微调聊天模型。Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. https://arxiv.org/abs/2307.09288 [3]: Stanford Alpaca: 一款指令跟随型LLaMA模型。Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto. https://crfm.stanford.edu/2023/03/13/alpaca.html [4]: Self-Instruct: 通过自生成指令对齐大语言模型。Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. https://arxiv.org/abs/2212.10560 ## 数据集发布 [*MedInstruct*](https://huggingface.co/datasets/xz97/MedInstruct)包含以下内容： - MedInstruct数据集子集： 1. *MedInstruct-52K*: 用于微调*AlpaCare*模型的52000条医疗指令跟随数据 2. *MedInstruct-test*: 217条由临床专家编写的自由格式指令评估任务，其参考响应由`gpt-4`、`gpt-turbo-3.5`、`text-davinci-003`和`claude-2`生成。所有文件均为JSON/JSONL格式的字典列表，每条字典包含以下字段： - `instruction`: 字符串类型，指模型需要执行的医疗任务。*MedInstruct-52K*与*MedInstruct-test*中的指令均唯一，其中*MedInstruct-52K*的指令由OpenAI的`gpt-4`生成，而*MedInstruct-test*的指令由临床专家编写。 - `input`: 字符串类型，可选字段，用于进一步解释指令、提升任务多样性的医疗上下文或示例。 - `output`: 字符串类型，由`text-davinci-003`生成的指令响应答案。 - `difficulity`: 字符串类型，由临床专家依据[评分体系](plots/scoresystem.pdf)评估得出的1-5级医疗知识难度等级，该字段仅在*MedInstruct-test*中出现。 - 指令集： 1. 临床专家编写的指令种子集，包含167条样本。 2. 所有由OpenAI的`gpt-4`生成的医疗指令（含过滤与未过滤版本）。 </details> ## 数据生成流程我们参照[Self-Instruct](https://github.com/yizhongw/self-instruct)和[Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html)的数据生成流程，并对部分细节进行了修改： ### 指令生成流程 - 以包含167条临床任务的[医疗任务种子集](https://github.com/XZhang97666/AlpaCare/blob/master/data/med_seed.json)为起点 - 采用GPT-4引擎生成指令数据 - 编写[提示词](https://github.com/XZhang97666/AlpaCare/blob/master/prompt/instruction_prompt.txt)作为GPT-4生成医疗指令的指导准则 - 为降低生成成本并提升随机性，每次批量生成12条指令，随后从种子集中随机选取3条种子指令。 ### 响应生成流程 - 采用GPT-3.5-turbo引擎生成响应，逐条处理每条指令。以下为一条指令-响应对示例： { "instruction": "请将这份吸入器使用说明改写为更易懂的患者友好型版本，减少专业术语使用。", "input": "深呼气，将吸嘴放入口中，按下药罐同时缓慢深吸气，屏息十秒，随后缓缓呼气", "output": "以下是更易懂的吸入器使用说明： 1. 深吸气后完全呼出。 2. 将吸入器吸嘴放入口中，用嘴唇紧密包裹。 3. 按下药罐的同时缓慢深吸气。 4. 屏息十秒。 5. 最后缓缓呼气。" }

提供机构：

casey-martin

原始信息汇总

数据集概述

数据集名称

MedInstruct

数据集内容

MedInstruct-52K: 包含52,000条医疗指导数据，用于微调AlpaCare模型。
MedInstruct-test: 包含217条临床手工自由形式的指导评估任务，包含参考响应。

数据集格式

数据以JSON/JSONL格式存储，每个文件包含一系列字典，每个字典包含以下字段：
- instruction: 字符串，描述模型应执行的医疗任务。
- input: 字符串，可选的医疗上下文或任务示例，用于进一步解释指导并增加任务多样性。
- output: 字符串，text-davinci-003生成的指导答案。
- difficulty: 字符串，医疗知识难度级别，范围从1-5，仅在MedInstruct-test中出现。

数据生成过程

使用GPT-4引擎生成指导数据。
使用GPT-3.5-turbo引擎合成响应。

数据集用途

用于微调AlpaCare模型，包括LLaMA系列模型。

数据集链接

搜集汇总

数据集介绍

构建方式

在医学自然语言处理领域，高质量指令数据集的构建对于提升模型的专业能力至关重要。MedInstruct数据集的构建遵循了Self-Instruct与Alpaca的先进范式，并进行了针对性优化。其过程始于一个由临床专家精心撰写的167条医疗任务种子集，以此作为生成基础。随后，利用GPT-4引擎，结合特定的医学指令生成提示词，批量合成新的医疗指令。为提升数据多样性并控制成本，每次生成过程会随机选取3条种子指令，并一次性生成12条新指令。指令对应的回答则由GPT-3.5-turbo引擎逐一生成，从而构成了完整的指令-响应对。

使用方法

该数据集以标准化的JSON与JSONL格式发布，便于研究人员直接集成至训练流程。对于微调用途，开发者可加载‘MedInstruct-52k.json’文件，其中每条数据均包含‘instruction’、‘input’和‘output’字段，可直接用于监督式指令微调训练。对于模型评估，则使用‘MedInstruct-test.jsonl’测试集，该集除了上述字段，还提供了‘difficulity’难度标签，支持对模型输出进行基于难度分级的细粒度分析。数据集与配套的代码工具链结合，能完整支持从任务生成、模型微调到响应生成与基于大语言模型评估的全流程实验。

背景与挑战

背景概述

在医疗人工智能领域，高质量指令遵循数据集的构建对于提升大型语言模型在专业场景下的应用能力至关重要。MedInstruct数据集由相关研究团队于2023年创建，旨在通过合成生成的医疗指令-响应对，为医疗领域的大语言模型微调提供专门资源。该数据集以Alpaca和Self-Instruct方法为基础，利用GPT-4生成指令，并结合临床专家手工筛选的种子任务，核心研究问题聚焦于如何高效构建大规模、多样化的医疗指令数据，以促进模型在诊断支持、患者教育等复杂医疗任务中的准确性与可靠性。其衍生的AlpaCare系列模型已展现出在医疗文本生成任务中的潜力，为后续医疗垂直领域的大模型研究提供了重要的数据基础与方法参考。

当前挑战

MedInstruct数据集所针对的医疗指令遵循任务，面临医疗知识专业性高、语境复杂多变的固有挑战，要求模型不仅能理解医学术语，还需把握临床逻辑与伦理规范。在数据集构建过程中，首要挑战在于如何确保生成指令的医学准确性与多样性，这依赖于初始种子任务的质量与生成算法的精心设计。其次，合成响应时需平衡自动化效率与临床可信度，避免模型产生误导性或不符合医疗准则的内容。此外，评估环节缺乏统一、权威的医疗任务性能基准，使得数据质量的量化与比较存在困难，这些因素共同构成了该数据集在创建与应用中的核心挑战。

常用场景

经典使用场景

在医学自然语言处理领域，MedInstruct数据集为指令微调提供了关键支撑。其核心应用场景在于训练大型语言模型执行复杂的医疗指令跟随任务，例如生成患者友好的医疗指导或解析临床文本。通过包含52,000条合成生成的医疗指令-响应对，该数据集使模型能够学习如何准确理解并回应多样化的医学查询，从而提升模型在医疗对话和知识推理中的表现。

解决学术问题

该数据集有效应对了医疗领域指令数据稀缺的学术挑战。传统上，医学文本数据往往局限于结构化记录或有限对话，缺乏高质量、多样化的指令跟随样本。MedInstruct通过合成生成方法，构建了大规模、多难度的医疗指令集，为研究医疗语言模型的泛化能力、知识对齐及安全响应提供了基准。其意义在于推动了医疗AI从被动检索向主动交互的范式转变，促进了模型在复杂临床场景中的可靠部署。

实际应用

在实际医疗环境中，MedInstruct支撑的模型可应用于临床辅助决策与患者教育。例如，模型能根据指令生成个性化的用药指导、解释医学术语或模拟医患对话，帮助医护人员提高工作效率。同时，该数据集也为开发医疗聊天机器人、自动化病历摘要工具提供了训练基础，使AI系统能够更自然、准确地处理非结构化医疗文本，降低医疗误读风险。

数据集最近研究