bio-nlp-umass/bioinstruct

Name: bio-nlp-umass/bioinstruct
Creator: bio-nlp-umass
Published: 2024-06-06 16:22:42
License: 暂无描述

Hugging Face2024-06-06 更新2024-06-15 收录

下载链接：

https://hf-mirror.com/datasets/bio-nlp-umass/bioinstruct

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: mit task_categories: - text-generation - question-answering - summarization - zero-shot-classification language: - en tags: - medical - clinical - healthcare - instruction-finetuning - multi-task learning size_categories: - 10K<n<100K --- # Dataset Card for BioInstruct GitHub repo: https://github.com/bio-nlp/BioInstruct ## Dataset Summary [BioInstruct](https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocae122/7687618) is a dataset of 25k instructions and demonstrations generated by OpenAI's GPT-4 engine in July 2023. This instruction data can be used to conduct instruction-tuning for language models (e.g. Llama) and make the language model follow biomedical instruction better. Improvements of Llama on 9 common BioMedical tasks are shown in the [result section](https://arxiv.org/pdf/2310.19975). Taking inspiration from [Self-Instruct](https://github.com/yizhongw/self-instruct), the collection of BioInstruct is a fully automated process. This process requires only an initial set of 80 manually constructed seed tasks, which can be produced within roughly three hours of human effort. These seed examples span a diverse range of biomedical and clinical NLP tasks, covering areas such as answering biomedical questions, summarizing, assessing eligibility for clinical trials, and determining differential diagnoses. During the data collection phase, we prompted the pretrained GPT-4 language model with three examples randomly selected from seed tasks, guiding it to generate new samples. Among the GPT-4 created instructions, we plot the top 20 most common root verbs and their top 4 direct noun objects of BioInstruct dataset in Figure below. We further used GPT-4 to classify the instructions into the following 4 major categories. Below is proportion in this dataset: - 33.8% on information extract. - 33.5% on text generation. - 22.8% on question answering. - 10.0% on others. Seed examples were collected from the training split of biomedical dataset below (see [paper](https://arxiv.org/pdf/2310.19975) for a comprehensive list): [MeQSum](https://huggingface.co/datasets/sumedh/MeQSum), [Primock57](https://github.com/babylonhealth/primock57), [MedQA](https://huggingface.co/collections/lavita/medical-qa-datasets-6540b9b1992b1c560eda935c), [emrQA](https://github.com/panushri25/emrQA#download-dataset), [DiSCQ](https://github.com/elehman16/discq), [MEDIQA-AnS](https://osf.io/9afru), [CliCR](https://github.com/clips/clicr), [Diagnoise-me](https://www.kaggle.com/datasets/dsxavier/diagnoise-me?resource=download), [pubhealth](https://huggingface.co/datasets/bigbio/pubhealth), [MedNLI](https://huggingface.co/datasets/bigbio/mednli), [CASI](https://arxiv.org/pdf/2205.12689), [Medal](https://huggingface.co/datasets/McGill-NLP/medal), [MedTextSimplifier](https://github.com/vanh17/MedTextSimplifier), BIOSSES, ChemProt, GAD ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6182988c68444be3259d8b69/s1tysNSWZrcI7o6Kp6SN9.png) ## Dataset Structure ### Data Instances An example of "train" looks as follows: ```json{ "instruction": "Explain the mechanism of action of a given drug in non-medical terms.", "input": "Metformin", "output": "Metformin is a medication that helps to lower blood sugar levels. It works by making your body more sensitive to insulin, a hormone that helps control sugar levels, and by decreasing the amount of sugar your liver produces." } ``` ### Data Fields The data fields are as follows: * `instruction`: describes the task the model should perform. Each of the 25K instructions is unique. * `input`: optional context or input for the task. For example, when the instruction is "Explain how the drug works", the input is the drug name. * `output`: the answer to the instruction as generated by GPT-4. ### Languages The data in BioInstruct are in English (BCP-47 en). ### Licensing Information The dataset is available under the MIT license. ### Citation Information ``` @article{Tran2024Bioinstruct, author = {Tran, Hieu and Yang, Zhichao and Yao, Zonghai and Yu, Hong}, title = "{BioInstruct: instruction tuning of large language models for biomedical natural language processing}", journal = {Journal of the American Medical Informatics Association}, pages = {ocae122}, year = {2024}, month = {06}, issn = {1527-974X}, doi = {10.1093/jamia/ocae122}, url = {https://doi.org/10.1093/jamia/ocae122}, eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocae122/58084577/ocae122.pdf}, } ``` ### Acknowledgments We thank [bigbio](https://huggingface.co/bigbio), [openlifescienceai](https://huggingface.co/openlifescienceai), and [hf4h](https://huggingface.co/hf4h) for organizing a collection of biomedical datasets. We thank [Meta](https://huggingface.co/meta-llama) for releasing their Llama models. ### Contribution [Hieu Tran](https://huggingface.co/hieutran81), [Zhichao Yang](https://huggingface.co/whaleloops), Zonghai Yao, Hong Yu

license: MIT协议 task_categories: - 文本生成 - 问答 - 摘要 - 零样本分类（zero-shot-classification） language: - 英语 tags: - 医疗 - 临床 - 医疗健康 - 指令微调（instruction-finetuning） - 多任务学习（multi-task learning） size_categories: - 10K<n<100K # BioInstruct数据集卡片 GitHub仓库：https://github.com/bio-nlp/BioInstruct ## 数据集概述 [BioInstruct](https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocae122/7687618) 是一个包含25000条指令与演示样本的数据集，所有样本由OpenAI的GPT-4引擎于2023年7月生成。该指令数据可用于对大语言模型（Large Language Model, LLM，如Llama）进行指令微调，以提升模型对生物医学指令的遵循能力。在9项常见生物医学任务上对Llama的改进效果已在[结果章节](https://arxiv.org/pdf/2310.19975)中展示。本数据集的构建灵感源自[Self-Instruct](https://github.com/yizhongw/self-instruct)，全程采用全自动流程完成。该流程仅需80条人工构建的种子任务作为初始输入，仅需约3小时的人工工作量即可完成种子任务的制作。这些种子示例涵盖了多样的生物医学与临床自然语言处理任务，包括回答生物医学问题、文本摘要、评估临床试验入组资格以及鉴别诊断等领域。在数据采集阶段，我们从种子任务中随机选取3个示例作为提示，引导预训练的GPT-4大语言模型生成新的样本。在GPT-4生成的指令中，我们绘制了BioInstruct数据集中前20个最常见的根动词及其前4个直接名词宾语的统计图（如下图所示）。我们进一步使用GPT-4将指令划分为以下4大类别，各分类在数据集中的占比如下： - 信息提取类：33.8% - 文本生成类：33.5% - 问答类：22.8% - 其他类：10.0% 种子示例采集自以下生物医学数据集的训练划分（完整列表详见[论文](https://arxiv.org/pdf/2310.19975)）： [MeQSum](https://huggingface.co/datasets/sumedh/MeQSum)、[Primock57](https://github.com/babylonhealth/primock57)、[MedQA](https://huggingface.co/collections/lavita/medical-qa-datasets-6540b9b1992b1c560eda935c)、[emrQA](https://github.com/panushri25/emrQA#download-dataset)、[DiSCQ](https://github.com/elehman16/discq)、[MEDIQA-AnS](https://osf.io/9afru)、[CliCR](https://github.com/clips/clicr)、[Diagnoise-me](https://www.kaggle.com/datasets/dsxavier/diagnoise-me?resource=download)、[pubhealth](https://huggingface.co/datasets/bigbio/pubhealth)、[MedNLI](https://huggingface.co/datasets/bigbio/mednli)、[CASI](https://arxiv.org/pdf/2205.12689)、[Medal](https://huggingface.co/datasets/McGill-NLP/medal)、[MedTextSimplifier](https://github.com/vanh17/MedTextSimplifier)、BIOSSES、ChemProt、GAD ![图像/png](https://cdn-uploads.huggingface.co/production/uploads/6182988c68444be3259d8b69/s1tysNSWZrcI7o6Kp6SN9.png) ## 数据集结构 ### 数据实例训练集的一个示例格式如下： json { "instruction": "用非医学术语解释某一给定药物的作用机制。", "input": "二甲双胍", "output": "二甲双胍是一种可降低血糖水平的药物。其作用机制为提高机体对胰岛素（一种协助调控血糖水平的激素）的敏感性，并减少肝脏生成的葡萄糖量。" } ### 数据字段各数据字段说明如下： * `instruction`：描述模型需执行的任务，25000条指令均唯一。 * `input`：任务的可选上下文或输入。例如，当指令为"解释药物的工作原理"时，输入即为药物名称。 * `output`：由GPT-4生成的对应指令的答案。 ### 语言说明 BioInstruct数据集采用英语（BCP-47编码为en）。 ### 许可证信息本数据集采用MIT协议进行授权。 ### 引用信息 @article{Tran2024Bioinstruct, author = {Tran, Hieu and Yang, Zhichao and Yao, Zonghai and Yu, Hong}, title = "{BioInstruct: instruction tuning of large language models for biomedical natural language processing}", journal = {Journal of the American Medical Informatics Association}, pages = {ocae122}, year = {2024}, month = {06}, issn = {1527-974X}, doi = {10.1093/jamia/ocae122}, url = {https://doi.org/10.1093/jamia/ocae122}, eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocae122/58084577/ocae122.pdf}, } ### 致谢感谢[bigbio](https://huggingface.co/bigbio)、[openlifescienceai](https://huggingface.co/openlifescienceai)以及[hf4h](https://huggingface.co/hf4h)团队整理并提供生物医学数据集。感谢[Meta](https://huggingface.co/meta-llama)团队开源其Llama系列模型。 ### 贡献者 [Hieu Tran](https://huggingface.co/hieutran81)、[Zhichao Yang](https://huggingface.co/whaleloops)、Zonghai Yao、Hong Yu

提供机构：

bio-nlp-umass

原始信息汇总

数据集卡片 for BioInstruct

数据集概述

BioInstruct 是一个包含 25,000 条指令和演示的数据集，由 OpenAI 的 GPT-4 引擎于 2023 年 7 月生成。该数据集可用于进行语言模型的指令微调（例如 Llama），使语言模型更好地遵循生物医学指令。在 9 个常见的生物医学任务上，Llama 的改进结果见结果部分。

受 Self-Instruct 启发，BioInstruct 的收集过程是完全自动化的。该过程仅需要一个初始的 80 个手动构建的种子任务集，这些种子示例涵盖了广泛的生物医学和临床 NLP 任务，包括回答生物医学问题、总结、评估临床试验资格和确定鉴别诊断等。在数据收集阶段，我们使用从种子任务中随机选择的三条示例来提示预训练的 GPT-4 语言模型，指导其生成新样本。

在 GPT-4 创建的指令中，我们绘制了 BioInstruct 数据集中最常见的 20 个根动词及其前 4 个直接名词宾语。我们进一步使用 GPT-4 将指令分类为以下 4 个主要类别：

33.8% 用于信息提取。
33.5% 用于文本生成。
22.8% 用于问答。
10.0% 用于其他。

种子示例来自以下生物医学数据集的训练分割（详见论文）：

数据集结构

数据实例

一个 "train" 示例如下： json { "instruction": "解释给定药物的作用机制，用非医学术语描述。", "input": "二甲双胍", "output": "二甲双胍是一种帮助降低血糖水平的药物。它通过使身体对胰岛素更敏感，胰岛素是一种帮助控制血糖水平的激素，并减少肝脏产生的糖分来发挥作用。" }

数据字段

数据字段如下：

instruction：描述模型应执行的任务。25,000 条指令中的每一条都是唯一的。
input：任务的可选上下文或输入。例如，当指令是 "解释药物如何起作用" 时，输入是药物名称。
output：由 GPT-4 生成的指令的答案。

语言

BioInstruct 数据集中的数据为英语（BCP-47 en）。

许可信息

该数据集在 MIT 许可下可用。

引用信息

@article{Tran2024Bioinstruct, author = {Tran, Hieu and Yang, Zhichao and Yao, Zonghai and Yu, Hong}, title = "{BioInstruct: instruction tuning of large language models for biomedical natural language processing}", journal = {Journal of the American Medical Informatics Association}, pages = {ocae122}, year = {2024}, month = {06}, issn = {1527-974X}, doi = {10.1093/jamia/ocae122}, url = {https://doi.org/10.1093/jamia/ocae122}, eprint = {https://academic.oup.com/jamia/advance-article-pdf/doi/10.1093/jamia/ocae122/58084577/ocae122.pdf}, }

搜集汇总

数据集介绍

构建方式

在生物医学自然语言处理领域，高质量指令数据的稀缺性促使研究者探索自动化生成方法。BioInstruct数据集的构建借鉴了Self-Instruct框架，采用了一种高效的全自动化流程。该流程仅需80个手动构建的种子任务作为起点，这些种子任务覆盖了生物医学问答、文本摘要、临床试验资格评估及鉴别诊断等多样化的临床与生物医学NLP任务。在数据生成阶段，研究团队利用预训练的GPT-4模型，随机选取三个种子任务示例作为提示，引导模型生成新的指令与演示样本，最终形成了包含2.5万条独特指令的数据集。

特点

BioInstruct数据集的核心特征体现在其任务类型的多样性与分布的均衡性。通过对指令进行系统分类，数据集涵盖了信息提取、文本生成、问题解答及其他类别，其中信息提取与文本生成各自占比约三分之一，问题解答约占四分之一，其余任务占一成，这种结构确保了模型在多种生物医学场景下的适应能力。每条指令均配有可选的输入语境及由GPT-4生成的输出答案，形成了完整的指令-输入-输出三元组，为指令微调提供了结构化的高质量语料。

使用方法

该数据集专为大型语言模型的指令微调而设计，旨在提升模型在生物医学指令遵循方面的性能。使用者可直接加载数据集，将其应用于如Llama等语言模型的监督式微调过程中。典型的应用流程包括将数据集的instruction和input字段拼接作为模型输入，并将output字段作为训练目标，通过优化模型参数使其学会理解并执行复杂的生物医学指令。经过微调的模型在多项生物医学基准任务上展现出显著性能提升，验证了该数据集在推动领域专用模型发展方面的实用价值。

背景与挑战

背景概述

在生物医学自然语言处理领域，高质量指令数据的稀缺长期制约着大型语言模型的精细化调优。为应对这一挑战，马萨诸塞大学的研究团队于2023年7月推出了BioInstruct数据集。该数据集由25,000条指令与演示样本构成，其核心研究问题聚焦于如何通过高效的指令微调，使通用语言模型能够更好地理解和执行复杂的生物医学任务。通过自动化流程生成，BioInstruct显著提升了模型在问答、信息抽取及文本生成等九项关键生物医学任务上的性能，为临床决策支持与医学知识挖掘提供了有力的数据基础。

当前挑战

BioInstruct旨在解决生物医学领域指令遵循与多任务学习的核心挑战，其构建过程面临双重困难。在领域层面，生物医学文本蕴含高度专业术语与复杂逻辑关系，要求指令数据具备精确的语义对齐与严格的科学性，避免生成误导性内容。在构建技术上，尽管采用基于GPT-4的自动化生成策略以降低人工成本，但初始种子任务的有限性与多样性不足，可能影响生成指令的覆盖范围与质量均衡，需通过精细的提示工程与后处理来确保数据的可靠性与泛化能力。

常用场景

经典使用场景

在生物医学自然语言处理领域，BioInstruct数据集以其25,000条由GPT-4生成的指令-演示对，为语言模型的指令微调提供了关键资源。该数据集通过自动化流程，覆盖了信息抽取、文本生成、问答等多种任务类型，使得研究人员能够基于Llama等模型，在药物机制解释、临床摘要生成等具体场景中，显著提升模型遵循生物医学指令的能力，从而优化模型在复杂专业语境下的表现。

实际应用

在实际医疗健康场景中，BioInstruct数据集能够赋能智能临床辅助系统。例如，在电子健康记录分析中，模型可依据指令自动提取关键症状信息；在患者教育方面，它能生成通俗易懂的药物说明；此外，在临床试验筛选或诊断支持中，数据集训练的模型可协助医护人员快速处理专业文本，提升医疗服务的效率与准确性。

衍生相关工作

基于BioInstruct数据集，衍生了一系列经典研究工作。这些工作主要集中在生物医学指令微调框架的优化，例如改进多任务学习策略以增强模型在MedQA、MedNLI等基准任务上的性能。同时，该数据集也促进了领域自适应方法的发展，为后续生物医学大型语言模型的定制化训练提供了重要范本，推动了整个领域的技术迭代。

以上内容由遇见数据集搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集