medmcqa-cot-llama31

Hugging Face2024-11-08 更新2024-12-12 收录

下载链接：

https://huggingface.co/datasets/HPAI-BSC/medmcqa-cot-llama31

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是对MedMCQA数据集的合成增强响应，通过使用Llama-3.1-70B-Instruct生成思维链（Chain of Thought, CoT）答案，提高了训练数据的质量。数据集包括多选题和问答任务，涵盖医学和生物学领域。数据集的创建旨在提供高质量的指令调优数据集，基于MedQA。

This dataset consists of synthetic augmented responses for the MedMCQA dataset, where Chain of Thought (CoT) answers are generated via Llama-3.1-70B-Instruct to improve the quality of training data. It covers multiple-choice questions and question answering tasks across the fields of medicine and biology. This dataset was created to deliver high-quality instruction-tuned datasets based on MedQA.

创建时间：

2024-10-29

原始信息汇总

MedMCQA - CoT 数据集

数据集概述

MedMCQA - CoT 数据集是对 MedMCQA 数据集的合成增强响应。该数据集用于训练 Aloe-Beta 模型。

数据集详情

数据集描述

为了提高 MedMCQA 数据集训练拆分中答案的质量，我们利用 Llama-3.1-70B-Instruct 生成 Chain of Thought (CoT) 答案。我们为数据集创建了一个自定义提示，并结合了手工制作的少量示例。对于多选答案，我们要求模型重新表述并解释问题，然后根据问题解释每个选项，最后总结这些解释以得出最终解决方案。在合成数据生成过程中，模型还会被提供解决方案和参考答案。在模型未能生成正确响应并仅重复输入问题的情况下，我们会重新生成解决方案，直到生成正确的响应。更多细节可在论文中找到。

语言(NLP): 英语
许可证: Apache 2.0

数据集来源

论文: Aloe: A Family of Fine-tuned Open Healthcare LLMs

数据集创建

创建理由

该数据集的创建旨在提供一个基于 MedQA 的高质量、易于使用的指令调优数据集。

引用

BibTeX:

@misc{gururajan2024aloe, title={Aloe: A Family of Fine-tuned Open Healthcare LLMs}, author={Ashwin Kumar Gururajan and Enrique Lopez-Cuena and Jordi Bayarri-Planas and Adrian Tormos and Daniel Hinjos and Pablo Bernabeu-Perez and Anna Arias-Duart and Pablo Agustin Martin-Torres and Lucia Urcelay-Ganzabal and Marta Gonzalez-Mallo and Sergio Alvarez-Napagao and Eduard Ayguadé-Parra and Ulises Cortés Dario Garcia-Gasulla}, year={2024}, eprint={2405.01886}, archivePrefix={arXiv}, primaryClass={cs.CL} }

@InProceedings{pmlr-v174-pal22a, title = {MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering}, author = {Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan}, booktitle = {Proceedings of the Conference on Health, Inference, and Learning}, pages = {248--260}, year = {2022}, editor = {Flores, Gerardo and Chen, George H and Pollard, Tom and Ho, Joyce C and Naumann, Tristan}, volume = {174}, series = {Proceedings of Machine Learning Research}, month = {07--08 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v174/pal22a/pal22a.pdf}, url = {https://proceedings.mlr.press/v174/pal22a.html}, abstract = {This paper introduces MedMCQA, a new large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. More than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity. Each sample contains a question, correct answer(s), and other options which requires a deeper language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects & topics. A detailed explanation of the solution, along with the above information, is provided in this study.} }

数据集卡片作者

Jordi Bayarri Planas

数据集卡片联系

hpai@bsc.es

搜集汇总

数据集介绍

构建方式

medmcqa-cot-llama31数据集的构建基于MedMCQA数据集，通过Llama-3.1-70B-Instruct模型生成链式思维（CoT）答案，以提升训练数据的质量。在生成过程中，模型被赋予自定义提示和少量示例，要求其重新表述并解释问题，随后逐一分析每个选项，最终总结出正确答案。为确保生成答案的准确性，模型在生成错误答案时会重新生成，直至得到正确响应。

特点

该数据集的特点在于其高质量的多选题答案生成，涵盖了广泛的医学主题和21个医学科目。每个样本包含问题、正确答案及其他选项，要求模型具备深层次的语言理解能力，以测试其在多种医学主题上的推理能力。此外，数据集还提供了详细的解决方案解释，进一步增强了其教育价值。

使用方法

medmcqa-cot-llama31数据集主要用于训练和评估医疗领域的语言模型，特别是针对多选题问答任务。用户可以通过加载数据集，利用其生成的链式思维答案进行模型微调，从而提升模型在医疗问答任务中的表现。此外，数据集还可用于研究链式思维在复杂医学问题中的应用效果。

背景与挑战

背景概述

medmcqa-cot-llama31数据集是基于MedMCQA数据集的一个增强版本，旨在通过生成链式思维（Chain of Thought, CoT）答案来提高医学领域多选问题的回答质量。该数据集由Jordi Bayarri Planas等人于2024年创建，主要依托Llama-3.1-70B-Instruct模型生成详细的解释性答案。MedMCQA数据集本身包含了超过19.4万道来自印度医学入学考试的多选题，涵盖了21个医学学科和2400多个医疗主题。medmcqa-cot-llama31的创建不仅为医学领域的自然语言处理任务提供了高质量的训练数据，还推动了医疗大语言模型（如Aloe-Beta）的发展，为医疗问答系统的智能化提供了重要支持。

当前挑战

medmcqa-cot-llama31数据集在构建过程中面临多重挑战。首先，医学领域的多选问题通常涉及复杂的推理能力和广泛的知识背景，生成准确且逻辑严密的链式思维答案需要模型具备极高的理解能力和知识覆盖范围。其次，在数据生成过程中，模型有时会重复输入问题而非生成有效答案，这需要通过多次迭代和人工干预来确保数据质量。此外，尽管Llama-3.1-70B-Instruct模型在生成解释性答案方面表现出色，但其对医学专业术语和复杂概念的准确理解仍需进一步提升。这些挑战不仅反映了医学领域自然语言处理任务的复杂性，也为未来研究提供了改进方向。

常用场景

经典使用场景

在医学领域的自然语言处理研究中，medmcqa-cot-llama31数据集被广泛应用于训练和评估医疗问答系统。该数据集通过引入Llama-3.1-70B-Instruct模型生成的链式思维（CoT）答案，显著提升了模型在复杂医学问题上的推理能力。研究者通常利用该数据集进行多选问答任务的训练，以增强模型对医学知识的理解和推理能力。

衍生相关工作

基于medmcqa-cot-llama31数据集，研究者们开发了多个经典的医疗问答模型，如Aloe-Beta模型。这些模型在医学领域的自然语言处理任务中表现出色，推动了医疗人工智能技术的进一步发展。此外，该数据集还激发了更多关于链式思维推理的研究，为其他领域的问答系统提供了新的思路和方法。

数据集最近研究