five

HPAI-BSC/medical-fields

收藏
Hugging Face2024-07-11 更新2024-07-13 收录
下载链接:
https://hf-mirror.com/datasets/HPAI-BSC/medical-fields
下载链接
链接失效反馈
官方服务:
资源简介:
该数据集设计用于医疗语言模型的评估。它将多个重要的医疗问答数据集合并为统一格式,并将其分类为35个不同的医疗类别。这种结构使用户能够识别模型在特定类别中可能表现不佳的地方,并相应地解决这些问题。数据集的结构包括每个问题的唯一标识、问题文本、四个选项、正确答案、来源数据集名称、预测的医疗领域、医疗领域的思维链以及思维链的对数概率。数据集的创建过程使用了Llama-3-70B-Instruct模型,并详细列出了使用的数据集和提示配置。

This dataset is designed for medical language models evaluation. It merges several of the most important medical QA datasets into a common format and classifies them into 35 distinct medical categories. This structure enables users to identify any specific categories where the models performance may be lacking and address these areas accordingly. The dataset includes fields such as id, question, options, correct option, source dataset name, predicted medical field, chain of thought for the medical field, and log probability of the CoT medical field. The dataset was created using the Llama-3-70B-Instruct model to classify medical questions into predefined medical fields.
提供机构:
HPAI-BSC
原始信息汇总

Medical Question Classification Dataset

数据集概述

该数据集旨在评估医疗语言模型。它将多个重要的医疗问答数据集合并为统一格式,并将其分类为35个不同的医疗类别。这种结构使用户能够识别模型在特定类别中的性能不足,并相应地解决这些问题。

数据集结构

数据字段

  • id: 每个问题的唯一标识符。
  • question: 医疗问题。
  • op1: 问题的第一个选项。
  • op2: 问题的第二个选项。
  • op3: 问题的第三个选项。
  • op4: 问题的第四个选项。
  • cop: 正确选项(1, 2, 3, 或 4)。
  • dataset: 数据集来源名称。
  • medical_field: 问题预测的医疗领域。
  • cot_medical_field: 医疗领域的思维链(CoT)。
  • cumulative_logprob_cot_medical_field: 医疗领域CoT的对数概率。

示例实例

json [ { "id": "test-00000", "question": "A junior orthopaedic surgery resident is completing a carpal tunnel repair with the department chairman as the attending physician. During the case, the resident inadvertently cuts a flexor tendon. The tendon is repaired without complication. The attending tells the resident that the patient will do fine, and there is no need to report this minor complication that will not harm the patient, as he does not want to make the patient worry unnecessarily. He tells the resident to leave this complication out of the operative report. Which of the following is the correct next action for the resident to take?", "op1": "Disclose the error to the patient and put it in the operative report", "op2": "Tell the attending that he cannot fail to disclose this mistake", "op3": "Report the physician to the ethics committee", "op4": "Refuse to dictate the operative report", "cop": 2, "dataset": "medqa_4options_test", "medical_field": "Surgery", "cot_medical_field": "This question involves a scenario related to surgical procedures and reporting complications, which falls under the category of Surgery. The category is: Surgery", "cumulative_logprob_cot_medical_field": -2.603069230914116 } ]

数据集创建

该数据集使用Llama-3-70B-Instruct模型将医疗问题分类到预定义的医疗领域。创建过程包括从HuggingFace下载数据集,根据配置文件中的指定字段对问题进行分类,并创建合并数据集。

使用的数据集

  • CareQA: https://huggingface.co/datasets/HPAI-BSC/CareQA (CareQA_en.json)
  • headqa_test: https://huggingface.co/datasets/openlifescienceai/headqa (test split)
  • medmcqa_validation: https://huggingface.co/datasets/openlifescienceai/medmcqa (validation split)
  • medqa_4options_test: https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options-hf (test split)
  • mmlu_anatomy_test: https://huggingface.co/datasets/openlifescienceai/mmlu_anatomy (test split)
  • mmlu_clinical_knowledge_test: https://huggingface.co/datasets/openlifescienceai/mmlu_clinical_knowledge (test split)
  • mmlu_college_medicine_test: https://huggingface.co/datasets/openlifescienceai/mmlu_college_medicine (test split)
  • mmlu_medical_genetics_test: https://huggingface.co/datasets/openlifescienceai/mmlu_medical_genetics (test split)
  • mmlu_professional_medicine_test: https://huggingface.co/datasets/openlifescienceai/mmlu_professional_medicine (test split)

提示配置

yaml system_prompt: "You are a medical assistant tasked with classifying medical questions into specific categories. You will be given a medical question. Your job is to categorize the question into one of the following categories: MEDICAL_FIELDS. Ensure that your output includes a step-by-step explanation of your reasoning process followed by the final category. Provide the name of the category as a single word and nothing else. If you have any doubts or the question does not fit clearly into one category, respond with The category is: None. End your response with The category is: <category>." fewshot_examples:

  • question: "What are the common symptoms of a myocardial infarction?" answer: "Myocardial infarction refers to a heart attack, which is a condition related to the heart. Heart conditions are categorized under Cardiology. The category is: Cardiology"
  • question: "What is the first-line treatment for type 2 diabetes?" answer: "Type 2 diabetes is a metabolic disorder that involves insulin regulation. Disorders related to metabolism and insulin are categorized under Endocrinology. The category is: Endocrinology"
  • question: "What are the stages of non-small cell lung cancer?" answer: "Non-small cell lung cancer is a type of cancer. The staging of cancer is a process that falls under the field of Oncology. The category is: Oncology"
  • question: "How is rheumatoid arthritis diagnosed?" answer: "Rheumatoid arthritis is an autoimmune disease that affects the joints. Diseases affecting the joints and autoimmune conditions are categorized under Rheumatology. The category is: Rheumatology"
  • question: "What are the side effects of the MMR vaccine?" answer: "The MMR vaccine triggers immune responses to prevent measles, mumps, and rubella. Immune responses and vaccinations are categorized under Immunology. The category is: Immunology"
  • question: "What is the capital of France?" answer: "The question is unrelated to medical fields and does not fit into any medical category. The category is: None"
  • question: "Waht are l" answer: "The question is incomplete and contains significant typos, making it unclear and impossible to categorize. The category is: None" regex: "The category is: (?P<category>\w+)"

数据集统计

image.png

image/png

引用

如果使用此数据集,请引用:

bibtex @misc{gururajan2024aloe, title={Aloe: A Family of Fine-tuned Open Healthcare LLMs}, author={Ashwin Kumar Gururajan and Enrique Lopez-Cuena and Jordi Bayarri-Planas and Adrian Tormos and Daniel Hinjos and Pablo Bernabeu-Perez and Anna Arias-Duart and Pablo Agustin Martin-Torres and Lucia Urcelay-Ganzabal and Marta Gonzalez-Mallo and Sergio Alvarez-Napagao and Eduard Ayguadé-Parra and Ulises Cortés Dario Garcia-Gasulla}, year={2024}, eprint={2405.01886}, archivePrefix={arXiv}, primaryClass={cs.CL} }

搜集汇总
数据集介绍
main_image_url
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作