HPAI-BSC/medical-fields
收藏Medical Question Classification Dataset
数据集概述
该数据集旨在评估医疗语言模型。它将多个重要的医疗问答数据集合并为统一格式,并将其分类为35个不同的医疗类别。这种结构使用户能够识别模型在特定类别中的性能不足,并相应地解决这些问题。
数据集结构
数据字段
- id: 每个问题的唯一标识符。
- question: 医疗问题。
- op1: 问题的第一个选项。
- op2: 问题的第二个选项。
- op3: 问题的第三个选项。
- op4: 问题的第四个选项。
- cop: 正确选项(1, 2, 3, 或 4)。
- dataset: 数据集来源名称。
- medical_field: 问题预测的医疗领域。
- cot_medical_field: 医疗领域的思维链(CoT)。
- cumulative_logprob_cot_medical_field: 医疗领域CoT的对数概率。
示例实例
json [ { "id": "test-00000", "question": "A junior orthopaedic surgery resident is completing a carpal tunnel repair with the department chairman as the attending physician. During the case, the resident inadvertently cuts a flexor tendon. The tendon is repaired without complication. The attending tells the resident that the patient will do fine, and there is no need to report this minor complication that will not harm the patient, as he does not want to make the patient worry unnecessarily. He tells the resident to leave this complication out of the operative report. Which of the following is the correct next action for the resident to take?", "op1": "Disclose the error to the patient and put it in the operative report", "op2": "Tell the attending that he cannot fail to disclose this mistake", "op3": "Report the physician to the ethics committee", "op4": "Refuse to dictate the operative report", "cop": 2, "dataset": "medqa_4options_test", "medical_field": "Surgery", "cot_medical_field": "This question involves a scenario related to surgical procedures and reporting complications, which falls under the category of Surgery. The category is: Surgery", "cumulative_logprob_cot_medical_field": -2.603069230914116 } ]
数据集创建
该数据集使用Llama-3-70B-Instruct模型将医疗问题分类到预定义的医疗领域。创建过程包括从HuggingFace下载数据集,根据配置文件中的指定字段对问题进行分类,并创建合并数据集。
使用的数据集
- CareQA: https://huggingface.co/datasets/HPAI-BSC/CareQA (CareQA_en.json)
- headqa_test: https://huggingface.co/datasets/openlifescienceai/headqa (test split)
- medmcqa_validation: https://huggingface.co/datasets/openlifescienceai/medmcqa (validation split)
- medqa_4options_test: https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options-hf (test split)
- mmlu_anatomy_test: https://huggingface.co/datasets/openlifescienceai/mmlu_anatomy (test split)
- mmlu_clinical_knowledge_test: https://huggingface.co/datasets/openlifescienceai/mmlu_clinical_knowledge (test split)
- mmlu_college_medicine_test: https://huggingface.co/datasets/openlifescienceai/mmlu_college_medicine (test split)
- mmlu_medical_genetics_test: https://huggingface.co/datasets/openlifescienceai/mmlu_medical_genetics (test split)
- mmlu_professional_medicine_test: https://huggingface.co/datasets/openlifescienceai/mmlu_professional_medicine (test split)
提示配置
yaml system_prompt: "You are a medical assistant tasked with classifying medical questions into specific categories. You will be given a medical question. Your job is to categorize the question into one of the following categories: MEDICAL_FIELDS. Ensure that your output includes a step-by-step explanation of your reasoning process followed by the final category. Provide the name of the category as a single word and nothing else. If you have any doubts or the question does not fit clearly into one category, respond with The category is: None. End your response with The category is: <category>." fewshot_examples:
- question: "What are the common symptoms of a myocardial infarction?" answer: "Myocardial infarction refers to a heart attack, which is a condition related to the heart. Heart conditions are categorized under Cardiology. The category is: Cardiology"
- question: "What is the first-line treatment for type 2 diabetes?" answer: "Type 2 diabetes is a metabolic disorder that involves insulin regulation. Disorders related to metabolism and insulin are categorized under Endocrinology. The category is: Endocrinology"
- question: "What are the stages of non-small cell lung cancer?" answer: "Non-small cell lung cancer is a type of cancer. The staging of cancer is a process that falls under the field of Oncology. The category is: Oncology"
- question: "How is rheumatoid arthritis diagnosed?" answer: "Rheumatoid arthritis is an autoimmune disease that affects the joints. Diseases affecting the joints and autoimmune conditions are categorized under Rheumatology. The category is: Rheumatology"
- question: "What are the side effects of the MMR vaccine?" answer: "The MMR vaccine triggers immune responses to prevent measles, mumps, and rubella. Immune responses and vaccinations are categorized under Immunology. The category is: Immunology"
- question: "What is the capital of France?" answer: "The question is unrelated to medical fields and does not fit into any medical category. The category is: None"
- question: "Waht are l" answer: "The question is incomplete and contains significant typos, making it unclear and impossible to categorize. The category is: None" regex: "The category is: (?P<category>\w+)"
数据集统计


引用
如果使用此数据集,请引用:
bibtex @misc{gururajan2024aloe, title={Aloe: A Family of Fine-tuned Open Healthcare LLMs}, author={Ashwin Kumar Gururajan and Enrique Lopez-Cuena and Jordi Bayarri-Planas and Adrian Tormos and Daniel Hinjos and Pablo Bernabeu-Perez and Anna Arias-Duart and Pablo Agustin Martin-Torres and Lucia Urcelay-Ganzabal and Marta Gonzalez-Mallo and Sergio Alvarez-Napagao and Eduard Ayguadé-Parra and Ulises Cortés Dario Garcia-Gasulla}, year={2024}, eprint={2405.01886}, archivePrefix={arXiv}, primaryClass={cs.CL} }




