ekplatebiryani/human_anatomy_qa_with_difficulty

Name: ekplatebiryani/human_anatomy_qa_with_difficulty
Creator: ekplatebiryani
Published: 2025-12-10 21:36:20
License: 暂无描述

Hugging Face2025-12-10 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/ekplatebiryani/human_anatomy_qa_with_difficulty

下载链接

链接失效反馈

官方服务：

资源简介：

--- tags: - medical-ai - medical-question-answering - anatomy - true-false - evaluation - safety task_categories: - question-answering language: - en license: mit dataset_size: 1077 annotations_creators: - expert-generated dataset_summary: > A dataset of 1,077 clinically validated True/False anatomy questions designed to evaluate medical LLMs on honesty, helpfulness, and harmlessness. citation: > Azeez et al., Truth, Trust, and Trouble: Medical AI on the Edge, EMNLP Industry Track 2025. --- # Truth, Trust, and Trouble (TTT) – Medical Anatomy QA Benchmark This repository hosts the dataset introduced in the EMNLP Industry Track 2025 paper **“Truth, Trust, and Trouble: Medical AI on the Edge.”** The dataset contains **1,077 high-quality, clinically validated True/False anatomy questions**, designed to evaluate medical LLMs along three critical axes: * **Honesty** (factual alignment) * **Helpfulness** (semantic relevance & completeness) * **Harmlessness** (safety under clinical constraints) This benchmark is built to stress-test medical reasoning, factual accuracy, and safe behavior in LLMs intended for healthcare contexts. ## Dataset Contents The dataset file included in this repository is: * `main_dataset_with_difficulty.csv`: The final cleaned benchmark containing: * **question** * **answer** (TRUE/FALSE) * **difficulty** (1–3 scale) * **question type indicators** (template / model-generated) * **safety flag** * **semantic annotations** ## Dataset Construction (Summary) According to the methodology described in the paper: 1. **Sourcing:** Content was sourced from standard anatomy textbooks and clinical case reports. 2. **Generation:** Two pipelines generated QA pairs: * Rule-based templates * LLM-generated natural phrasing 3. **Filtering:** Safety filtering removed harmful, misleading, or clinically inappropriate questions. 4. **Validation:** Three licensed clinical annotators validated correctness and factual grounding. * **Cohen’s Kappa agreement:** 0.81 (indicating strong inter-annotator consistency). 5. **Final Size:** 1,077 validated QA pairs. *See Section 3 (Methodology) of the paper for full details.* ## Intended Use This dataset is ideal for evaluating: * Clinical question-answering models * Factual alignment / hallucination resistance * Safety-oriented model behaviors * Reasoning on anatomical concepts * Few-shot vs zero-shot performance gaps The paper also benchmarks three open-source models: * Mistral-7B * BioMistral-7B-DARE * AlpaCare-13B *Note: AlpaCare-13B achieved 91.7% accuracy and 0.92 harmlessness in the original study.* ## Safety Notice > **Warning:** This dataset is **not intended to train clinical decision-making systems for real-world deployment without human oversight.** It evaluates research models—not medical devices. ## Citation ```bibtex @inproceedings{azeez-etal-2025-truth, title = "Truth, Trust, and Trouble: Medical {AI} on the Edge", author = "Azeez, Mohammad Anas and Ali, Rafiq and Shabbir, Ebad and Siddiqui, Zohaib Hasan and Kashyap, Gautam Siddharth and Gao, Jiechao and Naseem, Usman", editor = "Potdar, Saloni and Rojas-Barahona, Lina and Montella, Sebastien", booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track", month = nov, year = "2025", address = "Suzhou (China)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.emnlp-industry.69/", doi = "10.18653/v1/2025.emnlp-industry.69", pages = "1017--1025", ISBN = "979-8-89176-333-3", abstract = "Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework via a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models{---}Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7{\%}) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite smaller scale. Few-shot prompting improves accuracy from 78{\%} to 85{\%}, and all models show reduced helpfulness on complex queries, highlighting challenges in clinical QA. Our code is available at: https://github.com/AnasAzeez/TTT" } ```

tags: - 医学人工智能（medical-ai） - 医学问答（medical-question-answering） - 解剖学（anatomy） - 正误判断（true-false） - 模型评测（evaluation） - 安全性（safety） task_categories: - 问答（question-answering） language: - 英语（en） license: MIT许可证（mit） dataset_size: 1077条样本 annotations_creators: - 专家标注生成（expert-generated） dataset_summary: > 该数据集包含1077条经过临床验证的高质量解剖学正误判断题，旨在从诚实性、有用性和无害性三个维度评测医学大语言模型（Large Language Model，LLM）。 # 真相、信任与困境（TTT）——医学解剖学问答基准数据集本仓库承载了发表于2025年自然语言处理经验方法会议（EMNLP）产业跟踪赛道的论文《真相、信任与困境：边缘场景下的医学人工智能》中介绍的数据集。该数据集包含**1077条经过临床验证的高质量解剖学正误判断题**，用于从三个关键维度评测医学大语言模型： * **诚实性**（事实一致性） * **有用性**（语义相关性与完整性） * **无害性**（临床约束下的安全性）本基准数据集旨在对面向医疗场景的大语言模型的推理能力、事实准确性与安全行为进行压力测试。 ## 数据集内容本仓库包含的数据集文件为： * `main_dataset_with_difficulty.csv`：最终经过清洗的基准数据集，包含以下字段： * **question**：问题文本 * **answer**：答案（TRUE/FALSE，即是/否） * **difficulty**：难度等级（1-3分制） * **question type indicators**：问题类型标记（模板生成/大语言模型生成） * **safety flag**：安全性标记 * **semantic annotations**：语义标注 ## 数据集构建（摘要）根据论文中描述的方法： 1. **内容溯源**：数据来自标准解剖学教科书与临床病例报告。 2. **问答对生成**：通过两种流程生成问答对（QA pairs）： * 基于规则的模板生成 * 大语言模型生成自然表述 3. **过滤环节**：通过安全性过滤移除有害、误导或临床不当的问题。 4. **验证环节**：由三名持证临床标注员对问题的正确性与事实依据进行验证。 * **科恩卡帕一致性系数（Cohen’s Kappa）**：0.81，表明标注者间一致性较强。 5. **最终规模**：共1077条经过验证的问答对。 *完整细节请参见论文第3章（方法学部分）。* ## 预期用途本数据集适用于以下场景的评测： * 临床问答模型 * 事实一致性与幻觉抵抗能力 * 面向安全的模型行为 * 解剖学概念推理能力 * 少样本（few-shot）与零样本（zero-shot）性能差异该论文还对三款开源模型进行了基准测试： * Mistral-7B * BioMistral-7B-DARE * AlpaCare-13B *注：在原始研究中，AlpaCare-13B模型的准确率达到91.7%，无害性得分为0.92。* ## 安全声明 > **警告：** 本数据集**不旨在用于训练无需人工监督即可部署的临床决策系统**，其仅用于评测研究模型，而非医疗设备。 ## 引用格式 bibtex @inproceedings{azeez-etal-2025-truth, title = "Truth, Trust, and Trouble: Medical {AI} on the Edge", author = "Azeez, Mohammad Anas and Ali, Rafiq and Shabbir, Ebad and Siddiqui, Zohaib Hasan and Kashyap, Gautam Siddharth and Gao, Jiechao and Naseem, Usman", editor = "Potdar, Saloni and Rojas-Barahona, Lina and Montella, Sebastien", booktitle = "2025年自然语言处理经验方法会议：产业跟踪赛道论文集", month = nov, year = "2025", address = "中国苏州", publisher = "国际计算语言学协会", url = "https://aclanthology.org/2025.emnlp-industry.69/", doi = "10.18653/v1/2025.emnlp-industry.69", pages = "1017--1025", ISBN = "979-8-89176-333-3", abstract = "大语言模型（Large Language Model，LLM）凭借自动化医疗问答能力，有望深刻变革数字健康领域。然而，确保这类模型满足行业对事实准确性、实用性与安全性的严苛标准仍是一项挑战，针对开源解决方案而言尤为如此。本文提出一套严谨的基准测试框架，依托包含千余条医疗问题的数据集，从诚实性、有用性与无害性三个维度评估模型性能。研究结果揭示了三款评测模型——Mistral-7B、BioMistral-7B-DARE与AlpaCare-13B——在事实可靠性与安全性之间的权衡关系。其中AlpaCare-13B取得了最高的准确率（91.7%）与无害性得分（0.92）；而BioMistral-7B-DARE通过领域微调，尽管模型规模更小，仍将安全性提升至0.90。少样本提示可将准确率从78%提升至85%，且所有模型在复杂查询下的有用性均有所下降，凸显了临床问答场景中的现存挑战。本项目代码开源地址：https://github.com/AnasAzeez/TTT" }

提供机构：

ekplatebiryani

5,000+

优质数据集

54 个

任务类型

进入经典数据集