MedVAL-Bench: Expert-Annotated Medical Text Validation Benchmark

Name: MedVAL-Bench: Expert-Annotated Medical Text Validation Benchmark
Creator: PhysioNet
Published: 2025-11-14 07:11:57
License: 暂无描述

DataCite Commons2025-11-14 更新2026-05-04 收录

下载链接：

https://physionet.org/content/medval-bench/

下载链接

链接失效反馈

官方服务：

资源简介：

MedVAL-Bench is a dataset containing physician evaluations of errors in language model (LM)-generated medical text. The dataset spans 6 diverse medical text generation tasks and includes annotations from 12 physicians on clinically significant errors for 840 LM-generated outputs. These text-to-text generation tasks involve transforming an input medical text into an output relevant to a specific use case. Each task includes inputs and corresponding LM-generated outputs, which are evaluated for factual consistency by physicians. Importantly, the MedVAL framework and dataset are designed to rely only on inputs for the evaluation process to allow working with datasets that may not have reference outputs, ensuring broad applicability. The evaluation process aims to determine whether the output is factually consistent with the input and is safe for use. MedVAL-Bench constitutes the first large-scale physician-validated benchmark with triage-style risk grading aligned to real- world clinical decision-making, supporting the development of automated, expert-aligned evaluation methods and facilitating research toward trustworthy medical text generation.

MedVAL-Bench是一款收录医师对大语言模型（Large Language Model，LM）生成的医疗文本错误开展评估的数据集。该数据集涵盖6类多样化的医疗文本生成任务，包含12位医师针对840份大语言模型生成输出所标注的临床显著性错误。此类文本到文本生成任务均要求将输入医疗文本转换为适配特定应用场景的输出文本。每个任务均配备输入文本与对应的大语言模型生成输出，由医师对其事实一致性进行评估。尤为关键的是，MedVAL评估框架与数据集的设计仅依赖输入文本完成评估流程，以此适配可能缺乏参考输出的数据集场景，进而保障其广泛适用性。该评估流程的核心目标为判定模型输出是否与输入文本保持事实一致，且具备临床使用安全性。MedVAL-Bench是首个大规模经医师验证的基准数据集，其采用与真实世界临床决策相契合的分诊式风险分级机制，可为自动化、符合专家对齐标准的评估方法开发提供支撑，并助力面向可信医疗文本生成的相关研究。

提供机构：

PhysioNet

创建时间：

2025-10-29

5,000+

优质数据集

54 个

任务类型

进入经典数据集