TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Name: TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine
Creator: 阿里云天池
Published: 2026-06-03 17:09:43
License: 暂无描述

阿里云天池2026-06-03 更新2024-06-15 收录

下载链接：

https://tianchi.aliyun.com/dataset/180328

下载链接

链接失效反馈

官方服务：

资源简介：

Large language models (LLMs) have performed remarkably well in various natural language processing tasks by benchmarking, including in the Western medical domain. However, the professional evaluation benchmarks for LLMs have yet to be covered in the traditional Chinese medicine(TCM) domain, which has a profound history and vast influence. To address this research gap, we introduce TCM-Bench, an comprehensive benchmark for evaluating LLM performance in TCM. It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis. It covers the core components of TCMLE, including TCM basis and clinical practice. To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions. It comprehensively considers the consistency of TCM semantics and knowledge. After conducting comprehensive experimental analyses from diverse perspectives, we can obtain the following findings: (1) The unsatisfactory performance of LLMs on this benchmark underscores their significant room for improvement in TCM. (2) Introducing domain knowledge can enhance LLMs' performance. However, for in-domain models like ZhongJing-TCM, the quality of generated analysis text has decreased, and we hypothesize that their fine-tuning process affects the basic LLM capabilities. (3) Traditional metrics for text generation quality like Rouge and BertScore are susceptible to text length and surface semantic ambiguity, while domain-specific metrics such as TCMScore can further supplement and explain their evaluation results. These findings highlight the capabilities and limitations of LLMs in the TCM and aim to provide a more profound assistance to medical research.

大语言模型（Large Language Models，LLMs）通过基准测试在各类自然语言处理任务中展现出卓越性能，涵盖西方医学领域。然而，针对大语言模型的专业评测基准在拥有深厚历史与广泛影响力的中医（Traditional Chinese Medicine，TCM）领域仍属研究空白。为填补这一空白，我们推出了TCM-Bench：一款用于评测大语言模型在中医领域表现的综合基准评测集。该基准评测集包含TCM-ED数据集，该数据集含5473道源自中医执业资格考试（TCM Licensing Exam，TCMLE）的题目，其中1300道题目附带权威解析，涵盖了中医执业资格考试的核心考察模块，包括中医基础理论与临床实践。为从超越问答准确性的维度评测大语言模型，我们提出了TCMScore——一款专为评估大语言模型针对中医相关问题生成的答案质量而设计的评测指标，其全面考量中医语义与知识的一致性。通过多维度的综合实验分析，我们得到以下研究结论：（1）大语言模型在该基准评测集上的表现不尽如人意，凸显了其在中医领域仍有较大的提升空间。（2）引入领域知识可有效提升大语言模型的性能，但诸如ZhongJing-TCM这类领域专属模型，其生成的解析文本质量反而出现下滑，我们推测这是由于模型的微调过程损害了其基础大语言模型能力。（3）诸如Rouge、BertScore这类传统文本生成质量评测指标，易受文本长度与表层语义歧义的影响；而TCMScore这类领域专属评测指标，则可进一步补充并阐释其评测结果。上述研究结论凸显了大语言模型在中医领域的能力与局限，旨在为医学研究提供更具深度的辅助支撑。

提供机构：

阿里云天池

创建时间：

2024-06-09

搜集汇总

数据集介绍