TCM-SD
收藏OpenDataLab2026-05-17 更新2024-05-09 收录
下载链接:
https://opendatalab.org.cn/OpenDataLab/TCM-SD
下载链接
链接失效反馈官方服务:
资源简介:
中药 (TCM) 是一种天然,安全,有效的疗法,已在世界范围内传播和应用。独特的中医诊疗系统需要对隐藏在以自由文本书写的临床记录中的患者症状进行全面分析。先前的研究表明,该系统可以借助人工智能 (AI) 技术 (例如自然语言处理 (NLP)) 进行信息化和智能化。但是,现有数据集的质量和数量都不足以支持TCM中数据驱动的AI技术的进一步发展。因此,在本文中,我们将重点放在中医诊疗系统的核心任务-辨证论治 (SD) 上,并介绍了第一个针对SD的公共大规模基准,称为TCM-SD。我们的基准包含涵盖148综合征的54,152真实临床记录。此外,我们在TCM领域中收集了大规模的未标记文本语料库,并提出了一种特定于领域的预训练语言模型,称为ZYBERT。我们使用深度神经网络进行了实验,以建立强大的性能基线,揭示SD中的各种挑战,并证明了特定领域的预训练语言模型的潜力。我们的研究和分析揭示了整合计算机科学和语言学知识以探索中医理论的经验有效性的机会。
Traditional Chinese Medicine (TCM) is a natural, safe, and effective therapeutic approach that has been disseminated and applied worldwide. The unique TCM diagnosis and treatment system requires comprehensive analysis of patient symptoms hidden in free-text clinical records. Previous studies have demonstrated that this system can be digitalized and intelligentized via artificial intelligence (AI) technologies, such as natural language processing (NLP). However, the quality and quantity of existing datasets are insufficient to support the further development of data-driven AI technologies in TCM. Therefore, in this paper, we focus on the core task of the TCM diagnosis and treatment system—Syndrome Differentiation and Treatment (SD)—and introduce the first public large-scale benchmark for SD, termed TCM-SD. Our benchmark contains 54,152 real clinical records covering 148 syndromes. Additionally, we collected a large-scale unlabeled text corpus in the TCM domain and propose a domain-specific pre-trained language model termed ZYBERT. We conducted experiments using deep neural networks to establish strong performance baselines, reveal various challenges in SD, and demonstrate the potential of domain-specific pre-trained language models. Our research and analysis reveal opportunities to integrate knowledge from computer science and linguistics to explore the empirical validity of TCM theories.
提供机构:
OpenDataLab
创建时间:
2022-12-21
搜集汇总
数据集介绍

背景与挑战
背景概述
TCM-SD是首个针对中医辨证论治(SD)任务的大规模公共基准数据集,包含54,152条真实临床记录,覆盖148种综合征,旨在推动中医领域数据驱动AI技术的发展。该数据集还提供了大规模未标记文本语料库和特定领域的预训练语言模型ZYBERT,以支持自然语言处理研究。
以上内容由遇见数据集搜集并总结生成



