TCM-Pretrain-Data-ShizhenGPT

Name: TCM-Pretrain-Data-ShizhenGPT
Creator: maas
Published: 2026-05-23 06:11:07
License: 暂无描述

魔搭社区2026-05-23 更新2025-08-30 收录

下载链接：

https://modelscope.cn/datasets/FreedomIntelligence/TCM-Pretrain-Data-ShizhenGPT

下载链接

链接失效反馈

官方服务：

资源简介：

# 📚 Introduction This dataset is the pre-training dataset for [ShizhenGPT](https://github.com/FreedomIntelligence/ShizhenGPT), a multimodal LLM for **Traditional Chinese Medicine (TCM)**. We open-source the largest existing TCM corpus dataset (over 5B tokens) from TCM-related websites and books. Additionally, we also open-source the largest scale TCM image-text pretraining dataset. For details, see our [paper](https://arxiv.org/abs/2508.14706) and [GitHub repository](https://github.com/FreedomIntelligence/ShizhenGPT). # 📊 Dataset Overview The open-sourced pre-training dataset consists of five parts: | | Modality | Description | Data Quantity | | ---------------------------------- | ------------ | ------------------------------------------------------------------------- | ------------------------------ | | TCM\_Book\_Corpus | 📝 Text | A cleaned corpus of 3,256 TCM textbooks. | \~ 0.5 B tokens | | TCM\_Web\_Corpus | 📝 Text | A TCM corpus collected from the web. | Over 5B tokens | | TCM\_Book\_Interleaved\_Data | 📝 Text, 👁️ Visual | Interleaved text-image data from 306 TCM books. | 41459 entries, 50690 images | | TCM\_Web\_Interleaved\_Data | 📝 Text, 👁️ Visual | Interleaved text-image data from the TCM web corpus. | 505465 entries, 1143954 images | | TCM\_pretrain\_synthesized\_vision | 📝 Text, 👁️ Visual | TCM image-text pairs generated from images and their context using GPT-4o. | 144239 entries, 159534 images | > ⚠️ Note: Due to privacy and ethical concerns, TCM signal datasets (e.g., sound and pulse) are not provided. For some signal data, refer to the [Instruction Dataset](https://huggingface.co/datasets/FreedomIntelligence/TCM-Instruction-Tuning-ShizhenGPT). # 📖 Citation If you find our data useful, please consider citing our work! ``` @misc{chen2025shizhengptmultimodalllmstraditional, title={ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine}, author={Junying Chen and Zhenyang Cai and Zhiheng Liu and Yunjin Yang and Rongsheng Wang and Qingying Xiao and Xiangyi Feng and Zhan Su and Jing Guo and Xiang Wan and Guangjun Yu and Haizhou Li and Benyou Wang}, year={2025}, eprint={2508.14706}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.14706}, } ```

# 📚 数据集简介本数据集为**中医（Traditional Chinese Medicine, TCM）多模态大语言模型（Large Language Model, LLM）时珍GPT（ShizhenGPT）**的预训练数据集。我们开源了目前已公开的规模最大的中医语料数据集（包含超过50亿Token），数据源自中医相关网站与典籍。此外，我们还开源了当前规模最大的中医图文预训练数据集。详细信息请参阅我们的[学术论文](https://arxiv.org/abs/2508.14706)与[GitHub仓库](https://github.com/FreedomIntelligence/ShizhenGPT)。 # 📊 数据集概览本次开源的预训练数据集共包含五个部分： | | 模态 | 描述 | 数据量 | | ---------------------------------- | ------------------ | -------------------------------------------------------------------- | ----------------------------------- | | TCM_Book_Corpus | 📝 文本（Text） | 经过清洗的3256本中医教科书语料库 | 约0.5B Token | | TCM_Web_Corpus | 📝 文本（Text） | 从网络采集的中医语料库 | 超过50亿Token | | TCM_Book_Interleaved_Data | 📝 文本, 👁️ 视觉（Visual） | 源自306本中医典籍的图文交错数据 | 41459条条目，50690张图像 | | TCM_Web_Interleaved_Data | 📝 文本, 👁️ 视觉（Visual） | 源自中医网络语料库的图文交错数据 | 505465条条目，1143954张图像 | | TCM_pretrain_synthesized_vision | 📝 文本, 👁️ 视觉（Visual） | 基于图像及其上下文通过GPT-4o生成的中医图文配对数据 | 144239条条目，159534张图像 | > ⚠️ 注意：出于隐私与伦理考量，本数据集未包含中医信号类数据（如声音、脉象数据）。若需使用部分信号类数据，请参阅[指令微调数据集](https://huggingface.co/datasets/FreedomIntelligence/TCM-Instruction-Tuning-ShizhenGPT)。 # 📖 引用声明若您认为本数据集对您的研究有所帮助，请引用我们的相关工作！ @misc{chen2025shizhengptmultimodalllmstraditional, title={ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine}, author={Junying Chen and Zhenyang Cai and Zhiheng Liu and Yunjin Yang and Rongsheng Wang and Qingying Xiao and Xiangyi Feng and Zhan Su and Jing Guo and Xiang Wan and Guangjun Yu and Haizhou Li and Benyou Wang}, year={2025}, eprint={2508.14706}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.14706}, }

提供机构：

maas

创建时间：

2025-08-22

搜集汇总

数据集介绍