five

TCM-Pretrain-Data-ShizhenGPT

收藏
魔搭社区2026-05-23 更新2025-08-30 收录
下载链接:
https://modelscope.cn/datasets/FreedomIntelligence/TCM-Pretrain-Data-ShizhenGPT
下载链接
链接失效反馈
官方服务:
资源简介:
# <span>📚 Introduction</span> This dataset is the pre-training dataset for [ShizhenGPT](https://github.com/FreedomIntelligence/ShizhenGPT), a multimodal LLM for **Traditional Chinese Medicine (TCM)**. We open-source the largest existing TCM corpus dataset (over 5B tokens) from TCM-related websites and books. Additionally, we also open-source the largest scale TCM image-text pretraining dataset. For details, see our [paper](https://arxiv.org/abs/2508.14706) and [GitHub repository](https://github.com/FreedomIntelligence/ShizhenGPT). # <span>📊 Dataset Overview</span> The open-sourced pre-training dataset consists of five parts: | | Modality | Description | Data Quantity | | ---------------------------------- | ------------ | ------------------------------------------------------------------------- | ------------------------------ | | TCM\_Book\_Corpus | 📝 Text | A cleaned corpus of 3,256 TCM textbooks. | \~ 0.5 B tokens | | TCM\_Web\_Corpus | 📝 Text | A TCM corpus collected from the web. | Over 5B tokens | | TCM\_Book\_Interleaved\_Data | 📝 Text, 👁️ Visual | Interleaved text-image data from 306 TCM books. | 41459 entries, 50690 images | | TCM\_Web\_Interleaved\_Data | 📝 Text, 👁️ Visual | Interleaved text-image data from the TCM web corpus. | 505465 entries, 1143954 images | | TCM\_pretrain\_synthesized\_vision | 📝 Text, 👁️ Visual | TCM image-text pairs generated from images and their context using GPT-4o. | 144239 entries, 159534 images | > ⚠️ Note: Due to privacy and ethical concerns, TCM signal datasets (e.g., sound and pulse) are not provided. For some signal data, refer to the [Instruction Dataset](https://huggingface.co/datasets/FreedomIntelligence/TCM-Instruction-Tuning-ShizhenGPT). # <span>📖 Citation</span> If you find our data useful, please consider citing our work! ``` @misc{chen2025shizhengptmultimodalllmstraditional, title={ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine}, author={Junying Chen and Zhenyang Cai and Zhiheng Liu and Yunjin Yang and Rongsheng Wang and Qingying Xiao and Xiangyi Feng and Zhan Su and Jing Guo and Xiang Wan and Guangjun Yu and Haizhou Li and Benyou Wang}, year={2025}, eprint={2508.14706}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.14706}, } ```

# 📚 数据集简介 本数据集为**中医(Traditional Chinese Medicine, TCM)多模态大语言模型(Large Language Model, LLM)时珍GPT(ShizhenGPT)**的预训练数据集。我们开源了目前已公开的规模最大的中医语料数据集(包含超过50亿Token),数据源自中医相关网站与典籍。此外,我们还开源了当前规模最大的中医图文预训练数据集。 详细信息请参阅我们的[学术论文](https://arxiv.org/abs/2508.14706)与[GitHub仓库](https://github.com/FreedomIntelligence/ShizhenGPT)。 # 📊 数据集概览 本次开源的预训练数据集共包含五个部分: | | 模态 | 描述 | 数据量 | | ---------------------------------- | ------------------ | -------------------------------------------------------------------- | ----------------------------------- | | TCM_Book_Corpus | 📝 文本(Text) | 经过清洗的3256本中医教科书语料库 | 约0.5B Token | | TCM_Web_Corpus | 📝 文本(Text) | 从网络采集的中医语料库 | 超过50亿Token | | TCM_Book_Interleaved_Data | 📝 文本, 👁️ 视觉(Visual) | 源自306本中医典籍的图文交错数据 | 41459条条目,50690张图像 | | TCM_Web_Interleaved_Data | 📝 文本, 👁️ 视觉(Visual) | 源自中医网络语料库的图文交错数据 | 505465条条目,1143954张图像 | | TCM_pretrain_synthesized_vision | 📝 文本, 👁️ 视觉(Visual) | 基于图像及其上下文通过GPT-4o生成的中医图文配对数据 | 144239条条目,159534张图像 | > ⚠️ 注意:出于隐私与伦理考量,本数据集未包含中医信号类数据(如声音、脉象数据)。若需使用部分信号类数据,请参阅[指令微调数据集](https://huggingface.co/datasets/FreedomIntelligence/TCM-Instruction-Tuning-ShizhenGPT)。 # 📖 引用声明 若您认为本数据集对您的研究有所帮助,请引用我们的相关工作! @misc{chen2025shizhengptmultimodalllmstraditional, title={ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine}, author={Junying Chen and Zhenyang Cai and Zhiheng Liu and Yunjin Yang and Rongsheng Wang and Qingying Xiao and Xiangyi Feng and Zhan Su and Jing Guo and Xiang Wan and Guangjun Yu and Haizhou Li and Benyou Wang}, year={2025}, eprint={2508.14706}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2508.14706}, }
提供机构:
maas
创建时间:
2025-08-22
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
该数据集是用于ShizhenGPT预训练的多模态数据集,专注于传统中医(TCM)领域。它包含五个部分,涵盖文本和视觉数据,总计超过5B tokens的文本数据和大量图像文本对,是目前最大的TCM语料库和图像文本预训练数据集。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作