FreedomIntelligence/TCM-Pretrain-Data-ShizhenGPT

Name: FreedomIntelligence/TCM-Pretrain-Data-ShizhenGPT
Creator: FreedomIntelligence
Published: 2025-09-08 10:58:08
License: 暂无描述

Hugging Face2025-09-08 更新2025-09-13 收录

下载链接：

https://hf-mirror.com/datasets/FreedomIntelligence/TCM-Pretrain-Data-ShizhenGPT

下载链接

链接失效反馈

官方服务：

资源简介：

这是一个用于ShizhenGPT模型预训练的中医学数据集，包含了从中医相关网站和书籍中收集的文本数据，以及图像和文本结合的预训练数据。它是目前最大的中医语料库数据集，包含了超过50亿个token。数据集分为五部分：中医书籍语料库（TCM_Book_Corpus）、中医网页语料库（TCM_Web_Corpus）、中医书籍图文混合数据（TCM_Book_Interleaved_Data）、中医网页图文混合数据（TCM_Web_Interleaved_Data）和基于GPT-4o生成的中医图像文本对（TCM_pretrain_synthesized_vision）。

This is a Traditional Chinese Medicine (TCM) dataset for pre-training the ShizhenGPT model, which includes text data collected from TCM-related websites and books, as well as text-image pre-training data. It is the largest TCM corpus dataset to date, containing over 5 billion tokens. The dataset is divided into five parts: TCM_Book_Corpus, TCM_Web_Corpus, TCM_Book_Interleaved_Data, TCM_Web_Interleaved_Data, and TCM_pretrain_synthesized_vision, which are image-text pairs generated from images and their context using GPT-4o.

提供机构：

FreedomIntelligence

5,000+

优质数据集

54 个

任务类型

进入经典数据集