HuatuoGPT2-Pretraining-Instruction
收藏魔搭社区2025-11-27 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/FreedomIntelligence/HuatuoGPT2-Pretraining-Instruction
下载链接
链接失效反馈官方服务:
资源简介:
## HuatuoGPT2-Pretraining-Instruction-5200K
Here are the pre-training instructions for HuatuoGPT-II, developed with **5.2 million** medical corpus using **ChatGPT**.
This dataset is used to incorporate extensive medical knowledge and enable a one-stage medical adaptation. All our data have been made publicly accessible.
## Data Volume
The following table details the volume and distribution of pre-training data for HuatuoGPT2:
| Data Source | Data Volume |
| --------------------------- | ----------- |
| Medical_Web_Corpus_cn | 640,621 |
| Medical_Web_Corpus_en | 394,490 |
| Medical_Literature_cn | 177,261 |
| Medical_Literature_en | 878,241 |
| Medical_Encyclopedia_cn | 411,183 |
| Medical_Encyclopedia_en | 147,059 |
| Medical_Books_cn | 1,835,931 |
| Medical_Books_en | 801,522 |
| **Total** | **5,286,308** |
## Repository
- **Github:** https://github.com/FreedomIntelligence/HuatuoGPT-II
## Citations
```
@misc{chen2023huatuogptii,
title={HuatuoGPT-II: One-Stage Training for Medical Adaptation of Large Language Models},
author={Junying Chen, Xidong Wang, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, Xiang Wan, Haizhou Li, and Benyou Wang},
year={2023},
eprint={2311.09774},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{huatuogpt-2023,
title={HuatuoGPT: Pioneering the Integration of Medical Expertise into Language Models},
author={Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li},
journal={arXiv preprint arXiv:2305.15075},
year={2023}
}
```
# 华佗GPT2-预训练指令数据集(520万)
本数据集为华佗GPT-II(HuatuoGPT-II)的预训练指令集,依托**520万**条医疗语料,并结合**ChatGPT**构建而成。本数据集旨在融入海量医学知识,实现单阶段医疗领域适配。所有数据均已对外公开可获取。
## 数据规模
下表详细列出了华佗GPT2预训练数据的规模与分布:
| 数据来源 | 数据条数 |
| ---------------------------- | -------- |
| 中文医疗网络语料库(Medical_Web_Corpus_cn) | 640,621 |
| 英文医疗网络语料库(Medical_Web_Corpus_en) | 394,490 |
| 中文医学文献(Medical_Literature_cn) | 177,261 |
| 英文医学文献(Medical_Literature_en) | 878,241 |
| 中文医学百科(Medical_Encyclopedia_cn) | 411,183 |
| 英文医学百科(Medical_Encyclopedia_en) | 147,059 |
| 中文医学书籍(Medical_Books_cn) | 1,835,931 |
| 英文医学书籍(Medical_Books_en) | 801,522 |
| **总计** | **5,286,308** |
## 开源仓库
- **GitHub**:https://github.com/FreedomIntelligence/HuatuoGPT-II
## 引用文献
@misc{chen2023huatuogptii,
title={HuatuoGPT-II: 面向大语言模型(Large Language Model)医疗适配的单阶段训练方法},
author={Junying Chen, Xidong Wang, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, Xiang Wan, Haizhou Li, and Benyou Wang},
year={2023},
eprint={2311.09774},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@article{huatuogpt-2023,
title={HuatuoGPT: 率先将医学专业知识融入语言模型},
author={Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li},
journal={arXiv预印本 arXiv:2305.15075},
year={2023}
}
提供机构:
maas
创建时间:
2025-01-20



