five

HuatuoGPT2-Pretraining-Instruction

收藏
魔搭社区2025-11-27 更新2025-01-25 收录
下载链接:
https://modelscope.cn/datasets/FreedomIntelligence/HuatuoGPT2-Pretraining-Instruction
下载链接
链接失效反馈
官方服务:
资源简介:
## HuatuoGPT2-Pretraining-Instruction-5200K Here are the pre-training instructions for HuatuoGPT-II, developed with **5.2 million** medical corpus using **ChatGPT**. This dataset is used to incorporate extensive medical knowledge and enable a one-stage medical adaptation. All our data have been made publicly accessible. ## Data Volume The following table details the volume and distribution of pre-training data for HuatuoGPT2: | Data Source | Data Volume | | --------------------------- | ----------- | | Medical_Web_Corpus_cn | 640,621 | | Medical_Web_Corpus_en | 394,490 | | Medical_Literature_cn | 177,261 | | Medical_Literature_en | 878,241 | | Medical_Encyclopedia_cn | 411,183 | | Medical_Encyclopedia_en | 147,059 | | Medical_Books_cn | 1,835,931 | | Medical_Books_en | 801,522 | | **Total** | **5,286,308** | ## Repository - **Github:** https://github.com/FreedomIntelligence/HuatuoGPT-II ## Citations ``` @misc{chen2023huatuogptii, title={HuatuoGPT-II: One-Stage Training for Medical Adaptation of Large Language Models}, author={Junying Chen, Xidong Wang, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, Xiang Wan, Haizhou Li, and Benyou Wang}, year={2023}, eprint={2311.09774}, archivePrefix={arXiv}, primaryClass={cs.CL} } @article{huatuogpt-2023, title={HuatuoGPT: Pioneering the Integration of Medical Expertise into Language Models}, author={Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li}, journal={arXiv preprint arXiv:2305.15075}, year={2023} } ```

# 华佗GPT2-预训练指令数据集(520万) 本数据集为华佗GPT-II(HuatuoGPT-II)的预训练指令集,依托**520万**条医疗语料,并结合**ChatGPT**构建而成。本数据集旨在融入海量医学知识,实现单阶段医疗领域适配。所有数据均已对外公开可获取。 ## 数据规模 下表详细列出了华佗GPT2预训练数据的规模与分布: | 数据来源 | 数据条数 | | ---------------------------- | -------- | | 中文医疗网络语料库(Medical_Web_Corpus_cn) | 640,621 | | 英文医疗网络语料库(Medical_Web_Corpus_en) | 394,490 | | 中文医学文献(Medical_Literature_cn) | 177,261 | | 英文医学文献(Medical_Literature_en) | 878,241 | | 中文医学百科(Medical_Encyclopedia_cn) | 411,183 | | 英文医学百科(Medical_Encyclopedia_en) | 147,059 | | 中文医学书籍(Medical_Books_cn) | 1,835,931 | | 英文医学书籍(Medical_Books_en) | 801,522 | | **总计** | **5,286,308** | ## 开源仓库 - **GitHub**:https://github.com/FreedomIntelligence/HuatuoGPT-II ## 引用文献 @misc{chen2023huatuogptii, title={HuatuoGPT-II: 面向大语言模型(Large Language Model)医疗适配的单阶段训练方法}, author={Junying Chen, Xidong Wang, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, Xiang Wan, Haizhou Li, and Benyou Wang}, year={2023}, eprint={2311.09774}, archivePrefix={arXiv}, primaryClass={cs.CL} } @article{huatuogpt-2023, title={HuatuoGPT: 率先将医学专业知识融入语言模型}, author={Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li}, journal={arXiv预印本 arXiv:2305.15075}, year={2023} }
提供机构:
maas
创建时间:
2025-01-20
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作