five

qingdu-giter/HuatuoGPT2-Pretraining-Instruction

收藏
Hugging Face2026-04-14 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/qingdu-giter/HuatuoGPT2-Pretraining-Instruction
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 task_categories: - question-answering - text-generation language: - zh tags: - medical - biology size_categories: - 5M<n<6M dataset_info: features: - name: id dtype: string - name: conversations list: - name: from dtype: string - name: value dtype: string configs: - config_name: Meidcal_Encyclopedia_en data_files: data/HuatuoGPT2_Pretrain_Meidcal_Encyclopedia_en.json - config_name: Meidcal_Encyclopedia_cn data_files: data/HuatuoGPT2_Pretrain_Meidcal_Encyclopedia_cn.json - config_name: Meidcal_Books_en data_files: data/HuatuoGPT2_Pretrain_Meidcal_Books_en.json - config_name: Meidcal_Literature_en data_files: data/HuatuoGPT2_Pretrain_Meidcal_Literature_en.json - config_name: Meidcal_Literature_cn data_files: data/HuatuoGPT2_Pretrain_Meidcal_Literature_cn.json - config_name: Meidcal_Web_en data_files: data/HuatuoGPT2_Pretrain_Meidcal_Web_Corpus_en.json - config_name: Meidcal_Web_cn data_files: data/HuatuoGPT2_Pretrain_Meidcal_Web_Corpus_cn.json --- ## HuatuoGPT2-Pretraining-Instruction-5200K Here are the pre-training instructions for HuatuoGPT-II, developed with **5.2 million** medical corpus using **ChatGPT**. This dataset is used to incorporate extensive medical knowledge and enable a one-stage medical adaptation. All our data have been made publicly accessible. ## Data Volume The following table details the volume and distribution of pre-training data for HuatuoGPT2: | Data Source | Data Volume | | --------------------------- | ----------- | | Medical_Web_Corpus_cn | 640,621 | | Medical_Web_Corpus_en | 394,490 | | Medical_Literature_cn | 177,261 | | Medical_Literature_en | 878,241 | | Medical_Encyclopedia_cn | 411,183 | | Medical_Encyclopedia_en | 147,059 | | Medical_Books_cn | 1,835,931 | | Medical_Books_en | 801,522 | | **Total** | **5,286,308** | ## Repository - **Github:** https://github.com/FreedomIntelligence/HuatuoGPT-II ## Citations ``` @misc{chen2023huatuogptii, title={HuatuoGPT-II: One-Stage Training for Medical Adaptation of Large Language Models}, author={Junying Chen, Xidong Wang, Anningzhe Gao, Feng Jiang, Shunian Chen, Hongbo Zhang, Dingjie Song, Wenya Xie, Chuyi Kong, Jianquan Li, Xiang Wan, Haizhou Li, and Benyou Wang}, year={2023}, eprint={2311.09774}, archivePrefix={arXiv}, primaryClass={cs.CL} } @article{huatuogpt-2023, title={HuatuoGPT: Pioneering the Integration of Medical Expertise into Language Models}, author={Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li}, journal={arXiv preprint arXiv:2305.15075}, year={2023} } ```
提供机构:
qingdu-giter
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作