中文微调数据集
收藏魔搭社区2026-05-23 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/zhuangxialie/SFT-Chinese-Dataset
下载链接
链接失效反馈官方服务:
资源简介:
## 中文微调数据集
### 附带Python脚本,可统一转为ShareGPT格式
### firefly-train-1.1M
- 包含了23种常见的中文NLP任务的数据,并且构造了许多与中华文化相关的数据,如对联、作诗、文言文翻译、散文、金庸小说等。对于每个任务,由人工书写若干种指令模板,保证数据的高质量与丰富度,数据量为115万。
### CodeChat
- 主要包含逻辑推理、代码问答、代码生成相关语料样本。
### shareAIShareGPT-Chinese-English-90k
- 中英文平行双语优质人机问答数据集,覆盖真实复杂场景下的用户提问。(包含大量多轮对话)
### ruozhiba
- 弱智吧数据问答,据说比较锻炼模型的心智能力。
### 整理好的ShareGPT文件(包含以上全部数据集)
- https://modelscope.cn/datasets/zhuangxialie/Llama3-Chinese-Dataset/files
#### 下载方法
:modelscope-code[]{type="sdk"}
:modelscope-code[]{type="git"}
## Chinese Fine-tuning Dataset
### Equipped with Python scripts that can uniformly convert the dataset into ShareGPT format
### firefly-train-1.1M
- Contains data from 23 common Chinese NLP tasks, and constructs a large amount of data related to Chinese culture, such as couplets, poem composition, classical Chinese translation, prose, Jin Yong's novels, etc. For each task, multiple instruction templates are manually written to ensure the high quality and richness of the data, with a total of 1.15 million data samples.
### CodeChat
- Mainly contains corpus samples related to logical reasoning, code question answering and code generation.
### shareAIShareGPT-Chinese-English-90k
- A high-quality parallel bilingual human-machine question answering dataset in Chinese and English, covering user questions in real and complex scenarios (including a large number of multi-turn conversations).
### ruozhiba
- Question answering data from Ruozhiba Bar, which is reported to help enhance the mental capability of AI models.
### Packaged ShareGPT-format Files (Including All the Above Datasets)
- https://modelscope.cn/datasets/zhuangxialie/Llama3-Chinese-Dataset/files
#### Download Methods
:modelscope-code[]{type="sdk"}
:modelscope-code[]{type="git"}
提供机构:
maas
创建时间:
2024-04-24
搜集汇总
数据集介绍

背景与挑战
背景概述
中文微调数据集包含多个子数据集,如firefly-train-1.1M、CodeChat等,覆盖23种中文NLP任务及文化相关数据,总量达115万条。数据集还提供转换为ShareGPT格式的脚本,适用于复杂场景下的多轮对话和代码生成任务。
以上内容由遇见数据集搜集并总结生成



