ultrachat_200k
收藏魔搭社区2026-05-23 更新2024-05-15 收录
下载链接:
https://modelscope.cn/datasets/swift/ultrachat_200k
下载链接
链接失效反馈官方服务:
资源简介:
# Dataset Card for UltraChat 200k
## Dataset Description
This is a heavily filtered version of the [UltraChat](https://github.com/thunlp/UltraChat) dataset and was used to train [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a state of the art 7b chat model.
The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create `UltraChat 200k`, we applied the following logic:
- Selection of a subset of data for faster supervised fine tuning.
- Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors like "Hello. how are you?" instead of "Hello. How are you?"
- Removal of dialogues where the assistant replies with phrases like "I do not have emotions" or "I don't have opinions", even for fact-based prompts that don't involve either.
## Dataset Structure
The dataset has four splits, suitable for:
* Supervised fine-tuning (`sft`).
* Generation ranking (`gen`) via techniques like rejection sampling or PPO.
The number of examples per split is shown as follows:
| train_sft | test_sft | train_gen | test_gen |
|:-------:|:-----------:|:-----:| :-----:|
| 207865 | 23110 | 256032 | 28304 |
The dataset is stored in parquet format with each entry using the following schema:
```
{
"prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
"messages":[
{
"content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
"role": "user"
},
{
"content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
"role": "assistant"
},
{
"content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details. ...",
"role": "user"
},
{
"content": "Certainly! ....",
"role": "assistant"
},
{
"content": "That's really interesting! I would love to hear more...",
"role": "user"
}
{
"content": "Certainly! ....",
"role": "assistant"
},
],
"prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}
```
## Citation
If you find this dataset is useful in your work, please cite the original UltraChat dataset:
```
@misc{ding2023enhancing,
title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations},
author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
year={2023},
eprint={2305.14233},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
# UltraChat 200k 数据集卡片
## 示例代码
python
from modelscope import MsDataset
from modelscope.utils.constant import DownloadMode
ds = MsDataset.load('AI-ModelScope/ultrachat_200k',subset_name='default', split='train', download_mode=DownloadMode.FORCE_REDOWNLOAD)
print(next(iter(ds)))
## 数据集描述
本数据集是[UltraChat](https://github.com/thunlp/UltraChat)数据集的高度过滤版本,曾用于训练[Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta)——一款当前顶尖的7B级对话大语言模型(Large Language Model,LLM)。
原始UltraChat数据集包含140万段由ChatGPT生成的对话,涵盖极为广泛的主题范畴。为构建`UltraChat 200k`,我们采用了如下筛选逻辑:
- 选取部分数据子集以加速监督微调(Supervised Fine-Tuning,SFT)
- 对数据集进行大小写规范化处理:我们发现约5%的数据存在语法格式错误,例如将“Hello. How are you?”误写为“Hello. how are you?”
- 移除所有助理回复包含“我没有情绪”或“我无法提供观点”这类表述的对话,即便提示本身为事实类且不涉及情绪或观点相关内容
## 数据集结构
本数据集包含四个数据集划分,分别适配以下场景:
* 监督微调划分(`sft`)
* 生成排序划分(`gen`),可配合拒绝采样、近端策略优化(Proximal Policy Optimization,PPO)等技术使用
各划分下的样本数量如下表所示:
| 训练监督微调划分(train_sft) | 测试监督微调划分(test_sft) | 训练生成排序划分(train_gen) | 测试生成排序划分(test_gen) |
|:---------------------------:|:---------------------------:|:---------------------------:|:---------------------------:|
| 207865 | 23110 | 256032 | 28304 |
本数据集以Parquet格式存储,每条数据的结构如下所示:
json
{
"prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
"messages":[
{
"content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...",
"role": "user"
},
{
"content": "Name: Ava
Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...",
"role": "assistant"
},
{
"content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details. ...",
"role": "user"
},
{
"content": "Certainly! ....",
"role": "assistant"
},
{
"content": "That's really interesting! I would love to hear more...",
"role": "user"
},
{
"content": "Certainly! ....",
"role": "assistant"
},
],
"prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af"
}
## 引用说明
若本数据集对你的研究工作有所帮助,请引用原始UltraChat数据集的相关论文:
bibtex
@misc{ding2023enhancing,
title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations},
author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou},
year={2023},
eprint={2305.14233},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
你也可以引用Zephyr 7B的技术报告:
bibtex
@misc{tunstall2023zephyr,
title={Zephyr: Direct Distillation of LM Alignment},
author={Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf},
year={2023},
eprint={2310.16944},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
提供机构:
maas
创建时间:
2024-06-05
搜集汇总
数据集介绍

背景与挑战
背景概述
UltraChat 200k是UltraChat数据集的过滤版本,包含约20万条由ChatGPT生成的对话,经过真值大小写校正和低质量回复移除等处理,用于训练Zephyr-7B-β模型。该数据集分为监督微调和生成排名两个任务的分割,适用于聊天模型的训练和评估,存储为parquet格式,每条数据包含用户与助手的多轮对话。
以上内容由遇见数据集搜集并总结生成



