ultrachat_200k

魔搭社区2026-01-06 更新2024-05-15 收录

下载链接：

https://modelscope.cn/datasets/AI-ModelScope/ultrachat_200k

下载链接

链接失效反馈

资源简介：

# Dataset Card for UltraChat 200k ## 示例代码 ```python from modelscope import MsDataset from modelscope.utils.constant import DownloadMode ds = MsDataset.load('AI-ModelScope/ultrachat_200k',subset_name='default', split='train', download_mode=DownloadMode.FORCE_REDOWNLOAD) print(next(iter(ds))) ``` ## Dataset Description This is a heavily filtered version of the [UltraChat](https://github.com/thunlp/UltraChat) dataset and was used to train [Zephyr-7B-β](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create `UltraChat 200k`, we applied the following logic: - Selection of a subset of data for faster supervised fine tuning. - Truecasing of the dataset, as we observed around 5% of the data contained grammatical errors like "Hello. how are you?" instead of "Hello. How are you?" - Removal of dialogues where the assistant replies with phrases like "I do not have emotions" or "I don't have opinions", even for fact-based prompts that don't involve either. ## Dataset Structure The dataset has four splits, suitable for: * Supervised fine-tuning (`sft`). * Generation ranking (`gen`) via techniques like rejection sampling or PPO. The number of examples per split is shown as follows: | train_sft | test_sft | train_gen | test_gen | |:--- --- -:|:--- --- --- --:|:--- --:| :--- --:| | 207865 | 23110 | 256032 | 28304 | The dataset is stored in parquet format with each entry using the following schema: ``` { "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...", "messages":[ { "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...", "role": "user" }, { "content": "Name: Ava\n\n Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...", "role": "assistant" }, { "content": "Wow, Ava's story is so intense and inspiring! Can you provide me with more details. ...", "role": "user" }, { "content": "Certainly! ....", "role": "assistant" }, { "content": "That's really interesting! I would love to hear more...", "role": "user" } { "content": "Certainly! ....", "role": "assistant" }, ], "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af" } ``` ## Citation If you find this dataset is useful in your work, please cite the original UltraChat dataset: ``` @misc{ding2023enhancing, title={Enhancing Chat Language Models by Scaling High-quality Instructional Conversations}, author={Ning Ding and Yulin Chen and Bokai Xu and Yujia Qin and Zhi Zheng and Shengding Hu and Zhiyuan Liu and Maosong Sun and Bowen Zhou}, year={2023}, eprint={2305.14233}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` You may also wish to cite the Zephyr 7B technical report: ``` @misc{tunstall2023zephyr, title={Zephyr: Direct Distillation of LM Alignment}, author={Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf}, year={2023}, eprint={2310.16944}, archivePrefix={arXiv}, primaryClass={cs.LG} } ```

提供机构：

maas

创建时间：

2023-12-05

AI搜集汇总

数据集介绍

背景与挑战

背景概述

UltraChat 200k是UltraChat数据集的过滤版本，包含约20万条由ChatGPT生成的对话，经过真值大小写校正和低质量回复移除等处理，用于训练Zephyr-7B-β模型。该数据集分为监督微调和生成排名两个任务的分割，适用于聊天模型的训练和评估，存储为parquet格式，每条数据包含用户与助手的多轮对话。

以上内容由AI搜集并总结生成

5,000+

优质数据集

54 个

任务类型

进入经典数据集