abhinand/ultrachat_200k_sharegpt
收藏数据集概述
数据集描述
这是一个经过严格筛选的UltraChat数据集版本,用于训练Zephyr-7B-β,一个先进的7亿参数聊天模型。
原始数据集包含140万条由ChatGPT生成的对话,涵盖广泛的主题。为了创建UltraChat 200k,我们采用了以下逻辑:
- 选择数据子集以加快监督微调。
- 对数据集进行大小写修正,因为我们观察到约5%的数据包含语法错误,如“Hello. how are you?”而不是“Hello. How are you?”。
- 删除助手回复中包含“I do not have emotions”或“I dont have opinions”等短语的对话,即使在基于事实的提示中不涉及这些内容。
数据集结构
数据集包含四个部分,适用于:
- 监督微调(
sft)。 - 生成排序(
gen),通过拒绝采样或PPO等技术。
各部分的示例数量如下:
| train_sft | test_sft | train_gen | test_gen |
|---|---|---|---|
| 207865 | 23110 | 256032 | 28304 |
数据集以parquet格式存储,每个条目使用以下模式:
json { "prompt": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...", "messages":[ { "content": "Create a fully-developed protagonist who is challenged to survive within a dystopian society under the rule of a tyrant. ...", "role": "user" }, { "content": "Name: Ava
Ava was just 16 years old when the world as she knew it came crashing down. The government had collapsed, leaving behind a chaotic and lawless society. ...", "role": "assistant" }, { "content": "Wow, Avas story is so intense and inspiring! Can you provide me with more details. ...", "role": "user" }, { "content": "Certainly! ....", "role": "assistant" }, { "content": "Thats really interesting! I would love to hear more...", "role": "user" }, { "content": "Certainly! ....", "role": "assistant" } ], "prompt_id": "d938b65dfe31f05f80eb8572964c6673eddbd68eff3db6bd234d7f1e3b86c2af" }




