Zikrihakim66/Vioner

Name: Zikrihakim66/Vioner
Creator: Zikrihakim66
Published: 2025-12-15 03:43:32
License: 暂无描述

Hugging Face2025-12-15 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/Zikrihakim66/Vioner

下载链接

链接失效反馈

官方服务：

资源简介：

这是UltraChat数据集的一个严格过滤版本，用于训练最先进的7b聊天模型Zephyr-7B-β。原始数据集包含140万个由ChatGPT生成的对话，涵盖广泛的主题。UltraChat 200k通过以下方式处理：选择子集以加快监督微调，纠正语法错误（如大小写错误），并删除助手回复中不合适的对话（如“我没有情感”或“我没有意见”）。数据集分为四个部分，适用于监督微调（sft）和生成排名（gen）。每个部分的具体示例数量如下：train_sft（207865）、test_sft（23110）、train_gen（256032）、test_gen（28304）。数据集以parquet格式存储，每个条目包含prompt、messages和prompt_id。

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic: Selection of a subset of data for faster supervised fine tuning, truecasing of the dataset, and removal of dialogues where the assistant replies with phrases like I do not have emotions or I dont have opinions. The dataset has four splits, suitable for supervised fine-tuning (sft) and generation ranking (gen) via techniques like rejection sampling or PPO. The number of examples per split is as follows: train_sft (207865), test_sft (23110), train_gen (256032), test_gen (28304). The dataset is stored in parquet format with each entry containing prompt, messages, and prompt_id.

提供机构：

Zikrihakim66

5,000+

优质数据集

54 个

任务类型

进入经典数据集