onedevelopment/oneai-1.2-dataset

Name: onedevelopment/oneai-1.2-dataset
Creator: onedevelopment
Published: 2026-04-24 17:51:32
License: 暂无描述

Hugging Face2026-04-24 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/onedevelopment/oneai-1.2-dataset

下载链接

链接失效反馈

官方服务：

资源简介：

该项目包含一个数据集（及生成脚本），用于通过监督微调（SFT）过程训练对话语言模型。主要数据文件为`sftdataset.json`。数据集包含50,000个合成对话示例，专门设计用于训练原始新初始化的模型，使其表现得像一个有用且礼貌的AI助手。数据分布如下：30%为问候语（普通对话/问候），20%为身份问题（模型身份一致性），50%为常识（世界基本事实和简单互动）。当前版本故意不包含数学任务，以集中小型模型（如38M参数）的能力于语言流畅性和自然性。数据采用标准化的`messages`结构（类似OpenAI Chat API），并针对Hugging Face的`datasets`库和`trl`的`SFTTrainer`工具进行了优化。

This project contains a dataset (and scripts to generate it) intended for training conversational language models using the Supervised Fine-Tuning (SFT) process. The main data file is `sftdataset.json`. The dataset contains exactly 50,000 synthetic conversational examples, specially prepared to teach a raw, newly initialized model to behave like a helpful and polite AI assistant. The data distribution is as follows: 30% - Greetings (ordinary conversations/greetings), 20% - Identity (model identity consistency), 50% - General Knowledge (basic world facts and simple interactions). The current version deliberately does not include mathematical tasks to focus the small models power (e.g., 38M parameters) on being linguistically fluent and natural. The data has a standardized messages structure (often seen in OpenAI Chat API) and is optimized for the Hugging Face `datasets` library and the `SFTTrainer` tool from `trl`.

提供机构：

onedevelopment

5,000+

优质数据集

54 个

任务类型

进入经典数据集