five

onedevelopment/oneai-1.2-dataset

收藏
Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/onedevelopment/oneai-1.2-dataset
下载链接
链接失效反馈
官方服务:
资源简介:
该项目包含一个数据集(及生成脚本),用于通过监督微调(SFT)过程训练对话语言模型。主要数据文件为`sftdataset.json`。数据集包含50,000个合成对话示例,专门设计用于训练原始新初始化的模型,使其表现得像一个有用且礼貌的AI助手。数据分布如下:30%为问候语(普通对话/问候),20%为身份问题(模型身份一致性),50%为常识(世界基本事实和简单互动)。当前版本故意不包含数学任务,以集中小型模型(如38M参数)的能力于语言流畅性和自然性。数据采用标准化的`messages`结构(类似OpenAI Chat API),并针对Hugging Face的`datasets`库和`trl`的`SFTTrainer`工具进行了优化。

This project contains a dataset (and scripts to generate it) intended for training conversational language models using the Supervised Fine-Tuning (SFT) process. The main data file is `sftdataset.json`. The dataset contains exactly 50,000 synthetic conversational examples, specially prepared to teach a raw, newly initialized model to behave like a helpful and polite AI assistant. The data distribution is as follows: 30% - Greetings (ordinary conversations/greetings), 20% - Identity (model identity consistency), 50% - General Knowledge (basic world facts and simple interactions). The current version deliberately does not include mathematical tasks to focus the small models power (e.g., 38M parameters) on being linguistically fluent and natural. The data has a standardized messages structure (often seen in OpenAI Chat API) and is optimized for the Hugging Face `datasets` library and the `SFTTrainer` tool from `trl`.
提供机构:
onedevelopment
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作