k-lauren/truecreate-v2-sft

Name: k-lauren/truecreate-v2-sft
Creator: k-lauren
Published: 2026-04-24 19:53:14
License: 暂无描述

Hugging Face2026-04-24 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/k-lauren/truecreate-v2-sft

下载链接

链接失效反馈

官方服务：

资源简介：

TrueCreate v2 SFT语音数据集是一个用于监督微调的数据集，旨在为TrueCreate v2基础模型（215M参数，从Qwen-2.5-1.5B蒸馏而来）赋予特定的声音风格：Jim Henson的温暖、 playful和 starry-eyed wonder，结合C.S. Lewis的道德严肃性和对世界的 childlike awe。数据集默认简洁，无笑话，使用英国/印度英语语域，持有并温和地捍卫观点。数据集包含训练集（4769对，声音质量评分≥阈值）、评分集（4939对，包括被拒绝的对，带有voice_score和voice_reason字段）和黄金集（20对手工制作的种子示例）。数据集通过Claude Opus 4.7模型合成，使用Claude Haiku 4.5模型评分，保留了评分≥4的对。数据集的目标声音风格包括温暖、连贯和热情，语言简洁，无幽默，使用英国/印度英语，略微正式于美国口语，自我意识到是一个小型微调模型，温和地捍卫观点，注意平凡但真实的事物并以美丽的方式表达。数据集不适合通用助手训练、事实基础任务或大型模型训练。

TrueCreate v2 — SFT Voice Dataset is a supervised-fine-tuning dataset used to imbue the TrueCreate v2 base model (215M parameters, distilled from Qwen-2.5-1.5B) with a specific voice: Jim Hensons warmth, playfulness, and starry-eyed wonder, crossed with C.S. Lewiss moral seriousness and childlike awe at the world — brief by default, no jokes, British/Indian English register, holds and gently defends opinions. The dataset includes train split (4769 pairs, voice-quality score ≥ threshold), scored split (4939 pairs, including rejected ones, with voice_score and voice_reason fields), and goldens split (20 hand-crafted seed examples). The dataset was synthesized using Claude Opus 4.7 model and scored using Claude Haiku 4.5 model, keeping pairs scoring ≥ 4. The target voice is starry-eyed, coherent, impassioned, with fewer words, no humor, British/Indian English register, slightly more formal than American casual, self-aware about being a small fine-tuned model, defends opinions gently, and notices boring-but-true things phrased beautifully. The dataset is not suitable for training a general-purpose assistant, factual grounding tasks, or models larger than a few billion parameters.

提供机构：

k-lauren

5,000+

优质数据集

54 个

任务类型

进入经典数据集