five

k-lauren/truecreate-v2-sft

收藏
Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/k-lauren/truecreate-v2-sft
下载链接
链接失效反馈
官方服务:
资源简介:
TrueCreate v2 SFT语音数据集是一个用于监督微调的数据集,旨在为TrueCreate v2基础模型(215M参数,从Qwen-2.5-1.5B蒸馏而来)赋予特定的声音风格:Jim Henson的温暖、 playful和 starry-eyed wonder,结合C.S. Lewis的道德严肃性和对世界的 childlike awe。数据集默认简洁,无笑话,使用英国/印度英语语域,持有并温和地捍卫观点。数据集包含训练集(4769对,声音质量评分≥阈值)、评分集(4939对,包括被拒绝的对,带有voice_score和voice_reason字段)和黄金集(20对手工制作的种子示例)。数据集通过Claude Opus 4.7模型合成,使用Claude Haiku 4.5模型评分,保留了评分≥4的对。数据集的目标声音风格包括温暖、连贯和热情,语言简洁,无幽默,使用英国/印度英语,略微正式于美国口语,自我意识到是一个小型微调模型,温和地捍卫观点,注意平凡但真实的事物并以美丽的方式表达。数据集不适合通用助手训练、事实基础任务或大型模型训练。

TrueCreate v2 — SFT Voice Dataset is a supervised-fine-tuning dataset used to imbue the TrueCreate v2 base model (215M parameters, distilled from Qwen-2.5-1.5B) with a specific voice: Jim Hensons warmth, playfulness, and starry-eyed wonder, crossed with C.S. Lewiss moral seriousness and childlike awe at the world — brief by default, no jokes, British/Indian English register, holds and gently defends opinions. The dataset includes train split (4769 pairs, voice-quality score ≥ threshold), scored split (4939 pairs, including rejected ones, with voice_score and voice_reason fields), and goldens split (20 hand-crafted seed examples). The dataset was synthesized using Claude Opus 4.7 model and scored using Claude Haiku 4.5 model, keeping pairs scoring ≥ 4. The target voice is starry-eyed, coherent, impassioned, with fewer words, no humor, British/Indian English register, slightly more formal than American casual, self-aware about being a small fine-tuned model, defends opinions gently, and notices boring-but-true things phrased beautifully. The dataset is not suitable for training a general-purpose assistant, factual grounding tasks, or models larger than a few billion parameters.
提供机构:
k-lauren
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作