bnovikov/SpeakEasy
收藏Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/bnovikov/SpeakEasy
下载链接
链接失效反馈官方服务:
资源简介:
---
license: apache-2.0
language:
- en
pretty_name: SpeakEasy
task_categories:
- text-generation
- text-to-speech
tags:
- voice-assistant
- sft
- tts
- spoken-style
- chatml
size_categories:
- 10K<n<100K
---
# SpeakEasy
A supervised fine-tuning dataset for building voice assistants that sound natural through text-to-speech. Responses are written in spoken style with full text normalization — no digits, no symbols, no markdown.
## Why this dataset?
Most LLM training data is written for screens. When read aloud by TTS, it sounds robotic — bullet points, URLs, abbreviations, and digits all break the illusion. SpeakEasy targets this gap: every response is crafted to be spoken aloud naturally.
## Dataset Details
- **Size:** ~11k examples (train + validation)
- **Format:** ChatML messages (system / user / assistant)
- **Generation model:** GPT-5.4-mini
- **Source prompts:** Sampled and filtered from CLINC OOS Plus
## Key Properties
**Spoken style** — contractions, fillers ("hmm", "oh", "well"), warm casual tone like talking to a knowledgeable friend.
**Full text normalization** — all digits spelled out ("twenty" not "20"), all symbols expanded ("percent" not "%"), dates/times/money/phone numbers in spoken form.
**Turn-taking** — procedural requests get one step at a time with a handoff ("Want me to go on?"). Ambiguous requests get a clarifying question.
**Voice-native** — no "send me" or "type in", only voice-appropriate phrasing like "tell me" or "let me know".
**No markdown** — zero bullet points, headers, bold, italic, code blocks, or links.
## Categories
| Category | Description |
|----------|-------------|
| `step_gated_task` | Procedural how-to, one step at a time |
| `clarify_first` | Ambiguous request, asks for clarification |
| `direct_answer` | Factual question, direct spoken answer |
| `multi_turn_guided` | Multi-turn conversation with guided flow |
| `repair_and_rephrase` | Follow-up corrections and rephrasing |
| `boundary_and_safety_tone` | Safety-sensitive topics with appropriate tone |
## Example
```
User: how do i get my check directly deposited
Assistant: The first step is to ask your employer or payer for their direct deposit
setup form, then fill in your bank's routing number and account number. If you
want, I can go on and tell you how to find those numbers.
```
## Usage
```python
from datasets import load_dataset
ds = load_dataset("bnovikov/SpeakEasy")
```
## Source Datasets
This dataset was built using prompts sampled and filtered from:
- [**CLINC OOS Plus**](https://huggingface.co/datasets/clinc_oos_plus) (CC-BY-3.0) — intent classification dataset with short user utterances, used as seed prompts
Only the user-side prompts were used as seeds. All assistant responses were independently generated by GPT-5.4-mini with a custom spoken-style system prompt.
## Citation
If you use this dataset, please cite:
```bibtex
@misc{novikov2026speakeasy,
title={SpeakEasy: Spoken-Style SFT Dataset for Voice Assistants},
author={Borislav Novikov},
year={2026},
url={https://huggingface.co/datasets/bnovikov/SpeakEasy}
}
```
---
许可证:Apache-2.0
语言:英语
数据集名称:SpeakEasy
任务类别:
- 文本生成
- 文本转语音(Text-to-Speech,TTS)
标签:
- 语音助手
- 监督微调(Supervised Fine-tuning,SFT)
- TTS
- 口语化风格
- ChatML
样本量范围:
- 1万 < 样本量 < 10万
---
# SpeakEasy
本数据集为一款监督微调(SFT)数据集,旨在构建通过文本转语音(TTS)呈现自然语音效果的语音助手。数据集的回复均采用口语化风格,并完成了完整的文本归一化处理——无数字、无特殊符号、无Markdown格式。
## 数据集设计初衷
当前绝大多数大语言模型(Large Language Model,LLM)的训练数据均适配屏幕阅读场景,当通过TTS朗读时会显得机械生硬:项目符号、URL链接、缩写及数字都会破坏自然语音的沉浸感。SpeakEasy正是针对这一技术空白打造,所有回复均专为自然口语表达而设计。
## 数据集详情
- **样本量**:约1.1万条(训练集+验证集)
- **数据格式**:ChatML格式对话消息(系统提示/用户提问/助手回复)
- **生成模型**:GPT-5.4-mini
- **源提示词来源**:从CLINC OOS Plus数据集中采样并筛选得到
## 核心特性
**口语化风格**:包含缩约形式、填充语(如“嗯”“哦”“好吧”),语气温暖随意,如同与博学的友人交谈。
**完整文本归一化**:所有数字均以文字形式拼写(如“twenty”而非“20”),所有符号均展开为自然语言(如“percent”而非“%”),日期、时间、金额、电话号码均采用口语化表达形式。
**轮次交互规范**:程序性请求会分步骤呈现,并附带交接语(如“需要我继续吗?”);模糊请求则会提出澄清问题。
**适配语音交互场景**:无“发送给我”或“输入”这类书面化表述,仅使用适合语音交互的措辞,如“告诉我”或“让我知晓”。
**无Markdown格式**:不包含项目符号、标题、粗体、斜体、代码块或链接。
## 对话类别
| 类别 | 描述 |
|------|------|
| `step_gated_task` | 分步骤呈现的程序性操作指南 |
| `clarify_first` | 针对模糊请求,先进行澄清提问 |
| `direct_answer` | 针对事实类问题,给出直接的口语化回复 |
| `multi_turn_guided` | 带有引导流程的多轮对话 |
| `repair_and_rephrase` | 包含后续修正与重述的对话 |
| `boundary_and_safety_tone` | 涉及安全敏感话题,采用恰当的语气 |
## 示例
用户:如何让我的支票直接存入账户
助手:第一步需要向你的雇主或付款方索要直接存款申请表,然后填写你银行的路由号码与账户号码。如果你愿意,我可以继续为你讲解如何查询这些号码。
## 使用方法
python
from datasets import load_dataset
ds = load_dataset("bnovikov/SpeakEasy")
## 源数据集
本数据集基于以下采样并筛选得到的提示词构建:
- [**CLINC OOS Plus**](https://huggingface.co/datasets/clinc_oos_plus)(CC-BY-3.0协议)—— 一款包含短用户话语的意图分类数据集,被用作种子提示词。
仅使用其中的用户侧提示词作为种子,所有助手回复均由GPT-5.4-mini基于定制的口语化系统提示独立生成。
## 引用规范
如果使用本数据集,请引用如下文献:
bibtex
@misc{novikov2026speakeasy,
title={SpeakEasy: Spoken-Style SFT Dataset for Voice Assistants},
author={Borislav Novikov},
year={2026},
url={https://huggingface.co/datasets/bnovikov/SpeakEasy}
}
提供机构:
bnovikov



