bnovikov/SpeakEasy

Name: bnovikov/SpeakEasy
Creator: bnovikov
Published: 2026-04-03 11:18:00
License: 暂无描述

Hugging Face2026-04-03 更新2026-04-12 收录

下载链接：

https://hf-mirror.com/datasets/bnovikov/SpeakEasy

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: apache-2.0 language: - en pretty_name: SpeakEasy task_categories: - text-generation - text-to-speech tags: - voice-assistant - sft - tts - spoken-style - chatml size_categories: - 10K<n<100K --- # SpeakEasy A supervised fine-tuning dataset for building voice assistants that sound natural through text-to-speech. Responses are written in spoken style with full text normalization — no digits, no symbols, no markdown. ## Why this dataset? Most LLM training data is written for screens. When read aloud by TTS, it sounds robotic — bullet points, URLs, abbreviations, and digits all break the illusion. SpeakEasy targets this gap: every response is crafted to be spoken aloud naturally. ## Dataset Details - **Size:** ~11k examples (train + validation) - **Format:** ChatML messages (system / user / assistant) - **Generation model:** GPT-5.4-mini - **Source prompts:** Sampled and filtered from CLINC OOS Plus ## Key Properties **Spoken style** — contractions, fillers ("hmm", "oh", "well"), warm casual tone like talking to a knowledgeable friend. **Full text normalization** — all digits spelled out ("twenty" not "20"), all symbols expanded ("percent" not "%"), dates/times/money/phone numbers in spoken form. **Turn-taking** — procedural requests get one step at a time with a handoff ("Want me to go on?"). Ambiguous requests get a clarifying question. **Voice-native** — no "send me" or "type in", only voice-appropriate phrasing like "tell me" or "let me know". **No markdown** — zero bullet points, headers, bold, italic, code blocks, or links. ## Categories | Category | Description | |----------|-------------| | `step_gated_task` | Procedural how-to, one step at a time | | `clarify_first` | Ambiguous request, asks for clarification | | `direct_answer` | Factual question, direct spoken answer | | `multi_turn_guided` | Multi-turn conversation with guided flow | | `repair_and_rephrase` | Follow-up corrections and rephrasing | | `boundary_and_safety_tone` | Safety-sensitive topics with appropriate tone | ## Example ``` User: how do i get my check directly deposited Assistant: The first step is to ask your employer or payer for their direct deposit setup form, then fill in your bank's routing number and account number. If you want, I can go on and tell you how to find those numbers. ``` ## Usage ```python from datasets import load_dataset ds = load_dataset("bnovikov/SpeakEasy") ``` ## Source Datasets This dataset was built using prompts sampled and filtered from: - [**CLINC OOS Plus**](https://huggingface.co/datasets/clinc_oos_plus) (CC-BY-3.0) — intent classification dataset with short user utterances, used as seed prompts Only the user-side prompts were used as seeds. All assistant responses were independently generated by GPT-5.4-mini with a custom spoken-style system prompt. ## Citation If you use this dataset, please cite: ```bibtex @misc{novikov2026speakeasy, title={SpeakEasy: Spoken-Style SFT Dataset for Voice Assistants}, author={Borislav Novikov}, year={2026}, url={https://huggingface.co/datasets/bnovikov/SpeakEasy} } ```

--- 许可证：Apache-2.0 语言：英语数据集名称：SpeakEasy 任务类别： - 文本生成 - 文本转语音（Text-to-Speech，TTS）标签： - 语音助手 - 监督微调（Supervised Fine-tuning，SFT） - TTS - 口语化风格 - ChatML 样本量范围： - 1万 < 样本量 < 10万 --- # SpeakEasy 本数据集为一款监督微调（SFT）数据集，旨在构建通过文本转语音（TTS）呈现自然语音效果的语音助手。数据集的回复均采用口语化风格，并完成了完整的文本归一化处理——无数字、无特殊符号、无Markdown格式。 ## 数据集设计初衷当前绝大多数大语言模型（Large Language Model，LLM）的训练数据均适配屏幕阅读场景，当通过TTS朗读时会显得机械生硬：项目符号、URL链接、缩写及数字都会破坏自然语音的沉浸感。SpeakEasy正是针对这一技术空白打造，所有回复均专为自然口语表达而设计。 ## 数据集详情 - **样本量**：约1.1万条（训练集+验证集） - **数据格式**：ChatML格式对话消息（系统提示/用户提问/助手回复） - **生成模型**：GPT-5.4-mini - **源提示词来源**：从CLINC OOS Plus数据集中采样并筛选得到 ## 核心特性 **口语化风格**：包含缩约形式、填充语（如“嗯”“哦”“好吧”），语气温暖随意，如同与博学的友人交谈。 **完整文本归一化**：所有数字均以文字形式拼写（如“twenty”而非“20”），所有符号均展开为自然语言（如“percent”而非“%”），日期、时间、金额、电话号码均采用口语化表达形式。 **轮次交互规范**：程序性请求会分步骤呈现，并附带交接语（如“需要我继续吗？”）；模糊请求则会提出澄清问题。 **适配语音交互场景**：无“发送给我”或“输入”这类书面化表述，仅使用适合语音交互的措辞，如“告诉我”或“让我知晓”。 **无Markdown格式**：不包含项目符号、标题、粗体、斜体、代码块或链接。 ## 对话类别 | 类别 | 描述 | |------|------| | `step_gated_task` | 分步骤呈现的程序性操作指南 | | `clarify_first` | 针对模糊请求，先进行澄清提问 | | `direct_answer` | 针对事实类问题，给出直接的口语化回复 | | `multi_turn_guided` | 带有引导流程的多轮对话 | | `repair_and_rephrase` | 包含后续修正与重述的对话 | | `boundary_and_safety_tone` | 涉及安全敏感话题，采用恰当的语气 | ## 示例用户：如何让我的支票直接存入账户助手：第一步需要向你的雇主或付款方索要直接存款申请表，然后填写你银行的路由号码与账户号码。如果你愿意，我可以继续为你讲解如何查询这些号码。 ## 使用方法 python from datasets import load_dataset ds = load_dataset("bnovikov/SpeakEasy") ## 源数据集本数据集基于以下采样并筛选得到的提示词构建： - [**CLINC OOS Plus**](https://huggingface.co/datasets/clinc_oos_plus)（CC-BY-3.0协议）—— 一款包含短用户话语的意图分类数据集，被用作种子提示词。仅使用其中的用户侧提示词作为种子，所有助手回复均由GPT-5.4-mini基于定制的口语化系统提示独立生成。 ## 引用规范如果使用本数据集，请引用如下文献： bibtex @misc{novikov2026speakeasy, title={SpeakEasy: Spoken-Style SFT Dataset for Voice Assistants}, author={Borislav Novikov}, year={2026}, url={https://huggingface.co/datasets/bnovikov/SpeakEasy} }

提供机构：

bnovikov

5,000+

优质数据集

54 个

任务类型

进入经典数据集