five

bnovikov/SpeakEasy

收藏
Hugging Face2026-04-03 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/bnovikov/SpeakEasy
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: apache-2.0 language: - en pretty_name: SpeakEasy task_categories: - text-generation - text-to-speech tags: - voice-assistant - sft - tts - spoken-style - chatml size_categories: - 10K<n<100K --- # SpeakEasy A supervised fine-tuning dataset for building voice assistants that sound natural through text-to-speech. Responses are written in spoken style with full text normalization — no digits, no symbols, no markdown. ## Why this dataset? Most LLM training data is written for screens. When read aloud by TTS, it sounds robotic — bullet points, URLs, abbreviations, and digits all break the illusion. SpeakEasy targets this gap: every response is crafted to be spoken aloud naturally. ## Dataset Details - **Size:** ~11k examples (train + validation) - **Format:** ChatML messages (system / user / assistant) - **Generation model:** GPT-5.4-mini - **Source prompts:** Sampled and filtered from CLINC OOS Plus ## Key Properties **Spoken style** — contractions, fillers ("hmm", "oh", "well"), warm casual tone like talking to a knowledgeable friend. **Full text normalization** — all digits spelled out ("twenty" not "20"), all symbols expanded ("percent" not "%"), dates/times/money/phone numbers in spoken form. **Turn-taking** — procedural requests get one step at a time with a handoff ("Want me to go on?"). Ambiguous requests get a clarifying question. **Voice-native** — no "send me" or "type in", only voice-appropriate phrasing like "tell me" or "let me know". **No markdown** — zero bullet points, headers, bold, italic, code blocks, or links. ## Categories | Category | Description | |----------|-------------| | `step_gated_task` | Procedural how-to, one step at a time | | `clarify_first` | Ambiguous request, asks for clarification | | `direct_answer` | Factual question, direct spoken answer | | `multi_turn_guided` | Multi-turn conversation with guided flow | | `repair_and_rephrase` | Follow-up corrections and rephrasing | | `boundary_and_safety_tone` | Safety-sensitive topics with appropriate tone | ## Example ``` User: how do i get my check directly deposited Assistant: The first step is to ask your employer or payer for their direct deposit setup form, then fill in your bank's routing number and account number. If you want, I can go on and tell you how to find those numbers. ``` ## Usage ```python from datasets import load_dataset ds = load_dataset("bnovikov/SpeakEasy") ``` ## Source Datasets This dataset was built using prompts sampled and filtered from: - [**CLINC OOS Plus**](https://huggingface.co/datasets/clinc_oos_plus) (CC-BY-3.0) — intent classification dataset with short user utterances, used as seed prompts Only the user-side prompts were used as seeds. All assistant responses were independently generated by GPT-5.4-mini with a custom spoken-style system prompt. ## Citation If you use this dataset, please cite: ```bibtex @misc{novikov2026speakeasy, title={SpeakEasy: Spoken-Style SFT Dataset for Voice Assistants}, author={Borislav Novikov}, year={2026}, url={https://huggingface.co/datasets/bnovikov/SpeakEasy} } ```

--- 许可证:Apache-2.0 语言:英语 数据集名称:SpeakEasy 任务类别: - 文本生成 - 文本转语音(Text-to-Speech,TTS) 标签: - 语音助手 - 监督微调(Supervised Fine-tuning,SFT) - TTS - 口语化风格 - ChatML 样本量范围: - 1万 < 样本量 < 10万 --- # SpeakEasy 本数据集为一款监督微调(SFT)数据集,旨在构建通过文本转语音(TTS)呈现自然语音效果的语音助手。数据集的回复均采用口语化风格,并完成了完整的文本归一化处理——无数字、无特殊符号、无Markdown格式。 ## 数据集设计初衷 当前绝大多数大语言模型(Large Language Model,LLM)的训练数据均适配屏幕阅读场景,当通过TTS朗读时会显得机械生硬:项目符号、URL链接、缩写及数字都会破坏自然语音的沉浸感。SpeakEasy正是针对这一技术空白打造,所有回复均专为自然口语表达而设计。 ## 数据集详情 - **样本量**:约1.1万条(训练集+验证集) - **数据格式**:ChatML格式对话消息(系统提示/用户提问/助手回复) - **生成模型**:GPT-5.4-mini - **源提示词来源**:从CLINC OOS Plus数据集中采样并筛选得到 ## 核心特性 **口语化风格**:包含缩约形式、填充语(如“嗯”“哦”“好吧”),语气温暖随意,如同与博学的友人交谈。 **完整文本归一化**:所有数字均以文字形式拼写(如“twenty”而非“20”),所有符号均展开为自然语言(如“percent”而非“%”),日期、时间、金额、电话号码均采用口语化表达形式。 **轮次交互规范**:程序性请求会分步骤呈现,并附带交接语(如“需要我继续吗?”);模糊请求则会提出澄清问题。 **适配语音交互场景**:无“发送给我”或“输入”这类书面化表述,仅使用适合语音交互的措辞,如“告诉我”或“让我知晓”。 **无Markdown格式**:不包含项目符号、标题、粗体、斜体、代码块或链接。 ## 对话类别 | 类别 | 描述 | |------|------| | `step_gated_task` | 分步骤呈现的程序性操作指南 | | `clarify_first` | 针对模糊请求,先进行澄清提问 | | `direct_answer` | 针对事实类问题,给出直接的口语化回复 | | `multi_turn_guided` | 带有引导流程的多轮对话 | | `repair_and_rephrase` | 包含后续修正与重述的对话 | | `boundary_and_safety_tone` | 涉及安全敏感话题,采用恰当的语气 | ## 示例 用户:如何让我的支票直接存入账户 助手:第一步需要向你的雇主或付款方索要直接存款申请表,然后填写你银行的路由号码与账户号码。如果你愿意,我可以继续为你讲解如何查询这些号码。 ## 使用方法 python from datasets import load_dataset ds = load_dataset("bnovikov/SpeakEasy") ## 源数据集 本数据集基于以下采样并筛选得到的提示词构建: - [**CLINC OOS Plus**](https://huggingface.co/datasets/clinc_oos_plus)(CC-BY-3.0协议)—— 一款包含短用户话语的意图分类数据集,被用作种子提示词。 仅使用其中的用户侧提示词作为种子,所有助手回复均由GPT-5.4-mini基于定制的口语化系统提示独立生成。 ## 引用规范 如果使用本数据集,请引用如下文献: bibtex @misc{novikov2026speakeasy, title={SpeakEasy: Spoken-Style SFT Dataset for Voice Assistants}, author={Borislav Novikov}, year={2026}, url={https://huggingface.co/datasets/bnovikov/SpeakEasy} }
提供机构:
bnovikov
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作