five

juanquivilla/sotto-transcript-cleanup

收藏
Hugging Face2026-04-12 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/juanquivilla/sotto-transcript-cleanup
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: mit task_categories: - text-generation language: - en size_categories: - 100K<n<1M tags: - speech-to-text - transcript-cleanup - disfluency-correction - synthetic-data - sotto-asr pretty_name: SottoASR Transcript Cleanup Dataset configs: - config_name: default data_files: - split: train path: data/train-* - split: validation path: data/validation-* dataset_info: features: - name: input dtype: string - name: output dtype: string splits: - name: train num_bytes: 33192538 num_examples: 135503 - name: validation num_bytes: 1296731 num_examples: 6921 download_size: 18979669 dataset_size: 34489269 --- # SottoASR Transcript Cleanup Dataset <p align="center"> <a href="https://sotto.app">sotto.app</a> · <a href="https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m">Trained Model (bf16)</a> · <a href="https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m-mlx-5bit">MLX 5-bit Model</a> </p> ## Overview 124K+ synthetic training pairs for fine-tuning small language models on speech-to-text transcript cleanup. This dataset was used to train the [SottoASR transcript cleanup model](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m) — a 350M parameter model that **exceeds a prompted 2B model** on this task while being 8x faster. Part of [**SottoASR**](https://sotto.app) — a local, privacy-first speech-to-text application for macOS. ## Task **Input:** Raw, lowercase, unpunctuated ASR transcript with speech disfluencies **Output:** Clean, properly formatted text with disfluencies removed ```jsonl {"input": "uh the server is uh running low on memory", "output": "The server is running low on memory."} {"input": "use redis wait no memcached is better", "output": "Use Memcached."} {"input": "ship it", "output": "Ship it."} {"input": "send the email to john period", "output": "Send the email to John."} ``` ## Categories | Category | % | Description | |----------|---|-------------| | self_correction | 14% | Speaker corrects themselves mid-sentence | | preserve_wording | 13% | Clean input — model must NOT over-edit | | filler_removal | 11% | Remove uh, um, uhm, er, ah | | mixed | 10% | Multiple disfluency types combined | | crutch_words | 8% | Remove basically, you know, I mean, etc. | | false_start | 8% | Remove abandoned sentence beginnings | | dictation_commands | 8% | Convert "period" → ".", "comma" → "," | | misheard_words | 7% | Fix ASR errors (post gress → Postgres) | | grammar | 7% | Fix spoken grammar (gonna → going to) | | list_formatting | 6% | Convert spoken lists to numbered format | | adversarial | 5% | Words that look like fillers but are meaningful | ## Domains Software engineering (24%), general business (19%), casual conversation (15%), medical (10%), legal (8%), finance (7%), technical (5%), creative (5%), academic (5%) ## Generation Method Three-layer approach: 1. **Programmatic corruption** (Layer 1) — deterministic disfluency injection into clean public text 2. **LLM-generated** (Layer 2) — context-dependent patterns via Qwen3.5-35B and Grok 4.20 3. **Hand-crafted** (Layer 3) — expert-written samples for edge cases 94.6% validation pass rate. Details in the [training research document](https://huggingface.co/juanquivilla/sotto-cleanup-lfm25-350m). ## Splits | Split | Samples | |-------|---------| | train | 118,069 | | val | 6,215 | ## License MIT
提供机构:
juanquivilla
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作