danielrosehill/Transcription-Cleanup-Trainer

Name: danielrosehill/Transcription-Cleanup-Trainer
Creator: danielrosehill
Published: 2025-12-18 14:13:19
License: 暂无描述

Hugging Face2025-12-18 更新2025-12-20 收录

下载链接：

https://hf-mirror.com/datasets/danielrosehill/Transcription-Cleanup-Trainer

下载链接

链接失效反馈

官方服务：

资源简介：

该数据集是一个用于训练语音转文本清理模型的精选数据集，旨在实现最佳的转录本精炼效果。数据集包含原始语音转文本转录和手动清理版本的配对示例，专门用于微调模型以达到特定的清理质量水平（“恰到好处”的清理——既不过多也不过少）。数据集结构包括音频文件、Whisper ASR转录、自动清理和手动清理的文本，以及问题和数据集元数据。清理指南强调去除填充词、错误开始和重复，同时保留自然对话语气、说话者的意图和个性。数据集创建过程包括录音、ASR转录、自动清理和手动清理步骤。数据集适用于微调文本清理模型、训练音频多模态模型以及改进语音到文本的流程，但不适用于语音识别、说话者识别或语音克隆。数据集统计信息包括样本数量、平均音频长度、平均单词数和语言等。

This dataset is a curated collection for training speech-to-text cleanup models to achieve optimal transcript refinement. It contains paired examples of raw speech-to-text transcriptions and manually-cleaned versions, designed for fine-tuning models to clean up transcripts to a specific quality level ("Goldilocks" cleanup - not too much, not too little). The dataset structure includes audio files, Whisper ASR transcriptions, automated cleanups, and manual cleanups, along with question and dataset metadata. Cleanup guidelines focus on removing filler words, false starts, and repetitions while preserving natural conversational tone, speakers intent, and personality. The dataset creation involves recording, ASR transcription, automated cleanup, and manual cleanup steps. It is intended for fine-tuning text cleanup models, training audio multimodal models, and improving voice-to-text pipelines, but not suitable for speech recognition, speaker identification, or voice cloning. Dataset statistics cover sample count, average audio length, average word count, and language.

提供机构：

danielrosehill

5,000+

优质数据集

54 个

任务类型

进入经典数据集