Gilbert-AI/french-education-speech
收藏Hugging Face2025-12-17 更新2025-12-20 收录
下载链接:
https://hf-mirror.com/datasets/Gilbert-AI/french-education-speech
下载链接
链接失效反馈官方服务:
资源简介:
这是一个高质量的法语教育语音转录数据集,专门用于训练自动语音识别(ASR)模型。数据集包含3,933个来自法语教育领域的转录音频片段,总时长约12.82小时。所有转录均使用OpenAI Whisper API完成,确保了高精度,特别是在教育术语和缩写词方面。数据集分为训练集(3,720个片段,12.12小时)和验证集(213个片段,0.70小时)。每个样本包含音频文件、转录文本、持续时间、类别(会议、播客、课程、访谈)、质量(清洁、中等)、来源、说话者角色(教师、学生等)和领域等信息。数据集创建过程中进行了严格的质量控制,包括模型选择、质量过滤、转录执行和质量控制等步骤,以确保数据的准确性和可靠性。
High-quality French educational speech dataset transcribed with OpenAI Whisper API, prepared for training automatic speech recognition (ASR) models. This dataset contains 3,933 transcribed audio segments from the French educational domain, totaling approximately 12.82 hours of audio. All transcriptions were performed using OpenAI Whisper API (optimized Whisper-1 model) to ensure maximum accuracy, especially for educational terminology and acronyms. The dataset is split into train (3,720 segments, 12.12 hours) and validation (213 segments, 0.70 hours) sets. Each example includes audio file, transcribed text, duration, category (conferences, podcasts, courses, interviews), quality (clean, medium), source, speaker role (teacher, student, etc.), and domain. The dataset creation involved rigorous quality control steps including model selection, quality filtering, transcription execution, and quality control to ensure accuracy and reliability.
提供机构:
Gilbert-AI



