five

ufal/parczech4speech-unsegmented

收藏
Hugging Face2025-06-20 更新2025-07-05 收录
下载链接:
https://hf-mirror.com/datasets/ufal/parczech4speech-unsegmented
下载链接
链接失效反馈
官方服务:
资源简介:
ParCzech4Speech (未分段变体) 是一个从议会录音和官方转录文本派生的大型捷克语语音数据集。这个变体捕捉了不强制执行句子边界的连续语音段,使其非常适合现实世界的流式自动语音识别场景和从自然话语流中受益的语音建模任务。该数据集通过结合WhisperX和Wav2Vec 2.0模型进行稳健的自动对齐和过滤来创建。数据段是通过聚合连续对齐良好的单词直到遇到说话人变化或对齐错误而形成的。该数据集来源于ParCzech 4.0语料库(议会会议的官方转录文本)和相应的AudioPSP 24.01音频。

ParCzech4Speech (Unsegmented Variant) is a large-scale Czech speech dataset derived from parliamentary recordings and official transcripts. This variant captures continuous speech segments without enforcing sentence boundaries, making it well-suited for real-world streaming ASR scenarios and speech modeling tasks that benefit from natural discourse flow. The dataset is created using a combination of WhisperX and Wav2Vec 2.0 models for robust automatic alignment and filtering. Segments are formed by aggregating consecutive well-aligned words until encountering a speaker change or misalignment.
提供机构:
ufal
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作