ilsp/pomak-speech-corpus

Name: ilsp/pomak-speech-corpus
Creator: ilsp
Published: 2026-01-22 14:42:00
License: 暂无描述

Hugging Face2026-01-22 更新2026-02-07 收录

下载链接：

https://hf-mirror.com/datasets/ilsp/pomak-speech-corpus

下载链接

链接失效反馈

官方服务：

资源简介：

Pomak是一种濒危的东南斯拉夫语言变体，主要分布在希腊北部。该数据集是一个语音语料库，用于训练和评估Pomak语言的自动语音识别（ASR）系统。语料库包含四位母语人士在希腊Xanthi的ILSP视听工作室录制的Pomak文本朗读，总时长约14小时。录音在知情同意的情况下进行，并分割成最多25秒的短语音段，最终训练数据集时长为11小时8分钟。数据集用于微调一个预训练的斯拉夫语wav2vec2模型，结果模型wav2vec2-xls-r-slavic-pomak是首个Pomak语言的自动语音识别系统。在保留的测试集上评估显示，词错误率（WER）从31.47%降至3.12%，字符错误率（CER）从87.31%降至9.06%。该研究展示了在极低资源语言环境下，通过微调多语言和家族特定ASR模型可以获得高质量的语音识别性能，旨在支持Pomak语言的未来语言研究、语料库创建和语言文档工作。

Pomak is an endangered Southeastern Slavic language variety primarily spoken in northern Greece. This dataset is a speech corpus designed for training and evaluating automatic speech recognition (ASR) systems for the Pomak language. The corpus contains readings of Pomak texts recorded by four native speakers at the ILSP Audiovisual Studio in Xanthi, Greece, with a total duration of approximately 14 hours. All recordings were conducted with informed consent, and the audio was segmented into short clips of up to 25 seconds, resulting in a final training dataset with a duration of 11 hours and 8 minutes. This dataset was used to fine-tune a pre-trained Slavic wav2vec2 model, producing the wav2vec2-xls-r-slavic-pomak, the first ASR system for the Pomak language. Evaluations on the held-out test set showed that the word error rate (WER) decreased from 31.47% to 3.12%, while the character error rate (CER) dropped from 87.31% to 9.06%. This study demonstrates that high-quality speech recognition performance can be achieved by fine-tuning multilingual and family-specific ASR models in extremely low-resource language scenarios, aiming to support future linguistic research, corpus construction, and language documentation efforts for the Pomak language.

提供机构：

ilsp

5,000+

优质数据集

54 个

任务类型

进入经典数据集