Reubencf/Adaption-low-resource-audio
收藏Hugging Face2026-04-24 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/Reubencf/Adaption-low-resource-audio
下载链接
链接失效反馈官方服务:
资源简介:
‘Adaption Low-Resource Audio’是一个低资源语言音频数据集,源自Reubencf/PolyglotAudio数据集,并经过Adaption的Adaptive Data平台重新制作。该数据集包含3,704行配对的音频和文本数据,涵盖10种语言,从较为常见的语言(如马拉地语、捷克语)到开放语音数据中极为罕见的语言(如卡拜尔语、托克皮辛语)。每行数据包含原始的Tatoeba音频片段以及经过优化的‘enhanced_prompt’和‘enhanced_completion’列,以便用于语音模型的微调和评估。数据集适用于低资源ASR和TTS微调、跨语言语音检索以及通过‘enhanced_prompt’和‘enhanced_completion’列进行指令调优语音模型。数据集的语言分布偏向于柏柏尔语和马拉地语,其他8种语言仅占约19%的行数。音频质量因Tatoeba志愿者的录制条件而异,且‘enhanced_prompt’和‘enhanced_completion’列可能存在模型偏差。数据集采用CC-BY-NC 4.0许可。
The Adaption Low-Resource Audio dataset is a low-resource-language subset of the Reubencf/PolyglotAudio dataset, remastered with Adaptions Adaptive Data platform. It includes 3,704 rows of paired audio and text data across 10 languages, ranging from well-documented languages (e.g., Marathi, Czech) to genuinely rare languages in open speech data (e.g., Kabyle, Toki Pona). Each row contains the original Tatoeba-derived audio clip alongside sharpened enhanced_prompt and enhanced_completion columns, making the data ready for speech-model fine-tuning and evaluation. The dataset is intended for low-resource ASR and TTS fine-tuning, cross-lingual speech retrieval, and instruction-tuning speech models via the enhanced_prompt and enhanced_completion columns. The distribution is skewed toward Berber and Marathi, with the other 8 languages comprising about 19% of the rows. Audio quality varies due to Tatoeba volunteer contributors, and the enhanced_prompt and enhanced_completion columns may inherit model bias. The dataset is licensed under CC-BY-NC 4.0.
提供机构:
Reubencf



