Reubencf/Adaption-low-resource-audio

Name: Reubencf/Adaption-low-resource-audio
Creator: Reubencf
Published: 2026-04-24 10:33:15
License: 暂无描述

Hugging Face2026-04-24 更新2026-04-26 收录

下载链接：

https://hf-mirror.com/datasets/Reubencf/Adaption-low-resource-audio

下载链接

链接失效反馈

官方服务：

资源简介：

‘Adaption Low-Resource Audio’是一个低资源语言音频数据集，源自Reubencf/PolyglotAudio数据集，并经过Adaption的Adaptive Data平台重新制作。该数据集包含3,704行配对的音频和文本数据，涵盖10种语言，从较为常见的语言（如马拉地语、捷克语）到开放语音数据中极为罕见的语言（如卡拜尔语、托克皮辛语）。每行数据包含原始的Tatoeba音频片段以及经过优化的‘enhanced_prompt’和‘enhanced_completion’列，以便用于语音模型的微调和评估。数据集适用于低资源ASR和TTS微调、跨语言语音检索以及通过‘enhanced_prompt’和‘enhanced_completion’列进行指令调优语音模型。数据集的语言分布偏向于柏柏尔语和马拉地语，其他8种语言仅占约19%的行数。音频质量因Tatoeba志愿者的录制条件而异，且‘enhanced_prompt’和‘enhanced_completion’列可能存在模型偏差。数据集采用CC-BY-NC 4.0许可。

The Adaption Low-Resource Audio dataset is a low-resource-language subset of the Reubencf/PolyglotAudio dataset, remastered with Adaptions Adaptive Data platform. It includes 3,704 rows of paired audio and text data across 10 languages, ranging from well-documented languages (e.g., Marathi, Czech) to genuinely rare languages in open speech data (e.g., Kabyle, Toki Pona). Each row contains the original Tatoeba-derived audio clip alongside sharpened enhanced_prompt and enhanced_completion columns, making the data ready for speech-model fine-tuning and evaluation. The dataset is intended for low-resource ASR and TTS fine-tuning, cross-lingual speech retrieval, and instruction-tuning speech models via the enhanced_prompt and enhanced_completion columns. The distribution is skewed toward Berber and Marathi, with the other 8 languages comprising about 19% of the rows. Audio quality varies due to Tatoeba volunteer contributors, and the enhanced_prompt and enhanced_completion columns may inherit model bias. The dataset is licensed under CC-BY-NC 4.0.

提供机构：

Reubencf

5,000+

优质数据集

54 个

任务类型

进入经典数据集