tiny-aya-translate/fleurs-tr-hi-mimi-encoded
收藏Hugging Face2026-04-22 更新2026-04-26 收录
下载链接:
https://hf-mirror.com/datasets/tiny-aya-translate/fleurs-tr-hi-mimi-encoded
下载链接
链接失效反馈官方服务:
资源简介:
该数据集名为fleurs-tr-hi-mimi-encoded,包含土耳其语和印地语之间的平行语音对,用于TinyAya第二阶段语音到语音翻译训练。数据集内容包括9212个Mimi编码的音频对(使用kyutai/mimi模型,8个码本,12.5 Hz,24 kHz),每个文件包含配对ID、源语言、目标语言、源文本、目标文本、源编码和目标编码。此外,还包括18424个Whisper单词级对齐文件(.src.alignments.json和.tgt.alignments.json),以及训练集(8283行,占90%)和验证集(929行,占10%)的分割文件。数据集来源于tiny-aya-translate/fleurs-tr-hi-parallel-speech,并采用特定配置(kyutai/mimi,mimi_num_codebooks=8,output_sample_rate=24000)处理。
The dataset named fleurs-tr-hi-mimi-encoded contains Turkish↔Hindi parallel speech pairs for TinyAya Stage 2 speech-to-speech translation training. The dataset includes 9212 Mimi-encoded audio pairs (using kyutai/mimi, 8 codebooks, 12.5 Hz, 24 kHz), with each file containing keys: pair_id, src_lang, tgt_lang, src_text, tgt_text, src_codes[8, T_src], tgt_codes[8, T_tgt]. It also includes 18424 Whisper word-level alignment sidecars (.src.alignments.json / .tgt.alignments.json), and split files for training (8283 rows, 90%) and validation (929 rows, 10%). The dataset is sourced from tiny-aya-translate/fleurs-tr-hi-parallel-speech and processed with specific configurations (kyutai/mimi, mimi_num_codebooks=8, output_sample_rate=24000).
提供机构:
tiny-aya-translate



