five

kyutai/Audio-NTREX-4L

收藏
Hugging Face2026-02-12 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/kyutai/Audio-NTREX-4L
下载链接
链接失效反馈
官方服务:
资源简介:
--- license: cc configs: - config_name: default data_files: - split: valid path: data/valid-* - split: test path: data/test-* dataset_info: features: - name: id dtype: string - name: source_language dtype: string - name: target_language dtype: string - name: source_ntrex_file dtype: string - name: target_ntrex_file dtype: string - name: ntrex_lines list: int32 - name: tts dtype: string - name: source_audio dtype: audio - name: source_text dtype: string - name: source_aligned_transcript struct: - name: text list: string - name: timestamp list: list: float64 - name: target_text dtype: string splits: - name: valid num_bytes: 7218314006 num_examples: 1800 - name: test num_bytes: 7170348264 num_examples: 1800 download_size: 14054511442 dataset_size: 14388662270 task_categories: - translation language: - fr - es - pt - de - en pretty_name: Audio-NTREX-4L size_categories: - 1K<n<10K --- # Audio-NTREX-4L <p align="center"> <img src="Audio_NTREX_4L.png" width="400" alt="logo"> </p> ## Dataset Description **Audio-NTREX-4L** is a long-form multilingual speech translation dataset from 🇫🇷 French, 🇪🇸 Spanish, 🇵🇹 Portuguese and 🇩🇪 German to 🇬🇧 English designed to evaluate speech translation models on multi-sentence utterances. It is built from the text translation dataset [NTREX](https://github.com/MicrosoftTranslator/NTREX) by aggregating multiple sentences from a same context to create new source texts and their reference translation. We then use 3 different state-of-the-art commercial Text-To-Speech systems from [ElevenLabs](https://elevenlabs.io/text-to-speech), [Cartesia](https://cartesia.ai/sonic) and [Gradium](https://gradium.ai/#models) to synthesize the source texts into speech. We condition audio generations using voices from the multilingual [CML-TTS](https://www.openslr.org/146/) dataset. --- ## Dataset Summary * **Original data:** [NTREX](https://github.com/MicrosoftTranslator/NTREX) * **Source modalities:** Audio, Text * **Target modality:** Text * **Source languages:** French, Spanish, Portuguese, German * **Target language:** English * **Total number of source/target pairs:** 3600 * **Number of unique source texts per language:** 300 * **Average source sample duration:** 45 seconds --- ## Dataset Construction We use the following files containing text translation data from the [NTREX-128](https://github.com/MicrosoftTranslator/NTREX/tree/main/NTREX-128) corpus: * 🇬🇧 English: `newstest2019-ref.eng-US.txt` * 🇫🇷 French: `newstest2019-ref.fra.txt` * 🇪🇸 Spanish: `newstest2019-ref.spa.txt` * 🇵🇹 Portuguese: `newstest2019-ref.por.txt` * 🇩🇪 German: `newstest2019-ref.deu.txt` Using the English file, we select 300 groups of consecutive lines belonging to a same original document to form our multi-sentences source texts and obtain the target text translations accordingly. We define an `id` for each source-target pair as a hash of the ordered NTREX line indexes it comes from. We clean the source and target texts by removing elements in parentheses to make them better suited for natural speech. Each source text is then synthesized into 3 audio versions, each using a different TTS system and a different voice conditioning. We transcribe the synthesized audio using the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) Speech-To-Text model and check Word Error Rate with respect to the source texts to ensure that audio versions were correctly synthesized. We split the 3600 source/target pairs into balanced valid and test sets such that all pairs with the same target text stay in the same set i.e. we keep 150 different `id` for each language in each set. --- ## Citations If you use this dataset, please cite: ```bibtex @unpublished{hibikizero2026, title={Simultaneous Speech-to-Speech Translation Without Aligned Data}, author={Tom Labiausse and Romain Fabre and Yannick Estève and Alexandre Défossez and Neil Zeghidour}, note={Preprint}, year={2026}, url={https://arxiv.org/abs/2602.11072v1} } ``` **License:** CC BY-NC-SA 4.0
提供机构:
kyutai
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作