kyutai/Audio-NTREX-4L
收藏Hugging Face2026-02-12 更新2026-05-10 收录
下载链接:
https://hf-mirror.com/datasets/kyutai/Audio-NTREX-4L
下载链接
链接失效反馈官方服务:
资源简介:
---
license: cc
configs:
- config_name: default
data_files:
- split: valid
path: data/valid-*
- split: test
path: data/test-*
dataset_info:
features:
- name: id
dtype: string
- name: source_language
dtype: string
- name: target_language
dtype: string
- name: source_ntrex_file
dtype: string
- name: target_ntrex_file
dtype: string
- name: ntrex_lines
list: int32
- name: tts
dtype: string
- name: source_audio
dtype: audio
- name: source_text
dtype: string
- name: source_aligned_transcript
struct:
- name: text
list: string
- name: timestamp
list:
list: float64
- name: target_text
dtype: string
splits:
- name: valid
num_bytes: 7218314006
num_examples: 1800
- name: test
num_bytes: 7170348264
num_examples: 1800
download_size: 14054511442
dataset_size: 14388662270
task_categories:
- translation
language:
- fr
- es
- pt
- de
- en
pretty_name: Audio-NTREX-4L
size_categories:
- 1K<n<10K
---
# Audio-NTREX-4L
<p align="center">
<img src="Audio_NTREX_4L.png" width="400" alt="logo">
</p>
## Dataset Description
**Audio-NTREX-4L** is a long-form multilingual speech translation dataset from 🇫🇷 French, 🇪🇸 Spanish, 🇵🇹 Portuguese and 🇩🇪 German to 🇬🇧 English designed to evaluate speech translation models on multi-sentence utterances. It is built from the text translation dataset [NTREX](https://github.com/MicrosoftTranslator/NTREX) by aggregating multiple sentences from a same context to create new source texts and their reference translation. We then use 3 different state-of-the-art commercial Text-To-Speech systems from [ElevenLabs](https://elevenlabs.io/text-to-speech), [Cartesia](https://cartesia.ai/sonic) and [Gradium](https://gradium.ai/#models) to synthesize the source texts into speech. We condition audio generations using voices from the multilingual [CML-TTS](https://www.openslr.org/146/) dataset.
---
## Dataset Summary
* **Original data:** [NTREX](https://github.com/MicrosoftTranslator/NTREX)
* **Source modalities:** Audio, Text
* **Target modality:** Text
* **Source languages:** French, Spanish, Portuguese, German
* **Target language:** English
* **Total number of source/target pairs:** 3600
* **Number of unique source texts per language:** 300
* **Average source sample duration:** 45 seconds
---
## Dataset Construction
We use the following files containing text translation data from the [NTREX-128](https://github.com/MicrosoftTranslator/NTREX/tree/main/NTREX-128) corpus:
* 🇬🇧 English: `newstest2019-ref.eng-US.txt`
* 🇫🇷 French: `newstest2019-ref.fra.txt`
* 🇪🇸 Spanish: `newstest2019-ref.spa.txt`
* 🇵🇹 Portuguese: `newstest2019-ref.por.txt`
* 🇩🇪 German: `newstest2019-ref.deu.txt`
Using the English file, we select 300 groups of consecutive lines belonging to a same original document to form our multi-sentences source texts and obtain the target text translations accordingly. We define an `id` for each source-target pair as a hash of the ordered NTREX line indexes it comes from.
We clean the source and target texts by removing elements in parentheses to make them better suited for natural speech. Each source text is then synthesized into 3 audio versions, each using a different TTS system and a different voice conditioning.
We transcribe the synthesized audio using the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) Speech-To-Text model and check Word Error Rate with respect to the source texts to ensure that audio versions were correctly synthesized.
We split the 3600 source/target pairs into balanced valid and test sets such that all pairs with the same target text stay in the same set i.e. we keep 150 different `id` for each language in each set.
---
## Citations
If you use this dataset, please cite:
```bibtex
@unpublished{hibikizero2026,
title={Simultaneous Speech-to-Speech Translation Without Aligned Data},
author={Tom Labiausse and Romain Fabre and Yannick Estève and Alexandre Défossez and Neil Zeghidour},
note={Preprint},
year={2026},
url={https://arxiv.org/abs/2602.11072v1}
}
```
**License:** CC BY-NC-SA 4.0
提供机构:
kyutai



