deepdml/mtedx
收藏Hugging Face2026-02-26 更新2026-03-29 收录
下载链接:
https://hf-mirror.com/datasets/deepdml/mtedx
下载链接
链接失效反馈官方服务:
资源简介:
---
language:
- es
- fr
- pt
- it
- ar
- ru
- el
- de
license: cc-by-nc-nd-4.0
task_categories:
- automatic-speech-recognition
- translation
pretty_name: Multilingual TEDx (mTEDx) — SLR100
tags:
- speech
- audio
- tedx
- multilingual
- asr
- speech-translation
configs:
- config_name: ar
data_files:
- split: train
path: ar/train-*
- split: valid
path: ar/valid-*
- split: test
path: ar/test-*
- config_name: de
data_files:
- split: train
path: de/train-*
- split: valid
path: de/valid-*
- split: test
path: de/test-*
- config_name: el
data_files:
- split: train
path: el/train-*
- split: valid
path: el/valid-*
- split: test
path: el/test-*
- config_name: el-en
data_files:
- split: train
path: el-en/train-*
- split: valid
path: el-en/valid-*
- split: test
path: el-en/test-*
- config_name: es
data_files:
- split: train
path: es/train-*
- split: valid
path: es/valid-*
- split: test
path: es/test-*
- config_name: es-en
data_files:
- split: train
path: es-en/train-*
- split: valid
path: es-en/valid-*
- split: test
path: es-en/test-*
- config_name: es-fr
data_files:
- split: train
path: es-fr/train-*
- split: valid
path: es-fr/valid-*
- split: test
path: es-fr/test-*
- config_name: es-it
data_files:
- split: train
path: es-it/train-*
- split: valid
path: es-it/valid-*
- split: test
path: es-it/test-*
- config_name: es-pt
data_files:
- split: train
path: es-pt/train-*
- split: valid
path: es-pt/valid-*
- split: test
path: es-pt/test-*
- config_name: fr
data_files:
- split: train
path: fr/train-*
- split: valid
path: fr/valid-*
- split: test
path: fr/test-*
- config_name: fr-en
data_files:
- split: train
path: fr-en/train-*
- split: valid
path: fr-en/valid-*
- split: test
path: fr-en/test-*
- config_name: fr-es
data_files:
- split: train
path: fr-es/train-*
- split: valid
path: fr-es/valid-*
- split: test
path: fr-es/test-*
- config_name: fr-pt
data_files:
- split: train
path: fr-pt/train-*
- split: valid
path: fr-pt/valid-*
- split: test
path: fr-pt/test-*
- config_name: it
data_files:
- split: train
path: it/train-*
- split: valid
path: it/valid-*
- split: test
path: it/test-*
- config_name: it-en
data_files:
- split: train
path: it-en/train-*
- split: valid
path: it-en/valid-*
- split: test
path: it-en/test-*
- config_name: it-es
data_files:
- split: train
path: it-es/train-*
- split: valid
path: it-es/valid-*
- split: test
path: it-es/test-*
- config_name: pt
data_files:
- split: train
path: pt/train-*
- split: valid
path: pt/valid-*
- split: test
path: pt/test-*
- config_name: ru
data_files:
- split: train
path: ru/train-*
- split: valid
path: ru/valid-*
- split: test
path: ru/test-*
- config_name: ru-en
data_files:
- split: train
path: ru-en/train-*
- split: valid
path: ru-en/valid-*
- split: test
path: ru-en/test-*
dataset_info:
- config_name: ar
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 2178889228.384
num_examples: 11129
- name: valid
num_bytes: 142140185.32
num_examples: 1024
- name: test
num_bytes: 132688587.25
num_examples: 1025
download_size: 2472218095
dataset_size: 2453718000.954
- config_name: de
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 1765760361.764
num_examples: 6764
- name: valid
num_bytes: 256591446.396
num_examples: 1172
- name: test
num_bytes: 245484939.996
num_examples: 1126
download_size: 2035086468
dataset_size: 2267836748.1559997
- config_name: el
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 2568039234.829
num_examples: 12817
- name: valid
num_bytes: 304746052.0
num_examples: 972
- name: test
num_bytes: 276022477.156
num_examples: 1018
download_size: 4035781133
dataset_size: 3148807763.985
- config_name: el-en
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: translation
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 1308582725.54
num_examples: 4324
- name: valid
num_bytes: 304854379.0
num_examples: 972
- name: test
num_bytes: 276124096.156
num_examples: 1018
download_size: 1726358354
dataset_size: 1889561200.696
- config_name: es
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 26561374066.116
num_examples: 100198
- name: valid
num_bytes: 223052682.0
num_examples: 894
- name: test
num_bytes: 245397409.388
num_examples: 1003
download_size: 25062015078
dataset_size: 27029824157.504
- config_name: es-en
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: translation
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 8439704112.165
num_examples: 35619
- name: valid
num_bytes: 222635791.0
num_examples: 894
- name: test
num_bytes: 245390358.21
num_examples: 1003
download_size: 9579841898
dataset_size: 8907730261.375
- config_name: es-fr
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: translation
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 901562061.99
num_examples: 3599
- name: valid
num_bytes: 222644718.0
num_examples: 894
- name: test
num_bytes: 245404215.21
num_examples: 1003
download_size: 1362766440
dataset_size: 1369610995.2
- config_name: es-it
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: translation
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 1337609300.41
num_examples: 5510
- name: valid
num_bytes: 3469791.0
num_examples: 16
- name: test
num_bytes: 75562486.0
num_examples: 264
download_size: 1494586040
dataset_size: 1416641577.41
- config_name: es-pt
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: translation
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 4688684250.412
num_examples: 20729
- name: valid
num_bytes: 222639494.0
num_examples: 894
- name: test
num_bytes: 245393422.21
num_examples: 1003
download_size: 5662737434
dataset_size: 5156717166.622
- config_name: fr
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 20224863779.006
num_examples: 113954
- name: valid
num_bytes: 281149192.02
num_examples: 1020
- name: test
num_bytes: 221199082.174
num_examples: 1046
download_size: 24748069249
dataset_size: 20727212053.2
- config_name: fr-en
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: translation
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 6601653666.472
num_examples: 29656
- name: valid
num_bytes: 281180195.78
num_examples: 1020
- name: test
num_bytes: 221048522.14
num_examples: 1043
download_size: 6807053209
dataset_size: 7103882384.392
- config_name: fr-es
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: translation
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 3531289952.782
num_examples: 20417
- name: valid
num_bytes: 281183223.78
num_examples: 1020
- name: test
num_bytes: 221052191.14
num_examples: 1043
download_size: 4966571176
dataset_size: 4033525367.702
- config_name: fr-pt
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: translation
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 2280734741.884
num_examples: 13062
- name: valid
num_bytes: 281184777.78
num_examples: 1020
- name: test
num_bytes: 221050976.14
num_examples: 1043
download_size: 3363496343
dataset_size: 2782970495.8039994
- config_name: it
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 14681799161.35
num_examples: 49043
- name: valid
num_bytes: 239961988.0
num_examples: 920
- name: test
num_bytes: 249234907.0
num_examples: 992
download_size: 14005414812
dataset_size: 15170996056.35
- config_name: it-en
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: translation
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 6537697373.862
num_examples: 24187
- name: valid
num_bytes: 240056624.0
num_examples: 920
- name: test
num_bytes: 249340745.0
num_examples: 992
download_size: 6868910303
dataset_size: 7027094742.862
- config_name: it-es
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: translation
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 637123264.152
num_examples: 2218
- name: valid
num_bytes: 240063519.0
num_examples: 920
- name: test
num_bytes: 249348573.0
num_examples: 992
download_size: 1155283492
dataset_size: 1126535356.152
- config_name: pt
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 23704544310.012
num_examples: 88798
- name: valid
num_bytes: 184985513.136
num_examples: 1002
- name: test
num_bytes: 234352081.034
num_examples: 1006
download_size: 21173051837
dataset_size: 24123881904.182003
- config_name: ru
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 8638149553.84
num_examples: 28824
- name: valid
num_bytes: 269571389.0
num_examples: 965
- name: test
num_bytes: 283388832.071
num_examples: 1117
download_size: 7547655322
dataset_size: 9191109774.911
- config_name: ru-en
features:
- name: id
dtype: string
- name: audio
dtype: audio
- name: transcript
dtype: string
- name: translation
dtype: string
- name: duration
dtype: float32
- name: talk_id
dtype: string
- name: segment_id
dtype: int32
- name: start
dtype: float32
- name: end
dtype: float32
splits:
- name: train
num_bytes: 1168027677.668
num_examples: 4868
- name: valid
num_bytes: 269664453.0
num_examples: 965
- name: test
num_bytes: 283483764.071
num_examples: 1117
download_size: 1772511093
dataset_size: 1721175894.7389998
---
# Multilingual TEDx (mTEDx) — SLR100
## Dataset Description
- **Homepage:** [OpenSLR SLR100](https://www.openslr.org/100/)
**mTEDx** is a multilingual speech recognition and translation corpus built from
[TEDx Talks](https://www.ted.com/watch/tedx-talks).
Original resource: [https://www.openslr.org/100/](https://www.openslr.org/100/)
The corpus provides audio recordings and VTT transcripts for **8 languages**
(Spanish, French, Portuguese, Italian, Russian, Greek, Arabic, German) with
aligned translations into up to 5 languages (English, Spanish, French,
Portuguese, Italian).
**License:** [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/)
**Contact:** Elizabeth Salesky (`esalesky@jhu.edu`), Matthew Wiesner (`wiesner@jhu.edu`)
---
## Corpus Statistics
Each row in the dataset corresponds to **one segment** (individual audio clip + transcript).
The table below reflects sentence counts and total audio duration as reported in `docs/statistics.txt` per language.
### Spanish (`es`)
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 988 | 102 171 | 1 676 862 | 764 301 s ≈ 212h 18m21s |
| valid | 16 | 905 | 14 327 | 7 013 s ≈ 1h 56m53s |
| test | 12 | 1 012 | 15 439 | 7 475 s ≈ 2h 4m35s |
| iwslt2021 | 15 | 996 | 16 229 | 7 365 s ≈ 2h 2m46s |
| **total** | **1 031** | **105 084** | **1 722 857** | **786 155 s ≈ 218h 22m35s** |
### French (`fr`)
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 949 | 116 045 | 1 838 447 | 780 355 s ≈ 216h 45m55s |
| valid | 12 | 1 036 | 16 590 | 8 033 s ≈ 2h 13m54s |
| test | 10 | 1 059 | 16 136 | 7 158 s ≈ 1h 59m18s |
| iwslt2021 | 11 | 1 041 | 16 653 | 8 342 s ≈ 2h 19m2s |
| **total** | **982** | **119 181** | **1 887 826** | **803 889 s ≈ 223h 18m9s** |
### Portuguese (`pt`)
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 820 | 90 244 | 1 433 073 | 642 853 s ≈ 178h 34m14s |
| valid | 9 | 1 013 | 14 457 | 6 522 s ≈ 1h 48m43s |
| test | 13 | 1 020 | 17 626 | 7 648 s ≈ 2h 7m28s |
| iwslt2021 | 11 | 1 022 | 15 498 | 7 290 s ≈ 2h 1m30s |
| **total** | **853** | **93 299** | **1 480 654** | **664 315 s ≈ 184h 31m55s** |
### Italian (`it`)
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 489 | 49 964 | 883 138 | 420 141 s ≈ 116h 42m22s |
| valid | 8 | 931 | 16 316 | 7 883 s ≈ 2h 11m24s |
| test | 8 | 999 | 18 359 | 7 790 s ≈ 2h 9m51s |
| iwslt2021 | 11 | 979 | 17 368 | 7 940 s ≈ 2h 12m20s |
| **total** | **516** | **52 873** | **935 181** | **443 756 s ≈ 123h 15m56s** |
### Russian (`ru`)
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 238 | 29 161 | 400 666 | 205 222 s ≈ 57h 0m23s |
| valid | 7 | 973 | 13 739 | 7 258 s ≈ 2h 0m58s |
| test | 9 | 1 132 | 14 598 | 7 554 s ≈ 2h 5m55s |
| **total** | **254** | **31 266** | **429 003** | **220 035 s ≈ 61h 7m15s** |
### Greek (`el`)
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 113 | 12 965 | 221 625 | 104 084 s ≈ 28h 54m45s |
| valid | 10 | 982 | 18 586 | 9 412 s ≈ 2h 36m53s |
| test | 8 | 1 027 | 17 164 | 8 493 s ≈ 2h 21m33s |
| **total** | **131** | **14 974** | **257 375** | **121 991 s ≈ 33h 53m11s** |
### Arabic (`ar`)
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 95 | 11 821 | 115 259 | 68 310 s ≈ 18h 58m |
| valid | 7 | 1 079 | 9 374 | 5 280 s ≈ 1h 28m |
| test | 7 | 1 066 | 8 964 | 5 187 s ≈ 1h 26m |
| **total** | **109** | **13 966** | **133 597** | **78 778 s ≈ 21h 53m** |
### German (`de`)
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 53 | 6 764 | 94 984 | 44 958 s ≈ 12h 29m18s |
| valid | 9 | 1 172 | 14 661 | 6 893 s ≈ 1h 54m53s |
| test | 9 | 1 166 | 14 289 | 6 715 s ≈ 1h 51m55s |
| **total** | **71** | **9 062** | **123 934** | **58 566 s ≈ 16h 16m6s** |
## el-en
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 35 | 4 384 | 73 886 | 34 042 s ≈ 9h 27m22s |
| valid | 10 | 982 | 18 586 | 9 412 s ≈ 2h 36m53s |
| test | 8 | 1 027 | 17 164 | 8 493 s ≈ 2h 21m33s |
| **total** | **53** | **6 393** | **109 636** | **51 948 s ≈ 14h 25m48s** |
## es-en
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 378 | 36 263 | 596 484 | 27 9645 s ≈ 77h 40m45s |
| valid | 16 | 905 | 14 327 | 7 013 s ≈ 1h 56m53s |
| test | 12 | 1 012 | 15 439 | 7 475 s ≈ 2h 4m35s |
| iwslt2021 | 15 | 996 | 16 229 | 7365.96 s ≈ 2h2m46s |
| **total** | **421** | **39 176** | **642 479** | **301 499 s ≈ 83h 44m59s** |
## es-fr
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 43 | 3 663 | 61 547 | 27 427 s ≈ 7h 37m8s |
| valid | 16 | 905 | 14 327 | 7 013 s ≈ 1h 56m53s |
| test | 12 | 1 012 | 15 439 | 7 475 s ≈ 2h 4m35s |
| iwslt2021 | 15 | 996 | 16 229 | 7 365 s ≈ 2h 2m46s |
| **total** | **86** | **6 576** | **107 542** | **49 282 s ≈ 13h 41m22s** |
## es-it
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 57 | 5600 | 92 182 | 42 529 s ≈ 11h 48m49s |
| valid | 1 | 16 | 156 | 110 s ≈ 0h 1m50s |
| test | 3 | 267 | 3 812 | 2 079 s ≈ 0h 34m40s |
| iwslt2021 | 3 | 225 | 3 539 | 1 625 s ≈ 0h 27m6s |
| **total** | **64** | **6 108** | **99 689** | **46 345 s ≈ 12h 52m25s** |
## es-pt
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 225 | 21 107 | 351 555 | 162260 s ≈ 45h 4m20s |
| valid | 16 | 905 | 14 327 | 7013 s ≈ 1h 56m53s |
| test | 12 | 1 012 | 15 439 | 7475 s ≈ 2h 4m35s |
| iwslt2021 | 15 | 996 | 16 229 | 7365 s ≈ 2h 2m46s |
| **total** | **268** | **24 020** | **397 550** | **184 114 s ≈ 51h 8m34s** |
### fr-en
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 250 | 30 171 | 477 516 | 199 524 s ≈ 55h 25m25s |
| valid | 12 | 1 036 | 16 590 | 8 033 s ≈ 2h 13m54s |
| test | 10 | 1 059 | 16 136 | 7 158 s ≈ 1h 59m18s |
| iwslt2021 | 11 | 1 041 | 16 653 | 8 342 s ≈ 2h 19m2s |
| **total** | **283** | **33 307** | **526 895** | **223 058 s ≈ 61h 57m38s** |
## fr-es
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 196 | 20 826 | 331 569 | 144 016 s ≈ 40h 0m17s |
| valid | 12 | 1 036 | 16 590 | 8 033 s ≈ 2h 13m54s |
| test | 10 | 1 059 | 16 136 | 7 158 s ≈ 1h 59m18s |
| iwslt2021 | 11 | 1 041 | 16 653 | 8 342 s ≈ 2h 19m2s |
| **total** | **229** | **23 962** | **380 948** | **167 550 s ≈ 46h 32m30s** |
## fr-pt
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 112 | 13 286 | 209 488 | 88 800 s ≈ 24h 40m1s |
| valid | 12 | 1 036 | 16 590 | 8 033 s ≈ 2h 13m54s |
| test | 10 | 1 059 | 16 136 | 7 158 s ≈ 1h 59m18s |
| iwslt2021 | 11 | 1 041 | 16 653 | 8 342 s ≈ 2h 19m2s |
| **total** | **145** | **16 422** | **258 867** | **112 334 s = 31h 12m14s** |
## it-en
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 221 | 24 576 | 425 219 | 199 294 s ≈ 55h 21m34s |
| valid | 8 | 931 | 16 316 | 7 883 s ≈ 2h 11m24s |
| test | 8 | 999 | 18 359 | 7 790 s ≈ 2h 9m51s |
| iwslt2021 | 11 | 979 | 17 368 | 7 940 s ≈ 2h 12m20s |
| **total** | **248** | **27 485** | **477 262** | **222 908 s ≈ 61h 55m8s** |
## it-es
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 27 | 2 261 | 42 161 | 20 636 s ≈ 5h43m57s |
| valid | 8 | 931 | 16 316 | 7 883 s ≈ 2h11m24s |
| test | 8 | 999 | 18 359 | 7 790 s ≈ 2h9m51s |
| iwslt2021 | 11 | 979 | 17 368 | 7 940 s ≈ 2h12m20s |
| **total** | **54** | **5 170** | **94 204** | **44 251 s ≈ 12h 17m31s** |
## ru-en
| Split | Talks | Sentences | Words | Duration |
|-------|------:|----------:|------:|-----------------|
| train | 40 | 4 921 | 66 021 | 33 160 s ≈ 9h 12m41s |
| valid | 7 | 973 | 13 740 | 7 258 s ≈ 2h 0m58s |
| test | 9 | 1 132 | 14 605 | 7 554 s ≈ 2h 5m55s |
| **total** | **56** | **7 026** | **94 366** | **47 973 s ≈ 13h 19m33s** |
### All Languages — Download Sizes (original tarballs)
| Config | Language | Tarball size |
|--------|-------------|-------------:|
| `es` | Spanish | 35 GB |
| `fr` | French | 34 GB |
| `pt` | Portuguese | 29 GB |
| `it` | Italian | 19 GB |
| `ru` | Russian | 10 GB |
| `el` | Greek | 5.5 GB |
| `ar` | Arabic | 3.6 GB |
| `de` | German | 2.6 GB |
---
## Dataset Structure
### Schema
Each example corresponds to **one audio segment** extracted from a full TEDx talk
using the Kaldi segments timestamps file.
| Field | Type | Description |
|--------------|-------------------|---------------------------------------------------------|
| `id` | `string` | Unique segment id: `<talk_stem>_<index>` (e.g. `14zpc3Nj_e4_0003`) |
| `audio` | `Audio` | Audio float32 waveform of the segment |
| `transcript` | `string` | Transcription text |
| `translation` | `string` | Translated text, if translated dataset |
| `duration` | `float32` | Duration of the audio segment **in seconds** |
| `talk_id` | `string` | Source talk file stem |
| `segment_id` | `int32` | 0-based index of the segment within its talk |
| `start` | `float32` | Segment start time within the source talk (seconds) |
| `end` | `float32` | Segment end time within the source talk (seconds) |
### Splits
| Split | Description |
|---------|-------------------------------------|
| `train` | Training set |
| `valid` | Validation / development set |
| `test` | Test set |
---
## Usage
```python
from datasets import load_dataset
# Load Arabic training split
ds = load_dataset("deepdml/mtedx", "ar", split="train")
print(ds[0])
# {
# 'id': '14zpc3Nj_e4_0001',
# 'audio': {'array': array([...], dtype=float32), 'sampling_rate': 16000},
# 'transcript': 'أكل العالم وغص بنخلة',
# 'duration': 4.16,
# 'talk_id': '14zpc3Nj_e4',
# 'segment_id': 1,
# 'start': 9.332,
# 'end': 13.492,
# 'language': 'ar'
# }
# Stream a large language without downloading everything
ds = load_dataset("deepdml/mtedx", "es", split="train", streaming=True)
for sample in ds:
audio = sample["audio"]["array"] # numpy float32 array @ 16 kHz
text = sample["transcript"]
dur = sample["duration"] # seconds
break
# ASR fine-tuning example (Whisper / wav2vec2)
ds = load_dataset("deepdml/mtedx", "fr", split="train")
ds = ds.select_columns(["audio", "transcript", "duration"])
```
---
## Source Data
Downloaded from [OpenSLR SLR100](https://www.openslr.org/100/).
Each language pack (`mtedx_<lang>.tgz`) contains:
- `data/<split>/wav/` — Full-talk FLAC audio files
- `data/<split>/vtt/` — WebVTT transcript files (`<id>.<lang>.vtt`)
- `data/<split>/txt/` — Segments and Plain-text transcripts
- `docs/statistics.txt` — Per-split statistics
The upload script (`create_mtedx_dataset.py`) slices the full-talk FLAC files
into individual segments using the kaldi segments timestamps and discards segments shorter
than 0.5 s or longer than 30 s.
---
## Citation
```bibtex
@inproceedings{salesky2021mtedx,
title = {Multilingual TEDx Corpus for Speech Recognition and Translation},
author = {Elizabeth Salesky and Matthew Wiesner and Jacob Bremerman and
Roldano Cattoni and Matteo Negri and Marco Turchi and
Douglas W. Oard and Matt Post},
booktitle = {Proceedings of Interspeech},
year = {2021},
}
```
提供机构:
deepdml



