Name: deepdml/mtedx
Creator: deepdml
Published: 2026-02-26 05:22:22
License: 暂无描述

下载链接：

https://hf-mirror.com/datasets/deepdml/mtedx

下载链接

链接失效反馈

官方服务：

资源简介：

--- language: - es - fr - pt - it - ar - ru - el - de license: cc-by-nc-nd-4.0 task_categories: - automatic-speech-recognition - translation pretty_name: Multilingual TEDx (mTEDx) — SLR100 tags: - speech - audio - tedx - multilingual - asr - speech-translation configs: - config_name: ar data_files: - split: train path: ar/train-* - split: valid path: ar/valid-* - split: test path: ar/test-* - config_name: de data_files: - split: train path: de/train-* - split: valid path: de/valid-* - split: test path: de/test-* - config_name: el data_files: - split: train path: el/train-* - split: valid path: el/valid-* - split: test path: el/test-* - config_name: el-en data_files: - split: train path: el-en/train-* - split: valid path: el-en/valid-* - split: test path: el-en/test-* - config_name: es data_files: - split: train path: es/train-* - split: valid path: es/valid-* - split: test path: es/test-* - config_name: es-en data_files: - split: train path: es-en/train-* - split: valid path: es-en/valid-* - split: test path: es-en/test-* - config_name: es-fr data_files: - split: train path: es-fr/train-* - split: valid path: es-fr/valid-* - split: test path: es-fr/test-* - config_name: es-it data_files: - split: train path: es-it/train-* - split: valid path: es-it/valid-* - split: test path: es-it/test-* - config_name: es-pt data_files: - split: train path: es-pt/train-* - split: valid path: es-pt/valid-* - split: test path: es-pt/test-* - config_name: fr data_files: - split: train path: fr/train-* - split: valid path: fr/valid-* - split: test path: fr/test-* - config_name: fr-en data_files: - split: train path: fr-en/train-* - split: valid path: fr-en/valid-* - split: test path: fr-en/test-* - config_name: fr-es data_files: - split: train path: fr-es/train-* - split: valid path: fr-es/valid-* - split: test path: fr-es/test-* - config_name: fr-pt data_files: - split: train path: fr-pt/train-* - split: valid path: fr-pt/valid-* - split: test path: fr-pt/test-* - config_name: it data_files: - split: train path: it/train-* - split: valid path: it/valid-* - split: test path: it/test-* - config_name: it-en data_files: - split: train path: it-en/train-* - split: valid path: it-en/valid-* - split: test path: it-en/test-* - config_name: it-es data_files: - split: train path: it-es/train-* - split: valid path: it-es/valid-* - split: test path: it-es/test-* - config_name: pt data_files: - split: train path: pt/train-* - split: valid path: pt/valid-* - split: test path: pt/test-* - config_name: ru data_files: - split: train path: ru/train-* - split: valid path: ru/valid-* - split: test path: ru/test-* - config_name: ru-en data_files: - split: train path: ru-en/train-* - split: valid path: ru-en/valid-* - split: test path: ru-en/test-* dataset_info: - config_name: ar features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 2178889228.384 num_examples: 11129 - name: valid num_bytes: 142140185.32 num_examples: 1024 - name: test num_bytes: 132688587.25 num_examples: 1025 download_size: 2472218095 dataset_size: 2453718000.954 - config_name: de features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 1765760361.764 num_examples: 6764 - name: valid num_bytes: 256591446.396 num_examples: 1172 - name: test num_bytes: 245484939.996 num_examples: 1126 download_size: 2035086468 dataset_size: 2267836748.1559997 - config_name: el features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 2568039234.829 num_examples: 12817 - name: valid num_bytes: 304746052.0 num_examples: 972 - name: test num_bytes: 276022477.156 num_examples: 1018 download_size: 4035781133 dataset_size: 3148807763.985 - config_name: el-en features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: translation dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 1308582725.54 num_examples: 4324 - name: valid num_bytes: 304854379.0 num_examples: 972 - name: test num_bytes: 276124096.156 num_examples: 1018 download_size: 1726358354 dataset_size: 1889561200.696 - config_name: es features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 26561374066.116 num_examples: 100198 - name: valid num_bytes: 223052682.0 num_examples: 894 - name: test num_bytes: 245397409.388 num_examples: 1003 download_size: 25062015078 dataset_size: 27029824157.504 - config_name: es-en features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: translation dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 8439704112.165 num_examples: 35619 - name: valid num_bytes: 222635791.0 num_examples: 894 - name: test num_bytes: 245390358.21 num_examples: 1003 download_size: 9579841898 dataset_size: 8907730261.375 - config_name: es-fr features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: translation dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 901562061.99 num_examples: 3599 - name: valid num_bytes: 222644718.0 num_examples: 894 - name: test num_bytes: 245404215.21 num_examples: 1003 download_size: 1362766440 dataset_size: 1369610995.2 - config_name: es-it features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: translation dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 1337609300.41 num_examples: 5510 - name: valid num_bytes: 3469791.0 num_examples: 16 - name: test num_bytes: 75562486.0 num_examples: 264 download_size: 1494586040 dataset_size: 1416641577.41 - config_name: es-pt features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: translation dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 4688684250.412 num_examples: 20729 - name: valid num_bytes: 222639494.0 num_examples: 894 - name: test num_bytes: 245393422.21 num_examples: 1003 download_size: 5662737434 dataset_size: 5156717166.622 - config_name: fr features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 20224863779.006 num_examples: 113954 - name: valid num_bytes: 281149192.02 num_examples: 1020 - name: test num_bytes: 221199082.174 num_examples: 1046 download_size: 24748069249 dataset_size: 20727212053.2 - config_name: fr-en features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: translation dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 6601653666.472 num_examples: 29656 - name: valid num_bytes: 281180195.78 num_examples: 1020 - name: test num_bytes: 221048522.14 num_examples: 1043 download_size: 6807053209 dataset_size: 7103882384.392 - config_name: fr-es features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: translation dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 3531289952.782 num_examples: 20417 - name: valid num_bytes: 281183223.78 num_examples: 1020 - name: test num_bytes: 221052191.14 num_examples: 1043 download_size: 4966571176 dataset_size: 4033525367.702 - config_name: fr-pt features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: translation dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 2280734741.884 num_examples: 13062 - name: valid num_bytes: 281184777.78 num_examples: 1020 - name: test num_bytes: 221050976.14 num_examples: 1043 download_size: 3363496343 dataset_size: 2782970495.8039994 - config_name: it features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 14681799161.35 num_examples: 49043 - name: valid num_bytes: 239961988.0 num_examples: 920 - name: test num_bytes: 249234907.0 num_examples: 992 download_size: 14005414812 dataset_size: 15170996056.35 - config_name: it-en features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: translation dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 6537697373.862 num_examples: 24187 - name: valid num_bytes: 240056624.0 num_examples: 920 - name: test num_bytes: 249340745.0 num_examples: 992 download_size: 6868910303 dataset_size: 7027094742.862 - config_name: it-es features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: translation dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 637123264.152 num_examples: 2218 - name: valid num_bytes: 240063519.0 num_examples: 920 - name: test num_bytes: 249348573.0 num_examples: 992 download_size: 1155283492 dataset_size: 1126535356.152 - config_name: pt features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 23704544310.012 num_examples: 88798 - name: valid num_bytes: 184985513.136 num_examples: 1002 - name: test num_bytes: 234352081.034 num_examples: 1006 download_size: 21173051837 dataset_size: 24123881904.182003 - config_name: ru features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 8638149553.84 num_examples: 28824 - name: valid num_bytes: 269571389.0 num_examples: 965 - name: test num_bytes: 283388832.071 num_examples: 1117 download_size: 7547655322 dataset_size: 9191109774.911 - config_name: ru-en features: - name: id dtype: string - name: audio dtype: audio - name: transcript dtype: string - name: translation dtype: string - name: duration dtype: float32 - name: talk_id dtype: string - name: segment_id dtype: int32 - name: start dtype: float32 - name: end dtype: float32 splits: - name: train num_bytes: 1168027677.668 num_examples: 4868 - name: valid num_bytes: 269664453.0 num_examples: 965 - name: test num_bytes: 283483764.071 num_examples: 1117 download_size: 1772511093 dataset_size: 1721175894.7389998 --- # Multilingual TEDx (mTEDx) — SLR100 ## Dataset Description - **Homepage:** [OpenSLR SLR100](https://www.openslr.org/100/) **mTEDx** is a multilingual speech recognition and translation corpus built from [TEDx Talks](https://www.ted.com/watch/tedx-talks). Original resource: [https://www.openslr.org/100/](https://www.openslr.org/100/) The corpus provides audio recordings and VTT transcripts for **8 languages** (Spanish, French, Portuguese, Italian, Russian, Greek, Arabic, German) with aligned translations into up to 5 languages (English, Spanish, French, Portuguese, Italian). **License:** [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) **Contact:** Elizabeth Salesky (`esalesky@jhu.edu`), Matthew Wiesner (`wiesner@jhu.edu`) --- ## Corpus Statistics Each row in the dataset corresponds to **one segment** (individual audio clip + transcript). The table below reflects sentence counts and total audio duration as reported in `docs/statistics.txt` per language. ### Spanish (`es`) | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 988 | 102 171 | 1 676 862 | 764 301 s ≈ 212h 18m21s | | valid | 16 | 905 | 14 327 | 7 013 s ≈ 1h 56m53s | | test | 12 | 1 012 | 15 439 | 7 475 s ≈ 2h 4m35s | | iwslt2021 | 15 | 996 | 16 229 | 7 365 s ≈ 2h 2m46s | | **total** | **1 031** | **105 084** | **1 722 857** | **786 155 s ≈ 218h 22m35s** | ### French (`fr`) | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 949 | 116 045 | 1 838 447 | 780 355 s ≈ 216h 45m55s | | valid | 12 | 1 036 | 16 590 | 8 033 s ≈ 2h 13m54s | | test | 10 | 1 059 | 16 136 | 7 158 s ≈ 1h 59m18s | | iwslt2021 | 11 | 1 041 | 16 653 | 8 342 s ≈ 2h 19m2s | | **total** | **982** | **119 181** | **1 887 826** | **803 889 s ≈ 223h 18m9s** | ### Portuguese (`pt`) | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 820 | 90 244 | 1 433 073 | 642 853 s ≈ 178h 34m14s | | valid | 9 | 1 013 | 14 457 | 6 522 s ≈ 1h 48m43s | | test | 13 | 1 020 | 17 626 | 7 648 s ≈ 2h 7m28s | | iwslt2021 | 11 | 1 022 | 15 498 | 7 290 s ≈ 2h 1m30s | | **total** | **853** | **93 299** | **1 480 654** | **664 315 s ≈ 184h 31m55s** | ### Italian (`it`) | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 489 | 49 964 | 883 138 | 420 141 s ≈ 116h 42m22s | | valid | 8 | 931 | 16 316 | 7 883 s ≈ 2h 11m24s | | test | 8 | 999 | 18 359 | 7 790 s ≈ 2h 9m51s | | iwslt2021 | 11 | 979 | 17 368 | 7 940 s ≈ 2h 12m20s | | **total** | **516** | **52 873** | **935 181** | **443 756 s ≈ 123h 15m56s** | ### Russian (`ru`) | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 238 | 29 161 | 400 666 | 205 222 s ≈ 57h 0m23s | | valid | 7 | 973 | 13 739 | 7 258 s ≈ 2h 0m58s | | test | 9 | 1 132 | 14 598 | 7 554 s ≈ 2h 5m55s | | **total** | **254** | **31 266** | **429 003** | **220 035 s ≈ 61h 7m15s** | ### Greek (`el`) | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 113 | 12 965 | 221 625 | 104 084 s ≈ 28h 54m45s | | valid | 10 | 982 | 18 586 | 9 412 s ≈ 2h 36m53s | | test | 8 | 1 027 | 17 164 | 8 493 s ≈ 2h 21m33s | | **total** | **131** | **14 974** | **257 375** | **121 991 s ≈ 33h 53m11s** | ### Arabic (`ar`) | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 95 | 11 821 | 115 259 | 68 310 s ≈ 18h 58m | | valid | 7 | 1 079 | 9 374 | 5 280 s ≈ 1h 28m | | test | 7 | 1 066 | 8 964 | 5 187 s ≈ 1h 26m | | **total** | **109** | **13 966** | **133 597** | **78 778 s ≈ 21h 53m** | ### German (`de`) | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 53 | 6 764 | 94 984 | 44 958 s ≈ 12h 29m18s | | valid | 9 | 1 172 | 14 661 | 6 893 s ≈ 1h 54m53s | | test | 9 | 1 166 | 14 289 | 6 715 s ≈ 1h 51m55s | | **total** | **71** | **9 062** | **123 934** | **58 566 s ≈ 16h 16m6s** | ## el-en | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 35 | 4 384 | 73 886 | 34 042 s ≈ 9h 27m22s | | valid | 10 | 982 | 18 586 | 9 412 s ≈ 2h 36m53s | | test | 8 | 1 027 | 17 164 | 8 493 s ≈ 2h 21m33s | | **total** | **53** | **6 393** | **109 636** | **51 948 s ≈ 14h 25m48s** | ## es-en | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 378 | 36 263 | 596 484 | 27 9645 s ≈ 77h 40m45s | | valid | 16 | 905 | 14 327 | 7 013 s ≈ 1h 56m53s | | test | 12 | 1 012 | 15 439 | 7 475 s ≈ 2h 4m35s | | iwslt2021 | 15 | 996 | 16 229 | 7365.96 s ≈ 2h2m46s | | **total** | **421** | **39 176** | **642 479** | **301 499 s ≈ 83h 44m59s** | ## es-fr | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 43 | 3 663 | 61 547 | 27 427 s ≈ 7h 37m8s | | valid | 16 | 905 | 14 327 | 7 013 s ≈ 1h 56m53s | | test | 12 | 1 012 | 15 439 | 7 475 s ≈ 2h 4m35s | | iwslt2021 | 15 | 996 | 16 229 | 7 365 s ≈ 2h 2m46s | | **total** | **86** | **6 576** | **107 542** | **49 282 s ≈ 13h 41m22s** | ## es-it | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 57 | 5600 | 92 182 | 42 529 s ≈ 11h 48m49s | | valid | 1 | 16 | 156 | 110 s ≈ 0h 1m50s | | test | 3 | 267 | 3 812 | 2 079 s ≈ 0h 34m40s | | iwslt2021 | 3 | 225 | 3 539 | 1 625 s ≈ 0h 27m6s | | **total** | **64** | **6 108** | **99 689** | **46 345 s ≈ 12h 52m25s** | ## es-pt | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 225 | 21 107 | 351 555 | 162260 s ≈ 45h 4m20s | | valid | 16 | 905 | 14 327 | 7013 s ≈ 1h 56m53s | | test | 12 | 1 012 | 15 439 | 7475 s ≈ 2h 4m35s | | iwslt2021 | 15 | 996 | 16 229 | 7365 s ≈ 2h 2m46s | | **total** | **268** | **24 020** | **397 550** | **184 114 s ≈ 51h 8m34s** | ### fr-en | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 250 | 30 171 | 477 516 | 199 524 s ≈ 55h 25m25s | | valid | 12 | 1 036 | 16 590 | 8 033 s ≈ 2h 13m54s | | test | 10 | 1 059 | 16 136 | 7 158 s ≈ 1h 59m18s | | iwslt2021 | 11 | 1 041 | 16 653 | 8 342 s ≈ 2h 19m2s | | **total** | **283** | **33 307** | **526 895** | **223 058 s ≈ 61h 57m38s** | ## fr-es | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 196 | 20 826 | 331 569 | 144 016 s ≈ 40h 0m17s | | valid | 12 | 1 036 | 16 590 | 8 033 s ≈ 2h 13m54s | | test | 10 | 1 059 | 16 136 | 7 158 s ≈ 1h 59m18s | | iwslt2021 | 11 | 1 041 | 16 653 | 8 342 s ≈ 2h 19m2s | | **total** | **229** | **23 962** | **380 948** | **167 550 s ≈ 46h 32m30s** | ## fr-pt | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 112 | 13 286 | 209 488 | 88 800 s ≈ 24h 40m1s | | valid | 12 | 1 036 | 16 590 | 8 033 s ≈ 2h 13m54s | | test | 10 | 1 059 | 16 136 | 7 158 s ≈ 1h 59m18s | | iwslt2021 | 11 | 1 041 | 16 653 | 8 342 s ≈ 2h 19m2s | | **total** | **145** | **16 422** | **258 867** | **112 334 s = 31h 12m14s** | ## it-en | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 221 | 24 576 | 425 219 | 199 294 s ≈ 55h 21m34s | | valid | 8 | 931 | 16 316 | 7 883 s ≈ 2h 11m24s | | test | 8 | 999 | 18 359 | 7 790 s ≈ 2h 9m51s | | iwslt2021 | 11 | 979 | 17 368 | 7 940 s ≈ 2h 12m20s | | **total** | **248** | **27 485** | **477 262** | **222 908 s ≈ 61h 55m8s** | ## it-es | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 27 | 2 261 | 42 161 | 20 636 s ≈ 5h43m57s | | valid | 8 | 931 | 16 316 | 7 883 s ≈ 2h11m24s | | test | 8 | 999 | 18 359 | 7 790 s ≈ 2h9m51s | | iwslt2021 | 11 | 979 | 17 368 | 7 940 s ≈ 2h12m20s | | **total** | **54** | **5 170** | **94 204** | **44 251 s ≈ 12h 17m31s** | ## ru-en | Split | Talks | Sentences | Words | Duration | |-------|------:|----------:|------:|-----------------| | train | 40 | 4 921 | 66 021 | 33 160 s ≈ 9h 12m41s | | valid | 7 | 973 | 13 740 | 7 258 s ≈ 2h 0m58s | | test | 9 | 1 132 | 14 605 | 7 554 s ≈ 2h 5m55s | | **total** | **56** | **7 026** | **94 366** | **47 973 s ≈ 13h 19m33s** | ### All Languages — Download Sizes (original tarballs) | Config | Language | Tarball size | |--------|-------------|-------------:| | `es` | Spanish | 35 GB | | `fr` | French | 34 GB | | `pt` | Portuguese | 29 GB | | `it` | Italian | 19 GB | | `ru` | Russian | 10 GB | | `el` | Greek | 5.5 GB | | `ar` | Arabic | 3.6 GB | | `de` | German | 2.6 GB | --- ## Dataset Structure ### Schema Each example corresponds to **one audio segment** extracted from a full TEDx talk using the Kaldi segments timestamps file. | Field | Type | Description | |--------------|-------------------|---------------------------------------------------------| | `id` | `string` | Unique segment id: `<talk_stem>_<index>` (e.g. `14zpc3Nj_e4_0003`) | | `audio` | `Audio` | Audio float32 waveform of the segment | | `transcript` | `string` | Transcription text | | `translation` | `string` | Translated text, if translated dataset | | `duration` | `float32` | Duration of the audio segment **in seconds** | | `talk_id` | `string` | Source talk file stem | | `segment_id` | `int32` | 0-based index of the segment within its talk | | `start` | `float32` | Segment start time within the source talk (seconds) | | `end` | `float32` | Segment end time within the source talk (seconds) | ### Splits | Split | Description | |---------|-------------------------------------| | `train` | Training set | | `valid` | Validation / development set | | `test` | Test set | --- ## Usage ```python from datasets import load_dataset # Load Arabic training split ds = load_dataset("deepdml/mtedx", "ar", split="train") print(ds[0]) # { # 'id': '14zpc3Nj_e4_0001', # 'audio': {'array': array([...], dtype=float32), 'sampling_rate': 16000}, # 'transcript': 'أكل العالم وغص بنخلة', # 'duration': 4.16, # 'talk_id': '14zpc3Nj_e4', # 'segment_id': 1, # 'start': 9.332, # 'end': 13.492, # 'language': 'ar' # } # Stream a large language without downloading everything ds = load_dataset("deepdml/mtedx", "es", split="train", streaming=True) for sample in ds: audio = sample["audio"]["array"] # numpy float32 array @ 16 kHz text = sample["transcript"] dur = sample["duration"] # seconds break # ASR fine-tuning example (Whisper / wav2vec2) ds = load_dataset("deepdml/mtedx", "fr", split="train") ds = ds.select_columns(["audio", "transcript", "duration"]) ``` --- ## Source Data Downloaded from [OpenSLR SLR100](https://www.openslr.org/100/). Each language pack (`mtedx_<lang>.tgz`) contains: - `data/<split>/wav/` — Full-talk FLAC audio files - `data/<split>/vtt/` — WebVTT transcript files (`<id>.<lang>.vtt`) - `data/<split>/txt/` — Segments and Plain-text transcripts - `docs/statistics.txt` — Per-split statistics The upload script (`create_mtedx_dataset.py`) slices the full-talk FLAC files into individual segments using the kaldi segments timestamps and discards segments shorter than 0.5 s or longer than 30 s. --- ## Citation ```bibtex @inproceedings{salesky2021mtedx, title = {Multilingual TEDx Corpus for Speech Recognition and Translation}, author = {Elizabeth Salesky and Matthew Wiesner and Jacob Bremerman and Roldano Cattoni and Matteo Negri and Marco Turchi and Douglas W. Oard and Matt Post}, booktitle = {Proceedings of Interspeech}, year = {2021}, } ```

应用场景：