Speech-MASSIVE
收藏Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
简介
Speech-MASSIVE 是一个多语言的口语理解(SLU)数据集,包含 MASSIVE 文本语料库的一部分语音对应数据。该数据集涵盖 12 种语言(阿拉伯语、德语、西班牙语、法语、匈牙利语、韩语、荷兰语、波兰语、欧洲葡萄牙语、俄语、土耳其语和越南语),继承了 MASSIVE 的意图预测和槽填充任务的标注。MASSIVE 语句标签涵盖 18 个领域,包含 60 个意图和 55 个槽位。法语和德语提供完整的训练集,所有 12 种语言(包括法语和德语)提供少样本训练、开发和测试集。少样本训练集(115 个样本)涵盖所有 18 个领域、60 个意图和 55 个槽位(包括空槽)。
数据统计
| 语言 | 分割类型 | 样本数量 | 时长(小时) | 总说话人数 </br>(男/女/未识别) |
|---|---|---|---|---|
| ar-SA | few-shot train | 115 | 0.14 | 8 (4/4/0) |
| dev | 2033 | 2.12 | 36 (22/14/0) | |
| test | 2974 | 3.23 | 37 (15/17/5) | |
| de-DE | train-full | 11514 | 12.61 | 117 (50/63/4) |
| few-shot train | 115 | 0.15 | 7 (3/4/0) | |
| dev | 2033 | 2.33 | 68 (35/32/1) | |
| test | 2974 | 3.41 | 82 (36/36/10) | |
| es-ES | few-shot train | 115 | 0.13 | 7 (3/4/0) |
| dev | 2033 | 2.53 | 109 (51/53/5) | |
| test | 2974 | 3.61 | 85 (37/33/15) | |
| fr-FR | train-full | 11514 | 12.42 | 103 (50/52/1) |
| few-shot train | 115 | 0.12 | 103 (50/52/1) | |
| dev | 2033 | 2.20 | 55 (26/26/3) | |
| test | 2974 | 2.65 | 75 (31/35/9) | |
| hu-HU | few-shot train | 115 | 0.12 | 8 (3/4/1) |
| dev | 2033 | 2.27 | 69 (33/33/3) | |
| test | 2974 | 3.30 | 55 (25/24/6) | |
| ko-KR | few-shot train | 115 | 0.14 | 8 (4/4/0) |
| dev | 2033 | 2.12 | 21 (8/13/0) | |
| test | 2974 | 2.66 | 31 (10/18/3) | |
| nl-NL | few-shot train | 115 | 0.12 | 7 (3/4/0) |
| dev | 2033 | 2.14 | 37 (17/19/1) | |
| test | 2974 | 3.30 | 100 (48/49/3) | |
| pl-PL | few-shot train | 115 | 0.10 | 7 (3/4/0) |
| dev | 2033 | 2.24 | 105 (50/52/3) | |
| test | 2974 | 3.21 | 151 (73/71/7) | |
| pt-PT | few-shot train | 115 | 0.12 | 8 (4/4/0) |
| dev | 2033 | 2.20 | 107 (51/53/3) | |
| test | 2974 | 3.25 | 102 (48/50/4) | |
| ru-RU | few-shot train | 115 | 0.12 | 7 (3/4/0) |
| dev | 2033 | 2.25 | 40 (7/31/2) | |
| test | 2974 | 3.44 | 51 (25/23/3) | |
| tr-TR | few-shot train | 115 | 0.11 | 6 (3/3/0) |
| dev | 2033 | 2.17 | 71 (36/34/1) | |
| test | 2974 | 3.00 | 42 (17/18/7) | |
| vi-VN | few-shot train | 115 | 0.11 | 7 (2/4/1) |
| dev | 2033 | 2.10 | 28 (13/14/1) | |
| test | 2974 | 3.23 | 30 (11/14/5) |
许可证
Speech-MASSIVE 数据集采用 CC-BY-SA-4.0 许可证发布。
本仓库中的所有代码采用 Apache License 2.0 许可证发布。
引用
请引用我们的 Speech-MASSIVE 论文和 MASSIVE 论文,因为 Speech-MASSIVE 使用了 MASSIVE 的文本数据作为种子数据。
MASSIVE 论文:
@misc{fitzgerald2022massive, title={MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages}, author={Jack FitzGerald and Christopher Hench and Charith Peris and Scott Mackie and Kay Rottmann and Ana Sanchez and Aaron Nash and Liam Urbach and Vishesh Kakarala and Richa Singh and Swetha Ranganath and Laurie Crist and Misha Britan and Wouter Leeuwis and Gokhan Tur and Prem Natarajan}, year={2022}, eprint={2204.08582}, archivePrefix={arXiv}, primaryClass={cs.CL} }

- 1Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond特伦托大学, 法国NAVER LABS欧洲, 意大利布鲁诺凯斯勒基金会 · 2024年



