Speech-MASSIVE-test
收藏魔搭社区2025-12-05 更新2025-09-27 收录
下载链接:
https://modelscope.cn/datasets/FBK-MT/Speech-MASSIVE-test
下载链接
链接失效反馈官方服务:
资源简介:
# Speech-MASSIVE Test Split
This dataset repository is **_only_** for `test` split of Speech-MASSIVE.
`train` and `dev` splits are available in the separate dataset repository. [https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE](https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE)
## Dataset Description
Speech-MASSIVE is a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the [MASSIVE](https://aclanthology.org/2023.acl-long.235) textual corpus. Speech-MASSIVE covers 12 languages (Arabic, German, Spanish, French, Hungarian, Korean, Dutch, Polish, European Portuguese, Russian, Turkish, and Vietnamese) from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. MASSIVE utterances' labels span 18 domains, with 60 intents and 55 slots. Full train split is provided for French and German, and for all the 12 languages (including French and German), we provide few-shot train, dev, test splits. Few-shot train (115 examples) covers all 18 domains, 60 intents, and 55 slots (including empty slots).
Our extension is prompted by the scarcity of massively multilingual SLU datasets and the growing need for versatile speech datasets to assess foundation models (LLMs, speech encoders) across diverse languages and tasks. To facilitate speech technology advancements, we release Speech-MASSIVE publicly available with [CC-BY-NC-SA-4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
Speech-MASSIVE is accepted at INTERSPEECH 2024 (Kos, GREECE) and nominated for the ISCA best student paper award.
## Dataset Summary
- `test`: test split available for all the 12 languages
- ⚠️ `dev`, `train_115`, `train` splits are available in the separate dataset repository. [https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE](https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE) ⚠️
- `dev`: dev split available for all the 12 languages
- `train_115`: few-shot split available for all the 12 languages (all 115 samples are cross-lingually aligned)
- `train`: train split available for French (fr-FR) and German (de-DE)
| lang | split | # sample | # hrs | total # spk </br>(Male/Female/Unidentified) |
|:---:|:---:|:---:|:---:|:---:|
| ar-SA | dev | 2033 | 2.12 | 36 (22/14/0) |
| | test | 2974 | 3.23 | 37 (15/17/5) |
| | train_115 | 115 | 0.14 | 8 (4/4/0) |
| de-DE | dev | 2033 | 2.33 | 68 (35/32/1) |
| | test | 2974 | 3.41 | 82 (36/36/10) |
| | train | 11514 | 12.61 | 117 (50/63/4) |
| | train_115 | 115 | 0.15 | 7 (3/4/0) |
| es-ES | dev | 2033 | 2.53 | 109 (51/53/5) |
| | test | 2974 | 3.61 | 85 (37/33/15) |
| | train_115 | 115 | 0.13 | 7 (3/4/0) |
| fr-FR | dev | 2033 | 2.20 | 55 (26/26/3) |
| | test | 2974 | 2.65 | 75 (31/35/9) |
| | train | 11514 | 12.42 | 103 (50/52/1) |
| | train_115 | 115 | 0.12 | 103 (50/52/1) |
| hu-HU | dev | 2033 | 2.27 | 69 (33/33/3) |
| | test | 2974 | 3.30 | 55 (25/24/6) |
| | train_115 | 115 | 0.12 | 8 (3/4/1) |
| ko-KR | dev | 2033 | 2.12 | 21 (8/13/0) |
| | test | 2974 | 2.66 | 31 (10/18/3) |
| | train_115 | 115 | 0.14 | 8 (4/4/0) |
| nl-NL | dev | 2033 | 2.14 | 37 (17/19/1) |
| | test | 2974 | 3.30 | 100 (48/49/3) |
| | train_115 | 115 | 0.12 | 7 (3/4/0) |
| pl-PL | dev | 2033 | 2.24 | 105 (50/52/3) |
| | test | 2974 | 3.21 | 151 (73/71/7) |
| | train_115 | 115 | 0.10 | 7 (3/4/0) |
| pt-PT | dev | 2033 | 2.20 | 107 (51/53/3) |
| | test | 2974 | 3.25 | 102 (48/50/4) |
| | train_115 | 115 | 0.12 | 8 (4/4/0) |
| ru-RU | dev | 2033 | 2.25 | 40 (7/31/2) |
| | test | 2974 | 3.44 | 51 (25/23/3) |
| | train_115 | 115 | 0.12 | 7 (3/4/0) |
| tr-TR | dev | 2033 | 2.17 | 71 (36/34/1) |
| | test | 2974 | 3.00 | 42 (17/18/7) |
| | train_115 | 115 | 0.11 | 6 (3/3/0) |
| vi-VN | dev | 2033 | 2.10 | 28 (13/14/1) |
| | test | 2974 | 3.23 | 30 (11/14/5) |
|| train_115 | 115 | 0.11 | 7 (2/4/1) |
## How to use
### How to use
The `datasets` library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the `load_dataset` function.
For example, to download the French config, simply specify the corresponding language config name (i.e., "fr-FR" for French):
```python
from datasets import load_dataset
speech_massive_fr_test = load_dataset("FBK-MT/Speech-MASSIVE-test", "fr-FR", split="test", trust_remote_code=True)
```
In case you don't have enough space in the machine, you can stream dataset by adding a `streaming=True` argument to the `load_dataset` function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk.
```python
from datasets import load_dataset
speech_massive_de_test = load_dataset("FBK-MT/Speech-MASSIVE-test", "de-DE", split="test", streaming=True, trust_remote_code=True)
list(speech_massive_de_test.take(2))
```
You can also load all the available languages and splits at once.
And then access each split.
```python
from datasets import load_dataset
speech_massive = load_dataset("FBK-MT/Speech-MASSIVE-test", "all", trust_remote_code=True)
multilingual_test = speech_massive['test']
```
## Dataset Structure
### Data configs
- `all`: load all the 12 languages in one single dataset instance
- `lang`: load only `lang` in the dataset instance, by specifying one of below languages
- ```ar-SA, de-DE, es-ES, fr-FR, hu-HU, ko-KR, nl-NL, pl-PL, pt-PT, ru-RU, tr-TR, vi-VN```
### Data Splits
- `test`: test split available for all the 12 languages
> [!WARNING]
> `validation`, `train_115` and `train` splits are uploaded to a separate dataset repository.
- `validation`: validation(dev) split available for all the 12 languages
- `train_115`: few-shot (115 samples) split available for all the 12 languages
- `train`: train split available for French (fr-FR) and German (de-DE)
### Data Instances
```json
{
// Start of the data collected in Speech-MASSIVE
'audio': {
'path': 'train/2b12a21ca64a729ccdabbde76a8f8d56.wav',
'array': array([-7.80913979e-...7259e-03]),
'sampling_rate': 16000},
'path': '/path/to/wav/file.wav',
'is_transcript_reported': False,
'is_validated': True,
'speaker_id': '60fcc09cb546eee814672f44',
'speaker_sex': 'Female',
'speaker_age': '25',
'speaker_ethnicity_simple': 'White',
'speaker_country_of_birth': 'France',
'speaker_country_of_residence': 'Ireland',
'speaker_nationality': 'France',
'speaker_first_language': 'French',
// End of the data collected in Speech-MASSIVE
// Start of the data extracted from MASSIVE
// (https://huggingface.co/datasets/AmazonScience/massive/blob/main/README.md#data-instances)
'id': '7509',
'locale': 'fr-FR',
'partition': 'train',
'scenario': 2,
'scenario_str': 'calendar',
'intent_idx': 32,
'intent_str': 'calendar_query',
'utt': 'après les cours de natation quoi d autre sur mon calendrier mardi',
'annot_utt': 'après les cours de natation quoi d autre sur mon calendrier [date : mardi]',
'worker_id': '22',
'slot_method': {'slot': ['date'], 'method': ['translation']},
'judgments': {
'worker_id': ['22', '19', '0'],
'intent_score': [1, 2, 1],
'slots_score': [1, 1, 1],
'grammar_score': [4, 4, 4],
'spelling_score': [2, 1, 2],
'language_identification': ['target', 'target', 'target']
},
'tokens': ['après', 'les', 'cours', 'de', 'natation', 'quoi', 'd', 'autre', 'sur', 'mon', 'calendrier', 'mardi'],
'labels': ['Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'date'],
// End of the data extracted from MASSIVE
}
```
### Data Fields
`audio.path`: Original audio file name
`audio.array`: Read audio file with the sampling rate of 16,000
`audio.sampling_rate`: Sampling rate
`path`: Original audio file full path
`is_transcript_reported`: Whether the transcript is reported as 'syntatically wrong' by crowd-source worker
`is_validated`: Whether the recorded audio has been validated to check if the audio matches transcript exactly by crowd-source worker
`speaker_id`: Unique hash id of the crowd source speaker
`speaker_sex`: Speaker's sex information provided by the crowd-source platform ([Prolific](http://prolific.com))
- Male
- Female
- Unidentified : Information not available from Prolific
`speaker_age`: Speaker's age information provided by Prolific
- age value (`str`)
- Unidentified : Information not available from Prolific
`speaker_ethnicity_simple`: Speaker's ethnicity information provided by Prolific
- ethnicity value (`str`)
- Unidentified : Information not available from Prolific
`speaker_country_of_birth`: Speaker's country of birth information provided by Prolific
- country value (`str`)
- Unidentified : Information not available from Prolific
`speaker_country_of_residence`: Speaker's country of residence information provided by Prolific
- country value (`str`)
- Unidentified : Information not available from Prolific
`speaker_nationality`: Speaker's nationality information provided by Prolific
- nationality value (`str`)
- Unidentified : Information not available from Prolific
`speaker_first_language`: Speaker's first language information provided by Prolific
- language value (`str`)
- Unidentified : Information not available from Prolific
### Limitations
As Speech-MASSIVE is constructed based on the MASSIVE dataset, it inherently retains certain grammatical errors present in the original MASSIVE text. Correcting these errors was outside the scope of our project. However, by providing the `is_transcripted_reported` attribute in Speech-MASSIVE, we enable users of the dataset to be aware of these errors.
## License
All datasets are licensed under the [CC-BY-NC-SA-4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
### Citation Information
Speech-MASSIVE is accepted at INTERSPEECH 2024 (Kos, Greece).
You can access the Speech-MASSIVE paper on [Interspeech archive](https://www.isca-archive.org/interspeech_2024/lee24i_interspeech.html) or [arXiv (same paper content but with appendix)](https://arxiv.org/abs/2408.03900).
Please cite the paper when referencing the Speech-MASSIVE corpus as:
```
@inproceedings{lee24i_interspeech,
title = {{Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond}},
author = {Beomseok Lee and Ioan Calapodescu and Marco Gaido and Matteo Negri and Laurent Besacier},
year = {2024},
booktitle = {{Interspeech 2024}},
pages = {817--821},
doi = {10.21437/Interspeech.2024-957},
issn = {2958-1796},
}
```
# 语音-MASSIVE(Speech-MASSIVE)测试划分
本数据集仓库**仅**用于语音-MASSIVE(Speech-MASSIVE)的`test`(测试)划分。`train`(训练集)与`dev`(开发集)划分可在独立数据集仓库获取:[https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE](https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE)
## 数据集描述
语音-MASSIVE是一个多语言口语语言理解(Spoken Language Understanding, SLU)数据集,为[MASSIVE](https://aclanthology.org/2023.acl-long.235)文本语料库对应部分的语音版本。该数据集涵盖来自不同语系的12种语言:阿拉伯语、德语、西班牙语、法语、匈牙利语、韩语、荷兰语、波兰语、欧洲葡萄牙语、俄语、土耳其语与越南语,继承了MASSIVE用于意图预测与槽填充任务的标注。MASSIVE话语标签覆盖18个领域,包含60个意图与55个槽位。针对法语与德语提供完整训练划分,而针对全部12种语言(含法语与德语),我们提供少样本训练、开发与测试划分。少样本训练集(115条样本)覆盖全部18个领域、60个意图与55个槽位(含空槽位)。
我们拓展该数据集的动机在于,当前大规模多语言SLU数据集较为稀缺,且亟需通用性强的语音数据集,以在多样化语言与任务场景下评估基础模型(大语言模型(Large Language Model, LLM)、语音编码器)。为推动语音技术发展,我们以[CC-BY-NC-SA-4.0许可协议](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)公开发布语音-MASSIVE。
语音-MASSIVE已被INTERSPEECH 2024(希腊科斯岛)收录,并提名ISCA最佳学生论文奖。
## 数据集概览
- `test`:全部12种语言均提供测试划分
- ⚠️ `dev`、`train_115`与`train`划分可在独立数据集仓库获取:[https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE](https://huggingface.co/datasets/FBK-MT/Speech-MASSIVE) ⚠️
- `dev`:全部12种语言均提供开发划分
- `train_115`:全部12种语言均提供少样本划分(全部115条样本均跨语言对齐)
- `train`:仅为法语(fr-FR)与德语(de-DE)提供训练划分
| 语言代码 | 划分类型 | 样本数量 | 总时长(小时) | 说话人总数<br>(男性/女性/身份不明) |
|:---:|:---:|:---:|:---:|:---:|
| ar-SA | dev | 2033 | 2.12 | 36 (22/14/0) |
| | test | 2974 | 3.23 | 37 (15/17/5) |
| | train_115 | 115 | 0.14 | 8 (4/4/0) |
| de-DE | dev | 2033 | 2.33 | 68 (35/32/1) |
| | test | 2974 | 3.41 | 82 (36/36/10) |
| | train | 11514 | 12.61 | 117 (50/63/4) |
| | train_115 | 115 | 0.15 | 7 (3/4/0) |
| es-ES | dev | 2033 | 2.53 | 109 (51/53/5) |
| | test | 2974 | 3.61 | 85 (37/33/15) |
| | train_115 | 115 | 0.13 | 7 (3/4/0) |
| fr-FR | dev | 2033 | 2.20 | 55 (26/26/3) |
| | test | 2974 | 2.65 | 75 (31/35/9) |
| | train | 11514 | 12.42 | 103 (50/52/1) |
| | train_115 | 115 | 0.12 | 103 (50/52/1) |
| hu-HU | dev | 2033 | 2.27 | 69 (33/33/3) |
| | test | 2974 | 3.30 | 55 (25/24/6) |
| | train_115 | 115 | 0.12 | 8 (3/4/1) |
| ko-KR | dev | 2033 | 2.12 | 21 (8/13/0) |
| | test | 2974 | 2.66 | 31 (10/18/3) |
| | train_115 | 115 | 0.14 | 8 (4/4/0) |
| nl-NL | dev | 2033 | 2.14 | 37 (17/19/1) |
| | test | 2974 | 3.30 | 100 (48/49/3) |
| | train_115 | 115 | 0.12 | 7 (3/4/0) |
| pl-PL | dev | 2033 | 2.24 | 105 (50/52/3) |
| | test | 2974 | 3.21 | 151 (73/71/7) |
| | train_115 | 115 | 0.10 | 7 (3/4/0) |
| pt-PT | dev | 2033 | 2.20 | 107 (51/53/3) |
| | test | 2974 | 3.25 | 102 (48/50/4) |
| | train_115 | 115 | 0.12 | 8 (4/4/0) |
| ru-RU | dev | 2033 | 2.25 | 40 (7/31/2) |
| | test | 2974 | 3.44 | 51 (25/23/3) |
| | train_115 | 115 | 0.12 | 7 (3/4/0) |
| tr-TR | dev | 2033 | 2.17 | 71 (36/34/1) |
| | test | 2974 | 3.00 | 42 (17/18/7) |
| | train_115 | 115 | 0.11 | 6 (3/3/0) |
| vi-VN | dev | 2033 | 2.10 | 28 (13/14/1) |
| | test | 2974 | 3.23 | 30 (11/14/5) |
| | train_115 | 115 | 0.11 | 7 (2/4/1) |
## 使用方法
### 使用方法
`datasets`库支持通过纯Python代码规模化加载与预处理数据集。可通过调用`load_dataset`函数一次性将数据集下载并准备至本地磁盘。
例如,下载法语配置,只需指定对应的语言配置名称(即法语为`"fr-FR"`):
python
from datasets import load_dataset
speech_massive_fr_test = load_dataset("FBK-MT/Speech-MASSIVE-test", "fr-FR", split="test", trust_remote_code=True)
若本地磁盘空间不足,可在`load_dataset`函数调用中添加`streaming=True`参数以流式加载数据集。流式加载模式下,数据集将逐个加载样本,而非将完整数据集下载至本地磁盘。
python
from datasets import load_dataset
speech_massive_de_test = load_dataset("FBK-MT/Speech-MASSIVE-test", "de-DE", split="test", streaming=True, trust_remote_code=True)
list(speech_massive_de_test.take(2))
你也可一次性加载所有可用语言与划分,随后访问各划分:
python
from datasets import load_dataset
speech_massive = load_dataset("FBK-MT/Speech-MASSIVE-test", "all", trust_remote_code=True)
multilingual_test = speech_massive['test']
## 数据集结构
### 数据配置项
- `all`:在单个数据集实例中加载全部12种语言
- `lang`:通过指定以下任意一种语言代码,仅加载对应语言的数据集实例:
ar-SA, de-DE, es-ES, fr-FR, hu-HU, ko-KR, nl-NL, pl-PL, pt-PT, ru-RU, tr-TR, vi-VN
### 数据划分
- `test`:全部12种语言均提供测试划分
> [!WARNING]
> `validation`、`train_115`与`train`划分已上传至独立数据集仓库。
- `validation`:全部12种语言均提供验证(开发)划分
- `train_115`:全部12种语言均提供少样本(115条样本)划分
- `train`:仅为法语(fr-FR)与德语(de-DE)提供训练划分
### 数据实例
json
{
// 语音-MASSIVE采集的数据集字段起始
'audio': {
'path': 'train/2b12a21ca64a729ccdabbde76a8f8d56.wav',
'array': array([-7.80913979e-...7259e-03]),
'sampling_rate': 16000},
'path': '/path/to/wav/file.wav',
'is_transcript_reported': False,
'is_validated': True,
'speaker_id': '60fcc09cb546eee814672f44',
'speaker_sex': 'Female',
'speaker_age': '25',
'speaker_ethnicity_simple': 'White',
'speaker_country_of_birth': 'France',
'speaker_country_of_residence': 'Ireland',
'speaker_nationality': 'France',
'speaker_first_language': 'French',
// 语音-MASSIVE采集的数据集字段结束
// 从MASSIVE提取的数据集字段起始
// (https://huggingface.co/datasets/AmazonScience/massive/blob/main/README.md#data-instances)
'id': '7509',
'locale': 'fr-FR',
'partition': 'train',
'scenario': 2,
'scenario_str': 'calendar',
'intent_idx': 32,
'intent_str': 'calendar_query',
'utt': 'après les cours de natation quoi d autre sur mon calendrier mardi',
'annot_utt': 'après les cours de natation quoi d autre sur mon calendrier [date : mardi]',
'worker_id': '22',
'slot_method': {'slot': ['date'], 'method': ['translation']},
'judgments': {
'worker_id': ['22', '19', '0'],
'intent_score': [1, 2, 1],
'slots_score': [1, 1, 1],
'grammar_score': [4, 4, 4],
'spelling_score': [2, 1, 2],
'language_identification': ['target', 'target', 'target']
},
'tokens': ['après', 'les', 'cours', 'de', 'natation', 'quoi', 'd', 'autre', 'sur', 'mon', 'calendrier', 'mardi'],
'labels': ['Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'date'],
// 从MASSIVE提取的数据集字段结束
}
### 数据字段
`audio.path`:原始音频文件名
`audio.array`:以16000采样率读取的音频数组
`audio.sampling_rate`:采样率
`path`:音频文件完整本地路径
`is_transcript_reported`:标注员是否标注该转录文本存在句法错误
`is_validated`:标注员是否验证录制音频与转录文本完全匹配
`speaker_id`:标注说话人的唯一哈希ID
`speaker_sex`:标注员通过Prolific平台提供的说话人性别信息
- 男性
- 女性
- 身份不明:Prolific平台未提供相关信息
`speaker_age`:标注员通过Prolific平台提供的说话人年龄信息
- 年龄值(字符串类型)
- 身份不明:Prolific平台未提供相关信息
`speaker_ethnicity_simple`:标注员通过Prolific平台提供的说话人种族信息
- 种族值(字符串类型)
- 身份不明:Prolific平台未提供相关信息
`speaker_country_of_birth`:标注员通过Prolific平台提供的说话人出生国家信息
- 国家名称(字符串类型)
- 身份不明:Prolific平台未提供相关信息
`speaker_country_of_residence`:标注员通过Prolific平台提供的说话人居住国家信息
- 国家名称(字符串类型)
- 身份不明:Prolific平台未提供相关信息
`speaker_nationality`:标注员通过Prolific平台提供的说话人国籍信息
- 国籍名称(字符串类型)
- 身份不明:Prolific平台未提供相关信息
`speaker_first_language`:标注员通过Prolific平台提供的说话人第一语言信息
- 语言名称(字符串类型)
- 身份不明:Prolific平台未提供相关信息
## 局限性
由于语音-MASSIVE基于MASSIVE数据集构建,其天然保留了原始MASSIVE文本中的部分语法错误。修正这些错误不在本次项目的工作范围内。不过,通过在语音-MASSIVE中提供`is_transcript_reported`属性,我们允许数据集使用者知晓这些错误的存在。
## 许可协议
所有数据集均采用[CC-BY-NC-SA-4.0许可协议](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)发布。
### 引用信息
语音-MASSIVE已被INTERSPEECH 2024(希腊科斯岛)收录。你可在[Interspeech存档库](https://www.isca-archive.org/interspeech_2024/lee24i_interspeech.html)或[arXiv(含附录的同内容论文)](https://arxiv.org/abs/2408.03900)获取语音-MASSIVE论文。引用语音-MASSIVE语料库时,请按以下格式标注:
@inproceedings{lee24i_interspeech,
title = {{Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond}},
author = {Beomseok Lee and Ioan Calapodescu and Marco Gaido and Matteo Negri and Laurent Besacier},
year = {2024},
booktitle = {{Interspeech 2024}},
pages = {817--821},
doi = {10.21437/Interspeech.2024-957},
issn = {2958-1796},
}
提供机构:
maas
创建时间:
2025-09-26



