VoiceBrowseComp
收藏魔搭社区2026-03-11 更新2026-05-03 收录
下载链接:
https://modelscope.cn/datasets/ylc0411/VoiceBrowseComp
下载链接
链接失效反馈官方服务:
资源简介:
# Voice-BrowseComp 数据集 JSON 格式说明
本文档描述 Voice-BrowseComp 三个数据集(AudioMarathon、MMAU、MMSU)生成的 JSON 标注文件的统一格式,方便下游评测使用。
---
## 1. 文件结构
每个数据集输出一个 JSON 文件,内容为样本数组:
```
voice_browsecomp_output/
├── GTZAN/
│ ├── GTZAN_voice_browsecomp.json
│ └── audios/
├── DESED/
│ ├── DESED_voice_browsecomp.json
│ └── audios/
├── HAD/ ...
├── LibriSpeech/ ...
├── RACE/ ...
├── SLUE/ ...
├── TAU/ ...
├── VESUS/ ...
├── Vox/ ...
└── Vox_age/ ...
voice_browsecomp_mmau_output/
├── voice_browsecomp_mmau.json
└── audios/
voice_browsecomp_mmsu_output/
├── voice_browsecomp_mmsu.json
└── audios/
```
---
## 2. 统一字段格式
所有三个数据集的 JSON 标注共享以下核心字段:
| 字段名 | 类型 | 说明 |
|---|---|---|
| `uniq_id` | `string` | 样本唯一标识符 |
| `task_name` | `string` | 任务名称 |
| `dataset_source` | `string` | 数据集来源(如 `"GTZAN"`, `"MMAU"`, `"MMSU"`) |
| `path` | `string` | 变换后音频的**相对路径**(相对于 JSON 所在目录),如 `"audios/xxx.wav"` |
| `original_path` | `string` | 原始音频的**绝对路径** |
| `transforms_applied` | `list[object]` | 应用的变换列表(完整记录,见下方详细说明) |
| `num_transforms` | `int` | 变换数量 |
| `question` | `string` | 评测问题 |
| `answer_gt` | `string` | 标准答案(ground truth) |
| `choice_a` | `string` | 选项 A |
| `choice_b` | `string` | 选项 B |
| `choice_c` | `string` | 选项 C |
| `choice_d` | `string` | 选项 D |
| `choice_e` | `string` | 选项 E |
对于多选题,部分选项可能为空;对于简答题,选项全为空
---
## 3. `transforms_applied` 字段详细说明
每个变换是一个完整的对象,记录了变换类型、参数及对应的逆向恢复信息:
```json
{
"transform_type": "noise_addition",
"noise_type": "white",
"snr_db": -4.98,
"required_tool": "denoiser",
"reverse_params": {
"noise_type": "white",
"estimated_snr": -4.98
},
"transform_name": "white_noise",
"order": 0
}
```
### 字段说明
| 字段名 | 类型 | 说明 |
|---|---|---|
| `transform_type` | `string` | 变换类别(如 `noise_addition`, `speed_change`, `reverb` 等) |
| `transform_name` | `string` | 变换名称(如 `white_noise`, `speed_change`, `reverb` 等) |
| `order` | `int` | 应用顺序(从 0 开始) |
| `required_tool` | `string` | 推荐使用的恢复工具名称 |
| `reverse_params` | `object` | 逆向恢复所需的参数 |
| 其他字段 | 各异 | 变换特定参数(如 `snr_db`, `speed_factor`, `gain_db` 等) |
### 支持的变换类型
| `transform_name` | `transform_type` | 说明 | 关键参数 |
|---|---|---|---|
| `white_noise` | `noise_addition` | 白噪声 | `snr_db` |
| `colored_noise` | `noise_addition` | 彩色噪声(pink/brown) | `noise_type`, `snr_db` |
| `hum_noise` | `noise_addition` | 电流声 | `frequency`, `snr_db` |
| `volume_change` | `volume_change` | 音量变化 | `gain_db` |
| `speed_change` | `speed_change` | 速度变化 | `speed_factor` |
| `pitch_shift` | `pitch_shift` | 音高变化 | `semitones` |
| `reverb` | `reverb` | 混响 | `room_size`, `rt60` |
| `time_stretch` | `time_stretch` | 时间拉伸 | `stretch_factor` |
| `click_noise` | `click_noise` | 点击噪声 | `click_rate`, `intensity` |
| `silence_gaps` | `silence_insertion` | 静音插入 | `num_gaps`, `gap_duration` |
| `low_pass` | `low_pass_filter` | 低通滤波 | `cutoff_freq` |
| `telephone_effect` | `telephone_effect` | 电话效果 | `low_freq_hz`, `high_freq_hz` |
| `codec_compression` | `codec_compression` | 编解码压缩 | `codec`, `bitrate` |
| `reverse_audio` | `audio_reversal` | 音频反转 | `reverse_type` |
| `repeat_segments` | `segment_repetition` | 片段重复 | `num_repeats` |
| `cross_talk` | `cross_talk` | 串音 | — |
| `irrelevant_speech` | `irrelevant_speech` | 无关语音 | — |
---
## 4. 完整样本示例
### AudioMarathon 数据集(GTZAN 为例)
```json
{
"uniq_id": "GTZAN_000000_v00",
"task_name": "音乐流派分类任务",
"dataset_source": "GTZAN",
"path": "audios/GTZAN_000000_v00.wav",
"original_path": "/data1/.../GTZAN/concatenated_audio/wav/blues/blues_concatenated_01.wav",
"transforms_applied": [
{
"transform_type": "noise_addition",
"noise_type": "white",
"snr_db": -4.98,
"required_tool": "denoiser",
"reverse_params": {
"noise_type": "white",
"estimated_snr": -4.98
},
"transform_name": "white_noise",
"order": 0
},
{
"transform_type": "volume_change",
"gain_db": -11.96,
"gain_linear": 0.25,
"clipped": false,
"required_tool": "volume_normalizer",
"reverse_params": {
"target_gain_db": 11.96
},
"transform_name": "volume_change",
"order": 1
},
{
"transform_type": "click_noise",
"click_rate": 10.25,
"num_clicks": 576,
"intensity": 0.47,
"required_tool": "declicker",
"reverse_params": {
"detect_clicks": true,
"interpolate": true
},
"transform_name": "click_noise",
"order": 2
}
],
"num_transforms": 3,
"question": "What music genre is represented in this audio segment?",
"answer_gt": "blues",
"choice_a": "Country - country",
"choice_b": "Metal - metal",
"choice_c": "Blues - blues",
"choice_d": "Classical - classical",
"choice_e": ""
}
```
### MMAU 数据集
```json
{
"uniq_id": "c93e3644-5227-4710-b27b-5c46750afbff_v00",
"task_name": "sound",
"dataset_source": "MMAU",
"path": "audios/c93e3644-5227-4710-b27b-5c46750afbff_v00_5transforms.wav",
"original_path": "/data1/.../MMAU-Pro/data/c93e3644-5227-4710-b27b-5c46750afbff.wav",
"transforms_applied": [
{
"transform_type": "noise_addition",
"noise_type": "white",
"snr_db": -4.98,
"required_tool": "denoiser",
"reverse_params": { "noise_type": "white", "estimated_snr": -4.98 },
"transform_name": "white_noise",
"order": 0
}
],
"num_transforms": 5,
"question": "What is being prepared in the audio?",
"answer_gt": "Boba tea",
"choice_a": "Boba tea",
"choice_b": "Milk",
"choice_c": "Coffee",
"choice_d": "Milk tea",
"choice_e": "Green tea",
"category": "sound",
"length_type": "medium",
"perceptual_skills": ["Acoustic Source Characterization"],
"reasoning_skills": ["Procedural Reasoning"]
}
```
### MMSU 数据集
```json
{
"uniq_id": "volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf_v00",
"task_name": "volume_comparison",
"dataset_source": "MMSU",
"path": "audios/volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf_v00_5transforms.wav",
"original_path": "/data1/.../MMSU/audio/volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf.wav",
"transforms_applied": [
{
"transform_type": "volume_change",
"gain_db": -11.53,
"gain_linear": 0.27,
"clipped": false,
"required_tool": "volume_normalizer",
"reverse_params": { "target_gain_db": 11.53 },
"transform_name": "volume_change",
"order": 0
}
],
"num_transforms": 5,
"question": "Which volume pattern best matches the audio?",
"answer_gt": "high-low-medium",
"choice_a": "high-low-medium",
"choice_b": "medium-high-low",
"choice_c": "low-high-medium",
"choice_d": "low-medium-high",
"choice_e": "",
"category": "Perception",
"sub_category": "Paralinguistics"
}
```
---
## 5. 数据集特有字段
除了统一核心字段外,各数据集可能包含额外元数据字段:
### MMAU 特有字段
| 字段名 | 类型 | 说明 |
|---|---|---|
| `category` | `string` | 音频类别(`sound`, `music`, `speech`) |
| `length_type` | `string` | 音频长度类型(`short`, `medium`, `long`) |
| `perceptual_skills` | `list[string]` | 所需感知能力 |
| `reasoning_skills` | `list[string]` | 所需推理能力 |
> **注意**: MMAU 的选项可能超过 5 个(choice_a 到 choice_j),因为原始数据 `choices` 数组可达 10 项。
### MMSU 特有字段
| 字段名 | 类型 | 说明 |
|---|---|---|
| `category` | `string` | 任务类别(`Perception`, `Reasoning` 等) |
| `sub_category` | `string` | 子类别(`Paralinguistics`, `Phonetics` 等) |
### AudioMarathon (9 Datasets) 特有说明
| 数据集 | `task_name` | `answer_gt` 类型 | 选项数量 |
|---|---|---|---|
| GTZAN | 音乐流派分类任务 | 流派名 | 4 |
| DESED | 声音事件检测任务 | 事件类名 | 5 |
| HAD | 人声真假检测任务 | `real` / `fake` | 2 |
| LibriSpeech | 语音识别任务 | 转录文本 | 0(开放题) |
| RACE | 阅读理解任务 | 答案文本 | 4 |
| SLUE | 语义理解评估任务 | 情感标签 | 3 |
| TAU | 声学场景分类任务 | 场景名 | 5 |
| VESUS | 情感识别任务 | 情感标签 | 5 |
| Vox | 性别分类任务 | `male` / `female` | 2 |
| Vox_age | 年龄分类任务 | 年龄组 | 4 |
---
## 6. 注意事项
1. **音频路径**: `path` 为相对路径(相对于 JSON 文件所在目录),加载时需要拼接基础目录。
2. **选项格式**: 无选项的题目(如 LibriSpeech 语音识别),`choice_a` ~ `choice_e` 均为空字符串。
3. **变换数量**: 每个样本默认应用 3~5 个变换,可通过 `--min-transforms` 和 `--max-transforms` 调整。
4. **多版本**: 同一原始音频可生成多个变换版本,通过 `--variants-per-sample` 控制,样本 ID 以 `_v00`, `_v01` 区分。
# Voice-BrowseComp 数据集 JSON 格式说明
本文件用于描述Voice-BrowseComp三个数据集(AudioMarathon、MMAU、MMSU)所生成的JSON标注文件的统一格式,以适配下游评测场景。
---
## 1. 文件结构
每个数据集将输出一个JSON格式标注文件,其内容为样本数组:
voice_browsecomp_output/
├── GTZAN/
│ ├── GTZAN_voice_browsecomp.json
│ └── audios/
├── DESED/
│ ├── DESED_voice_browsecomp.json
│ └── audios/
├── HAD/ ...
├── LibriSpeech/ ...
├── RACE/ ...
├── SLUE/ ...
├── TAU/ ...
├── VESUS/ ...
├── Vox/ ...
└── Vox_age/ ...
voice_browsecomp_mmau_output/
├── voice_browsecomp_mmau.json
└── audios/
voice_browsecomp_mmsu_output/
├── voice_browsecomp_mmsu.json
└── audios/
---
## 2. 统一字段格式
所有三个数据集的JSON标注共享以下核心字段:
| 字段名 | 类型 | 说明 |
|---|---|---|
| `uniq_id` | `string` | 样本唯一标识符 |
| `task_name` | `string` | 任务名称 |
| `dataset_source` | `string` | 数据集来源(如 `"GTZAN"`, `"MMAU"`, `"MMSU"`) |
| `path` | `string` | 变换后音频的**相对路径**(相对于JSON文件所在目录),如 `"audios/xxx.wav"` |
| `original_path` | `string` | 原始音频的**绝对路径** |
| `transforms_applied` | `list[object]` | 应用的音频变换列表(完整记录,见下方详细说明) |
| `num_transforms` | `int` | 变换数量 |
| `question` | `string` | 评测问题 |
| `answer_gt` | `string` | 标准答案(ground truth) |
| `choice_a` | `string` | 选项 A |
| `choice_b` | `string` | 选项 B |
| `choice_c` | `string` | 选项 C |
| `choice_d` | `string` | 选项 D |
| `choice_e` | `string` | 选项 E |
对于多选题,部分选项可能为空;对于简答题,所有选项均为空字符串。
---
## 3. `transforms_applied` 字段详细说明
每个音频变换均为完整的对象,记录了变换类型、参数及对应的逆向恢复信息:
json
{
"transform_type": "noise_addition",
"noise_type": "white",
"snr_db": -4.98,
"required_tool": "denoiser",
"reverse_params": {
"noise_type": "white",
"estimated_snr": -4.98
},
"transform_name": "white_noise",
"order": 0
}
### 字段说明
| 字段名 | 类型 | 说明 |
|---|---|---|
| `transform_type` | `string` | 变换类别(如 `noise_addition`, `speed_change`, `reverb` 等) |
| `transform_name` | `string` | 变换名称(如 `white_noise`, `speed_change`, `reverb` 等) |
| `order` | `int` | 变换应用顺序(从0开始计数) |
| `required_tool` | `string` | 推荐使用的逆向恢复工具名称 |
| `reverse_params` | `object` | 逆向恢复所需的参数集 |
| 其他字段 | 各异 | 变换专属参数(如 `snr_db`, `speed_factor`, `gain_db` 等) |
### 支持的变换类型
| `transform_name` | `transform_type` | 说明 | 关键参数 |
|---|---|---|---|
| `white_noise` | `noise_addition` | 白噪声 | `snr_db` |
| `colored_noise` | `noise_addition` | 彩色噪声(pink/brown) | `noise_type`, `snr_db` |
| `hum_noise` | `noise_addition` | 电流声 | `frequency`, `snr_db` |
| `volume_change` | `volume_change` | 音量调整 | `gain_db` |
| `speed_change` | `speed_change` | 语速调整 | `speed_factor` |
| `pitch_shift` | `pitch_shift` | 音高调整 | `semitones` |
| `reverb` | `reverb` | 混响 | `room_size`, `rt60` |
| `time_stretch` | `time_stretch` | 时间拉伸 | `stretch_factor` |
| `click_noise` | `click_noise` | 点击噪声 | `click_rate`, `intensity` |
| `silence_gaps` | `silence_insertion` | 静音插入 | `num_gaps`, `gap_duration` |
| `low_pass` | `low_pass_filter` | 低通滤波 | `cutoff_freq` |
| `telephone_effect` | `telephone_effect` | 电话效果 | `low_freq_hz`, `high_freq_hz` |
| `codec_compression` | `codec_compression` | 编解码压缩 | `codec`, `bitrate` |
| `reverse_audio` | `audio_reversal` | 音频反转 | `reverse_type` |
| `repeat_segments` | `segment_repetition` | 片段重复 | `num_repeats` |
| `cross_talk` | `cross_talk` | 串音 | — |
| `irrelevant_speech` | `irrelevant_speech` | 无关语音 | — |
---
## 4. 完整样本示例
### AudioMarathon 数据集(以GTZAN为例)
json
{
"uniq_id": "GTZAN_000000_v00",
"task_name": "音乐流派分类任务",
"dataset_source": "GTZAN",
"path": "audios/GTZAN_000000_v00.wav",
"original_path": "/data1/.../GTZAN/concatenated_audio/wav/blues/blues_concatenated_01.wav",
"transforms_applied": [
{
"transform_type": "noise_addition",
"noise_type": "white",
"snr_db": -4.98,
"required_tool": "denoiser",
"reverse_params": {
"noise_type": "white",
"estimated_snr": -4.98
},
"transform_name": "white_noise",
"order": 0
},
{
"transform_type": "volume_change",
"gain_db": -11.96,
"gain_linear": 0.25,
"clipped": false,
"required_tool": "volume_normalizer",
"reverse_params": {
"target_gain_db": 11.96
},
"transform_name": "volume_change",
"order": 1
},
{
"transform_type": "click_noise",
"click_rate": 10.25,
"num_clicks": 576,
"intensity": 0.47,
"required_tool": "declicker",
"reverse_params": {
"detect_clicks": true,
"interpolate": true
},
"transform_name": "click_noise",
"order": 2
}
],
"num_transforms": 3,
"question": "该音频片段所属的音乐流派是什么?",
"answer_gt": "blues",
"choice_a": "Country - country",
"choice_b": "Metal - metal",
"choice_c": "Blues - blues",
"choice_d": "Classical - classical",
"choice_e": ""
}
### MMAU 数据集
json
{
"uniq_id": "c93e3644-5227-4710-b27b-5c46750afbff_v00",
"task_name": "sound",
"dataset_source": "MMAU",
"path": "audios/c93e3644-5227-4710-b27b-5c46750afbff_v00_5transforms.wav",
"original_path": "/data1/.../MMAU-Pro/data/c93e3644-5227-4710-b27b-5c46750afbff.wav",
"transforms_applied": [
{
"transform_type": "noise_addition",
"noise_type": "white",
"snr_db": -4.98,
"required_tool": "denoiser",
"reverse_params": { "noise_type": "white", "estimated_snr": -4.98 },
"transform_name": "white_noise",
"order": 0
}
],
"num_transforms": 5,
"question": "该音频中正在制作的是什么?",
"answer_gt": "Boba tea",
"choice_a": "Boba tea",
"choice_b": "Milk",
"choice_c": "Coffee",
"choice_d": "Milk tea",
"choice_e": "Green tea",
"category": "sound",
"length_type": "medium",
"perceptual_skills": ["Acoustic Source Characterization"],
"reasoning_skills": ["Procedural Reasoning"]
}
### MMSU 数据集
json
{
"uniq_id": "volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf_v00",
"task_name": "volume_comparison",
"dataset_source": "MMSU",
"path": "audios/volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf_v00_5transforms.wav",
"original_path": "/data1/.../MMSU/audio/volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf.wav",
"transforms_applied": [
{
"transform_type": "volume_change",
"gain_db": -11.53,
"gain_linear": 0.27,
"clipped": false,
"required_tool": "volume_normalizer",
"reverse_params": { "target_gain_db": 11.53 },
"transform_name": "volume_change",
"order": 0
}
],
"num_transforms": 5,
"question": "以下哪一种音量模式与该音频最为匹配?",
"answer_gt": "high-low-medium",
"choice_a": "high-low-medium",
"choice_b": "medium-high-low",
"choice_c": "low-high-medium",
"choice_d": "low-medium-high",
"choice_e": "",
"category": "Perception",
"sub_category": "Paralinguistics"
}
---
## 5. 数据集特有字段
除统一核心字段外,各数据集可包含额外元数据字段:
### MMAU 特有字段
| 字段名 | 类型 | 说明 |
|---|---|---|
| `category` | `string` | 音频类别(`sound`, `music`, `speech`) |
| `length_type` | `string` | 音频长度类型(`short`, `medium`, `long`) |
| `perceptual_skills` | `list[string]` | 所需感知能力 |
| `reasoning_skills` | `list[string]` | 所需推理能力 |
> **注意**: MMAU 的选项可能超过5个(从choice_a至choice_j),因原始数据的`choices`数组最多可包含10项。
### MMSU 特有字段
| 字段名 | 类型 | 说明 |
|---|---|---|
| `category` | `string` | 任务类别(`Perception`, `Reasoning` 等) |
| `sub_category` | `string` | 子类别(`Paralinguistics`, `Phonetics` 等) |
### AudioMarathon (9 Datasets) 特有说明
| 数据集 | `task_name` | `answer_gt` 类型 | 选项数量 |
|---|---|---|---|
| GTZAN | 音乐流派分类任务 | 流派名 | 4 |
| DESED | 声音事件检测任务 | 事件类名 | 5 |
| HAD | 人声真假检测任务 | `real` / `fake` | 2 |
| LibriSpeech | 语音识别任务 | 转录文本 | 0(开放题) |
| RACE | 阅读理解任务 | 答案文本 | 4 |
| SLUE | 语义理解评估任务 | 情感标签 | 3 |
| TAU | 声学场景分类任务 | 场景名 | 5 |
| VESUS | 情感识别任务 | 情感标签 | 5 |
| Vox | 性别分类任务 | `male` / `female` | 2 |
| Vox_age | 年龄分类任务 | 年龄组 | 4 |
---
## 6. 注意事项
1. **音频路径**: `path` 字段为相对于JSON文件所在目录的相对路径,加载时需拼接基础目录路径。
2. **选项格式**: 无选项的题目(如LibriSpeech语音识别任务),`choice_a` ~ `choice_e` 均为空字符串。
3. **变换数量**: 每个样本默认应用3~5个变换,可通过`--min-transforms`和`--max-transforms`参数调整。
4. **多版本**: 同一原始音频可生成多个变换版本,通过`--variants-per-sample`参数控制,样本ID以`_v00`, `_v01`等后缀区分。
提供机构:
maas
创建时间:
2026-02-26



