VoiceBrowseComp

Name: VoiceBrowseComp
Creator: maas
Published: 2026-03-11 22:24:40
License: 暂无描述

魔搭社区2026-03-11 更新2026-05-03 收录

下载链接：

https://modelscope.cn/datasets/ylc0411/VoiceBrowseComp

下载链接

链接失效反馈

官方服务：

资源简介：

# Voice-BrowseComp 数据集 JSON 格式说明本文档描述 Voice-BrowseComp 三个数据集（AudioMarathon、MMAU、MMSU）生成的 JSON 标注文件的统一格式，方便下游评测使用。 --- ## 1. 文件结构每个数据集输出一个 JSON 文件，内容为样本数组： ``` voice_browsecomp_output/ ├── GTZAN/ │ ├── GTZAN_voice_browsecomp.json │ └── audios/ ├── DESED/ │ ├── DESED_voice_browsecomp.json │ └── audios/ ├── HAD/ ... ├── LibriSpeech/ ... ├── RACE/ ... ├── SLUE/ ... ├── TAU/ ... ├── VESUS/ ... ├── Vox/ ... └── Vox_age/ ... voice_browsecomp_mmau_output/ ├── voice_browsecomp_mmau.json └── audios/ voice_browsecomp_mmsu_output/ ├── voice_browsecomp_mmsu.json └── audios/ ``` --- ## 2. 统一字段格式所有三个数据集的 JSON 标注共享以下核心字段： | 字段名 | 类型 | 说明 | |---|---|---| | `uniq_id` | `string` | 样本唯一标识符 | | `task_name` | `string` | 任务名称 | | `dataset_source` | `string` | 数据集来源（如 `"GTZAN"`, `"MMAU"`, `"MMSU"`） | | `path` | `string` | 变换后音频的**相对路径**（相对于 JSON 所在目录），如 `"audios/xxx.wav"` | | `original_path` | `string` | 原始音频的**绝对路径** | | `transforms_applied` | `list[object]` | 应用的变换列表（完整记录，见下方详细说明） | | `num_transforms` | `int` | 变换数量 | | `question` | `string` | 评测问题 | | `answer_gt` | `string` | 标准答案（ground truth） | | `choice_a` | `string` | 选项 A | | `choice_b` | `string` | 选项 B | | `choice_c` | `string` | 选项 C | | `choice_d` | `string` | 选项 D | | `choice_e` | `string` | 选项 E | 对于多选题,部分选项可能为空;对于简答题,选项全为空 --- ## 3. `transforms_applied` 字段详细说明每个变换是一个完整的对象，记录了变换类型、参数及对应的逆向恢复信息： ```json { "transform_type": "noise_addition", "noise_type": "white", "snr_db": -4.98, "required_tool": "denoiser", "reverse_params": { "noise_type": "white", "estimated_snr": -4.98 }, "transform_name": "white_noise", "order": 0 } ``` ### 字段说明 | 字段名 | 类型 | 说明 | |---|---|---| | `transform_type` | `string` | 变换类别（如 `noise_addition`, `speed_change`, `reverb` 等） | | `transform_name` | `string` | 变换名称（如 `white_noise`, `speed_change`, `reverb` 等） | | `order` | `int` | 应用顺序（从 0 开始） | | `required_tool` | `string` | 推荐使用的恢复工具名称 | | `reverse_params` | `object` | 逆向恢复所需的参数 | | 其他字段 | 各异 | 变换特定参数（如 `snr_db`, `speed_factor`, `gain_db` 等） | ### 支持的变换类型 | `transform_name` | `transform_type` | 说明 | 关键参数 | |---|---|---|---| | `white_noise` | `noise_addition` | 白噪声 | `snr_db` | | `colored_noise` | `noise_addition` | 彩色噪声（pink/brown） | `noise_type`, `snr_db` | | `hum_noise` | `noise_addition` | 电流声 | `frequency`, `snr_db` | | `volume_change` | `volume_change` | 音量变化 | `gain_db` | | `speed_change` | `speed_change` | 速度变化 | `speed_factor` | | `pitch_shift` | `pitch_shift` | 音高变化 | `semitones` | | `reverb` | `reverb` | 混响 | `room_size`, `rt60` | | `time_stretch` | `time_stretch` | 时间拉伸 | `stretch_factor` | | `click_noise` | `click_noise` | 点击噪声 | `click_rate`, `intensity` | | `silence_gaps` | `silence_insertion` | 静音插入 | `num_gaps`, `gap_duration` | | `low_pass` | `low_pass_filter` | 低通滤波 | `cutoff_freq` | | `telephone_effect` | `telephone_effect` | 电话效果 | `low_freq_hz`, `high_freq_hz` | | `codec_compression` | `codec_compression` | 编解码压缩 | `codec`, `bitrate` | | `reverse_audio` | `audio_reversal` | 音频反转 | `reverse_type` | | `repeat_segments` | `segment_repetition` | 片段重复 | `num_repeats` | | `cross_talk` | `cross_talk` | 串音 | — | | `irrelevant_speech` | `irrelevant_speech` | 无关语音 | — | --- ## 4. 完整样本示例 ### AudioMarathon 数据集（GTZAN 为例） ```json { "uniq_id": "GTZAN_000000_v00", "task_name": "音乐流派分类任务", "dataset_source": "GTZAN", "path": "audios/GTZAN_000000_v00.wav", "original_path": "/data1/.../GTZAN/concatenated_audio/wav/blues/blues_concatenated_01.wav", "transforms_applied": [ { "transform_type": "noise_addition", "noise_type": "white", "snr_db": -4.98, "required_tool": "denoiser", "reverse_params": { "noise_type": "white", "estimated_snr": -4.98 }, "transform_name": "white_noise", "order": 0 }, { "transform_type": "volume_change", "gain_db": -11.96, "gain_linear": 0.25, "clipped": false, "required_tool": "volume_normalizer", "reverse_params": { "target_gain_db": 11.96 }, "transform_name": "volume_change", "order": 1 }, { "transform_type": "click_noise", "click_rate": 10.25, "num_clicks": 576, "intensity": 0.47, "required_tool": "declicker", "reverse_params": { "detect_clicks": true, "interpolate": true }, "transform_name": "click_noise", "order": 2 } ], "num_transforms": 3, "question": "What music genre is represented in this audio segment?", "answer_gt": "blues", "choice_a": "Country - country", "choice_b": "Metal - metal", "choice_c": "Blues - blues", "choice_d": "Classical - classical", "choice_e": "" } ``` ### MMAU 数据集 ```json { "uniq_id": "c93e3644-5227-4710-b27b-5c46750afbff_v00", "task_name": "sound", "dataset_source": "MMAU", "path": "audios/c93e3644-5227-4710-b27b-5c46750afbff_v00_5transforms.wav", "original_path": "/data1/.../MMAU-Pro/data/c93e3644-5227-4710-b27b-5c46750afbff.wav", "transforms_applied": [ { "transform_type": "noise_addition", "noise_type": "white", "snr_db": -4.98, "required_tool": "denoiser", "reverse_params": { "noise_type": "white", "estimated_snr": -4.98 }, "transform_name": "white_noise", "order": 0 } ], "num_transforms": 5, "question": "What is being prepared in the audio?", "answer_gt": "Boba tea", "choice_a": "Boba tea", "choice_b": "Milk", "choice_c": "Coffee", "choice_d": "Milk tea", "choice_e": "Green tea", "category": "sound", "length_type": "medium", "perceptual_skills": ["Acoustic Source Characterization"], "reasoning_skills": ["Procedural Reasoning"] } ``` ### MMSU 数据集 ```json { "uniq_id": "volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf_v00", "task_name": "volume_comparison", "dataset_source": "MMSU", "path": "audios/volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf_v00_5transforms.wav", "original_path": "/data1/.../MMSU/audio/volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf.wav", "transforms_applied": [ { "transform_type": "volume_change", "gain_db": -11.53, "gain_linear": 0.27, "clipped": false, "required_tool": "volume_normalizer", "reverse_params": { "target_gain_db": 11.53 }, "transform_name": "volume_change", "order": 0 } ], "num_transforms": 5, "question": "Which volume pattern best matches the audio?", "answer_gt": "high-low-medium", "choice_a": "high-low-medium", "choice_b": "medium-high-low", "choice_c": "low-high-medium", "choice_d": "low-medium-high", "choice_e": "", "category": "Perception", "sub_category": "Paralinguistics" } ``` --- ## 5. 数据集特有字段除了统一核心字段外，各数据集可能包含额外元数据字段： ### MMAU 特有字段 | 字段名 | 类型 | 说明 | |---|---|---| | `category` | `string` | 音频类别（`sound`, `music`, `speech`） | | `length_type` | `string` | 音频长度类型（`short`, `medium`, `long`） | | `perceptual_skills` | `list[string]` | 所需感知能力 | | `reasoning_skills` | `list[string]` | 所需推理能力 | > **注意**: MMAU 的选项可能超过 5 个（choice_a 到 choice_j），因为原始数据 `choices` 数组可达 10 项。 ### MMSU 特有字段 | 字段名 | 类型 | 说明 | |---|---|---| | `category` | `string` | 任务类别（`Perception`, `Reasoning` 等） | | `sub_category` | `string` | 子类别（`Paralinguistics`, `Phonetics` 等） | ### AudioMarathon (9 Datasets) 特有说明 | 数据集 | `task_name` | `answer_gt` 类型 | 选项数量 | |---|---|---|---| | GTZAN | 音乐流派分类任务 | 流派名 | 4 | | DESED | 声音事件检测任务 | 事件类名 | 5 | | HAD | 人声真假检测任务 | `real` / `fake` | 2 | | LibriSpeech | 语音识别任务 | 转录文本 | 0（开放题） | | RACE | 阅读理解任务 | 答案文本 | 4 | | SLUE | 语义理解评估任务 | 情感标签 | 3 | | TAU | 声学场景分类任务 | 场景名 | 5 | | VESUS | 情感识别任务 | 情感标签 | 5 | | Vox | 性别分类任务 | `male` / `female` | 2 | | Vox_age | 年龄分类任务 | 年龄组 | 4 | --- ## 6. 注意事项 1. **音频路径**: `path` 为相对路径（相对于 JSON 文件所在目录），加载时需要拼接基础目录。 2. **选项格式**: 无选项的题目（如 LibriSpeech 语音识别），`choice_a` ~ `choice_e` 均为空字符串。 3. **变换数量**: 每个样本默认应用 3~5 个变换，可通过 `--min-transforms` 和 `--max-transforms` 调整。 4. **多版本**: 同一原始音频可生成多个变换版本，通过 `--variants-per-sample` 控制，样本 ID 以 `_v00`, `_v01` 区分。

# Voice-BrowseComp 数据集 JSON 格式说明本文件用于描述Voice-BrowseComp三个数据集（AudioMarathon、MMAU、MMSU）所生成的JSON标注文件的统一格式，以适配下游评测场景。 --- ## 1. 文件结构每个数据集将输出一个JSON格式标注文件，其内容为样本数组： voice_browsecomp_output/ ├── GTZAN/ │ ├── GTZAN_voice_browsecomp.json │ └── audios/ ├── DESED/ │ ├── DESED_voice_browsecomp.json │ └── audios/ ├── HAD/ ... ├── LibriSpeech/ ... ├── RACE/ ... ├── SLUE/ ... ├── TAU/ ... ├── VESUS/ ... ├── Vox/ ... └── Vox_age/ ... voice_browsecomp_mmau_output/ ├── voice_browsecomp_mmau.json └── audios/ voice_browsecomp_mmsu_output/ ├── voice_browsecomp_mmsu.json └── audios/ --- ## 2. 统一字段格式所有三个数据集的JSON标注共享以下核心字段： | 字段名 | 类型 | 说明 | |---|---|---| | `uniq_id` | `string` | 样本唯一标识符 | | `task_name` | `string` | 任务名称 | | `dataset_source` | `string` | 数据集来源（如 `"GTZAN"`, `"MMAU"`, `"MMSU"`） | | `path` | `string` | 变换后音频的**相对路径**（相对于JSON文件所在目录），如 `"audios/xxx.wav"` | | `original_path` | `string` | 原始音频的**绝对路径** | | `transforms_applied` | `list[object]` | 应用的音频变换列表（完整记录，见下方详细说明） | | `num_transforms` | `int` | 变换数量 | | `question` | `string` | 评测问题 | | `answer_gt` | `string` | 标准答案（ground truth） | | `choice_a` | `string` | 选项 A | | `choice_b` | `string` | 选项 B | | `choice_c` | `string` | 选项 C | | `choice_d` | `string` | 选项 D | | `choice_e` | `string` | 选项 E | 对于多选题，部分选项可能为空；对于简答题，所有选项均为空字符串。 --- ## 3. `transforms_applied` 字段详细说明每个音频变换均为完整的对象，记录了变换类型、参数及对应的逆向恢复信息： json { "transform_type": "noise_addition", "noise_type": "white", "snr_db": -4.98, "required_tool": "denoiser", "reverse_params": { "noise_type": "white", "estimated_snr": -4.98 }, "transform_name": "white_noise", "order": 0 } ### 字段说明 | 字段名 | 类型 | 说明 | |---|---|---| | `transform_type` | `string` | 变换类别（如 `noise_addition`, `speed_change`, `reverb` 等） | | `transform_name` | `string` | 变换名称（如 `white_noise`, `speed_change`, `reverb` 等） | | `order` | `int` | 变换应用顺序（从0开始计数） | | `required_tool` | `string` | 推荐使用的逆向恢复工具名称 | | `reverse_params` | `object` | 逆向恢复所需的参数集 | | 其他字段 | 各异 | 变换专属参数（如 `snr_db`, `speed_factor`, `gain_db` 等） | ### 支持的变换类型 | `transform_name` | `transform_type` | 说明 | 关键参数 | |---|---|---|---| | `white_noise` | `noise_addition` | 白噪声 | `snr_db` | | `colored_noise` | `noise_addition` | 彩色噪声（pink/brown） | `noise_type`, `snr_db` | | `hum_noise` | `noise_addition` | 电流声 | `frequency`, `snr_db` | | `volume_change` | `volume_change` | 音量调整 | `gain_db` | | `speed_change` | `speed_change` | 语速调整 | `speed_factor` | | `pitch_shift` | `pitch_shift` | 音高调整 | `semitones` | | `reverb` | `reverb` | 混响 | `room_size`, `rt60` | | `time_stretch` | `time_stretch` | 时间拉伸 | `stretch_factor` | | `click_noise` | `click_noise` | 点击噪声 | `click_rate`, `intensity` | | `silence_gaps` | `silence_insertion` | 静音插入 | `num_gaps`, `gap_duration` | | `low_pass` | `low_pass_filter` | 低通滤波 | `cutoff_freq` | | `telephone_effect` | `telephone_effect` | 电话效果 | `low_freq_hz`, `high_freq_hz` | | `codec_compression` | `codec_compression` | 编解码压缩 | `codec`, `bitrate` | | `reverse_audio` | `audio_reversal` | 音频反转 | `reverse_type` | | `repeat_segments` | `segment_repetition` | 片段重复 | `num_repeats` | | `cross_talk` | `cross_talk` | 串音 | — | | `irrelevant_speech` | `irrelevant_speech` | 无关语音 | — | --- ## 4. 完整样本示例 ### AudioMarathon 数据集（以GTZAN为例） json { "uniq_id": "GTZAN_000000_v00", "task_name": "音乐流派分类任务", "dataset_source": "GTZAN", "path": "audios/GTZAN_000000_v00.wav", "original_path": "/data1/.../GTZAN/concatenated_audio/wav/blues/blues_concatenated_01.wav", "transforms_applied": [ { "transform_type": "noise_addition", "noise_type": "white", "snr_db": -4.98, "required_tool": "denoiser", "reverse_params": { "noise_type": "white", "estimated_snr": -4.98 }, "transform_name": "white_noise", "order": 0 }, { "transform_type": "volume_change", "gain_db": -11.96, "gain_linear": 0.25, "clipped": false, "required_tool": "volume_normalizer", "reverse_params": { "target_gain_db": 11.96 }, "transform_name": "volume_change", "order": 1 }, { "transform_type": "click_noise", "click_rate": 10.25, "num_clicks": 576, "intensity": 0.47, "required_tool": "declicker", "reverse_params": { "detect_clicks": true, "interpolate": true }, "transform_name": "click_noise", "order": 2 } ], "num_transforms": 3, "question": "该音频片段所属的音乐流派是什么？", "answer_gt": "blues", "choice_a": "Country - country", "choice_b": "Metal - metal", "choice_c": "Blues - blues", "choice_d": "Classical - classical", "choice_e": "" } ### MMAU 数据集 json { "uniq_id": "c93e3644-5227-4710-b27b-5c46750afbff_v00", "task_name": "sound", "dataset_source": "MMAU", "path": "audios/c93e3644-5227-4710-b27b-5c46750afbff_v00_5transforms.wav", "original_path": "/data1/.../MMAU-Pro/data/c93e3644-5227-4710-b27b-5c46750afbff.wav", "transforms_applied": [ { "transform_type": "noise_addition", "noise_type": "white", "snr_db": -4.98, "required_tool": "denoiser", "reverse_params": { "noise_type": "white", "estimated_snr": -4.98 }, "transform_name": "white_noise", "order": 0 } ], "num_transforms": 5, "question": "该音频中正在制作的是什么？", "answer_gt": "Boba tea", "choice_a": "Boba tea", "choice_b": "Milk", "choice_c": "Coffee", "choice_d": "Milk tea", "choice_e": "Green tea", "category": "sound", "length_type": "medium", "perceptual_skills": ["Acoustic Source Characterization"], "reasoning_skills": ["Procedural Reasoning"] } ### MMSU 数据集 json { "uniq_id": "volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf_v00", "task_name": "volume_comparison", "dataset_source": "MMSU", "path": "audios/volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf_v00_5transforms.wav", "original_path": "/data1/.../MMSU/audio/volume_comparison_6b58eff0-f0ff-4558-89e9-52ca0ed489bf.wav", "transforms_applied": [ { "transform_type": "volume_change", "gain_db": -11.53, "gain_linear": 0.27, "clipped": false, "required_tool": "volume_normalizer", "reverse_params": { "target_gain_db": 11.53 }, "transform_name": "volume_change", "order": 0 } ], "num_transforms": 5, "question": "以下哪一种音量模式与该音频最为匹配？", "answer_gt": "high-low-medium", "choice_a": "high-low-medium", "choice_b": "medium-high-low", "choice_c": "low-high-medium", "choice_d": "low-medium-high", "choice_e": "", "category": "Perception", "sub_category": "Paralinguistics" } --- ## 5. 数据集特有字段除统一核心字段外，各数据集可包含额外元数据字段： ### MMAU 特有字段 | 字段名 | 类型 | 说明 | |---|---|---| | `category` | `string` | 音频类别（`sound`, `music`, `speech`） | | `length_type` | `string` | 音频长度类型（`short`, `medium`, `long`） | | `perceptual_skills` | `list[string]` | 所需感知能力 | | `reasoning_skills` | `list[string]` | 所需推理能力 | > **注意**: MMAU 的选项可能超过5个（从choice_a至choice_j），因原始数据的`choices`数组最多可包含10项。 ### MMSU 特有字段 | 字段名 | 类型 | 说明 | |---|---|---| | `category` | `string` | 任务类别（`Perception`, `Reasoning` 等） | | `sub_category` | `string` | 子类别（`Paralinguistics`, `Phonetics` 等） | ### AudioMarathon (9 Datasets) 特有说明 | 数据集 | `task_name` | `answer_gt` 类型 | 选项数量 | |---|---|---|---| | GTZAN | 音乐流派分类任务 | 流派名 | 4 | | DESED | 声音事件检测任务 | 事件类名 | 5 | | HAD | 人声真假检测任务 | `real` / `fake` | 2 | | LibriSpeech | 语音识别任务 | 转录文本 | 0（开放题） | | RACE | 阅读理解任务 | 答案文本 | 4 | | SLUE | 语义理解评估任务 | 情感标签 | 3 | | TAU | 声学场景分类任务 | 场景名 | 5 | | VESUS | 情感识别任务 | 情感标签 | 5 | | Vox | 性别分类任务 | `male` / `female` | 2 | | Vox_age | 年龄分类任务 | 年龄组 | 4 | --- ## 6. 注意事项 1. **音频路径**: `path` 字段为相对于JSON文件所在目录的相对路径，加载时需拼接基础目录路径。 2. **选项格式**: 无选项的题目（如LibriSpeech语音识别任务），`choice_a` ~ `choice_e` 均为空字符串。 3. **变换数量**: 每个样本默认应用3~5个变换，可通过`--min-transforms`和`--max-transforms`参数调整。 4. **多版本**: 同一原始音频可生成多个变换版本，通过`--variants-per-sample`参数控制，样本ID以`_v00`, `_v01`等后缀区分。

提供机构：

maas

创建时间：

2026-02-26

5,000+

优质数据集

54 个

任务类型

进入经典数据集