MECAT-QA
收藏魔搭社区2025-11-29 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/midasheng/MECAT-QA
下载链接
链接失效反馈官方服务:
资源简介:
# MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks
[**📖 Paper**](https://arxiv.org/abs/2507.23511) | [**🛠️ GitHub**](https://github.com/xiaomi-research/mecat) | [**🔊 MECAT-Caption Dataset**](https://huggingface.co/datasets/mispeech/MECAT-Caption) | [**🔊 MECAT-QA Dataset**](https://huggingface.co/datasets/mispeech/MECAT-QA)
## Dataset Description
MECAT (Multi-Expert Chain for Audio Tasks) is a comprehensive benchmark constructed on **large-scale data** to evaluate machine understanding of audio content through two core tasks:
- **Audio Captioning**: Generating textual descriptions for given audio
- **Audio Question Answering**: Answering questions about given audio

## Dataset Structure
### Audio Captioning Dataset (MECAT-Caption)
The captioning dataset contains audio clips paired with high-quality textual descriptions across multiple subtasks:
- **Systematic Captioning**: Long-form (1-2 sentences) and short-form (≤15 words) descriptions
- **Content-Specific Captioning**: Speech, music, and sound-focused descriptions
- **Environment Captioning**: Acoustic characteristics and environmental context
### Audio Question Answering Dataset (MECAT-QA)
The QA dataset features audio clips with associated questions spanning multiple difficulty levels and reasoning types:
- **Perception**: Direct sound type identification
- **Analysis**: Sound characteristics and quality assessment
- **Reasoning**: Environmental reasoning, inference, and application context
## Data Distribution
| Data Code | Description | Caption (Train/Test) | QA (Train/Test) |
|-----------|-------------|---------------------|-----------------|
| **000** | Silence | 173 / 179 | 865 / 895 |
| **00A** | General sound (excluding speech and music) | 837 / 848 | 4,185 / 4,240 |
| **0M0** | Music | 2,593 / 2,593 | 12,965 / 12,965 |
| **0MA** | Music and general sound | 206 / 199 | 1,030 / 995 |
| **S00** | Speech | 7,839 / 7,839 | 39,195 / 39,195 |
| **S0A** | Speech and general sound | 2,424 / 2,439 | 12,120 / 12,195 |
| **SM0** | Speech and music | 5,312 / 5,312 | 26,560 / 26,560 |
| **SMA** | Speech, music and general sound | 668 / 643 | 3,340 / 3,215 |
**Total**: ~20K caption pairs, ~100K QA pairs
## Task Categories
### Audio Captioning Subtasks
| Type | Subtask | Category | Level | Description |
|------|---------|----------|-------|-------------|
| **Systematic** | Short | - | 🔵 Specialized | Simplified caption within 15 words |
| **Systematic** | Long | - | 🔵 Specialized | Caption using 1-2 sentences |
| **Content-Specific** | Speech | Clean/Mixed | 🟢 Basic / 🔴 Complex | Caption speech content |
| **Content-Specific** | Music | Clean/Mixed | 🟢 Basic / 🔴 Complex | Caption music content |
| **Content-Specific** | Sound | Clear/Mixed | 🟢 Basic / 🔴 Complex | Caption general sounds |
| **Content-Unrelated** | Environment | - | 🔵 Specialized | Acoustic characteristics and environment |
### Audio Question Answering Subtasks
| Type | Subtask | Level | Description |
|------|---------|-------|-------------|
| **Perception** | Direct_Perception | 🟢🟡 | Perceive sound types |
| **Analysis** | Sound_Characteristics | 🟢🟡🟠🔴 | Analyze sound characteristics |
| **Analysis** | Quality_Assessment | 🟢🟡🟠🔴 | Analyze sound quality |
| **Reasoning** | Environment_Reasoning | 🟢🟡🟠🔴 | Reasoning acoustic environment |
| **Reasoning** | Inference_Judgment | 🟢🟡🟠🔴 | Cross-modal reasoning |
| **Reasoning** | Application_Context | 🟢🟡🟠🔴 | Semantic understanding |
#### Difficulty Levels
- 🟢 **Basic** (25%): Direct descriptive questions
- 🟡 **Intermediate** (35%): Analytical questions
- 🟠 **Advanced** (25%): Inferential questions
- 🔴 **Complex** (15%): Comprehensive judgment questions
## Usage
### Loading the Datasets
```python
from datasets import load_dataset
# Load Caption dataset
caption_data = load_dataset('mispeech/MECAT-Caption', split='test')
print(f"Caption dataset: {len(caption_data)} samples")
# Load QA dataset
qa_data = load_dataset('mispeech/MECAT-QA', split='test')
print(f"QA dataset: {len(qa_data)} samples")
```
### Data Format
#### Caption Dataset
```python
{
'__key__': 'unique_audio_id',
'flac': {
'array': numpy.array, # Audio waveform
'sampling_rate': 16000
},
'json': {
'long': 'Long-form caption text',
'short': 'Short caption',
'speech': 'Speech-focused caption',
'music': 'Music-focused caption',
'sound': 'Sound-focused caption',
'environment': 'Environment description'
}
}
```
#### QA Dataset
```python
{
'__key__': 'unique_audio_id',
'flac': {
'array': numpy.array, # Audio waveform
'sampling_rate': 16000
},
'json': {
'question': 'Question about the audio',
'answer': 'Ground truth answer',
'category': 'direct_perception|sound_characteristics|...',
'level': 'basic|intermediate|advanced|complex'
}
}
```
### Evaluation
For detailed evaluation methods and comprehensive evaluation results, please refer to our [GitHub repository](https://github.com/xiaomi-research/mecat). The repository includes:
- **Evaluation Framework**: Complete evaluation scripts and metrics for both captioning and QA tasks
- **Baseline Results**: Performance benchmarks from various state-of-the-art audio understanding models
- **Evaluation Metrics**: Detailed explanations of evaluation criteria and scoring methods
- **Result Analysis**: Comprehensive analysis of model performance across different audio categories and difficulty levels
## Citation
```bibtex
@article{mecat2025,
title={MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks},
author={Niu, Yadong and Wang, Tianzi and Dinkel, Heinrich and Sun, Xingwei and Zhou, Jiahao and Li, Gang and Liu, Jizhong and Liu, Xunying and Zhang, Junbo and Luan, Jian},
journal={arXiv preprint arXiv:2507.23511},
year={2025}
}
```
## License
This dataset is released under the **Creative Commons Attribution License 3.0 (CC BY-3.0) license**.
## Contact
For questions about the dataset or benchmark, please open an issue on the [GitHub repository](https://github.com/xiaomi-research/mecat).
# MECAT:面向细粒度音频理解任务的多专家构建基准数据集
[**📖 论文**](https://arxiv.org/abs/2507.23511) | [**🛠️ GitHub 仓库**](https://github.com/xiaomi-research/mecat) | [**🔊 MECAT-Caption 数据集**](https://huggingface.co/datasets/mispeech/MECAT-Caption) | [**🔊 MECAT-QA 数据集**](https://huggingface.co/datasets/mispeech/MECAT-QA)
## 数据集概述
MECAT(音频任务多专家链,Multi-Expert Chain for Audio Tasks)是基于大规模数据构建的综合性基准数据集,旨在通过两大核心任务评估机器对音频内容的理解能力:
- **音频字幕生成(Audio Captioning)**:为给定音频生成文本描述
- **音频问答(Audio Question Answering)**:针对给定音频回答相关问题

## 数据集结构
### 音频字幕生成数据集(MECAT-Caption)
该字幕数据集包含音频片段与高质量文本描述的配对样本,涵盖多个子任务:
- **系统化字幕生成**:长文本(1-2句话)与短文本(≤15词)描述
- **特定内容字幕生成**:聚焦语音、音乐与通用声音的描述
- **环境字幕生成**:声学特征与环境上下文描述
### 音频问答数据集(MECAT-QA)
该问答数据集包含音频片段与关联问题,覆盖多种难度等级与推理类型:
- **感知类**:直接识别声音类型
- **分析类**:评估声音特征与质量
- **推理类**:环境推理、推断与应用场景理解
## 数据分布
| 数据编码 | 描述 | 字幕(训练集/测试集) | 问答(训练集/测试集) |
|-----------|-------------|---------------------|-----------------|
| **000** | 静音 | 173 / 179 | 865 / 895 |
| **00A** | 通用声音(不含语音与音乐) | 837 / 848 | 4185 / 4240 |
| **0M0** | 音乐 | 2593 / 2593 | 12965 / 12965 |
| **0MA** | 音乐与通用声音 | 206 / 199 | 1030 / 995 |
| **S00** | 语音 | 7839 / 7839 | 39195 / 39195 |
| **S0A** | 语音与通用声音 | 2424 / 2439 | 12120 / 12195 |
| **SM0** | 语音与音乐 | 5312 / 5312 | 26560 / 26560 |
| **SMA** | 语音、音乐与通用声音 | 668 / 643 | 3340 / 3215 |
**总计**:约20000对字幕样本,约100000个问答样本
## 任务分类
### 音频字幕生成子任务
| 类型 | 子任务 | 类别 | 难度等级 | 描述 |
|------|---------|----------|-------|-------------|
| **系统化** | 短文本 | - | 🔵 专业级 | 15词以内的精简字幕 |
| **系统化** | 长文本 | - | 🔵 专业级 | 1-2句话组成的字幕 |
| **特定内容** | 语音 | 纯净/混合 | 🟢 基础 / 🔴 复杂 | 针对语音内容的字幕 |
| **特定内容** | 音乐 | 纯净/混合 | 🟢 基础 / 🔴 复杂 | 针对音乐内容的字幕 |
| **特定内容** | 通用声音 | 清晰/混合 | 🟢 基础 / 🔴 复杂 | 针对通用声音的字幕 |
| **非内容定向** | 环境 | - | 🔵 专业级 | 声学特征与环境描述 |
### 音频问答子任务
| 类型 | 子任务 | 难度等级 | 描述 |
|------|---------|-------|-------------|
| **感知类** | 直接感知(Direct_Perception) | 🟢🟡 | 感知声音类型 |
| **分析类** | 声音特征分析(Sound_Characteristics) | 🟢🟡🟠🔴 | 分析声音特征 |
| **分析类** | 质量评估(Quality_Assessment) | 🟢🟡🟠🔴 | 评估声音质量 |
| **推理类** | 环境推理(Environment_Reasoning) | 🟢🟡🟠🔴 | 声学环境推理 |
| **推理类** | 推断判断(Inference_Judgment) | 🟢🟡🟠🔴 | 跨模态推理 |
| **推理类** | 应用场景(Application_Context) | 🟢🟡🟠🔴 | 语义场景理解 |
#### 难度等级
- 🟢 **基础(Basic)**(25%):直接描述性问题
- 🟡 **中级(Intermediate)**(35%):分析类问题
- 🟠 **高级(Advanced)**(25%):推断类问题
- 🔴 **复杂(Complex)**(15%):综合判断类问题
## 使用方法
### 数据集加载
python
from datasets import load_dataset
# 加载字幕数据集
caption_data = load_dataset('mispeech/MECAT-Caption', split='test')
print(f"字幕数据集样本量:{len(caption_data)}")
# 加载问答数据集
qa_data = load_dataset('mispeech/MECAT-QA', split='test')
print(f"问答数据集样本量:{len(qa_data)}")
### 数据格式
#### 字幕数据集格式
python
{
'__key__': '唯一音频ID',
'flac': {
'array': numpy.array, # 音频波形
'sampling_rate': 16000
},
'json': {
'long': '长文本字幕内容',
'short': '短文本字幕',
'speech': '语音聚焦型字幕',
'music': '音乐聚焦型字幕',
'sound': '通用声音聚焦型字幕',
'environment': '环境描述'
}
}
#### 问答数据集格式
python
{
'__key__': '唯一音频ID',
'flac': {
'array': numpy.array, # 音频波形
'sampling_rate': 16000
},
'json': {
'question': '针对音频的提问',
'answer': '标准答案',
'category': 'direct_perception|sound_characteristics|...',
'level': 'basic|intermediate|advanced|complex'
}
}
### 评估
如需了解详细评估方法与全面评估结果,请参阅我们的[GitHub仓库](https://github.com/xiaomi-research/mecat),该仓库包含:
- **评估框架**:覆盖字幕生成与问答任务的完整评估脚本与指标
- **基准结果**:多款当前领先音频理解模型的性能基准
- **评估指标**:评估准则与评分方法的详细说明
- **结果分析**:不同音频类别与难度等级下的模型性能综合分析
## 引用
bibtex
@article{mecat2025,
title={MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks},
author={Niu, Yadong and Wang, Tianzi and Dinkel, Heinrich and Sun, Xingwei and Zhou, Jiahao and Li, Gang and Liu, Jizhong and Liu, Xunying and Zhang, Junbo and Luan, Jian},
journal={arXiv preprint arXiv:2507.23511},
year={2025}
}
## 许可证
本数据集采用**知识共享署名3.0许可协议(Creative Commons Attribution License 3.0, CC BY-3.0)**发布。
## 联系方式
如需咨询数据集或基准相关问题,请在[GitHub仓库](https://github.com/xiaomi-research/mecat)提交Issue。
提供机构:
maas
创建时间:
2025-08-08



