MECAT-QA

Name: MECAT-QA
Creator: maas
Published: 2025-11-29 18:13:22
License: 暂无描述

魔搭社区2025-11-29 更新2025-08-09 收录

下载链接：

https://modelscope.cn/datasets/midasheng/MECAT-QA

下载链接

链接失效反馈

官方服务：

资源简介：

# MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks [**📖 Paper**](https://arxiv.org/abs/2507.23511) | [**🛠️ GitHub**](https://github.com/xiaomi-research/mecat) | [**🔊 MECAT-Caption Dataset**](https://huggingface.co/datasets/mispeech/MECAT-Caption) | [**🔊 MECAT-QA Dataset**](https://huggingface.co/datasets/mispeech/MECAT-QA) ## Dataset Description MECAT (Multi-Expert Chain for Audio Tasks) is a comprehensive benchmark constructed on **large-scale data** to evaluate machine understanding of audio content through two core tasks: - **Audio Captioning**: Generating textual descriptions for given audio - **Audio Question Answering**: Answering questions about given audio ![image](framework.png) ## Dataset Structure ### Audio Captioning Dataset (MECAT-Caption) The captioning dataset contains audio clips paired with high-quality textual descriptions across multiple subtasks: - **Systematic Captioning**: Long-form (1-2 sentences) and short-form (≤15 words) descriptions - **Content-Specific Captioning**: Speech, music, and sound-focused descriptions - **Environment Captioning**: Acoustic characteristics and environmental context ### Audio Question Answering Dataset (MECAT-QA) The QA dataset features audio clips with associated questions spanning multiple difficulty levels and reasoning types: - **Perception**: Direct sound type identification - **Analysis**: Sound characteristics and quality assessment - **Reasoning**: Environmental reasoning, inference, and application context ## Data Distribution | Data Code | Description | Caption (Train/Test) | QA (Train/Test) | |-----------|-------------|---------------------|-----------------| | **000** | Silence | 173 / 179 | 865 / 895 | | **00A** | General sound (excluding speech and music) | 837 / 848 | 4,185 / 4,240 | | **0M0** | Music | 2,593 / 2,593 | 12,965 / 12,965 | | **0MA** | Music and general sound | 206 / 199 | 1,030 / 995 | | **S00** | Speech | 7,839 / 7,839 | 39,195 / 39,195 | | **S0A** | Speech and general sound | 2,424 / 2,439 | 12,120 / 12,195 | | **SM0** | Speech and music | 5,312 / 5,312 | 26,560 / 26,560 | | **SMA** | Speech, music and general sound | 668 / 643 | 3,340 / 3,215 | **Total**: ~20K caption pairs, ~100K QA pairs ## Task Categories ### Audio Captioning Subtasks | Type | Subtask | Category | Level | Description | |------|---------|----------|-------|-------------| | **Systematic** | Short | - | 🔵 Specialized | Simplified caption within 15 words | | **Systematic** | Long | - | 🔵 Specialized | Caption using 1-2 sentences | | **Content-Specific** | Speech | Clean/Mixed | 🟢 Basic / 🔴 Complex | Caption speech content | | **Content-Specific** | Music | Clean/Mixed | 🟢 Basic / 🔴 Complex | Caption music content | | **Content-Specific** | Sound | Clear/Mixed | 🟢 Basic / 🔴 Complex | Caption general sounds | | **Content-Unrelated** | Environment | - | 🔵 Specialized | Acoustic characteristics and environment | ### Audio Question Answering Subtasks | Type | Subtask | Level | Description | |------|---------|-------|-------------| | **Perception** | Direct_Perception | 🟢🟡 | Perceive sound types | | **Analysis** | Sound_Characteristics | 🟢🟡🟠🔴 | Analyze sound characteristics | | **Analysis** | Quality_Assessment | 🟢🟡🟠🔴 | Analyze sound quality | | **Reasoning** | Environment_Reasoning | 🟢🟡🟠🔴 | Reasoning acoustic environment | | **Reasoning** | Inference_Judgment | 🟢🟡🟠🔴 | Cross-modal reasoning | | **Reasoning** | Application_Context | 🟢🟡🟠🔴 | Semantic understanding | #### Difficulty Levels - 🟢 **Basic** (25%): Direct descriptive questions - 🟡 **Intermediate** (35%): Analytical questions - 🟠 **Advanced** (25%): Inferential questions - 🔴 **Complex** (15%): Comprehensive judgment questions ## Usage ### Loading the Datasets ```python from datasets import load_dataset # Load Caption dataset caption_data = load_dataset('mispeech/MECAT-Caption', split='test') print(f"Caption dataset: {len(caption_data)} samples") # Load QA dataset qa_data = load_dataset('mispeech/MECAT-QA', split='test') print(f"QA dataset: {len(qa_data)} samples") ``` ### Data Format #### Caption Dataset ```python { '__key__': 'unique_audio_id', 'flac': { 'array': numpy.array, # Audio waveform 'sampling_rate': 16000 }, 'json': { 'long': 'Long-form caption text', 'short': 'Short caption', 'speech': 'Speech-focused caption', 'music': 'Music-focused caption', 'sound': 'Sound-focused caption', 'environment': 'Environment description' } } ``` #### QA Dataset ```python { '__key__': 'unique_audio_id', 'flac': { 'array': numpy.array, # Audio waveform 'sampling_rate': 16000 }, 'json': { 'question': 'Question about the audio', 'answer': 'Ground truth answer', 'category': 'direct_perception|sound_characteristics|...', 'level': 'basic|intermediate|advanced|complex' } } ``` ### Evaluation For detailed evaluation methods and comprehensive evaluation results, please refer to our [GitHub repository](https://github.com/xiaomi-research/mecat). The repository includes: - **Evaluation Framework**: Complete evaluation scripts and metrics for both captioning and QA tasks - **Baseline Results**: Performance benchmarks from various state-of-the-art audio understanding models - **Evaluation Metrics**: Detailed explanations of evaluation criteria and scoring methods - **Result Analysis**: Comprehensive analysis of model performance across different audio categories and difficulty levels ## Citation ```bibtex @article{mecat2025, title={MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks}, author={Niu, Yadong and Wang, Tianzi and Dinkel, Heinrich and Sun, Xingwei and Zhou, Jiahao and Li, Gang and Liu, Jizhong and Liu, Xunying and Zhang, Junbo and Luan, Jian}, journal={arXiv preprint arXiv:2507.23511}, year={2025} } ``` ## License This dataset is released under the **Creative Commons Attribution License 3.0 (CC BY-3.0) license**. ## Contact For questions about the dataset or benchmark, please open an issue on the [GitHub repository](https://github.com/xiaomi-research/mecat).

# MECAT：面向细粒度音频理解任务的多专家构建基准数据集 [**📖 论文**](https://arxiv.org/abs/2507.23511) | [**🛠️ GitHub 仓库**](https://github.com/xiaomi-research/mecat) | [**🔊 MECAT-Caption 数据集**](https://huggingface.co/datasets/mispeech/MECAT-Caption) | [**🔊 MECAT-QA 数据集**](https://huggingface.co/datasets/mispeech/MECAT-QA) ## 数据集概述 MECAT（音频任务多专家链，Multi-Expert Chain for Audio Tasks）是基于大规模数据构建的综合性基准数据集，旨在通过两大核心任务评估机器对音频内容的理解能力： - **音频字幕生成（Audio Captioning）**：为给定音频生成文本描述 - **音频问答（Audio Question Answering）**：针对给定音频回答相关问题 ![image](framework.png) ## 数据集结构 ### 音频字幕生成数据集（MECAT-Caption）该字幕数据集包含音频片段与高质量文本描述的配对样本，涵盖多个子任务： - **系统化字幕生成**：长文本（1-2句话）与短文本（≤15词）描述 - **特定内容字幕生成**：聚焦语音、音乐与通用声音的描述 - **环境字幕生成**：声学特征与环境上下文描述 ### 音频问答数据集（MECAT-QA）该问答数据集包含音频片段与关联问题，覆盖多种难度等级与推理类型： - **感知类**：直接识别声音类型 - **分析类**：评估声音特征与质量 - **推理类**：环境推理、推断与应用场景理解 ## 数据分布 | 数据编码 | 描述 | 字幕（训练集/测试集） | 问答（训练集/测试集） | |-----------|-------------|---------------------|-----------------| | **000** | 静音 | 173 / 179 | 865 / 895 | | **00A** | 通用声音（不含语音与音乐） | 837 / 848 | 4185 / 4240 | | **0M0** | 音乐 | 2593 / 2593 | 12965 / 12965 | | **0MA** | 音乐与通用声音 | 206 / 199 | 1030 / 995 | | **S00** | 语音 | 7839 / 7839 | 39195 / 39195 | | **S0A** | 语音与通用声音 | 2424 / 2439 | 12120 / 12195 | | **SM0** | 语音与音乐 | 5312 / 5312 | 26560 / 26560 | | **SMA** | 语音、音乐与通用声音 | 668 / 643 | 3340 / 3215 | **总计**：约20000对字幕样本，约100000个问答样本 ## 任务分类 ### 音频字幕生成子任务 | 类型 | 子任务 | 类别 | 难度等级 | 描述 | |------|---------|----------|-------|-------------| | **系统化** | 短文本 | - | 🔵 专业级 | 15词以内的精简字幕 | | **系统化** | 长文本 | - | 🔵 专业级 | 1-2句话组成的字幕 | | **特定内容** | 语音 | 纯净/混合 | 🟢 基础 / 🔴 复杂 | 针对语音内容的字幕 | | **特定内容** | 音乐 | 纯净/混合 | 🟢 基础 / 🔴 复杂 | 针对音乐内容的字幕 | | **特定内容** | 通用声音 | 清晰/混合 | 🟢 基础 / 🔴 复杂 | 针对通用声音的字幕 | | **非内容定向** | 环境 | - | 🔵 专业级 | 声学特征与环境描述 | ### 音频问答子任务 | 类型 | 子任务 | 难度等级 | 描述 | |------|---------|-------|-------------| | **感知类** | 直接感知（Direct_Perception） | 🟢🟡 | 感知声音类型 | | **分析类** | 声音特征分析（Sound_Characteristics） | 🟢🟡🟠🔴 | 分析声音特征 | | **分析类** | 质量评估（Quality_Assessment） | 🟢🟡🟠🔴 | 评估声音质量 | | **推理类** | 环境推理（Environment_Reasoning） | 🟢🟡🟠🔴 | 声学环境推理 | | **推理类** | 推断判断（Inference_Judgment） | 🟢🟡🟠🔴 | 跨模态推理 | | **推理类** | 应用场景（Application_Context） | 🟢🟡🟠🔴 | 语义场景理解 | #### 难度等级 - 🟢 **基础（Basic）**（25%）：直接描述性问题 - 🟡 **中级（Intermediate）**（35%）：分析类问题 - 🟠 **高级（Advanced）**（25%）：推断类问题 - 🔴 **复杂（Complex）**（15%）：综合判断类问题 ## 使用方法 ### 数据集加载 python from datasets import load_dataset # 加载字幕数据集 caption_data = load_dataset('mispeech/MECAT-Caption', split='test') print(f"字幕数据集样本量：{len(caption_data)}") # 加载问答数据集 qa_data = load_dataset('mispeech/MECAT-QA', split='test') print(f"问答数据集样本量：{len(qa_data)}") ### 数据格式 #### 字幕数据集格式 python { '__key__': '唯一音频ID', 'flac': { 'array': numpy.array, # 音频波形 'sampling_rate': 16000 }, 'json': { 'long': '长文本字幕内容', 'short': '短文本字幕', 'speech': '语音聚焦型字幕', 'music': '音乐聚焦型字幕', 'sound': '通用声音聚焦型字幕', 'environment': '环境描述' } } #### 问答数据集格式 python { '__key__': '唯一音频ID', 'flac': { 'array': numpy.array, # 音频波形 'sampling_rate': 16000 }, 'json': { 'question': '针对音频的提问', 'answer': '标准答案', 'category': 'direct_perception|sound_characteristics|...', 'level': 'basic|intermediate|advanced|complex' } } ### 评估如需了解详细评估方法与全面评估结果，请参阅我们的[GitHub仓库](https://github.com/xiaomi-research/mecat)，该仓库包含： - **评估框架**：覆盖字幕生成与问答任务的完整评估脚本与指标 - **基准结果**：多款当前领先音频理解模型的性能基准 - **评估指标**：评估准则与评分方法的详细说明 - **结果分析**：不同音频类别与难度等级下的模型性能综合分析 ## 引用 bibtex @article{mecat2025, title={MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks}, author={Niu, Yadong and Wang, Tianzi and Dinkel, Heinrich and Sun, Xingwei and Zhou, Jiahao and Li, Gang and Liu, Jizhong and Liu, Xunying and Zhang, Junbo and Luan, Jian}, journal={arXiv preprint arXiv:2507.23511}, year={2025} } ## 许可证本数据集采用**知识共享署名3.0许可协议（Creative Commons Attribution License 3.0, CC BY-3.0）**发布。 ## 联系方式如需咨询数据集或基准相关问题，请在[GitHub仓库](https://github.com/xiaomi-research/mecat)提交Issue。

提供机构：

maas

创建时间：

2025-08-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集