MECAT-Caption

Name: MECAT-Caption
Creator: maas
Published: 2026-01-09 03:22:41
License: 暂无描述

魔搭社区2026-01-09 更新2025-08-09 收录

下载链接：

https://modelscope.cn/datasets/midasheng/MECAT-Caption

下载链接

链接失效反馈

官方服务：

资源简介：

# MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks [**📖 Paper**](https://arxiv.org/abs/2507.23511) | [**🛠️ GitHub**](https://github.com/xiaomi-research/mecat) | [**🔊 MECAT-Caption Dataset**](https://huggingface.co/datasets/mispeech/MECAT-Caption) | [**🔊 MECAT-QA Dataset**](https://huggingface.co/datasets/mispeech/MECAT-QA) ## Dataset Description MECAT (Multi-Expert Chain for Audio Tasks) is a comprehensive benchmark constructed on **large-scale data** to evaluate machine understanding of audio content through two core tasks: - **Audio Captioning**: Generating textual descriptions for given audio - **Audio Question Answering**: Answering questions about given audio Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation), which penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. ![MECAT Framework](framework.png) ## Features - **Data Source**: Diverse-scenario coverage via the part of ACAV100M dataset - **Processing Pipeline**: - **MetaInfo**: Source video metadata extraction (titles/descriptions) - **Content-Specific**: Content-specific feature extraction using 10-20 dedicated models (speech/music/general audio) - **Content-Unrelated**: Non-content audio analysis: quality metrics, loudness measurements, reverberation assessment - **Understanding & Generation**: LLM-powered comprehension & generation with Chain-of-Thought - **Quality Control**: Multi-stage verification framework - **Evaluation System**: Multi-perspective assessment with progressive difficulty levels ## Dataset Structure ### Audio Captioning Dataset (MECAT-Caption) The captioning dataset contains audio clips paired with high-quality textual descriptions across multiple subtasks: - **Systematic Captioning**: Long-form (1-2 sentences) and short-form (≤15 words) descriptions - **Content-Specific Captioning**: Speech, music, and sound-focused descriptions - **Environment Captioning**: Acoustic characteristics and environmental context ### Audio Question Answering Dataset (MECAT-QA) The QA dataset features audio clips with associated questions spanning multiple difficulty levels and reasoning types: - **Perception**: Direct sound type identification - **Analysis**: Sound characteristics and quality assessment - **Reasoning**: Environmental reasoning, inference, and application context ## Data Distribution | Data Code | Description | Caption (Train/Test) | QA (Train/Test) | |---|---|---|---| | **000** | Silence | 173 / 179 | 865 / 895 | | **00A** | General sound (excluding speech and music) | 837 / 848 | 4,185 / 4,240 | | **0M0** | Music | 2,593 / 2,593 | 12,965 / 12,965 | | **0MA** | Music and general sound | 206 / 199 | 1,030 / 995 | | **S00** | Speech | 7,839 / 7,839 | 39,195 / 39,195 | | **S0A** | Speech and general sound | 2,424 / 2,439 | 12,120 / 12,195 | | **SM0** | Speech and music | 5,312 / 5,312 | 26,560 / 26,560 | | **SMA** | Speech, music and general sound | 668 / 643 | 3,340 / 3,215 | **Total**: ~20K caption pairs, ~100K QA pairs ## Task Categories ### Audio Captioning Subtasks | Type | Subtask | Category | Level | Description | |---|---|---|---|---| | **Systematic** | Short | - | 🔵 Specialized | Simplified caption within 15 words | | **Systematic** | Long | - | 🔵 Specialized | Caption using 1-2 sentences | | **Content-Specific** | Speech | Clean/Mixed | 🟢 Basic / 🔴 Complex | Caption speech content | | **Content-Specific** | Music | Clean/Mixed | 🟢 Basic / 🔴 Complex | Caption music content | | **Content-Specific** | Sound | Clear/Mixed | 🟢 Basic / 🔴 Complex | Caption general sounds | | **Content-Unrelated** | Environment | - | 🔵 Specialized | Acoustic characteristics and environment | ### Audio Question Answering Subtasks | Type | Subtask | Level | Description | |---|---|---|---| | **Perception** | Direct_Perception | 🟢🟡 | Perceive sound types | | **Analysis** | Sound_Characteristics | 🟢🟡🟠🔴 | Analyze sound characteristics | | **Analysis** | Quality_Assessment | 🟢🟡🟠🔴 | Analyze sound quality | | **Reasoning** | Environment_Reasoning | 🟢🟡🟠🔴 | Reasoning acoustic environment | | **Reasoning** | Inference_Judgment | 🟢🟡🟠🔴 | Cross-modal reasoning | | **Reasoning** | Application_Context | 🟢🟡🟠🔴 | Semantic understanding | #### Difficulty Levels - 🟢 **Basic** (25%): Direct descriptive questions - 🟡 **Intermediate** (35%): Analytical questions - 🟠 **Advanced** (25%): Inferential questions - 🔴 **Complex** (15%): Comprehensive judgment questions ## Usage ### Loading the Datasets ```python from datasets import load_dataset # Load Caption dataset caption_data = load_dataset('mispeech/MECAT-Caption', split='test') print(f"Caption dataset: {len(caption_data)} samples") # Load QA dataset qa_data = load_dataset('mispeech/MECAT-QA', split='test') print(f"QA dataset: {len(qa_data)} samples") ``` ### Data Format #### Caption Dataset ```python { '__key__': 'unique_audio_id', 'flac': { 'array': numpy.array, # Audio waveform 'sampling_rate': 16000 }, 'json': { 'long': 'Long-form caption text', 'short': 'Short caption', 'speech': 'Speech-focused caption', 'music': 'Music-focused caption', 'sound': 'Sound-focused caption', 'environment': 'Environment description' } } ``` #### QA Dataset ```python { '__key__': 'unique_audio_id', 'flac': { 'array': numpy.array, # Audio waveform 'sampling_rate': 16000 }, 'json': { 'question': 'Question about the audio', 'answer': 'Ground truth answer', 'category': 'direct_perception|sound_characteristics|...', 'level': 'basic|intermediate|advanced|complex' } } ``` ### Evaluation For detailed evaluation methods and comprehensive evaluation results, please refer to our [GitHub repository](https://github.com/xiaomi-research/mecat). The repository includes: - **Evaluation Framework**: Complete evaluation scripts and metrics for both captioning and QA tasks - **Baseline Results**: Performance benchmarks from various state-of-the-art audio understanding models - **Evaluation Metrics**: Detailed explanations of evaluation criteria and scoring methods - **Result Analysis**: Comprehensive analysis of model performance across different audio categories and difficulty levels ## Evaluation Metrics MECAT supports multiple evaluation metrics for comprehensive assessment: - **Traditional Metrics**: BLEU - **FENSE**: Fluency Error-based Sentence-bert Evaluation for audio captioning - **DATE**: Discriminability based Audio Task Evaluation - DATE is particularly effective for audio captioning and question-answering tasks as it considers both the quality of generated text and the model's discriminative capabilities. ## Results ### Audio-Captioning Task #### DATE | Model Type | Model Name | Systemtic long | Systemtic short | Speech-Focused pure | Speech-Focused mixed | Music-Focused pure | Music-Focused mixed | Sound-Focused pure | Sound-Focused mixed | Content-Unrelated environment | Overall | |---|---|---|---|---|---|---|---|---|---|---|---| | Caption-Only | enclap | 48.6 | 53.1 | 30.2 | 31.8 | 17.9 | 15.9 | 48.8 | 15.2 | 6.8 | 33.3 | | Caption-Only | pengi | 43.5 | 46.8 | 27.2 | 29.5 | 29.3 | 13.1 | 42.8 | 14.6 | 7.1 | 30.6 | | LALM | audio-flamingo | 48.6 | 49.7 | 30.5 | 34.3 | 28.8 | 25.6 | 41.2 | 18.5 | 17.5 | 35.6 | | LALM | kimi-audio | 49.5 | 54.2 | 30.0 | 31.3 | 27.7 | 16.9 | 43.1 | 16.2 | 7.0 | 34.3 | | LALM | omni3b | 56.4 | 55.2 | 42.5 | 41.3 | 46.6 | 29.7 | 52.9 | 23.9 | 19.4 | 42.6 | | LALM | omni7b | 61.1 | 56.5 | 39.9 | 40.9 | 32.1 | 30.9 | 50.7 | 23.8 | 17.9 | 43.0 | #### FENSE | Model Type | Model Name | Systemtic long | Systemtic short | Speech-Focused pure | Speech-Focused mixed | Music-Focused pure | Music-Focused mixed | Sound-Focused pure | Sound-Focused mixed | Content-Unrelated environment | Overall | |---|---|---|---|---|---|---|---|---|---|---|---| | Caption-Only | enclap-both | 40.5 | 45.0 | 28.7 | 29.5 | 39.3 | 15.0 | 41.2 | 17.3 | 17.9 | 31.6 | | Caption-Only | pengi | 37.5 | 41.0 | 26.6 | 29.2 | 39.6 | 11.8 | 35.4 | 16.2 | 17.8 | 29.5 | | LLM-Based | audio-flamingo2 | 43.8 | 43.3 | 28.5 | 33.7 | 43.1 | 30.3 | 41.0 | 24.7 | 45.4 | 39.4 | | LLM-Based | kimi-audio | 40.8 | 45.7 | 25.6 | 27.1 | 39.5 | 16.2 | 35.8 | 19.4 | 16.7 | 30.8 | | LLM-Based | qwen2.5-omni3b | 48.3 | 45.3 | 37.3 | 37.5 | 50.7 | 34.7 | 46.6 | 34.1 | 47.8 | 44.1 | | LLM-Based | qwen2.5-omni7b | 52.7 | 46.2 | 35.3 | 37.5 | 39.2 | 33.1 | 45.2 | 32.1 | 41.0 | 43.4 | ### Audio-Question-Answering #### DATE | Model Type | Model Name | Perception direct perception | Analsysis sound characteristics | Analsysis quality assessment | Reasoning environment reasoning | Reasoning inference judgement | Reasoning application context | Overall | |---|---|---|---|---|---|---|---| | LLM-Based | audio-flamingo2 | 45.1 | 46.3 | 34.9 | 37.5 | 44.0 | 42.4 | 41.7 | | LLM-Based | kimi-audio | 45.6 | 39.2 | 18.7 | 34.6 | 48.9 | 41.2 | 38.0 | | LLM-Based | qwen2.5-omni3b | 55.7 | 53.2 | 38.6 | 41.1 | 51.8 | 50.8 | 48.5 | | LLM-Based | qwen2.5-omni7b | 57.8 | 52.9 | 39.1 | 44.0 | 53.2 | 50.8 | 49.6 | #### FENSE | Model Type | Model Name | Perception direct perception | Analsysis sound characteristics | Analsysis quality assessment | Reasoning environment reasoning | Reasoning inference judgement | Reasoning application context | Overall | |---|---|---|---|---|---|---|---| | LALM | audio-flamingo2 | 39.1 | 39.0 | 37.4 | 41.3 | 35.5 | 35.8 | 38.0 | | LALM | kimi-audio | 37.5 | 32.5 | 19.2 | 37.5 | 38.8 | 33.8 | 33.2 | | LALM | qwen2.5-omni3b | 47.2 | 43.8 | 39.7 | 43.2 | 41.0 | 41.9 | 42.8 | | LALM | qwen2.5-omni7b | 49.7 | 43.8 | 40.5 | 44.1 | 42.5 | 41.9 | 43.7 | ## Citation ```bibtex @article{mecat2025, title={MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks}, author={Niu, Yadong and Wang, Tianzi and Dinkel, Heinrich and Sun, Xingwei and Zhou, Jiahao and Li, Gang and Liu, Jizhong and Liu, Xunying and Zhang, Junbo and Luan, Jian}, journal={arXiv preprint arXiv:2507.23511}, year={2025} } ``` ## License This dataset is released under the **Creative Commons Attribution License 3.0 (CC BY-3.0) license**. The associated code is licensed under the **Apache License 2.0 license**. ## Contact For questions about the dataset or benchmark, please open an issue on the [GitHub repository](https://github.com/xiaomi-research/mecat).

提供机构：

maas

创建时间：

2025-08-08

5,000+

优质数据集

54 个

任务类型

进入经典数据集