genggenggeng/AudioDER
收藏Hugging Face2026-04-07 更新2026-04-12 收录
下载链接:
https://hf-mirror.com/datasets/genggenggeng/AudioDER
下载链接
链接失效反馈官方服务:
资源简介:
# AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models
**Official repository for the paper _"AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models"_**
## Quick Links
- **Base Dataset**: [AudioDER on Google Drive](https://drive.google.com/file/d/1XJqV_JcjHBoCMEiiruCNrHgD8eVgCcEB/view?usp=drive_link)
- **Filtered Subset**: [AudioDER-Filtered on Google Drive](https://drive.google.com/file/d/1krYbu-kNjuAJaI-fCuxFbR7vHZCQAjEV/view?usp=drive_link)
---
## Table of Contents
- [Overview](#overview)
- [Key Highlights](#key-highlights)
- [Dataset Access](#dataset-access)
- [Dataset Construction Pipeline](#dataset-construction-pipeline)
- [A. Caption Generation](#a-caption-generation)
- [B. Question-Answer Generation](#b-question-answer-generation)
- [C. Chain-of-Thought Generation](#c-chain-of-thought-generation)
- [D. Filtered Process](#d-filtered-process)
- [Examples](#examples)
- [Data Format](#data-format)
- [Repository Structure](#repository-structure)
- [Experiment Details](#experiment-details)
- [Related Resources](#related-resources)
---
## Overview
Large Audio-Language Models (LALMs) have achieved strong performance across a wide range of audio understanding tasks, but they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of supervision data.
However, existing audio-language datasets often contain substantial redundancy: many samples are highly similar in acoustic content and therefore provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training.
To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication over raw audio datasets to improve corpus diversity. We then leverage **Qwen3-30B** to automatically generate structured annotations, including audio captions, multiple-choice questions, and corresponding chain-of-thought (CoT) rationales.
Based on this pipeline, we construct **AudioDER**, a reasoning-oriented post-training dataset containing approximately **191K** samples spanning **sound**, **speech**, and **music**. Each sample consists of:
- an audio clip,
- a multiple-choice question,
- four answer candidates,
- an audio caption,
- and a CoT rationale.

---
## Key Highlights
- **191K high-quality samples** covering sound, speech, and music
- **Reasoning-oriented supervision** for post-training LALMs
- **Automatic structured annotation** with Qwen3-30B
- **Acoustic similarity-based deduplication** built on CLAP
- **Filtered subset** with additional quality control
- **Multi-field annotations** including caption, question, options, answer, and CoT rationale
---
## Dataset Access
For complete dataset information, statistics, data format, and download instructions, please visit:
### [Google Drive Dataset Repository](https://drive.google.com/drive/folders/1o1TaaXZZ9GD2jzBUb4faG7vXhyKtp44K?usp=drive_link)
> **Note**
> The current repository mainly introduces the construction idea, prompts, and supporting resources for AudioDER.
---
## Dataset Construction Pipeline
Our dataset construction pipeline consists of four main stages:
1. **Acoustic similarity-based deduplication**
2. **Caption generation**
3. **Question-answer generation**
4. **Chain-of-thought generation and quality filtering**
### A. Caption Generation
We use different prompts for different audio types to generate detailed captions.
```text
Please generate a detailed caption describing the audio scene in the input audio, including background.
Music Audio
Please generate a detailed caption describing the music scene in the input audio, including background.
Speech Audio
Please generate a detailed caption describing the audio scene in the speech audio, including background.
```
### B. Question-Answer Generation
We use the following prompt to construct audio-dependent multiple-choice questions:
```text
You are a professional test designer specializing in advanced audio comprehension exams.
Your task is to generate a new multiple-choice question based on the provided original question-answer pair. The new question must assess understanding that can only be gained by carefully listening to the audio.
Source Information:
- ORIGINAL QUESTION: {q_text}
- ORIGINAL ANSWER: {answer}
Requirements:
1. Question Type
Choose exactly one of: "sound", "music", or "speech" — based on the primary audio element the new question targets.
2. New Question
- Must end with a question mark (?).
- Be clear, concise, natural, and easy to understand.
- Test specific content that definitely exists in the ORIGINAL ANSWER.
- Focus exclusively on audio-dependent understanding (details like tone, pronunciation, speaker's exact wording, sound characteristics, music elements, etc.) that cannot be inferred from text alone.
3. Multiple-Choice Options
Create exactly 4 options (1 correct + 3 incorrect).
Each option must:
- Be 1–8 words long.
- Start with a capital letter only; no ending punctuation.
- Have identical word count (to prevent length-based guessing).
- Follow the exact same sentence structure and grammatical pattern.
- Maintain consistent level of detail and vocabulary complexity.
- Appear equally plausible to a partially attentive listener.
4. Correct Answer Design
The correct answer must:
- Require actual listening to the audio to identify.
- Be based strictly on content in the ORIGINAL ANSWER.
- Rephrase the key information using synonyms or parallel expressions (do not reuse exact phrases).
- Extract only the most essential element while omitting secondary details.
5. Distractor Design
Each incorrect option must:
- Represent plausible misunderstandings or shallow listening (e.g., misheard similar sounds/words, partial information, common confusions).
- Focus on the same topic/aspect as the correct answer.
- Match the correct answer in structure, complexity, length, and vocabulary level.
- Be clearly distinct in meaning, but not obviously wrong at first glance.
6. Validation Check
Before outputting, confirm:
- All 4 options are indistinguishable in length, tone, and phrasing.
- The question cannot be answered correctly by a language model or reader without hearing the audio.
- No option stands out due to word choice, grammar, or detail level.
- The pair purely tests audio comprehension, not general logic or reading skills.
Output Format:
Return ONLY a valid JSON object with exactly these keys. No extra text, no explanations, no trailing commas.
{
"question_type": "sound" or "music" or "speech",
"question": "Your new question here?",
"choices": [
"Option A",
"Option B",
"Option C",
"Option D"
],
"answer": "Correct option here"
}
```
### C. Chain-of-Thought Generation
We further generate a concise reasoning process for each sample:
```text
TASK: Complete the THINKING PROCESS for this audio-based multiple-choice question using clear Chain-of-Thought reasoning.
QUESTION DETAILS:
- Question: ***question_text***
- Question Type: ***question_type***
- Choices: ***multi_choice***
- Answer: ***answer***
INCOMPLETE THINKING PROCESS:
According to the question text, <first_analysis>...</first_analysis>, so the question type is ***question_type***.
I need to firstly analyze the audio content: ***caption***
According to the audio content, <second_analysis>...</second_analysis>, so the correct answer is ***answer***.
COMPLETION REQUIREMENTS:
FIRST THINKING PROCESS (Question Analysis):
- Max 30 words
- Start with lowercase letter
- One continuous paragraph, no breaks
- Be analytical and methodical
- Maintain coherence with context
- Must: identify what the question asks, specify needed audio evidence, link to question type
SECOND THINKING PROCESS (Audio Analysis & Answer Selection):
- Max 30 words
- Start with lowercase letter
- One continuous paragraph, no breaks
- Maintain coherence with context
- Use "quotes" only for choice options, never for audio
- Must: highlight key evidence from audio, explain how it supports the correct answer, briefly show why other options don't fit, connect evidence to answer
OUTPUT FORMAT:
- Place first process in <first_analysis> </first_analysis>
- Place second process in <second_analysis> </second_analysis>
- Output nothing else.
```
### D. Filtered Process
After constructing the full AudioDER dataset, we further process a filtered subset.
This stage mainly targets the AVQA portion of the data and includes two major constraints:
- Restrict audio duration to 5–30 seconds
- Apply a quality filtering pipeline to verify the consistency of:
1. audio
2. caption
3. question
4. answer
5. reasoning process
The prompt used for filtering is as follows:
```text
You are a data reviewer. You will be given an audio clip together with its caption, question, answer, and—most importantly—the reasoning process used to solve the task. Your job is to carefully verify whether all of these elements are accurate and fully consistent with the audio content.
Pay special attention to the reasoning process and check whether it contains any hallucinations, unsupported claims, or logical inconsistencies. Also verify that the caption, question, and answer are all correct and aligned with the audio.
Return <True> if everything is correct and there are no issues.
Return <False> if you find any error, inconsistency, or hallucination.
Here is the caption of the audio: "{caption}".
Here is the question: "{question}".
Here is the choices: "{choices_text}".
Here is the answer: "{answer}".
Here is the reasoning process: "{cot_think}".
Please perform a careful and thorough evaluation and return the result in the specified format only.
Output only one sentence.
No explanation.
No extra text.
Output:
<True> or <False>
```
### Examples
- Example 1
```text
Question: What are the main sources of sound in the audio?
A: wolf
B: apes and monkeys
C: duck
D: lion
Qwen3-30B CoT:According to the question text, the question requires identifying audible animal sources from the audio, focusing on the choices provided as sound references without prior audio details, so the question type is "sound".
I need to firstly analyze the audio content:"This audio contains two instances of Gibbon howling sounds, and background noise sounds."
According to the audio content, gibbon howling directly identifies ape sounds, matching "apes and monkeys"; wolf, duck, and lion lack any audio reference in the description, making them incorrect choices, so the correct answer is "apes and monkeys".
```
- Example 2
```text
Question: What's that noise?
A: skiing
B: disco dancing
C: The sound of running water
D: sing
Qwen3-30B CoT:
According to the question text, the question asks for the source of a sound among given options, requiring identification of vocal content from audio, so the question type is "sound".
I need to firstly analyze the audio content:"The audio contains a section of a pop song where a French-speaking female lead vocal is accompanied by an electronic musical background featuring beats and synthesized sounds."
According to the audio content, the audio features a female vocal, confirming "sing" as the sound source, while skiing, disco dancing, and running water describe non-vocal activities mismatched to the audio, so the correct answer is "sing".
```
### Data Format
Each sample in AudioDER is organized as a structured instance containing the following fields:
```text
{
"audio_path": "path/or/url/to/audio",
"question_type": "sound",
"question": "What is the speaker doing?",
"options": [
"Reading a poem",
"Giving a speech",
"Singing a chorus",
"Asking for help"
],
"answer": "Giving a speech",
"caption": "A male speaker talks clearly in front of a crowd with light background noise.",
"cot_think": "<first_analysis>...</first_analysis><second_analysis>...</second_analysis>"
}
```
### Field Description
- audio: audio file path, identifier, or URL
- question_type: one of sound, music, or speech
- question: multiple-choice question
- choices: four answer candidates
- answer: correct answer
- caption: generated audio caption
- cot_think: generated reasoning process
### Repository Structure
A recommended repository structure is shown below:
```text
AudioDER/
├── assets/
│ └── framework.jpg
├── README.md
└── LICENSE
```
## Experiment Details
We will provide the hyperparameters and experimental settings used in our study in this repository.

## Additional Experiments
More supplementary experiments will be added here.

### Related Resources
- Qwen2-Audio-7B-Instruct
- AudioMCQ
- R1-AQA Evaluation Format
提供机构:
genggenggeng



