LongAudio
收藏魔搭社区2026-04-28 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/nv-community/LongAudio
下载链接
链接失效反馈官方服务:
资源简介:
# LongAudio-XL Dataset
[Paper](https://huggingface.co/papers/2507.08128) | [Project Page](https://research.nvidia.com/labs/adlr/AF3/) | [Code](https://github.com/NVIDIA/audio-flamingo)
🚨 Note: This repository now also contains the datasets for our the latest model in the Audio Flamingo series, Audio Flamingo Next.
## Dataset Description
**LongAudio-XL** is a large-scale **long** audio question-answering (AQA) dataset designed to develop (large) audio-language models on long audio reasoning and problem-solving tasks over long audio clips (30 seconds - 10 mins). It expands upon the original LongAudio collection by adding approximately **1 million new QA pairs** for long speech, resulting in a total of **~1.25 million** diverse examples. The release included the full dataset, including LongAudio and LongAudio-XL. The dataset is partitioned into subsets based on each audio’s source dataset:
1. **DailyTalk (`DailyTalk_LongAudio.json`)**
- Domain: Speech
- Link to original dataset: https://github.com/keonlee9420/DailyTalk
- Additional Note: The entire non-segmented original wav files are treated as the corresponding audios.
2. **IEMOCAP (`IEMOCAP_LongAudio.json`)**
- Domain: Speech
- Link to original dataset: https://sail.usc.edu/iemocap/
- Additional Note: The entire non-segmented original wav files are treated as the corresponding audios.
3. **MELD (`MELD_LongAudio.json`)**
- Domain: Speech
- Link to original dataset: https://github.com/declare-lab/MELD
- Additional Note: The entire non-segmented original episodes are treated as the corresponding audios.
4. **MultiDialog (`MultiDialog_LongAudio.json`)**
- Domain: Speech
- Link to original dataset: https://huggingface.co/datasets/IVLLab/MultiDialog
- Additional Note: The entire original dialogues are treated as the corresponding audios.
5. **LibriSpeech (`LibriSpeech_LongAudio.json`)**
- Domain: Speech
- Link to original dataset: https://www.openslr.org/12/
- Additional Note: Combine each audio in the list in the exact order for the corresponding audio.
6. **VoxPopuli (`VoxPopuli_LongAudio.json`)**
- Domain: Speech
- Link to original dataset: https://github.com/facebookresearch/voxpopuli
- Additional Note: Combine each audio in the list in the exact order for the corresponding audio.
7. **Switchboard (`Switchboard_LongAudio.json`)**
- Domain: Speech
- Link to original dataset: https://catalog.ldc.upenn.edu/LDC97S62
- Additional Note: Combine each audio in the list in the exact order for the corresponding audio.
8. **Europarl (`Europarl_LongAudio.json`)**
- Domain: Speech
- Link to original dataset: https://www.statmt.org/europarl/
- Additional Note: Combine each audio in the list in the exact order for the corresponding audio.
9. **Fisher (`Fisher_LongAudio.json`)**
- Domain: Speech
- Link to original dataset: https://catalog.ldc.upenn.edu/LDC2004T19
- Additional Note: Each audio file is named in te format `file_start_end.wav`. Segment the original wav by the start and end time for the corresponding audio.
10. **MiraData (`MiraData_LongAudio.json`)**
- Domain: Sound and Music
- Link to original dataset: https://github.com/mira-space/MiraData
- Follow instructions on original GitHub to obtained audios from YouTube.
11. **Recap_LongAudio (`Recap_LongAudio.json`)**
- Domain: Sound and Music
- Link to original dataset: https://github.com/md-mohaiminul/VideoRecap
- Follow instructions on original GitHub to obtained audios from [EGO4D](https://ego4d-data.org/)
12. **GigaSpeech_LongAudio (`GigaSpeech_LongAudio.json`)**
- Domain: Speech
- Link to original dataset: https://github.com/SpeechOcean/GigaSpeech
- Additional Note: Download the original dataset. The entire non-segmented original files are treated as the corresponding audio.
13. **LongAudioBench (`Bench_LongAudio.json`)**
- Domain: Speech, Sounds and Music
- Additional Note: Please contact the corresponding authors for this dataset.
14. **LongAudioXXL (`MiraData_AFNext_LongAudio.json`)**
- Domain: Speech, Sounds and Music
- Link to original dataset: https://github.com/mira-space/MiraData
- Follow instructions on original GitHub to obtained audios from YouTube.
15. **LongAudioXXL (`LongVila_AFNext_LongAudio.json`)**
- Domain: Speech, Sounds and Music
- Link to original dataset: https://huggingface.co/datasets/LongVILA/longvila_sft_dataset
16. **LongAudioXXL (`LongVale_AFNext_LongAudio.json`)**
- Domain: Speech, Sounds and Music
- Link to original dataset: https://huggingface.co/datasets/ttgeng233/LongVALE
17. **LongAudioXXL (`MMTrail_AFNext_LongAudio.json`)**
- Domain: Speech, Sounds and Music
- Link to original dataset: https://huggingface.co/datasets/litwell/MMTrail-20M
18. **LongAudioXXL (`General_Emotion_AFNext_LongAudio.json`)**
- Domain: Speech, Sounds and Music
- Same as `MMTrail_AFNext_LongAudio.json` but with additional emotional information in captions.
- The audios need to downloaded from YouTube (using the corresponding YouTube IDs in "id" key).
19. **LongAudioXXL (`General_Time_AFNext_LongAudio.json`)**
- Domain: Speech, Sounds and Music
- Same as `MMTrail_AFNext_LongAudio.json` but based QAs based on time grounding.
- The audios need to downloaded from YouTube (using the corresponding YouTube IDs in "id" key).
20. **LongAudioXXL (`YouTube_AFNext_LongAudio.json`)**
- Domain: Speech, Sounds and Music
- Time-stamped captions.
- The audios need to downloaded from YouTube (using the corresponding YouTube IDs in "id" key).
By releasing LongAudio-XL, researchers can train models on a broad spectrum of audio reasoning tasks. **Please note that we only provide the text QA annotations. Due to licensing constraints, we do not host the original audio files. Users are responsible for retrieving the corresponding audio clips from their original sources (e.g., YouTube8M, LibriSpeech, Music4All) using the wav file name from the "sound" tag in the JSONs and dowloading the dataset from the URLs mentioned. Next, the audio files need to be either sliced or combined (see Additional Note for each dataset). We acknowledge this process may be complex, and request you to please raise an issue or contact the corresponding authors for any issues.**
## Dataset Owner(s)
NVIDIA Corporation
## Dataset Creation Date
2025/07/10
## Last Update
2026/04/03
## License / Terms of Use
The use of LongAudio-XL is governed by the [NVIDIA OneWay Noncommercial License](licenses/NVIDIA-OneWay-Noncommercial-License_22Mar2022-research.docx).
Synthetic data generation may be subject to OpenAI’s [Terms of Use](https://openai.com/policies/terms-of-use). Additionally, audios may be governed by its own dataset license, which users should review before downloading or using the audio content.
## Intended Usage
LongAudio-XL (and LongAudio) is intended to support:
- Training and fine-tuning (large) audio-language models for understanding and reasoning over long audios.
## Dataset Characterization
LongAudio-XL focuses on seven primary skills for sounds and music:
- **Captioning:** Generate comprehensive descriptions of long audio, capturing key events and the overall context.
- **Plot QA:** Answer questions about the audio’s narrative or storyline, reasoning over temporal and causal relationships.
- **Temporal QA:** Identify when events occur and how they relate in time, including sequencing, overlap, and attribute changes.
- **Needle QA:** Locate and reason about a specific “needle” segment within a longer audio “haystack,” ensuring answers reference that segment.
- **Subscene QA:** Answer questions about a distinct subscene in the audio, requiring focus on localized events and details.
- **General QA:** Address broad, open-ended questions spanning multiple events or themes, demonstrating overall comprehension.
and 6 primary skills for speech:
- **Sarcasm Identification:** Inferring sarcasm from speech by analyzing content, tone, and emotional cues.
- **Emotional State Reasoning:** Identifying a speaker’s emotion, reasoning about its cause, and explaining any emotion flips.
- **Topic Relationship Reasoning:** Determining how two ideas or topics relate within the conversation.
- **Information Extraction (IE):** Needle QA, Causal QA, Response QA, and Topic QA for extracting specific facts, causes, responses, or main topics.
- **Summarization:** Producing a concise summary of the speech content.
- **Order:** Temporal Order, Temporal Attribute, Temporal Referring, and Temporal Grounding to locate and sequence topics over time.
Each example is a pair of a long clip and a corresponding QA item. Audio encompasses environmental sounds, speech (primarily English), and music. Audios are sourced from open-source datasets (see Table 9 and 10 in paper appendix). Text QA is generated using a variety of methods mentioned in the paper. Metadata from the original datasets (if available) is used to for QA generation.
## Data Curation Method
- Audio is drawn from several open-source datasets. Some audios are synthetically generated.
- Available metadata (e.g., captions, transcripts, etc.) from respective datasets is curated. Additional meta-data (if required) is generated (see paper for details).
- LLMs are used to generate QA pairs from the meta-data using expert-designed reasoning prompts.
- Dataset curation had human-in-the-loop, where prompts and data sources were iteratively refined based on model outputs.
## Data Collection Method
Hybrid: Human, Synthetic and Automated
## Labeling Method
Synthetic
## Dataset Format
- **Modality**: Audio (WAV/MP3/FLAC) + Text (JSON)
- **JSON Schema Example**:
```json
[
{
"id": "ID",
"sound": "Name of the wav file.",
"duration": "The duration in floating point.",
"conversations": [
{
"from": "human",
"value": "<sound>
The Question."
},
{
"from": "gpt",
"value": "The Answer."
}
]
},
]
```
## Reference(s):
- Audio Flamingo Next
```
@misc{ghoshaudioflamingonext,
title={Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music},
author={Sreyan Ghosh and Arushi Goel and Kaousheik Jayakumar and Lasha Koroshinadze and Nishit Anand and Zhifeng Kong and Siddharth Gururani and Sang-gil Lee and Jaehyeon Kim and Aya Aljafari and Chao-Han Huck Yang and Sungwon Kim and Ramani Duraiswami and Dinesh Manocha and Mohammad Shoeybi, Bryan Catanzaro and Ming-Yu Liu and Wei Ping},
year={2026},
eprint={},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={},
}
```
- Audio Flamingo 3
```
@misc{goel2025audioflamingo3advancing,
title={Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models},
author={Arushi Goel and Sreyan Ghosh and Jaehyeon Kim and Sonal Kumar and Zhifeng Kong and Sang-gil Lee and Chao-Han Huck Yang and Ramani Duraiswami and Dinesh Manocha and Rafael Valle and Bryan Catanzaro},
year={2025},
eprint={2507.08128},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2507.08128},
}
```
- Audio Flamingo
```
@inproceedings{kong2024audio,
title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
booktitle={International Conference on Machine Learning},
pages={25125--25148},
year={2024},
organization={PMLR}
}
```
- Audio Flamingo 2
```
@article{ghosh2025audio,
title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
journal={arXiv preprint arXiv:2503.03983},
year={2025}
}
```
## Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
# LongAudio-XL 数据集
[论文](https://huggingface.co/papers/2507.08128) | [项目主页](https://research.nvidia.com/labs/adlr/AF3/) | [代码仓库](https://github.com/NVIDIA/audio-flamingo)
## 数据集描述
**LongAudio-XL** 是一个大规模**长时**音频问答(Audio Question Answering, AQA)数据集,旨在针对时长30秒至10分钟的长音频片段,开发(大)音频语言模型以完成长音频推理与问题解决任务。该数据集在原始LongAudio数据集的基础上,新增了约**100万条针对长语音的问答对**,总样本量达到**约125万条**多样化示例。本次发布包含完整数据集,涵盖LongAudio与LongAudio-XL两个部分。该数据集按照每个音频的来源数据集划分为多个子集:
1. **DailyTalk(`DailyTalk_LongAudio.json`)**
- 领域:语音
- 原始数据集链接:https://github.com/keonlee9420/DailyTalk
- 附加说明:将整个未分段的原始wav文件作为对应音频。
2. **IEMOCAP(`IEMOCAP_LongAudio.json`)**
- 领域:语音
- 原始数据集链接:https://sail.usc.edu/iemocap/
- 附加说明:将整个未分段的原始wav文件作为对应音频。
3. **MELD(`MELD_LongAudio.json`)**
- 领域:语音
- 原始数据集链接:https://github.com/declare-lab/MELD
- 附加说明:将整个未分段的原始剧集作为对应音频。
4. **MultiDialog(`MultiDialog_LongAudio.json`)**
- 领域:语音
- 原始数据集链接:https://huggingface.co/datasets/IVLLab/MultiDialog
- 附加说明:将整个原始对话作为对应音频。
5. **LibriSpeech(`LibriSpeech_LongAudio.json`)**
- 领域:语音
- 原始数据集链接:https://www.openslr.org/12/
- 附加说明:按照列表中的精确顺序拼接列表内的每个音频,作为对应音频。
6. **VoxPopuli(`VoxPopuli_LongAudio.json`)**
- 领域:语音
- 原始数据集链接:https://github.com/facebookresearch/voxpopuli
- 附加说明:按照列表中的精确顺序拼接列表内的每个音频,作为对应音频。
7. **Switchboard(`Switchboard_LongAudio.json`)**
- 领域:语音
- 原始数据集链接:https://catalog.ldc.upenn.edu/LDC97S62
- 附加说明:按照列表中的精确顺序拼接列表内的每个音频,作为对应音频。
8. **Europarl(`Europarl_LongAudio.json`)**
- 领域:语音
- 原始数据集链接:https://www.statmt.org/europarl/
- 附加说明:按照列表中的精确顺序拼接列表内的每个音频,作为对应音频。
9. **Fisher(`Fisher_LongAudio.json`)**
- 领域:语音
- 原始数据集链接:https://catalog.ldc.upenn.edu/LDC2004T19
- 附加说明:每个音频文件采用`file_start_end.wav`格式命名。根据起始与结束时间对原始wav文件进行分段,作为对应音频。
10. **MiraData(`MiraData_LongAudio.json`)**
- 领域:声音与音乐
- 原始数据集链接:https://github.com/mira-space/MiraData
- 附加说明:请遵循原始GitHub仓库中的说明,从YouTube获取音频。
11. **Recap_LongAudio(`Recap_LongAudio.json`)**
- 领域:声音与音乐
- 原始数据集链接:https://github.com/md-mohaiminul/VideoRecap
- 附加说明:请遵循原始GitHub仓库中的说明,从[EGO4D](https://ego4d-data.org/)获取音频。
12. **GigaSpeech_LongAudio(`GigaSpeech_LongAudio.json`)**
- 领域:语音
- 原始数据集链接:https://github.com/SpeechOcean/GigaSpeech
- 附加说明:下载原始数据集,将整个未分段的原始文件作为对应音频。
13. **LongAudioBench(`Bench_LongAudio.json`)**
- 领域:语音、声音与音乐
- 附加说明:本数据集需联系通讯作者获取。
通过发布LongAudio-XL,研究人员可针对广泛的音频推理任务开展模型训练。**请注意:本数据集仅提供文本问答标注。由于许可限制,我们未托管原始音频文件。用户需根据JSON文件中“sound”字段对应的wav文件名,从原始来源(如YouTube8M、LibriSpeech、Music4All)获取对应音频片段,并通过前文提及的链接下载原始数据集。随后,需根据各数据集的附加说明对音频文件进行切片或拼接操作。我们知晓该流程可能较为复杂,若遇任何问题,请提交Issue或联系本文通讯作者。**
## 数据集所有者
英伟达公司(NVIDIA Corporation)
## 数据集创建日期
2025/07/10
## 许可与使用条款
LongAudio-XL的使用受[NVIDIA OneWay非商业许可协议(licenses/NVIDIA-OneWay-Noncommercial-License_22Mar2022-research.docx)](licenses/NVIDIA-OneWay-Noncommercial-License_22Mar2022-research.docx)约束。合成数据生成可能受开放AI(OpenAI)的[使用条款](https://openai.com/policies/terms-of-use)约束。此外,音频内容可能受其所属原始数据集的许可协议约束,用户在下载或使用音频内容前应自行审阅相关条款。
## 预期用途
LongAudio-XL(及LongAudio)旨在支持以下场景:
- 针对长音频的理解与推理任务,训练并微调(大)音频语言模型。
## 数据集特征
LongAudio-XL聚焦于音频与音乐领域的7项核心技能:
- **字幕生成(Captioning)**:为长音频生成全面描述,涵盖关键事件与整体上下文。
- **情节问答(Plot QA)**:回答有关音频叙事或剧情线的问题,基于时间与因果关系进行推理。
- **时序问答(Temporal QA)**:识别事件发生的时间及其时间关联,包括时序排序、重叠关系与属性变化。
- **关键片段问答(Needle QA)**:在较长的音频“干草堆”中定位并推理特定的“针”式片段,确保答案引用该片段。
- **子场景问答(Subscene QA)**:回答有关音频中独立子场景的问题,需聚焦于局部事件与细节。
- **通用问答(General QA)**:处理覆盖多个事件或主题的开放式宽泛问题,体现整体理解能力。
同时针对语音领域涵盖6项核心技能:
- **讽刺识别(Sarcasm Identification)**:通过分析内容、语调与情感线索,从语音中推断讽刺意味。
- **情感状态推理(Emotional State Reasoning)**:识别说话者的情感,推理其产生原因,并解释情感转变。
- **主题关联推理(Topic Relationship Reasoning)**:判断对话中两个观点或主题之间的关联。
- **信息抽取(Information Extraction, IE)**:包括关键片段问答、因果问答、响应问答与主题问答,用于提取特定事实、原因、响应或核心主题。
- **摘要生成(Summarization)**:为语音内容生成简洁摘要。
- **时序推理(Order)**:包括时序排序、时序属性、时序指代与时序定位,用于在时间维度上定位并排序主题。
每个样本均为长音频片段与对应问答项的配对。音频涵盖环境音、语音(主要为英语)与音乐。音频数据来源于开源数据集(详见论文附录中的表9与表10)。文本问答通过论文中提及的多种方法生成。问答生成过程会使用原始数据集的元数据(若有)。
## 数据整理方法
- 音频数据取自多个开源数据集,部分音频为合成生成。
- 整理各数据集现有的元数据(如字幕、转录文本等)。若有需要,会生成额外元数据(详见论文)。
- 采用大语言模型(Large Language Model, LLM)结合专家设计的推理提示词,从元数据中生成问答对。
- 数据集整理过程采用人机协同模式,根据模型输出迭代优化提示词与数据源。
## 数据收集方法
混合模式:人工、合成与自动化
## 标注方法
合成生成
## 数据集格式
- **模态**:音频(WAV/MP3/FLAC)+ 文本(JSON)
- **JSON Schema示例**:
json
[
{
"id": "样本ID",
"sound": "wav文件名",
"duration": "浮点型时长",
"conversations": [
{
"from": "human",
"value": "<sound>
问题内容。"
},
{
"from": "gpt",
"value": "答案内容。"
}
]
}
]
## 参考文献
- Audio Flamingo 3
bibtex
@misc{goel2025audioflamingo3advancing,
title={Audio Flamingo 3: 面向全开源大型音频语言模型的音频智能进阶},
author={Arushi Goel and Sreyan Ghosh and Jaehyeon Kim and Sonal Kumar and Zhifeng Kong and Sang-gil Lee and Chao-Han Huck Yang and Ramani Duraiswami and Dinesh Manocha and Rafael Valle and Bryan Catanzaro},
year={2025},
eprint={2507.08128},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2507.08128}
}
- Audio Flamingo
bibtex
@inproceedings{kong2024audio,
title={Audio Flamingo:一种具备少样本学习与对话能力的新型音频语言模型},
author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
booktitle={国际机器学习会议},
pages={25125--25148},
year={2024},
organization={PMLR}
}
- Audio Flamingo 2
bibtex
@article{ghosh2025audio,
title={Audio Flamingo 2:具备长音频理解与专家推理能力的音频语言模型},
author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
journal={arXiv预印本 arXiv:2503.03983},
year={2025}
}
## 伦理考量
英伟达(NVIDIA)认为可信人工智能是一项共同责任,我们已建立相关政策与实践规范,以支持各类AI应用的开发。开发者在遵循本服务条款的前提下下载或使用本数据集时,应与内部模型团队协作,确保本模型符合相关行业与应用场景的要求,并应对可能出现的产品误用问题。
请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)报告安全漏洞或NVIDIA AI相关问题。
提供机构:
maas
创建时间:
2025-07-12



