five

AudioSkills

收藏
魔搭社区2026-04-28 更新2025-07-19 收录
下载链接:
https://modelscope.cn/datasets/nv-community/AudioSkills
下载链接
链接失效反馈
官方服务:
资源简介:
# AudioSkills-XL Dataset [Project page](https://research.nvidia.com/labs/adlr/AF3/) | [Paper](https://huggingface.co/papers/2507.08128) | [Code](https://github.com/NVIDIA/audio-flamingo) ## Dataset Description **AudioSkills-XL** is a large-scale audio question-answering (AQA) dataset designed to develop (large) audio-language models on expert-level reasoning and problem-solving tasks over short audio clips (≤30 seconds). It expands upon the original AudioSkills collection by adding approximately **4.5 million new QA pairs**, resulting in a total of **~10 million** diverse examples. The release includes the full dataset, including AudioSkills and AudioSkills-XL. The dataset is partitioned into subsets based on each audio’s source dataset: 1. **WavText5K (`WavText5K.json`)** - Domain: Sound - Link to original dataset: https://github.com/microsoft/WavText5K 2. **SONNISS (`SONNISS.json`)** - Domain: Sound - Link to original dataset: https://sonniss.com/ 3. **MusicCaps (`MusicCaps.json`)** - Domain: Sound - Link to original dataset: https://huggingface.co/datasets/google/MusicCaps 4. **BBC Sound Effects (`BBC_Sound_Effects.json`)** - Domain: Sound - Link to original dataset: [NA](https://sound-effects.bbcrewind.co.uk/) 5. **AudioSet (`AudioSet.json`)** - Domain: Sound - Link to original dataset: https://research.google.com/audioset/ Can also be downloaded from https://github.com/JishengBai/AudioSetCaps 6. **MusicBench (`MusicBench.json`)** - Domain: Music - Link to original dataset: https://huggingface.co/datasets/amaai-lab/MusicBench 7. **YouTube-8M (`YouTube8M.json`)** - Domain: Sound, Speech - Link to original dataset: https://research.google.com/youtube8m/. Can also be downloaded from https://github.com/JishengBai/AudioSetCaps. 8. **MACS (`MACS.json`)** - Domain: Sound - Link to original dataset: https://zenodo.org/records/5114771 9. **ESC-50 (`ESC-50.json`)** - Domain: Sound - Link to original dataset: https://github.com/karolpiczak/ESC-50 10. **CountingQA (`CountingQA.json`)** - Domain: Sound - Link to original dataset: [Google Drive](https://drive.google.com/file/d/163YvlQ6gzDt7pskMa3pKGZ0vg422Je2F/view?usp=sharing) - Additional Note: This split has both counting and temporal QAs. 11. **MagnaTagATune (`MagnaTagATune.json`)** - Domain: Music - Link to original dataset: http://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset 12. **FSD50k (`FSD50k.json`)** - Domain: Sound - Link to original dataset: https://zenodo.org/records/4060432 13. **VoxCeleb2 (`VoxCeleb2.json`)** - Domain: Speech - Link to original dataset: https://www.robots.ox.ac.uk/~vgg/data/voxceleb/ - Note: Audio paths follow the pattern `voxceleb-2/dev/aac/id07175/GDQK8Nu5-cA/combined.wav`. In each folder (`voxceleb-2/dev/aac/id07175/`), all WAV files were merged in chronological order to create the final combined file (`combined.wav`). 14. **FMA (`FMA.json`)** - Domain: Music - Link to original dataset: https://github.com/mdeff/fma 15. **Music4ALL (`Music4ALL.json`)** - Domain: Music - Link to original dataset: https://github.com/amaai-lab/Music4All - Additional Note: Please email the corresponding authors with approved license for access to this JSON. 16. **UrbanSound8K (`UrbanSound8K.json`)** - Domain: Sound - Link to original dataset: https://urbansounddataset.weebly.com/urbansound8k.html 17. **SoundDescs (`SoundDescs.json`)** - Domain: Sound - Link to original dataset: https://github.com/akoepke/audio-retrieval-benchmark 18. **Medley-solos-DB (`Medley-solos-DB.json`)** - Domain: Music - Link to original dataset: https://zenodo.org/records/3464194 19. **Medley-Pitch-DB (`Medley-Pitch-DB.json`)** - Domain: Music - Link to original dataset: https://zenodo.org/records/3464194 20. **GTZAN (`GTZAN.json`)** - Domain: Music - Link to original dataset: https://github.com/chittalpatel/Music-Genre-Classification-GTZAN 21. **Clotho-v2 (`Clotho-v2.json`)** - Domain: Sound - Link to original dataset: https://zenodo.org/records/4783391 22. **Freesound (`Freesound.json`)** - Domain: Sound - Link to original dataset: https://freesound.org. Can also be downloaded from https://github.com/XinhaoMei/WavCaps 23. **CochlScene (`CochlScene.json`)** - Domain: Sound - Link to original dataset: https://github.com/cochlearai/cochlscene 24. **WavCaps (`WavCaps.json`)** - Domain: Sound - Link to original dataset: https://github.com/XinhaoMei/WavCaps 25. **Million Song Dataset (`MSD.json`)** - Domain: Music - Link to original dataset: http://millionsongdataset.com/. 26. **VGGSound (`VGG.json`)** - Domain: Sound - Link to original dataset: https://github.com/amirabd/vggsound 27. **TUT_Urban (`TUT_Urban.json`)** - Domain: Sound - Link to original dataset: https://dcase-repo.github.io/dcase_datalist/datasets/scenes/tut_asc_2018_mobile_eval.html 28. **SoundBible (`SoundBible.json`)** - Domain: Sound - Link to original dataset: http://soundbible.com 29. **AudioSet_SL (`AudioSet_SL.json`)** - Domain: Sound - Link to original dataset: https://research.google.com/audioset/ Can also be downloaded from https://github.com/JishengBai/AudioSetCaps By releasing AudioSkills-XL, researchers can train models on a broad spectrum of audio reasoning tasks. **Please note that we only provide the text QA annotations. Due to licensing constraints, we do not host the original audio files. Users are responsible for retrieving the corresponding audio clips from their original sources (e.g., YouTube8M, LibriSpeech, Music4All) using the wav file name from the "sound" tag in the JSONs and dowloading the dataset from the URLs mentioned.** ## Sample Usage To download the dataset files, you can use `git lfs`: ```bash git lfs install git clone git@hf.co:datasets/nvidia/AudioSkills-XL ``` ## Dataset Owner(s) NVIDIA Corporation ## Dataset Creation Date 2025/07/10 ## License / Terms of Use The use of AudioSkills-XL is governed by the [NVIDIA OneWay Noncommercial License](licenses/NVIDIA-OneWay-Noncommercial-License_22Mar2022-research.docx). Synthetic data generation may be subject to OpenAI’s [Terms of Use](https://openai.com/policies/terms-of-use). Additionally, audios may be governed by its own dataset license, which users should review before downloading or using the audio content. ## Intended Usage AudioSkills-XL (and AudioSkills) is intended to support: - Training and fine-tuning (large) audio-language models for expert-level reasoning over audio. ## Dataset Characterization AudioSkills-XL focuses on seven primary skills for sounds and music: - **Temporal Reasoning:** Understanding temporal relationships in audio (order, attribute changes, referring, grounding). - **Attribute Identification:** Recognizing specific event properties (e.g., loudness, speaker gender). - **Counting:** Quantifying occurrences of target sounds at varying difficulty levels. - **Contextual Sound Event Reasoning:** Inferring the purpose or cause of a sound in its acoustic context. - **Contextual Speech Event Reasoning:** Explaining spoken utterances in relation to surrounding sounds or dialogue. - **Information Extraction:** Pulling out detailed facts, entities, or responses from audio content. - **General Reasoning:** Addressing complex questions that combine multiple reasoning skills. and 6 primary skills for speech: - **Sarcasm Identification:** Inferring sarcasm from speech by analyzing content, tone, and emotional cues. - **Emotional State Reasoning:** Identifying a speaker’s emotion, reasoning about its cause, and explaining any emotion flips. - **Topic Relationship Reasoning:** Determining how two ideas or topics relate within the conversation. - **Information Extraction (IE):** Needle QA, Causal QA, Response QA, and Topic QA for extracting specific facts, causes, responses, or main topics. - **Summarization:** Producing a concise summary of the speech content. - **Order:** Temporal Order, Temporal Attribute, Temporal Referring, and Temporal Grounding to locate and sequence topics over time. Each example is a pair of a short audio clip (≤30 s) and a corresponding QA item. Audio encompasses environmental sounds, speech (primarily English), and music. Audios are sourced from open-source datasets (see Table 6 in paper appendix). Text QA is generated using a variety of methods mentioned in the paper. Metadata from the original datasets (if available) is used to for QA generation. ## Data Curation Method - Audio is drawn from several open-source datasets. Some audios are synthetically generated. - Available metadata (e.g., captions, transcripts, etc.) from respective datasets is curated. Additional meta-data (if required) is generated (see paper for details). - LLMs are used to generate QA pairs from the meta-data using expert-designed reasoning prompts. - Dataset curation had human-in-the-loop, where prompts and data sources were iteratively refined based on model outputs. ## Data Collection Method Hybrid: Human, Synthetic and Automated ## Labeling Method Synthetic ## Dataset Format - **Modality**: Audio (WAV/MP3/FLAC) + Text (JSON) - **JSON Schema Example**: ```json [ { "id": "ID", "sound": "Name of the wav file.", "duration": "The duration in floating point.", "conversations": [ { "from": "human", "value": "<sound> The Question." }, { "from": "gpt", "value": "The Answer." } ] }, ] ``` **Note:** While the `duration` field is accurate in most cases, it may be incorrect in some files and should be treated as a placeholder. If your code relies on audio durations, we recommend recalculating them. Please also note that all QA pairs are intended to correspond to the entire audio clip, not just a segment. ## Reference(s): - Audio Flamingo 3 ``` @misc{goel2025audioflamingo3advancing, title={Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models}, author={Arushi Goel and Sreyan Ghosh and Jaehyeon Kim and Sonal Kumar and Zhifeng Kong and Sang-gil Lee and Chao-Han Huck Yang and Ramani Duraiswami and Dinesh Manocha and Rafael Valle and Bryan Catanzaro}, year={2025}, eprint={2507.08128}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2507.08128}, } ``` - Audio Flamingo ``` @inproceedings{kong2024audio, title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities}, author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan}, booktitle={International Conference on Machine Learning}, pages={25125--25148}, year={2024}, organization={PMLR} } ``` - Audio Flamingo 2 ``` @article{ghosh2025audio, title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities}, author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan}, journal={arXiv preprint arXiv:2503.03983}, year={2025} } ``` ## Ethical Considerations: NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).

# AudioSkills-XL 数据集 [项目页面](https://research.nvidia.com/labs/adlr/AF3/) | [论文](https://huggingface.co/papers/2507.08128) | [代码](https://github.com/NVIDIA/audio-flamingo) ## 数据集概述 **AudioSkills-XL** 是一款大规模音频问答(Audio Question Answering, AQA)数据集,旨在针对时长不超过30秒的短音频片段,开发(大)音频语言模型的专家级推理与问题解决能力。该数据集在初代AudioSkills集的基础上新增了约**450万条问答对**,最终总样本量达到约**1000万条**多样化示例。本次发布包含完整数据集,即初代AudioSkills与AudioSkills-XL全部内容。数据集按照音频的来源数据集划分为多个子集: 1. **WavText5K (`WavText5K.json`)** - 领域:声音 - 原始数据集链接:https://github.com/microsoft/WavText5K 2. **SONNISS (`SONNISS.json`)** - 领域:声音 - 原始数据集链接:https://sonniss.com/ 3. **MusicCaps (`MusicCaps.json`)** - 领域:声音 - 原始数据集链接:https://huggingface.co/datasets/google/MusicCaps 4. **BBC音效库 (`BBC_Sound_Effects.json`)** - 领域:声音 - 原始数据集链接:[无](https://sound-effects.bbcrewind.co.uk/) 5. **AudioSet (`AudioSet.json`)** - 领域:声音 - 原始数据集链接:https://research.google.com/audioset/,也可通过 https://github.com/JishengBai/AudioSetCaps 下载。 6. **MusicBench (`MusicBench.json`)** - 领域:音乐 - 原始数据集链接:https://huggingface.co/datasets/amaai-lab/MusicBench 7. **YouTube-8M (`YouTube8M.json`)** - 领域:声音、语音 - 原始数据集链接:https://research.google.com/youtube8m/,也可通过 https://github.com/JishengBai/AudioSetCaps 下载。 8. **MACS (`MACS.json`)** - 领域:声音 - 原始数据集链接:https://zenodo.org/records/5114771 9. **ESC-50 (`ESC-50.json`)** - 领域:声音 - 原始数据集链接:https://github.com/karolpiczak/ESC-50 10. **CountingQA (`CountingQA.json`)** - 领域:声音 - 原始数据集链接:[谷歌云端硬盘](https://drive.google.com/file/d/163YvlQ6gzDt7pskMa3pKGZ0vg422Je2F/view?usp=sharing) - 补充说明:该子集同时包含计数类与时序类问答任务。 11. **MagnaTagATune (`MagnaTagATune.json`)** - 领域:音乐 - 原始数据集链接:http://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset 12. **FSD50k (`FSD50k.json`)** - 领域:声音 - 原始数据集链接:https://zenodo.org/records/4060432 13. **VoxCeleb2 (`VoxCeleb2.json`)** - 领域:语音 - 原始数据集链接:https://www.robots.ox.ac.uk/~vgg/data/voxceleb/ - 注意:音频路径遵循`voxceleb-2/dev/aac/id07175/GDQK8Nu5-cA/combined.wav`格式。在每个文件夹(如`voxceleb-2/dev/aac/id07175/`)中,所有WAV文件将按时间顺序合并以生成最终的`combined.wav`文件。 14. **FMA (`FMA.json`)** - 领域:音乐 - 原始数据集链接:https://github.com/mdeff/fma 15. **Music4ALL (`Music4ALL.json`)** - 领域:音乐 - 原始数据集链接:https://github.com/amaai-lab/Music4All - 补充说明:需联系通讯作者并获得许可协议后方可获取该JSON文件。 16. **UrbanSound8K (`UrbanSound8K.json`)** - 领域:声音 - 原始数据集链接:https://urbansounddataset.weebly.com/urbansound8k.html 17. **SoundDescs (`SoundDescs.json`)** - 领域:声音 - 原始数据集链接:https://github.com/akoepke/audio-retrieval-benchmark 18. **Medley-solos-DB (`Medley-solos-DB.json`)** - 领域:音乐 - 原始数据集链接:https://zenodo.org/records/3464194 19. **Medley-Pitch-DB (`Medley-Pitch-DB.json`)** - 领域:音乐 - 原始数据集链接:https://zenodo.org/records/3464194 20. **GTZAN (`GTZAN.json`)** - 领域:音乐 - 原始数据集链接:https://github.com/chittalpatel/Music-Genre-Classification-GTZAN 21. **Clotho-v2 (`Clotho-v2.json`)** - 领域:声音 - 原始数据集链接:https://zenodo.org/records/4783391 22. **Freesound (`Freesound.json`)** - 领域:声音 - 原始数据集链接:https://freesound.org,也可通过 https://github.com/XinhaoMei/WavCaps 下载。 23. **CochlScene (`CochlScene.json`)** - 领域:声音 - 原始数据集链接:https://github.com/cochlearai/cochlscene 24. **WavCaps (`WavCaps.json`)** - 领域:声音 - 原始数据集链接:https://github.com/XinhaoMei/WavCaps 25. **百万歌曲数据集(Million Song Dataset, MSD) (`MSD.json`)** - 领域:音乐 - 原始数据集链接:http://millionsongdataset.com/。 26. **VGGSound (`VGG.json`)** - 领域:声音 - 原始数据集链接:https://github.com/amirabd/vggsound 27. **TUT_Urban (`TUT_Urban.json`)** - 领域:声音 - 原始数据集链接:https://dcase-repo.github.io/dcase_datalist/datasets/scenes/tut_asc_2018_mobile_eval.html 28. **SoundBible (`SoundBible.json`)** - 领域:声音 - 原始数据集链接:http://soundbible.com 29. **AudioSet_SL (`AudioSet_SL.json`)** - 领域:声音 - 原始数据集链接:https://research.google.com/audioset/,也可通过 https://github.com/JishengBai/AudioSetCaps 下载。 通过发布AudioSkills-XL,研究人员可在广泛的音频推理任务上训练模型。**请注意:本数据集仅提供文本问答标注。受许可协议限制,我们未托管原始音频文件。用户需根据JSON文件中“sound”字段对应的WAV文件名,从原始来源(如YouTube-8M、LibriSpeech、Music4ALL)获取对应音频片段,并通过前文提及的链接下载原始数据集。** ## 样本使用方法 可通过`git lfs`下载数据集文件: bash git lfs install git clone git@hf.co:datasets/nvidia/AudioSkills-XL ## 数据集归属方 英伟达公司(NVIDIA Corporation) ## 数据集创建日期 2025/07/10 ## 许可与使用条款 AudioSkills-XL的使用受[英伟达单向非商业许可](licenses/NVIDIA-OneWay-Noncommercial-License_22Mar2022-research.docx)约束。 合成数据生成可能受OpenAI的[使用条款](https://openai.com/policies/terms-of-use)限制。此外,音频内容可能受其所属原始数据集的许可协议约束,用户在下载或使用音频前应自行查阅相关条款。 ## 预期用途 AudioSkills-XL(及初代AudioSkills)旨在支持: - 针对音频的专家级推理任务,训练并微调(大)音频语言模型。 ## 数据集特征 AudioSkills-XL聚焦于音频与音乐的7项核心能力: - **时序推理**:理解音频中的时序关系(如顺序、属性变化、指代、接地)。 - **属性识别**:识别特定事件的属性(如响度、说话者性别)。 - **计数任务**:在不同难度级别下统计目标声音的出现次数。 - **上下文声音事件推理**:在声学语境中推断声音的用途或成因。 - **上下文语音事件推理**:结合周围声音或对话,解释口语内容。 - **信息抽取**:从音频内容中提取详细事实、实体或答案。 - **通用推理**:处理结合多种推理能力的复杂问题。 针对语音任务,则包含6项核心能力: - **讽刺识别**:通过分析内容、语调与情感线索,识别语音中的讽刺意味。 - **情感状态推理**:识别说话者的情绪,推理情绪成因,并解释情绪转变。 - **主题关联推理**:判断对话中两个观点或主题的关联方式。 - **信息抽取(IE)**:涵盖精准问答、因果问答、回复问答与主题问答,用于提取特定事实、成因、回复或核心主题。 - **摘要生成**:为语音内容生成简洁摘要。 - **时序任务**:包括时序顺序、时序属性、时序指代及时序接地,用于定位并按时间顺序排列主题。 每个样本均为一条短音频片段(时长≤30秒)与对应的问答项。音频涵盖环境音、语音(以英语为主)与音乐。音频数据来源于开源数据集(详见论文附录表6)。文本问答通过论文中提及的多种方法生成,问答生成会利用原始数据集的元数据(若可用)。 ## 数据整理方法 - 音频来源于多个开源数据集,部分音频为合成生成。 - 整理各原始数据集的可用元数据(如字幕、转录文本等),并按需生成额外元数据(详见论文细节)。 - 利用大语言模型(Large Language Model, LLM)结合专家设计的推理提示词,从元数据中生成问答对。 - 数据集整理过程采用人机协同模式,根据模型输出迭代优化提示词与数据源。 ## 数据收集方法 混合模式:人工、合成与自动化 ## 标注方法 合成 ## 数据集格式 - **模态**:音频(WAV/MP3/FLAC格式)+ 文本(JSON格式) - **JSON Schema示例**: json [ { "id": "样本ID", "sound": "WAV文件名", "duration": "浮点格式的音频时长", "conversations": [ { "from": "human", "value": "<sound> 问题内容。" }, { "from": "gpt", "value": "答案内容。" } ] } ] **注意**:尽管`duration`字段在多数场景下准确,但部分文件中该字段可能存在误差,仅可作为参考值。若代码依赖音频时长,建议自行重新计算。同时请注意,所有问答对均对应完整音频片段,而非其中某一段。 ## 参考文献 - Audio Flamingo 3 bibtex @misc{goel2025audioflamingo3advancing, title={Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models}, author={Arushi Goel and Sreyan Ghosh and Jaehyeon Kim and Sonal Kumar and Zhifeng Kong and Sang-gil Lee and Chao-Han Huck Yang and Ramani Duraiswami and Dinesh Manocha and Rafael Valle and Bryan Catanzaro}, year={2025}, eprint={2507.08128}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2507.08128}, } - Audio Flamingo bibtex @inproceedings{kong2024audio, title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities}, author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan}, booktitle={International Conference on Machine Learning}, pages={25125--25148}, year={2024}, organization={PMLR} } - Audio Flamingo 2 bibtex @article{ghosh2025audio, title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities}, author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan}, journal={arXiv preprint arXiv:2503.03983}, year={2025} } ## 伦理考量 英伟达坚信可信人工智能是一项共同责任,我们已制定相关政策与实践规范,以支持各类AI应用的开发。开发者若按照本服务条款下载或使用本数据集,应与其内部模型团队协作,确保该模型符合相关行业与应用场景的要求,并应对可能出现的产品误用问题。 请通过[此链接](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)提交安全漏洞或英伟达AI相关问题。
提供机构:
maas
创建时间:
2025-07-12
搜集汇总
数据集介绍
main_image_url
背景与挑战
背景概述
AudioSkills-XL是一个大规模音频问答数据集,包含约1000万对QA样本,覆盖声音、音乐和语音等多个领域,旨在支持音频语言模型的专家级推理任务。数据集不包含原始音频文件,用户需自行从原始来源获取。
以上内容由遇见数据集搜集并总结生成
5,000+
优质数据集
54 个
任务类型
进入经典数据集
二维码
社区交流群

面向社区/商业的数据集话题

二维码
科研交流群

面向高校/科研机构的开源数据集话题

数据驱动未来

携手共赢发展

商业合作