HKUSTAudio/AudioX-IFcaps

Name: HKUSTAudio/AudioX-IFcaps
Creator: HKUSTAudio
Published: 2026-02-10 12:13:38
License: 暂无描述

Hugging Face2026-02-10 更新2026-04-05 收录

下载链接：

https://hf-mirror.com/datasets/HKUSTAudio/AudioX-IFcaps

下载链接

链接失效反馈

官方服务：

资源简介：

--- license: cc-by-nc-nd-4.0 task_categories: - text-to-audio size_categories: - 1M<n<10M pretty_name: AudioX-IFcaps --- # [ICLR 2026] AudioX-IFcaps: Instruction-Following Audio Caption Dataset <a href="https://zeyuet.github.io/AudioX/" target="_blank"><img src="https://img.shields.io/badge/🌐%20Project%20Page-blue" alt="Project Page"></a> <a href="https://github.com/ZeyueT/AudioX" target="_blank"><img src="https://img.shields.io/badge/💻%20GitHub-ffffff?logo=github&logoColor=181717" alt="GitHub"></a> <a href="https://arxiv.org/pdf/2503.10522" target="_blank"><img src="https://img.shields.io/badge/📄%20Paper-ICLR%202026-red" alt="Paper"></a> **AudioX-IFcaps** (Instruction-Following) is a large-scale, high-quality multimodal dataset designed for training unified audio and music generation models. The dataset contains over **7 million samples** with fine-grained, structured annotations that enable precise control over audio generation, including sound event categories, counts, temporal ordering, and timestamps. ## 📊 Dataset Statistics - **General Audio**: ~1.3m 10-second video-audio clips - **Music**: ~5.7m 10-second video-music clips - **Total Duration**: ~16k hours of audio content ## 📝 Citation If you use this dataset in your research, please cite: ```bibtex @article{tian2025audiox, title={Audiox: Diffusion transformer for anything-to-audio generation}, author={Tian, Zeyue and Jin, Yizhu and Liu, Zhaoyang and Yuan, Ruibin and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike}, journal={arXiv preprint arXiv:2503.10522}, year={2025} } @inproceedings{tian2025vidmuse, title={Vidmuse: A simple video-to-music generation framework with long-short-term modeling}, author={Tian, Zeyue and Liu, Zhaoyang and Yuan, Ruibin and Pan, Jiahao and Liu, Qifeng and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike}, booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference}, pages={18782--18793}, year={2025} } ``` ## 🔗 Related Resources - **Paper**: <a href="https://arxiv.org/pdf/2503.10522" target="_blank">AudioX: Diffusion Transformer for Anything-to-Audio Generation</a> (Accepted to ICLR 2026) - **Project Page**: <a href="https://zeyuet.github.io/AudioX/" target="_blank">https://zeyuet.github.io/AudioX/</a> - **Code**: <a href="https://github.com/ZeyueT/AudioX" target="_blank">GitHub Repository</a> --- **Note**: This dataset is part of the AudioX project. For more information, please refer to the paper and project page.

提供机构：

HKUSTAudio

5,000+

优质数据集

54 个

任务类型

进入经典数据集