MINT
收藏MINT数据集概述
数据格式
MINT数据集包含图像、叙事文本、音频标题和音频。数据使用JSON文件组织,每行代表一个数据样本。音频文件可通过提供的youtube_id和audio_start_time使用yt-dlp工具获取,音频提取时长设定为10秒。图像数据通过JSON文件中的索引提供,实际图像可从Zenodo下载。
数据示例
json { "audiocaps_id": "97151", "youtube_id": "vfY_TJq7n_U", "audio_start_time": "130", "audio_caption": "Rustling occurs, ducks quack and water splashes, followed by an adult female and adult male speaking and duck calls being blown", "image": "97151.png", "narrative_text": "As I make my way along the winding path, I come across a loving couple, their gentle conversation a warm and intimate accompaniment to the natural soundscape. The adult females voice is soft and melodious, while the adult males is deep and soothing. Their words are lost in the distance, but the love and contentment in their tone is palpable. Suddenly, a duck call pierces the air, followed by a chorus of quacks and honks from the ducks in the water. The sounds blend together in perfect harmony, a beautiful tapestry of sound that envelops me in its serenity." }
图像数据
图像数据可从Zenodo获取,链接为https://zenodo.org/records/11606725。
许可证
MINT数据集根据CC BY-NC-SA-4.0许可证授权。
引用
如需引用此数据集,请使用以下格式:
@article{fu2024mint, title={MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation}, author={Ruibo Fu and Shuchen Shi and Hongming Guo and Tao Wang and Chunyu Qiang and Zhengqi Wen and Jianhua Tao and Xin Qi and Yi Lu and Xiaopeng Wang and Zhiyong Wang and Yukun Liu and Xuefei Liu and Shuai Zhang and Guanjun Li}, journal={arXiv preprint arXiv:2406.10591}, year={2024} }




