MMTrail|多模态数据数据集|视频内容理解数据集

arXiv2024-08-06 更新2024-08-02 收录

多模态数据

视频内容理解

下载链接：

https://github.com/litwellchi/MMTrail

下载链接

链接失效反馈

资源简介：

MMTrail数据集由香港科技大学和北京大学联合创建，是一个包含超过2000万个视频片段的大规模多模态视频语言数据集。该数据集不仅包含视觉字幕，还有200万个高质量的多模态字幕片段，涵盖电影、新闻、游戏等多种内容类型。数据集的创建过程中，采用了先进的大型语言模型（LLM）进行多模态注释，确保了音乐视角与视觉内容的权威性。MMTrail数据集主要应用于细粒度的大规模多模态语言模型训练，旨在解决视频内容生成和理解中的多模态融合问题。

提供机构：

香港科技大学、北京大学

创建时间：

2024-07-31

原始信息汇总

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

数据集概述

MMTrail是一个大规模的多模态视频-语言数据集，包含超过2000万个预告片片段，具有高质量的多模态字幕，整合了上下文、视觉帧和背景音乐，旨在增强跨模态研究和细粒度多模态-语言模型训练。数据集提供了200万个LLaVA视频字幕、200万个音乐字幕和6000万个Coca帧字幕，涵盖27.1万小时的预告片视频。

下载信息

分割	下载链接	样本数量	视频时长	存储空间
Training-Coca	下载	2000万	27.1万小时	约8.0 TB
Training	下载	210万	8.2万小时	约1.6 TB
Test (2M(sample 1w))	下载	210万	8.2万小时	约1.6 TB
Test	TODO (2.77 MB)	1000	3.5小时	794 Mb

元数据格式

json [ { video_id: zW1-6V_cN8I, video_path: group_32/zW1-6V_cN8I.mp4, video_duration: 1645.52, video_resolution: [720, 1280], video_fps: 25.0, clip_id: zW1-6V_cN8I_0000141, clip_path: video_dataset_32/zW1-6V_cN8I_0000141.mp4, clip_duration: 9.92, clip_start_end_idx: [27102, 27350], image_quality: 45.510545094807945, of_score: 6.993135, aesthetic_score: [4.515582084655762, 4.1147027015686035, 3.796849250793457], music_caption_wo_vocal: [{text: This song features a drum machine playing a simple beat. A siren sound is played on the low register. Then, a synth plays a descending lick and the other voice starts rapping. This is followed by a descending run. The mid range of the instruments cannot be heard. This song can be played in a meditation center., time: 0:00-10:00}], vocal_caption: I was just wondering..., frame_caption: [two people are standing in a room under an umbrella . , a woman in a purple robe standing in front of a man . , a man and a woman dressed in satin robes . ], music_caption: [{text: This music is instrumental. The tempo is medium with a synthesiser arrangement and digital drumming with a lot of vibrato and static. The music is loud, emphatic, youthful, groovy, energetic and pulsating. This music is a Electro Trap., time: 0:00-10:00}], objects: [ bed, Woman, wall, pink robe, pillow], background: Bedroom, ocr_score: 0.0, caption: The video shows a woman in a pink robe standing in a room with a bed and a table, captured in a series of keyframes that show her in various poses and expressions., polish_caption: A woman in a pink robe poses and expresses herself in various ways in a room with a bed and a table, capturing her graceful movements and emotive facial expressions., merge_caption: In a cozy bedroom setting, a stunning woman adorned in a pink robe gracefully poses and expresses herself, her movements and facial expressions captured in a series of intimate moments. The scene is set against the backdrop of a comfortable bed and a table, with an umbrella standing in a corner of the room. The video features two people standing together under the umbrella, a woman in a purple robe standing confidently in front of a man, and a man and woman dressed in satin robes, all set to an energetic and pulsating electro trap beat with a synthesiser arrangement and digital drumming. The music is loud and emphatic, capturing the youthful and groovy vibe of the video. } ]

更新记录

【2024/07/30】发布了200万和2000万字幕数据文件供下载。
【2024/06/10】建立了GitHub页面。

许可证

视频样本来自公开可用的数据集。用户必须遵循相关许可证使用这些视频样本。我们提供了字幕文件。

引用

@misc{chi2024mmtrailmultimodaltrailervideo, title={MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions}, author={Xiaowei Chi and Yatian Wang and Aosong Cheng and Pengjun Fang and Zeyue Tian and Yingqing He and Zhaoyang Liu and Xingqun Qi and Jiahao Pan and Rongyu Zhang and Mengfei Li and Ruibin Yuan and Yanbing Jiang and Wei Xue and Wenhan Luo and Qifeng Chen and Shanghang Zhang and Qifeng Liu and Yike Guo}, year={2024}, eprint={2407.20962}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2407.20962}, }

AI搜集汇总

数据集介绍

构建方式

MMTrail数据集的构建基于大规模的预告片视频，通过系统化的字幕框架实现多模态注释。首先，数据集从290,000多个预告片视频中提取了超过2000万个视频片段，并为其生成视觉字幕。随后，利用先进的语言模型（LLM）将所有注释自适应地合并，确保字幕在保留视觉上下文的同时，也能反映音乐视角。此外，数据集还包含了200万个高质量视频片段，这些片段不仅包含视觉字幕，还附带了多模态字幕，如音乐描述和背景信息。

使用方法

MMTrail数据集适用于多种多模态任务，包括视频生成、视频理解以及视频与音乐的跨模态任务。研究者可以利用数据集中的视觉字幕和音乐描述，训练和评估视频生成模型，如视频扩散模型和视频语言模型。同时，数据集的高质量注释和多样化的视频内容，也为视频理解模型的训练提供了丰富的资源。此外，数据集还可用于探索视频与音乐之间的潜在关联，推动跨模态研究的深入发展。

背景与挑战

背景概述

MMTrail数据集由香港科技大学和北京大学联合开发，旨在解决当前视频-语言数据集中音频信息被忽视的问题。该数据集于2024年创建，包含超过2000万段预告片视频，每段视频均配有视觉字幕和多模态字幕。MMTrail的核心研究问题是如何在视频-语言模型中有效整合视觉和音频信息，以实现更全面和精确的描述。该数据集的推出对多模态学习和视频内容生成领域具有重要影响，为细粒度的大规模多模态语言模型训练铺平了道路。

当前挑战

MMTrail数据集在构建过程中面临多重挑战。首先，如何从海量视频中筛选出高质量的预告片视频，并确保其视觉和音频信息的完整性。其次，多模态数据的复杂性增加了数据处理和标注的难度，特别是音频和视觉内容之间的高相关性要求。此外，如何设计一个系统化的字幕框架，既能保留视觉上下文的权威性，又能融入音乐视角，是一个技术难题。这些挑战不仅涉及数据集的构建，还关系到后续模型训练的效果和应用的广泛性。

常用场景

经典使用场景

MMTrail数据集的经典使用场景在于其多模态视频描述的生成。通过整合视觉、音频和文本信息，该数据集能够为视频内容生成详尽且准确的描述，从而促进视频语言模型的训练和评估。其独特的预告片视频格式，结合精心设计的背景音乐，使得MMTrail在多模态内容理解和生成任务中表现出色。

解决学术问题

MMTrail数据集解决了现有视频语言数据集在音频信息利用上的不足，填补了多模态视频数据集的空白。通过提供包含视觉、音频和文本的多模态描述，MMTrail显著提升了视频内容理解的准确性和全面性，为跨模态研究提供了坚实的基础。其高质量的注释和丰富的数据多样性，对推动多模态语言模型的发展具有重要意义。

实际应用

MMTrail数据集在实际应用中展现出广泛的前景，特别是在视频内容生成和理解领域。例如，在电影和游戏预告片的自动生成、新闻视频的摘要生成以及教育视频的内容分析等方面，MMTrail都能提供强大的支持。此外，该数据集还可用于开发智能视频推荐系统，通过理解视频内容和背景音乐，提升用户体验。

数据集最近研究

相关研究论文

1
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions香港科技大学北京大学 · 2024年

以上内容由AI搜集并总结生成

用户留言

有没有相关的论文或文献参考？

这个数据集是基于什么背景创建的？

数据集的作者是谁？

能帮我联系到这个数据集的作者吗？

这个数据集如何下载？

点击留言

数据主题

具身智能

数据集 4098个

机构 8个

大模型

数据集 439个

机构 10个

无人机

数据集 37个

机构 6个

指令微调

数据集 36个

机构 6个

蛋白质结构

数据集 50个

机构 8个

空间智能

数据集 21个

机构 5个

5,000+

优质数据集

54 个

任务类型

进入经典数据集

热门数据集

Tropicos

Tropicos是一个全球植物名称数据库，包含超过130万种植物的名称、分类信息、分布数据、图像和参考文献。该数据库由密苏里植物园维护，旨在为植物学家、生态学家和相关领域的研究人员提供全面的植物信息。

www.tropicos.org 收录

全国 1∶200 000 数字地质图（公开版）空间数据库

As the only one of its kind, China National Digital Geological Map (Public Version at 1∶200 000 scale) Spatial Database (CNDGM-PVSD) is based on China' s former nationwide measured results of regional geological survey at 1∶200 000 scale, and is also one of the nationwide basic geosciences spatial databases jointly accomplished by multiple organizations of China. Spatially, it embraces 1 163 geological map-sheets (at scale 1: 200 000) in both formats of MapGIS and ArcGIS, covering 72% of China's whole territory with a total data volume of 90 GB. Its main sources is from 1∶200 000 regional geological survey reports, geological maps, and mineral resources maps with an original time span from mid-1950s to early 1990s. Approved by the State's related agencies, it meets all the related technical qualification requirements and standards issued by China Geological Survey in data integrity, logic consistency, location acc racy, attribution fineness, and collation precision, and is hence of excellent and reliable quality. The CNDGM-PVSD is an important component of China' s national spatial database categories, serving as a spatial digital platform for the information construction of the State's national economy, and providing informationbackbones to the national and provincial economic planning, geohazard monitoring, geological survey, mineral resources exploration as well as macro decision-making.

DataCite Commons 收录

poi

本项目收集国内POI兴趣点，当前版本数据来自于openstreetmap。

github 收录

CHARLS

中国健康与养老追踪调查（CHARLS）数据集，旨在收集反映中国45岁及以上中老年人家庭和个人的高质量微观数据，用以分析人口老龄化问题，内容包括健康状况、经济状况、家庭结构和社会支持等。

charls.pku.edu.cn 收录

DALY

DALY数据集包含了全球疾病负担研究（Global Burden of Disease Study）中的伤残调整生命年（Disability-Adjusted Life Years, DALYs）数据。该数据集提供了不同国家和地区在不同年份的DALYs指标，用于衡量因疾病、伤害和早逝导致的健康损失。

ghdx.healthdata.org 收录