VISTA-400K
收藏魔搭社区2025-12-04 更新2025-02-08 收录
下载链接:
https://modelscope.cn/datasets/TIGER-Lab/VISTA-400K
下载链接
链接失效反馈官方服务:
资源简介:
# VISTA-400K
This repo contains all subsets for **VISTA-400K**. [VISTA](https://huggingface.co/papers/2412.00927) is a video spatiotemporal augmentation method that generates long-duration and high-resolution video instruction-following data to enhance the video understanding capabilities of video LMMs.
### This repo is under construction. Please stay tuned.
[**🌐 Homepage**](https://tiger-ai-lab.github.io/VISTA/) | [**📖 arXiv**](https://arxiv.org/abs/2412.00927) | [**💻 GitHub**](https://github.com/TIGER-AI-Lab/VISTA) | [**🤗 VISTA-400K**](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K) | [**🤗 Models**](https://huggingface.co/collections/TIGER-Lab/vista-674a2f0fab81be728a673193) | [**🤗 HRVideoBench**](https://huggingface.co/datasets/TIGER-Lab/HRVideoBench)
## Video Instruction Data Synthesis Pipeline
<p align="center">
<img src="https://tiger-ai-lab.github.io/VISTA/static/images/vista_main.png" width="900">
</p>
VISTA leverages insights from image and video classification data augmentation techniques such as CutMix, MixUp and VideoMix, which demonstrate that training on synthetic data created by overlaying or mixing multiple images or videos results in more robust classifiers. Similarly, our method spatially and temporally combines videos to create (artificial) augmented video samples with longer durations and higher resolutions, followed by synthesizing instruction data based on these new videos. Our data synthesis pipeline utilizes existing public video-caption datasets, making it fully open-sourced and scalable. This allows us to construct VISTA-400K, a high-quality video instruction-following dataset aimed at improving the long and high-resolution video understanding capabilities of video LMMs.
## Citation
If you find our paper useful, please cite us with
```
@misc{ren2024vistaenhancinglongdurationhighresolution,
title={VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation},
author={Weiming Ren and Huan Yang and Jie Min and Cong Wei and Wenhu Chen},
year={2024},
eprint={2412.00927},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.00927},
}
```
# VISTA-400K
本仓库收录了**VISTA-400K**的全部子集。[VISTA](https://huggingface.co/papers/2412.00927)是一种视频时空增强方法,可生成长时长、高分辨率的视频指令跟随数据,用于提升视频多模态大模型(Video Large Multimodal Model,简称Video LMM)的视频理解能力。
### 本仓库仍在开发中,敬请期待。
[**🌐 项目主页**](https://tiger-ai-lab.github.io/VISTA/) | [**📖 arXiv论文**](https://arxiv.org/abs/2412.00927) | [**💻 GitHub仓库**](https://github.com/TIGER-AI-Lab/VISTA) | [**🤗 VISTA-400K数据集**](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K) | [**🤗 模型集合**](https://huggingface.co/collections/TIGER-Lab/vista-674a2f0fab81be728a673193) | [**🤗 HRVideoBench基准测试集**](https://huggingface.co/datasets/TIGER-Lab/HRVideoBench)
## 视频指令数据合成流水线
<p align="center">
<img src="https://tiger-ai-lab.github.io/VISTA/static/images/vista_main.png" width="900">
</p>
VISTA借鉴了图像与视频分类数据增强技术的经典思路,例如CutMix、MixUp与VideoMix——已有研究证实,通过叠加或混合多幅图像、多段视频生成的合成数据进行训练,可获得鲁棒性更强的分类器。与之同理,我们提出的方法在时空维度上对视频进行融合,生成时长更长、分辨率更高的人工增强视频样本,并基于这些新生成的视频合成指令跟随数据。本数据合成流水线依托现有公开的视频-字幕数据集构建,具备完全开源且易于扩展的特性。借此我们构建了VISTA-400K,这一高质量视频指令跟随数据集,旨在提升视频多模态大模型对长时、高分辨率视频的理解能力。
## 引用
如果您认为我们的工作对您有所帮助,请使用以下格式进行引用:
@misc{ren2024vistaenhancinglongdurationhighresolution,
title={VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation},
author={Weiming Ren and Huan Yang and Jie Min and Cong Wei and Wenhu Chen},
year={2024},
eprint={2412.00927},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.00927},
}
提供机构:
maas
创建时间:
2025-02-03



