multi-shot video dataset
收藏ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models
作者与机构
- Ozgur Kara1,2
- Krishna Kumar Singh2
- Feng Liu2
- Duygu Ceylan2
- James Matthew Rehg1
- Tobias Hinz2
1University of Illinois Urbana-Champaign
2Adobe
会议
CVPR 2025
摘要
当前基于扩散的文本到视频方法仅限于生成单镜头的短视频片段,缺乏生成具有离散过渡的多镜头视频的能力。为解决这一限制,提出了一个框架,包括数据集收集流程和视频扩散模型的架构扩展,以实现文本到多镜头视频生成。该方法能够生成多镜头视频,确保角色和背景一致性,并允许用户通过镜头特定条件控制镜头的数量、持续时间和内容。
方法论
- 通过引入“过渡令牌”微调预训练的T2V模型。
- 使用n-1个过渡令牌,初始化为可学习参数,与n镜头视频和镜头特定提示一起输入预训练的T2V模型。
- 模型处理连接的输入令牌序列,通过DiT块中的联合注意力层引导。
- 局部注意力掩码确保过渡令牌仅与发生过渡的视觉帧交互,每个文本令牌仅与其对应的视觉令牌交互。
数据集收集
- 方法一:从具有大运动的视频中采样,随机分割为n个镜头并拼接成多镜头视频。
- 方法二:从预聚类组中随机采样n个相同身份的视频,拼接成多镜头视频。
- 后处理:确保身份一致性,并使用LLaVA-NeXT获取镜头特定标题。
定性结果
- 生成的2镜头视频示例:
- 镜头1提示:“a young girl paints at an easel in her bedroom”
- 镜头2提示:“she then reads a comic book in her bed”
- 生成的3镜头视频示例:
- 镜头1提示:“a man sketches in a notebook at a quiet cafe, his hand moving quickly across the page”
- 镜头2提示:“he pauses, looking up thoughtfully before continuing his drawing”
- 镜头3提示:“later, the man steps outside, his notebook tucked under his arm as he takes in the city around him”
- 生成的4镜头视频示例:
- 镜头1提示:“scientist in lab coat examines a specimen”
- 镜头2提示:“she writes notes on a clipboard”
- 镜头3提示:“she adjusts dials on a machine”
- 镜头4提示:“she pours a liquid into a beaker”
比较
- 镜头1提示:“a man reads a book under tree”
- 镜头2提示:“a man walks from the forest towards lake”
- 对比方法:MEVG [1], FreeNoise [2], Gen-L-Video [3], SEINE [4]
引用
bibtex @inproceedings{kara2025shotadapter, title={ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models}, author={Ozgur Kara and Krishna Kumar Singh and Feng Liu and Duygu Ceylan and James M. Rehg and Tobias Hinz}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2025} }
参考文献
[1] Oh, G., et al. (2024). MEVG: Multi-event video generation with text-to-video models. European Conference on Computer Vision.
[2] Qiu, H., et al. (2024). FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling. ICLR.
[3] Wang, F. Y., et al. (2023). Gen-l-video: Multi-text to long video generation via temporal co-denoising. arXiv.
[4] Chen, X., et al. (2023). SEINE: Short-to-long video diffusion model for generative transition and prediction. ICLR.

- 1ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion ModelsUIUC, Adobe · 2025年



