Wan2.2-Syn-121x704x1280_32k
收藏魔搭社区2026-04-28 更新2025-08-09 收录
下载链接:
https://modelscope.cn/datasets/FastVideo/Wan2.2-Syn-121x704x1280_32k
下载链接
链接失效反馈官方服务:
资源简介:
# FastVideo Synthetic Wan2.2 720P dataset
<p align="center">
<img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logo.png" width="200"/>
</p>
<div>
<div align="center">
<a href="https://github.com/hao-ai-lab/FastVideo" target="_blank">FastVideo Team</a> 
</div>
<div align="center">
<a href="https://arxiv.org/abs/2505.13389">Paper</a> |
<a href="https://github.com/hao-ai-lab/FastVideo">Github</a> |
<a href="https://hao-ai-lab.github.io/FastVideo">Project Page</a>
</div>
</div>
## Abstract
Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models.
## Dataset Overview
- The prompts were randomly sampled from the [Vchitect_T2V_DataVerse](https://huggingface.co/datasets/Vchitect/Vchitect_T2V_DataVerse) dataset.
- Each sample was generated using the **Wan2.2-TI2V-5B-Diffusers** model and stored the latents.
- The resolution of each latent sample corresponds to **121 frames**, with each frame sized **704×1280**.
- It includes all preprocessed latents required for **Text-to-Video (T2V)** task (Also include the first frame Image).
- The dataset is fully compatible with the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository and can be directly loaded and used without any additional preprocessing.
## Sample Usage
To download this dataset, ensure you have Git LFS installed, then clone the repository:
```bash
git lfs install
git clone https://huggingface.co/datasets/FastVideo/Wan2.2-Syn-121x704x1280_32k
```
This dataset contains preprocessed latents ready for Text-to-Video (T2V) tasks and is designed to be directly used with the [FastVideo repository](https://github.com/hao-ai-lab/FastVideo) without further preprocessing. Refer to the FastVideo [documentation](https://hao-ai-lab.github.io/FastVideo) for detailed instructions on how to load and use the dataset for training or finetuning.
If you use FastVideo Synthetic Wan2.2 dataset for your research, please cite our paper:
```
@article{zhang2025vsa,
title={VSA: Faster Video Diffusion with Trainable Sparse Attention},
author={Zhang, Peiyuan and Huang, Haofeng and Chen, Yongqi and Lin, Will and Liu, Zhengzhong and Stoica, Ion and Xing, Eric and Zhang, Hao},
journal={arXiv preprint arXiv:2505.13389},
year={2025}
}
@article{zhang2025fast,
title={Fast video generation with sliding tile attention},
author={Zhang, Peiyuan and Chen, Yongqi and Su, Runlong and Ding, Hangliang and Stoica, Ion and Liu, Zhengzhong and Zhang, Hao},
journal={arXiv preprint arXiv:2502.04507},
year={2025}
}
```
# FastVideo 合成Wan2.2 720P数据集
<p align="center">
<img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logo.png" width="200"/>
</p>
<div>
<div align="center">
<a href="https://github.com/hao-ai-lab/FastVideo" target="_blank">FastVideo 团队</a> 
</div>
<div align="center">
<a href="https://arxiv.org/abs/2505.13389">论文</a> |
<a href="https://github.com/hao-ai-lab/FastVideo">GitHub</a> |
<a href="https://hao-ai-lab.github.io/FastVideo">项目主页</a>
</div>
</div>
## 摘要
视频扩散Transformer(DiTs)的扩展受限于其二次三维注意力机制,尽管绝大多数注意力权重集中在少量位置上。我们将这一观察转化为VSA——一种可训练、硬件高效的稀疏注意力机制,在训练与推理阶段均替代全注意力机制。在VSA中,轻量级粗粒度阶段将Token池化为块,并识别高权重关键Token;细粒度阶段仅在这些块内计算Token级注意力,采用分块计算布局以确保硬效率。这形成了单个可微内核,支持端到端训练,无需事后性能分析,且可达到FlashAttention3 85%的模型浮点运算利用率(MFU)。我们通过对6000万至14亿参数的DiTs进行预训练,开展了大规模消融实验与缩放定律实验。VSA实现了帕累托最优平衡点,将训练浮点运算量降低2.53×,同时扩散损失无下降。对开源Wan-2.1模型进行适配改造后,注意力计算速度提升6×,端到端生成时间从31秒缩短至18秒,且生成质量相当。这些结果证明,可训练稀疏注意力可作为全注意力机制的实用替代方案,同时也是进一步扩展视频扩散模型的关键赋能技术。
## 数据集概览
- 提示词从[Vchitect_T2V_DataVerse](https://huggingface.co/datasets/Vchitect/Vchitect_T2V_DataVerse)数据集中随机采样得到。
- 每个样本均通过**Wan2.2-TI2V-5B-Diffusers**模型生成,并存储其隐向量。
- 每个隐向量样本对应**121帧**,单帧分辨率为**704×1280**。
- 本数据集包含文本到视频(Text-to-Video, T2V)任务所需的全部预处理隐向量(同时包含首帧图像)。
- 本数据集与[FastVideo](https://github.com/hao-ai-lab/FastVideo)代码库完全兼容,无需额外预处理即可直接加载使用。
## 样本使用方法
如需下载本数据集,请确保已安装Git LFS,随后执行如下命令克隆仓库:
bash
git lfs install
git clone https://huggingface.co/datasets/FastVideo/Wan2.2-Syn-121x704x1280_32k
本数据集包含适用于文本到视频(T2V)任务的预处理隐向量,可直接与[FastVideo代码库](https://github.com/hao-ai-lab/FastVideo)配合使用,无需额外预处理。如需了解加载与使用本数据集进行训练或微调的详细步骤,请参阅FastVideo[官方文档](https://hao-ai-lab.github.io/FastVideo)。
若您在研究中使用本FastVideo合成Wan2.2数据集,请引用如下论文:
@article{zhang2025vsa,
title={VSA: Faster Video Diffusion with Trainable Sparse Attention},
author={Zhang, Peiyuan and Huang, Haofeng and Chen, Yongqi and Lin, Will and Liu, Zhengzhong and Stoica, Ion and Xing, Eric and Zhang, Hao},
journal={arXiv preprint arXiv:2505.13389},
year={2025}
}
@article{zhang2025fast,
title={Fast video generation with sliding tile attention},
author={Zhang, Peiyuan and Chen, Yongqi and Su, Runlong and Ding, Hangliang and Stoica, Ion and Liu, Zhengzhong and Zhang, Hao},
journal={arXiv preprint arXiv:2502.04507},
year={2025}
}
提供机构:
maas
创建时间:
2025-08-08
搜集汇总
数据集介绍

背景与挑战
背景概述
该数据集为FastVideo团队创建的合成视频数据集,包含使用Wan2.2-TI2V-5B-Diffusers模型生成的潜变量,每个样本对应121帧、分辨率704×1280,专用于文本到视频任务。它基于随机采样的提示词,与FastVideo仓库完全兼容,无需额外预处理即可直接使用。
以上内容由遇见数据集搜集并总结生成



