Wan2.2-Syn-121x704x1280_32k

Name: Wan2.2-Syn-121x704x1280_32k
Creator: maas
Published: 2026-04-28 16:43:22
License: 暂无描述

魔搭社区2026-04-28 更新2025-08-09 收录

下载链接：

https://modelscope.cn/datasets/FastVideo/Wan2.2-Syn-121x704x1280_32k

下载链接

链接失效反馈

官方服务：

资源简介：

# FastVideo Synthetic Wan2.2 720P dataset <p align="center"> <img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logo.png" width="200"/> </p> <div> <div align="center"> <a href="https://github.com/hao-ai-lab/FastVideo" target="_blank">FastVideo Team</a>&emsp; </div> <div align="center"> <a href="https://arxiv.org/abs/2505.13389">Paper</a> | <a href="https://github.com/hao-ai-lab/FastVideo">Github</a> | <a href="https://hao-ai-lab.github.io/FastVideo">Project Page</a> </div> </div> ## Abstract Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models. ## Dataset Overview - The prompts were randomly sampled from the [Vchitect_T2V_DataVerse](https://huggingface.co/datasets/Vchitect/Vchitect_T2V_DataVerse) dataset. - Each sample was generated using the **Wan2.2-TI2V-5B-Diffusers** model and stored the latents. - The resolution of each latent sample corresponds to **121 frames**, with each frame sized **704×1280**. - It includes all preprocessed latents required for **Text-to-Video (T2V)** task (Also include the first frame Image). - The dataset is fully compatible with the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository and can be directly loaded and used without any additional preprocessing. ## Sample Usage To download this dataset, ensure you have Git LFS installed, then clone the repository: ```bash git lfs install git clone https://huggingface.co/datasets/FastVideo/Wan2.2-Syn-121x704x1280_32k ``` This dataset contains preprocessed latents ready for Text-to-Video (T2V) tasks and is designed to be directly used with the [FastVideo repository](https://github.com/hao-ai-lab/FastVideo) without further preprocessing. Refer to the FastVideo [documentation](https://hao-ai-lab.github.io/FastVideo) for detailed instructions on how to load and use the dataset for training or finetuning. If you use FastVideo Synthetic Wan2.2 dataset for your research, please cite our paper: ``` @article{zhang2025vsa, title={VSA: Faster Video Diffusion with Trainable Sparse Attention}, author={Zhang, Peiyuan and Huang, Haofeng and Chen, Yongqi and Lin, Will and Liu, Zhengzhong and Stoica, Ion and Xing, Eric and Zhang, Hao}, journal={arXiv preprint arXiv:2505.13389}, year={2025} } @article{zhang2025fast, title={Fast video generation with sliding tile attention}, author={Zhang, Peiyuan and Chen, Yongqi and Su, Runlong and Ding, Hangliang and Stoica, Ion and Liu, Zhengzhong and Zhang, Hao}, journal={arXiv preprint arXiv:2502.04507}, year={2025} } ```

# FastVideo 合成Wan2.2 720P数据集 <p align="center"> <img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logo.png" width="200"/> </p> <div> <div align="center"> <a href="https://github.com/hao-ai-lab/FastVideo" target="_blank">FastVideo 团队</a>&emsp; </div> <div align="center"> <a href="https://arxiv.org/abs/2505.13389">论文</a> | <a href="https://github.com/hao-ai-lab/FastVideo">GitHub</a> | <a href="https://hao-ai-lab.github.io/FastVideo">项目主页</a> </div> </div> ## 摘要视频扩散Transformer（DiTs）的扩展受限于其二次三维注意力机制，尽管绝大多数注意力权重集中在少量位置上。我们将这一观察转化为VSA——一种可训练、硬件高效的稀疏注意力机制，在训练与推理阶段均替代全注意力机制。在VSA中，轻量级粗粒度阶段将Token池化为块，并识别高权重关键Token；细粒度阶段仅在这些块内计算Token级注意力，采用分块计算布局以确保硬效率。这形成了单个可微内核，支持端到端训练，无需事后性能分析，且可达到FlashAttention3 85%的模型浮点运算利用率（MFU）。我们通过对6000万至14亿参数的DiTs进行预训练，开展了大规模消融实验与缩放定律实验。VSA实现了帕累托最优平衡点，将训练浮点运算量降低2.53×，同时扩散损失无下降。对开源Wan-2.1模型进行适配改造后，注意力计算速度提升6×，端到端生成时间从31秒缩短至18秒，且生成质量相当。这些结果证明，可训练稀疏注意力可作为全注意力机制的实用替代方案，同时也是进一步扩展视频扩散模型的关键赋能技术。 ## 数据集概览 - 提示词从[Vchitect_T2V_DataVerse](https://huggingface.co/datasets/Vchitect/Vchitect_T2V_DataVerse)数据集中随机采样得到。 - 每个样本均通过**Wan2.2-TI2V-5B-Diffusers**模型生成，并存储其隐向量。 - 每个隐向量样本对应**121帧**，单帧分辨率为**704×1280**。 - 本数据集包含文本到视频（Text-to-Video, T2V）任务所需的全部预处理隐向量（同时包含首帧图像）。 - 本数据集与[FastVideo](https://github.com/hao-ai-lab/FastVideo)代码库完全兼容，无需额外预处理即可直接加载使用。 ## 样本使用方法如需下载本数据集，请确保已安装Git LFS，随后执行如下命令克隆仓库： bash git lfs install git clone https://huggingface.co/datasets/FastVideo/Wan2.2-Syn-121x704x1280_32k 本数据集包含适用于文本到视频（T2V）任务的预处理隐向量，可直接与[FastVideo代码库](https://github.com/hao-ai-lab/FastVideo)配合使用，无需额外预处理。如需了解加载与使用本数据集进行训练或微调的详细步骤，请参阅FastVideo[官方文档](https://hao-ai-lab.github.io/FastVideo)。若您在研究中使用本FastVideo合成Wan2.2数据集，请引用如下论文： @article{zhang2025vsa, title={VSA: Faster Video Diffusion with Trainable Sparse Attention}, author={Zhang, Peiyuan and Huang, Haofeng and Chen, Yongqi and Lin, Will and Liu, Zhengzhong and Stoica, Ion and Xing, Eric and Zhang, Hao}, journal={arXiv preprint arXiv:2505.13389}, year={2025} } @article{zhang2025fast, title={Fast video generation with sliding tile attention}, author={Zhang, Peiyuan and Chen, Yongqi and Su, Runlong and Ding, Hangliang and Stoica, Ion and Liu, Zhengzhong and Zhang, Hao}, journal={arXiv preprint arXiv:2502.04507}, year={2025} }

提供机构：

maas

创建时间：

2025-08-08

搜集汇总

数据集介绍