Shenbi: accelerating video diffusion models with INT8-attention

中国科学数据2026-04-23 更新2026-04-25 收录

下载链接：

https://www.sciengine.com/AA/doi/10.1007/s11432-025-4915-9

下载链接

链接失效反馈

官方服务：

资源简介：

Video generation has achieved remarkable progress with the adoption of diffusion models and transformer-based architectures, enabling the creation of highly realistic and intricate video content. However, the quadratic computational complexity of attention in diffusion transformers with respect to the video length poses a significant challenge, particularly for high-resolution and long-duration videos. Quantization has been widely used for optimizing inference. However, the quantization of attention layers in video diffusion models remains difficult owing to the sensitivity to quantization errors. To tackle these issues, we propose Shenbi, the first INT8 attention solution for video diffusion models. Shenbi employs fine-grained block quantization and softmax-friendly quantization to ensure high precision. It further employs rowmax fusionand scale factor fusiontechniques to reduce de/quantization overhead and adopts a hybrid quantization and attention block sizestrategy to maximize GPU utilization. The experimental results show that Shenbi delivers a $2.63\times$ throughput for attention kernels and a $1.58\times$ speedup for the entire model compared to the FP16 baseline, while preserving high model accuracy. The source code is available at https://github.com/HaiShuangFan/Shenbi

创建时间：

2026-04-14

5,000+

优质数据集

54 个

任务类型

进入经典数据集