└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome Video Attention 2 | 3 | A curated list of recent papers on **efficient video attention** for video diffusion models, including **sparsification**, **quantization**, and **caching**, etc. 4 | 5 | > 📌 Sorted in **reverse chronological order** by arXiv submission date. 6 | 7 | --- 8 | 9 | ## Papers 10 | 11 | - **[SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention](https://www.arxiv.org/abs/2509.24006)** (Sep 2025) 12 | SLA, a trainable attention method combining sparse and linear attention, accelerates Diffusion Transformer models for video generation with minimal quality loss. SLA reduces attention computation by 95% without degrading end-to-end generation quality. An efficient GPU kernel for SLA is implemented, which yields a 13.7× speedup in attention computation and a 2.2× end-to-end speedup in video generation on Wan2.1-1.3B. 13 | 14 | - **[Mixture of Contexts for Long Video Generation](https://arxiv.org/abs/2508.21058)** (Aug 2025) 15 | This work recasts long-context video generation as an internal information retrieval task and proposes a simple, learnable sparse attention routing module, Mixture of Contexts, as an effective long-term memory retrieval engine. By discarding 85% of the context, this approach achieves a 2.2× speedup. 16 | 17 | - **[Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers](https://arxiv.org/abs/2506.03065)** (Jun 2025) 18 | This paper provides a detailed analysis of attention maps in Video Diffusion Transformers and identifies three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. It achieves 2.09× theoretical FLOP reduction and 1.76× inference speedup on CogVideoX while maintaining visual fidelity. 19 | 20 | - **[Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column‑Sparse Deltas](https://arxiv.org/abs/2506.03275)** (Jun 2025) 21 | Exploits step-to-step activation redundancy in DiTs via dynamic sparsity and voxel-based token reordering. Implements efficient column-sparse GPU kernels and overlapping strategies to hide latency. Achieves up to 3.72× speedup (HunyuanVideo) without retraining or quality loss. 22 | 23 | - **[Astraea: A GPU‑Oriented Token‑wise Acceleration Framework for Video Diffusion Transformers](https://arxiv.org/abs/2506.05096)** (Jun 2025) 24 | Proposes an automatic framework combining lightweight token selection and GPU-parallel sparse attention. Uses evolutionary search to optimize token budgets across timesteps. Achieves up to 2.4× speedup on 1 GPU and 13.2× on 8 GPUs with <0.5% quality drop on VBench. 25 | 26 | - **[PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models](https://arxiv.org/abs/2506.16054)** (Jun 2025) 27 | Proposes a token reordering method that transforms irregular attention into hardware-friendly block-wise patterns, simplifying both sparsification and quantization. Achieves lossless visual generation with INT8/INT4 at ~20–30% density, yielding 1.9×–2.7× latency speedup. 28 | 29 | - **[VMoBA: Mixture‑of‑Block Attention for Video Diffusion Models](https://arxiv.org/abs/2506.23858)** (Jun 2025) 30 | Proposes a sparse mixture-of-block attention mechanism that partitions video tokens into 1D, 2D, and 3D blocks to exploit spatio-temporal locality, achieving ≈2.9× lower FLOPs and 1.35× faster training/inference in long video generation. 31 | 32 | - **[Radial Attention: Sparse Attention with Energy Decay for Long Video Generation](https://www.arxiv.org/abs/2506.19852)** (Jun 2025) 33 | Introduces a static $O(n\log n)$ attention mask inspired by spatiotemporal energy decay, enabling ~4× longer videos with up to 1.9× speedup over dense attention in pretrained video diffusion models. 34 | 35 | - **[FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion](https://arxiv.org/abs/2506.04648)** (Jun 2025) 36 | Develops a training-aware co-design of FP8 quantization and structured sparsity for 3D attention, yielding 7.09× attention speedup (≈4.96× end-to-end) at 720p with negligible quality loss. 37 | 38 | - **[Interspatial Attention for Efficient 4D Human Video Generation](https://arxiv.org/abs/2505.15800)** (May 2025)
39 | Introduces interspatial attention (ISA) mechanism for diffusion transformer-based video generation models, using relative positional encodings tailored for human videos. Achieves state-of-the-art 4D human video synthesis with motion consistency and identity preservation. 40 | 41 | - **[MAGI-1: Autoregressive Video Generation at Scale](https://arxiv.org/abs/2505.13211)** (May 2025) 42 | `MagiAttention` is a distributed attention mechanism, or context-parallel (CP) strategy, which aims to support a wide variety of attention mask types with **kernel-level flexibility**, while achieving **linear scalability** with respect to context-parallel (CP) size across a broad range of scenarios, particularly suitable for training tasks involving ultra-long, heterogeneous mask training like video-generation for Magi-1. 43 | 44 | - **[GRAT: Grouping First, Attending Smartly – Training-Free Acceleration for Diffusion Transformers](https://arxiv.org/abs/2505.14687)** (May 2025) 45 | Proposes GRAT, a training-free strategy that partitions tokens into GPU-friendly groups and restricts attention to structured regions. Delivers up to 35.8× speedup in 8192×8192 generation on A100, while preserving quality on pretrained Flux and HunyuanVideo. 46 | 47 | - **[Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation](https://arxiv.org/abs/2505.18875)** (May 2025) 48 | Proposes a training-free method that reorders tokens based on semantic clustering to form dense blocks. Achieves up to 2.3× speedup with minimal quality drop. 49 | 50 | - **[DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance](https://arxiv.org/abs/2505.14708)** (May 2025) 51 | Proposes a training-free method using low-res draft attention to reorder tokens for structured sparse computation. Enables up to 1.75× speedup on GPUs with superior video generation quality. 52 | 53 | - **[VSA: Efficient Video Diffusion via Routing Sparse Attention](https://arxiv.org/abs/2505.13389)** (May 2025) 54 | Proposes a trainable sparse attention mechanism that routes attention to important tokens in video diffusion models, reducing training FLOPs by 2.5× and cutting inference latency from 31s to 18s without degrading generation quality. 55 | 56 | - **[VORTA: Efficient Video Diffusion via Routing Sparse Attention](https://arxiv.org/abs/2505.18809)** (May 2025) 57 | Introduces a routing-based framework to replace full 3D attention with specialized sparse patterns during sampling. Delivers 1.76× end-to-end speedup (14.4× with distillation) without quality degradation. 58 | 59 | - **[FastCAR: Cache Attentive Replay for Fast Auto-regressive Video Generation on the Edge](https://arxiv.org/abs/2505.14709)** (May 2025) 60 | Exploits temporal redundancy by caching MLP outputs between frames to skip redundant decoding steps. Achieves >2.1× faster decoding and better energy efficiency for edge devices. 61 | 62 | - **[SageAttention3: Microscaling FP4 Attention and 8-Bit Training](https://arxiv.org/abs/2505.11594)** (May 2025) 63 | Leverages FP4 Tensor Cores on Blackwell GPUs to reach 1038 TOPS attention throughput and explores 8-bit attention training with promising results. 64 | 65 | - **[Analysis of Attention in Video Diffusion Transformers](https://arxiv.org/abs/2504.10317)** (Apr 2025) 66 | Provides an in-depth study of attention in VDiTs, identifying three key attention properties—**Structure**, **Sparsity**, and **Sinks**. Shows that attention patterns are prompt-agnostic, sparsity methods aren’t universally effective, and attention sinks differ from language models. Suggests future directions to improve the efficiency-quality tradeoff :contentReference[oaicite:1]{index=1}. 67 | 68 | 69 | - **[Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light](https://arxiv.org/abs/2504.16922)** (Apr 2025) 70 | Introduces a unified framework (GNA) for local sparse attention patterns—sliding window, strided, and blocked—and provides an analytical performance simulator. Implements GNA on NVIDIA Blackwell FMHA kernels, achieving up to 1.3 PFLOPs/s and 28%–46% end-to-end speedup on Cosmos-7B, FLUX, and HunyuanVideo without fine-tuning. 71 | 72 | 73 | - **\[ICML 25\][XAttention: Block Sparse Attention with Antidiagonal Scoring](https://arxiv.org/abs/2503.16428)** (Mar 2024) 74 | Proposes a sparse attention method using antidiagonal scoring for efficient block pruning, achieving up to 13.5× speedup with minimal accuracy loss on long-context language and video benchmarks. 75 | 76 | - **[Training-free and Adaptive Sparse Attention for Efficient Long Video Generation](https://arxiv.org/abs/2502.21079)** (Feb 2025) 77 | AdaSpa uses blockified adaptive sparse attention with online cached search to reduce PFLOPs in long video generation while maintaining fidelity. 78 | 79 | - **\[ICML 25\][SpargeAttention: Accurate Sparse Attention Accelerating Any Model Inference](https://arxiv.org/abs/2502.18137)** (Feb 2025) 80 | Introduces a training-free two-stage filter for fast sparse attention inference. Achieves high speedup with no quality loss across LLMs, image, and video models. 81 | 82 | - **[DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training](https://arxiv.org/abs/2502.07590)** (Feb 2025) 83 | Leverages dynamic attention sparsity and hybrid parallelism to achieve 3.02× training throughput on large-scale VDiT models. 84 | 85 | - **\[ICML 25\][Fast Video Generation with Sliding Tile Attention](https://arxiv.org/abs/2502.04507)** (Feb 2025) 86 | Restricts attention to a sliding 3D window, accelerating attention by 2.8–17× over FlashAttention-2 and reducing end-to-end latency by 27% without quality loss. 87 | 88 | - **\[ICML 25\][Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity](https://arxiv.org/abs/2502.01776)** (Feb 2025) 89 | Classifies heads as spatial vs. temporal and skips irrelevant computations. Yields ≈2.3× end-to-end speedups on modern video diffusion models. 90 | 91 | - **\[ICML 25\][SageAttention2: Efficient Attention with INT4 Quantization](https://arxiv.org/abs/2411.10958)** (Nov 2024) 92 | Combines INT4 $QK^\top$ and FP8 $PV$ with outlier smoothing to reach 3× higher throughput than FlashAttention2 while retaining high accuracy. 93 | 94 | - **\[ICLR 24\][SageAttention: Accurate 8-Bit Attention for Plug-and-Play Inference Acceleration](https://arxiv.org/abs/2410.02367)** (Oct 2024) 95 | Pioneers 8-bit attention using INT8+FP16 strategy with smoothing. Achieves 2.1×–2.7× speedups over baselines with negligible accuracy drop. 96 | 97 | --- 98 | 99 | ## Contributing 100 | 101 | If you find your paper related to attention in video generation, feel free to open a pull request! 102 | 103 | --- 104 | 105 | ## License 106 | 107 | This project is under the MIT License. 108 | --------------------------------------------------------------------------------