└── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # Awesome Video Attention
  2 | 
  3 | A curated list of recent papers on **efficient video attention** for video diffusion models, including **sparsification**, **quantization**, and **caching**, etc.
  4 | 
  5 | > 📌 Sorted in **reverse chronological order** by arXiv submission date.
  6 | 
  7 | ---
  8 | 
  9 | ## Papers
 10 | 
 11 | - **[SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention](https://www.arxiv.org/abs/2509.24006)** (Sep 2025)
 12 |   SLA, a trainable attention method combining sparse and linear attention, accelerates Diffusion Transformer models for video generation with minimal quality loss. SLA reduces attention computation by 95% without degrading end-to-end generation quality. An efficient GPU kernel for SLA is implemented, which yields a 13.7× speedup in attention computation and a 2.2× end-to-end speedup in video generation on Wan2.1-1.3B.
 13 | 
 14 | - **[Mixture of Contexts for Long Video Generation](https://arxiv.org/abs/2508.21058)** (Aug 2025)
 15 |   This work recasts long-context video generation as an internal information retrieval task and proposes a simple, learnable sparse attention routing module, Mixture of Contexts, as an effective long-term memory retrieval engine. By discarding 85% of the context, this approach achieves a 2.2× speedup.
 16 | 
 17 | - **[Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers](https://arxiv.org/abs/2506.03065)** (Jun 2025)  
 18 | This paper provides a detailed analysis of attention maps in Video Diffusion Transformers and identifies three recurring sparsity patterns: diagonal, multi-diagonal, and vertical-stripe structures. It achieves 2.09× theoretical FLOP reduction and 1.76× inference speedup on CogVideoX while maintaining visual fidelity.
 19 | 
 20 | - **[Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column‑Sparse Deltas](https://arxiv.org/abs/2506.03275)** (Jun 2025)  
 21 |   Exploits step-to-step activation redundancy in DiTs via dynamic sparsity and voxel-based token reordering. Implements efficient column-sparse GPU kernels and overlapping strategies to hide latency. Achieves up to 3.72× speedup (HunyuanVideo) without retraining or quality loss.
 22 | 
 23 | - **[Astraea: A GPU‑Oriented Token‑wise Acceleration Framework for Video Diffusion Transformers](https://arxiv.org/abs/2506.05096)** (Jun 2025)  
 24 |   Proposes an automatic framework combining lightweight token selection and GPU-parallel sparse attention. Uses evolutionary search to optimize token budgets across timesteps. Achieves up to 2.4× speedup on 1 GPU and 13.2× on 8 GPUs with <0.5% quality drop on VBench.
 25 | 
 26 | - **[PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models](https://arxiv.org/abs/2506.16054)** (Jun 2025)  
 27 |   Proposes a token reordering method that transforms irregular attention into hardware-friendly block-wise patterns, simplifying both sparsification and quantization. Achieves lossless visual generation with INT8/INT4 at ~20–30% density, yielding 1.9×–2.7× latency speedup.
 28 | 
 29 | - **[VMoBA: Mixture‑of‑Block Attention for Video Diffusion Models](https://arxiv.org/abs/2506.23858)** (Jun 2025)  
 30 |   Proposes a sparse mixture-of-block attention mechanism that partitions video tokens into 1D, 2D, and 3D blocks to exploit spatio-temporal locality, achieving ≈2.9× lower FLOPs and 1.35× faster training/inference in long video generation.
 31 | 
 32 | - **[Radial Attention: Sparse Attention with Energy Decay for Long Video Generation](https://www.arxiv.org/abs/2506.19852)** (Jun 2025)  
 33 |   Introduces a static $O(n\log n)$ attention mask inspired by spatiotemporal energy decay, enabling ~4× longer videos with up to 1.9× speedup over dense attention in pretrained video diffusion models.
 34 | 
 35 | - **[FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion](https://arxiv.org/abs/2506.04648)** (Jun 2025)  
 36 |   Develops a training-aware co-design of FP8 quantization and structured sparsity for 3D attention, yielding 7.09× attention speedup (≈4.96× end-to-end) at 720p with negligible quality loss.
 37 | 
 38 | - **[Interspatial Attention for Efficient 4D Human Video Generation](https://arxiv.org/abs/2505.15800)** (May 2025)<br>
 39 | Introduces interspatial attention (ISA) mechanism for diffusion transformer-based video generation models, using relative positional encodings tailored for human videos. Achieves state-of-the-art 4D human video synthesis with motion consistency and identity preservation.
 40 | 
 41 | - **[MAGI-1: Autoregressive Video Generation at Scale](https://arxiv.org/abs/2505.13211)** (May 2025)
 42 | `MagiAttention` is a distributed attention mechanism, or context-parallel (CP) strategy, which aims to support a wide variety of attention mask types with **kernel-level flexibility**, while achieving **linear scalability** with respect to context-parallel (CP) size across a broad range of scenarios, particularly suitable for training tasks involving <u><em>ultra-long, heterogeneous mask</em></u> training like video-generation for Magi-1.
 43 | 
 44 | - **[GRAT: Grouping First, Attending Smartly – Training-Free Acceleration for Diffusion Transformers](https://arxiv.org/abs/2505.14687)** (May 2025)  
 45 |   Proposes GRAT, a training-free strategy that partitions tokens into GPU-friendly groups and restricts attention to structured regions. Delivers up to 35.8× speedup in 8192×8192 generation on A100, while preserving quality on pretrained Flux and HunyuanVideo.
 46 | 
 47 | - **[Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation](https://arxiv.org/abs/2505.18875)** (May 2025)  
 48 |   Proposes a training-free method that reorders tokens based on semantic clustering to form dense blocks. Achieves up to 2.3× speedup with minimal quality drop.
 49 | 
 50 | - **[DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance](https://arxiv.org/abs/2505.14708)** (May 2025)  
 51 |   Proposes a training-free method using low-res draft attention to reorder tokens for structured sparse computation. Enables up to 1.75× speedup on GPUs with superior video generation quality.
 52 | 
 53 | - **[VSA: Efficient Video Diffusion via Routing Sparse Attention](https://arxiv.org/abs/2505.13389)** (May 2025)  
 54 | Proposes a trainable sparse attention mechanism that routes attention to important tokens in video diffusion models, reducing training FLOPs by 2.5× and cutting inference latency from 31s to 18s without degrading generation quality.
 55 | 
 56 | - **[VORTA: Efficient Video Diffusion via Routing Sparse Attention](https://arxiv.org/abs/2505.18809)** (May 2025)  
 57 |   Introduces a routing-based framework to replace full 3D attention with specialized sparse patterns during sampling. Delivers 1.76× end-to-end speedup (14.4× with distillation) without quality degradation.
 58 | 
 59 | - **[FastCAR: Cache Attentive Replay for Fast Auto-regressive Video Generation on the Edge](https://arxiv.org/abs/2505.14709)** (May 2025)  
 60 |   Exploits temporal redundancy by caching MLP outputs between frames to skip redundant decoding steps. Achieves >2.1× faster decoding and better energy efficiency for edge devices.
 61 | 
 62 | - **[SageAttention3: Microscaling FP4 Attention and 8-Bit Training](https://arxiv.org/abs/2505.11594)** (May 2025)  
 63 |   Leverages FP4 Tensor Cores on Blackwell GPUs to reach 1038 TOPS attention throughput and explores 8-bit attention training with promising results.
 64 | 
 65 | - **[Analysis of Attention in Video Diffusion Transformers](https://arxiv.org/abs/2504.10317)** (Apr 2025)  
 66 |   Provides an in-depth study of attention in VDiTs, identifying three key attention properties—**Structure**, **Sparsity**, and **Sinks**. Shows that attention patterns are prompt-agnostic, sparsity methods aren’t universally effective, and attention sinks differ from language models. Suggests future directions to improve the efficiency-quality tradeoff :contentReference[oaicite:1]{index=1}.
 67 | 
 68 | 
 69 | - **[Generalized Neighborhood Attention: Multi-dimensional Sparse Attention at the Speed of Light](https://arxiv.org/abs/2504.16922)** (Apr 2025)  
 70 |   Introduces a unified framework (GNA) for local sparse attention patterns—sliding window, strided, and blocked—and provides an analytical performance simulator. Implements GNA on NVIDIA Blackwell FMHA kernels, achieving up to 1.3 PFLOPs/s and 28%–46% end-to-end speedup on Cosmos-7B, FLUX, and HunyuanVideo without fine-tuning.
 71 | 
 72 | 
 73 | - **\[ICML 25\][XAttention: Block Sparse Attention with Antidiagonal Scoring](https://arxiv.org/abs/2503.16428)** (Mar 2024)
 74 | Proposes a sparse attention method using antidiagonal scoring for efficient block pruning, achieving up to 13.5× speedup with minimal accuracy loss on long-context language and video benchmarks.
 75 | 
 76 | - **[Training-free and Adaptive Sparse Attention for Efficient Long Video Generation](https://arxiv.org/abs/2502.21079)** (Feb 2025)  
 77 |   AdaSpa uses blockified adaptive sparse attention with online cached search to reduce PFLOPs in long video generation while maintaining fidelity.
 78 | 
 79 | - **\[ICML 25\][SpargeAttention: Accurate Sparse Attention Accelerating Any Model Inference](https://arxiv.org/abs/2502.18137)** (Feb 2025)  
 80 |   Introduces a training-free two-stage filter for fast sparse attention inference. Achieves high speedup with no quality loss across LLMs, image, and video models.
 81 | 
 82 | - **[DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training](https://arxiv.org/abs/2502.07590)** (Feb 2025)  
 83 |   Leverages dynamic attention sparsity and hybrid parallelism to achieve 3.02× training throughput on large-scale VDiT models.
 84 | 
 85 | - **\[ICML 25\][Fast Video Generation with Sliding Tile Attention](https://arxiv.org/abs/2502.04507)** (Feb 2025)  
 86 |   Restricts attention to a sliding 3D window, accelerating attention by 2.8–17× over FlashAttention-2 and reducing end-to-end latency by 27% without quality loss.
 87 | 
 88 | - **\[ICML 25\][Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity](https://arxiv.org/abs/2502.01776)** (Feb 2025)  
 89 |   Classifies heads as spatial vs. temporal and skips irrelevant computations. Yields ≈2.3× end-to-end speedups on modern video diffusion models.
 90 | 
 91 | - **\[ICML 25\][SageAttention2: Efficient Attention with INT4 Quantization](https://arxiv.org/abs/2411.10958)** (Nov 2024)  
 92 |   Combines INT4 $QK^\top$ and FP8 $PV$ with outlier smoothing to reach 3× higher throughput than FlashAttention2 while retaining high accuracy.
 93 | 
 94 | - **\[ICLR 24\][SageAttention: Accurate 8-Bit Attention for Plug-and-Play Inference Acceleration](https://arxiv.org/abs/2410.02367)** (Oct 2024)  
 95 |   Pioneers 8-bit attention using INT8+FP16 strategy with smoothing. Achieves 2.1×–2.7× speedups over baselines with negligible accuracy drop.
 96 | 
 97 | ---
 98 | 
 99 | ## Contributing
100 | 
101 | If you find your paper related to attention in video generation, feel free to open a pull request!
102 | 
103 | ---
104 | 
105 | ## License
106 | 
107 | This project is under the MIT License.
108 | 


--------------------------------------------------------------------------------