├── png_figs ├── fig1.png ├── fig2.png ├── fig3.png ├── fig4.png ├── fig5.png ├── table2.png ├── table3.png ├── table4.png ├── table5.png ├── table6.png ├── table7.png ├── formula1.png ├── formula2.png ├── formula3.png ├── formula4.png ├── formula5.png ├── fig1_with_caption.png ├── fig2_with_caption.png ├── fig3_with_caption.png ├── fig4_with_caption.png ├── table2_with_caption.png ├── table3_with_caption.png ├── table4_with_caption.png ├── table5_with_caption.png ├── table6_with_caption.png └── table7_with_caption.png ├── resource └── Attention_Survey.pdf └── README.md /png_figs/fig1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/fig1.png -------------------------------------------------------------------------------- /png_figs/fig2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/fig2.png -------------------------------------------------------------------------------- /png_figs/fig3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/fig3.png -------------------------------------------------------------------------------- /png_figs/fig4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/fig4.png -------------------------------------------------------------------------------- /png_figs/fig5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/fig5.png -------------------------------------------------------------------------------- /png_figs/table2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table2.png -------------------------------------------------------------------------------- /png_figs/table3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table3.png -------------------------------------------------------------------------------- /png_figs/table4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table4.png -------------------------------------------------------------------------------- /png_figs/table5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table5.png -------------------------------------------------------------------------------- /png_figs/table6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table6.png -------------------------------------------------------------------------------- /png_figs/table7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table7.png -------------------------------------------------------------------------------- /png_figs/formula1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/formula1.png -------------------------------------------------------------------------------- /png_figs/formula2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/formula2.png -------------------------------------------------------------------------------- /png_figs/formula3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/formula3.png -------------------------------------------------------------------------------- /png_figs/formula4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/formula4.png -------------------------------------------------------------------------------- /png_figs/formula5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/formula5.png -------------------------------------------------------------------------------- /resource/Attention_Survey.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/resource/Attention_Survey.pdf -------------------------------------------------------------------------------- /png_figs/fig1_with_caption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/fig1_with_caption.png -------------------------------------------------------------------------------- /png_figs/fig2_with_caption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/fig2_with_caption.png -------------------------------------------------------------------------------- /png_figs/fig3_with_caption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/fig3_with_caption.png -------------------------------------------------------------------------------- /png_figs/fig4_with_caption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/fig4_with_caption.png -------------------------------------------------------------------------------- /png_figs/table2_with_caption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table2_with_caption.png -------------------------------------------------------------------------------- /png_figs/table3_with_caption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table3_with_caption.png -------------------------------------------------------------------------------- /png_figs/table4_with_caption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table4_with_caption.png -------------------------------------------------------------------------------- /png_figs/table5_with_caption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table5_with_caption.png -------------------------------------------------------------------------------- /png_figs/table6_with_caption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table6_with_caption.png -------------------------------------------------------------------------------- /png_figs/table7_with_caption.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/attention-survey/Efficient_Attention_Survey/HEAD/png_figs/table7_with_caption.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # A Survey of Efficient Attention Methods 2 | 3 | **Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention** 4 | **PDF**: https://attention-survey.github.io/files/Attention_Survey.pdf 5 | Paper webpage: https://attention-survey.github.io 6 | 7 | ![](./png_figs/fig2.png) 8 | 9 | This paper provides a comprehensive survey of **Efficient Attention Methods**, categorizing them into four classes. 10 | 11 | ----- 12 | 13 | ## Updates 14 | 15 | - **[2025/8/19]** 🎉 Our survey paper is now publicly available on [GitHub](https://attention-survey.github.io/files/Attention_Survey.pdf)! If you do find our resources helpful, please [cite our paper](#citation). 16 | 17 | ----- 18 | 19 | ## Class 1: Hardware-efficient Attention 20 | 21 | 💡 **Core Idea**: Accelerate attention by leveraging hardware characteristics. 22 | 23 | 📝 **Overall Formulations**: 24 | 25 | Hardware-efficient attention for the prefilling stage can be formulated as: 26 | 27 | 28 | Where $\Psi(\cdot), \Theta(\cdot)$ are preprocess functions to accelerate computation, e.g., quantization functions in SageAttention. 29 | 30 | Hardware-efficient attention for the decoding stage can be formulated as: 31 | 32 | 33 | Where $\Psi(\cdot), \Theta(\cdot)$ are KV cache preprocess functions. 34 | 35 | --- 36 | 37 | An example is FlashAttention, which tiles $Q, K, V$ to progressively compute the attention output $O$. Such a strategy avoids the I/O of $S, P$ matrices in the shape of $N \times N$. 38 | 39 | 40 | 41 | --- 42 | 43 | The Table below summarizes various hardware-efficient attention methods. 👇 44 | 45 | ![](./png_figs/table2.png) 46 | 47 | 48 | ----- 49 | 50 | ## Class2: Compact Attention 51 | 52 | 💡 **Core Idea**: Compressing the KV cache of attention by weight sharing or low rank decomposition while keeping computational cost unchanged, as with a full-sized KV cache. 53 | 54 | 📝 **Overall Formulations**: 55 | 56 | 57 | 58 | --- 59 | 60 | The Table Below is a summarization of various compact attention approaches. 👇 61 | 62 | ![](./png_figs/table3.png) 63 | 64 | 65 | ----- 66 | 67 | ## Class3: Sparse Attention 68 | 69 | 💡 **Core Idea**: Selectively performing a subset of computations in attention while omitting others. 70 | 71 | 📝 **Overall Formulations**: 72 | 73 | 74 | 75 | --- 76 | 77 | The Table below summarizes various sparse attention methods. 👇 78 | 79 | ![](./png_figs/table4.png) 80 | 81 | ----- 82 | 83 | ## Class4: Linear Attention 84 | 85 | 💡 **Core Idea**: Redesigning the computational formulation of attention to achieve \(\mathcal{O}(N)\) time complexity. 86 | 87 | 📝 **Overall Formulations**: 88 | 89 | 90 | 91 | --- 92 | ### Computational Forms 93 | 94 | Linear Attention can be implemented in three forms: **parallel**, **recurrent**, and **chunkwise**. 95 | 96 | ![](./png_figs/fig3.png) 97 | 98 | --- 99 | 100 | ### Gating Mechanisms 101 | 102 | Many linear attention methods incorporate **forget gates** and **select gates**. 103 | 104 | 105 | 106 | Based on the presence of these gates, we can classify linear attention methods as follows: 107 | 108 | 1. **Naive Linear Attention (No Gates)** 109 | 110 | 📝 The Table below summarizes naive attention methods. 👇 111 | 112 | ![](./png_figs/table5.png) 113 | 114 | 115 | 2. **Linear Attention with a Forget Gate** 116 | 117 | 📝 This Table compares methods that use a forget gate. 👇 118 | 119 | ![](./png_figs/table6.png) 120 | 121 | 122 | 3. **Linear Attention with Forget and Select Gates** 123 | 124 | 📝 This Table compares methods that utilize both the forget gate and the select gate. 👇 125 | 126 | ![](./png_figs/table6.png) 127 | 128 | 129 | ### A Special Case: Test-Time Training (TTT) 130 | 131 | A unique approach, **Test-Time Training (TTT)**, treats the hidden states of linear attention as learnable parameters. 132 | 133 | 134 | 135 | ----- 136 | 137 | ## Citation 138 | 139 | If you find our work helpful, please cite our paper: 140 | 141 | ``` 142 | @article{zhangsurvey, 143 | title={Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention}, 144 | author={Zhang, Jintao and Su, Rundong and Liu, Chunyu and Wei, Jia and Wang, Ziteng and Zhang, Pengle and Wang, Haoxu and Jiang, Huiqiang and Huang, Haofeng and Xiang, Chendong and others} 145 | } 146 | ``` 147 | --------------------------------------------------------------------------------