├── README.md ├── formula.md ├── gqa.png ├── mha.png ├── mla.ipynb ├── mla.png ├── mla_from_paper.png └── mqa.png /README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/preacher-1/MLA_tutorial/20ec3f385f07f44ad47549057946a52e847afeb0/README.md -------------------------------------------------------------------------------- /formula.md: -------------------------------------------------------------------------------- 1 | ### Part 3 2 | 最后有一个细节,就是MLA的最终版本,还将Q的输入也改为了低秩投影形式,这与减少KV Cache无关,主要是为了减少训练期间参数量和相应的梯度(原论文说的是激活值,个人表示不大理解)所占的显存: 3 | $$ 4 | \begin{equation} 5 | \begin{gathered} 6 | \boldsymbol{o}_t = \left[\boldsymbol{o}_t^{(1)}, \boldsymbol{o}_t^{(2)}, \cdots, \boldsymbol{o}_t^{(h)}\right] \\[10pt] 7 | \boldsymbol{o}_t^{(s)} = Attention\left(\boldsymbol{q}_t^{(s)}, \boldsymbol{k}_{\leq t}^{(s)} ,\boldsymbol{v}_{\leq t}^{(s)}\right)\triangleq\frac{\sum_{i\leq t}\exp\left(\boldsymbol{q}_t^{(s)} \boldsymbol{k}_i^{(s)}{}^{\top}\right)\boldsymbol{v}_i^{(s)}}{\sum_{i\leq t}\exp\left(\boldsymbol{q}_t^{(s)} \boldsymbol{k}_i^{(s)}{}^{\top}\right)} \\[15pt] 8 | \boldsymbol{q}_i^{(s)} = \left[\boldsymbol{c}_i'\boldsymbol{W}_{qc}^{(s)}, \boldsymbol{c}_i'\boldsymbol{W}_{qr}^{(s)}\color{#3ce2f7}{\boldsymbol{\mathcal{R}}_i}\right]\in\mathbb{R}^{d_k + d_r},\quad \boldsymbol{W}_{qc}^{(s)}\in\mathbb{R}^{d_c'\times d_k},\boldsymbol{W}_{qr}^{(s)}\in\mathbb{R}^{d_c'\times d_r}\\ 9 | \boldsymbol{k}_i^{(s)} = \left[\boldsymbol{c}_i\boldsymbol{W}_{kc}^{(s)}, \boldsymbol{x}_i\boldsymbol{W}_{kr}^{\color{#ccc}{\smash{\bcancel{(s)}}}}\color{#3ce2f7}{\boldsymbol{\mathcal{R}}_i}\right]\in\mathbb{R}^{d_k+d_r},\quad \boldsymbol{W}_{kc}^{(s)}\in\mathbb{R}^{d_c\times d_k}, \boldsymbol{W}_{kr}^{\color{#ccc}{\smash{\bcancel{(s)}}}}\in\mathbb{R}^{d\times d_r} \\ 10 | \boldsymbol{v}_i^{(s)} = \boldsymbol{c}_i\boldsymbol{W}_v^{(s)}\in\mathbb{R}^{d_v},\quad \boldsymbol{W}_v^{(s)}\in\mathbb{R}^{d_c\times d_v} \\[10pt] 11 | \boldsymbol{c}_i' = \boldsymbol{x}_i \boldsymbol{W}_c'\in\mathbb{R}^{d_c'},\quad \boldsymbol{W}_c'\in\mathbb{R}^{d\times d_c'} \\ 12 | \boldsymbol{c}_i = \boldsymbol{x}_i \boldsymbol{W}_c\in\mathbb{R}^{d_c},\quad \boldsymbol{W}_c\in\mathbb{R}^{d\times d_c} \\ 13 | \end{gathered} 14 | \end{equation} 15 | $$ 16 | 注意$\boldsymbol{k}_i^{(s)}$中的第二项,带RoPE的部分,其输入还是$\boldsymbol{x}_i$而不是$\boldsymbol{c}_i$,这里保持了原论文的设置,不是笔误,$d_c'$原论文的取值是1536,跟$d_c=512$不同。 17 | 18 | 推理阶段的MLA则改为 19 | $$ 20 | \begin{equation} 21 | \begin{gathered} 22 | \boldsymbol{o}_t = \left[\boldsymbol{o}_t^{(1)}\boldsymbol{W}_v^{(1)}, \boldsymbol{o}_t^{(2)}\boldsymbol{W}_v^{(2)}, \cdots, \boldsymbol{o}_t^{(h)}\boldsymbol{W}_v^{(h)}\right] \\[10pt] 23 | \boldsymbol{o}_t^{(s)} = Attention\left(\boldsymbol{q}_t^{(s)}, \boldsymbol{k}_{\leq t}^{\color{#ccc}{\smash{\bcancel{(s)}}}} ,\boldsymbol{c}_{\leq t}\right)\triangleq\frac{\sum_{i\leq t}\exp\left(\boldsymbol{q}_t^{(s)} \boldsymbol{k}_i^{\color{#ccc}{\smash{\bcancel{(s)}}}}{}^{\top}\right)\boldsymbol{c}_i}{\sum_{i\leq t}\exp\left(\boldsymbol{q}_t^{(s)} \boldsymbol{k}_i^{\color{#ccc}{\smash{\bcancel{(s)}}}}{}^{\top}\right)} \\[15pt] 24 | \boldsymbol{q}_i^{(s)} = \left[\boldsymbol{c}_i'\boldsymbol{W}_{qc}^{(s)}\boldsymbol{W}_{kc}^{(s)}{}^{\top}, \boldsymbol{c}_i'\boldsymbol{W}_{qr}^{(s)}\color{#3ce2f7}{\boldsymbol{\mathcal{R}}_i}\right]\in\mathbb{R}^{d_c + d_r}\\ 25 | \boldsymbol{k}_i^{\color{#ccc}{\smash{\bcancel{(s)}}}} = \left[\boldsymbol{c}_i, \boldsymbol{x}_i\boldsymbol{W}_{kr}^{\color{#ccc}{\smash{\bcancel{(s)}}}}\color{#3ce2f7}{\boldsymbol{\mathcal{R}}_i}\right]\in\mathbb{R}^{d_c+d_r}\\ 26 | \boldsymbol{W}_{qc}^{(s)}\in\mathbb{R}^{d_c'\times d_k},\boldsymbol{W}_{kc}^{(s)}\in\mathbb{R}^{d_c\times d_k},\boldsymbol{W}_{qr}^{(s)}\in\mathbb{R}^{d_c'\times d_r},\boldsymbol{W}_{kr}^{\color{#ccc}{\smash{\bcancel{(s)}}}}\in\mathbb{R}^{d\times d_r} \\[10pt] 27 | \boldsymbol{c}_i' = \boldsymbol{x}_i \boldsymbol{W}_c'\in\mathbb{R}^{d_c'},\quad \boldsymbol{W}_c'\in\mathbb{R}^{d\times d_c'} \\ 28 | \boldsymbol{c}_i = \boldsymbol{x}_i \boldsymbol{W}_c\in\mathbb{R}^{d_c},\quad \boldsymbol{W}_c\in\mathbb{R}^{d\times d_c} \\ 29 | \end{gathered} 30 | \end{equation} 31 | $$ 32 | 此时Q、K的Head Size变成了$d_c + d_r$,V的Head Size 则变成了$d_c$,按照原论文的设置,这是$d_k$、$d_v$的4倍。所以实际上MLA在推理阶段做的这个转换,虽然能有效减少KV Cache,但其推理的计算量是增加的。 33 | 34 | 那为什么还能提高推理效率呢?这又回到“瓶颈”一节所讨论的问题了,我们可以将LLM的推理分两部分:第一个Token的生成(Prefill)和后续每个Token的生成(Generation),Prefill阶段涉及到对输入所有Token的并行计算,然后把对应的KV Cache存下来,这部分对于计算、带宽和显存都是瓶颈,MLA虽然增大了计算量,但KV Cache的减少也降低了显存和带宽的压力,大家半斤八两;但是Generation阶段由于每步只计算一个Token,实际上它更多的是带宽瓶颈和显存瓶颈,因此MLA的引入理论上能明显提高Generation的速度。 35 | 36 | 还有一个细节充分体现了这个特性。一般的LLM架构参数满足$h=d$,即num_heads * head_size = hidden_size,但DeepSeek-V2不一样,它$d_k=128,d=5120$,但$h=128$,是一般设置的3倍!这是因为MLA的KV Cache大小跟$h$无关,增大$h$只会增加计算量和提升模型能力,但不会增加KV Cache,所以不会带来速度瓶颈。 -------------------------------------------------------------------------------- /gqa.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/preacher-1/MLA_tutorial/20ec3f385f07f44ad47549057946a52e847afeb0/gqa.png -------------------------------------------------------------------------------- /mha.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/preacher-1/MLA_tutorial/20ec3f385f07f44ad47549057946a52e847afeb0/mha.png -------------------------------------------------------------------------------- /mla.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/preacher-1/MLA_tutorial/20ec3f385f07f44ad47549057946a52e847afeb0/mla.png -------------------------------------------------------------------------------- /mla_from_paper.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/preacher-1/MLA_tutorial/20ec3f385f07f44ad47549057946a52e847afeb0/mla_from_paper.png -------------------------------------------------------------------------------- /mqa.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/preacher-1/MLA_tutorial/20ec3f385f07f44ad47549057946a52e847afeb0/mqa.png --------------------------------------------------------------------------------