├── .github └── workflows │ └── static.yml ├── .gitignore ├── CNAME ├── LICENSE ├── README.md ├── index.html └── static ├── css ├── bulma-carousel.min.css ├── bulma-slider.min.css ├── bulma.css.map.txt ├── bulma.min.css ├── fontawesome.all.min.css └── index.css ├── images ├── CLKV_Algorithm.jpg ├── CLKV_main_crop_x.jpg ├── CLKV_method_crop_v2.jpg ├── Figure2.png ├── hf.svg ├── hfbw.svg ├── longbench.png ├── main_exps_crop.jpg ├── run.ico └── run.svg ├── js ├── bulma-carousel.js ├── bulma-carousel.min.js ├── bulma-slider.js ├── bulma-slider.min.js ├── fontawesome.all.min.js └── index.js ├── pdfs └── Motion_Mamba_Slides_miHoYo.pdf ├── scholar.html └── videos ├── badminton.mp4 ├── banner_video.mp4 ├── circle.mp4 ├── somersault.mp4 ├── stand-up.mp4 ├── street-dance.mp4 └── walk.mp4 /.github/workflows/static.yml: -------------------------------------------------------------------------------- 1 | # Simple workflow for deploying static content to GitHub Pages 2 | name: Deploy static content to Pages 3 | 4 | on: 5 | # Runs on pushes targeting the default branch 6 | push: 7 | branches: ["main"] 8 | 9 | # Allows you to run this workflow manually from the Actions tab 10 | workflow_dispatch: 11 | 12 | # Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages 13 | permissions: 14 | contents: read 15 | pages: write 16 | id-token: write 17 | 18 | # Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued. 19 | # However, do NOT cancel in-progress runs as we want to allow these production deployments to complete. 20 | concurrency: 21 | group: "pages" 22 | cancel-in-progress: false 23 | 24 | jobs: 25 | # Single deploy job since we're just deploying 26 | deploy: 27 | environment: 28 | name: github-pages 29 | url: ${{ steps.deployment.outputs.page_url }} 30 | runs-on: ubuntu-latest 31 | steps: 32 | - name: Checkout 33 | uses: actions/checkout@v4 34 | - name: Setup Pages 35 | uses: actions/configure-pages@v5 36 | - name: Upload artifact 37 | uses: actions/upload-pages-artifact@v3 38 | with: 39 | # Upload entire repository 40 | path: '.' 41 | - name: Deploy to GitHub Pages 42 | id: deployment 43 | uses: actions/deploy-pages@v4 44 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .idea/ 3 | -------------------------------------------------------------------------------- /CNAME: -------------------------------------------------------------------------------- 1 | minicache.vmv.re -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 2-Clause License 2 | 3 | Copyright (c) 2024, Akide Liu 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions are met: 7 | 8 | 1. Redistributions of source code must retain the above copyright notice, this 9 | list of conditions and the following disclaimer. 10 | 11 | 2. Redistributions in binary form must reproduce the above copyright notice, 12 | this list of conditions and the following disclaimer in the documentation 13 | and/or other materials provided with the distribution. 14 | 15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 16 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 17 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 18 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 19 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 20 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 21 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 22 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 23 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 24 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MiniCache -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |
4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 |91 | 103 | 104 |
188 | 189 | 190 | 191 | 204 | 205 | 206 | 207 |
214 | A critical approach for efficiently deploying computationally demanding large language models 215 | (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated 216 | tokens, significantly reducing the need for repetitive computations and thereby lowering latency 217 | in autoregressive generation. However, the size of the KV cache grows linearly with sequence 218 | length, posing challenges for applications requiring long context input and extensive sequence 219 | generation. In this paper, we present a simple yet effective approach, called MiniCache, to 220 | compress the KV cache across layers from a novel depth perspective, significantly reducing the 221 | memory footprint for LLM inference. Our approach is based on the observation that KV cache 222 | states exhibit high similarity between the adjacent layers in the middle-to-deep portion of 223 | LLMs. To facilitate merging, we propose disentangling the states into the magnitude and 224 | direction components, interpolating the directions of the state vectors while preserving their 225 | lengths unchanged. Furthermore, we introduce a token retention strategy to keep highly distinct 226 | state pairs unmerged, thus preserving the information with minimal additional storage overhead. 227 | Our MiniCache is training-free and general, complementing existing KV cache compression 228 | strategies, such as quantization and sparsity. We conduct a comprehensive evaluation of 229 | MiniCache utilizing various models including LLaMA-2, LLaMA-3, Phi-3, Mistral, and Mixtral 230 | across multiple benchmarks, demonstrating its exceptional performance in achieving superior 231 | compression ratios and high throughput. On the ShareGPT dataset, LLaMA-2-7B with 4-bit MiniCache 232 | achieves a remarkable compression ratio of up to 5.02x, enhances inference throughput by 233 | approximately 5x, and reduces the memory footprint by 41% compared to the FP16 full cache 234 | baseline, all while maintaining near-lossless performance. 235 |
236 |308 | 309 |
318 | Overview of our MiniCache strategy and example results: 319 | (a) shows the observation that the KV cache states between two adjacent layers are highly similar, 320 | particularly across the middle to deep layers. The x-axis uses index/2 to represent the similarities 321 | for 322 | each pair of layers. (b) compares the performance of MiniCache, and the mean baseline, which simply 323 | averages 324 | the KV caches of two layers, using the LLaMA-3-70B model on the GSM8K 325 | dataset. MiniCache, which begins merging from the half-layer depth, achieves 326 | near-lossless performance. 327 | (c) highlights the primary difference between MiniCache and previous approaches. MiniCache 328 | investigates 329 | the 330 | inter-layer redundancy of KV caches along the depth dimension of LLMs, an aspect overlooked by 331 | intra-layer-based methods. Here, 332 | T refers to the last timestamp of pre-filling, and T+1 des to the first timestamp 333 | of decoding. 334 |
338 | 339 |
Overall of our explorations and observations : (a) shows the strong baseline by performing average 351 | merging on the KV cache. (b) shows the pairwise similarity of cache states between adjacent layers. 352 | (c) compares the MiniCache, simple average, and full cache baseline across five different 353 | datasets.
357 | 358 |
363 | 364 |
375 | 376 |
The illustration of the proposed method MiniCache. (a) depicts the cross-layer 386 | compression process. We fetch the KV caches, from layers l and l-1, and merge 387 | them into shared states via Eq.~(3). Additionally, we compute the ℒ2 norm for 388 | the caches to obtain their magnitudes. Furthermore, we select unmergable tokens for retention, 389 | then store merged cache, retention tokens, and magnitudes at layer l in 390 | C. (b) illustrates the restoration process for layers l and 391 | l-1, which includes magnitude rescaling in Eq.~(2) and retention token recovery.
392 |396 | 397 |
411 | 412 |
Performance comparisons between our proposed MiniCache with the “averaging baseline” and the 422 | “unmerged full cache baseline” on multiple datasets with Phi3-Mini, Mixtral-8x7B, LLaMA-3-8B, 423 | and LLaMA-3-70B. More result details are shown in Section 424 | 4. The x-axis indicates the number of layers merged. As more layers are merged, a 425 | greater reduction in memory usage is achieved.
426 |
Evaluation of different KV cache compression methods on LongBench. 443 | MiniCache builds on top of 4-bit KIVI and achieves the best performance with the strongest 444 | compression rate.
445 |463 | 464 |
Overall prefilling and decoding logic for MiniCache involves performing 474 | cross-layer merging and recovery within our framework.
475 |
509 | @article{liu2024minicache,
510 | title={MiniCache: KV Cache Compression in Depth Dimension for Large Language Models},
511 | author={Liu, Akide and Liu, Jing and Pan, Zizheng and He, Yefei and Haffari, Gholamreza and Zhuang, Bohan},
512 | journal={arXiv preprint arXiv:2405.14366},
513 | year={2024}
514 | }
515 |
516 | 2 | @article{liu2024minicache, 3 | title={MiniCache: KV Cache Compression in Depth Dimension for Large Language Models}, 4 | author={Liu, Akide and Liu, Jing and Pan, Zizheng and He, Yefei and Haffari, Gholamreza and Zhuang, Bohan}, 5 | journal={arXiv preprint arXiv:2405.14366}, 6 | year={2024} 7 | } 8 |9 | -------------------------------------------------------------------------------- /static/videos/badminton.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AkideLiu/MiniCache/f1baccacba6bc83dc499f5db9257576397bb4ed8/static/videos/badminton.mp4 -------------------------------------------------------------------------------- /static/videos/banner_video.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AkideLiu/MiniCache/f1baccacba6bc83dc499f5db9257576397bb4ed8/static/videos/banner_video.mp4 -------------------------------------------------------------------------------- /static/videos/circle.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AkideLiu/MiniCache/f1baccacba6bc83dc499f5db9257576397bb4ed8/static/videos/circle.mp4 -------------------------------------------------------------------------------- /static/videos/somersault.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AkideLiu/MiniCache/f1baccacba6bc83dc499f5db9257576397bb4ed8/static/videos/somersault.mp4 -------------------------------------------------------------------------------- /static/videos/stand-up.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AkideLiu/MiniCache/f1baccacba6bc83dc499f5db9257576397bb4ed8/static/videos/stand-up.mp4 -------------------------------------------------------------------------------- /static/videos/street-dance.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AkideLiu/MiniCache/f1baccacba6bc83dc499f5db9257576397bb4ed8/static/videos/street-dance.mp4 -------------------------------------------------------------------------------- /static/videos/walk.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AkideLiu/MiniCache/f1baccacba6bc83dc499f5db9257576397bb4ed8/static/videos/walk.mp4 --------------------------------------------------------------------------------