├── survey.png └── README.md /survey.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kachris/survey_HA_LLM/HEAD/survey.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # A Syrvey on Hardware Accelerators for Large Language Models 2 | 3 | You can read the relevant paper here: [A Survey on Hardware Accelerators for Large Language Models](https://arxiv.org/abs/2401.09890) 4 | 5 | An overview of the speedup and Energy efficiency of Hardware accelerators for LLMs 6 | (If there is no energy efficiency measurements the paper is plotted in the x-axis as if the energy efficiency was 1) 7 | 8 | ![Survey on Hardware accelerators for LLMs](survey.png) 9 | 10 | The following table shows the research papers focused on the acceleration of LLMs (mostly transformers) categorized on the computing platform (FPGA, GPU, ASIC, In-memory). 11 | 12 | 13 | | Year | Type | Title | Speedup | Energy efficiency | 14 | | ---- | ---- | ------- | ------- | ----------------- | 15 | | 2019 | FPGA | [MnnFast: A Fast and Scalable System Architecture for Memory-Augmented Neural Networks](https://ieeexplore.ieee.org/document/8980322) | 5.4x | 4.5x | 16 | | 2020 | FPGA | [FTRANS: Energy-Efficient Acceleration of Transformers Using FPGA](https://arxiv.org/pdf/2007.08563) | 27-81x | 8.8x | 17 | | 2020 | FPGA | [Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer](https://arxiv.org/abs/2009.08605) | 14x | | 18 | | 2021 | FPGA | [An FPGA-Based Overlay Processor for Natural Language Processing](https://arxiv.org/abs/2104.06535) | 35x | 4x-6x | 19 | | 2021 | FPGA | [Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning](https://ieeexplore.ieee.org/document/9424344) | 11-2x | | 20 | | 2022 | FPGA | [DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation](https://arxiv.org/abs/2209.10797) | 3.8x | 4x | 21 | | 2022 | FPGA | [Hardware Acceleration of Transformer Networks using FPGAs](https://ieeexplore.ieee.org/document/9976354) | 2.3x | | 22 | | 2023 | FPGA | [Transformer-OPU: An FPGA-based Overlay Processor for Transformer Networks](https://ieeexplore.ieee.org/document/10171578/) | 15x | | 23 | | 2023 | FPGA | [A Fast and Flexible FPGA-Based Accelerator for Natural Language Processing Neural Networks](https://dl.acm.org/doi/10.1145/3564606) | 2.7x | | 24 | | 2023 | FPGA | [A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE](https://arxiv.org/abs/2401.02721) | 12.8x | 9.2x | 25 | | 2022 | GPU A100 | [Accelerating Transformer Networks through Recomposing Softmax Layers](https://ieeexplore.ieee.org/document/9975410) | 2.5x | | 26 | | 2022 | GPU A100 | [LightSeq2: Accelerated Training for Transformer-Based Models on GPUs](https://arxiv.org/abs/2110.05722) | 3x | | 27 | | 2023 | GPU V100 | [Inference with Reference: Lossless Acceleration of Large Language Models](https://arxiv.org/abs/2304.04487) | 2x | | 28 | | 2022 | CPU | [Exponentially Faster Language Modelling](https://arxiv.org/abs/2311.10770) | 78x | | 29 | | 2020 | ASIC 40nm | [A3: Accelerating Attention Mechanisms in Neural Networks with Approximation](https://taejunham.github.io/data/a3_hpca2020.pdf) | 7x | 11x | 30 | | 2021 | ASIC 40nm | [ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks](https://taejunham.github.io/data/elsa_isca21.pdf) | 157x | 1265x | 31 | | 2021 | ASIC 40nm | [SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning](https://hanlab.mit.edu/projects/spatten) | 347x/162x | 4059x/1093x | 32 | | 2021 | ASIC 55nm | [Sanger: A Co-Design Framework for Enabling Sparse Attention Using Reconfigurable Architecture](https://dl.acm.org/doi/abs/10.1145/3466752.3480125) | 22.7x/4.64x | | 33 | | 2023 | ASIC 45nm | [Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention](https://arxiv.org/abs/2110.09310) | 168x/8.7x | 10000x/1000x | 34 | | 2020 | In-memory | [ATT: A Fault-Tolerant ReRAM Accelerator for Attention-based Neural Networks](https://www.computer.org/csdl/proceedings-article/iccd/2020/971000a213/1pK5cZB2u1G) | 202x | 11x | 35 | | 2020 | In-memory | [ReTransformer: ReRAM-based Processing-in-Memory Architecture for Transformer Acceleration](https://par.nsf.gov/servlets/purl/10225128) | 23x | 1086x | 36 | | 2022 | In-memory | [In-Memory Computing based Accelerator for Transformer Networks for Long Sequences](https://ieeexplore.ieee.org/document/9474146) | 200x | 41x | 37 | | 2023 | In-memory | [X-Former: In-Memory Acceleration of Transformers](https://arxiv.org/abs/2303.07470) | 85x | 7.5x | 38 | | 2023 | Flash | [LLM in a flash: Efficient Large Language Model Inference with Limited Memory](https://arxiv.org/abs/2312.11514) | 25x/5x | | 39 | 40 | 41 | If you are interested to add you research paper in the list contact me here: [Christoforos Kachris](https://users.uniwa.gr/kachris/) 42 | 43 | 44 | Feel free to cite the paper: 45 | ``` 46 | @Article{kachris2024survey, 47 | AUTHOR = {Kachris, Christoforos}, 48 | TITLE = {A Survey on Hardware Accelerators for Large Language Models}, 49 | JOURNAL = {Applied Sciences}, 50 | VOLUME = {15}, 51 | YEAR = {2025}, 52 | NUMBER = {2}, 53 | ARTICLE-NUMBER = {586}, 54 | URL = {https://www.mdpi.com/2076-3417/15/2/586}, 55 | ISSN = {2076-3417}, 56 | } 57 | 58 | ``` 59 | 60 | --------------------------------------------------------------------------------