└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Awesome MoE LLM Inference System and Algorithm 2 | ![Awesome](https://awesome.re/badge.svg) 3 | [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/JustQJ/awesome-moe-inference/pulls) 4 | 5 | A curated list of awesome papers about optimizing the inference of MoE-based LLMs. 6 | 7 | Example: [Conference'year] [Paper Title]() [[Code]()] 8 | 9 | ## Contents 10 | 11 | 12 | ## Survey 13 | 14 | [Preprints'24.8] [The Evolution of Mixture of Experts: A Survey from Basics to Breakthroughs](https://www.preprints.org/manuscript/202408.0583/v2) 15 | 16 | 17 | [Arxiv'24.8] [A Survey on Mixture of Experts](https://arxiv.org/abs/2407.06204) [[Code](https://github.com/withinmiaov/A-Survey-on-Mixture-of-Experts)] 18 | 19 | [Arxiv'22] [A Review of Sparse Expert Models in Deep Learning](https://arxiv.org/abs/2209.01667) 20 | 21 | 22 | 23 | ## SOTA Open Source MoE LLMs 24 | 25 | | Reference | Para. | Experts | \#L | \#H | $d_{model}$ | $d_{ffn}$ | $d_{expert}$ | Affiliation | Time | 26 | |:--------------------------------------------------------------------------------------------------------------------------------:|:-----:|:-------:|:---:|:---:|:-----------:|:---------:|:------------:|:-----------:|:-------:| 27 | | [NLLB](https://huggingface.co/facebook/nllb-moe-54b) | 54B | 2/64/0 | 24 | 16 | 1024 | 8192 | 8192 | FaceBook | 2022.07 | 28 | | [Qwen2-57B-A14B](https://huggingface.co/Qwen/Qwen2-57B-A14B) | 57.4B | 8/64/0 | 28 | 28 | 3584 | 18944 | 2560 | Alibaba | 2023.05 | 29 | | [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) | 46.7B | 2/8/0 | 32 | 32 | 4096 | 14336 | 14336 | Mistral AI | 2023.12 | 30 | | [OpenMoE](https://huggingface.co/OrionZheng/openmoe-base) | 34B | 2/16/0 | 12 | 12 | 768 | 2048 | 2048 | NUS et al. | 2023.12 | 31 | | [DeepSeekMoE](https://huggingface.co/deepseek-ai/deepseek-moe-16b-base) | 16.4B | 6/64/2 | 28 | 16 | 2048 | 10944 | 1408 | DeepSeek-AI | 2024.01 | 32 | | [Qwen1.5-MoE](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) | 14.3B | 4/60/0 | 24 | 16 | 2048 | 5632 | 1408 | Alibaba | 2024.02 | 33 | | [JetMoE](https://huggingface.co/jetmoe/jetmoe-8b) | 8.52B | 2/8/0 | 24 | 32 | 2048 | 5632 | 5632 | MIT et al. | 2024.03 | 34 | | [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1) | 51.6B | 2/16/0 | 32 | 32 | 4096 | 14336 | 14336 | ai21labs | 2024.03 | 35 | | [DBRX](https://huggingface.co/databricks/dbrx-base) | 132B | 4/16/0 | 40 | 48 | 6144 | 10752 | 10752 | Databricks | 2024.03 | 36 | | [Grok-1](https://huggingface.co/xai-org/grok-1) | 314B | 2/8/0 | 64 | 48 | 6144 | UNK | UNK | xAI | 2024.03 | 37 | | [Arctic](https://huggingface.co/Snowflake/snowflake-arctic-base) | 482B | 2/128/0 | 35 | 56 | 7168 | 4864 | 4864 | Snowflake | 2024.04 | 38 | | [Mixtral-8x22B](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1) | 141B | 2/8/0 | 56 | 48 | 6144 | 16384 | 16384 | Mistral AI | 2024.04 | 39 | | [DeepSeek-V2](https://huggingface.co/deepseek-ai/DeepSeek-V2) | 236B | 6/160/2 | 60 | 128 | 5120 | 12288 | 1536 | DeepSeek-AI | 2024.04 | 40 | | [Skywork-MoE](https://huggingface.co/Skywork/Skywork-MoE-Base) | 13B | 2/16/0 | 52 | 36 | 4608 | 12288 | 12288 | Kunlun Tech | 2024.05 | 41 | | [Yuan2](https://huggingface.co/IEITYuan/Yuan2-M32-hf) | 40B | 2/32/0 | 24 | 16 | 2048 | 8192 | 8192 | IEIT-Yuan | 2024.05 | 42 | | [LLaMA-MoE](https://github.com/pjlab-sys4nlp/llama-moe) | 6.7B | 2/8/0 | 32 | 32 | 4096 | 11008 | 11008 | Zhu et al. | 2024.06 | 43 | | [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924) | 6.92B | 8/64/0 | 16 | 16 | 2048 | 1024 | 1024 | AllenAI | 2024.07 | 44 | | [Phi-3](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct) | 41.9B | 2/16/0 | 32 | 32 | 4096 | 6400 | 6400 | MicroSoft | 2024.08 | 45 | | [GRIN-MoE](https://huggingface.co/microsoft/GRIN-MoE) | 41.9B | 2/16/0 | 32 | 32 | 4096 | 6400 | 6400 | MicroSoft | 2024.09 | 46 | | [Hunyuan-Large](https://huggingface.co/tencent/Tencent-Hunyuan-Large/tree/main/Hunyuan-A52B-Pretrain) | 389B | 1/16/1 | 64 | 80 | 6400 | 18304 | 18304 | Tencent | 2024.11 | 47 | | [DeepSeek-V3](huggingface.co/deepseek-ai/DeepSeek-V3-Base)| 671B | 8/256/1 | 61 | 128 | 7168 | 18432 | 2048 | DeepSeek-AI | 2024.12 | 48 | | [MiniMax-Text-01](https://huggingface.co/MiniMaxAI/MiniMax-Text-01)| 456B | 2/32/0 | 80 | 64 | 6144 | 9216 | 9216 | MiniMax-AI | 2025.1 | 49 | | [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)| 671B | 8/256/1 | 61 | 128 | 7168 | 18432 | 2048 | DeepSeek-AI | 2025.1 | 50 | 51 | 52 | [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf) [[Code](https://huggingface.co/deepseek-ai/DeepSeek-R1)] 53 | 54 | [Arxiv'25.1] [MiniMax-Text-01](https://arxiv.org/pdf/2501.08313) [[Code](https://huggingface.co/MiniMaxAI/MiniMax-Text-01)] 55 | 56 | [Arxiv'24.11] [Hunyuan-Large](https://arxiv.org/abs/2411.02265) [[Code](https://huggingface.co/tencent/Tencent-Hunyuan-Large)] 57 | 58 | [Arxiv'24.1] [Mixtral-8x7B](https://arxiv.org/abs/2401.04088) [[Code](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)] 59 | 60 | [Arxiv'24.1] [Mixtral-8x22B](https://arxiv.org/abs/2401.04088) [[Code](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)] 61 | 62 | 63 | [Arxiv'24.1] [DeepseekMoE](https://arxiv.org/abs/2401.06066) [[Code](https://huggingface.co/deepseek-ai/deepseek-moe-16b-base)] 64 | 65 | [Arxiv'24.6] [DeepSeek-V2](https://arxiv.org/abs/2405.04434) [[Code](https://huggingface.co/deepseek-ai/DeepSeek-V2)] 66 | 67 | 68 | [Arxiv'24.8] [PhiMoE](https://arxiv.org/abs/2404.14219) [[Code](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct)] 69 | 70 | [Arxiv'24.9] [GRadient-INformed MoE](https://arxiv.org/abs/2409.12136) [[Code](https://huggingface.co/microsoft/GRIN-MoE)] 71 | 72 | 73 | [Arxiv'24.9] [Qwen2-57B-A14B](https://arxiv.org/abs/2407.10671) [[Code](https://huggingface.co/Qwen/Qwen2-57B-A14B)] 74 | 75 | [QwenBlog'24.3] [Qwen1.5-MoE](https://qwenlm.github.io/blog/qwen-moe/) [[Code](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B)] 76 | 77 | [Arxiv'24.9] [OLMoE: Open Mixture-of-Experts Language Models](https://arxiv.org/abs/2409.02060) [[Code](https://github.com/allenai/OLMoE)] 78 | 79 | [Arxiv'24.3] [OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models](https://arxiv.org/abs/2402.01739) [[Code](https://github.com/XueFuzhao/OpenMoE)] 80 | 81 | [Arxiv'24.6] [Skywork-MoE](https://arxiv.org/abs/2406.06563) [[Code](https://huggingface.co/Skywork/Skywork-MoE-Base)] 82 | 83 | [Arxiv'24.4] [JetMoE: Reaching Llama2 Performance with 0.1M Dollars](https://arxiv.org/abs/2404.07413)[[Code](https://github.com/myshell-ai/JetMoE)] 84 | 85 | [Arxiv'24.5] [Yuan 2.0-M32](https://arxiv.org/abs/2405.17976) [[Code](https://huggingface.co/IEITYuan/Yuan2-M32-hf)] 86 | 87 | [MosaicResearchBlog'24.3] [DBRX](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) [[Code](https://huggingface.co/databricks/dbrx-base)] 88 | 89 | [SnowflakeBlog'24.4] [Arctic](https://www.snowflake.com/en/blog/arctic-open-efficient-foundation-language-models-snowflake/) [[Code](https://huggingface.co/Snowflake/snowflake-arctic-base)] 90 | 91 | [XAIBlog'24.3] [Grok-1](https://x.ai/blog/grok-os) [[Code](https://github.com/xai-org/grok-1)] 92 | 93 | [Arxiv'24.7] [Jamba](https://arxiv.org/abs/2403.19887) [[Code](https://huggingface.co/ai21labs/Jamba-v0.1)] 94 | 95 | [Arxiv'24.6] [LLaMA-MoE](https://arxiv.org/abs/2406.16554) [[Code](https://github.com/pjlab-sys4nlp/llama-moe)] 96 | 97 | [Arxiv'22] [NLLB-MOE](https://arxiv.org/abs/2207.04672) [[Code](https://huggingface.co/facebook/nllb-moe-54b)] 98 | 99 | [ICCV'21] [Swin-MoE](https://openaccess.thecvf.com/content/ICCV2021/papers/Liu_Swin_Transformer_Hierarchical_Vision_Transformer_Using_Shifted_Windows_ICCV_2021_paper.pdf) [[Code](https://github.com/microsoft/Swin-Transformer)] 100 | 101 | ## Model-Level Optimizations 102 | 103 | ### Efficient Architecture Design 104 | 105 | #### Attention Module 106 | 107 | [Arxiv'24.8] [BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts](https://arxiv.org/abs/2408.08274) 108 | 109 | [Arxiv'24.10] [MoH: Multi-Head Attention as Mixture-of-Head Attention](https://arxiv.org/abs/2410.11842) [[Code](https://github.com/SkyworkAI/MoH)] 110 | 111 | [Arxiv'24.4] [Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models](https://arxiv.org/abs/2404.05567) 112 | 113 | [Arxiv'24.4] [JetMoE: Reaching Llama2 Performance with 0.1M Dollars](https://arxiv.org/abs/2404.07413)[[Code](https://github.com/myshell-ai/JetMoE)] 114 | 115 | [NeurIPS'24.10] [MoEUT: Mixture-of-Experts Universal Transformers](https://arxiv.org/abs/2405.16039) [[Code](https://github.com/robertcsordas/moeut)] 116 | 117 | [NeurIPS'24.9] [SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention](https://arxiv.org/abs/2312.07987) [[Code](https://github.com/robertcsordas/switchhead)] 118 | 119 | [Arxiv'23] [ModuleFormer: Modularity Emerges from Mixture-of-Experts](https://arxiv.org/abs/2306.04640) [[Code](https://github.com/IBM/ModuleFormer)] 120 | 121 | [Arxiv'23] [Sparse Universal Transformer](https://arxiv.org/abs/2310.07096) 122 | 123 | [EMNLP'22] [Mixture of Attention Heads: Selecting Attention Heads Per Token](https://arxiv.org/abs/2210.05144) [[Code](https://github.com/yikangshen/MoA)] 124 | 125 | [ACL'20] [A Mixture of h - 1 Heads is Better than h Heads](https://aclanthology.org/2020.acl-main.587/) 126 | 127 | #### MoE Module 128 | 129 | [Arxiv'24.10] [MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts](https://arxiv.org/abs/2410.07348) [[Code](https://github.com/SkyworkAI/MoE-plus-plus)] 130 | 131 | [Arxiv'24.2] [MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models](https://arxiv.org/abs/2402.12851) 132 | 133 | 134 | 135 | [Arxiv'23] [Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference](https://arxiv.org/abs/2308.12066) [[Code](https://github.com/ranggihwang/Pregated_MoE)] 136 | 137 | [ICLR'23] [SCoMoE: Efficient Mixtures of Experts with Structured Communication](https://openreview.net/forum?id=s-c96mSU0u5) 138 | 139 | [KDD'23] [COMET: Learning Cardinality Constrained Mixture of Experts with Trees and Local Search](https://dl.acm.org/doi/pdf/10.1145/3580305.3599278) 140 | 141 | 142 | 143 | 144 | 145 | ### Model Compression 146 | 147 | #### Pruning 148 | 149 | [Arxiv'24.10] [MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router](https://arxiv.org/abs/2410.12013) 150 | 151 | [ACL'24] [HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts](https://aclanthology.org/2024.acl-long.571/) [[Code](https://github.com/Bumble666/Hyper_MoE)] 152 | 153 | [Arxiv'24.4] [SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts](https://arxiv.org/abs/2404.05089) 154 | 155 | [Arxiv'24.10] [Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts](https://arxiv.org/abs/2407.09590) 156 | 157 | [Arxiv'24.7] [Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs](https://arxiv.org/abs/2407.00945) [[Code](https://github.com/imagination-research/EEP)] 158 | 159 | 160 | [ACL'24.5] [Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models](https://arxiv.org/abs/2402.14800) [[Code](https://github.com/Lucky-Lance/Expert_Sparsity)] 161 | 162 | [Arxiv'24.9] [Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning](https://arxiv.org/abs/2409.01483) 163 | 164 | [Arxiv'24.9] [STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning](https://arxiv.org/abs/2409.06211) 165 | 166 | [Arxiv'24.6] [Demystifying the Compression of Mixture-of-Experts Through a Unified Framework](https://arxiv.org/abs/2406.02500) [[Code](https://github.com/DaizeDong/Unified-MoE-Compression)] 167 | 168 | [Arxiv'24.5] [A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts](https://arxiv.org/abs/2405.16646) 169 | 170 | 171 | [Arxiv'24.11] [MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition](https://arxiv.org/abs/2411.01016) [[Code](https://github.com/xiaochengsky/MoEI-2)] 172 | 173 | [ICLR'24.3] [Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy](https://arxiv.org/abs/2310.01334) [[Code](https://github.com/unites-lab/mc-smoe)] 174 | 175 | 176 | [Arxiv'23] [ModuleFormer: Modularity Emerges from Mixture-of-Experts](https://arxiv.org/abs/2306.04640) [[Code](https://github.com/IBM/ModuleFormer)] 177 | 178 | [Arxiv'22] [Task-Specific Expert Pruning for Sparse Mixture-of-Experts](https://arxiv.org/abs/2206.00277) 179 | 180 | [SENSYS '24] [LiteMoE: Customizing On-device LLM Serving via Proxy Submodel Tuning](https://dl.acm.org/doi/abs/10.1145/3666025.3699355) 181 | 182 | 183 | #### Quantization 184 | [Arxiv'24.10] [MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More](https://arxiv.org/abs/2410.06270) [[Code](https://github.com/Aaronhuang-778/MC-MoE)] 185 | 186 | [Arxiv'23] [Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness](https://arxiv.org/abs/2310.02410) 187 | 188 | [Arxiv'23] [QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models](https://arxiv.org/abs/2310.16795) [[Code](http://github.com/IST-DASLab/qmoe)] 189 | 190 | [Arxiv'24.11] [HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference](https://arxiv.org/abs/2411.01433) 191 | 192 | 193 | 194 | 195 | [Arxiv'24.9] [Mixture of Experts with Mixture of Precisions for Tuning Quality of Service](https://arxiv.org/abs/2407.14417) 196 | 197 | [Arxiv'24.6] [Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark](https://arxiv.org/abs/2406.08155) [[Code](https://github.com/UNITES-Lab/moe-quantization)] 198 | 199 | 200 | [INTERSPEECH'23] [Compressed MoE ASR Model Based on Knowledge Distillation and Quantization](https://www.isca-archive.org/interspeech_2023/yuan23c_interspeech.pdf) 201 | 202 | [Arxiv'23] [EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models](https://arxiv.org/abs/2308.14352) [Quantization] 203 | 204 | 205 | [EMNLP'22] [Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production](https://arxiv.org/abs/2211.10017) 206 | 207 | 208 | #### Knowledge Distillation 209 | 210 | [Arxiv'24.10] [LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation](https://arxiv.org/abs/2408.15881) 211 | 212 | [Arxiv'24.8] [LaDiMo: Layer-wise Distillation Inspired MoEfier](https://arxiv.org/abs/2408.04278) 213 | 214 | [INTERSPEECH'23] [Compressed MoE ASR Model Based on Knowledge Distillation and Quantization](https://www.isca-archive.org/interspeech_2023/yuan23c_interspeech.pdf) 215 | 216 | 217 | [ICML'22] [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://proceedings.mlr.press/v162/rajbhandari22a.html) [[Code](https://github.com/microsoft/DeepSpeed)] 218 | 219 | [MICROSOFT'22] [Knowledge distillation for mixture of experts models in speech recognition](https://www.microsoft.com/en-us/research/uploads/prod/2022/05/MainzSpeech_Interspeech2022_KD_MoE_Network.pdf) 220 | 221 | [Arxiv'22] [One Student Knows All Experts Know: From Sparse to Dens](https://arxiv.org/abs/2201.10890) 222 | 223 | 224 | [JMLR'22] [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://www.jmlr.org/papers/volume23/21-0998/21-0998.pdf) 225 | 226 | [Arxiv'21] [Efficient Large Scale Language Modeling with Mixtures of Experts](https://arxiv.org/pdf/2112.10684) 227 | 228 | 229 | #### Low Rank Decomposition 230 | [Arxiv'24.11] [MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition](https://arxiv.org/abs/2411.01016) [[Code](https://github.com/xiaochengsky/MoEI-2)] 231 | 232 | [ICLR'24.3] [Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy](https://arxiv.org/abs/2310.01334) [[Code](https://github.com/unites-lab/mc-smoe)] 233 | 234 | [Arxiv'22] [Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models](https://arxiv.org/abs/2203.01104) [[Code](https://github.com/RUCAIBox/MPOE)] 235 | ### Expert Skip/Adaptive Gating 236 | 237 | 238 | 239 | 240 | [Arxiv'24.8] [AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference](https://arxiv.org/abs/2408.10284) [[Code](https://github.com/PKU-SEC-Lab/AdapMoE)] 241 | 242 | [ACL'24.8] [XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection](https://aclanthology.org/2024.findings-acl.694/) 243 | 244 | [Arxiv'23] [Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models](https://arxiv.org/abs/2405.14297) [[Code](https://github.com/LINs-lab/DynMoE)] 245 | 246 | [Arxiv'23] [Adaptive Gating in Mixture-of-Experts based Language Models](https://arxiv.org/abs/2310.07188) 247 | 248 | [Arxiv'23] [Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference](https://arxiv.org/abs/2303.06182) 249 | 250 | [Arxiv'24.8] [AdaMoLE: Fine-Tuning Large Language Models with Adaptive Mixture of Low-Rank Adaptation Experts](https://arxiv.org/abs/2405.00361) 251 | 252 | [ICCV'23] [AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts](https://ieeexplore.ieee.org/document/10377734) 253 | 254 | ### Merge Expert 255 | 256 | [Arxiv'24.10] [Retraining-Free Merging of Sparse Mixture-of-Experts via Hierarchical Clustering](https://arxiv.org/abs/2410.08589) 257 | 258 | [EMNLP'23] [Merging Experts into One: Improving Computational Efficiency of Mixture of Experts](https://aclanthology.org/2023.emnlp-main.907.pdf) 259 | 260 | [Arxiv'24.3] [Branch-Train-MiX:Mixing Expert LLMs into a Mixture-of-Experts LLM](https://arxiv.org/abs/2403.07816) 261 | 262 | [Arxiv'22] [Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models](https://arxiv.org/abs/2208.03306) 263 | 264 | [ICLR'24.5] [Fusing Models with Complementary Expertise](https://openreview.net/pdf?id=PhMrGCMIRL) 265 | 266 | [Arxiv'24.5] [Learning More Generalized Experts by Merging Experts in Mixture-of-Experts](https://arxiv.org/abs/2405.11530) 267 | 268 | [Arxiv'24.9] [DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models](https://arxiv.org/abs/2409.06669) 269 | 270 | ### Sparse to Dense 271 | 272 | [ACL'24.6] [XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts](https://aclanthology.org/2024.acl-long.699.pdf) 273 | 274 | [Arxiv'23] [Moduleformer: Learning modular large language models from uncurated data](https://arxiv.org/abs/2306.04640) 275 | 276 | 277 | [Arxiv'23] [Experts weights averaging: A new general training scheme for vision transformers](https://arxiv.org/pdf/2308.06093) 278 | 279 | [JMLR'22] [Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity](https://www.jmlr.org/papers/volume23/21-0998/21-0998.pdf) 280 | 281 | [Arxiv'22] [One student knows all experts know: From sparse to dense](https://arxiv.org/abs/2201.10890) 282 | 283 | [Arxiv'22] [Task-specific expert pruning for sparse mixture-of experts](https://arxiv.org/abs/2206.00277) 284 | 285 | [Arxiv'21] [Efficient Large Scale Language Modeling with Mixtures of Experts](https://arxiv.org/pdf/2112.10684) 286 | 287 | 288 | 289 | 290 | ## System-Level Optimization 291 | 292 | ### Expert Parallel 293 | 294 | [Arxiv'25.1] [Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing](https://arxiv.org/abs/2501.05313) 295 | 296 | 297 | [ASPLOS'25] [FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models](https://shaohuais.github.io/publications/index.html) 298 | 299 | [OpenReview'24.11] [Toward Efficient Inference for Mixture of Experts](https://openreview.net/forum?id=stXtBqyTWX¬eId=p7ADDxdU8g) 300 | 301 | [Arxiv'24.10] [EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference](https://arxiv.org/abs/2410.12247) 302 | 303 | [IPDPS'24.1] [Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference](https://arxiv.org/abs/2401.08383) 304 | 305 | 306 | [Arxiv'24.10] [Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling](https://arxiv.org/abs/2410.17043) 307 | 308 | 309 | [IEEE'24.5] [WDMoE: Wireless Distributed Large Language Models with Mixture of Experts](https://arxiv.org/abs/2405.03131) 310 | 311 | [Arxiv'24.11] [Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection](https://arxiv.org/abs/2411.08982) 312 | 313 | [Arxiv'24.4] [Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing](https://arxiv.org/abs/2404.16914) 314 | 315 | [Arxiv'24.10] [MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts](https://arxiv.org/abs/2410.07348) [[Code](https://github.com/SkyworkAI/MoE-plus-plus)] [MoE Module Design] 316 | 317 | 318 | [Arxiv'24.11] [Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts](https://arxiv.org/abs/2404.05019) 319 | 320 | 321 | [TSC'24.5] [MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10528887) 322 | 323 | [Arxiv'24.11] [HEXA-MoE: Efficient and Heterogeneous-aware MoE Acceleration with ZERO Computation Redundancy](https://arxiv.org/abs/2411.01288) [[Code](https://github.com/UNITES-Lab/HEXA-MoE)] 324 | 325 | [Arxiv'24.5] [LocMoE: A Low-Overhead MoE for Large Language Model Training](https://arxiv.org/abs/2401.13920) 326 | 327 | [Arxiv'24.7] [Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement](https://arxiv.org/abs/2407.04656) 328 | 329 | [Arxiv'24.10] [Scattered Mixture-of-Experts Implementation](https://arxiv.org/abs/2403.08245) [[Code](https://github.com/shawntan/scattermoe)] 330 | 331 | 332 | [TPDS'24.4] [MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism](https://ieeexplore.ieee.org/abstract/document/10494556) 333 | 334 | [INFOCOM'24.5] [Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules](https://ieeexplore.ieee.org/abstract/document/10621327) 335 | 336 | [EuroSys'24.4] [ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling](https://dl.acm.org/doi/10.1145/3627703.3650083) 337 | 338 | 339 | [SIGCOMM'23] [Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models](https://dl.acm.org/doi/10.1145/3603269.3604869) 340 | 341 | 342 | [INFOCOM'23] [PipeMoE: Accelerating Mixture-of-Experts through Adaptive Pipelining](https://ieeexplore.ieee.org/abstract/document/10228874) 343 | 344 | [ATC'23] [Accelerating Distributed MoE Training and Inference with Lina](https://www.usenix.org/conference/atc23/presentation/li-jiamin) 345 | 346 | [ATC'23] [SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization](https://www.usenix.org/conference/atc23/presentation/zhai) [[Code](https://github.com/zms1999/SmartMoE)] 347 | 348 | [Arxiv'23] [Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference](https://arxiv.org/abs/2303.06182) 349 | 350 | 351 | 352 | [SIGMOD'23] [FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement](https://arxiv.org/abs/2304.03946) [[Code](https://github.com/UNITES-Lab/flex-moe)] 353 | 354 | 355 | [MLSys'23] [Tutel: Adaptive Mixture-of-Experts at Scale](https://arxiv.org/abs/2206.03382) [[Code](https://github.com/microsoft/tutel)] 356 | 357 | [OSDI'23] [Optimizing Dynamic Neural Networks with Brainstorm](https://www.usenix.org/conference/osdi23/presentation/cui) [[Code](https://github.com/Raphael-Hao/brainstorm)] 358 | 359 | 360 | [ICS'23] [A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training](https://arxiv.org/abs/2303.06318) 361 | 362 | 363 | [CLUSTER'23] [Prophet: Fine-grained Load Balancing for Parallel Training of Large-scale MoE Models](https://ieeexplore.ieee.org/abstract/document/10319949) 364 | 365 | [OSDI'22] [Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning](https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin) [[Code](https://github.com/alpa-projects/alpa)] 366 | 367 | [NeurIPS'22] [TA-MoE: Topology-Aware Large Scale Mixture-of-Expert Training](https://arxiv.org/abs/2302.09915) [[Code](https://github.com/chen-chang/ta-moe)] 368 | 369 | 370 | [NeurIPS'22] [Mixture-of-Experts with Expert Choice Routing](https://proceedings.neurips.cc/paper_files/paper/2022/file/2f00ecd787b432c1d36f3de9800728eb-Paper-Conference.pdf) 371 | 372 | [PPoPP'22] [FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models](https://dl.acm.org/doi/10.1145/3503221.3508418) [[Code](https://github.com/thu-pacman/FasterMoE)] 373 | 374 | 375 | [PPoPP'22] [BaGuaLu: targeting brain scale pretrained models with over 37 million cores](https://dl.acm.org/doi/10.1145/3503221.3508417) 376 | 377 | [SoCC'22] [Accelerating large-scale distributed neural network training with SPMD parallelism](https://dl.acm.org/doi/10.1145/3542929.3563487) 378 | 379 | [PMLR'22] [Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers](https://proceedings.mlr.press/v162/liu22g/liu22g.pdf) 380 | 381 | [ICML'22] [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://proceedings.mlr.press/v162/rajbhandari22a.html) [[Code](https://github.com/microsoft/DeepSpeed)] 382 | 383 | [Arxiv'22] [HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System](https://arxiv.org/abs/2203.14685) [[Code](https://github.com/PKU-DAIR/Hetu)] 384 | 385 | 386 | [Arxiv'21] [FastMoE: A Fast Mixture-of-Expert Training System](https://arxiv.org/abs/2103.13262) [[Code](https://github.com/laekov/fastmoe)] 387 | 388 | [PMLR'21] [BASE Layers: Simplifying Training of Large, Sparse Models](https://proceedings.mlr.press/v139/lewis21a/lewis21a.pdf) [[Code](https://github.com/pytorch/fairseq/)] 389 | 390 | [Arxiv'20] [GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding](https://arxiv.org/abs/2006.16668) 391 | 392 | 393 | 394 | 395 | 396 | 397 | ### Expert Offloading 398 | [Arxiv'25.02] [Accurate Expert Predictions in MoE Inference via Cross-Layer Gate](https://arxiv.org/abs/2502.12224v1) 399 | 400 | [Arxiv'25.02] [fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving](https://www.arxiv.org/abs/2502.05370) 401 | 402 | [Arxiv'24.12] [DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference](https://arxiv.org/abs/2501.10375) 403 | 404 | [Arxiv'24.11] [Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference](https://arxiv.org/abs/2412.00099) 405 | 406 | [Arxiv'24.10] [ProMoE: Fast MoE-based LLM Serving using Proactive Caching](https://arxiv.org/abs/2410.22134) 407 | 408 | 409 | [NeurIPS'24.10] [Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design](https://arxiv.org/abs/2410.19123) [[Code](https://github.com/VITA-Group/READ-ME)] 410 | 411 | 412 | [Arxiv'24.11] [Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts](https://arxiv.org/abs/2404.05019) 413 | 414 | 415 | [Arxiv'24.11] [MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs](https://arxiv.org/abs/2411.11217) 416 | 417 | [Arxiv'24.11] [HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference](https://arxiv.org/abs/2411.01433) [Quantization, Skip Expert] 418 | 419 | [Arxiv'24.10] [ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference](https://arxiv.org/abs/2410.17954) 420 | 421 | [Arxiv'24.8] [AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference](https://arxiv.org/abs/2408.10284) [[Code](https://github.com/PKU-SEC-Lab/AdapMoE)] [Adaptive Gating] 422 | 423 | 424 | [Arxiv'24.9] [Mixture of Experts with Mixture of Precisions for Tuning Quality of Service](https://arxiv.org/abs/2407.14417) 425 | 426 | [MLSys'24.5] [SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models](https://proceedings.mlsys.org/paper_files/paper/2024/hash/698cfaf72a208aef2e78bcac55b74328-Abstract-Conference.html) [[Code](https://github.com/timlee0212/SiDA-MoE)] 427 | 428 | [Arxiv'24.8] [MoE-Infinity: Offloading-Efficient MoE Model Serving](https://arxiv.org/abs/2401.14361) [[Code](https://github.com/TorchMoE/MoE-Infinity)] 429 | 430 | [Arxiv'24.2] [Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models](https://arxiv.org/abs/2402.07033) [[Code](https://github.com/efeslab/fiddler)] 431 | 432 | [Arxiv'24.9] [Mixture of Experts with Mixture of Precisions for Tuning Quality of Service](https://arxiv.org/abs/2407.14417) 433 | 434 | [Electronics'24.5] [Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things](https://www.mdpi.com/2079-9292/13/11/2077) 435 | 436 | [ISCA'24.4] [Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference](https://arxiv.org/abs/2308.12066) [[Code](https://github.com/ranggihwang/Pregated_MoE)] [MoE Module] 437 | 438 | 439 | [HPCA'24.3] [Enabling Large Dynamic Neural Network Training with Learning-based Memory Management](https://ieeexplore.ieee.org/document/10476398) 440 | 441 | [SC'24.11] [APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes](https://dl.acm.org/doi/10.1109/SC41406.2024.00096) 442 | 443 | [Arxiv'23] [Fast Inference of Mixture-of-Experts Language Models with Offloading](https://arxiv.org/abs/2312.17238) [[Code](https://github.com/dvmazur/mixtral-offloading)] 444 | 445 | [Arxiv'23] [Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference](https://arxiv.org/abs/2303.06182) [Adaptive Gating] 446 | 447 | [Arxiv'23] [EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models](https://arxiv.org/abs/2308.14352) [Quantization] 448 | 449 | [ACL'24.5] [SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget](https://arxiv.org/abs/2308.15030) 450 | 451 | 452 | ### Others 453 | [SoCC '24.11] [MoEsaic: Shared Mixture of Experts] 454 | 455 | 456 | ## Hareware-Level Optimization 457 | 458 | [MICRO'24.9] [Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching](https://arxiv.org/abs/2409.01141) 459 | 460 | [DAC'24.5] [MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models](https://dl.acm.org/doi/pdf/10.1145/3649329.3655951) 461 | 462 | [DAC'24.11] [FLAME: Fully Leveraging MoE Sparsity for Transformer on FPGA](https://dl.acm.org/doi/pdf/10.1145/3649329.3656507) 463 | 464 | [ISSCC’24.2] [Space-Mate: A 303.5mW Real-Time Sparse Mixture-of-Experts-Based NeRF-SLAM Processor for Mobile Spatial Computing](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10454487) 465 | 466 | [ICCAD'23] [Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-Level Sparsity via Mixture-of-Experts](https://ieeexplore.ieee.org/abstract/document/10323651) [[Code](https://github.com/sharc-lab/Edge-MoE)] 467 | 468 | 469 | [NeurIPS'22] [M³ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design](https://proceedings.neurips.cc/paper_files/paper/2022/file/b653f34d576d1790481e3797cb740214-Paper-Conference.pdf) [[Code](https://github.com/VITA-Group/M3ViT)] 470 | 471 | 472 | ## Citation 473 | 474 | If you find this repo useful, please cite our paper: 475 | 476 | ``` 477 | @misc{liu2024moeinf, 478 | title={A Survey on Inference Optimization Techniques for Mixture of Experts Models}, 479 | author={Jiacheng Liu and Peng Tang and Wenfeng Wang and Yuhang Ren and Xiaofeng Hou and Pheng-Ann Heng and Minyi Guo and Chao Li}, 480 | year={2024}, 481 | archivePrefix={arXiv}, 482 | } 483 | ``` 484 | --------------------------------------------------------------------------------