├── LICENSE
├── README.md
├── assets
    ├── MOE Expert Choice.png
    ├── MOE Expert Distribution.png
    ├── MOE Expert Routing.png
    ├── MOE Instruction Tuning.png
    ├── MOE Mixtral Tokens.png
    ├── MOE Parallel.png
    ├── MOE RNN.png
    ├── MOE Scaling Law.png
    ├── MOE vs Dense.png
    └── MOE-Block.png
└── notes
    └── Mixture-Of-Experts-Hugging-Face.md


/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # llm-paper-notes
  2 | Notes from the [Latent Space paper club](https://www.latent.space/about#§components). Follow along or start your own!
  3 | 
  4 | ---
  5 | 
  6 | 1. **[Attention Is All You Need](https://arxiv.org/abs/1706.03762):** Query, Key, and Value are all you need\* (\*Also position embeddings, multiple heads, feed-forward layers, skip-connections, etc.)
  7 | 
  8 | 2. **[GPT: Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf):** Decoder is all you need\* (\*Also, pre-training + finetuning)
  9 | 
 10 | 3. **[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805):** Encoder is all you need\*. Left-to-right language modeling is NOT all you need. (\*Also, pre-training + finetuning)
 11 | 
 12 | 4. **[T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683):** Encoder-only or decoder-only is NOT all you need, though text-to-text is all you need\* (\*Also, pre-training + finetuning)
 13 | 
 14 | 5. **[GPT2: Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf):** Unsupervised pre-training is all you need?!
 15 | 
 16 | 6. **[GPT3: Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165):** Unsupervised pre-training + a few* examples is all you need. (\*From 5 examples, in Conversational QA, to 50 examples  in Winogrande, PhysicalQA, and TriviaQA)
 17 | 
 18 | 7. **[Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361):** Larger models trained on lesser data\* are what you you need. (\*10x more compute should be spent on 5.5x larger model and 1.8x more tokens)
 19 | 
 20 | 8. **[Chinchilla: Training Compute-Optimal Large Language Models](https://arxiv.org/abs/2203.15556):** Smaller models trained on more data\* are what you need. (\*10x more compute should be spent on 3.2x larger model and 3.2x more tokens)
 21 | 
 22 | 9. **[LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971):** Smoler models trained longer—on public data—is all you need
 23 | 
 24 | 10. **[InstructGPT: Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155):** 40 labelers are all you need\* (\*Plus supervised fine-tuning, reward modeling, and PPO)
 25 | 
 26 | 11. **[LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685):** One rank is all you need
 27 | 
 28 | 12. **[QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314):** 4-bit is all you need\* (\*Plus double quantization and paged optimizers)
 29 | 
 30 | 13. **[DPR: Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906):** Dense embeddings are all you need\* (\*Also, high precision retrieval)
 31 | 
 32 | 14. **[RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401):** Semi-parametric models\* are all you need (\*Dense vector retrieval as non-parametric component; pre-trained LLM as parametric component)
 33 | 
 34 | 15. **[RETRO: Improving language models by retrieving from trillions of tokens](https://arxiv.org/abs/2112.04426):** Retrieving based on input chunks and chunked cross attention are all you need
 35 | 
 36 | 16. **[Internet-augmented language models through few-shot prompting for open-domain question answering](https://arxiv.org/abs/2203.05115):** Google Search as retrieval is all you need
 37 | 
 38 | 17. **[HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels](https://arxiv.org/abs/2212.10496):** LLM-generated, hypothetical documents are all you need
 39 | 
 40 | 18. **[FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135):** For-loops in SRAM are all you need
 41 | 
 42 | 19. **[ALiBi; Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409):** Constant bias on the query-key dot-product is all you need\* (\*Also hyperparameter m and cached Q, K, V representations)
 43 | 
 44 | 20. **[Codex: Evaluating Large Language Models Trained on Code](https://arxiv.org/abs/2107.03374):** Finetuning on code is all you need
 45 | 
 46 | 21. **[Layer Normalization](https://arxiv.org/abs/1607.06450):** Consistent mean and variance at each layer is all you need
 47 | 
 48 | 22. **[On Layer Normalization in the Transformer Architecture](https://arxiv.org/abs/2002.04745):** Pre-layer norm, instead of post-layer norm, is all you need
 49 | 
 50 | 23. **[PPO: Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347):** Clipping your surrogate function is all you need
 51 | 
 52 | 24. **[WizardCoder: Empowering Code Large Language Models with Evol-Instruct](https://arxiv.org/abs/2306.08568):** Asking the model to make the question harder is all you need\* (\*Where do they get the responses to these harder questions though?!)
 53 | 
 54 | 25. **[Llama 2: Open Foundation and Fine-Tuned Chat Models](https://arxiv.org/abs/2307.09288)**: Iterative finetuning, PPO, rejection sampling, and ghost attention is all you need\* (\*Also, 27,540 SFT annotations and more than 1 million binary comparison preference data)
 55 | 
 56 | 26. **[RWKV: Reinventing RNNs for the Transformer Era](https://arxiv.org/abs/2305.13048):** Linear attention during inference, via RNNs, is what you need
 57 | 
 58 | 27. **[RLAIF; Constitutional AI: Harmlessness from AI Feedback](https://arxiv.org/abs/2212.08073):** A natural language constitution\* and model feedback on harmlessness is all you need (\*16 different variants of harmlessness principles)
 59 | 
 60 | 28. **[Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538):** Noise in your softmax and expert regularization are all you need
 61 | 
 62 | 29. **[CLIP: Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/abs/2103.00020):** \*A projection layer between text and image embeddings is all you need (\*Also, 400 million image-text pairs)
 63 | 
 64 | 30. **[ViT; An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929):** Flattened 2D patches are all you need
 65 | 
 66 | 31. **[Generative Agents: Interactive Simulacra of Human Behavior](https://arxiv.org/abs/2304.03442):** Reflection, memory, and retrieval are all you need
 67 | 
 68 | 32. **[Out-of-Domain Finetuning to Bootstrap Hallucination Detection](https://eugeneyan.com/writing/finetuning/):** Open-source, permissive-use data is what you need
 69 | 
 70 | 33. **[DPO; Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290):** A separate reward model is NOT what you need
 71 | 
 72 | 34. **[Consistency Models](https://arxiv.org/abs/2303.01469):** Mapping to how diffusion adds gaussian noise to images is all you need
 73 | 
 74 | 35. **[LCM; Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://arxiv.org/abs/2310.04378):** Consistency modeling in latent space is all you need\* (\*Also, a diffusion model to distill from)
 75 | 
 76 | 36. **[LCM-LoRA: A Universal Stable-Diffusion Acceleration Module](https://arxiv.org/abs/2311.05556):** Combining LoRAs is all you need
 77 | 
 78 | 37. **[Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models](https://arxiv.org/abs/2311.09210):** Asking the LLM to reflect on retrieved documents is all you need
 79 | 
 80 | 38. **[Emergent Abilities of Large Language Models](https://arxiv.org/abs/2206.07682):** The Bitter Lesson is all you need
 81 | 
 82 | 39. **[Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions](https://arxiv.org/abs/2309.10150):** The Bellman equation and replay buffers are all you need
 83 | 
 84 | 40. **[Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations](https://arxiv.org/abs/2312.06674):** Classification guidelines and the multiple-choice response are all you need
 85 | 
 86 | 41. **[REST^EM; Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models](https://arxiv.org/abs/2312.06585)**: Synthetic data and a reward function are all you need
 87 | 
 88 | 42. **[Mixture Of Experts Explained](https://huggingface.co/blog/moe#capacity-factor-and-communication-costs)**: MOE is a architectural choice to route observations to subnetworks within a block. This allows us to scale up parameter counts by introducting more experts and hence capabilities of our network. However, this introduces new  challenges due to the higher parameter count to run inference with, training instabilities and inference-time provisioning of experts across devices.
 89 | 
 90 | 43. **[Self-Instruct: Aligning Language Models with Self-Generated Instructions](https://arxiv.org/abs/2212.10560)**: ([GitHub Repo](https://discord.com/channels/822583790773862470/1197350122112168006), [LS Discord chat](https://discord.com/channels/822583790773862470/1197350122112168006/1199809444897361970))
 91 | 
 92 | 44. **[Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling](https://arxiv.org/abs/2304.01373)**: ([Presentation](https://docs.google.com/presentation/d/1EfNdcpRWqns8DvNIL-1bZdOI5tkbmcoBvMQxoMSbbI4/edit?usp=sharing)) A series of open source LLMs with completely reproducible datasets and checkpoints for LLM research. Novel studies (incl negative results) in memorization, data deduplication and data order, and gender debiasing.
 93 | 
 94 | 46. **[Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020)**: ([GitHub Repo](https://github.com/lucidrains/self-rewarding-lm-pytorch), [LS Discord Chat](https://discord.com/channels/822583790773862470/1197350122112168006/1204880366993674251)): Instead of training reward models from human preferences, LLMs can be used to provide their own rewards during training (aka WITHOUT distillation from GPT4). Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613.
 95 | 
 96 | 47. **[Building Your Own Product Copilot - Challenges, Opportunities, and Needs](https://arxiv.org/abs/2312.14231)**: Prompt engineering LLMs is NOT all you need.
 97 | 
 98 | 48. **[Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147)**: Aggregated losses across $2^n$-dim embeddings is all you need.
 99 | 
100 | 49. **[Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems](https://arxiv.org/abs/2312.15234)**: Bigger GPUs is not all you need.
101 | 
102 | 50. **[How to Generate and Use Synthetic Data for Finetuning](https://eugeneyan.com/writing/synthetic/)**: Synthetic data is almost all you need.
103 | 
104 | 51. **[Whisper: Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)**: 680k hrs of audio and multitask formulated as a sequence is all you need.
105 | 
106 | 52. **[Leveraging Large Language Models for NLG Evaluation: A Survey](https://arxiv.org/abs/2401.07103)**: [Slides](https://docs.google.com/presentation/d/17mbqFh8KFgn4SqRiKZmtjDn83N6d_jhqep8oPGbtB50/edit#slide=id.g2b89c4d23f6_0_329), [LS Discord Chat](https://discord.com/channels/822583790773862470/1197350122112168006/1207416409383108659) a survey paper of model and task eval techniques. Includes using Auto-J correlation instead of AlpacaEval which we liked.
107 | 


--------------------------------------------------------------------------------
/assets/MOE Expert Choice.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eugeneyan/llm-paper-notes/3ae0296cd6f6142fc3649cdcefd22c8014260f6d/assets/MOE Expert Choice.png


--------------------------------------------------------------------------------
/assets/MOE Expert Distribution.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eugeneyan/llm-paper-notes/3ae0296cd6f6142fc3649cdcefd22c8014260f6d/assets/MOE Expert Distribution.png


--------------------------------------------------------------------------------
/assets/MOE Expert Routing.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eugeneyan/llm-paper-notes/3ae0296cd6f6142fc3649cdcefd22c8014260f6d/assets/MOE Expert Routing.png


--------------------------------------------------------------------------------
/assets/MOE Instruction Tuning.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eugeneyan/llm-paper-notes/3ae0296cd6f6142fc3649cdcefd22c8014260f6d/assets/MOE Instruction Tuning.png


--------------------------------------------------------------------------------
/assets/MOE Mixtral Tokens.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eugeneyan/llm-paper-notes/3ae0296cd6f6142fc3649cdcefd22c8014260f6d/assets/MOE Mixtral Tokens.png


--------------------------------------------------------------------------------
/assets/MOE Parallel.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eugeneyan/llm-paper-notes/3ae0296cd6f6142fc3649cdcefd22c8014260f6d/assets/MOE Parallel.png


--------------------------------------------------------------------------------
/assets/MOE RNN.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eugeneyan/llm-paper-notes/3ae0296cd6f6142fc3649cdcefd22c8014260f6d/assets/MOE RNN.png


--------------------------------------------------------------------------------
/assets/MOE Scaling Law.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eugeneyan/llm-paper-notes/3ae0296cd6f6142fc3649cdcefd22c8014260f6d/assets/MOE Scaling Law.png


--------------------------------------------------------------------------------
/assets/MOE vs Dense.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eugeneyan/llm-paper-notes/3ae0296cd6f6142fc3649cdcefd22c8014260f6d/assets/MOE vs Dense.png


--------------------------------------------------------------------------------
/assets/MOE-Block.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/eugeneyan/llm-paper-notes/3ae0296cd6f6142fc3649cdcefd22c8014260f6d/assets/MOE-Block.png


--------------------------------------------------------------------------------
/notes/Mixture-Of-Experts-Hugging-Face.md:
--------------------------------------------------------------------------------
  1 | # Mixture Of Experts Explained
  2 | 
  3 | > URL : https://huggingface.co/blog/moe
  4 | 
  5 | ## Introduction
  6 | 
  7 | A Mixture of Experts (MOE) model involves two main things - training sub networks which eventually specialize certain tokens/tasks and a Router which learns which sub network to route a token to. For transformers, We implement this using a MOE layer which replaces the traditional FFN component of an attention block. 
  8 | 
  9 | Fundamentally, using a MOE model means we are trading VRAM for compute because it allows us to scale up the number of total parameters in our model while keeping the number of active parameters constant. Therefore, the compute to run inference ( in terms of FLOPs ) remains constant.
 10 | 
 11 | ![Mixture Of Expert Layer](../assets/MOE-Block.png)
 12 | 
 13 | Here are some important characteristics of Mixture of Expert networks
 14 | 
 15 | - **Sparse**: Not all of the networks weights are connected to one another ( due to experts being seperate sub network that don't share parameters )
 16 | - **Training**: They can be trained faster because the computational graph is smaller due to lower number of nodes involved in each forward pass. [MOE-Mamba](https://arxiv.org/abs/2401.04081) reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer.
 17 | - **High VRAM**: Since every expert must be loaded into memory for the model to run an inference step
 18 | - **Difficult to Fine-Tune**: These models seem to overfit quite easily so fine-tuning them for specific tasks has been difficult.
 19 | - **Challenging to Optimize**: Complex to perform - if we load balance requests to the wrong expert due to the appropriate expert being overwhelmed, then quality of response degrades. Possibility of wasted capacity if specific experts are also never activated.
 20 | 
 21 | We can also see that the performance of a MOE network increases as the number of **total** parameters increases. Note here that **total** parameters is not the same as **active** parameters. Total parameters refer to the number of parameters in the MOE model while active parameters only refer to the number of parameters involved in an inference step.
 22 | 
 23 | ![MOE Parameter Scaling Relationship](../assets/MOE%20Scaling%20Law.png)
 24 | 
 25 | 
 26 | 
 27 | MOEs can also be implemented in other architectures, even RNNs and LSTMs.
 28 | 
 29 | ![MOE RNN](../assets/MOE%20RNN.png)
 30 | 
 31 | ## Architecture
 32 | 
 33 | The key things to note in a MOE model is going to be the routing mechanism. This controls the expert which we eventually dispatch the token to.
 34 | 
 35 | ### Routing
 36 | 
 37 | > Most networks tend to set k = 2, which means that we combine the outputs of at most 2 experts. Some infrastructure providers such as Fireworks have chosen to provide k=3 mixtral in certain cases. 
 38 | >
 39 | > Fundamentally, this is a hyper-parameter that needs to be tuned. There are diminishing gains and latency trade-offs that will arise as we increase the number of experts. This is a topic of active study.
 40 | 
 41 | We utilise a learned gate network to determine the specific expert to send the token to
 42 | $$
 43 | y = \sum_{i=1}^nG(X)_i E_i(x)
 44 | $$
 45 | 
 46 | 
 47 | There are a few different ways to decide how to sample the experts to be chosen. They are
 48 | 
 49 | 1. Top-k : Use a softmax function based on the output of the addition and normalization component of the attention block
 50 | 
 51 | 2. Top-k with noise: Add some noise before applying softmax and sampling
 52 | 
 53 | 3. Random Routing : Softmax for the first expert and then random sampling of the second based on softmax outputs
 54 | 
 55 | 4. Expert Capacity : Calculate which experts are avaliable based on the average number of tokens to process per expert, then define a capacity multiple (Eg. each expert has capacity limit of 1.5x) - see below where C represents the capacity multiple, T the number of tokens, N the number of experts and LF the token capacity of an expert
 56 | 
 57 |    $$
 58 |    LF = \frac{C\times T}{N}
 59 |    $$
 60 | 
 61 | Note that we want to make sure each expert has a roughly equal distribution of tokens to proccess because of two main reasons
 62 | 
 63 | - Experts can be overwhelmed if they keep getting chosen to proccess tokens
 64 | - Experts will not learn if they never recieve tokens to proccess
 65 | 
 66 | Other methods that have been proposed includes 
 67 | 
 68 | - Expert Choice by Zhou et Al ( 2022 ): Allow experts to select the top t tokens from each sequence and then process it
 69 | - Soft mixtures of experts by Puigcerver et Al : In this model, experts act on *sequences* not tokens: each expert processes a weighted combination of all of the tokens in the input sequence.
 70 | 
 71 | #### Loss Functions
 72 | 
 73 | There are two main loss functions which we use when training a MOE network
 74 | 
 75 | 1. Auxilliary Loss : Encourage each expert to have equal important and an equal number of training examples
 76 | 2. Z-Loss : Penalize large logits entering the softmax function, therefore reducing potential routing errors
 77 | 
 78 | The reason why we want to penalise large logits is because of the issues with rounding errors in the routers. Switch Transformers experiment with casting different parameters to different datatypes in order to deal with this more efficiently.
 79 | 
 80 | ## Training
 81 | 
 82 | ### Fine-Tuning
 83 | 
 84 | Sparse models are going to benefit more from smaller batch sizes and higher learning rates but the models tend to overfit easily.
 85 | 
 86 | Models seem to memorise the training data - hence performing well on knowledge-heavy tasks such as TriviaQA while struggling with reasoning-heavy tasks such as SuperGLUE. 
 87 | 
 88 | ![MOE vs Dense Model](../assets/MOE%20vs%20Dense.png)
 89 | 
 90 | However, when trained on instruction tuned data, we can see that MOEs see an greater increase in performance as compared to their dense counterparts. This can be seen below where we observe a greater increase in the eval score of the MOE model when it is finetuned on single-task instruction data.
 91 | 
 92 | ![Instruction Tuning MOEs](../assets/MOE%20Instruction%20Tuning.png)
 93 | 
 94 | Data also seems to suggest that dropout probabilities within each expert has a moderate, more positive effect. Other interesting tricks include using upcycling - where we initialise an expert from the weights of the feed forward network.
 95 | 
 96 | This seems to speed up the training process by a significant proportion.
 97 | 
 98 | ## Inference
 99 | 
100 | Note that when we are refering to MOE inference, this refers to the inference per token.
101 | 
102 | MOE systems are challenging to run inference for because we cannot predict the load on each expert ahead of time. If there are multiple consecutive tokens that are related to each other in the sequence, this will result in an oversubscribed expert. Alternatively, if an expert is never chosen, then we have wasted compute that is just idly waiting for use.
103 | 
104 | ### Running Things in Parallel
105 | 
106 | We have the four following ways to achieve parallelism.
107 | 
108 | - **Model parallelism:** the model is partitioned across cores, and the data is replicated across cores.
109 | - **Data parallelism:** the same weights are replicated across all cores, and the data is partitioned across cores.
110 | - **Model and data parallelism:** we can partition the model and the data across cores. Note that different cores process different batches of data.
111 | - **Expert parallelism**: experts are placed on different workers. If combined with data parallelism, each core has a different expert and the data is partitioned across all cores
112 | 
113 | ![MOE Parrallelism](../assets/MOE%20Parallel.png)
114 | 
115 | ### Other Approaches
116 | 
117 | 1. **Distillation**: Distil our MOE model into a dense equivalent. With this approach, we can keep ~30-40% of the sparsity gains.  Fedus et al (2021), for example, compare a sparse mixture of experts model to a dense T5-Base model that is 100 times smaller but is able to preserve the sparsity gains when distilled using a MOE T-5 model. 
118 | 2. **Modify Routing**: Route full sentences or tasks to an expert so that more information/context can be extracted
119 | 3. **Aggregation of MOE**: Merging the weights of the expert, reducing parameters at inference time.
120 | 4. **Custom Kernels**: Exploring new ways to batch the operations to take advantage of GPU parallelism. This is explored in Megablocks which expresses MoE layers as block-sparse operations that can accomodate imblaanced assignments in matrix mult ( when experts have diff utilisation )
121 | 
122 | ## Challenges
123 | 
124 | ### Expert Specialisation 
125 | 
126 | Expert Specialization seems to be on the token rather than sequence level
127 | 
128 | ![Token Specific MOE](../assets/MOE Expert Routing.png)
129 | 
130 | We can see a similar example in the Mixtral MOE paper where they show the following diagrams
131 | 
132 | ![Mixtral MOE Expert Routing](../assets/MOE%20Mixtral%20Tokens.png)
133 | 
134 | ![Mixtral MOE Expert Routing](../assets/MOE%20Expert%20Choice.png)
135 | 
136 | 
137 | 
138 | ## Examples
139 | 
140 | There are a variety of different MOE models that have recently been developed such as Mixtral 8x7b and Phixtral. 
141 | 
142 | ### Mixtral 8x7b
143 | 
144 | Mixtral 8x7b uses a collection of Feed Forward Networks ( 8 Experts with 2 hidden layers ). It doesn't have 8x Mixtral 7Bs
145 | 
146 | ```
147 | MixtralForCausalLM(
148 |   (model): MixtralModel(
149 |     (embed_tokens): Embedding(32000, 4096)
150 |     (layers): ModuleList(
151 |       (0-31): 32 x MixtralDecoderLayer(
152 |         (self_attn): MixtralAttention(
153 |           (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
154 |           (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
155 |           (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
156 |           (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
157 |           (rotary_emb): MixtralRotaryEmbedding()
158 |         )
159 |         (block_sparse_moe): MixtralSparseMoeBlock(
160 |           (gate): Linear4bit(in_features=4096, out_features=8, bias=False)
161 |           (experts): ModuleList(
162 |             (0-7): 8 x MixtralBLockSparseTop2MLP(
163 |               (w1): Linear4bit(in_features=4096, out_features=14336, bias=False)
164 |               (w2): Linear4bit(in_features=14336, out_features=4096, bias=False)
165 |               (w3): Linear4bit(in_features=4096, out_features=14336, bias=False)
166 |               (act_fn): SiLU()
167 |             )
168 |           )
169 |         )
170 |         (input_layernorm): MixtralRMSNorm()
171 |         (post_attention_layernorm): MixtralRMSNorm()
172 |       )
173 |     )
174 |     (norm): MixtralRMSNorm()
175 |   )
176 |   (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
177 | )
178 | ```
179 | 
180 | 
181 | 
182 | ### Mixtral Bits
183 | 
184 | Mixtral upsampled the proportion of multilingual dataset when pre-training. This in turn increase the ability of the model to perform well on multilingual benchmarks while mantaining a high accuracy in English.
185 | 
186 | # Relevant Resources/Useful Links
187 | 
188 | 1. [MOEs by David Lakha](https://blog.javid.io/p/mixtures-of-experts) : Good walkthrough and links to more papers 
189 | 2. [MOE hardware requirements](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/discussions/3) : Forum page to understand hardware requirements to run an MOE system for inference
190 | 3. [FasterMOE](https://dl.acm.org/doi/10.1145/3503221.3508418) : Explores how to speed up MOE inference by examining a variety of different factors and blockers in normal MOE inference, resulting in a 17x speedup with their suggested changes.
191 | 4. [Upcycling MOEs](https://arxiv.org/abs/2212.05055): Explores how to speed up MOE training by initialising experts from original FFN network weights
192 | 5. [Instruction Tuned MOEs](https://arxiv.org/abs/2305.14705): Using Instruction Tuning and doing some comparisons against a dense model


--------------------------------------------------------------------------------