├── .gitignore ├── LICENSE └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # poetry 98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 102 | #poetry.lock 103 | 104 | # pdm 105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 106 | #pdm.lock 107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 108 | # in version control. 109 | # https://pdm.fming.dev/#use-with-ide 110 | .pdm.toml 111 | 112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 113 | __pypackages__/ 114 | 115 | # Celery stuff 116 | celerybeat-schedule 117 | celerybeat.pid 118 | 119 | # SageMath parsed files 120 | *.sage.py 121 | 122 | # Environments 123 | .env 124 | .venv 125 | env/ 126 | venv/ 127 | ENV/ 128 | env.bak/ 129 | venv.bak/ 130 | 131 | # Spyder project settings 132 | .spyderproject 133 | .spyproject 134 | 135 | # Rope project settings 136 | .ropeproject 137 | 138 | # mkdocs documentation 139 | /site 140 | 141 | # mypy 142 | .mypy_cache/ 143 | .dmypy.json 144 | dmypy.json 145 | 146 | # Pyre type checker 147 | .pyre/ 148 | 149 | # pytype static type analyzer 150 | .pytype/ 151 | 152 | # Cython debug symbols 153 | cython_debug/ 154 | 155 | # PyCharm 156 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 157 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 158 | # and can be added to the global gitignore or merged into this file. For a more nuclear 159 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 160 | #.idea/ 161 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Mark Saroufim and Andreas Köpf (gpu-mode) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # GPU MODE Resource Stream 2 | [![](https://dcbadge.limes.pink/api/server/gpumode?style=flat)](https://discord.gg/gpumode) 3 | 4 | [https://discord.gg/gpumode](https://discord.gg/gpumode) 5 | 6 | Here you find a collection of CUDA related material (books, papers, blog-post, youtube videos, tweets, implementations etc.). We also collect information to higher level tools for performance optimization and kernel development like [Triton](https://triton-lang.org) and `torch.compile()` ... whatever makes the GPUs go brrrr. 7 | 8 | You know a great resource we should add? Please see [How to contribute](#how-to-contribute). 9 | 10 | 11 | ## Lectures / Reading Group Live Sessions 12 | 13 | You find a list of upcoming lectures in the Events option in the channel list (side bar) of our [discord server](https://discord.gg/gpumode). 14 | 15 | Recordings of the weekly lectures are published on our [YouTube channel](https://www.youtube.com/@GPUMODE). Material (code, slides) for the individual lectures can be found in the [lectures](https://github.com/gpu-mode/lectures) repository. 16 | 17 | 18 | ## 1st Contact with CUDA 19 | - [An Easy Introduction to CUDA C and C++](https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/) 20 | - [An Even Easier Introduction to CUDA](https://developer.nvidia.com/blog/even-easier-introduction-cuda/) 21 | - [CUDA Toolkit Documentation ](https://docs.nvidia.com/cuda/) 22 | - Basic terminology: Thread block, Warp, Streaming Multiprocessor: [Wiki: Thread Block](https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)), [A tour of CUDA](https://tbetcke.github.io/hpc_lecture_notes/cuda_introduction.html) 23 | - [GPU Performance Background User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html) 24 | - [OLCF NVIDIA CUDA Training Series](https://www.olcf.ornl.gov/cuda-training-series/), talk recordings can be found under the presentation footer for each lecture; [exercises](https://github.com/olcf/cuda-training-series) 25 | - [GTC 2022 - CUDA: New Features and Beyond - Stephen Jones](https://www.youtube.com/watch?v=SAm4gwkj2Ko) 26 | - Intro video: [Writing Code That Runs FAST on a GPU](https://youtu.be/8sDg-lD1fZQ) 27 | - 12 hrs CUDA tutorial: [Introduction of CUDA and writing kernels in CUDA](https://www.youtube.com/watch?v=86FAWCzIe_4) 28 | 29 | 30 | ## 2nd Contact 31 | - [CUDA Refresher](https://developer.nvidia.com/blog/tag/cuda-refresher/) 32 | 33 | ## Hazy Research 34 | 35 | The MLSys-oriented research group at Stanford led by Chris Re, with 36 | alumni Tri Dao, Dan Fu, and many others. A goldmine. 37 | 38 | - [Building Blocks for AI 39 | Systems](https://github.com/HazyResearch/aisys-building-blocks): 40 | Their collection of resources similar to this one, many great links 41 | - [Data-Centric AI](https://github.com/HazyResearch/data-centric-ai): 42 | An older such collection 43 | - [Blog](https://hazyresearch.stanford.edu/blog) 44 | - [ThunderKittens](https://hazyresearch.stanford.edu/blog/2024-05-12-tk): 45 | (May 2024) A DSL within CUDA, this blog post has good background on 46 | getting good H100 performance 47 | - [Systems for Foundation Models, and Foundation Models for 48 | Systems](https://neurips.cc/virtual/2023/invited-talk/73990): Chris 49 | Re's keynote from NeurIPS Dec 2023 50 | 51 | ## Papers, Case Studies 52 | - [A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library](https://arxiv.org/abs/2312.11918) 53 | - [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog](https://siboehm.com/articles/22/CUDA-MMM) 54 | - [Anatomy of high-performance matrix multiplication](https://dl.acm.org/doi/10.1145/1356052.1356053) 55 | 56 | 57 | ## Books 58 | - [Programming Massively Parallel Processors: A Hands-on Approach](https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0323912311) 59 | - [Cuda by Example: An Introduction to General-Purpose Gpu Programming](https://edoras.sdsu.edu/~mthomas/docs/cuda/cuda_by_example.book.pdf); [code](https://github.com/tpn/cuda-by-example) 60 | - [The CUDA Handbook](https://www.cudahandbook.com/) 61 | - [The Book of Shaders](https://thebookofshaders.com/) guide through the abstract and complex universe of Fragment Shader (not cuda but GPU related) 62 | - [Art of HPC](https://theartofhpc.com/) 4 books on HPC more generally, does not specifically cover GPUs but lessons broadly apply 63 | 64 | ## Cuda Courses 65 | - [HetSys: Programming Heterogeneous Computing Systems with GPUs and other Accelerators](https://safari.ethz.ch/projects_and_seminars/fall2022/doku.php?id%253Dheterogeneous_systems) 66 | - [Heterogeneous Parallel Programming Class](https://www.youtube.com/playlist?list=PLzn6LN6WhlN06hIOA_ge6SrgdeSiuf9Tb) (YouTube playlist) Prof. Wen-mei Hwu, University of Illinois 67 | - [Official YouTube channel for "Programming Massively Parallel Processors: A Hands-on Approach"](https://www.youtube.com/@pmpp-book), course playlist: [Applied Parallel Programming](https://www.youtube.com/playlist?list=PLRRuQYjFhpmvu5ODQoY2l7D0ADgWEcYAX) 68 | - [Programming Parallel Computers](https://ppc-exercises.cs.aalto.fi/courses); covers both CUDA and CPU-parallelism. Use [Open Course Version](https://ppc-exercises.cs.aalto.fi/course/open2024a) and you can even submit your own solutions to the exercises for testing and benchmarking. 69 | 70 | 71 | ## CUDA Grandmasters 72 | 73 | ### Tri Dao 74 | - x: [@tri_dao](https://twitter.com/tri_dao), gh: [tridao](https://github.com/tridao) 75 | - [Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention), [paper](https://arxiv.org/abs/2205.14135) 76 | - [state-spaces/mamba](https://github.com/state-spaces/mamba), paper: [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752), minimal impl: [mamba-minimal](https://github.com/johnma2006/mamba-minimal) 77 | 78 | 79 | ### Tim Dettmers 80 | - x: [@Tim_Dettmers](https://twitter.com/Tim_Dettmers), gh: [TimDettmers](https://github.com/TimDettmers) 81 | - [TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes), docs: [docs](https://bitsandbytes.readthedocs.io/en/latest/) 82 | - [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) 83 | 84 | 85 | ### Sasha Rush 86 | - x: [@srush_nlp](https://twitter.com/srush_nlp), gh: [srush](https://github.com/srush) 87 | - [Sasha Rush's GPU Puzzles](https://github.com/srush/GPU-Puzzles), dshah3's [CUDA C++ version](https://github.com/dshah3/GPU-Puzzles) & [walkthrough video](https://www.youtube.com/watch?v=3frRR6fycgM) 88 | - [Mamba: The Hard Way](https://srush.github.io/annotated-mamba/hard.html), code: [srush/annotated-mamba](https://github.com/srush/annotated-mamba) 89 | 90 | 91 | ## Practice 92 | - [Adnan Aziz and Anupam Bhatnagar GPU Puzzlers](http://www.gpupuzzlers.com/) 93 | 94 | 95 | ## PyTorch Performance Optimization 96 | - [Accelerating Generative AI with PyTorch: Segment Anything, Fast](https://pytorch.org/blog/accelerating-generative-ai/) 97 | - [Accelerating Generative AI with PyTorch II: GPT, Fast](https://pytorch.org/blog/accelerating-generative-ai-2/) 98 | - [Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning](https://blog.fireworks.ai/speed-python-pick-two-how-cuda-graphs-enable-fast-python-code-for-deep-learning-353bf6241248) 99 | - [Performance Debugging of Production PyTorch Models at Meta](https://pytorch.org/blog/performance-debugging-of-production-pytorch-models-at-meta/) 100 | 101 | 102 | ## PyTorch Internals & Debugging 103 | - [TorchDynamo Deep Dive](https://pytorch.org/docs/stable/torch.compiler_dynamo_overview.html) 104 | - [PyTorch Compiler Troubleshooting](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_troubleshooting.rst) 105 | - [PyTorch internals](http://blog.ezyang.com/2019/05/pytorch-internals/) 106 | - [Pytorch 2 internals](https://drive.google.com/file/d/1XBox0G3FI-71efQQjmqGh0-VkCd-AHPL/view) 107 | - Understanding GPU memory: [1: Visualizing All Allocations over Time](https://pytorch.org/blog/understanding-gpu-memory-1/), [2: Finding and Removing Reference Cycles](https://pytorch.org/blog/understanding-gpu-memory-2/) 108 | - Debugging memory using snapshots: [Debugging PyTorch memory use with snapshots](https://zdevito.github.io/2022/08/16/memory-snapshots.html) 109 | - CUDA caching allocaator: [https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html](https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html) 110 | - Trace Analyzer: [PyTorch Trace Analysis for the Masses](https://pytorch.org/blog/trace-analysis-for-masses/) 111 | - [Holistic Trace Analysis (HTA)](https://hta.readthedocs.io/en/latest/), gh: [facebookresearch/HolisticTraceAnalysis](https://github.com/facebookresearch/HolisticTraceAnalysis) 112 | 113 | 114 | ## Code / Libs 115 | - [NVIDIA/cutlass](https://github.com/NVIDIA/cutlass) 116 | 117 | 118 | ## Essentials 119 | - [Triton compiler tutorials](https://triton-lang.org/main/getting-started/tutorials/index.html) 120 | - [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/) 121 | - [PyTorch: Custom C++ and CUDA Extensions](https://pytorch.org/tutorials/advanced/cpp_extension.html), Code: [pytorch/extension-cpp](https://github.com/pytorch/extension-cpp/tree/master) 122 | - [PyTorch C++ API](https://pytorch.org/cppdocs/index.html) 123 | - [pybind11 documentation](https://pybind11.readthedocs.io/en/stable/) 124 | - [NVIDIA Tensor Core Programming](https://leimao.github.io/blog/NVIDIA-Tensor-Core-Programming/) 125 | - [GPU Programming: When, Why and How?](https://enccs.github.io/gpu-programming/#) 126 | - [How GPU Computing Works | GTC 2021](https://youtu.be/3l10o0DYJXg?si=t5FHswnibAbo3s0t) (more basic than the 2022 version) 127 | - [How CUDA Programming Works | GTC 2022](https://youtu.be/n6M8R8-PlnE?si=cJ4dWtpYaPoIuJ0q) 128 | - [CUDA Kernel optimization Part 1](https://www.youtube.com/watch?v=hOi3NWOPVR8) [Part 2](https://www.youtube.com/watch?v=NrWhZMHrP4w) 129 | - [PTX and ISA Programming Guide](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html) (V8.3) 130 | - Compiler Explorer: Inspect PTX: [div 256 -> shr 8 example](https://godbolt.org/z/odb3191vK) 131 | 132 | 133 | ## Profiling 134 | - [Nsight Compute Profiling Guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html) 135 | - [mcarilli/nsight.sh](https://gist.github.com/mcarilli/376821aa1a7182dfcf59928a7cde3223) - Favorite nsight systems profiling commands for PyTorch scripts 136 | - [Profiling GPU Applications with Nsight Systems](https://www.youtube.com/watch?v=kKANP0kL_hk) 137 | 138 | 139 | ## Python GPU Computing 140 | - [PyTorch](https://pytorch.org/) 141 | - [Trtion](https://triton-lang.org/main/index.html), github: [openai/triton](https://github.com/openai/triton/) 142 | - [numba @cuda.jit](https://numba.readthedocs.io/en/stable/cuda/kernels.html) 143 | - [Apache TVM](https://tvm.apache.org/) 144 | - [JAX Pallas](https://jax.readthedocs.io/en/latest/pallas/index.html) 145 | - [CuPy](https://cupy.dev/) NumPy compatible GPU Computing 146 | - [NVidia Fuser](https://github.com/NVIDIA/Fuser/) 147 | - [Codon @gpu.kernel](https://docs.exaloop.io/codon/advanced/gpu), github: [exaloop/codon](https://github.com/exaloop/codon) 148 | - [Mojo](https://docs.modular.com/mojo/manual/) (part of commercial [MAX Plattform](https://www.modular.com/max) by [Modular](https://www.modular.com)) 149 | - NVIDIA Python Bindings: [CUDA Python](https://github.com/NVIDIA/cuda-python) (calling NVRTC to compile kernels, malloc, copy, launching kernels, ..), [cuDNN FrontEnd(FE) API](https://github.com/NVIDIA/cudnn-frontend), [CUTLASS Python Interface](https://github.com/NVIDIA/cutlass/tree/main/python) 150 | 151 | 152 | ## Advanced Topics, Research, Compilers 153 | - [TACO](http://tensor-compiler.org/): The Tensor Algebra Compiler, gh: [tensor-compiler/taco](https://github.com/tensor-compiler/taco) 154 | - [Mosaic compiler](https://github.com/manya-bansal/mosaic) C++ DSL for sparse and dense tensors algebra (built on top of TACO), [paper](https://dl.acm.org/doi/10.1145/3591236), [presentation](https://aha.stanford.edu/mosaic-interoperable-compiler-tensor-algebra) 155 | 156 | 157 | ## News 158 | - [SemiAnalysis](https://www.semianalysis.com/) 159 | 160 | 161 | ## Technical Blog Posts 162 | - [Cooperative Groups: Flexible CUDA Thread Programming](https://developer.nvidia.com/blog/cooperative-groups/) (Oct 04, 2017) 163 | - [A friendly introduction to machine learning compilers and optimizers](https://huyenchip.com/2021/09/07/a-friendly-introduction-to-machine-learning-compilers-and-optimizers.html) (Sep 7, 2021) 164 | 165 | 166 | ## Hardware Architecture 167 | - [NVIDIA H100 Whitepaper](https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper) 168 | - [NVIDIA GH200 Whitepaper](https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-hopper) 169 | - [AMD CDNA 3 Whitepaper](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf) 170 | - [AMD MI300X Data Sheet](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf) 171 | - Video: [Can SRAM Keep Shrinking?](https://youtu.be/2G4_RZo41Zw) (by [Asianometry](https://www.asianometry.com/)) 172 | 173 | 174 | ## GPU-MODE Community Projects 175 | 176 | ## ring-attention 177 | - see our [ring-attention](https://github.com/gpu-mode/ring-attention) repo 178 | 179 | ## pscan 180 | - GPU Gems: [Parallel Prefix Sum (Scan) with CUDA](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda), [PDF version (2007)](https://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/scan/doc/scan.pdf), impl: [stack overflow](https://stackoverflow.com/a/30835030/387870), nicer impl: [mattdean1/cuda](https://github.com/mattdean1/cuda) 181 | - [Accelerating Reduction and Scan Using Tensor Core Units](https://arxiv.org/abs/1811.09736) 182 | - Thrust: [Prefix Sums](https://docs.nvidia.com/cuda/thrust/index.html#prefix-sums), Reference: [scan variants](https://thrust.github.io/doc/group__prefixsums.html) 183 | - [CUB](https://nvlabs.github.io/cub/), part of cccl: [NVIDIA/cccl/tree/main/cub](https://github.com/NVIDIA/cccl/tree/main/cub) 184 | - SAM Algorithm: [Higher-Order and Tuple-Based Massively-Parallel Prefix Sums](https://userweb.cs.txstate.edu/~mb92/papers/pldi16.pdf) (licensed for non commercial use only) 185 | - CUB Algorithm: [Single-pass Parallel Prefix Scan with Decoupled Look-back](https://research.nvidia.com/publication/2016-03_single-pass-parallel-prefix-scan-decoupled-look-back) 186 | - Group Experiments: [johnryan465/pscan](https://github.com/johnryan465/pscan), [andreaskoepf/pscan_kernel](https://github.com/andreaskoepf/pscan_kernel) 187 | 188 | 189 | ## Triton Kernels / Examples 190 | 191 | - [`unsloth`](https://github.com/unslothai/unsloth) that implements custom kernels in Triton for faster QLoRA training 192 | - Custom implementation of relative position attention ([link](https://github.com/pytorch-labs/segment-anything-fast/blob/main/segment_anything_fast/flash_4.py)) 193 | - Tri Dao's Triton implementation of Flash Attention: [flash_attn_triton.py](https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/flash_attn_triton.py) 194 | - YouTube playlist: [Triton Conference 2023](https://www.youtube.com/watch?v=ZGU0Yw7mORE&list=PLc_vA1r0qoiRZfUC3o4_yjj0FtWvodKAz) 195 | - [LightLLM](https://github.com/ModelTC/lightllm) with different triton kernels for different LLMs 196 | 197 | 198 | ## How to contribute 199 | To share interesting CUDA related links please create a pull request for this file. See [editing files](https://docs.github.com/en/repositories/working-with-files/managing-files/editing-files) in the github documentation. 200 | 201 | Or contact us on the **GPU MODE** discord server: [https://discord.gg/gpumode](https://discord.gg/gpumode) 202 | --------------------------------------------------------------------------------