├── .gitignore ├── screenshot.png ├── run.bat ├── setup.bat └── readme.md /.gitignore: -------------------------------------------------------------------------------- 1 | env 2 | models -------------------------------------------------------------------------------- /screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/nistvan86/continuedev-llamacpp-gpu-llm-server/HEAD/screenshot.png -------------------------------------------------------------------------------- /run.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | call .\env\Scripts\activate.bat 3 | python -m llama_cpp.server --model models/phind-codellama-34b-v2.Q4_K_M.gguf --n_gpu_layers 99999 --verbose VERBOSE 4 | -------------------------------------------------------------------------------- /setup.bat: -------------------------------------------------------------------------------- 1 | python -m venv env 2 | call .\env\Scripts\activate.bat 3 | 4 | set CMAKE_ARGS="-DLLAMA_CUBLAS=on" 5 | set FORCE_CMAKE=1 6 | pip install llama-cpp-python[server]==0.1.83 --force-reinstall --upgrade --no-cache-dir --verbose 7 | -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 | This repo is unmaintained, please rather use Ollama with continue.dev as described here: https://github.com/nistvan86/continuedev-llamacpp-gpu-llm-server/issues/1#issuecomment-1822391864 2 | 3 | --- 4 | 5 | # continue.dev GGML client + llama-cpp-python server with GGUF model and GPU offloading 6 | This a sample Windows helper repo which uses llama-cpp-python with a GGUF model and NVIDIA GPU offloading to serve as a local LLM for the continue.dev VSCode extension. 7 | 8 | This is based on the https://continue.dev/docs/customization#local-models-with-ggml example. 9 | 10 | ![Alt text](screenshot.png?raw=true "Running") 11 | 12 | ## Install 13 | Clone this repo somewhere. 14 | 15 | You need Python (tested with 3.10.11 x64 on Windows 11), NVIDIA CUDA Toolkit (tested with v11.7) as well as Visual Studio Build Tools 2022 installed. 16 | 17 | Run `setup.bat`. It should create a virtualenv and install llama-cpp-python server with compiled CUBLAS support. 18 | 19 | ## Model download 20 | You need a GGUF model compatible with the size of your GPU. The Bloke's models have a table on each page detailing the memory requirements. 21 | This repos is expecting `phind-codellama-34b-v2.Q4_K_M.gguf` (https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF) placed into a `models` subfolder by default. This fits into a 24GB GPU (eg. RTX3090/4090). run.bat is configured so all layers are offloaded to the VRAM. 22 | You need to fine tune this based on your PC specs. 23 | 24 | ## Running and using in continue.dev 25 | Execute `run.bat`, model should be loaded to the GPU and the Uvicorn server prompt is shown with the host and port. 26 | 27 | Install continue.dev extension to VSCode, then on it's tab enter `\config`. 28 | In config.py add the necessary GGML config. 29 | 30 | Add top import 31 | 32 | from continuedev.src.continuedev.libs.llm.ggml import GGML 33 | 34 | Change the config model declaration. 35 | 36 | config = ContinueConfig( 37 | ... 38 | models=Models( 39 | default=GGML(max_context_length=2048) 40 | ) 41 | ... 42 | ) 43 | 44 | Reload VSCode, and have fun. 45 | --------------------------------------------------------------------------------