├── .gitignore
├── screenshot.png
├── run.bat
├── setup.bat
└── readme.md


/.gitignore:
--------------------------------------------------------------------------------
1 | env
2 | models


--------------------------------------------------------------------------------
/screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/nistvan86/continuedev-llamacpp-gpu-llm-server/HEAD/screenshot.png


--------------------------------------------------------------------------------
/run.bat:
--------------------------------------------------------------------------------
1 | @echo off
2 | call .\env\Scripts\activate.bat
3 | python -m llama_cpp.server --model models/phind-codellama-34b-v2.Q4_K_M.gguf --n_gpu_layers 99999 --verbose VERBOSE
4 | 


--------------------------------------------------------------------------------
/setup.bat:
--------------------------------------------------------------------------------
1 | python -m venv env
2 | call .\env\Scripts\activate.bat
3 | 
4 | set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
5 | set FORCE_CMAKE=1
6 | pip install llama-cpp-python[server]==0.1.83 --force-reinstall --upgrade --no-cache-dir --verbose
7 | 


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
 1 | This repo is unmaintained, please rather use Ollama with continue.dev as described here: https://github.com/nistvan86/continuedev-llamacpp-gpu-llm-server/issues/1#issuecomment-1822391864
 2 | 
 3 | ---
 4 | 
 5 | # continue.dev GGML client + llama-cpp-python server with GGUF model and GPU offloading
 6 | This a sample Windows helper repo which uses llama-cpp-python with a GGUF model and NVIDIA GPU offloading to serve as a local LLM for the continue.dev VSCode extension.
 7 | 
 8 | This is based on the https://continue.dev/docs/customization#local-models-with-ggml example.
 9 | 
10 | ![Alt text](screenshot.png?raw=true "Running")
11 | 
12 | ## Install
13 | Clone this repo somewhere.
14 | 
15 | You need Python (tested with 3.10.11 x64 on Windows 11), NVIDIA CUDA Toolkit (tested with v11.7) as well as Visual Studio Build Tools 2022 installed.
16 | 
17 | Run `setup.bat`. It should create a virtualenv and install llama-cpp-python server with compiled CUBLAS support.
18 | 
19 | ## Model download
20 | You need a GGUF model compatible with the size of your GPU. The Bloke's models have a table on each page detailing the memory requirements.
21 | This repos is expecting `phind-codellama-34b-v2.Q4_K_M.gguf` (https://huggingface.co/TheBloke/Phind-CodeLlama-34B-v2-GGUF) placed into a `models` subfolder by default. This fits into a 24GB GPU (eg. RTX3090/4090). run.bat is configured so all layers are offloaded to the VRAM.
22 | You need to fine tune this based on your PC specs.
23 | 
24 | ## Running and using in continue.dev
25 | Execute `run.bat`, model should be loaded to the GPU and the Uvicorn server prompt is shown with the host and port.
26 | 
27 | Install continue.dev extension to VSCode, then on it's tab enter `\config`.
28 | In config.py add the necessary GGML config.
29 | 
30 | Add top import
31 | 
32 | 	from continuedev.src.continuedev.libs.llm.ggml import GGML
33 | 	
34 | Change the config model declaration.
35 | 
36 | 	config = ContinueConfig(
37 | 		...
38 | 		models=Models(
39 | 			default=GGML(max_context_length=2048)
40 | 		)
41 | 		...
42 | 	)
43 | 	
44 | Reload VSCode, and have fun.
45 | 


--------------------------------------------------------------------------------