├── LICENSE └── README.md /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Varun Vasudeva 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Local LLaMA Server Setup Documentation 2 | 3 | _TL;DR_: A comprehensive guide to setting up a fully local and private language model server equipped with the following: 4 | - Inference Engine ([Ollama](https://github.com/ollama/ollama), [llama.cpp](https://github.com/ggml-org/llama.cpp), [vLLM](https://github.com/vllm-project/vllm)) 5 | - Chat Platform ([Open WebUI](https://github.com/open-webui/open-webui)) 6 | - Text-to-Speech Server ([OpenedAI Speech](https://github.com/matatonic/openedai-speech), [Kokoro FastAPI](https://github.com/remsky/Kokoro-FastAPI)) 7 | - Text-to-Image Server ([ComfyUI](https://github.com/comfyanonymous/ComfyUI)) 8 | 9 | ## Table of Contents 10 | 11 | - [Local LLaMA Server Setup Documentation](#local-llama-server-setup-documentation) 12 | - [Table of Contents](#table-of-contents) 13 | - [About](#about) 14 | - [Priorities](#priorities) 15 | - [System Requirements](#system-requirements) 16 | - [Prerequisites](#prerequisites) 17 | - [Docker](#docker) 18 | - [HuggingFace CLI](#huggingface-cli) 19 | - [Managing Models](#managing-models) 20 | - [Download Models](#download-models) 21 | - [Delete Models](#delete-models) 22 | - [General](#general) 23 | - [Startup Script](#startup-script) 24 | - [Scheduling Startup Script](#scheduling-startup-script) 25 | - [Configuring Script Permissions](#configuring-script-permissions) 26 | - [Auto-Login](#auto-login) 27 | - [Inference Engine](#inference-engine) 28 | - [Ollama](#ollama) 29 | - [llama.cpp](#llamacpp) 30 | - [vLLM](#vllm) 31 | - [Creating a Service](#creating-a-service) 32 | - [Open WebUI Integration](#open-webui-integration) 33 | - [Ollama vs. llama.cpp](#ollama-vs-llamacpp) 34 | - [vLLM vs. Ollama/llama.cpp](#vllm-vs-ollamallamacpp) 35 | - [Chat Platform](#chat-platform) 36 | - [Open WebUI](#open-webui) 37 | - [Text-to-Speech Server](#text-to-speech-server) 38 | - [OpenedAI Speech](#openedai-speech) 39 | - [Downloading Voices](#downloading-voices) 40 | - [Kokoro FastAPI](#kokoro-fastapi) 41 | - [Open WebUI Integration](#open-webui-integration-1) 42 | - [Comparison](#comparison) 43 | - [Text-to-Image Server](#text-to-image-server) 44 | - [ComfyUI](#comfyui) 45 | - [Open WebUI Integration](#open-webui-integration-2) 46 | - [SSH](#ssh) 47 | - [Firewall](#firewall) 48 | - [Remote Access](#remote-access) 49 | - [Tailscale](#tailscale) 50 | - [Installation](#installation) 51 | - [Exit Nodes](#exit-nodes) 52 | - [Local DNS](#local-dns) 53 | - [Third-Party VPN Integration](#third-party-vpn-integration) 54 | - [Verifying](#verifying) 55 | - [Inference Engine](#inference-engine-1) 56 | - [Open WebUI](#open-webui-1) 57 | - [Text-to-Speech Server](#text-to-speech-server-1) 58 | - [ComfyUI](#comfyui-1) 59 | - [Updating](#updating) 60 | - [General](#general-1) 61 | - [Nvidia Drivers \& CUDA](#nvidia-drivers--cuda) 62 | - [Ollama](#ollama-1) 63 | - [llama.cpp](#llamacpp-1) 64 | - [vLLM](#vllm-1) 65 | - [Open WebUI](#open-webui-2) 66 | - [OpenedAI Speech](#openedai-speech-1) 67 | - [Kokoro FastAPI](#kokoro-fastapi-1) 68 | - [ComfyUI](#comfyui-2) 69 | - [Troubleshooting](#troubleshooting) 70 | - [`ssh`](#ssh-1) 71 | - [Nvidia Drivers](#nvidia-drivers) 72 | - [Ollama](#ollama-2) 73 | - [vLLM](#vllm-2) 74 | - [Open WebUI](#open-webui-3) 75 | - [OpenedAI Speech](#openedai-speech-2) 76 | - [Monitoring](#monitoring) 77 | - [Notes](#notes) 78 | - [Software](#software) 79 | - [Hardware](#hardware) 80 | - [References](#references) 81 | - [Acknowledgements](#acknowledgements) 82 | 83 | ## About 84 | 85 | This repository outlines the steps to run a server for running local language models. It uses Debian specifically, but most Linux distros should follow a very similar process. It aims to be a guide for Linux beginners like me who are setting up a server for the first time. 86 | 87 | The process involves installing the requisite drivers, setting the GPU power limit, setting up auto-login, and scheduling the `init.bash` script to run at boot. All these settings are based on my ideal setup for a language model server that runs most of the day but a lot can be customized to suit your needs. 88 | 89 | ## Priorities 90 | 91 | - **Simplicity of setup process**: It should be relatively straightforward to set up the components of the solution. 92 | - **Stability of runtime**: The components should be stable and capable of running for weeks at a time without any intervention necessary. 93 | - **Ease of maintenance**: The components and their interactions should be uncomplicated enough that you know enough to maintain them as they evolve (because they *will* evolve). 94 | - **Aesthetics**: The result should be as close to a cloud provider's chat platform as possible. A homelab solution doesn't necessarily need to feel like it was cobbled together haphazardly. 95 | - **Open source**: The code should be able to be verified by a community of engineers. Chat platforms and LLMs involve large amounts of personal data conveyed in natural language and it's important to know that data isn't going outside your machine. 96 | 97 | ## System Requirements 98 | 99 | Any modern CPU and GPU combination should work for this guide. Previously, compatibility with AMD GPUs was an issue but the latest releases of Ollama have worked through this and [AMD GPUs are now supported natively](https://ollama.com/blog/amd-preview). 100 | 101 | For reference, this guide was built around the following system: 102 | - **CPU**: Intel Core i5-12600KF 103 | - **Memory**: 32GB 6000 MHz DDR5 RAM 104 | - **Storage**: 1TB M.2 NVMe SSD 105 | - **GPU**: Nvidia RTX 3090 24GB 106 | 107 | > [!NOTE] 108 | > **AMD GPUs**: Power limiting is skipped for AMD GPUs as [AMD has recently made it difficult to set power limits on their GPUs](https://www.reddit.com/r/linux_gaming/comments/1b6l1tz/no_more_power_limiting_for_amd_gpus_because_it_is/). Naturally, skip any steps involving `nvidia-smi` or `nvidia-persistenced` and the power limit in the `init.bash` script. 109 | > 110 | > **CPU-only**: You can skip the GPU driver installation and power limiting steps. The rest of the guide should work as expected. 111 | 112 | ## Prerequisites 113 | 114 | - Fresh install of Debian 115 | - Internet connection 116 | - Basic understanding of the Linux terminal 117 | - Peripherals like a monitor, keyboard, and mouse 118 | 119 | To install Debian on your newly built server hardware: 120 | 121 | - Download the [Debian ISO](https://www.debian.org/distrib/) from the official website. 122 | - Create a bootable USB using a tool like [Rufus](https://rufus.ie/en/) for Windows or [Balena Etcher](https://etcher.balena.io) for MacOS. 123 | - Boot into the USB and install Debian. 124 | 125 | For a more detailed guide on installing Debian, refer to the [official documentation](https://www.debian.org/releases/buster/amd64/). For those who aren't yet experienced with Linux, I recommend using the graphical installer - you will be given an option between the text-based installer and graphical installer. 126 | 127 | I also recommend installing a lightweight desktop environment like XFCE for ease of use. Other options like GNOME or KDE are also available - GNOME may be a better option for those using their server as a primary workstation as it is more feature-rich (and, as such, heavier) than XFCE. 128 | 129 | ### Docker 130 | 131 | Docker is a containerization platform that allows you to run applications in isolated environments. This subsection follows [this guide](https://docs.docker.com/engine/install/debian/) to install Docker Engine on Debian. 132 | 133 | - Run the following commands: 134 | ``` 135 | # Add Docker's official GPG key: 136 | sudo apt-get update 137 | sudo apt-get install ca-certificates curl 138 | sudo install -m 0755 -d /etc/apt/keyrings 139 | sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc 140 | sudo chmod a+r /etc/apt/keyrings/docker.asc 141 | 142 | # Add the repository to Apt sources: 143 | echo \ 144 | "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \ 145 | $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ 146 | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null 147 | sudo apt-get update 148 | ``` 149 | - Install the Docker packages: 150 | ``` 151 | sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin 152 | ``` 153 | - Verify the installation: 154 | ``` 155 | sudo docker run hello-world 156 | ``` 157 | 158 | ### HuggingFace CLI 159 | 160 | 📖 [**Documentation**](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) 161 | 162 | > [!NOTE] 163 | > Only needed for llama.cpp/vLLM. 164 | 165 | - Create a new virtual environment: 166 | ``` 167 | python3 -m venv hf-env 168 | source hf-env/bin/activate 169 | ``` 170 | - Download the `huggingface_hub[cli]` package using `pip`: 171 | ``` 172 | pip install -U "huggingface_hub[cli]" 173 | ``` 174 | - Create an authentication token on https://huggingface.com 175 | - Log in to HF Hub: 176 | ``` 177 | huggingface-cli login 178 | ``` 179 | - Enter your token when prompted. 180 | - Run the following to verify your login: 181 | ``` 182 | huggingface-cli whoami 183 | ``` 184 | 185 | The result should be your username. 186 | 187 | #### Managing Models 188 | 189 | Models can be downloaded either to the default location (`.cache/huggingface/hub`) or to any local directory you specify. Where the model is stored can be defined using the `--local-dir` command line flag. Not specifying this will result in the model being stored in the default location. Storing the model in the folder where the packages for the inference engine are stored is good practice - this way, everything you need to run inference on a model is stored in the same place. However, if you use the same models with multiple backends frequently (e.g. using Qwen_QwQ-32B-Q4_K_M.gguf with both llama.cpp and vLLM), the default location is probably best. 190 | 191 | First, activate the virtual environment that contains `huggingface_hub`: 192 | ``` 193 | source hf-env/bin/activate 194 | ``` 195 | 196 | #### Download Models 197 | 198 | Models are downloaded using their HuggingFace tag. Here, we'll use bartowski/Qwen_QwQ-32B-GGUF as an example. To download a model, run: 199 | ``` 200 | huggingface-cli download bartowski/Qwen_QwQ-32B-GGUF Qwen_QwQ-32B-Q4_K_M.gguf --local-dir models 201 | ``` 202 | Ensure that you are in the correct directory when you run this. 203 | 204 | #### Delete Models 205 | 206 | To delete a model in the specified location, run: 207 | ``` 208 | rm 209 | ``` 210 | 211 | To delete a model in the default location, run: 212 | ``` 213 | huggingface-cli delete-cache 214 | ``` 215 | 216 | This will start an interactive session where you can remove models from the HuggingFace directory. In case you've been saving models in a different location than `.cache/huggingface`, deleting models from there will free up space but the metadata will remain in the HF cache until it is deleted properly. This can be done via the above command but you can also simply delete the model directory from `.cache/huggingface/hub`. 217 | 218 | ## General 219 | Update the system by running the following commands: 220 | ``` 221 | sudo apt update 222 | sudo apt upgrade 223 | ``` 224 | 225 | Now, we'll install the required GPU drivers that allow programs to utilize their compute capabilities. 226 | 227 | **Nvidia GPUs** 228 | - Follow Nvidia's [guide on downloading CUDA Toolkit](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian). The instructions are specific to your machine and the website will lead you to them interactively. 229 | - Run the following commands: 230 | ``` 231 | sudo apt install linux-headers-amd64 232 | sudo apt install nvidia-driver firmware-misc-nonfree 233 | ``` 234 | - Reboot the server. 235 | - Run the following command to verify the installation: 236 | ``` 237 | nvidia-smi 238 | ``` 239 | 240 | **AMD GPUs** 241 | - Run the following commands: 242 | ``` 243 | deb http://deb.debian.org/debian bookworm main contrib non-free-firmware 244 | apt-get install firmware-amd-graphics libgl1-mesa-dri libglx-mesa0 mesa-vulkan-drivers xserver-xorg-video-all 245 | ``` 246 | - Reboot the server. 247 | 248 | ## Startup Script 249 | 250 | In this step, we'll create a script called `init.bash`. This script will be run at boot to set the GPU power limit and start the server using Ollama. We set the GPU power limit lower because it has been seen in testing and inference that there is only a 5-15% performance decrease for a 30% reduction in power consumption. This is especially important for servers that are running 24/7. 251 | 252 | - Run the following commands: 253 | ``` 254 | touch init.bash 255 | nano init.bash 256 | ``` 257 | - Add the following lines to the script: 258 | ``` 259 | #!/bin/bash 260 | sudo nvidia-smi -pm 1 261 | sudo nvidia-smi -pl 262 | ``` 263 | > Replace `` with the desired power limit in watts. For example, `sudo nvidia-smi -pl 250`. 264 | 265 | For multiple GPUs, modify the script to set the power limit for each GPU: 266 | ``` 267 | sudo nvidia-smi -i 0 -pl 268 | sudo nvidia-smi -i 1 -pl 269 | ``` 270 | - Save and exit the script. 271 | - Make the script executable: 272 | ``` 273 | chmod +x init.bash 274 | ``` 275 | 276 | ### Scheduling Startup Script 277 | 278 | Adding the `init.bash` script to the crontab will schedule it to run at boot. 279 | 280 | - Run the following command: 281 | ``` 282 | crontab -e 283 | ``` 284 | - Add the following line to the file: 285 | ``` 286 | @reboot /path/to/init.bash 287 | ``` 288 | > Replace `/path/to/init.bash` with the path to the `init.bash` script. 289 | 290 | - (Optional) Add the following line to shutdown the server at 12am: 291 | ``` 292 | 0 0 * * * /sbin/shutdown -h now 293 | ``` 294 | - Save and exit the file. 295 | 296 | ### Configuring Script Permissions 297 | 298 | We want `init.bash` to run the `nvidia-smi` commands without having to enter a password. This is done by giving `nvidia-persistenced` and `nvidia-smi` passwordless `sudo` permissions, and can be achieved by editing the `sudoers` file. 299 | 300 | AMD users can skip this step as power limiting is not supported on AMD GPUs. 301 | 302 | - Run the following command: 303 | ``` 304 | sudo visudo 305 | ``` 306 | - Add the following lines to the file: 307 | ``` 308 | ALL=(ALL) NOPASSWD: /usr/bin/nvidia-persistenced 309 | ALL=(ALL) NOPASSWD: /usr/bin/nvidia-smi 310 | ``` 311 | > Replace `` with your username. 312 | - Save and exit the file. 313 | 314 | > [!IMPORTANT] 315 | > Ensure that you add these lines AFTER `%sudo ALL=(ALL:ALL) ALL`. The order of the lines in the file matters - the last matching line will be used so if you add these lines before `%sudo ALL=(ALL:ALL) ALL`, they will be ignored. 316 | 317 | ## Auto-Login 318 | 319 | When the server boots up, we want it to automatically log in to a user account and run the `init.bash` script. This is done by configuring the `lightdm` display manager. 320 | 321 | - Run the following command: 322 | ``` 323 | sudo nano /etc/lightdm/lightdm.conf 324 | ``` 325 | - Find the following commented line. It should be in the `[Seat:*]` section. 326 | ``` 327 | # autologin-user= 328 | ``` 329 | - Uncomment the line and add your username: 330 | ``` 331 | autologin-user= 332 | ``` 333 | > Replace `` with your username. 334 | - Save and exit the file. 335 | 336 | ## Inference Engine 337 | 338 | The inference engine is one of the primary components of this setup. It is code that takes model files containing weights and makes it possible to get useful outputs from them. This guide allows a choice between llama.cpp, vLLM, and Ollama - all of these are popular inference engines with different priorities and stengths (note: Ollama uses llama.cpp under the hood and is simply a CLI wrapper). It can be daunting to jump straight into the deep end with command line arguments in llama.cpp and vLLM. If you're a power user and enjoy the flexibility afforded by tight control over serving parameters, using either llama.cpp or vLLM will be a wonderful experience and really come down to the quantization format you decide. However, if you're a beginner or aren't yet comfortable with this, Ollama can be convenient stopgap while you build the skills you need or the very end of the line if you decide your current level of knowledge is enough! 339 | 340 | ### Ollama 341 | 342 | 🌟 [**GitHub**](https://github.com/ollama/ollama) 343 | 📖 [**Documentation**](https://github.com/ollama/ollama/tree/main/docs) 344 | 🔧 [**Engine Arguments**](https://github.com/ollama/ollama/blob/main/docs/modelfile.md) 345 | 346 | Ollama will be installed as a service, so it runs automatically at boot. 347 | 348 | - Download Ollama from the official repository: 349 | ``` 350 | curl -fsSL https://ollama.com/install.sh | sh 351 | ``` 352 | 353 | We want our API endpoint to be reachable by the rest of the LAN. For Ollama, this means setting `OLLAMA_HOST=0.0.0.0` in the `ollama.service`. 354 | 355 | - Run the following command to edit the service: 356 | ``` 357 | systemctl edit ollama.service 358 | ``` 359 | - Find the `[Service]` section and add `Environment="OLLAMA_HOST=0.0.0.0"` under it. It should look like this: 360 | ``` 361 | [Service] 362 | Environment="OLLAMA_HOST=0.0.0.0" 363 | ``` 364 | - Save and exit. 365 | - Reload the environment. 366 | ``` 367 | systemctl daemon-reload 368 | systemctl restart ollama 369 | ``` 370 | 371 | > [!TIP] 372 | > If you installed Ollama manually or don't use it as a service, remember to run `ollama serve` to properly start the server. Refer to [Ollama's troubleshooting steps](#ollama-3) if you encounter an error. 373 | 374 | ### llama.cpp 375 | 376 | 🌟 [**GitHub**](https://github.com/ggml-org/llama.cpp) 377 | 📖 [**Documentation**](https://github.com/ggml-org/llama.cpp/tree/master/docs) 378 | 🔧 [**Engine Arguments**](https://github.com/ggml-org/llama.cpp/tree/master/examples/server) 379 | 380 | - Clone the llama.cpp GitHub repository: 381 | ``` 382 | git clone https://github.com/ggml-org/llama.cpp.git 383 | cd llama.cpp 384 | ``` 385 | - Build the binary: 386 | 387 | **CPU** 388 | ``` 389 | cmake -B build 390 | cmake --build build --config Release 391 | ``` 392 | 393 | **CUDA** 394 | ``` 395 | cmake -B build -DGGML_CUDA=ON 396 | cmake --build build --config Release 397 | ``` 398 | For other systems looking to use Metal, Vulkan and other low-level graphics APIs, view the complete [llama.cpp build documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) to leverage accelerated inference. 399 | 400 | ### vLLM 401 | 402 | 🌟 [**GitHub**](https://github.com/vllm-project/vllm) 403 | 📖 [**Documentation**](https://docs.vllm.ai/en/stable/index.html) 404 | 🔧 [**Engine Arguments**](https://docs.vllm.ai/en/stable/serving/engine_args.html) 405 | 406 | vLLM comes with its own OpenAI-compatible API that we can use just like Ollama. Where Ollama runs GGUF model files, vLLM can run AWQ, GPTQ, GGUF, BitsAndBytes, and safetensors (the default release type) natively. 407 | 408 | **Manual Installation (Recommended)** 409 | 410 | - Create a directory and virtual environment for vLLM: 411 | ``` 412 | mkdir vllm 413 | cd vllm 414 | python3 -m venv .venv 415 | source .venv/bin/activate 416 | ``` 417 | 418 | - Install vLLM using `pip`: 419 | ``` 420 | pip install vllm 421 | ``` 422 | 423 | - Serve with your desired flags. It uses port 8000 by default, but I'm using port 8556 here so it doesn't conflict with any other services: 424 | ``` 425 | vllm serve --port 8556 426 | ``` 427 | 428 | - To use as a service, add the following block to `init.bash` to serve vLLM on startup: 429 | ``` 430 | source .venv/bin/activate 431 | vllm serve --port 8556 432 | ``` 433 | > Replace `` with your desired model tag, copied from HuggingFace. 434 | 435 | **Docker Installation** 436 | 437 | - Run: 438 | ``` 439 | sudo docker run --gpus all \ 440 | -v ~/.cache/huggingface:/root/.cache/huggingface \ 441 | --env "HUGGING_FACE_HUB_TOKEN=" \ 442 | -p 8556:8000 \ 443 | --ipc=host \ 444 | vllm/vllm-openai:latest \ 445 | --model 446 | ``` 447 | > Replace `` with your HuggingFace Hub token and `` with your desired model tag, copied from HuggingFace. 448 | 449 | To serve a different model: 450 | 451 | - First stop the existing container: 452 | ``` 453 | sudo docker ps -a 454 | sudo docker stop 455 | ``` 456 | 457 | - If you want to run the exact same setup again in the future, skip this step. Otherwise, run the following to delete the container and not clutter your Docker container environment: 458 | ``` 459 | sudo docker rm 460 | ``` 461 | 462 | - Rerun the Docker command from the installation with the desired model. 463 | ``` 464 | sudo docker run --gpus all \ 465 | -v ~/.cache/huggingface:/root/.cache/huggingface \ 466 | --env "HUGGING_FACE_HUB_TOKEN=" \ 467 | -p 8556:8000 \ 468 | --ipc=host \ 469 | vllm/vllm-openai:latest \ 470 | --model 471 | ``` 472 | 473 | ### Creating a Service 474 | > [!NOTE] 475 | > Only needed for manual installations of llama.cpp/vLLM. 476 | 477 | While the above steps will help you get up and running with an OpenAI-compatible LLM server, they will not help with this server persisting after you close your terminal window or restart your physical server. Docker can achieve this with the `-d` (for "detach") flag but running vanilla Python servers is common. To do this, we must start the inference engine in a `.service` file that will run alongside the Linux operating system when booting, ensuring that it is available whenever the server is on. 478 | 479 | Let's call the service we're about to build `llm-server.service`. We'll assume all models are in the `models` child directory - you can change this as you need to. 480 | 481 | 1. Create the `systemd` service file: 482 | ```bash 483 | sudo nano /etc/systemd/system/llm-server.service 484 | ``` 485 | 486 | 2. Configure the service file: 487 | 488 | **llama.cpp** 489 | ```ini 490 | [Unit] 491 | Description=LLM Server Service 492 | After=network.target 493 | 494 | [Service] 495 | User= 496 | Group= 497 | WorkingDirectory=/home//llama.cpp/build/bin/ 498 | ExecStart=/home//llama.cpp/build/bin/llama-server \ 499 | --port \ 500 | --host 0.0.0.0 \ 501 | -m /home//llama.cpp/models/ \ 502 | --no-webui # [other engine arguments] 503 | Restart=always 504 | RestartSec=10s 505 | 506 | [Install] 507 | WantedBy=multi-user.target 508 | ``` 509 | 510 | **vLLM** 511 | ```ini 512 | [Unit] 513 | Description=LLM Server Service 514 | After=network.target 515 | 516 | [Service] 517 | User= 518 | Group= 519 | WorkingDirectory=/home//vllm/ 520 | ExecStart=/bin/bash -c 'source .venv/bin/activate && vllm serve --port --host 0.0.0.0 -m /home//vllm/models/' 521 | Restart=always 522 | RestartSec=10s 523 | 524 | [Install] 525 | WantedBy=multi-user.target 526 | ``` 527 | > Replace ``, ``, and `` with your Linux username, desired port for serving, and desired model respectively. 528 | 529 | 3. Reload the `systemd` daemon: 530 | ```bash 531 | sudo systemctl daemon-reload 532 | ``` 533 | 4. Run the service: 534 | 535 | If `llm-server.service` doesn't exist: 536 | ``` 537 | sudo systemctl enable llm-server.service 538 | sudo systemctl start llm-server 539 | ``` 540 | 541 | If `llm-server.service` does exist: 542 | ``` 543 | sudo systemctl restart llm-server 544 | ``` 545 | 5. (Optional) Check the service's status: 546 | ```bash 547 | sudo systemctl status llama-server 548 | ``` 549 | 550 | ### Open WebUI Integration 551 | > [!NOTE] 552 | > Only needed for llama.cpp/vLLM. 553 | 554 | Navigate to `Admin Panel > Settings > Connections` and set the following values: 555 | 556 | - Enable OpenAI API 557 | - API Base URL: `http://host.docker.internal:/v1` 558 | - API Key: `anything-you-like` 559 | 560 | ### Ollama vs. llama.cpp 561 | 562 | | **Aspect** | **Ollama (Wrapper)** | **llama.cpp (Vanilla)** | 563 | | -------------------------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | 564 | | **Installation/Setup** | One-click install & CLI model management | Requires manual setup/configuration | 565 | | **Open WebUI Integration** | First-class citizen | Requires OpenAI-compatible endpoint setup | 566 | | **Model Switching** | Native model-switching via server | Requires manual port management or [llama-swap](https://github.com/mostlygeek/llama-swap) | 567 | | **Customizability** | Limited: Modelfiles are cumbersome | Full control over parameters via CLI | 568 | | **Transparency** | Defaults may override model parameters (e.g., context length) | Full transparency in parameter settings | 569 | | **GGUF Support** | Inherits llama.cpp's best-in-class implementation | Best GGUF implementation | 570 | | **GPU-CPU Splitting** | Inherits llama.cpp's efficient splitting | Trivial GPU-CPU splitting out-of-the-box | 571 | 572 | --- 573 | 574 | ### vLLM vs. Ollama/llama.cpp 575 | | **Feature** | **vLLM** | **Ollama/llama.cpp** | 576 | | ----------------------- | -------------------------------------------- | ------------------------------------------------------------------------------------- | 577 | | **Vision Models** | Supports Qwen 2.5 VL, Llama 3.2 Vision, etc. | Ollama supports some vision models, llama.cpp does not support any (via llama-server) | 578 | | **Quantization** | Supports AWQ, GPTQ, BnB, etc. | Only supports GGUF | 579 | | **Multi-GPU Inference** | Yes | Yes | 580 | | **Tensor Parallelism** | Yes | No | 581 | 582 | In summary, 583 | 584 | - **Ollama**: Best for those who want an "it just works" experience. 585 | - **llama.cpp**: Best for those who want total control over their inference servers and are familiar with engine arguments. 586 | - **vLLM**: Best for those who want (i) to run non-GGUF quantizations of models, (ii) multi-GPU inference using tensor parallelism, or (iii) to use vision models. 587 | 588 | Using Ollama as a service offers no degradation in experience because unused models are offloaded from VRAM after some time. Using vLLM or llama.cpp as a service keeps a model in memory, so I wouldn't use this alongside Ollama in an automated, always-on fashion unless it was your primary inference engine. Essentially, 589 | 590 | | Primary Engine | Secondary Engine | Run SE as service? | 591 | | -------------- | ---------------- | ------------------ | 592 | | Ollama | llama.cpp/vLLM | No | 593 | | llama.cpp/vLLM | Ollama | Yes | 594 | 595 | 596 | ## Chat Platform 597 | 598 | ### Open WebUI 599 | 600 | 🌟 [**GitHub**](https://github.com/open-webui/open-webui) 601 | 📖 [**Documentation**](https://docs.openwebui.com) 602 | 603 | Open WebUI is a web-based interface for managing models and chats, and provides a beautiful, performant UI for communicating with your models. You will want to do this if you want to access your models from a web interface. If you're fine with using the command line or want to consume models through a plugin/extension, you can skip this step. 604 | 605 | To install without Nvidia GPU support, run the following command: 606 | ``` 607 | sudo docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main 608 | ``` 609 | 610 | For Nvidia GPUs, run the following command: 611 | ``` 612 | sudo docker run -d -p 3000:8080 --gpus all --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:cuda 613 | ``` 614 | 615 | You can access it by navigating to `http://localhost:3000` in your browser or `http://:3000` from another device on the same network. There's no need to add this to the `init.bash` script as Open WebUI will start automatically at boot via Docker Engine. 616 | 617 | Read more about Open WebUI [here](https://github.com/open-webui/open-webui). 618 | 619 | ## Text-to-Speech Server 620 | 621 | > [!NOTE] 622 | > `host.docker.internal` is a magic hostname that resolves to the internal IP address assigned to the host by Docker. This allows containers to communicate with services running on the host, such as databases or web servers, without needing to know the host's IP address. It simplifies communication between containers and host-based services, making it easier to develop and deploy applications. 623 | 624 | > [!NOTE] 625 | > The TTS engine is set to `OpenAI` because OpenedAI Speech is OpenAI-compatible. There is no data transfer between OpenAI and OpenedAI Speech - the API is simply a wrapper around Piper and XTTS. 626 | 627 | ### OpenedAI Speech 628 | 629 | 🌟 [**GitHub**](https://github.com/matatonic/openedai-speech) 630 | 631 | OpenedAI Speech is a text-to-speech server that wraps [Piper TTS](https://github.com/rhasspy/piper) and [Coqui XTTS v2](https://docs.coqui.ai/en/latest/models/xtts.html) in an OpenAI-compatible API. This is great because it plugs in easily to the Open WebUI interface, giving your models the ability to speak their responses. 632 | 633 | > As of v0.17 (compared to v0.10), OpenedAI Speech features a far more straightforward and automated Docker installation, making it easy to get up and running. 634 | 635 | Piper TTS is a lightweight model that is great for quick responses - it can also run CPU-only inference, which may be a better fit for systems that need to reserve as much VRAM for language models as possible. XTTS is a more performant model that requires a GPU for inference. Piper is: 636 | 637 | 1) generally easier to setup with out-of-the-box CUDA acceleration, and, 638 | 2) has a plethora of voices that can be found [here](https://rhasspy.github.io/piper-samples/), so it's what I would suggest starting with. 639 | 640 | - To install OpenedAI Speech, first clone the repository and navigate to the directory: 641 | ``` 642 | git clone https://github.com/matatonic/openedai-speech 643 | cd openedai-speech 644 | ``` 645 | - Copy the `sample.env` file to `speech.env`: 646 | ``` 647 | cp sample.env speech.env 648 | ``` 649 | - Run the following command to start the server. 650 | - Nvidia GPUs 651 | ``` 652 | sudo docker compose up -d 653 | ``` 654 | - AMD GPUs 655 | ``` 656 | sudo docker compose -d docker-compose.rocm.yml up 657 | ``` 658 | - CPU only 659 | ``` 660 | sudo docker compose -f docker-compose.min.yml up 661 | ``` 662 | 663 | OpenedAI Speech runs on `0.0.0.0:8000` by default. You can access it by navigating to `http://localhost:8000` in your browser or `http://:8000` from another device on the same network without any additional changes. 664 | 665 | #### Downloading Voices 666 | 667 | We'll use Piper here because I haven't found any good resources for high quality .wav files for XTTS. The process is the same for both models, just replace `tts-1` with `tts-1-hd` in the following commands. We'll download the `en_GB-alba-medium` voice as an example. 668 | 669 | - Create a new virtual environment named `speech` and activate it. Then, install `piper-tts`: 670 | ``` 671 | python3 -m venv speech 672 | source speech/bin/activate 673 | pip install piper-tts 674 | ``` 675 | This is a minimal virtual environment that is only required to run the script that downloads voices. 676 | - Download the voice: 677 | ``` 678 | bash download_voices_tts-1.sh en_GB-alba-medium 679 | ``` 680 | - Update the `voice_to_speaker.yaml` file to include the voice you downloaded. This file maps the voice to a speaker name that can be used in the Open WebUI interface. For example, to map the `en_GB-alba-medium` voice to the speaker name `alba`, add the following lines to the file: 681 | ``` 682 | alba: 683 | model: voices/en_GB-alba-medium.onnx 684 | speaker: # default speaker 685 | ``` 686 | - Run the following command: 687 | ``` 688 | sudo docker ps -a 689 | ``` 690 | Identify the container IDs of 691 | 1) OpenedAI Speech 692 | 2) Open WebUI 693 | 694 | Restart both containers: 695 | ``` 696 | sudo docker restart 697 | sudo docker restart 698 | ``` 699 | > Replace `` and `` with the container IDs you identified. 700 | 701 | ### Kokoro FastAPI 702 | 703 | 🌟 [**GitHub**](https://github.com/remsky/Kokoro-FastAPI) 704 | 705 | Kokoro FastAPI is a text-to-speech server that wraps around and provides OpenAI-compatible API inference for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M), a state-of-the-art TTS model. The documentation for this project is fantastic and covers most, if not all, of the use cases for the project itself. 706 | 707 | To install Kokoro-FastAPI, run 708 | ``` 709 | git clone https://github.com/remsky/Kokoro-FastAPI.git 710 | cd Kokoro-FastAPI 711 | sudo docker compose up --build 712 | ``` 713 | 714 | The server can be used in two ways: an API and a UI. By default, the API is served on port 8880 and the UI is served on port 7860. 715 | 716 | ### Open WebUI Integration 717 | 718 | Navigate to `Admin Panel > Settings > Audio` and set the following values: 719 | 720 | **OpenedAI Speech** 721 | - Text-to-Speech Engine: `OpenAI` 722 | - API Base URL: `http://host.docker.internal:8000/v1` 723 | - API Key: `anything-you-like` 724 | - Set Model: `tts-1` (for Piper) or `tts-1-hd` (for XTTS) 725 | 726 | **Kokoro FastAPI** 727 | - Text-to-Speech Engine: `OpenAI` 728 | - API Base URL: `http://host.docker.internal:8880/v1` 729 | - API Key: `anything-you-like` 730 | - Set Model: `kokoro` 731 | - Response Splitting: None (this is crucial - Kokoro uses a novel audio splitting system) 732 | 733 | The server can be used in two ways: an API and a UI. By default, the API is served on port 8880 and the UI is served on port 7860. 734 | 735 | ### Comparison 736 | 737 | You may choose OpenedAI Speech over Kokoro because: 738 | 739 | 1) **Voice Cloning**: xTTS v2 offers extensive support for cloning voices with small samples of audio. 740 | 2) **Choice of Voices**: Piper offers a very large variety of voices across multiple languages, dialects, and accents. 741 | 742 | You may choose Kokoro over OpenedAI Speech because: 743 | 744 | 1) Natural Tone: Kokoro's voices are very natural sounding and offer a better experience than Piper. While Piper has high quality voices, the text can sound robotic when reading out complex words/sentences. 745 | 2) Advanced Splitting: Kokoro splits responses up in a better format, making any pauses in speech feel more real. It also natively skips over Markdown formatting like lists and asterisks for bold/italics. 746 | 747 | Kokoro's performance makes it an ideal candidate for regular use as a voice assistant chained to a language model in Open WebUI. 748 | 749 | ## Text-to-Image Server 750 | 751 | ### ComfyUI 752 | 753 | 🌟 [**GitHub**](https://github.com/comfyanonymous/ComfyUI) 754 | 📖 [**Documentation**](https://docs.comfy.org) 755 | 756 | ComfyUI is a popular open-source graph-based tool for generating images using image generation models such as Stable Diffusion XL, Stable Diffusion 3, and the Flux family of models. 757 | 758 | - Clone and navigate to the repository: 759 | ``` 760 | git clone https://github.com/comfyanonymous/ComfyUI 761 | cd ComfyUI 762 | ``` 763 | - Set up a new virtual environment: 764 | ``` 765 | python3 venv -m comfyui-env 766 | source comfyui-env/bin/activate 767 | ``` 768 | - Download the platform-specific dependencies: 769 | - Nvidia GPUs 770 | ``` 771 | pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121 772 | ``` 773 | - AMD GPUs 774 | ``` 775 | pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0 776 | ``` 777 | - Intel GPUs 778 | 779 | Read the installation instructions from [ComfyUI's GitHub](https://github.com/comfyanonymous/ComfyUI?tab=readme-ov-file#intel-gpus). 780 | 781 | - Download the general dependencies: 782 | ``` 783 | pip install -r requirements.txt 784 | ``` 785 | 786 | Now, we have to download and load a model. Here, we'll use FLUX.1 [dev], a new, state-of-the-art medium-tier model by Black Forest Labs that fits well on an RTX 3090 24GB. Since we want this to be set up as easily as possible, we'll use a complete checkpoint that can be loaded directly into ComfyUI. For a completely customized workflow, CLIPs, VAEs, and models can be downloaded separately. Follow [this guide](https://comfyanonymous.github.io/ComfyUI_examples/flux/#simple-to-use-fp8-checkpoint-version) by ComfyUI's creator to install the FLUX.1 models in a fully customizable way. 787 | 788 | > [!NOTE] 789 | > [FLUX.1 [schnell] HuggingFace](https://huggingface.co/Comfy-Org/flux1-schnell/blob/main/flux1-schnell-fp8.safetensors) (smaller, ideal for <24GB VRAM) 790 | > 791 | > [FLUX.1 [dev] HuggingFace](https://huggingface.co/Comfy-Org/flux1-dev/blob/main/flux1-dev-fp8.safetensors) (larger, ideal for 24GB VRAM) 792 | 793 | - Download your desired model into `/models/checkpoints`. 794 | 795 | - If you want ComfyUI to be served at boot and effectively run as a service, add the following lines to `init.bash`: 796 | ``` 797 | cd /path/to/comfyui 798 | source comfyui/bin/activate 799 | python main.py --listen 800 | ``` 801 | > Replace `/path/to/comfyui` with the correct relative path to `init.bash`. 802 | 803 | Otherwise, to run it just once, simply execute the above lines in a terminal window. 804 | 805 | #### Open WebUI Integration 806 | 807 | Navigate to `Admin Panel > Settings > Images` and set the following values: 808 | 809 | - Image Generation Engine: `ComfyUI` 810 | - API Base URL: `http://localhost:8188` 811 | 812 | > [!TIP] 813 | > You'll either need more than 24GB of VRAM or to use a small language model mostly on CPU to use Open WebUI with FLUX.1 [dev]. FLUX.1 [schnell] and a small language model, however, should fit cleanly in 24GB of VRAM, making for a faster experience if you intend to regularly use both text and image generation together. 814 | 815 | ## SSH 816 | 817 | Enabling SSH allows you to connect to the server remotely. After configuring SSH, you can connect to the server from another device on the same network using an SSH client like PuTTY or the terminal. This lets you run your server headlessly without needing a monitor, keyboard, or mouse after the initial setup. 818 | 819 | On the server: 820 | - Run the following command: 821 | ``` 822 | sudo apt install openssh-server 823 | ``` 824 | - Start the SSH service: 825 | ``` 826 | sudo systemctl start ssh 827 | ``` 828 | - Enable the SSH service to start at boot: 829 | ``` 830 | sudo systemctl enable ssh 831 | ``` 832 | - Find the server's IP address: 833 | ``` 834 | ip a 835 | ``` 836 | 837 | On the client: 838 | - Connect to the server using SSH: 839 | ``` 840 | ssh @ 841 | ``` 842 | > Replace `` with your username and `` with the server's IP address. 843 | 844 | > [!NOTE] 845 | > If you expect to tunnel into your server often, I highly recommend following [this guide](https://www.raspberrypi.com/documentation/computers/remote-access.html#configure-ssh-without-a-password) to enable passwordless SSH using `ssh-keygen` and `ssh-copy-id`. It worked perfectly on my Debian system despite having been written for Raspberry Pi OS. 846 | 847 | ## Firewall 848 | 849 | Setting up a firewall is essential for securing your server. The Uncomplicated Firewall (UFW) is a simple and easy-to-use firewall for Linux. You can use UFW to allow or deny incoming and outgoing traffic to and from your server. 850 | 851 | - Install UFW: 852 | ```bash 853 | sudo apt install ufw 854 | ``` 855 | - Allow SSH, HTTPS, and any other ports you need: 856 | ```bash 857 | sudo ufw allow ssh https 80 8080 # [your other services' ports] 858 | ``` 859 | Here, we're allowing SSH (port 22), HTTPS (port 443), HTTP (port 80), Docker (port 8080) to start. You can add or remove ports as needed. Ensure to add ports for services that you end up using - both from this guide and in general. 860 | - Enable UFW: 861 | ```bash 862 | sudo ufw enable 863 | ``` 864 | - Check the status of UFW: 865 | ```bash 866 | sudo ufw status 867 | ``` 868 | 869 | > [!WARNING] 870 | > Enabling UFW without allowing access to port 22 will disrupt your existing SSH connections. If you run a headless setup, this means connecting a monitor to your server and then allowing SSH access through UFW. Be careful to ensure that this port is allowed when making changes to UFW's configuration. 871 | 872 | Refer to [this guide](https://www.digitalocean.com/community/tutorials/how-to-set-up-a-firewall-with-ufw-on-debian-10) for more information on setting up UFW. 873 | 874 | ## Remote Access 875 | 876 | Remote access refers to the ability to access your server outside of your home network. For example, when you leave the house, you aren't going to be able to access `http://`, because your network has changed from your home network to some other network (either your mobile carrier's or a local network in some other place). This means that you won't be able to access the services running on your server. There are many solutions on the web that solve this problem and we'll explore some of the easiest-to-use here. 877 | 878 | ### Tailscale 879 | 880 | Tailscale is a peer-to-peer VPN service that combines many services into one. Its most common use-case is to bind many different devices of many different kinds (Windows, Linux, macOS, iOS, Android, etc.) on one virtual network. This way, all these devices can be connected to different networks but still be able to communicate with each other as if they were all on the same local network. Tailscale is not completely open source (its GUI is proprietary), but it is based on the [Wireguard](https://www.wireguard.com) VPN protocol and the remainder of the actual service is open source. Comprehensive documentation on the service can be found [here](https://tailscale.com/kb) and goes into many topics not mentioned here - I would recommend reading it to get the most of out the service. 881 | 882 | On Tailscale, networks are referred to as tailnets. Creating and managing tailnets requires creating an account with Tailscale (an expected scenario with a VPN service) but connections are peer-to-peer and happen without any routing to Tailscale servers. This connection being based on Wireguard means 100% of your traffic is encrypted and cannot be accessed by anyone but the devices on your tailnet. 883 | 884 | #### Installation 885 | 886 | First, create a tailnet through the Admin Console on Tailscale. Download the Tailscale app on any client you want to access your tailnet from. For Windows, macOS, iOS, and Android, the apps can be found on their respective OS app stores. After signing in, your device will be added to the tailnet. 887 | 888 | For Linux, the steps required are as follows. 889 | 890 | 1) Install Tailscale 891 | ``` 892 | curl -fsSL https://tailscale.com/install.sh | sh 893 | ``` 894 | 895 | 2) Start the service 896 | ``` 897 | sudo tailscale up 898 | ``` 899 | 900 | For SSH, run `sudo tailscale up --ssh`. 901 | 902 | #### Exit Nodes 903 | 904 | An exit node allows access to a different network while still being on your tailnet. For example, you can use this to allow a server on your network to act as a tunnel for other devices. This way, you can not only access that device (by virtue of your tailnet) but also all the devices on the host network its on. This is useful to access non-Tailscale devices on a network. 905 | 906 | To advertise a device on as an exit node, run `sudo tailscale up --advertise-exit-node`. To allow access to the local network via this device, add the `--exit-node-allow-lan-access` flag. 907 | 908 | #### Local DNS 909 | 910 | If one of the devices on your tailnet runs a [DNS-sinkhole](https://en.wikipedia.org/wiki/DNS_sinkhole) service like [Pi-hole](https://pi-hole.net), you'll probably want other devices to use it as their DNS server. Assume this device is named `poplar`. This means every networking request made by a any device on your tailnet will send this request to `poplar`, which will in turn decide whether that request will be answered or rejected according to your Pi-hole configuration. However, since `poplar` is also one of the devices on your tailnet, it will send networking requests to itself in accordance with this rule and not to somewhere that will actually resolve the request. Thus, we don't want such devices to accept the DNS settings according to the tailnet but follow their otherwise preconfigured rules. 911 | 912 | To reject the tailnet's DNS settings, run `sudo tailscale up --accept-dns=false`. 913 | 914 | #### Third-Party VPN Integration 915 | 916 | Tailscale offers a [Mullvad VPN](https://mullvad.net/en) exit node add-on with their service. This add-on allows for a traditional VPN experience that will route your requests through a proxy server in some other location, effectively masking your IP and allowing the circumvention of geolocation restrictions on web services. Assigned devices can be configured from the Admin Console. Mullvad VPN has [proven their no-log policy](https://mullvad.net/en/blog/2023/4/20/mullvad-vpn-was-subject-to-a-search-warrant-customer-data-not-compromised) and offers a fixed $5/month price no matter what duration you choose to pay for. 917 | 918 | To use a Mullvad exit on one of your devices, first find the exit node you want to use by running `sudo tailscale exit-node list`. Note the IP and run `sudo tailscale up --exit-node=`. 919 | 920 | > [!WARNING] 921 | > Ensure the device is allowed to use the Mullvad add-on through the Admin Console first. 922 | 923 | ## Verifying 924 | 925 | This section isn't strictly necessary by any means - if you use all the elements in the guide, a good experience in Open WebUI means you've succeeded with the goal of the guide. However, it can be helpful to test the disparate installations at different stages in this process. 926 | 927 | ### Inference Engine 928 | 929 | To test your OpenAI-compatible server endpoint, run: 930 | ``` 931 | curl http://localhost:/v1/completions -d '{ 932 | "model": "llama2", 933 | "prompt":"Why is the sky blue?" 934 | }' 935 | ``` 936 | > Replace `` with the actual port of your server and `llama2` with your preferred model. If your physical server is different than the machine you're executing the above command on, replace `localhost` with the IP of the physical server. 937 | 938 | ### Open WebUI 939 | 940 | Visit `http://localhost:3000`. If you're greeted by the authentication page, you've successfully installed Open WebUI. 941 | 942 | ### Text-to-Speech Server 943 | 944 | To test your TTS server, run the following command: 945 | ``` 946 | curl -s http://localhost:/v1/audio/speech -H "Content-Type: application/json" -d '{ 947 | "input": "The quick brown fox jumped over the lazy dog."}' > speech.mp3 948 | ``` 949 | 950 | If you see the `speech.mp3` file in the directory you ran the command from, you should be good to go. If you're paranoid, test it using a player like `aplay`. Run the following commands: 951 | ``` 952 | sudo apt install aplay 953 | aplay speech.mp3 954 | ``` 955 | 956 | Kokoro FastAPI: To test the web UI, visit `http://localhost:7860`. 957 | 958 | ### ComfyUI 959 | 960 | Visit `http://localhost:8188`. If you're greeted by the workflow page, you've successfully installed ComfyUI. 961 | 962 | ## Updating 963 | 964 | Updating your system is a good idea to keep software running optimally and with the latest security patches. Updates to Ollama allow for inference from new model architectures and updates to Open WebUI enable new features like voice calling, function calling, pipelines, and more. 965 | 966 | I've compiled steps to update these "primary function" installations in a standalone section because I think it'd be easier to come back to one section instead of hunting for update instructions in multiple subsections. 967 | 968 | ### General 969 | 970 | Upgrade Debian packages by running the following commands: 971 | ``` 972 | sudo apt update 973 | sudo apt upgrade 974 | ``` 975 | 976 | ### Nvidia Drivers & CUDA 977 | 978 | Follow Nvidia's guide [here](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian) to install the latest CUDA drivers. 979 | 980 | > [!WARNING] 981 | > Don't skip this step. Not installing the latest drivers after upgrading Debian packages will throw your installations out of sync, leading to broken functionality. When updating, target everything important at once. Also, rebooting after this step is a good idea to ensure that your system is operating as expected after upgrading these crucial drivers. 982 | 983 | ### Ollama 984 | 985 | Rerun the command that installs Ollama - it acts as an updater too: 986 | ``` 987 | curl -fsSL https://ollama.com/install.sh | sh 988 | ``` 989 | 990 | ### llama.cpp 991 | 992 | Enter your llama.cpp folder and run the following commands: 993 | ``` 994 | cd llama.cpp 995 | git pull 996 | # Rebuild according to your setup - uncomment `-DGGML_CUDA=ON` for CUDA support 997 | cmake -B build # -DGGML_CUDA=ON 998 | cmake --build build --config Release 999 | ``` 1000 | 1001 | ### vLLM 1002 | 1003 | For a manual installation, enter your virtual environment and update via `pip`: 1004 | ``` 1005 | source vllm/.venv/bin/activate 1006 | pip install vllm --upgrade 1007 | ``` 1008 | 1009 | For a Docker installation, you're good to go when you re-run your Docker command, because it pulls the latest Docker image for vLLM. 1010 | 1011 | ### Open WebUI 1012 | 1013 | To update Open WebUI once, run the following command: 1014 | ``` 1015 | docker run --rm --volume /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --run-once open-webui 1016 | ``` 1017 | 1018 | To keep it updated automatically, run the following command: 1019 | ``` 1020 | docker run -d --name watchtower --volume /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower open-webui 1021 | ``` 1022 | 1023 | ### OpenedAI Speech 1024 | 1025 | Navigate to the directory and pull the latest image from Docker: 1026 | ``` 1027 | cd openedai-speech 1028 | sudo docker compose pull 1029 | sudo docker compose up -d 1030 | ``` 1031 | 1032 | ### Kokoro FastAPI 1033 | 1034 | Navigate to the directory and pull the latest image from Docker: 1035 | ``` 1036 | cd Kokoro-FastAPI 1037 | sudo docker compose pull 1038 | sudo docker compose up -d 1039 | ``` 1040 | 1041 | ### ComfyUI 1042 | 1043 | Navigate to the directory, pull the latest changes, and update dependencies: 1044 | ``` 1045 | cd ComfyUI 1046 | git pull 1047 | source comfyui-env/bin/activate 1048 | pip install -r requirements.txt 1049 | ``` 1050 | 1051 | ## Troubleshooting 1052 | 1053 | For any service running in a container, you can check the logs by running `sudo docker logs -f (container_ID)`. If you're having trouble with a service, this is a good place to start. 1054 | 1055 | ### `ssh` 1056 | - If you encounter an issue using `ssh-copy-id` to set up passwordless SSH, try running `ssh-keygen -t rsa` on the client before running `ssh-copy-id`. This generates the RSA key pair that `ssh-copy-id` needs to copy to the server. 1057 | 1058 | ### Nvidia Drivers 1059 | - Disable Secure Boot in the BIOS if you're having trouble with the Nvidia drivers not working. For me, all packages were at the latest versions and `nvidia-detect` was able to find my GPU correctly, but `nvidia-smi` kept returning the `NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver` error. [Disabling Secure Boot](https://askubuntu.com/a/927470) fixed this for me. Better practice than disabling Secure Boot is to sign the Nvidia drivers yourself but I didn't want to go through that process for a non-critical server that can afford to have Secure Boot disabled. 1060 | - If you run into `docker: Error response from daemon: unknown or invalid runtime name: nvidia.`, you probably have `--runtime nvidia` in your Docker statement. This is meant for `nvidia-docker`, [which is deprecated now](https://stackoverflow.com/questions/52865988/nvidia-docker-unknown-runtime-specified-nvidia). Removing this flag from your command should get rid of this error. 1061 | 1062 | ### Ollama 1063 | - If you receive the `could not connect to ollama app, is it running?` error, your Ollama instance wasn't served properly. This could be because of a manual installation or the desire to use it at-will and not as a service. To run the Ollama server once, run: 1064 | ``` 1065 | ollama serve 1066 | ``` 1067 | Then, **in a new terminal**, you should be able to access your models regularly by running: 1068 | ``` 1069 | ollama run 1070 | ``` 1071 | For detailed instructions on _manually_ configuring Ollama to run as a service (to run automatically at boot), read the official documentation [here](https://github.com/ollama/ollama/blob/main/docs/linux.md). You shouldn't need to do this unless your system faces restrictions using Ollama's automated installer. 1072 | 1073 | - If you receive the `Failed to open "/etc/systemd/system/ollama.service.d/.#override.confb927ee3c846beff8": Permission denied` error from Ollama after running `systemctl edit ollama.service`, simply creating the file works to eliminate it. Use the following steps to edit the file. 1074 | - Run: 1075 | ``` 1076 | sudo mkdir -p /etc/systemd/system/ollama.service.d 1077 | sudo nano /etc/systemd/system/ollama.service.d/override.conf 1078 | ``` 1079 | - Retry the remaining steps. 1080 | - If you still can't connect to your API endpoint, check your firewall settings. [This guide to UFW (Uncomplicated Firewall) on Debian](https://www.digitalocean.com/community/tutorials/how-to-set-up-a-firewall-with-ufw-on-debian-10) is a good resource. 1081 | 1082 | ### vLLM 1083 | - If you encounter ```RuntimeError: An error occurred while downloading using `hf_transfer`. Consider disabling HF_HUB_ENABLE_HF_TRANSFER for better error handling.```, add `HF_HUB_ENABLE_HF_TRANSFER=0` to the `--env` flag after your HuggingFace Hub token. If this still doesn't fix the issue - 1084 | - Ensure your user has all the requisite permissions for HuggingFace to be able to write to the cache. To give read+write access over the HF cache to your user (and, thus, `huggingface-cli`), run: 1085 | ``` 1086 | sudo chmod 777 ~/.cache/huggingface 1087 | sudo chmod 777 ~/.cache/huggingface/hub 1088 | ``` 1089 | - Manually download a model via the HuggingFace CLI and specify `--download-dir=~/.cache/huggingface/hub` in the engine arguments. If your `.cache/huggingface` directory is being troublesome, specify another directory to the `--download-dir` in the engine arguments and remember to do the same with the `--local-dir` flag in any `huggingface-cli` commands. 1090 | 1091 | ### Open WebUI 1092 | - If you encounter `Ollama: llama runner process has terminated: signal: killed`, check your `Advanced Parameters`, under `Settings > General > Advanced Parameters`. For me, bumping the context length past what certain models could handle was breaking the Ollama server. Leave it to the default (or higher, but make sure it's still under the limit for the model you're using) to fix this issue. 1093 | 1094 | ### OpenedAI Speech 1095 | - If you encounter `docker: Error response from daemon: Unknown runtime specified nvidia.` when running `docker compose up -d`, ensure that you have `nvidia-container-toolkit` installed (this was previously `nvidia-docker2`, which is now deprecated). If not, installation instructions can be found [here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). Make sure to reboot the server after installing the toolkit. If you still encounter issues, ensure that your system has a valid CUDA installation by running `nvcc --version`. 1096 | - If `nvcc --version` doesn't return a valid response despite following Nvidia's installation guide, the issue is likely that CUDA is not in your PATH variable. 1097 | Run the following command to edit your `.bashrc` file: 1098 | ``` 1099 | sudo nano /home//.bashrc 1100 | ``` 1101 | > Replace `` with your username. 1102 | Add the following to your `.bashrc` file: 1103 | ``` 1104 | export PATH="/usr/local//bin:$PATH" 1105 | export LD_LIBRARY_PATH="/usr/local//lib64:$LD_LIBRARY_PATH" 1106 | ``` 1107 | > Replace `` with your installation's version. If you're unsure of which version, run `ls /usr/local` to find the CUDA directory. It is the directory with the `cuda` prefix, followed by the version number. 1108 | 1109 | Save and exit the file, then run `source /home//.bashrc` to apply the changes (or close the current terminal and open a new one). Run `nvcc --version` again to verify that CUDA is now in your PATH. You should see something like the following: 1110 | ``` 1111 | nvcc: NVIDIA (R) Cuda compiler driver 1112 | Copyright (c) 2005-2024 NVIDIA Corporation 1113 | Built on Thu_Mar_28_02:18:24_PDT_2024 1114 | Cuda compilation tools, release 12.4, V12.4.131 1115 | Build cuda_12.4.r12.4/compiler.34097967_0 1116 | ``` 1117 | If you see this, CUDA is now in your PATH and you can run `docker compose up -d` again. 1118 | - If you run into a `VoiceNotFoundError`, you may either need to download the voices again or the voices may not be compatible with the model you're using. Make sure to check your `speech.env` file to ensure that the `PRELOAD_MODEL` and `CLI_COMMAND` lines are configured correctly. 1119 | 1120 | ## Monitoring 1121 | 1122 | To monitor GPU usage, power draw, and temperature, you can use the `nvidia-smi` command. To monitor GPU usage, run: 1123 | ``` 1124 | watch -n 1 nvidia-smi 1125 | ``` 1126 | This will update the GPU usage every second without cluttering the terminal environment. Press `Ctrl+C` to exit. 1127 | 1128 | ## Notes 1129 | 1130 | This is my first foray into setting up a server and ever working with Linux so there may be better ways to do some of the steps. I will update this repository as I learn more. 1131 | 1132 | ### Software 1133 | 1134 | - I chose Debian because it is, apparently, one of the most stable Linux distros. I also went with an XFCE desktop environment because it is lightweight and I wasn't yet comfortable going full command line. 1135 | - Use a user for auto-login, don't log in as root unless for a specific reason. 1136 | - To switch to root in the command line without switching users, run `sudo -i`. 1137 | - If something using a Docker container doesn't work, try running `sudo docker ps -a` to see if the container is running. If it isn't, try running `sudo docker compose up -d` again. If it is and isn't working, try running `sudo docker restart ` to restart the container. 1138 | - If something isn't working no matter what you do, try rebooting the server. It's a common solution to many problems. Try this before spending hours troubleshooting. Sigh. 1139 | - While it takes some time to get comfortable with, using an inference engine like llama.cpp and vLLM (as compared to Ollama) is really the way to go to squeeze the maximum performance out of your hardware. If you're reading this guide in the first place and haven't already thrown up your hands and used a cloud provider, it's a safe assumption that you care about the ethos of hosting all this stuff locally. Thus, get your experience as close to a cloud provider as it can be by optimizing your server. 1140 | 1141 | ### Hardware 1142 | 1143 | - The power draw of my EVGA FTW3 Ultra RTX 3090 was 350W at stock settings. I set the power limit to 250W and the performance decrease was negligible for my use case, which is primarily code completion in VS Code and the Q&A via chat. 1144 | - Using a power monitor, I measured the power draw of my server for multiple days - the running average is ~60W. The power can spike to 350W during prompt processing and token generation, but this only lasts for a few seconds. For the remainder of the generation time, it tended to stay at the 250W power limit and dropped back to the average power draw after the model wasn't in use for about 20 seconds. 1145 | - Ensure your power supply has enough headroom for transient spikes (particularly in multi GPU setups) or you may face random shutdowns. Your GPU can blow past its rated power draw and also any software limit you set for it based on the chip's actual draw. I usually aim for +50% of my setup's estimated total power draw. 1146 | 1147 | ## References 1148 | 1149 | Downloading Nvidia drivers: 1150 | - https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian 1151 | - https://wiki.debian.org/NvidiaGraphicsDrivers 1152 | 1153 | Downloading AMD drivers: 1154 | - https://wiki.debian.org/AtiHowTo 1155 | 1156 | Secure Boot: 1157 | - https://askubuntu.com/a/927470 1158 | 1159 | Monitoring GPU usage, power draw: 1160 | - https://unix.stackexchange.com/questions/38560/gpu-usage-monitoring-cuda/78203#78203 1161 | 1162 | Passwordless `sudo`: 1163 | - https://stackoverflow.com/questions/25215604/use-sudo-without-password-inside-a-script 1164 | - https://www.reddit.com/r/Fedora/comments/11lh9nn/set_nvidia_gpu_power_and_temp_limit_on_boot/ 1165 | - https://askubuntu.com/questions/100051/why-is-sudoers-nopasswd-option-not-working 1166 | 1167 | Auto-login: 1168 | - https://forums.debian.net/viewtopic.php?t=149849 1169 | - https://wiki.archlinux.org/title/LightDM#Enabling_autologin 1170 | 1171 | Expose Ollama to LAN: 1172 | - https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux 1173 | - https://github.com/ollama/ollama/issues/703 1174 | 1175 | Firewall: 1176 | - https://www.digitalocean.com/community/tutorials/how-to-set-up-a-firewall-with-ufw-on-debian-10 1177 | 1178 | Passwordless `ssh`: 1179 | - https://www.raspberrypi.com/documentation/computers/remote-access.html#configure-ssh-without-a-password 1180 | 1181 | Adding CUDA to PATH: 1182 | - https://askubuntu.com/questions/885610/nvcc-version-command-says-nvcc-is-not-installed 1183 | 1184 | Docs: 1185 | 1186 | - [Debian](https://www.debian.org/releases/buster/amd64/) 1187 | - [Docker](https://docs.docker.com/engine/install/debian/) 1188 | - [Ollama](https://github.com/ollama/ollama/blob/main/docs/api.md) 1189 | - [vLLM](https://docs.vllm.ai/en/stable/index.html) 1190 | - [Open WebUI](https://github.com/open-webui/open-webui) 1191 | - [OpenedAI Speech](https://github.com/matatonic/openedai-speech) 1192 | - [ComfyUI](https://github.com/comfyanonymous/ComfyUI) 1193 | 1194 | ## Acknowledgements 1195 | 1196 | Cheers to all the fantastic work done by the open-source community. This guide wouldn't exist without the effort of the many contributors to the projects and guides referenced here. To stay up-to-date on the latest developments in the field of machine learning, LLMs, and other vision/speech models, check out [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/). 1197 | 1198 | > [!NOTE] 1199 | > Please star any projects you find useful and consider contributing to them if you can. Stars on this guide would also be appreciated if you found it helpful, as it helps others find it too. --------------------------------------------------------------------------------