├── llm-server-architecture.png ├── LICENSE └── README.md /llm-server-architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/varunvasudeva1/llm-server-docs/HEAD/llm-server-architecture.png -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Varun Vasudeva 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Local LLaMA Server Setup Documentation 2 | 3 | _TL;DR_: End-to-end documentation to set up your own local & fully private LLM server on Debian. Equipped with chat, web search, RAG, model management, MCP servers, image generation, and TTS, along with steps for configuring SSH, firewall, and secure remote access via Tailscale. 4 | 5 | Software Stack: 6 | 7 | - Inference Engine ([Ollama](https://github.com/ollama/ollama), [llama.cpp](https://github.com/ggml-org/llama.cpp), [vLLM](https://github.com/vllm-project/vllm)) 8 | - Search Engine ([SearXNG](https://github.com/searxng/searxng)) 9 | - Model Server ([llama-swap](https://github.com/mostlygeek/llama-swap), `systemd` service) 10 | - Chat Platform ([Open WebUI](https://github.com/open-webui/open-webui)) 11 | - MCP Proxy Server ([mcp-proxy](https://github.com/sparfenyuk/mcp-proxy), [MCPJungle](https://github.com/mcpjungle/MCPJungle)) 12 | - Text-to-Speech Server ([Kokoro FastAPI](https://github.com/remsky/Kokoro-FastAPI)) 13 | - Image Generation Server ([ComfyUI](https://github.com/comfyanonymous/ComfyUI)) 14 | 15 | ![Software Stack Architectural Diagram](llm-server-architecture.png) 16 | 17 | ## Table of Contents 18 | 19 | - [Local LLaMA Server Setup Documentation](#local-llama-server-setup-documentation) 20 | - [Table of Contents](#table-of-contents) 21 | - [About](#about) 22 | - [Priorities](#priorities) 23 | - [Prerequisites](#prerequisites) 24 | - [General](#general) 25 | - [Schedule Startup Script](#schedule-startup-script) 26 | - [Configure Script Permissions](#configure-script-permissions) 27 | - [Configure Auto-Login (optional)](#configure-auto-login-optional) 28 | - [Docker](#docker) 29 | - [Nvidia Container Toolkit](#nvidia-container-toolkit) 30 | - [Create a Network](#create-a-network) 31 | - [Helpful Commands](#helpful-commands) 32 | - [HuggingFace CLI](#huggingface-cli) 33 | - [Manage Models](#manage-models) 34 | - [Download Models](#download-models) 35 | - [Delete Models](#delete-models) 36 | - [Search Engine](#search-engine) 37 | - [SearXNG](#searxng) 38 | - [Open WebUI Integration](#open-webui-integration) 39 | - [Inference Engine](#inference-engine) 40 | - [Ollama](#ollama) 41 | - [llama.cpp](#llamacpp) 42 | - [vLLM](#vllm) 43 | - [Open WebUI Integration](#open-webui-integration-1) 44 | - [Ollama vs. llama.cpp](#ollama-vs-llamacpp) 45 | - [vLLM vs. Ollama/llama.cpp](#vllm-vs-ollamallamacpp) 46 | - [Model Server](#model-server) 47 | - [llama-swap](#llama-swap) 48 | - [`systemd` Service](#systemd-service) 49 | - [Open WebUI Integration](#open-webui-integration-2) 50 | - [llama-swap](#llama-swap-1) 51 | - [`systemd` Service](#systemd-service-1) 52 | - [Chat Platform](#chat-platform) 53 | - [Open WebUI](#open-webui) 54 | - [MCP Proxy Server](#mcp-proxy-server) 55 | - [mcp-proxy](#mcp-proxy) 56 | - [MCPJungle](#mcpjungle) 57 | - [Comparison](#comparison) 58 | - [Open WebUI Integration](#open-webui-integration-3) 59 | - [mcp-proxy](#mcp-proxy-1) 60 | - [MCPJungle](#mcpjungle-1) 61 | - [VS Code/Claude Desktop Integration](#vs-codeclaude-desktop-integration) 62 | - [Text-to-Speech Server](#text-to-speech-server) 63 | - [Kokoro FastAPI](#kokoro-fastapi) 64 | - [Open WebUI Integration](#open-webui-integration-4) 65 | - [Image Generation Server](#image-generation-server) 66 | - [ComfyUI](#comfyui) 67 | - [Open WebUI Integration](#open-webui-integration-5) 68 | - [SSH](#ssh) 69 | - [Firewall](#firewall) 70 | - [Remote Access](#remote-access) 71 | - [Tailscale](#tailscale) 72 | - [Installation](#installation) 73 | - [Exit Nodes](#exit-nodes) 74 | - [Local DNS](#local-dns) 75 | - [Third-Party VPN Integration](#third-party-vpn-integration) 76 | - [Updating](#updating) 77 | - [General](#general-1) 78 | - [Nvidia Drivers \& CUDA](#nvidia-drivers--cuda) 79 | - [Ollama](#ollama-1) 80 | - [llama.cpp](#llamacpp-1) 81 | - [vLLM](#vllm-1) 82 | - [llama-swap](#llama-swap-2) 83 | - [Open WebUI](#open-webui-1) 84 | - [mcp-proxy/MCPJungle](#mcp-proxymcpjungle) 85 | - [Kokoro FastAPI](#kokoro-fastapi-1) 86 | - [ComfyUI](#comfyui-1) 87 | - [Troubleshooting](#troubleshooting) 88 | - [`ssh`](#ssh-1) 89 | - [Nvidia Drivers](#nvidia-drivers) 90 | - [Ollama](#ollama-2) 91 | - [vLLM](#vllm-2) 92 | - [Open WebUI](#open-webui-2) 93 | - [Monitoring](#monitoring) 94 | - [Notes](#notes) 95 | - [Software](#software) 96 | - [Hardware](#hardware) 97 | - [References](#references) 98 | - [Acknowledgements](#acknowledgements) 99 | 100 | ## About 101 | 102 | This repository outlines the steps to run a server for running local language models. It uses Debian specifically, but most Linux distros should follow a very similar process. It aims to be a guide for Linux beginners like me who are setting up a server for the first time. 103 | 104 | The process involves installing the requisite drivers, setting the GPU power limit, setting up auto-login, and scheduling the `init.bash` script to run at boot. All these settings are based on my ideal setup for a language model server that runs most of the day but a lot can be customized to suit your needs. 105 | 106 | > [!IMPORTANT] 107 | > No part of this guide was written using AI - any hallucinations are the good old human kind. While I've done my absolute best to ensure correctness in every step/command, check **everything** you execute in a terminal. Enjoy! 108 | 109 | ## Priorities 110 | 111 | - **Simplicity**: It should be relatively straightforward to set up the components of the solution. 112 | - **Stability**: The components should be stable and capable of running for weeks at a time without any intervention necessary. 113 | - **Maintainability**: The components and their interactions should be uncomplicated enough that you know enough to maintain them as they evolve (because they *will* evolve). 114 | - **Aesthetics**: The result should be as close to a cloud provider's chat platform as possible. A homelab solution doesn't necessarily need to feel like it was cobbled together haphazardly. 115 | - **Modularity**: Components in the setup should be able to be swapped out for newer/more performant/better maintained alternatives easily. Standard protocols (OpenAI-compatibility, MCPs, etc.) help with this a lot and, in this guide, they are always preferred over bundled solutions. 116 | - **Open source**: The code should be able to be verified by a community of engineers. Chat platforms and LLMs involve large amounts of personal data conveyed in natural language and it's important to know that data isn't going outside your machine. 117 | 118 | ## Prerequisites 119 | 120 | Any modern CPU and GPU combination should work for this guide. Previously, compatibility with AMD GPUs was an issue but the latest releases of Ollama have worked through this and [AMD GPUs are now supported natively](https://ollama.com/blog/amd-preview). 121 | 122 | For reference, this guide was built around the following system: 123 | - **CPU**: Intel Core i5-12600KF 124 | - **Memory**: 64GB 3200MHz DDR4 RAM 125 | - **Storage**: 1TB M.2 NVMe SSD 126 | - **GPU**: Nvidia RTX 3090 (24GB), Nvidia RTX 3060 (12GB) 127 | 128 | > [!NOTE] 129 | > **AMD GPUs**: Power limiting is skipped for AMD GPUs as [AMD has recently made it difficult to set power limits on their GPUs](https://www.reddit.com/r/linux_gaming/comments/1b6l1tz/no_more_power_limiting_for_amd_gpus_because_it_is/). Naturally, skip any steps involving `nvidia-smi` or `nvidia-persistenced` and the power limit in the `init.bash` script. 130 | > 131 | > **CPU-only**: You can skip the GPU driver installation and power limiting steps. The rest of the guide should work as expected. 132 | 133 | > [!NOTE] 134 | > This guide uses `~/` (or `/home/`) as the base directory. If you're working in different directory, please modify all your commands accordingly. 135 | 136 | To begin the process of setting up your server, you will need the following: 137 | 138 | - Fresh install of Debian 139 | - Internet connection 140 | - Basic understanding of the Linux terminal 141 | - Peripherals like a monitor, keyboard, and mouse 142 | 143 | To install Debian on your newly built server hardware: 144 | 145 | - Download the [Debian ISO](https://www.debian.org/distrib/) from the official website. 146 | - Create a bootable USB using a tool like [Rufus](https://rufus.ie/en/) for Windows or [Balena Etcher](https://etcher.balena.io) for MacOS. 147 | - Boot into the USB and install Debian. 148 | 149 | For a more detailed guide on installing Debian, refer to the [official documentation](https://www.debian.org/releases/buster/amd64/). For those who aren't yet experienced with Linux, I recommend using the graphical installer - you will be given an option between the text-based installer and graphical installer. 150 | 151 | I also recommend installing a lightweight desktop environment like XFCE for ease of use. Other options like GNOME or KDE are also available - GNOME may be a better option for those using their server as a primary workstation as it is more feature-rich (and, as such, heavier) than XFCE. 152 | 153 | ## General 154 | 155 | Update the system by running the following commands: 156 | ``` 157 | sudo apt update 158 | sudo apt upgrade 159 | ``` 160 | 161 | Now, we'll install the required GPU drivers that allow programs to utilize their compute capabilities. 162 | 163 | **Nvidia GPUs** 164 | - Follow Nvidia's [guide on downloading CUDA Toolkit](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian). The instructions are specific to your machine and the website will lead you to them interactively. 165 | - Run the following commands: 166 | ``` 167 | sudo apt install linux-headers-amd64 168 | sudo apt install nvidia-driver firmware-misc-nonfree 169 | ``` 170 | - Reboot the server. 171 | - Run the following command to verify the installation: 172 | ``` 173 | nvidia-smi 174 | ``` 175 | 176 | **AMD GPUs** 177 | - Run the following commands: 178 | ``` 179 | deb http://deb.debian.org/debian bookworm main contrib non-free-firmware 180 | apt-get install firmware-amd-graphics libgl1-mesa-dri libglx-mesa0 mesa-vulkan-drivers xserver-xorg-video-all 181 | ``` 182 | - Reboot the server. 183 | 184 | We'll also install some packages that are not installed on Debian by default but may be required later: 185 | ``` 186 | sudo apt install libcurl cmake 187 | ``` 188 | 189 | ### Schedule Startup Script 190 | 191 | In this step, we'll create a script called `init.bash`. This script will be run at boot to set the GPU power limit and start the server using Ollama. We set the GPU power limit lower because it has been seen in testing and inference that there is only a 5-15% performance decrease for a 30% reduction in power consumption. This is especially important for servers that are running 24/7. 192 | 193 | - Run the following commands: 194 | ```bash 195 | touch init.bash 196 | nano init.bash 197 | ``` 198 | - Add the following lines to the script: 199 | ```bash 200 | #!/bin/bash 201 | sudo nvidia-smi -pm 1 202 | sudo nvidia-smi -pl 203 | ``` 204 | > Replace `` with the desired power limit in watts. For example, `sudo nvidia-smi -pl 250`. 205 | 206 | For multiple GPUs, modify the script to set the power limit for each GPU: 207 | ```bash 208 | sudo nvidia-smi -i 0 -pl 209 | sudo nvidia-smi -i 1 -pl 210 | ``` 211 | - Save and exit the script. 212 | - Make the script executable: 213 | ```bash 214 | chmod +x init.bash 215 | ``` 216 | 217 | Adding the `init.bash` script to the crontab will schedule it to run at boot. 218 | 219 | - Run the following command: 220 | ```bash 221 | crontab -e 222 | ``` 223 | - Add the following line to the file: 224 | ```bash 225 | @reboot /path/to/init.bash 226 | ``` 227 | > Replace `/path/to/init.bash` with the path to the `init.bash` script. 228 | 229 | - (Optional) Add the following line to shutdown the server at 12am: 230 | ```bash 231 | 0 0 * * * /sbin/shutdown -h now 232 | ``` 233 | - Save and exit the file. 234 | 235 | ### Configure Script Permissions 236 | 237 | We want `init.bash` to run the `nvidia-smi` commands without having to enter a password. This is done by giving `nvidia-persistenced` and `nvidia-smi` passwordless `sudo` permissions, and can be achieved by editing the `sudoers` file. 238 | 239 | AMD users can skip this step as power limiting is not supported on AMD GPUs. 240 | 241 | - Run the following command: 242 | ```bash 243 | sudo visudo 244 | ``` 245 | - Add the following lines to the file: 246 | ``` 247 | ALL=(ALL) NOPASSWD: /usr/bin/nvidia-persistenced 248 | ALL=(ALL) NOPASSWD: /usr/bin/nvidia-smi 249 | ``` 250 | > Replace `` with your username. 251 | - Save and exit the file. 252 | 253 | > [!IMPORTANT] 254 | > Ensure that you add these lines AFTER `%sudo ALL=(ALL:ALL) ALL`. The order of the lines in the file matters - the last matching line will be used so if you add these lines before `%sudo ALL=(ALL:ALL) ALL`, they will be ignored. 255 | 256 | ### Configure Auto-Login (optional) 257 | 258 | When the server boots up, we want it to automatically log in to a user account and run the `init.bash` script. This is done by configuring the `lightdm` display manager. 259 | 260 | - Run the following command: 261 | ```bash 262 | sudo nano /etc/lightdm/lightdm.conf 263 | ``` 264 | - Find the following commented line. It should be in the `[Seat:*]` section. 265 | ``` 266 | # autologin-user= 267 | ``` 268 | - Uncomment the line and add your username: 269 | ``` 270 | autologin-user= 271 | ``` 272 | > Replace `` with your username. 273 | - Save and exit the file. 274 | 275 | ## Docker 276 | 277 | 📖 [**Documentation**](https://docs.docker.com/engine/) 278 | 279 | Docker is a containerization platform that allows you to run applications in isolated environments. This subsection follows [Docker's guide](https://docs.docker.com/engine/install/debian/) to install Docker Engine on Debian. The commands are listed below, but visiting the guide is recommended in case instructions have changed. 280 | 281 | - If you already have a Docker installation on your system, it's a good idea to re-install so there are no broken/out-of-date dependencies. The command below will iterate through your system's installed packages and remove the ones associated with Docker. 282 | ``` 283 | for pkg in docker.io docker-doc docker-compose podman-docker containerd runc; do sudo apt-get remove $pkg; done 284 | ``` 285 | 286 | - Run the following commands: 287 | ```bash 288 | # Add Docker's official GPG key: 289 | sudo apt-get update 290 | sudo apt-get install ca-certificates curl 291 | sudo install -m 0755 -d /etc/apt/keyrings 292 | sudo curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc 293 | sudo chmod a+r /etc/apt/keyrings/docker.asc 294 | 295 | # Add the repository to Apt sources: 296 | echo \ 297 | "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \ 298 | $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ 299 | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null 300 | sudo apt-get update 301 | ``` 302 | - Install the Docker packages: 303 | ```bash 304 | sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin 305 | ``` 306 | - Verify the installation: 307 | ```bash 308 | sudo docker run hello-world 309 | ``` 310 | 311 | ### Nvidia Container Toolkit 312 | 313 | You will most likely want to use GPUs via Docker - this will require Nvidia Container Toolkit, which allows Docker to allocate/de-allocate memory on Nvidia GPUs. The steps for installing this are listed below, but it is recommended to reference [Nvidia's documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) for the most up-to-date commands. 314 | 315 | 1. Configure the repository 316 | ```bash 317 | curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ 318 | && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ 319 | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ 320 | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list 321 | ``` 322 | 323 | 2. Update packages: 324 | ```bash 325 | sudo apt-get update 326 | ``` 327 | 328 | 3. Install Nvidia Container Toolkit packages: 329 | ```bash 330 | export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.17.8-1 331 | sudo apt-get install -y \ 332 | nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ 333 | nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ 334 | libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \ 335 | libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION} 336 | ``` 337 | 338 | ### Create a Network 339 | 340 | We'll be running most services via Docker containers. To allow multiple containers to communicate with each other, we can open up ports via UFW (which we'll [configure later](#firewall)) but this is less optimal than creating a Docker network. This way, all containers on the network can securely talk to each other without needing to open ports for the many services we could run via UFW, inherently creating a more secure setup. 341 | 342 | We'll call this network `app-net`: you can call it anything you like, just be sure to update the commands that use it later. 343 | 344 | Run the following: 345 | 346 | ```bash 347 | sudo docker network create app-net 348 | ``` 349 | 350 | That's it! Now, when we create containers, we can reference it as follows: 351 | 352 | **Docker Run** 353 | ```bash 354 | sudo docker run --network app-net 355 | ``` 356 | 357 | **Docker Compose** 358 | ```yaml 359 | services: 360 | : 361 | # add this 362 | networks: 363 | - app-net 364 | 365 | # add this 366 | networks: 367 | app-net: 368 | external: true 369 | ``` 370 | 371 | > Replace `` with the actual service - don't forget to add other parameters too. 372 | 373 | With this configured, we can now call containers by name and port. Let's pretend we're calling the /health endpoint in `llama-swap` from `open-webui` (two actual containers we'll be creating later on) to ensure that the containers can see and speak to each other. Run (`CTRL+C` to quit): 374 | 375 | ```bash 376 | sudo docker exec -i open-webui curl http://llama-swap:8080/health 377 | ``` 378 | 379 | You could also run it the other way to be extra sure: 380 | 381 | ```bash 382 | sudo docker exec -it llama-swap curl http://open-webui:8080 383 | ``` 384 | 385 | > [!IMPORTANT] 386 | > The port is always the **internal port** the container is running on. If a container runs on 1111:8080, for example, 1111 is the port on the host (where you might access it, like `http://localhost:1111` or `http://:1111`) and 8080 is the internal port the container is running on. Thus, trying to access the container on 1111 via `app-net` will not work. Remembering this when specifying URLs in services will save you a lot of unnecessary "why isn't this working?" pains. 387 | 388 | ### Helpful Commands 389 | 390 | In the process of setting up this server (or anywhere down the rabbit hole of setting up services), you're likely to use Docker often. For the uninititated, here are helpful commands that will make navigating and troubleshooting containers easier: 391 | 392 | - See available/running containers: `sudo docker ps -a` 393 | - Restart a container: `sudo docker restart ` 394 | - View a container's logs (live): `sudo docker logs -f ` (`CTRL+C` to quit) 395 | - Rename a container: `sudo docker rename ` 396 | - Sometimes, a single service will spin up multiple containers, e.g. `xyz-server` and `xyz-db`. To restart both simultaneously, run the following command **from inside the directory containing the Compose file**: `sudo docker compose restart` 397 | 398 | > [!TIP] 399 | > There's no rules when it comes to how you set up your Docker containers/services. However, here are my two cents: 400 | > It's cleanest to use Docker Compose (`sudo docker compose up -d` with a `docker-compose.yaml` file as opposed to `sudo docker run -d `). Unless you take copious notes on your homelab and its setup, this method is almost self-documenting and keeps a neat trail of the services you run via their compose files. One compose file per directory is standard. 401 | 402 | ## HuggingFace CLI 403 | 404 | 📖 [**Documentation**](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) 405 | 406 | HuggingFace is the leading open-source ML/AI platform - it hosts models (including LLMs), datasets, and demo apps that can be used to test models. For the purpose of this guide, we'll be using HuggingFace to download popular open-source LLMs. 407 | 408 | > [!NOTE] 409 | > Only needed for llama.cpp/vLLM. 410 | 411 | - Create a new virtual environment: 412 | ```bash 413 | python3 -m venv hf-env 414 | source hf-env/bin/activate 415 | ``` 416 | - Download the `huggingface_hub` package using `pip`: 417 | ```bash 418 | pip install -U "huggingface_hub[cli]" 419 | ``` 420 | - Create an authentication token on https://huggingface.com 421 | - Log in to HF Hub: 422 | ```bash 423 | hf auth login 424 | ``` 425 | - Enter your token when prompted. 426 | - Run the following to verify your login: 427 | ```bash 428 | hf auth whoami 429 | ``` 430 | 431 | The result should be your username. 432 | 433 | ### Manage Models 434 | 435 | Models can be downloaded either to the default location (`.cache/huggingface/hub`) or to any local directory you specify. Where the model is stored can be defined using the `--local-dir` command line flag. Not specifying this will result in the model being stored in the default location. Storing the model in the folder where the packages for the inference engine are stored is good practice - this way, everything you need to run inference on a model is stored in the same place. However, if you use the same models with multiple backends frequently (e.g. using Qwen_QwQ-32B-Q4_K_M.gguf with both llama.cpp and vLLM), either set a common model directory or use the default HF option without specifying this flag. 436 | 437 | First, activate the virtual environment that contains `huggingface_hub`: 438 | ``` 439 | source hf-env/bin/activate 440 | ``` 441 | 442 | ### Download Models 443 | 444 | Models are downloaded using their HuggingFace tag. Here, we'll use bartowski/Qwen_QwQ-32B-GGUF as an example. To download a model, run: 445 | ``` 446 | hf download bartowski/Qwen_QwQ-32B-GGUF Qwen_QwQ-32B-Q4_K_M.gguf --local-dir models 447 | ``` 448 | Ensure that you are in the correct directory when you run this. 449 | 450 | ### Delete Models 451 | 452 | To delete a model in the specified location, run: 453 | ``` 454 | rm 455 | ``` 456 | 457 | To delete a model in the default location, run: 458 | ``` 459 | hf delete-cache 460 | ``` 461 | 462 | This will start an interactive session where you can remove models from the HuggingFace directory. In case you've been saving models in a different location than `.cache/huggingface`, deleting models from there will free up space but the metadata will remain in the HF cache until it is deleted properly. This can be done via the above command but you can also simply delete the model directory from `.cache/huggingface/hub`. 463 | 464 | ## Search Engine 465 | 466 | > [!NOTE] 467 | > This step is optional but highly recommended for grounding LLMs with relevant search results from reputable sources. Targeted web searches via MCP tool calls make reports generated by LLMs much less prone to random hallucinations. 468 | 469 | ### SearXNG 470 | 471 | 🌟 [**GitHub**](https://github.com/searxng/searxng) 472 | 📖 [**Documentation**](https://docs.searxng.org) 473 | 474 | To power our search-based workflows, we don't want to rely on a search provider that can monitor searches. While using any search engine has this problem, metasearch engines like SearXNG mitigate it to a decent degree. SearXNG aggregates results from over 245 [search services](https://docs.searxng.org/user/configured_engines.html#configured-engines) and does not track/profile users. You can use a hosted instance on the Internet but, considering the priorities of this guide and how trivial it is to set one up, we'll be spinning up our own instance on port 5050. 475 | 476 | 1. Start the container: 477 | ```bash 478 | mkdir searxng 479 | cd searxng 480 | sudo docker pull searxng/searxng 481 | export PORT=5050 482 | sudo docker run \ 483 | -d -p ${PORT}:8080 \ 484 | --name searxng \ 485 | --network app-net \ 486 | -v "${PWD}/searxng:/etc/searxng" \ 487 | -e "BASE_URL=http://0.0.0.0:$PORT/" \ 488 | -e "INSTANCE_NAME=searxng" \ 489 | --restart unless-stopped \ 490 | searxng/searxng 491 | ``` 492 | 493 | 2. Edit `settings.yml` to include JSON format support: 494 | ```bash 495 | sudo nano settings.yml 496 | ``` 497 | 498 | Add the following: 499 | ```yaml 500 | search: 501 | # ...other parameters here... 502 | formats: 503 | - html 504 | - json # add this line 505 | ``` 506 | 507 | 3. Restart the container with `sudo docker restart searxng` 508 | 509 | ### Open WebUI Integration 510 | 511 | In case you want a simple web search workflow and want to skip MCP servers/agentic setups, Open WebUI supports web search functionality natively. Navigate to `Admin Panel > Settings > Web Search` and set the following values: 512 | 513 | - Enable `Web Search` 514 | - Web Search Engine: `searxng` 515 | - Searxng Query URL: `http://searxng:8080/search?q=` 516 | - API Key: `anything-you-like` 517 | 518 | ## Inference Engine 519 | 520 | The inference engine is one of the primary components of this setup. It is code that takes model files containing weights and makes it possible to get useful outputs from them. This guide allows a choice between llama.cpp, vLLM, and Ollama - all of these are popular inference engines with different priorities and stengths (note: Ollama uses llama.cpp under the hood and is simply a CLI wrapper). It can be daunting to jump straight into the deep end with command line arguments in llama.cpp and vLLM. If you're a power user and enjoy the flexibility afforded by tight control over serving parameters, using either llama.cpp or vLLM will be a wonderful experience and really come down to the quantization format you decide. However, if you're a beginner or aren't yet comfortable with this, Ollama can be convenient stopgap while you build the skills you need or the very end of the line if you decide your current level of knowledge is enough! 521 | 522 | ### Ollama 523 | 524 | 🌟 [**GitHub**](https://github.com/ollama/ollama) 525 | 📖 [**Documentation**](https://github.com/ollama/ollama/tree/main/docs) 526 | 🔧 [**Engine Arguments**](https://github.com/ollama/ollama/blob/main/docs/modelfile.md) 527 | 528 | Ollama will be installed as a service, so it runs automatically at boot. 529 | 530 | - Download Ollama from the official repository: 531 | ``` 532 | curl -fsSL https://ollama.com/install.sh | sh 533 | ``` 534 | 535 | We want our API endpoint to be reachable by the rest of the LAN. For Ollama, this means setting `OLLAMA_HOST=0.0.0.0` in the `ollama.service`. 536 | 537 | - Run the following command to edit the service: 538 | ``` 539 | systemctl edit ollama.service 540 | ``` 541 | - Find the `[Service]` section and add `Environment="OLLAMA_HOST=0.0.0.0"` under it. It should look like this: 542 | ``` 543 | [Service] 544 | Environment="OLLAMA_HOST=0.0.0.0" 545 | ``` 546 | - Save and exit. 547 | - Reload the environment. 548 | ``` 549 | systemctl daemon-reload 550 | systemctl restart ollama 551 | ``` 552 | 553 | > [!TIP] 554 | > If you installed Ollama manually or don't use it as a service, remember to run `ollama serve` to properly start the server. Refer to [Ollama's troubleshooting steps](#ollama-3) if you encounter an error. 555 | 556 | ### llama.cpp 557 | 558 | 🌟 [**GitHub**](https://github.com/ggml-org/llama.cpp) 559 | 📖 [**Documentation**](https://github.com/ggml-org/llama.cpp/tree/master/docs) 560 | 🔧 [**Engine Arguments**](https://github.com/ggml-org/llama.cpp/tree/master/examples/server) 561 | 562 | - Clone the llama.cpp GitHub repository: 563 | ``` 564 | git clone https://github.com/ggml-org/llama.cpp.git 565 | cd llama.cpp 566 | ``` 567 | - Build the binary: 568 | 569 | **CPU** 570 | ``` 571 | cmake -B build 572 | cmake --build build --config Release 573 | ``` 574 | 575 | **CUDA** 576 | ``` 577 | cmake -B build -DGGML_CUDA=ON 578 | cmake --build build --config Release 579 | ``` 580 | For other systems looking to use Metal, Vulkan and other low-level graphics APIs, view the complete [llama.cpp build documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) to leverage accelerated inference. 581 | 582 | ### vLLM 583 | 584 | 🌟 [**GitHub**](https://github.com/vllm-project/vllm) 585 | 📖 [**Documentation**](https://docs.vllm.ai/en/stable/index.html) 586 | 🔧 [**Engine Arguments**](https://docs.vllm.ai/en/stable/serving/engine_args.html) 587 | 588 | vLLM comes with its own OpenAI-compatible API that we can use just like Ollama. Where Ollama runs GGUF model files, vLLM can run AWQ, GPTQ, GGUF, BitsAndBytes, and safetensors (the default release type) natively. 589 | 590 | **Manual Installation (Recommended)** 591 | 592 | - Create a directory and virtual environment for vLLM: 593 | ``` 594 | mkdir vllm 595 | cd vllm 596 | python3 -m venv .venv 597 | source .venv/bin/activate 598 | ``` 599 | 600 | - Install vLLM using `pip`: 601 | ``` 602 | pip install vllm 603 | ``` 604 | 605 | - Serve with your desired flags. It uses port 8000 by default, but I'm using port 8556 here so it doesn't conflict with any other services: 606 | ``` 607 | vllm serve --port 8556 608 | ``` 609 | 610 | - To use as a service, add the following block to `init.bash` to serve vLLM on startup: 611 | ``` 612 | source .venv/bin/activate 613 | vllm serve --port 8556 614 | ``` 615 | > Replace `` with your desired model tag, copied from HuggingFace. 616 | 617 | **Docker Installation** 618 | 619 | - Run: 620 | ``` 621 | sudo docker run --gpus all \ 622 | -v ~/.cache/huggingface:/root/.cache/huggingface \ 623 | --env "HUGGING_FACE_HUB_TOKEN=" \ 624 | -p 8556:8000 \ 625 | --ipc=host \ 626 | vllm/vllm-openai:latest \ 627 | --model 628 | ``` 629 | > Replace `` with your HuggingFace Hub token and `` with your desired model tag, copied from HuggingFace. 630 | 631 | To serve a different model: 632 | 633 | - First stop the existing container: 634 | ``` 635 | sudo docker ps -a 636 | sudo docker stop 637 | ``` 638 | 639 | - If you want to run the exact same setup again in the future, skip this step. Otherwise, run the following to delete the container and not clutter your Docker container environment: 640 | ``` 641 | sudo docker rm 642 | ``` 643 | 644 | - Rerun the Docker command from the installation with the desired model. 645 | ``` 646 | sudo docker run --gpus all \ 647 | -v ~/.cache/huggingface:/root/.cache/huggingface \ 648 | --env "HUGGING_FACE_HUB_TOKEN=" \ 649 | -p 8556:8000 \ 650 | --ipc=host \ 651 | vllm/vllm-openai:latest \ 652 | --model 653 | ``` 654 | 655 | ### Open WebUI Integration 656 | > [!NOTE] 657 | > Only needed for llama.cpp/vLLM. 658 | 659 | Navigate to `Admin Panel > Settings > Connections` and set the following values: 660 | 661 | - Enable `OpenAI API` 662 | - API Base URL: `http://host.docker.internal:/v1` 663 | - API Key: `anything-you-like` 664 | 665 | > [!NOTE] 666 | > `host.docker.internal` is a magic hostname that resolves to the internal IP address assigned to the host by Docker. This allows containers to communicate with services running on the host, such as databases or web servers, without needing to know the host's IP address. It simplifies communication between containers and host-based services, making it easier to develop and deploy applications. 667 | 668 | ### Ollama vs. llama.cpp 669 | 670 | | **Aspect** | **Ollama (Wrapper)** | **llama.cpp (Vanilla)** | 671 | | -------------------------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------- | 672 | | **Installation/Setup** | One-click install & CLI model management | Requires manual setup/configuration | 673 | | **Open WebUI Integration** | First-class citizen | Requires OpenAI-compatible endpoint setup | 674 | | **Model Switching** | Native model-switching via server | Requires manual port management or [llama-swap](https://github.com/mostlygeek/llama-swap) | 675 | | **Customizability** | Limited: Modelfiles are cumbersome | Full control over parameters via CLI | 676 | | **Transparency** | Defaults may override model parameters (e.g., context length) | Full transparency in parameter settings | 677 | | **GGUF Support** | Inherits llama.cpp's best-in-class implementation | Best GGUF implementation | 678 | | **GPU-CPU Splitting** | Inherits llama.cpp's efficient splitting | Trivial GPU-CPU splitting out-of-the-box | 679 | 680 | --- 681 | 682 | ### vLLM vs. Ollama/llama.cpp 683 | | **Feature** | **vLLM** | **Ollama/llama.cpp** | 684 | | ----------------------- | -------------------------------------------- | ------------------------------------------------------------------------------------- | 685 | | **Vision Models** | Supports Qwen 2.5 VL, Llama 3.2 Vision, etc. | Ollama supports some vision models, llama.cpp does not support any (via llama-server) | 686 | | **Quantization** | Supports AWQ, GPTQ, BnB, etc. | Only supports GGUF | 687 | | **Multi-GPU Inference** | Yes | Yes | 688 | | **Tensor Parallelism** | Yes | No | 689 | 690 | In summary, 691 | 692 | - **Ollama**: Best for those who want an "it just works" experience. 693 | - **llama.cpp**: Best for those who want total control over their inference servers and are familiar with engine arguments. 694 | - **vLLM**: Best for those who want (i) to run non-GGUF quantizations of models, (ii) multi-GPU inference using tensor parallelism, or (iii) to use vision models. 695 | 696 | Using Ollama as a service offers no degradation in experience because unused models are offloaded from VRAM after some time. Using vLLM or llama.cpp as a service keeps a model in memory, so I wouldn't use this alongside Ollama in an automated, always-on fashion unless it was your primary inference engine. Essentially, 697 | 698 | | Primary Engine | Secondary Engine | Run SE as service? | 699 | | -------------- | ---------------- | ------------------ | 700 | | Ollama | llama.cpp/vLLM | No | 701 | | llama.cpp/vLLM | Ollama | Yes | 702 | 703 | ## Model Server 704 | 705 | > [!NOTE] 706 | > Only needed for manual installations of llama.cpp/vLLM. Ollama manages model management via its CLI. 707 | 708 | While the above steps will help you get up and running with an OpenAI-compatible LLM server, they will not help with this server persisting after you close your terminal window or restart your physical server. They also won't allow a chat platform to reliably reference and swap between the various models you have available - a likely use-case in a landscape where different models specialize in different tasks. Running the inference engine via Docker can achieve this persistence with the `-d` (for "detach") flag but (i) services like llama.cpp and vLLM are usually configured without Docker and (ii) it can't swap models on-demand. This necessitates a server that can manage loading/unloading, swapping, and listing available models. 709 | 710 | ### llama-swap 711 | 712 | 🌟 [**GitHub**](https://github.com/mostlygeek/llama-swap) 713 | 📖 [**Documentation**](https://github.com/mostlygeek/llama-swap/wiki) 714 | 715 | > [!TIP] 716 | > This is my recommended way to run llama.cpp/vLLM models. 717 | 718 | llama-swap is a lightweight proxy server for LLMs that solves our pain points from above. It's an extremely configurable tool that allows a single point of entry for models from various backends. Models can be set up in groups, listed/unlisted easily, configured with customized hyperparameters, and monitored using streamed logs in the llama-swap web UI. 719 | 720 | In the installation below, we'll use `Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf` for llama.cpp and `Qwen/Qwen3-4B-Instruct-2507` for vLLM. We'll also use port 7000 to serve the models on. 721 | 722 | 1. Create a new directory with the `config.yaml` file: 723 | ```bash 724 | sudo mkdir llama-swap 725 | cd llama-swap 726 | sudo nano config.yaml 727 | ``` 728 | 729 | 2. Enter the following and save: 730 | 731 | **llama.cpp** 732 | ```yaml 733 | models: 734 | "qwen3-4b": 735 | proxy: "http://127.0.0.1:7000" 736 | cmd: | 737 | /app/llama-server 738 | -m /models/Qwen3-4B-Instruct-2507-UD-Q4_K_XL.gguf 739 | # or use `-hf unsloth/Qwen3-4B-Instruct-2507-GGUF:Q4_K_XL` for HuggingFace 740 | --port 7000 741 | ``` 742 | 743 | **vLLM (Docker)** 744 | ```yaml 745 | models: 746 | "qwen3-4b": 747 | proxy: "http://127.0.0.1:7000" 748 | cmd: | 749 | docker run --name qwen-vllm 750 | --init --rm -p 7000:8080 751 | --ipc=host \ 752 | vllm/vllm-openai:latest 753 | -m /models/Qwen/Qwen3-4B-Instruct-2507 754 | cmdStop: docker stop qwen-vllm 755 | ``` 756 | 757 | **vLLM (local)**: 758 | ```yaml 759 | models: 760 | "qwen3-4b": 761 | proxy: "http://127.0.0.1:7000" 762 | cmd: | 763 | source /app/vllm/.venv/bin/activate && \ 764 | /app/vllm/.venv/bin/vllm serve \ 765 | --port 7000 \ 766 | --host 0.0.0.0 \ 767 | -m /models/Qwen/Qwen3-4B-Instruct-2507 768 | cmdStop: pkill -f "vllm serve" 769 | ``` 770 | 771 | 3. Install the container: 772 | 773 | We use the `cuda` tag here, but llama-swap offers `cpu`, `intel`, `vulkan`, and `musa` tags as well. Releases can be found [here](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap). 774 | 775 | **llama.cpp** 776 | ```bash 777 | sudo docker run -d --gpus all --restart unless-stopped --network app-net --pull=always --name llama-swap -p 9292:8080 \ 778 | -v /path/to/models:/models \ 779 | -v /home//llama-swap/config.yaml:/app/config.yaml \ 780 | -v /home//llama.cpp/build/bin/llama-server:/app/llama-server \ 781 | ghcr.io/mostlygeek/llama-swap:cuda 782 | ``` 783 | 784 | **vLLM (Docker/local)** 785 | ```bash 786 | sudo docker run -d --gpus all --restart unless-stopped --network app-net --pull=always --name llama-swap -p 9292:8080 \ 787 | -v /path/to/models:/models \ 788 | -v /home//vllm:/app/vllm \ 789 | -v /home//llama-swap/config.yaml:/app/config.yaml \ 790 | ghcr.io/mostlygeek/llama-swap:cuda 791 | ``` 792 | 793 | > Replace with your actual username and `/path/to/models` with the path to your model files. 794 | 795 | > [!NOTE] 796 | > llama-swap prefers Docker-based vLLM due to cleanliness of environments and adherence to SIGTERM signals sent by the server. I've written out both options here. 797 | 798 | This should result in a functioning llama-swap instance running at `http://localhost:9292`, which can be confirmed by running `curl http://localhost:9292/health`. It is **highly recommended** that you read the [configuration documentation](https://github.com/mostlygeek/llama-swap/wiki/Configuration). llama-swap is thoroughly documented and highly configurable - utilizing its capabilities will result in a tailored setup ready to deploy as you need it. 799 | 800 | ### `systemd` Service 801 | 802 | The other way to persist a model across system reboots is to start the inference engine in a `.service` file that will run alongside the Linux operating system when booting, ensuring that it is available whenever the server is on. If you're willing to live with the relative compromise of not being able to swap models/backends and are satisfied with running one model, this is the lowest overhead solution and works great. 803 | 804 | Let's call the service we're about to build `llm-server.service`. We'll assume all models are in the `models` child directory - you can change this as you need to. 805 | 806 | 1. Create the `systemd` service file: 807 | ```bash 808 | sudo nano /etc/systemd/system/llm-server.service 809 | ``` 810 | 811 | 2. Configure the service file: 812 | 813 | **llama.cpp** 814 | ```ini 815 | [Unit] 816 | Description=LLM Server Service 817 | After=network.target 818 | 819 | [Service] 820 | User= 821 | Group= 822 | WorkingDirectory=/home//llama.cpp/build/bin/ 823 | ExecStart=/home//llama.cpp/build/bin/llama-server \ 824 | --port \ 825 | --host 0.0.0.0 \ 826 | -m /home//llama.cpp/models/ \ 827 | --no-webui # [other engine arguments] 828 | Restart=always 829 | RestartSec=10s 830 | 831 | [Install] 832 | WantedBy=multi-user.target 833 | ``` 834 | 835 | **vLLM** 836 | ```ini 837 | [Unit] 838 | Description=LLM Server Service 839 | After=network.target 840 | 841 | [Service] 842 | User= 843 | Group= 844 | WorkingDirectory=/home//vllm/ 845 | ExecStart=/bin/bash -c 'source .venv/bin/activate && vllm serve --port --host 0.0.0.0 -m /home//vllm/models/' 846 | Restart=always 847 | RestartSec=10s 848 | 849 | [Install] 850 | WantedBy=multi-user.target 851 | ``` 852 | > Replace ``, ``, and `` with your Linux username, desired port for serving, and desired model respectively. 853 | 854 | 3. Reload the `systemd` daemon: 855 | ```bash 856 | sudo systemctl daemon-reload 857 | ``` 858 | 4. Run the service: 859 | 860 | If `llm-server.service` doesn't exist: 861 | ```bash 862 | sudo systemctl enable llm-server.service 863 | sudo systemctl start llm-server 864 | ``` 865 | 866 | If `llm-server.service` does exist: 867 | ```bash 868 | sudo systemctl restart llm-server 869 | ``` 870 | 5. (Optional) Check the service's status: 871 | ```bash 872 | sudo systemctl status llm-server 873 | ``` 874 | 875 | ### Open WebUI Integration 876 | 877 | #### llama-swap 878 | 879 | Navigate to `Admin Panel > Settings > Connections` and set the following values: 880 | 881 | - Enable OpenAI API 882 | - API Base URL: `http://llama-swap:8080/v1` 883 | - API Key: `anything-you-like` 884 | 885 | #### `systemd` Service 886 | 887 | Follow the same steps as above. 888 | 889 | - Enable OpenAI API 890 | - API Base URL: `http://localhost:/v1` 891 | - API Key: `anything-you-like` 892 | 893 | > Replace `` with your desired port. 894 | 895 | ## Chat Platform 896 | 897 | ### Open WebUI 898 | 899 | 🌟 [**GitHub**](https://github.com/open-webui/open-webui) 900 | 📖 [**Documentation**](https://docs.openwebui.com) 901 | 902 | Open WebUI is a web-based interface for managing models and chats, and provides a beautiful, performant UI for communicating with your models. You will want to do this if you want to access your models from a web interface. If you're fine with using the command line or want to consume models through a plugin/extension, you can skip this step. 903 | 904 | To install without Nvidia GPU support, run the following command: 905 | ```bash 906 | sudo docker run -d -p 3000:8080 --network app-net --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main 907 | ``` 908 | 909 | For Nvidia GPUs, run the following command: 910 | ```bash 911 | sudo docker run -d -p 3000:8080 --network app-net --gpus all --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:cuda 912 | ``` 913 | 914 | You can access it by navigating to `http://localhost:3000` in your browser or `http://:3000` from another device on the same network. There's no need to add this to the `init.bash` script as Open WebUI will start automatically at boot via Docker Engine. 915 | 916 | Read more about Open WebUI [here](https://github.com/open-webui/open-webui). 917 | 918 | ## MCP Proxy Server 919 | 920 | Model Context Protocol (MCP) is a protocol that tools (functions/scripts written in code) to LLMs in a standardized way. Generally, models are being trained more and more with the ability to natively call tools in order to power agentic tasks - think along the lines of having a model use sequential thinking to formulate multiple thoughts, execute multiple targeted web searches, and provide a response leveraging real-time information. MCP also, probably more importantly for most people, enables models to call third-party tools like those for GitHub, Azure, etc. A complete list, curated and maintained by Anthropic, can be found [here](https://github.com/modelcontextprotocol/servers). 921 | 922 | Most guides on the Internet concerning MCP will have you spin up an MCP server via a client like VS Code, Cline, etc. since most agentic uses are for coding or Claude Desktop, which is a proprietary app by Anthropic and not at all what we're aiming to achieve in this guide with respect to privacy. There *are* other chat clients that support MCP server management from the UI itself (LobeChat, Cherry Studio, etc.), but we want to be able to manage MCP servers in a central and modular way. That way, (i) they aren't tied to a specific client and are available to every client you may use them with and (ii) if you switch chat platforms in the future, your MCP servers require zero changes because they run as a decoupled service - little bit more maintenance for a lot more flexibility down the line. We can do this by setting up an MCP proxy server. 923 | 924 | This proxy server will take the MCP servers running via stdio (standard IO) protocol (that can only be accessed by an application running on that device) and make them compatible with streamable HTTP. Any MCP-enabled client can use streamable HTTP so they'll also be able to use all the servers we install on our physical server. This centralizes the management of your MCP servers: create/edit/delete servers in one place, use any of them from your various clients (Open WebUI, VS Code, etc.). 925 | 926 | We'll use the [fetch](https://github.com/zcaceres/fetch-mcp), [sequential-thinking](https://github.com/arben-adm/mcp-sequential-thinking), and [searxng](https://github.com/ihor-sokoliuk/MCP-searxng) MCP servers to get started. The process for adding more servers will be identical. 927 | 928 | ### mcp-proxy 929 | 930 | 🌟 [**GitHub**](https://github.com/sparfenyuk/mcp-proxy) 931 | 932 | mcp-proxy is a server proxy that allows switching between transports (stdio to streamable HTTP and vice versa). I'll be using port 3131 to avoid conflicts - feel free to change this as you need to. I will also be extending mcp-proxy to include `uv`: most MCP servers use either `npx` or `uv` and setting mcp-proxy up without `uv` is going to hamper your ability to run the MCP servers you'd like. If you don't require `uv`, (i) don't add the `build` section in the compose file and (ii) skip step 4. 933 | 934 | 1. Create a compose file 935 | ```bash 936 | mkdir mcp-proxy 937 | cd mcp-proxy 938 | sudo nano docker-compose.yaml 939 | ``` 940 | 941 | 2. Enter the following: 942 | ```yaml 943 | services: 944 | mcp-proxy: 945 | container_name: mcp-proxy 946 | build: 947 | context: . 948 | dockerfile: Dockerfile 949 | networks: 950 | - app-net 951 | volumes: 952 | - .:/config 953 | - /:/:ro 954 | restart: unless-stopped 955 | ports: 956 | - 3131:3131 957 | command: "--pass-environment --port=3131 --host 0.0.0.0 --transport streamablehttp --named-server-config /config/servers.json" 958 | 959 | networks: 960 | app-net: 961 | external: true 962 | ``` 963 | 964 | > Replace `` with your actual server's hostname (or whatever else). This is primarily useful when adding `filesystem` or similar MCP servers that read from and write files to the file system. Feel free to skip if that isn't your goal. 965 | 966 | 3. Create a `servers.json` file: 967 | ```json 968 | { 969 | "mcpServers": { 970 | "fetch": { 971 | "disabled": false, 972 | "timeout": 60, 973 | "command": "uvx", 974 | "args": [ 975 | "mcp-server-fetch" 976 | ], 977 | "transportType": "stdio" 978 | }, 979 | "sequential-thinking": { 980 | "command": "npx", 981 | "args": [ 982 | "-y", 983 | "@modelcontextprotocol/server-sequential-thinking" 984 | ] 985 | }, 986 | "searxng": { 987 | "command": "npx", 988 | "args": ["-y", "mcp-searxng"], 989 | "env": { 990 | "SEARXNG_URL": "http://searxng:8080/search?q=" 991 | } 992 | } 993 | } 994 | } 995 | ``` 996 | 997 | 4. Create a `Dockerfile`: 998 | ```bash 999 | sudo nano Dockerfile 1000 | ``` 1001 | Enter the following: 1002 | ```Dockerfile 1003 | FROM ghcr.io/sparfenyuk/mcp-proxy:latest 1004 | 1005 | # Install dependencies for nvm and Node.js 1006 | RUN apk add --update npm 1007 | 1008 | # Install the 'uv' package 1009 | RUN python3 -m ensurepip && pip install --no-cache-dir uv 1010 | 1011 | ENV PATH="/usr/local/bin:/usr/bin:$PATH" \ 1012 | UV_PYTHON_PREFERENCE=only-system 1013 | 1014 | ENTRYPOINT ["catatonit", "--", "mcp-proxy"] 1015 | ``` 1016 | 1017 | 5. Start the container with `sudo docker compose up -d` 1018 | 1019 | Your mcp-proxy container should be up and running! Adding servers is simple: add the relevant server to `servers.json` (you can use the same configuration that the MCP server's developer provides for VS Code, it's identical) and then restart the container with `sudo docker restart mcp-proxy`. 1020 | 1021 | ### MCPJungle 1022 | 1023 | 🌟 [**GitHub**](https://github.com/mcpjungle/MCPJungle?tab=readme-ov-file) 1024 | 1025 | MCPJungle is another MCP proxy server with a different focus. It focuses on providing more of a "production-grade" experience, a lot of which is disabled by default in the development mode of the application. We'll use the standard development version of the container here on port 4141. 1026 | 1027 | 1. Create a compose file: 1028 | ```bash 1029 | mkdir mcpjungle 1030 | cd mcpjungle 1031 | sudo nano docker-compose.yaml 1032 | ``` 1033 | 1034 | Enter the following and save: 1035 | 1036 | ```yaml 1037 | # MCPJungle Docker Compose configuration for individual users. 1038 | # Use this compose file if you want to run MCPJungle locally for your personal MCP management & Gateway. 1039 | # The mcpjungle server runs in development mode. 1040 | services: 1041 | db: 1042 | image: postgres:latest 1043 | container_name: mcpjungle-db 1044 | environment: 1045 | POSTGRES_USER: mcpjungle 1046 | POSTGRES_PASSWORD: mcpjungle 1047 | POSTGRES_DB: mcpjungle 1048 | ports: 1049 | - "5432:5432" 1050 | networks: 1051 | - app-net 1052 | volumes: 1053 | - db_data:/var/lib/postgresql/data 1054 | healthcheck: 1055 | test: ["CMD-SHELL", "PGPASSWORD=mcpjungle pg_isready -U mcpjungle"] 1056 | interval: 10s 1057 | timeout: 5s 1058 | retries: 5 1059 | restart: unless-stopped 1060 | 1061 | mcpjungle: 1062 | image: mcpjungle/mcpjungle:${MCPJUNGLE_IMAGE_TAG:-latest-stdio} 1063 | container_name: mcpjungle-server 1064 | environment: 1065 | DATABASE_URL: postgres://mcpjungle:mcpjungle@db:5432/mcpjungle 1066 | SERVER_MODE: ${SERVER_MODE:-development} 1067 | OTEL_ENABLED: ${OTEL_ENABLED:-false} 1068 | ports: 1069 | - "4141:8080" 1070 | networks: 1071 | - app-net 1072 | volumes: 1073 | # Mount host filesystem current directory to enable filesystem MCP server access 1074 | - .:/host/project:ro 1075 | - /home/:/host:ro 1076 | # Other options: 1077 | # - ${HOME}:/host/home:ro 1078 | # - /tmp:/host/tmp:rw 1079 | depends_on: 1080 | db: 1081 | condition: service_healthy 1082 | restart: always 1083 | 1084 | volumes: 1085 | db_data: 1086 | 1087 | networks: 1088 | app-net: 1089 | external: true 1090 | ``` 1091 | 1092 | 2. Start the container with `sudo docker compose up -d` 1093 | 1094 | 3. Create a tool file: 1095 | ```bash 1096 | sudo nano fetch.json 1097 | ``` 1098 | 1099 | Enter the following and save: 1100 | ```json 1101 | { 1102 | "name": "fetch", 1103 | "transport": "stdio", 1104 | "command": "npx", 1105 | "args": ["mcp-server-fetch"] 1106 | } 1107 | ``` 1108 | 1109 | 4. Register the tool: 1110 | ```bash 1111 | sudo docker exec -i mcpjungle-server /mcpjungle register -c /host/project/fetch.json 1112 | ``` 1113 | 1114 | Repeat steps 3 and 4 for every tool mentioned. Commands for `sequential-thinking` and `searxng` can be found below. 1115 | 1116 | **sequential-thinking** 1117 | ```json 1118 | { 1119 | "name": "sequential-thinking", 1120 | "transport": "stdio", 1121 | "command": "npx", 1122 | "args": ["-y", "@modelcontextprotocol/server-sequential-thinking"] 1123 | } 1124 | ``` 1125 | 1126 | **searxng** 1127 | ```json 1128 | { 1129 | "name": "searxng", 1130 | "transport": "stdio", 1131 | "command": "npx", 1132 | "args": ["-y", "mcp-searxng"], 1133 | "env": { 1134 | "SEARXNG_URL": "http://searxng:8080/search?q=" 1135 | } 1136 | } 1137 | ``` 1138 | 1139 | ### Comparison 1140 | 1141 | The choice between the two services is yours entirely: I use mcp-proxy because I find the workflow slightly less cumbersome than MCPJungle. Here's a comparison with the stengths of each service. 1142 | 1143 | **mcp-proxy > MCPJungle** 1144 | 1145 | - Servers can just be added to `servers.json` and will be registered automatically on container restart - MCPJungle requires manual registration of tools via the CLI 1146 | - Uses the standard MCP syntax that most clients accept for configuration 1147 | - Lighter footprint in that it doesn't need to spin up a separate database container 1148 | - Uses stateful connections - MCPJungle spins up a new connection per tool call, which can lead to some performance overhead 1149 | 1150 | **MCPJungle > mcp-proxy** 1151 | 1152 | - Combines all tools under one endpoint, making it very easy to integrate into a chat frontend 1153 | - Capable of creating a very configurable setup with tool groups, access control, selective tool enabling/disabling 1154 | - Supports enterprise features like telemetry 1155 | 1156 | ### Open WebUI Integration 1157 | 1158 | Open WebUI recently added support for streamable HTTP - where once you may have had to use [mcpo](https://github.com/open-webui/mcpo), Open WebUI's way of automatically generating an OpenAPI-compatible HTTP server, you can use the MCP servers you've set up as-is with no changes. 1159 | 1160 | #### mcp-proxy 1161 | 1162 | Navigate to `Admin Panel > Settings > External Tools`. Click the `+` button to add a new tool and enter the following information: 1163 | 1164 | - URL: `http://mcp-proxy:/servers//mcp` 1165 | - API Key: `anything-you-like` 1166 | - ID: `` 1167 | - Name: `` 1168 | 1169 | > Replace `` with the port of the MCP service and `` with the specific tool you're adding. 1170 | 1171 | #### MCPJungle 1172 | 1173 | Follow the same steps as above. By design, MCPJungle exposes all tools via one endpoint, so you should only have to add once: 1174 | 1175 | - URL: `http://mcpjungle-server:8080/mcp` 1176 | - API Key: `anything-you-like` 1177 | - ID: `` 1178 | - Name: `` 1179 | 1180 | > [!IMPORTANT] 1181 | > When configuring models in Open WebUI (via `Admin Panel > Settings > Models > my-cool-model > Advanced Params`), change the `Function Calling` parameter from `Default` to `Native`. This step will allow the model to use multiple tool calls to formulate a single response instead of just one. 1182 | 1183 | ### VS Code/Claude Desktop Integration 1184 | 1185 | The steps for integrating your MCP proxy server in another client such as VS Code (Claude Desktop, Zed, etc.) will be similar, if not exactly the same. 1186 | 1187 | Add the following key and value to your `mcp.json` file: 1188 | 1189 | ```json 1190 | "your-mcp-proxy-name": { 1191 | "timeout": 60, 1192 | "type": "stdio", 1193 | "command": "npx", 1194 | "args": [ 1195 | "mcp-remote", 1196 | "http:///mcp", 1197 | "--allow-http" 1198 | ] 1199 | } 1200 | ``` 1201 | 1202 | ## Text-to-Speech Server 1203 | 1204 | ### Kokoro FastAPI 1205 | 1206 | 🌟 [**GitHub**](https://github.com/remsky/Kokoro-FastAPI) 1207 | 1208 | Kokoro FastAPI is a text-to-speech server that wraps around and provides OpenAI-compatible API inference for [Kokoro-82M](https://huggingface.co/hexgrad/Kokoro-82M), a state-of-the-art TTS model. The documentation for this project is fantastic and covers most, if not all, of the use cases for the project itself. 1209 | 1210 | To install Kokoro-FastAPI, run 1211 | ```bash 1212 | git clone https://github.com/remsky/Kokoro-FastAPI.git 1213 | cd Kokoro-FastAPI 1214 | sudo docker compose up --build 1215 | ``` 1216 | 1217 | The server can be used in two ways: an API and a UI. By default, the API is served on port 8880 and the UI is served on port 7860. 1218 | 1219 | ### Open WebUI Integration 1220 | 1221 | Navigate to `Admin Panel > Settings > Audio` and set the following values: 1222 | 1223 | - Text-to-Speech Engine: `OpenAI` 1224 | - API Base URL: `http://host.docker.internal:8880/v1` 1225 | - API Key: `anything-you-like` 1226 | - Set Model: `kokoro` 1227 | - Response Splitting: None (this is crucial - Kokoro uses a novel audio splitting system) 1228 | 1229 | The server can be used in two ways: an API and a UI. By default, the API is served on port 8880 and the UI is served on port 7860. 1230 | 1231 | ## Image Generation Server 1232 | 1233 | ### ComfyUI 1234 | 1235 | 🌟 [**GitHub**](https://github.com/comfyanonymous/ComfyUI) 1236 | 📖 [**Documentation**](https://docs.comfy.org) 1237 | 1238 | ComfyUI is a popular open-source graph-based tool for generating images using image generation models such as Stable Diffusion XL, Stable Diffusion 3, and the Flux family of models. 1239 | 1240 | - Clone and navigate to the repository: 1241 | ``` 1242 | git clone https://github.com/comfyanonymous/ComfyUI 1243 | cd ComfyUI 1244 | ``` 1245 | - Set up a new virtual environment: 1246 | ``` 1247 | python3 venv -m comfyui-env 1248 | source comfyui-env/bin/activate 1249 | ``` 1250 | - Download the platform-specific dependencies: 1251 | - Nvidia GPUs 1252 | ``` 1253 | pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121 1254 | ``` 1255 | - AMD GPUs 1256 | ``` 1257 | pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0 1258 | ``` 1259 | - Intel GPUs 1260 | 1261 | Read the installation instructions from [ComfyUI's GitHub](https://github.com/comfyanonymous/ComfyUI?tab=readme-ov-file#intel-gpus). 1262 | 1263 | - Download the general dependencies: 1264 | ``` 1265 | pip install -r requirements.txt 1266 | ``` 1267 | 1268 | Now, we have to download and load a model. Here, we'll use FLUX.1 [dev], a new, state-of-the-art medium-tier model by Black Forest Labs that fits well on an RTX 3090 24GB. Since we want this to be set up as easily as possible, we'll use a complete checkpoint that can be loaded directly into ComfyUI. For a completely customized workflow, CLIPs, VAEs, and models can be downloaded separately. Follow [this guide](https://comfyanonymous.github.io/ComfyUI_examples/flux/#simple-to-use-fp8-checkpoint-version) by ComfyUI's creator to install the FLUX.1 models in a fully customizable way. 1269 | 1270 | > [!NOTE] 1271 | > [FLUX.1 [schnell] HuggingFace](https://huggingface.co/Comfy-Org/flux1-schnell/blob/main/flux1-schnell-fp8.safetensors) (smaller, ideal for <24GB VRAM) 1272 | > 1273 | > [FLUX.1 [dev] HuggingFace](https://huggingface.co/Comfy-Org/flux1-dev/blob/main/flux1-dev-fp8.safetensors) (larger, ideal for 24GB VRAM) 1274 | 1275 | - Download your desired model into `/models/checkpoints`. 1276 | 1277 | - If you want ComfyUI to be served at boot and effectively run as a service, add the following lines to `init.bash`: 1278 | ``` 1279 | cd /path/to/comfyui 1280 | source comfyui/bin/activate 1281 | python main.py --listen 1282 | ``` 1283 | > Replace `/path/to/comfyui` with the correct relative path to `init.bash`. 1284 | 1285 | Otherwise, to run it just once, simply execute the above lines in a terminal window. 1286 | 1287 | ### Open WebUI Integration 1288 | 1289 | Navigate to `Admin Panel > Settings > Images` and set the following values: 1290 | 1291 | - Image Generation Engine: `ComfyUI` 1292 | - API Base URL: `http://localhost:8188` 1293 | 1294 | > [!TIP] 1295 | > You'll either need more than 24GB of VRAM or to use a small language model mostly on CPU to use Open WebUI with FLUX.1 [dev]. FLUX.1 [schnell] and a small language model, however, should fit cleanly in 24GB of VRAM, making for a faster experience if you intend to regularly use both text and image generation together. 1296 | 1297 | ## SSH 1298 | 1299 | Enabling SSH allows you to connect to the server remotely. After configuring SSH, you can connect to the server from another device on the same network using an SSH client like PuTTY or the terminal. This lets you run your server headlessly without needing a monitor, keyboard, or mouse after the initial setup. 1300 | 1301 | On the server: 1302 | - Run the following command: 1303 | ``` 1304 | sudo apt install openssh-server 1305 | ``` 1306 | - Start the SSH service: 1307 | ``` 1308 | sudo systemctl start ssh 1309 | ``` 1310 | - Enable the SSH service to start at boot: 1311 | ``` 1312 | sudo systemctl enable ssh 1313 | ``` 1314 | - Find the server's IP address: 1315 | ``` 1316 | ip a 1317 | ``` 1318 | 1319 | On the client: 1320 | - Connect to the server using SSH: 1321 | ``` 1322 | ssh @ 1323 | ``` 1324 | > Replace `` with your username and `` with the server's IP address. 1325 | 1326 | > [!NOTE] 1327 | > If you expect to tunnel into your server often, I highly recommend following [this guide](https://www.raspberrypi.com/documentation/computers/remote-access.html#configure-ssh-without-a-password) to enable passwordless SSH using `ssh-keygen` and `ssh-copy-id`. It worked perfectly on my Debian system despite having been written for Raspberry Pi OS. 1328 | 1329 | ## Firewall 1330 | 1331 | Setting up a firewall is essential for securing your server. The Uncomplicated Firewall (UFW) is a simple and easy-to-use firewall for Linux. You can use UFW to allow or deny incoming and outgoing traffic to and from your server. 1332 | 1333 | - Install UFW: 1334 | ```bash 1335 | sudo apt install ufw 1336 | ``` 1337 | 1338 | - Allow SSH, HTTPS, and HTTP to your local network: 1339 | ```bash 1340 | # Allow all hosts access to port 1341 | sudo ufw allow from to any port proto tcp 1342 | ``` 1343 | 1344 | Start with running the above command to open ports 22 (SSH), 80 (HTTP), and 443 (HTTPS) to our local network. Since we use the `app-net` Docker network for our containers, there's no need to open anything else up. Open ports up carefully, ideally only to specific IPs or to your local network. To allow a port for a specific IP, you can replace the IP range with a single IP and it'll work exactly the same way. 1345 | 1346 | > [!TIP] 1347 | > You can find your local network's IP address range by running `ip route show`. The result will be something like this: 1348 | > ``` 1349 | > me@my-cool-server:~$ ip route show 1350 | > default via dev enp3s0 proto dhcp src metric 100 1351 | > dev enp3s0 proto kernel scope link src metric 100 1352 | > # more routes 1353 | > ``` 1354 | 1355 | - Enable UFW: 1356 | ```bash 1357 | sudo ufw enable 1358 | ``` 1359 | 1360 | - Check the status of UFW: 1361 | ```bash 1362 | sudo ufw status verbose 1363 | ``` 1364 | 1365 | > [!WARNING] 1366 | > Enabling UFW without allowing access to port 22 will disrupt your existing SSH connections. If you run a headless setup, this means connecting a monitor to your server and then allowing SSH access through UFW. Be careful to ensure that this port is allowed when making changes to UFW's configuration. 1367 | 1368 | Refer to [this guide](https://www.digitalocean.com/community/tutorials/how-to-set-up-a-firewall-with-ufw-on-debian-10) for more information on setting up UFW. 1369 | 1370 | ## Remote Access 1371 | 1372 | Remote access refers to the ability to access your server outside of your home network. For example, when you leave the house, you aren't going to be able to access `http://`, because your network has changed from your home network to some other network (either your mobile carrier's or a local network in some other place). This means that you won't be able to access the services running on your server. There are many solutions on the web that solve this problem and we'll explore some of the easiest-to-use here. 1373 | 1374 | ### Tailscale 1375 | 1376 | Tailscale is a peer-to-peer VPN service that combines many services into one. Its most common use-case is to bind many different devices of many different kinds (Windows, Linux, macOS, iOS, Android, etc.) on one virtual network. This way, all these devices can be connected to different networks but still be able to communicate with each other as if they were all on the same local network. Tailscale is not completely open source (its GUI is proprietary), but it is based on the [Wireguard](https://www.wireguard.com) VPN protocol and the remainder of the actual service is open source. Comprehensive documentation on the service can be found [here](https://tailscale.com/kb) and goes into many topics not mentioned here - I would recommend reading it to get the most of out the service. 1377 | 1378 | On Tailscale, networks are referred to as tailnets. Creating and managing tailnets requires creating an account with Tailscale (an expected scenario with a VPN service) but connections are peer-to-peer and happen without any routing to Tailscale servers. This connection being based on Wireguard means 100% of your traffic is encrypted and cannot be accessed by anyone but the devices on your tailnet. 1379 | 1380 | #### Installation 1381 | 1382 | First, create a tailnet through the Admin Console on Tailscale. Download the Tailscale app on any client you want to access your tailnet from. For Windows, macOS, iOS, and Android, the apps can be found on their respective OS app stores. After signing in, your device will be added to the tailnet. 1383 | 1384 | For Linux, the steps required are as follows. 1385 | 1386 | 1) Install Tailscale 1387 | ``` 1388 | curl -fsSL https://tailscale.com/install.sh | sh 1389 | ``` 1390 | 1391 | 2) Start the service 1392 | ``` 1393 | sudo tailscale up 1394 | ``` 1395 | 1396 | For SSH, run `sudo tailscale up --ssh`. 1397 | 1398 | #### Exit Nodes 1399 | 1400 | An exit node allows access to a different network while still being on your tailnet. For example, you can use this to allow a server on your network to act as a tunnel for other devices. This way, you can not only access that device (by virtue of your tailnet) but also all the devices on the host network its on. This is useful to access non-Tailscale devices on a network. 1401 | 1402 | To advertise a device on as an exit node, run `sudo tailscale up --advertise-exit-node`. To allow access to the local network via this device, add the `--exit-node-allow-lan-access` flag. 1403 | 1404 | #### Local DNS 1405 | 1406 | If one of the devices on your tailnet runs a [DNS-sinkhole](https://en.wikipedia.org/wiki/DNS_sinkhole) service like [Pi-hole](https://pi-hole.net), you'll probably want other devices to use it as their DNS server. Assume this device is named `poplar`. This means every networking request made by a any device on your tailnet will send this request to `poplar`, which will in turn decide whether that request will be answered or rejected according to your Pi-hole configuration. However, since `poplar` is also one of the devices on your tailnet, it will send networking requests to itself in accordance with this rule and not to somewhere that will actually resolve the request. Thus, we don't want such devices to accept the DNS settings according to the tailnet but follow their otherwise preconfigured rules. 1407 | 1408 | To reject the tailnet's DNS settings, run `sudo tailscale up --accept-dns=false`. 1409 | 1410 | #### Third-Party VPN Integration 1411 | 1412 | Tailscale offers a [Mullvad VPN](https://mullvad.net/en) exit node add-on with their service. This add-on allows for a traditional VPN experience that will route your requests through a proxy server in some other location, effectively masking your IP and allowing the circumvention of geolocation restrictions on web services. Assigned devices can be configured from the Admin Console. Mullvad VPN has [proven their no-log policy](https://mullvad.net/en/blog/2023/4/20/mullvad-vpn-was-subject-to-a-search-warrant-customer-data-not-compromised) and offers a fixed $5/month price no matter what duration you choose to pay for. 1413 | 1414 | To use a Mullvad exit on one of your devices, first find the exit node you want to use by running `sudo tailscale exit-node list`. Note the IP and run `sudo tailscale up --exit-node=`. 1415 | 1416 | > [!WARNING] 1417 | > Ensure the device is allowed to use the Mullvad add-on through the Admin Console first. 1418 | 1419 | ## Updating 1420 | 1421 | Updating your system is a good idea to keep software running optimally and with the latest security patches. Updates to Ollama allow for inference from new model architectures and updates to Open WebUI enable new features like voice calling, function calling, pipelines, and more. 1422 | 1423 | I've compiled steps to update these "primary function" installations in a standalone section because I think it'd be easier to come back to one section instead of hunting for update instructions in multiple subsections. 1424 | 1425 | ### General 1426 | 1427 | Upgrade Debian packages by running the following commands: 1428 | ``` 1429 | sudo apt update 1430 | sudo apt upgrade 1431 | ``` 1432 | 1433 | ### Nvidia Drivers & CUDA 1434 | 1435 | Follow Nvidia's guide [here](https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian) to install the latest CUDA drivers. 1436 | 1437 | > [!WARNING] 1438 | > Don't skip this step. Not installing the latest drivers after upgrading Debian packages will throw your installations out of sync, leading to broken functionality. When updating, target everything important at once. Also, rebooting after this step is a good idea to ensure that your system is operating as expected after upgrading these crucial drivers. 1439 | 1440 | ### Ollama 1441 | 1442 | Rerun the command that installs Ollama - it acts as an updater too: 1443 | ``` 1444 | curl -fsSL https://ollama.com/install.sh | sh 1445 | ``` 1446 | 1447 | ### llama.cpp 1448 | 1449 | Enter your llama.cpp folder and run the following commands: 1450 | ``` 1451 | cd llama.cpp 1452 | git pull 1453 | # Rebuild according to your setup - uncomment `-DGGML_CUDA=ON` for CUDA support 1454 | cmake -B build # -DGGML_CUDA=ON 1455 | cmake --build build --config Release 1456 | ``` 1457 | 1458 | ### vLLM 1459 | 1460 | For a manual installation, enter your virtual environment and update via `pip`: 1461 | ``` 1462 | source vllm/.venv/bin/activate 1463 | pip install vllm --upgrade 1464 | ``` 1465 | 1466 | For a Docker installation, you're good to go when you re-run your Docker command, because it pulls the latest Docker image for vLLM. 1467 | 1468 | ### llama-swap 1469 | 1470 | Delete the current container: 1471 | ```bash 1472 | sudo docker stop llama-swap 1473 | sudo docker rm llama-swap 1474 | ``` 1475 | 1476 | Re-run the container command from the [llama-swap section](#llama-swap). 1477 | 1478 | ### Open WebUI 1479 | 1480 | To update Open WebUI once, run the following command: 1481 | ``` 1482 | docker run --rm --volume /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower --run-once open-webui 1483 | ``` 1484 | 1485 | To keep it updated automatically, run the following command: 1486 | ``` 1487 | docker run -d --name watchtower --volume /var/run/docker.sock:/var/run/docker.sock containrrr/watchtower open-webui 1488 | ``` 1489 | 1490 | ### mcp-proxy/MCPJungle 1491 | 1492 | Navigate to the directory and pull the latest container image: 1493 | ```bash 1494 | cd mcp-proxy # or mcpjungle 1495 | sudo docker compose down 1496 | sudo docker compose pull 1497 | sudo docker compose up -d 1498 | ``` 1499 | 1500 | ### Kokoro FastAPI 1501 | 1502 | Navigate to the directory and pull the latest container image: 1503 | ``` 1504 | cd Kokoro-FastAPI 1505 | sudo docker compose pull 1506 | sudo docker compose up -d 1507 | ``` 1508 | 1509 | ### ComfyUI 1510 | 1511 | Navigate to the directory, pull the latest changes, and update dependencies: 1512 | ``` 1513 | cd ComfyUI 1514 | git pull 1515 | source comfyui-env/bin/activate 1516 | pip install -r requirements.txt 1517 | ``` 1518 | 1519 | ## Troubleshooting 1520 | 1521 | For any service running in a container, you can check the logs by running `sudo docker logs -f (container_ID)`. If you're having trouble with a service, this is a good place to start. 1522 | 1523 | ### `ssh` 1524 | - If you encounter an issue using `ssh-copy-id` to set up passwordless SSH, try running `ssh-keygen -t rsa` on the client before running `ssh-copy-id`. This generates the RSA key pair that `ssh-copy-id` needs to copy to the server. 1525 | 1526 | ### Nvidia Drivers 1527 | - Disable Secure Boot in the BIOS if you're having trouble with the Nvidia drivers not working. For me, all packages were at the latest versions and `nvidia-detect` was able to find my GPU correctly, but `nvidia-smi` kept returning the `NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver` error. [Disabling Secure Boot](https://askubuntu.com/a/927470) fixed this for me. Better practice than disabling Secure Boot is to sign the Nvidia drivers yourself but I didn't want to go through that process for a non-critical server that can afford to have Secure Boot disabled. 1528 | - If you run into `docker: Error response from daemon: unknown or invalid runtime name: nvidia.`, you probably have `--runtime nvidia` in your Docker statement. This is meant for `nvidia-docker`, [which is deprecated now](https://stackoverflow.com/questions/52865988/nvidia-docker-unknown-runtime-specified-nvidia). Removing this flag from your command should get rid of this error. 1529 | 1530 | ### Ollama 1531 | - If you receive the `could not connect to ollama app, is it running?` error, your Ollama instance wasn't served properly. This could be because of a manual installation or the desire to use it at-will and not as a service. To run the Ollama server once, run: 1532 | ``` 1533 | ollama serve 1534 | ``` 1535 | Then, **in a new terminal**, you should be able to access your models regularly by running: 1536 | ``` 1537 | ollama run 1538 | ``` 1539 | For detailed instructions on _manually_ configuring Ollama to run as a service (to run automatically at boot), read the official documentation [here](https://github.com/ollama/ollama/blob/main/docs/linux.md). You shouldn't need to do this unless your system faces restrictions using Ollama's automated installer. 1540 | 1541 | - If you receive the `Failed to open "/etc/systemd/system/ollama.service.d/.#override.confb927ee3c846beff8": Permission denied` error from Ollama after running `systemctl edit ollama.service`, simply creating the file works to eliminate it. Use the following steps to edit the file. 1542 | - Run: 1543 | ``` 1544 | sudo mkdir -p /etc/systemd/system/ollama.service.d 1545 | sudo nano /etc/systemd/system/ollama.service.d/override.conf 1546 | ``` 1547 | - Retry the remaining steps. 1548 | - If you still can't connect to your API endpoint, check your firewall settings. [This guide to UFW (Uncomplicated Firewall) on Debian](https://www.digitalocean.com/community/tutorials/how-to-set-up-a-firewall-with-ufw-on-debian-10) is a good resource. 1549 | 1550 | ### vLLM 1551 | - If you encounter ```RuntimeError: An error occurred while downloading using `hf_transfer`. Consider disabling HF_HUB_ENABLE_HF_TRANSFER for better error handling.```, add `HF_HUB_ENABLE_HF_TRANSFER=0` to the `--env` flag after your HuggingFace Hub token. If this still doesn't fix the issue - 1552 | - Ensure your user has all the requisite permissions for HuggingFace to be able to write to the cache. To give read+write access over the HF cache to your user (and, thus, `huggingface-cli`), run: 1553 | ``` 1554 | sudo chmod 777 ~/.cache/huggingface 1555 | sudo chmod 777 ~/.cache/huggingface/hub 1556 | ``` 1557 | - Manually download a model via the HuggingFace CLI and specify `--download-dir=~/.cache/huggingface/hub` in the engine arguments. If your `.cache/huggingface` directory is being troublesome, specify another directory to the `--download-dir` in the engine arguments and remember to do the same with the `--local-dir` flag in any `huggingface-cli` commands. 1558 | 1559 | ### Open WebUI 1560 | - If you encounter `Ollama: llama runner process has terminated: signal: killed`, check your `Advanced Parameters`, under `Settings > General > Advanced Parameters`. For me, bumping the context length past what certain models could handle was breaking the Ollama server. Leave it to the default (or higher, but make sure it's still under the limit for the model you're using) to fix this issue. 1561 | 1562 | ## Monitoring 1563 | 1564 | To monitor GPU usage, power draw, and temperature, you can use the `nvidia-smi` command. To monitor GPU usage, run: 1565 | ``` 1566 | watch -n 1 nvidia-smi 1567 | ``` 1568 | This will update the GPU usage every second without cluttering the terminal environment. Press `Ctrl+C` to exit. 1569 | 1570 | ## Notes 1571 | 1572 | This is my first foray into setting up a server and ever working with Linux so there may be better ways to do some of the steps. I will update this repository as I learn more. 1573 | 1574 | ### Software 1575 | 1576 | - I chose Debian because it is, apparently, one of the most stable Linux distros. I also went with an XFCE desktop environment because it is lightweight and I wasn't yet comfortable going full command line. 1577 | - Use a user for auto-login, don't log in as root unless for a specific reason. 1578 | - To switch to root in the command line without switching users, run `sudo -i`. 1579 | - If something using a Docker container doesn't work, try running `sudo docker ps -a` to see if the container is running. If it isn't, try running `sudo docker compose up -d` again. If it is and isn't working, try running `sudo docker restart ` to restart the container. 1580 | - If something isn't working no matter what you do, try rebooting the server. It's a common solution to many problems. Try this before spending hours troubleshooting. Sigh. 1581 | - While it takes some time to get comfortable with, using an inference engine like llama.cpp and vLLM (as compared to Ollama) is really the way to go to squeeze the maximum performance out of your hardware. If you're reading this guide in the first place and haven't already thrown up your hands and used a cloud provider, it's a safe assumption that you care about the ethos of hosting all this stuff locally. Thus, get your experience as close to a cloud provider as it can be by optimizing your server. 1582 | 1583 | ### Hardware 1584 | 1585 | - The power draw of my EVGA FTW3 Ultra RTX 3090 was 350W at stock settings. I set the power limit to 250W and the performance decrease was negligible for my use case, which is primarily code completion in VS Code and the Q&A via chat. 1586 | - Using a power monitor, I measured the power draw of my server for multiple days - the running average is ~60W. The power can spike to 350W during prompt processing and token generation, but this only lasts for a few seconds. For the remainder of the generation time, it tended to stay at the 250W power limit and dropped back to the average power draw after the model wasn't in use for about 20 seconds. 1587 | - Ensure your power supply has enough headroom for transient spikes (particularly in multi GPU setups) or you may face random shutdowns. Your GPU can blow past its rated power draw and also any software limit you set for it based on the chip's actual draw. I usually aim for +50% of my setup's estimated total power draw. 1588 | 1589 | ## References 1590 | 1591 | Downloading Nvidia drivers: 1592 | - https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian 1593 | - https://wiki.debian.org/NvidiaGraphicsDrivers 1594 | 1595 | Downloading AMD drivers: 1596 | - https://wiki.debian.org/AtiHowTo 1597 | 1598 | Secure Boot: 1599 | - https://askubuntu.com/a/927470 1600 | 1601 | Monitoring GPU usage, power draw: 1602 | - https://unix.stackexchange.com/questions/38560/gpu-usage-monitoring-cuda/78203#78203 1603 | 1604 | Passwordless `sudo`: 1605 | - https://stackoverflow.com/questions/25215604/use-sudo-without-password-inside-a-script 1606 | - https://www.reddit.com/r/Fedora/comments/11lh9nn/set_nvidia_gpu_power_and_temp_limit_on_boot/ 1607 | - https://askubuntu.com/questions/100051/why-is-sudoers-nopasswd-option-not-working 1608 | 1609 | Auto-login: 1610 | - https://forums.debian.net/viewtopic.php?t=149849 1611 | - https://wiki.archlinux.org/title/LightDM#Enabling_autologin 1612 | 1613 | Expose Ollama to LAN: 1614 | - https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux 1615 | - https://github.com/ollama/ollama/issues/703 1616 | 1617 | Firewall: 1618 | - https://www.digitalocean.com/community/tutorials/how-to-set-up-a-firewall-with-ufw-on-debian-10 1619 | 1620 | Passwordless `ssh`: 1621 | - https://www.raspberrypi.com/documentation/computers/remote-access.html#configure-ssh-without-a-password 1622 | 1623 | Adding CUDA to PATH: 1624 | - https://askubuntu.com/questions/885610/nvcc-version-command-says-nvcc-is-not-installed 1625 | 1626 | Docs: 1627 | 1628 | - [Debian](https://www.debian.org/releases/buster/amd64/) 1629 | - [Docker](https://docs.docker.com/engine/install/debian/) 1630 | - [Ollama](https://github.com/ollama/ollama/blob/main/docs/api.md) 1631 | - [vLLM](https://docs.vllm.ai/en/stable/index.html) 1632 | - [Open WebUI](https://github.com/open-webui/open-webui) 1633 | - [ComfyUI](https://github.com/comfyanonymous/ComfyUI) 1634 | 1635 | ## Acknowledgements 1636 | 1637 | Cheers to all the fantastic work done by the open-source community. This guide wouldn't exist without the effort of the many contributors to the projects and guides referenced here. To stay up-to-date on the latest developments in the field of machine learning, LLMs, and other vision/speech models, check out [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/). 1638 | 1639 | > [!NOTE] 1640 | > Please star any projects you find useful and consider contributing to them if you can. Stars on this guide would also be appreciated if you found it helpful, as it helps others find it too. 1641 | --------------------------------------------------------------------------------