├── .gitignore ├── LICENSE ├── README.md ├── gppmc ├── gppmc.py ├── gppmc_config.yaml └── requirements.txt ├── gppmd ├── config.yaml ├── gppmd.py ├── gppmd_config.yaml.example ├── llamacpp_configs │ └── examples.yaml ├── requirements.txt └── templates │ └── home.html └── tools ├── build_gppmc_deb.sh ├── build_gppmd_deb.sh ├── run_instance_1.sh ├── run_instance_2.sh ├── start_over.sh └── sync_to_remote.sh /.gitignore: -------------------------------------------------------------------------------- 1 | gppmd_config.yaml 2 | gppm_config.yaml 3 | *.swp 4 | src 5 | debian 6 | .env 7 | venv 8 | *.spec 9 | build/ 10 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Roni 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # gppm 2 | ![gppm-banner](https://github.com/user-attachments/assets/af0a6d7b-818c-476f-b3e3-9217b848c5c7) 3 | 4 | 5 | gppm power process manager 6 | 7 | gppm is designed for use with llama.cpp and NVIDIA Tesla P40 GPUs. The standalone llama.cpp currently lacks functionality to reduce the power consumption of these GPUs in idle mode. Although there is a patch for llama.cpp, it switches the performance mode for all GPUs simultaneously, which can disrupt setups where multiple llama.cpp instances share one or more GPUs. Implementing a communication mechanism within llama.cpp to manage task distribution and GPU status is complex. gppm addresses this challenge externally, providing a more efficient solution. 8 | gppm allows you to define llama.cpp instances as code, enabling automatic spawning, termination, and respawning. 9 | 10 | > [!NOTE] 11 | > Both the configuration and the API will most likely continue to change for a while. When changing to a newer version, please always take a look at the current documentation. 12 | 13 | ## Table of Contents 14 | 15 | - [How it works](#how-it-works) 16 | - [Quickstart](#quickstart) 17 | - [Installation](#installation) 18 | - [Command line interface](#command-line-interface) 19 | - [Configuration](#configuration) 20 | 21 | ## How it works 22 | 23 | gppm uses [nvidia-pstate](https://github.com/sasha0552/nvidia-pstate) under the hood which makes it possible to switch the performance state of P40 GPUs at all. gppm must be installed on the host where the GPUs are installed and llama.cpp is running. gppm monitors llama.cpp's output to recognize tasks and on which GPU lama.cpp runs them on and with this information accordingly changes the performance modes of installed P40 GPUs. It can manage any number of GPUs and llama.cpp instances. gppm switches each GPU to a low performance state as soon as none of the existing llama.cpp instances is running a task on that particular GPU and sets it into high performancemode as soon as the next task is going to be run. In doing so, gppm is able to control all GPUs independently of each other. gppm is designed as a wrapper and as such you have all llama.cpp instances configured at one place. 24 | 25 | ## Quickstart 26 | 27 | Clone the repository and cd into it: 28 | 29 | ```shell 30 | git clone https://github.com/crashr/gppm 31 | cd gppm 32 | ``` 33 | 34 | Edit the following files to your needs: 35 | 36 | * gppmd/config.yaml 37 | * gppmd/llamacpp_configs/examples.yaml 38 | 39 | In a separate terminal run nvidia-smi to monitor the llama.cpp instances we are going run: 40 | 41 | ```shell 42 | watch -n 0.1 nvidia-smi 43 | ``` 44 | 45 | Run the gppm daemon: 46 | 47 | ``` 48 | python3 gppmd/gppmd.py --config ./gppmd/config.yaml --llamacpp_configs_dir ./gppmd/llamacpp_configs 49 | ``` 50 | 51 | Wait for the instances to show up in the nvidia-smi command teminal. 52 | gppm ships with a command line client (see details below). In another terminal run the cli like this to list the instances you just started: 53 | 54 | ```shell 55 | python3 gppmc/gppmc.py get instances 56 | ``` 57 | 58 | 59 | ## Installation 60 | 61 | ### Build binaries and DEB package 62 | 63 | ```shell 64 | ./tools/build_gppmd_deb.sh 65 | ./tools/build_gppmc_deb.sh 66 | ``` 67 | 68 | You should now find binaries for the daemon and the cli in the build folder: 69 | 70 | ```shell 71 | ls -1 build/gppmd-$(git describe --tags --abbrev=0)-amd64/usr/bin/gppmd 72 | ls -1 build/gppmc-$(git describe --tags --abbrev=0)-amd64/usr/bin/gppmc 73 | ``` 74 | 75 | Copy them wherever you want or install the DEB packages (described in the next step): 76 | 77 | ```shell 78 | ls -1 build/*.deb 79 | ``` 80 | 81 | ### Install DEB package 82 | 83 | The DEB packages are tested for the following dsitributions: 84 | 85 | * Ubuntu 22.04 86 | 87 | Install the DEB packages like this: 88 | 89 | ```sh 90 | sudo dpkg -i build/gppmd-$(git describe --tags --abbrev=0)-amd64.deb 91 | sudo dpkg -i build/gppmc-$(git describe --tags --abbrev=0)-amd64.deb 92 | ``` 93 | 94 | gppmd awaits it's config file at /etc/gppmd/config.yaml so put your config there. It can be minimal as this: 95 | 96 | ``` 97 | host: '0.0.0.0' 98 | port: 5001 99 | ``` 100 | 101 | gppmd looks for llama.cpp config files in /etc/gppmd/llamacpp_configs so put your configs there (see below for detailed explaination on how the configuration works). 102 | 103 | Enable and run the daemon: 104 | 105 | ```shell 106 | sudo systemctl enable --now gppmd.service 107 | ``` 108 | 109 | ## Command line interface 110 | 111 | gppm comes with a cli client. It provides basic functionalities to interact with the daemon: 112 | 113 | ```sh 114 | $ gppmc 115 | Usage: gppmc [OPTIONS] COMMAND [ARGS]... 116 | 117 | Group of commands for managing llama.cpp instances and configurations. 118 | 119 | Options: 120 | --host TEXT The host to connect to. 121 | --port INTEGER The port to connect to. 122 | --help Show this message and exit. 123 | 124 | Commands: 125 | apply Apply LlamaCpp configurations from a YAML file. 126 | disable Disable a LlamaCpp instance. 127 | enable Enable a LlamaCpp instance. 128 | get Get various resources. 129 | reload Reload LlamaCpp configurations. 130 | ``` 131 | 132 | For some usage example take a look at the configuration section. 133 | 134 | ## Configuration 135 | 136 | After changing llama.cpp instance configuration files they can be reloded with the cli: 137 | 138 | ```shell 139 | gppmc reload 140 | ``` 141 | 142 | This affects only instances which configs where changed. All other instances remain untouched. 143 | 144 | The most basic configuration for a llama.cpp instance looks like this: 145 | 146 | ```yaml 147 | - name: Biggie_SmolLM_0.15B_Base_q8_0_01 148 | enabled: True 149 | env: 150 | CUDA_VISIBLE_DEVICES: "0" 151 | command: 152 | "/usr/local/bin/llama-server \ 153 | --host 0.0.0.0 \ 154 | -ngl 100 \ 155 | -m /models/Biggie_SmolLM_0.15B_Base_q8_0.gguf \ 156 | --port 8061 \ 157 | -sm none \ 158 | --no-mmap \ 159 | --log-format json" # Remove this for version >=1.2.0 160 | ``` 161 | 162 | To enable gppmd to perform power state switching with NVIDIA Tesla P40 GPUs it is essential to specifiy CUDA_VISIBLE_DEVICES and json log format. 163 | 164 | gppm allows to configure post launch hooks. With that it is possible to bundle complex setups. As an example the following configuration creates a setup consisting of two llama.cpp instances running Codestral on three GPUs behind a load balancer. For the load balancer [Paddler](https://github.com/distantmagic/paddler) is used: 165 | 166 | ```yaml 167 | - name: "Codestral-22B-v0.1-Q8_0 (paddler balancer)" 168 | enabled: True 169 | command: 170 | "/usr/local/bin/paddler balancer \ 171 | --management-host 0.0.0.0 \ 172 | --management-port 8085 \ 173 | --management-dashboard-enable=true \ 174 | --reverseproxy-host 192.168.178.56 \ 175 | --reverseproxy-port 8081" 176 | 177 | - name: "Codestral-22B-v0.1-Q8_0 (llama.cpp 01)" 178 | enabled: True 179 | env: 180 | CUDA_VISIBLE_DEVICES: "0,1,2" 181 | command: 182 | "/usr/local/bin/llama-server \ 183 | --host 0.0.0.0 \ 184 | -ngl 100 \ 185 | -m /models/Codestral-22B-v0.1-Q8_0.gguf \ 186 | --port 8082 \ 187 | -fa \ 188 | -sm row \ 189 | -mg 0 \ 190 | --no-mmap \ 191 | --slots \ 192 | --log-format json" # Remove this for version >=1.2.0 193 | post_launch_hooks: 194 | - name: Codestral-22B-v0.1-Q8_0_(paddler_01) 195 | enabled: True 196 | command: 197 | "/usr/local/bin/paddler agent \ 198 | --name 'Codestral-22B-v0.1-Q8_0 (llama.cpp 01)' \ 199 | --external-llamacpp-host 192.168.178.56 \ 200 | --external-llamacpp-port 8082 \ 201 | --local-llamacpp-host 192.168.178.56 \ 202 | --local-llamacpp-port 8082 \ 203 | --management-host 192.168.178.56 \ 204 | --management-port 8085" 205 | 206 | - name: "Codestral-22B-v0.1-Q8_0_(llama.cpp_02)" 207 | enabled: True 208 | env: 209 | CUDA_VISIBLE_DEVICES: "0,1,2" 210 | command: 211 | "/usr/local/bin/llama-server \ 212 | --host 0.0.0.0 \ 213 | -ngl 100 \ 214 | -m /models/Codestral-22B-v0.1-Q8_0.gguf \ 215 | --port 8083 \ 216 | -fa \ 217 | -sm row \ 218 | -mg 1 \ 219 | --no-mmap \ 220 | --log-format json" # Remove this for version >=1.2.0 221 | post_launch_hooks: 222 | - name: "Codestral-22B-v0.1-Q8_0_Paddler_02" 223 | enabled: True 224 | command: 225 | "/usr/local/bin/paddler agent \ 226 | --name 'Codestral-22B-v0.1-Q8_0 (llama.cpp 02)' \ 227 | --external-llamacpp-host 192.168.178.56 \ 228 | --external-llamacpp-port 8083 \ 229 | --local-llamacpp-host 192.168.178.56 \ 230 | --local-llamacpp-port 8083 \ 231 | --management-host 192.168.178.56 \ 232 | --management-port 8085" 233 | ``` 234 | 235 | ![image](https://github.com/user-attachments/assets/777e4c96-b960-449e-8647-6f28753d3d8b) 236 | 237 | 238 | ***More to come soon*** 239 | -------------------------------------------------------------------------------- /gppmc/gppmc.py: -------------------------------------------------------------------------------- 1 | import click 2 | import requests 3 | import json 4 | import yaml 5 | from halo import Halo 6 | import click_completion 7 | 8 | 9 | click_completion.init() 10 | 11 | 12 | @click.group() 13 | @click.option("--host", default="localhost", help="The host to connect to.") 14 | @click.option("--port", default=5002, type=int, help="The port to connect to.") 15 | @click.pass_context 16 | def gppmc(ctx, host, port): 17 | """Group of commands for managing LlamaCpp instances and configurations.""" 18 | ctx.obj["BASE_URL"] = f"http://{host}:{port}" 19 | 20 | 21 | @gppmc.group("get") 22 | def get_group(): 23 | """Get various resources.""" 24 | pass 25 | 26 | 27 | @get_group.command("instances") 28 | @click.option( 29 | "--format", 30 | default="text", 31 | type=click.Choice(["json", "text"]), 32 | help="Print output in JSON or text format.", 33 | ) 34 | @click.pass_context 35 | def get_instances(ctx, format): 36 | """Get all LlamaCpp instances.""" 37 | base_url = ctx.obj["BASE_URL"] 38 | if format == "text": 39 | with Halo(text="Loading instances", spinner="dots"): 40 | response = requests.get(f"{base_url}/get_llamacpp_instances") 41 | else: 42 | response = requests.get(f"{base_url}/get_llamacpp_instances") 43 | 44 | if format == "json": 45 | print(response.json()) 46 | else: 47 | instances = response.json()["llamacpp_instances"] 48 | for instance in instances: 49 | print(instance) 50 | 51 | 52 | @get_group.command("configs") 53 | @click.option( 54 | "--format", 55 | default="text", 56 | type=click.Choice(["json", "text"]), 57 | help="Print output in JSON or text format.", 58 | ) 59 | @click.pass_context 60 | def get_configs(ctx, format): 61 | """Get all LlamaCpp configurations.""" 62 | base_url = ctx.obj["BASE_URL"] 63 | if format == "text": 64 | with Halo(text="Loading configurations", spinner="dots"): 65 | response = requests.get(f"{base_url}/get_llamacpp_configs") 66 | 67 | if format == "json": 68 | print(response.json()) 69 | else: 70 | # print(response.json()) 71 | configs = response.json()["llamacpp_configs"] 72 | for config in configs: 73 | print(config) 74 | 75 | 76 | @gppmc.command("apply") 77 | @click.argument("file", type=click.File("rb")) 78 | @click.option( 79 | "--format", 80 | default="text", 81 | type=click.Choice(["json", "text"]), 82 | help="Print output in JSON or text format.", 83 | ) 84 | @click.pass_context 85 | def apply_configs(ctx, file, format): 86 | """Apply LlamaCpp configurations from a YAML file.""" 87 | data = yaml.safe_load(file) 88 | base_url = ctx.obj["BASE_URL"] 89 | if format == "text": 90 | with Halo(text="Applying configurations", spinner="dots"): 91 | response = requests.post(f"{base_url}/apply_llamacpp_configs", json=data) 92 | else: 93 | pass 94 | # response = requests.post(f"{base_url}/apply_llamacpp_configs", json=data) 95 | # print(response.json()) 96 | 97 | 98 | @get_group.command("subprocesses") 99 | @click.pass_context 100 | def get_subprocesses(ctx): 101 | """Get all LlamaCpp subprocesses.""" 102 | base_url = ctx.obj["BASE_URL"] 103 | with Halo(text="Loading subprocesses", spinner="dots"): 104 | response = requests.get(f"{base_url}/get_llamacpp_subprocesses") 105 | print(json.dumps(response.json(), indent=4)) 106 | # print(response) 107 | 108 | 109 | @gppmc.command("reload") 110 | @click.pass_context 111 | def reload_configs(ctx): 112 | """Reload LlamaCpp configurations.""" 113 | base_url = ctx.obj["BASE_URL"] 114 | with Halo(text="Reloading configurations", spinner="dots"): 115 | response = requests.get(f"{base_url}/reload_llamacpp_configs") 116 | # print(json.dumps(response.json(), indent=4)) # TODO 117 | # print(response) 118 | 119 | 120 | @gppmc.command("enable") 121 | @click.argument("name") 122 | @click.pass_context 123 | def enable_instance(ctx, name): 124 | """Enable a LlamaCpp instance.""" 125 | base_url = ctx.obj["BASE_URL"] 126 | with Halo(text=f"Enabling instance {name}", spinner="dots"): 127 | response = requests.post( 128 | f"{base_url}/enable_llamacpp_instance", json={"name": name} 129 | ) 130 | # print(response.json()) 131 | # print(response) 132 | 133 | 134 | @gppmc.command("disable") 135 | @click.argument("name") 136 | @click.pass_context 137 | def disable_instance(ctx, name): 138 | """Disable a LlamaCpp instance.""" 139 | base_url = ctx.obj["BASE_URL"] 140 | with Halo(text=f"Disabling instance {name}", spinner="dots"): 141 | response = requests.post( 142 | f"{base_url}/disable_llamacpp_instance", json={"name": name} 143 | ) 144 | # print(response.json()) 145 | # print(response) 146 | 147 | 148 | if __name__ == "__main__": 149 | gppmc(obj={}) 150 | -------------------------------------------------------------------------------- /gppmc/gppmc_config.yaml: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/crashr/gppm/ff920f77d64c5769f9b5778cc44b78ce3368d34f/gppmc/gppmc_config.yaml -------------------------------------------------------------------------------- /gppmc/requirements.txt: -------------------------------------------------------------------------------- 1 | blinker==1.8.2 2 | certifi==2024.6.2 3 | charset-normalizer==3.3.2 4 | click==8.1.7 5 | Flask==3.0.3 6 | idna==3.7 7 | itsdangerous==2.2.0 8 | Jinja2==3.1.4 9 | MarkupSafe==2.1.5 10 | nvidia_pstate==1.0.5 11 | PyYAML==6.0.1 12 | requests==2.32.3 13 | urllib3==2.2.2 14 | Werkzeug==3.0.3 15 | pyinstaller==6.9.0 16 | click-completion 17 | halo -------------------------------------------------------------------------------- /gppmd/config.yaml: -------------------------------------------------------------------------------- 1 | host: '127.0.0.1' 2 | port: 5002 3 | -------------------------------------------------------------------------------- /gppmd/gppmd.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import yaml 3 | import logging 4 | import time 5 | import threading 6 | import json 7 | import os 8 | from nvidia_pstate import set_pstate_low, set_pstate_high, set_pstate 9 | from flask import Flask 10 | from flask import jsonify 11 | from flask import render_template 12 | from flask import request 13 | import subprocess 14 | import tempfile 15 | import re 16 | import shlex 17 | from werkzeug.serving import make_server 18 | import select 19 | import signal 20 | import sys 21 | 22 | 23 | global llamacpp_configs_dir 24 | global configs 25 | global threads 26 | configs = [] 27 | threads = [] 28 | subprocesses = {} 29 | 30 | app = Flask(__name__, template_folder=os.path.abspath("/etc/gppmd/templates")) 31 | app.config["DEBUG"] = True 32 | 33 | parser = argparse.ArgumentParser(description="gppm power process manager") 34 | parser.add_argument( 35 | "--config", 36 | type=str, 37 | default="/etc/gppmd/config.yaml", 38 | help="Path to the configuration file", 39 | ) 40 | parser.add_argument( 41 | "--llamacpp_configs_dir", 42 | type=str, 43 | default="/etc/gppmd/llamacpp_configs", 44 | help="Path to the llama.cpp configuration file", 45 | ) 46 | parser.add_argument( 47 | "--port", 48 | type=int, 49 | default=5002, 50 | help="Port number for the API to listen on", 51 | ) 52 | args = parser.parse_args() 53 | 54 | with open(args.config, "r") as file: 55 | config = yaml.safe_load(file) 56 | 57 | # for key, value in config.items(): 58 | # parser.add_argument(f"--{key}", type=type(value), default=value, help=f"Set {key}") 59 | for key, value in config.items(): 60 | if key != "port": # Exclude "--port" from the loop 61 | parser.add_argument( 62 | f"--{key}", type=type(value), default=value, help=f"Set {key}" 63 | ) 64 | 65 | args = parser.parse_args() 66 | 67 | for key, value in vars(args).items(): 68 | config[key] = value 69 | 70 | # logging.basicConfig( 71 | # filename=config.get("log_file", "/var/log/gppmd/gppmd.log"), level=logging.INFO 72 | # ) 73 | 74 | result = subprocess.run( 75 | ["nvidia-smi", "-L"], stdout=subprocess.PIPE, stderr=subprocess.PIPE 76 | ) 77 | num_gpus = len(result.stdout.decode("utf-8").strip().split("\n")) 78 | 79 | gpu_semaphores = {} 80 | for gpu in range(num_gpus): 81 | set_pstate([gpu], int(os.getenv("NVIDIA_PSTATE_LOW", "8")), silent=True) 82 | gpu_semaphores[gpu] = threading.Semaphore(config.get("max_llamacpp_instances", 10)) 83 | 84 | 85 | def run_post_launch_hooks(config, subprocesses): 86 | if "post_launch_hooks" in config: 87 | for post_launch_hook in config["post_launch_hooks"]: 88 | if post_launch_hook["enabled"]: 89 | with open("/dev/null", "w") as devnull: 90 | new_subprocess = subprocess.Popen( 91 | shlex.split(post_launch_hook["command"]), 92 | shell=False, 93 | stdout=devnull, 94 | stderr=devnull, 95 | ) 96 | subprocesses.append(new_subprocess) 97 | run_post_launch_hooks(post_launch_hook, subprocesses) 98 | 99 | 100 | def list_thread_names(): 101 | thread_names = "\n".join([thread._args[0]["name"] for thread in threads]) 102 | return thread_names 103 | 104 | 105 | def process_line(data, config): # Need config for hooks 106 | 107 | gpus = [int(x) for x in data["gppm"]["gppm_cvd"].split(",")] 108 | 109 | pid = data["gppm"][ 110 | "llamacpp_pid" 111 | ] # TODO This needs to be changed to work with ollama 112 | 113 | tid = data["tid"] 114 | 115 | if "processing task" in data["msg"]: 116 | # logging.info(f"Task {tid} started") 117 | for gpu in gpus: 118 | gpu_semaphores[gpu].acquire(blocking=True) 119 | # logging.info(f"Aquired semaphore for GPU {gpu}") 120 | for gpu in gpus: 121 | # logging.info(f"Setting GPU {gpu} into high performance mode") 122 | set_pstate([gpu], int(os.getenv("NVIDIA_PSTATE_HIGH", "16")), silent=True) 123 | elif "stop processing: " in data["msg"]: 124 | # logging.info(f"Task {tid} terminated") 125 | for gpu in gpus: 126 | gpu_semaphores[gpu].release() 127 | # logging.info(f"Released semaphore for GPU {gpu}") 128 | if gpu_semaphores[gpu]._value is config.get("max_llamacpp_instances", 10): 129 | # logging.info(f"Setting GPU {gpu} into low performance mode") 130 | set_pstate([gpu], int(os.getenv("NVIDIA_PSTATE_LOW", "8")), silent=True) 131 | 132 | # for gpu, semaphore in gpu_semaphores.items(): 133 | # logging.info(f"Semaphore value for GPU {gpu}: {semaphore._value}") 134 | 135 | 136 | def launch_llamacpp(llamacpp_config, stop_event): 137 | """ 138 | tmp_dir = tempfile.TemporaryDirectory(dir="/tmp") 139 | os.makedirs(tmp_dir.name, exist_ok=True) 140 | pipe = os.path.join(tmp_dir.name, "pipe") 141 | os.mkfifo(pipe) 142 | """ 143 | 144 | env = os.environ.copy() 145 | # env["CUDA_VISIBLE_DEVICES"] = llamacpp_config[ 146 | # "cuda_visible_devices" 147 | # ] # TODO remove this 148 | if "env" in llamacpp_config: 149 | for key, value in llamacpp_config["env"].items(): 150 | # print(f"ENV: {key}:{value}") 151 | env[key] = value 152 | 153 | llamacpp_cmd = shlex.split(llamacpp_config["command"]) 154 | 155 | llamacpp_process = subprocess.Popen( 156 | llamacpp_cmd, 157 | env=env, 158 | stdout=subprocess.PIPE, 159 | stderr=subprocess.PIPE, 160 | shell=False, 161 | bufsize=1, 162 | universal_newlines=True, 163 | ) 164 | 165 | if llamacpp_process.pid not in subprocesses: 166 | subprocesses[llamacpp_process.pid] = [] 167 | 168 | run_post_launch_hooks(llamacpp_config, subprocesses[llamacpp_process.pid]) 169 | 170 | pattern = re.compile(r"slot .* \| .* \| .+") 171 | 172 | while not stop_event.is_set(): 173 | # Wait for data to be available for reading 174 | ready_to_read, _, _ = select.select([llamacpp_process.stderr], [], [], 0.1) 175 | if ready_to_read: 176 | # New data available, read it 177 | line = llamacpp_process.stderr.readline() 178 | # if line: 179 | # print(f"DEBUG line: {line}", end="") 180 | # pass 181 | if pattern.search(line): 182 | # FIXME 183 | line = line.strip() 184 | parts = line.split(" | ") 185 | id_slot = 0 # TODO 186 | id_task = 0 # TODO 187 | 188 | data = { 189 | "tid": "", 190 | "timestamp": "", 191 | "msg": parts[2], 192 | "id_slot": id_slot, 193 | "id_task": id_task, 194 | "gppm": { 195 | "llamacpp_pid": llamacpp_process.pid, 196 | "gppm_cvd": env["CUDA_VISIBLE_DEVICES"], 197 | }, 198 | } 199 | 200 | process_line(data, llamacpp_config) 201 | else: 202 | # No new data available, check if the subprocess has terminated 203 | if llamacpp_process.poll() is not None: 204 | break 205 | 206 | # Check if subprocesses are still running 207 | for llamacpp_subprocess in subprocesses[llamacpp_process.pid]: 208 | if llamacpp_subprocess.poll() is None: 209 | llamacpp_subprocess.terminate() 210 | while llamacpp_subprocess.poll() is None: 211 | pass 212 | 213 | llamacpp_process.terminate() 214 | llamacpp_process.wait() 215 | 216 | 217 | # WIP with very low prio 218 | def launch_ollama(ollama_config, stop_event): 219 | tmp_dir = tempfile.TemporaryDirectory(dir="/tmp") 220 | os.makedirs(tmp_dir.name, exist_ok=True) 221 | pipe = os.path.join(tmp_dir.name, "pipe") 222 | os.mkfifo(pipe) 223 | 224 | ollama_options = [] 225 | 226 | for option in ollama_config["options"]: 227 | if isinstance(option, dict): 228 | pass 229 | else: 230 | ollama_options.append(str(option)) 231 | 232 | ollama_cmd = shlex.split(ollama_config["command"] + " " + " ".join(ollama_options)) 233 | 234 | env = os.environ.copy() 235 | env["CUDA_VISIBLE_DEVICES"] = ollama_config["cuda_visible_devices"] 236 | env["OLLAMA_HOST"] = ollama_config["ollama_host"] 237 | env["OLLAMA_DEBUG"] = "1" 238 | 239 | # for env_var in ollama_config["env_vars"]: 240 | # for k, v in env_var.items(): 241 | # env[k] = v 242 | 243 | ollama_process = subprocess.Popen( 244 | ollama_cmd, 245 | env=env, 246 | stdout=subprocess.PIPE, 247 | stderr=subprocess.PIPE, 248 | shell=False, 249 | bufsize=1, 250 | universal_newlines=True, 251 | ) 252 | 253 | data = {"model": ollama_config["model"], "keep_alive": -1} 254 | response = requests.post( 255 | f"http://" + ollama_config["ollama_host"] + "/api/generate", data 256 | ) 257 | 258 | pattern = re.compile(r"slot is processing task|slot released") 259 | 260 | while not stop_event.is_set(): 261 | # Wait for data to be available for reading 262 | ready_to_read, _, _ = select.select([ollama_process.stdout], [], [], 0.1) 263 | if ready_to_read: 264 | # New data available, read it 265 | line = ollama_process.stdout.readline() 266 | # print(line) # FIXME 267 | if pattern.search(line): 268 | try: 269 | data = json.loads(line) 270 | data["gppm"] = { 271 | "ollama_pid": ollama_process.pid, 272 | "gppm_cvd": env["CUDA_VISIBLE_DEVICES"], 273 | } 274 | except: 275 | data = {} 276 | data["gppm"] = { # FIXME 277 | "llamacpp_pid": ollama_process.pid, 278 | "ollama_pid": ollama_process.pid, 279 | "gppm_cvd": env["CUDA_VISIBLE_DEVICES"], 280 | } 281 | data["tid"] = 0 282 | data["msg"] = line 283 | process_line(data) 284 | else: 285 | # No new data available, check if the subprocess has terminated 286 | if ollama_process.poll() is not None: 287 | break 288 | 289 | ollama_process.terminate() 290 | ollama_process.wait() 291 | 292 | 293 | llamacpp_configs_dir = config.get("llamacpp_configs_dir", "/etc/gppmd/llamacpp_configs") 294 | 295 | 296 | def load_llamacpp_configs(llamacpp_configs_dir=llamacpp_configs_dir): 297 | new_configs = [] 298 | for filename in os.listdir(llamacpp_configs_dir): 299 | if filename.endswith(".yaml"): 300 | with open(os.path.join(llamacpp_configs_dir, filename), "r") as f: 301 | configs = yaml.safe_load(f) 302 | for config in configs: 303 | new_configs.append(config) 304 | return new_configs 305 | 306 | 307 | def purge_thread(thread): 308 | thread._args[1].set() # Signal to stop 309 | thread.join() # Wait for the thread to finish 310 | threads.remove(thread) # Remove the thread from the list 311 | 312 | 313 | def sync_threads_with_configs(threads, configs, launch_llamacpp): 314 | existing_config_names = [thread._args[0]["name"] for thread in threads] 315 | 316 | # new_config_names = [config['name'] for config in configs] 317 | new_config_names = [] 318 | for config in configs: 319 | new_config_names.append(config["name"]) 320 | 321 | # Remove threads that are not in the configs 322 | for thread in threads[:]: 323 | if thread._args[0]["name"] not in new_config_names: 324 | purge_thread(thread) 325 | # threads.remove(thread) 326 | 327 | # Add or update threads based on the configs 328 | for config in configs: 329 | if config["name"] in existing_config_names: # FIXME is that needed? 330 | # Update existing thread 331 | for thread in threads: 332 | if thread._args[0]["name"] == config["name"]: 333 | if ( 334 | thread._args[0] != config or config["enabled"] == False 335 | ): # FIXME does order of options in config file matter? 336 | # print("Thread config has changed. Purging thread.") 337 | purge_thread(thread) 338 | # print("DEBUG: Purged.") 339 | if config["enabled"] == True: 340 | stop_event = threading.Event() 341 | new_thread = threading.Thread( 342 | target=launch_llamacpp, args=(config, stop_event) 343 | ) 344 | new_thread.start() 345 | threads.append(new_thread) 346 | else: 347 | if config["enabled"] == True: 348 | # Add new thread 349 | stop_event = threading.Event() # FIXME 350 | new_thread = threading.Thread() # FIXME 351 | 352 | if config.get("type", "llamacpp") == "llamacpp": 353 | new_thread = threading.Thread( 354 | target=launch_llamacpp, args=(config, stop_event) 355 | ) 356 | elif config.get("type") == "ollama": 357 | new_thread = threading.Thread( 358 | target=launch_ollama, args=(config, stop_event) 359 | ) 360 | else: 361 | # Handle unknown type or default to llamacpp 362 | new_thread = threading.Thread( 363 | target=launch_llamacpp, args=(config, stop_event) 364 | ) 365 | 366 | new_thread.start() 367 | threads.append(new_thread) 368 | 369 | return threads 370 | 371 | 372 | @app.route("/apply_llamacpp_configs", methods=["POST"]) 373 | def api_apply_llamacpp_configs(): 374 | global configs 375 | global threads 376 | configs = load_llamacpp_configs() 377 | configs_to_apply = request.json 378 | for config in configs_to_apply: 379 | configs.append(config) 380 | threads = sync_threads_with_configs(threads, configs, launch_llamacpp) 381 | return {"status": "OK"} 382 | 383 | 384 | @app.route("/reload_llamacpp_configs", methods=["GET"]) 385 | def api_reload_llamacpp_configs(): 386 | global configs 387 | global threads 388 | configs = load_llamacpp_configs() 389 | threads = sync_threads_with_configs(threads, configs, launch_llamacpp) 390 | return {"status": "OK"} 391 | 392 | 393 | @app.route("/get_llamacpp_instances", methods=["GET"]) 394 | def api_get_llamacpp_instances(): 395 | instances = {"llamacpp_instances": []} 396 | for thread in threads: 397 | thread_name = thread._args[0]["name"] 398 | instances["llamacpp_instances"].append(thread_name) 399 | return jsonify(instances) 400 | 401 | 402 | @app.route("/get_llamacpp_subprocesses", methods=["GET"]) 403 | def api_get_llamacpp_subprocesses(): 404 | serialized_subprocesses = { 405 | pid: [ 406 | {"pid": p.pid, "args": p.args, "returncode": p.returncode} 407 | for p in processes 408 | ] 409 | for pid, processes in subprocesses.items() 410 | } 411 | print(serialized_subprocesses) 412 | return jsonify(serialized_subprocesses) 413 | 414 | 415 | @app.route("/get_llamacpp_configs", methods=["GET"]) 416 | def api_get_llamacpp_configs(): 417 | return jsonify(configs) 418 | 419 | 420 | @app.route("/gui", methods=["GET"]) 421 | def gui(): 422 | return render_template("home.html") 423 | 424 | 425 | @app.route("/get_instances", methods=["GET"]) 426 | def api_get_instances(): 427 | instances = {"instances": []} 428 | for thread in threads: 429 | thread_data = { 430 | "name": thread._args[0]["name"], 431 | "pid": thread._args[0]["pid"], 432 | "cvd": thread._args[0]["cuda_visible_devices"], 433 | # add more data if needed 434 | } 435 | instances["instances"].append(thread_data) 436 | return jsonify(instances) 437 | 438 | 439 | @app.route("/enable_llamacpp_instance", methods=["POST"]) 440 | def api_enable_llamacpp_instance(): 441 | global configs 442 | global threads 443 | name = request.json.get("name") 444 | 445 | for config in configs: 446 | print(config) 447 | if config["name"] == name: 448 | config["enabled"] = True 449 | threads = sync_threads_with_configs(threads, configs, launch_llamacpp) 450 | return {"status": "OK", "message": f"Instance {name} enabled"} 451 | return {"status": "ERROR", "message": f"Instance {name} not found"} 452 | 453 | 454 | @app.route("/disable_llamacpp_instance", methods=["POST"]) 455 | def api_disable_llamacpp_instance(): 456 | global configs 457 | global threads 458 | name = request.json.get("name") 459 | 460 | for config in configs: 461 | print(config) 462 | if config["name"] == name: 463 | config["enabled"] = False 464 | threads = sync_threads_with_configs(threads, configs, launch_llamacpp) 465 | return {"status": "OK", "message": f"Instance {name} disabled"} 466 | return {"status": "ERROR", "message": f"Instance {name} not found"} 467 | 468 | 469 | server = make_server("0.0.0.0", args.port, app) 470 | server_thread = threading.Thread(target=server.serve_forever) 471 | server_thread.start() 472 | 473 | 474 | if __name__ == "__main__": 475 | 476 | logging.info(f"Reading llama.cpp configs") 477 | 478 | configs = load_llamacpp_configs() 479 | threads = sync_threads_with_configs(threads, configs, launch_llamacpp) 480 | 481 | logging.info(f"All llama.cpp instances launched") 482 | 483 | for thread in threads: 484 | thread.join() 485 | -------------------------------------------------------------------------------- /gppmd/gppmd_config.yaml.example: -------------------------------------------------------------------------------- 1 | log_file: '/var/log/gppm/gppmd.log' 2 | sleep_time: 0.1 3 | timeout_time: 0.0 4 | log_file_to_monitor: '/var/log/llama.cpp/llama-server.log' 5 | host: '0.0.0.0' 6 | port: 5000 7 | max_llamacpp_instances: 10 -------------------------------------------------------------------------------- /gppmd/llamacpp_configs/examples.yaml: -------------------------------------------------------------------------------- 1 | # To try the following examples edit it's configuration and set enabled to True. 2 | # Multiple can be activated at the same time as long as their configurations do not interfere with each other. 3 | 4 | - name: Biggie_SmolLM_0.15B_Base_q8_0_01 5 | enabled: False 6 | env: 7 | CUDA_VISIBLE_DEVICES: "0" 8 | command: 9 | "/usr/local/bin/llama-server \ 10 | --host 0.0.0.0 \ 11 | -ngl 100 \ 12 | -m /models/Biggie_SmolLM_0.15B_Base_q8_0.gguf \ 13 | --port 8061 \ 14 | -sm none \ 15 | --no-mmap \ 16 | --log-format json" 17 | 18 | - name: Biggie_SmolLM_0.15B_Base_q8_0_02 19 | enabled: False 20 | env: 21 | CUDA_VISIBLE_DEVICES: "1" 22 | command: 23 | "/usr/local/bin/llama-server \ 24 | --host 0.0.0.0 \ 25 | -ngl 100 \ 26 | -m /models/Biggie_SmolLM_0.15B_Base_q8_0.gguf \ 27 | --port 8062 \ 28 | -sm none \ 29 | --no-mmap \ 30 | --log-format json" 31 | -------------------------------------------------------------------------------- /gppmd/requirements.txt: -------------------------------------------------------------------------------- 1 | blinker==1.8.2 2 | certifi==2024.6.2 3 | charset-normalizer==3.3.2 4 | click==8.1.7 5 | Flask==3.0.3 6 | idna==3.7 7 | itsdangerous==2.2.0 8 | Jinja2==3.1.4 9 | MarkupSafe==2.1.5 10 | nvidia_pstate==1.0.5 11 | PyYAML==6.0.1 12 | requests==2.32.3 13 | urllib3==2.2.2 14 | Werkzeug==3.0.3 15 | pyinstaller==6.9.0 16 | -------------------------------------------------------------------------------- /gppmd/templates/home.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Hello, World! 5 | 26 | 27 | 28 | -------------------------------------------------------------------------------- /tools/build_gppmc_deb.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Ensure necessary tools are installed 4 | for tool in python3 pip dpkg-deb 5 | do 6 | if ! command -v $tool &> /dev/null 7 | then 8 | echo "$tool could not be found" 9 | exit 10 | fi 11 | done 12 | 13 | PACKAGE_NAME="gppmc" 14 | VERSION="$(git describe --tags --abbrev=0)" 15 | MAINTAINER="Roni" 16 | DESCRIPTION="gppm power process manager command line tool" 17 | ARCHITECTURE="amd64" 18 | 19 | DEST_DIR="build/$PACKAGE_NAME-$VERSION-$ARCHITECTURE" 20 | 21 | # Clean build dir 22 | sudo rm -rf $DEST_DIR 23 | 24 | # Ensure the Python project file exists 25 | if [ ! -f "$PACKAGE_NAME/$PACKAGE_NAME.py" ] 26 | then 27 | echo "$PACKAGE_NAME/$PACKAGE_NAME.py could not be found" 28 | exit 29 | fi 30 | 31 | # Ensure the configuration file exists 32 | if [ ! -f "$PACKAGE_NAME/${PACKAGE_NAME}_config.yaml" ] 33 | then 34 | echo "$PACKAGE_NAME/${PACKAGE_NAME}_config.yaml could not be found" 35 | exit 36 | fi 37 | 38 | ## Ensure the requirements.txt file exists 39 | #if [ ! -f "$PACKAGE_NAME/requirements.txt" ] 40 | #then 41 | # echo "$PACKAGE_NAME/requirements.txt could not be found" 42 | # exit 43 | #fi 44 | 45 | # Ensure PyInstaller is installed 46 | #if ! pip show PyInstaller &> /dev/null 47 | #then 48 | # echo "PyInstaller is not installed" 49 | # exit 50 | #fi 51 | 52 | # Create the directory structure 53 | mkdir -p $DEST_DIR/DEBIAN 54 | mkdir -p $DEST_DIR/etc/bash_completion.d 55 | mkdir -p $DEST_DIR/usr/bin 56 | mkdir -p $DEST_DIR/usr/share/bash-completion/completions 57 | mkdir -p $DEST_DIR/usr/share/$PACKAGE_NAME 58 | 59 | # Create the control file 60 | cat > $DEST_DIR/DEBIAN/control < $DEST_DIR/usr/share/bash-completion/completions/$PACKAGE_NAME 81 | 82 | # Deactivate the virtual environment 83 | deactivate 84 | 85 | # Create a basic configuration file 86 | cp $PACKAGE_NAME/${PACKAGE_NAME}_config.yaml $DEST_DIR/usr/share/$PACKAGE_NAME/$PACKAGE_NAME.yaml 87 | 88 | sudo chown -R root:root build/* 89 | 90 | # Build the package 91 | sudo dpkg-deb --build $DEST_DIR 92 | 93 | # Ensure the package is built successfully 94 | if [ ! -f "$DEST_DIR.deb" ] 95 | then 96 | echo "Failed to build $PACKAGE_NAME.deb" 97 | exit 98 | fi 99 | 100 | # Cleanup 101 | sudo rm -rf ./build/$PACKAGE_NAME 102 | rm -rf ./$PACKAGE_NAME/venv 103 | rm -rf ./*.spec 104 | -------------------------------------------------------------------------------- /tools/build_gppmd_deb.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Ensure necessary tools are installed 4 | for tool in python3 pip dpkg-deb 5 | do 6 | if ! command -v $tool &> /dev/null 7 | then 8 | echo "$tool could not be found" 9 | exit 10 | fi 11 | done 12 | 13 | PACKAGE_NAME="gppmd" 14 | VERSION="$(git describe --tags --abbrev=0)" 15 | MAINTAINER="Roni" 16 | DESCRIPTION="gppm power process manager daemon" 17 | ARCHITECTURE="amd64" 18 | 19 | DEST_DIR="build/$PACKAGE_NAME-$VERSION-$ARCHITECTURE" 20 | 21 | # Clean build dir 22 | sudo rm -rf $DEST_DIR 23 | 24 | # Ensure the Python project file exists 25 | if [ ! -f "$PACKAGE_NAME/$PACKAGE_NAME.py" ] 26 | then 27 | echo "$PACKAGE_NAME/$PACKAGE_NAME.py could not be found" 28 | exit 29 | fi 30 | 31 | ## Ensure the configuration file exists 32 | #if [ ! -f "$PACKAGE_NAME/${PACKAGE_NAME}_config.yaml" ] 33 | #then 34 | # echo "$PACKAGE_NAME/${PACKAGE_NAME}_config.yaml could not be found" 35 | # exit 36 | #fi 37 | 38 | # Ensure the requirements.txt file exists 39 | if [ ! -f "$PACKAGE_NAME/requirements.txt" ] 40 | then 41 | echo "$PACKAGE_NAME/requirements.txt could not be found" 42 | exit 43 | fi 44 | 45 | ## Ensure PyInstaller is installed 46 | #if ! pip show PyInstaller &> /dev/null 47 | #then 48 | # echo "PyInstaller is not installed" 49 | # exit 50 | #fi 51 | 52 | # Create the directory structure 53 | mkdir -p $DEST_DIR/DEBIAN 54 | mkdir -p $DEST_DIR/usr/bin 55 | mkdir -p $DEST_DIR/etc/$PACKAGE_NAME 56 | mkdir -p $DEST_DIR/etc/$PACKAGE_NAME/llamacpp_configs 57 | mkdir -p $DEST_DIR/etc/$PACKAGE_NAME/templates 58 | mkdir -p $DEST_DIR/var/log/$PACKAGE_NAME 59 | mkdir -p $DEST_DIR/lib/systemd/system 60 | 61 | # Create the control file 62 | cat > $DEST_DIR/DEBIAN/control < $DEST_DIR/DEBIAN/postinst < $DEST_DIR/DEBIAN/postrm < $DEST_DIR/lib/systemd/system/$PACKAGE_NAME.service < llama_server_output_1 & 6 | llamacpp_pid=$! 7 | jq --unbuffered -c --arg pid "$llamacpp_pid" --arg cvd "$CUDA_VISIBLE_DEVICES" ".gppm = {\"llamacpp_pid\":\$pid,\"gppm_cvd\":\$cvd}" < llama_server_output_1 \ 8 | | egrep "slot is processing task|slot released" --line-buffered \ 9 | | tee -a ~/llama-server.log 10 | rm llama_server_output_1 11 | 12 | -------------------------------------------------------------------------------- /tools/run_instance_2.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | mkfifo llama_server_output_2 4 | CUDA_VISIBLE_DEVICES=1,2 5 | CUDA_VISIBLE_DEVICES=1,2 llama-server --host 0.0.0.0 -ngl 100 -m ~/models/Replete-Coder-Llama3-8B-Q4_K_M.gguf --port 8082 -fa -sm row -mg 0 --no-mmap --log-format json > llama_server_output_2 & 6 | llamacpp_pid=$! 7 | jq --unbuffered -c --arg pid "$llamacpp_pid" --arg cvd "$CUDA_VISIBLE_DEVICES" ".gppm = {\"llamacpp_pid\":\$pid,\"gppm_cvd\":\$cvd}" < llama_server_output_2 \ 8 | | egrep "slot is processing task|slot released" --line-buffered \ 9 | | tee -a ~/llama-server.log 10 | rm llama_server_output_2 11 | 12 | -------------------------------------------------------------------------------- /tools/start_over.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | SRC_DIR="./" 4 | 5 | COMMAND="python3 gppmd.py --config gppmd_config.yaml" 6 | 7 | start_command() { 8 | echo "Starting command..." 9 | $COMMAND & 10 | COMMAND_PID=$! 11 | echo "Command started with PID $COMMAND_PID" 12 | } 13 | 14 | stop_command() { 15 | if [ -n "$COMMAND_PID" ]; then 16 | echo "Stopping command with PID $COMMAND_PID..." 17 | kill $COMMAND_PID 18 | wait $COMMAND_PID 2>/dev/null 19 | echo "Command stopped." 20 | fi 21 | } 22 | 23 | trap stop_command EXIT 24 | 25 | start_command 26 | 27 | inotifywait --exclude ".*\.log|\..*" -m -r -e modify,create,delete,move "$SRC_DIR" | 28 | while read -r directory events filename; do 29 | echo "Change detected in $directory$filename: $events" 30 | stop_command 31 | start_command 32 | done 33 | -------------------------------------------------------------------------------- /tools/sync_to_remote.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Description: 4 | # This script synchronizes a source directory to a remote destination using rsync. 5 | # It uses inotifywait to monitor the source directory for changes and then synchronizes the changes to the destination. 6 | # The source and destination directories are specified as command line arguments. 7 | 8 | # Check if source and destination are provided 9 | if [ $# -ne 2 ]; then 10 | echo "Usage: $0 " 11 | exit 1 12 | fi 13 | 14 | SRC="$1" 15 | DEST="$2" 16 | #LOGFILE="/var/log/rsync.log" 17 | 18 | inotifywait -m -r -e modify,create,delete,move "$SRC" --format '%w%f' | 19 | while read file; do 20 | #rsync -avz --include='build/*.deb' --exclude='build/*' "$SRC" "$DEST" #>> "$LOGFILE" 2>&1 21 | rsync -avz --include='build/*' "$SRC" "$DEST" #>> "$LOGFILE" 2>&1 22 | done 23 | --------------------------------------------------------------------------------