├── .github ├── FUNDING.yml └── ISSUE_TEMPLATE │ ├── bug_report.md │ └── feature_request.md ├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── SECURITY.md ├── autoscale.py ├── config.yaml ├── host_resource_checker.py ├── install.sh ├── logging_config.json ├── ssh_utils.py ├── vm_autoscale.service └── vm_manager.py /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | # These are supported funding model platforms 2 | 3 | github: [fabriziosalmi] 4 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/bug_report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Browser [e.g. chrome, safari] 29 | - Version [e.g. 22] 30 | 31 | **Smartphone (please complete the following information):** 32 | - Device: [e.g. iPhone6] 33 | - OS: [e.g. iOS8.1] 34 | - Browser [e.g. stock browser, safari] 35 | - Version [e.g. 22] 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature_request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: '' 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | .DS_Store 3 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Contributor Covenant Code of Conduct 2 | 3 | ## Our Pledge 4 | 5 | We as members, contributors, and leaders pledge to make participation in our 6 | community a harassment-free experience for everyone, regardless of age, body 7 | size, visible or invisible disability, ethnicity, sex characteristics, gender 8 | identity and expression, level of experience, education, socio-economic status, 9 | nationality, personal appearance, race, religion, or sexual identity 10 | and orientation. 11 | 12 | We pledge to act and interact in ways that contribute to an open, welcoming, 13 | diverse, inclusive, and healthy community. 14 | 15 | ## Our Standards 16 | 17 | Examples of behavior that contributes to a positive environment for our 18 | community include: 19 | 20 | * Demonstrating empathy and kindness toward other people 21 | * Being respectful of differing opinions, viewpoints, and experiences 22 | * Giving and gracefully accepting constructive feedback 23 | * Accepting responsibility and apologizing to those affected by our mistakes, 24 | and learning from the experience 25 | * Focusing on what is best not just for us as individuals, but for the 26 | overall community 27 | 28 | Examples of unacceptable behavior include: 29 | 30 | * The use of sexualized language or imagery, and sexual attention or 31 | advances of any kind 32 | * Trolling, insulting or derogatory comments, and personal or political attacks 33 | * Public or private harassment 34 | * Publishing others' private information, such as a physical or email 35 | address, without their explicit permission 36 | * Other conduct which could reasonably be considered inappropriate in a 37 | professional setting 38 | 39 | ## Enforcement Responsibilities 40 | 41 | Community leaders are responsible for clarifying and enforcing our standards of 42 | acceptable behavior and will take appropriate and fair corrective action in 43 | response to any behavior that they deem inappropriate, threatening, offensive, 44 | or harmful. 45 | 46 | Community leaders have the right and responsibility to remove, edit, or reject 47 | comments, commits, code, wiki edits, issues, and other contributions that are 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation 49 | decisions when appropriate. 50 | 51 | ## Scope 52 | 53 | This Code of Conduct applies within all community spaces, and also applies when 54 | an individual is officially representing the community in public spaces. 55 | Examples of representing our community include using an official e-mail address, 56 | posting via an official social media account, or acting as an appointed 57 | representative at an online or offline event. 58 | 59 | ## Enforcement 60 | 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be 62 | reported to the community leaders responsible for enforcement at 63 | fabrizio.salmi@gmail.com. 64 | All complaints will be reviewed and investigated promptly and fairly. 65 | 66 | All community leaders are obligated to respect the privacy and security of the 67 | reporter of any incident. 68 | 69 | ## Enforcement Guidelines 70 | 71 | Community leaders will follow these Community Impact Guidelines in determining 72 | the consequences for any action they deem in violation of this Code of Conduct: 73 | 74 | ### 1. Correction 75 | 76 | **Community Impact**: Use of inappropriate language or other behavior deemed 77 | unprofessional or unwelcome in the community. 78 | 79 | **Consequence**: A private, written warning from community leaders, providing 80 | clarity around the nature of the violation and an explanation of why the 81 | behavior was inappropriate. A public apology may be requested. 82 | 83 | ### 2. Warning 84 | 85 | **Community Impact**: A violation through a single incident or series 86 | of actions. 87 | 88 | **Consequence**: A warning with consequences for continued behavior. No 89 | interaction with the people involved, including unsolicited interaction with 90 | those enforcing the Code of Conduct, for a specified period of time. This 91 | includes avoiding interactions in community spaces as well as external channels 92 | like social media. Violating these terms may lead to a temporary or 93 | permanent ban. 94 | 95 | ### 3. Temporary Ban 96 | 97 | **Community Impact**: A serious violation of community standards, including 98 | sustained inappropriate behavior. 99 | 100 | **Consequence**: A temporary ban from any sort of interaction or public 101 | communication with the community for a specified period of time. No public or 102 | private interaction with the people involved, including unsolicited interaction 103 | with those enforcing the Code of Conduct, is allowed during this period. 104 | Violating these terms may lead to a permanent ban. 105 | 106 | ### 4. Permanent Ban 107 | 108 | **Community Impact**: Demonstrating a pattern of violation of community 109 | standards, including sustained inappropriate behavior, harassment of an 110 | individual, or aggression toward or disparagement of classes of individuals. 111 | 112 | **Consequence**: A permanent ban from any sort of public interaction within 113 | the community. 114 | 115 | ## Attribution 116 | 117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], 118 | version 2.0, available at 119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html. 120 | 121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct 122 | enforcement ladder](https://github.com/mozilla/diversity). 123 | 124 | [homepage]: https://www.contributor-covenant.org 125 | 126 | For answers to common questions about this code of conduct, see the FAQ at 127 | https://www.contributor-covenant.org/faq. Translations are available at 128 | https://www.contributor-covenant.org/translations. 129 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | PR are welcome here! 2 | 3 | Fix issues, don't do like me re-introducing fixed issues, propose and provide full usable magic SOTA solid PRs and get free beers! 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 fab 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 🚀 VM Autoscale 2 | 3 | ## 🌟 Overview 4 | **Proxmox VM Autoscale** is a dynamic scaling service that automatically adjusts virtual machine (VM) resources (CPU cores and RAM) on your Proxmox Virtual Environment (VE) based on real-time metrics and user-defined thresholds. This solution helps ensure efficient resource usage, optimizing performance and resource availability dynamically. 5 | 6 | The service supports multiple Proxmox hosts via SSH connections and can be easily installed and managed as a **systemd** service for seamless automation. 7 | 8 | > [!IMPORTANT] 9 | > To enable scaling of VM resources, make sure NUMA and hotplug features are enabled: 10 | > - **Enable NUMA**: VM > Hardware > Processors > Enable NUMA ☑️ 11 | > - **Enable CPU Hotplug**: VM > Options > Hotplug > CPU ☑️ 12 | > - **Enable Memory Hotplug**: VM > Options > Hotplug > Memory ☑️ 13 | 14 | ## ✨ Features 15 | - 🔄 **Auto-scaling of VM CPU and RAM** based on real-time resource metrics. 16 | - 🛠️ **Configuration-driven** setup using an easy-to-edit YAML file. 17 | - 🌐 **Multi-host support** via SSH (compatible with both password and key-based authentication). 18 | - 📲 **Gotify Notifications** for alerting you whenever scaling actions are performed. 19 | - ⚙️ **Systemd Integration** for effortless setup, management, and monitoring as a Linux service. 20 | 21 | ## 📋 Prerequisites 22 | - 🖥️ **Proxmox VE** must be installed on the target hosts. 23 | - 🐍 **Python 3.x** should be installed on the Proxmox host(s). 24 | - 💻 Familiarity with Proxmox `qm` commands and SSH is recommended. 25 | 26 | ## 🤝 Contributing 27 | Contributions are **more** than welcome! If you encounter a bug or have suggestions for improvement, please [open an issue](https://github.com/fabriziosalmi/proxmox-vm-autoscale/issues/new/choose) or submit a pull request. 28 | 29 | ### Contributors 30 | Code improvements by: **[Specimen67](https://github.com/Specimen67)**, **[brianread108](https://github.com/brianread108)** 31 | 32 | ### Want to scale LXC containers instead of VM on Proxmox hosts? 33 | To autoscale LXC containers on Proxmox hosts, you may be interested in [this related project](https://github.com/fabriziosalmi/proxmox-lxc-autoscale). 34 | 35 | ## 🚀 Quick Start 36 | 37 | To install **Proxmox VM Autoscale**, execute the following `curl bash` command. This command will automatically clone the repository, execute the installation script, and set up the service for you: 38 | 39 | ```bash 40 | bash <(curl -s https://raw.githubusercontent.com/fabriziosalmi/proxmox-vm-autoscale/main/install.sh) 41 | ``` 42 | 43 | 🎯 **This installation script will:** 44 | - Clone the repository into `/usr/local/bin/vm_autoscale`. 45 | - Copy all necessary files to the installation directory. 46 | - Install the required Python dependencies. 47 | - Set up a **systemd unit file** to manage the autoscaling service. 48 | 49 | > [!NOTE] 50 | > The service is enabled but not started automatically at the end of the installation. To start it manually, use the following command. 51 | 52 | ```bash 53 | systemctl start vm_autoscale.service 54 | ``` 55 | 56 | > [!IMPORTANT] 57 | > Make sure to review the official [Proxmox documentation](https://pve.proxmox.com/wiki/Hotplug_(qemu_disk,nic,cpu,memory)) for the hotplug feature requirements to enable scaling virtual machines on the fly. 58 | 59 | ## ⚡ Usage 60 | 61 | ### ▶️ Start/Stop the Service 62 | To **start** the autoscaling service: 63 | 64 | ```bash 65 | systemctl start vm_autoscale.service 66 | ``` 67 | 68 | To **stop** the service: 69 | 70 | ```bash 71 | systemctl stop vm_autoscale.service 72 | ``` 73 | 74 | ### 🔍 Check the Status 75 | To view the service status: 76 | 77 | ```bash 78 | systemctl status vm_autoscale.service 79 | ``` 80 | 81 | ### 📜 Logs 82 | Logs are saved to `/var/log/vm_autoscale.log`. You can monitor the logs in real-time using: 83 | 84 | ```bash 85 | tail -f /var/log/vm_autoscale.log 86 | ``` 87 | 88 | Or by using `journalctl`: 89 | 90 | ```bash 91 | journalctl -u vm_autoscale.service -f 92 | ``` 93 | 94 | ## ⚙️ Configuration 95 | 96 | The configuration file (`config.yaml`) is located at `/usr/local/bin/vm_autoscale/config.yaml`. This file contains settings for scaling thresholds, resource limits, Proxmox hosts, and VM information. 97 | 98 | ### Example Configuration 99 | ```yaml 100 | scaling_thresholds: 101 | cpu: 102 | high: 80 103 | low: 20 104 | ram: 105 | high: 85 106 | low: 25 107 | 108 | scaling_limits: 109 | min_cores: 1 110 | max_cores: 8 111 | min_ram_mb: 512 112 | max_ram_mb: 16384 113 | 114 | check_interval: 60 # Check every 60 seconds 115 | 116 | proxmox_hosts: 117 | - name: host1 118 | host: 192.168.1.10 119 | ssh_user: root 120 | ssh_password: your_password_here 121 | ssh_key: /path/to/ssh_key 122 | 123 | virtual_machines: 124 | - vm_id: 101 125 | proxmox_host: host1 126 | scaling_enabled: true 127 | cpu_scaling: true 128 | ram_scaling: true 129 | 130 | logging: 131 | level: INFO 132 | log_file: /var/log/vm_autoscale.log 133 | 134 | gotify: 135 | enabled: true 136 | server_url: https://gotify.example.com 137 | app_token: your_gotify_app_token_here 138 | priority: 5 139 | ``` 140 | 141 | ### ⚙️ Configuration Details 142 | - **`scaling_thresholds`**: Defines the CPU and RAM usage thresholds that trigger scaling actions (e.g., when CPU > 80%, scale up). 143 | - **`scaling_limits`**: Specifies the **minimum** and **maximum** resources (CPU cores and RAM) each VM can have. 144 | - **`proxmox_hosts`**: Contains the details of Proxmox hosts, including SSH credentials. 145 | - **`virtual_machines`**: Lists the VMs to be managed by the autoscaling script, allowing per-VM scaling customization. 146 | - **`logging`**: Specifies the logging level and log file path for activity tracking and debugging. 147 | - **`gotify`**: Configures **Gotify notifications** to send alerts when scaling actions are performed. 148 | 149 | ## 📲 Gotify Notifications 150 | Gotify is used to send real-time notifications regarding scaling actions. Configure Gotify in the `config.yaml` file: 151 | - **`enabled`**: Set to `true` to enable notifications. 152 | - **`server_url`**: URL of the Gotify server. 153 | - **`app_token`**: Authentication token for accessing Gotify. 154 | - **`priority`**: Notification priority level (1-10). 155 | 156 | ## 👨‍💻 Development 157 | 158 | ### 🔧 Requirements 159 | - **Python 3.x** 160 | - Required Python Packages: `paramiko`, `requests`, `PyYAML` 161 | 162 | ### 🐛 Running Manually 163 | To run the script manually for debugging or testing: 164 | 165 | ```bash 166 | python3 /usr/local/bin/vm_autoscale/autoscale.py 167 | ``` 168 | 169 | ### Others projects 170 | 171 | If You like my projects, you may also like these ones: 172 | 173 | - [caddy-waf](https://github.com/fabriziosalmi/caddy-waf) Caddy WAF (Regex Rules, IP and DNS filtering, Rate Limiting, GeoIP, Tor, Anomaly Detection) 174 | - [patterns](https://github.com/fabriziosalmi/patterns) Automated OWASP CRS and Bad Bot Detection for Nginx, Apache, Traefik and HaProxy 175 | - [blacklists](https://github.com/fabriziosalmi/blacklists) Hourly updated domains blacklist 🚫 176 | - [UglyFeed](https://github.com/fabriziosalmi/UglyFeed) Retrieve, aggregate, filter, evaluate, rewrite and serve RSS feeds using Large Language Models for fun, research and learning purposes 177 | - [proxmox-lxc-autoscale](https://github.com/fabriziosalmi/proxmox-lxc-autoscale) Automatically scale LXC containers resources on Proxmox hosts 178 | - [DevGPT](https://github.com/fabriziosalmi/DevGPT) Code togheter, right now! GPT powered code assistant to build project in minutes 179 | - [websites-monitor](https://github.com/fabriziosalmi/websites-monitor) Websites monitoring via GitHub Actions (expiration, security, performances, privacy, SEO) 180 | - [caddy-mib](https://github.com/fabriziosalmi/caddy-mib) Track and ban client IPs generating repetitive errors on Caddy 181 | - [zonecontrol](https://github.com/fabriziosalmi/zonecontrol) Cloudflare Zones Settings Automation using GitHub Actions 182 | - [lws](https://github.com/fabriziosalmi/lws) linux (containers) web services 183 | - [cf-box](https://github.com/fabriziosalmi/cf-box) cf-box is a set of Python tools to play with API and multiple Cloudflare accounts. 184 | - [limits](https://github.com/fabriziosalmi/limits) Automated rate limits implementation for web servers 185 | - [dnscontrol-actions](https://github.com/fabriziosalmi/dnscontrol-actions) Automate DNS updates and rollbacks across multiple providers using DNSControl and GitHub Actions 186 | - [proxmox-lxc-autoscale-ml](https://github.com/fabriziosalmi/proxmox-lxc-autoscale-ml) Automatically scale the LXC containers resources on Proxmox hosts with AI 187 | - [csv-anonymizer](https://github.com/fabriziosalmi/csv-anonymizer) CSV fuzzer/anonymizer 188 | - [iamnotacoder](https://github.com/fabriziosalmi/iamnotacoder) AI code generation and improvement 189 | 190 | 191 | ### ⚠️ Disclaimer 192 | > [!CAUTION] 193 | > The author assumes no responsibility for any damage or issues that may arise from using this tool. 194 | 195 | ### 📜 License 196 | This project is licensed under the **MIT License**. See the LICENSE file for complete details. 197 | -------------------------------------------------------------------------------- /SECURITY.md: -------------------------------------------------------------------------------- 1 | # Security Policy 2 | 3 | ## Supported Versions 4 | 5 | Alpha version, barely tested. Testers are welcome! 6 | -------------------------------------------------------------------------------- /autoscale.py: -------------------------------------------------------------------------------- 1 | import yaml 2 | import json 3 | import requests 4 | import smtplib 5 | import logging 6 | import logging.config 7 | import time 8 | import re 9 | import sys 10 | from ssh_utils import SSHClient 11 | from pathlib import Path 12 | from email.mime.text import MIMEText 13 | from email.mime.multipart import MIMEMultipart 14 | from vm_manager import VMResourceManager 15 | from host_resource_checker import HostResourceChecker 16 | from functools import wraps 17 | from typing import Union, List, Optional, Dict, Any 18 | 19 | class ConfigurationError(Exception): 20 | """Custom exception for configuration-related errors.""" 21 | pass 22 | 23 | class NotificationManager: 24 | def __init__(self, config: Dict[str, Any], logger: logging.Logger): 25 | self.config = config 26 | self.logger = logger 27 | self.validate_notification_config() 28 | 29 | def validate_notification_config(self) -> None: 30 | """Validate notification configuration at startup.""" 31 | notification_enabled = False 32 | 33 | if self.config.get('gotify', {}).get('enabled', False): 34 | notification_enabled = True 35 | gotify_config = self.config.get('gotify', {}) 36 | if not all([gotify_config.get('server_url'), gotify_config.get('app_token')]): 37 | raise ConfigurationError("Gotify is enabled but configuration is incomplete") 38 | 39 | if self.config.get('alerts', {}).get('email_enabled', False): 40 | notification_enabled = True 41 | alerts_config = self.config.get('alerts', {}) 42 | required_fields = ['smtp_server', 'smtp_user', 'email_recipient'] 43 | missing_fields = [field for field in required_fields if not alerts_config.get(field)] 44 | if missing_fields: 45 | raise ConfigurationError(f"Email alerts are enabled but missing configuration: {', '.join(missing_fields)}") 46 | 47 | if not notification_enabled: 48 | self.logger.warning("No notification method is enabled in configuration") 49 | 50 | def _format_message(self, message: Union[str, tuple, Any]) -> str: 51 | """Format message to ensure it's a string.""" 52 | if isinstance(message, tuple): 53 | # If it's a tuple, join non-empty parts 54 | return ' '.join(str(part) for part in message if part) 55 | elif isinstance(message, str): 56 | return message 57 | else: 58 | return str(message) 59 | 60 | def send_gotify_notification(self, message: str, priority: Optional[int] = None) -> None: 61 | """Send notification via Gotify with retry logic.""" 62 | try: 63 | gotify_config = self.config.get('gotify', {}) 64 | server_url = gotify_config['server_url'].rstrip('/') # Remove trailing slash if present 65 | app_token = gotify_config['app_token'] 66 | final_priority = priority or gotify_config.get('priority', 5) 67 | 68 | formatted_message = self._format_message(message) 69 | 70 | response = requests.post( 71 | f"{server_url}/message", 72 | data={ 73 | "title": "VM Autoscale Alert", 74 | "message": formatted_message, 75 | "priority": final_priority 76 | }, 77 | headers={"Authorization": f"Bearer {app_token}"}, 78 | timeout=10 79 | ) 80 | response.raise_for_status() 81 | self.logger.info("Gotify notification sent successfully") 82 | except requests.exceptions.RequestException as e: 83 | self.logger.error(f"Failed to send Gotify notification: {str(e)}") 84 | raise 85 | 86 | def send_smtp_notification(self, message: str) -> None: 87 | """Send notification via email with retry logic.""" 88 | try: 89 | alerts_config = self.config['alerts'] 90 | smtp_config = { 91 | 'host': alerts_config['smtp_server'], 92 | 'port': alerts_config.get('smtp_port', 587), 93 | 'user': alerts_config['smtp_user'], 94 | 'password': alerts_config['smtp_password'], 95 | 'recipient': alerts_config['email_recipient'] 96 | } 97 | 98 | to_emails = [smtp_config['recipient']] if isinstance(smtp_config['recipient'], str) else smtp_config['recipient'] 99 | if not all(isinstance(email, str) for email in to_emails): 100 | raise ValueError("Invalid email format in recipients") 101 | formatted_message = self._format_message(message) 102 | # Updated regex to capture the VM number 103 | pattern = r"VM\s+(\d+)" 104 | result = re.search(pattern, formatted_message) 105 | if result: 106 | vm_id = result.group(1) 107 | else: 108 | vm_id = "" 109 | msg = MIMEMultipart() 110 | msg['From'] = smtp_config['user'] 111 | msg['To'] = ", ".join(to_emails) 112 | msg['Subject'] = f"VM Autoscale Alert for VM {vm_id}" 113 | msg.attach(MIMEText(formatted_message, 'plain')) 114 | 115 | with smtplib.SMTP(smtp_config['host'], smtp_config['port']) as server: 116 | server.starttls() 117 | if smtp_config['password']: 118 | server.login(smtp_config['user'], smtp_config['password']) 119 | server.sendmail(smtp_config['user'], to_emails, msg.as_string()) 120 | 121 | self.logger.info("Email notification sent successfully") 122 | except Exception as e: 123 | self.logger.error(f"Failed to send email notification: {str(e)}") 124 | raise 125 | 126 | def send_notification(self, message: Union[str, tuple, Any], priority: Optional[int] = None) -> None: 127 | """Send notification through configured channels.""" 128 | sent = False 129 | errors = [] 130 | formatted_message = self._format_message(message) 131 | 132 | if self.config.get('gotify', {}).get('enabled', False): 133 | try: 134 | self.send_gotify_notification(formatted_message, priority) 135 | sent = True 136 | except Exception as e: 137 | error_msg = f"Failed to send Gotify notification: {str(e)}" 138 | errors.append(error_msg) 139 | self.logger.error(error_msg) 140 | 141 | if self.config.get('alerts', {}).get('email_enabled', False): 142 | try: 143 | self.send_smtp_notification(formatted_message) 144 | sent = True 145 | except Exception as e: 146 | error_msg = f"Failed to send email notification: {str(e)}" 147 | errors.append(error_msg) 148 | self.logger.error(error_msg) 149 | 150 | if not sent: 151 | error_summary = f" Errors: {'; '.join(errors)}" if errors else "" 152 | self.logger.warning( 153 | f"Failed to send notification through any channel. Message: {formatted_message}.{error_summary}" 154 | ) 155 | 156 | class VMAutoscaler: 157 | def __init__(self, config_path: str, logging_config_path: Optional[str] = None): 158 | self.config = self._load_config(config_path) 159 | self.logger = self._setup_logging(logging_config_path) 160 | self.notification_manager = NotificationManager(self.config, self.logger) 161 | 162 | @staticmethod 163 | def _load_config(config_path: str) -> Dict[str, Any]: 164 | """Load and validate configuration file.""" 165 | if not Path(config_path).exists(): 166 | raise FileNotFoundError(f"Configuration file not found at {config_path}") 167 | 168 | with open(config_path, 'r') as config_file: 169 | config = yaml.safe_load(config_file) 170 | 171 | # Validate essential configuration 172 | required_sections = ['scaling_thresholds', 'scaling_limits', 'proxmox_hosts', 'virtual_machines'] 173 | missing_sections = [section for section in required_sections if section not in config] 174 | if missing_sections: 175 | raise ConfigurationError(f"Missing required configuration sections: {', '.join(missing_sections)}") 176 | 177 | return config 178 | 179 | def _setup_logging(self, logging_config_path: Optional[str]) -> logging.Logger: 180 | """Setup logging configuration.""" 181 | if logging_config_path and Path(logging_config_path).exists(): 182 | with open(logging_config_path, 'r') as logging_file: 183 | logging_config = json.load(logging_file) 184 | logging.config.dictConfig(logging_config) 185 | else: 186 | logging.basicConfig( 187 | level=self.config.get('logging', {}).get('level', 'INFO'), 188 | format="%(asctime)s [%(levelname)s] %(message)s", 189 | handlers=[ 190 | logging.FileHandler(self.config.get('logging', {}).get('log_file', '/var/log/vm_autoscale.log')), 191 | logging.StreamHandler() 192 | ] 193 | ) 194 | return logging.getLogger("vm_autoscale") 195 | 196 | def process_vm(self, host: Dict[str, Any], vm: Dict[str, Any]) -> None: 197 | """Process a single VM for autoscaling.""" 198 | ssh_client = None 199 | try: 200 | ssh_client = SSHClient( 201 | host=host['host'], 202 | port=host['ssh_port'], 203 | user=host['ssh_user'], 204 | password=host.get('ssh_password'), 205 | key_path=host.get('ssh_key') 206 | ) 207 | ssh_client.connect() 208 | 209 | vm_manager = VMResourceManager(ssh_client, vm['vm_id'], self.config) 210 | 211 | # First check if VM is running 212 | if not vm_manager.is_vm_running(): 213 | self.logger.info(f"VM {vm['vm_id']} is not running. Skipping scaling.") 214 | return 215 | 216 | # Check host resources first 217 | host_checker = HostResourceChecker(ssh_client) 218 | if not host_checker.check_host_resources( 219 | self.config['host_limits']['max_host_cpu_percent'], 220 | self.config['host_limits']['max_host_ram_percent']): 221 | self.logger.warning(f"Host {host['name']} resources maxed out. Skipping scaling.") 222 | return 223 | 224 | # Get current resource usage once to avoid multiple calls 225 | current_cpu_usage, current_ram_usage = vm_manager.get_resource_usage() 226 | self.logger.info(f"VM {vm['vm_id']} current usage - CPU: {current_cpu_usage}%, RAM: {current_ram_usage}%") 227 | 228 | # Handle CPU scaling if enabled 229 | if vm.get('cpu_scaling', False): 230 | try: 231 | self._handle_cpu_scaling(vm_manager, vm['vm_id'], current_cpu_usage) 232 | self.logger.debug(f"CPU scaling completed for VM {vm['vm_id']}") 233 | except Exception as e: 234 | self.logger.error(f"CPU scaling failed for VM {vm['vm_id']}: {str(e)}") 235 | # Continue to RAM scaling even if CPU scaling fails 236 | 237 | # Handle RAM scaling if enabled 238 | if vm.get('ram_scaling', False): 239 | try: 240 | self._handle_ram_scaling(vm_manager, vm['vm_id'], current_ram_usage) 241 | self.logger.debug(f"RAM scaling completed for VM {vm['vm_id']}") 242 | except Exception as e: 243 | self.logger.error(f"RAM scaling failed for VM {vm['vm_id']}: {str(e)}") 244 | 245 | except Exception as e: 246 | self.logger.error(f"Error processing VM {vm['vm_id']} on host {host['name']}: {e}") 247 | self.notification_manager.send_notification( 248 | f"Error processing VM {vm['vm_id']} on host {host['name']}: {e}", 249 | priority=9 250 | ) 251 | finally: 252 | if ssh_client: 253 | ssh_client.close() 254 | 255 | def _handle_cpu_scaling(self, vm_manager: VMResourceManager, vm_id: int, cpu_usage: float) -> None: 256 | """Handle CPU scaling decisions.""" 257 | thresholds = self.config['scaling_thresholds']['cpu'] 258 | if cpu_usage > thresholds['high']: 259 | if vm_manager.scale_cpu('up'): 260 | self.notification_manager.send_notification( 261 | f"Scaled up CPU for VM {vm_id} due to high usage ({cpu_usage}%).", 262 | priority=7 263 | ) 264 | elif cpu_usage < thresholds['low']: 265 | if vm_manager.scale_cpu('down'): 266 | self.notification_manager.send_notification( 267 | f"Scaled down CPU for VM {vm_id} due to low usage ({cpu_usage}%).", 268 | priority=5 269 | ) 270 | 271 | def _handle_ram_scaling(self, vm_manager: VMResourceManager, vm_id: int, ram_usage: float) -> None: 272 | """Handle RAM scaling decisions.""" 273 | thresholds = self.config['scaling_thresholds']['ram'] 274 | if ram_usage > thresholds['high']: 275 | if vm_manager.scale_ram('up'): 276 | self.notification_manager.send_notification( 277 | f"Scaled up RAM for VM {vm_id} due to high usage ({ram_usage}%).", 278 | priority=7 279 | ) 280 | elif ram_usage < thresholds['low']: 281 | if vm_manager.scale_ram('down'): 282 | self.notification_manager.send_notification( 283 | f"Scaled down RAM for VM {vm_id} due to low usage ({ram_usage}%).", 284 | priority=5 285 | ) 286 | 287 | def run(self) -> None: 288 | """Main execution loop.""" 289 | self.logger.info("Starting VM Autoscaler") 290 | while True: 291 | try: 292 | for host in self.config['proxmox_hosts']: 293 | for vm in self.config['virtual_machines']: 294 | if vm['proxmox_host'] == host['name'] and vm.get('scaling_enabled', False): 295 | self.process_vm(host, vm) 296 | 297 | check_interval = self.config.get('check_interval', 300) # Default to 5 minutes 298 | time.sleep(check_interval) 299 | 300 | except KeyboardInterrupt: 301 | self.logger.info("Shutting down VM Autoscaler") 302 | break 303 | except Exception as e: 304 | self.logger.error(f"Unexpected error in main loop: {e}") 305 | self.notification_manager.send_notification( 306 | f"Unexpected error in VM Autoscaler: {e}", 307 | priority=10 308 | ) 309 | time.sleep(60) # Wait before retrying 310 | 311 | def main(): 312 | """Entry point of the application.""" 313 | try: 314 | autoscaler = VMAutoscaler( 315 | config_path="/usr/local/bin/vm_autoscale/config.yaml", 316 | logging_config_path="/usr/local/bin/vm_autoscale/logging_config.json" 317 | ) 318 | autoscaler.run() 319 | except Exception as e: 320 | logging.critical(f"Failed to start VM Autoscaler: {e}") 321 | sys.exit(1) 322 | 323 | if __name__ == "__main__": 324 | main() 325 | -------------------------------------------------------------------------------- /config.yaml: -------------------------------------------------------------------------------- 1 | # Configuration file for vm_autoscale 2 | 3 | # Global thresholds for scaling VMs 4 | scaling_thresholds: 5 | cpu: 6 | high: 80 # Percentage CPU usage at which scaling up is triggered 7 | low: 20 # Percentage CPU usage at which scaling down is considered 8 | ram: 9 | high: 85 # Percentage RAM usage at which scaling up is triggered 10 | low: 25 # Percentage RAM usage at which scaling down is considered 11 | 12 | # Scaling limits for VMs 13 | scaling_limits: 14 | min_cores: 1 # Minimum number of CPU cores that a VM can have 15 | max_cores: 8 # Maximum number of CPU cores that a VM can have 16 | min_ram_mb: 1024 # Minimum RAM (in MB) that a VM can have (updated since NUMA doeasnt like less than 1024MB) 17 | max_ram_mb: 16384 # Maximum RAM (in MB) that a VM can have 18 | 19 | # Time intervals for checking VM resources and performing actions (in seconds) 20 | check_interval: 300 # How often to check VM stats and take action if needed 21 | 22 | # List of Proxmox hosts to manage 23 | proxmox_hosts: 24 | - name: host1 25 | host: 192.168.1.10 26 | ssh_user: root 27 | ssh_password: your_password_here # SSH password or key must be provided 28 | ssh_key: /path/to/ssh_key # Path to SSH private key (optional, use if no password) 29 | ssh_port: 22 # SSH Port (default 22) 30 | 31 | - name: host2 32 | host: 192.168.1.11 33 | ssh_user: root 34 | ssh_password: your_password_here 35 | ssh_key: /path/to/ssh_key 36 | 37 | # Virtual machines to be monitored and scaled 38 | virtual_machines: 39 | - vm_id: "101" 40 | proxmox_host: "host1" 41 | scaling_enabled: true 42 | cpu_scaling: true # Enable/disable CPU scaling 43 | ram_scaling: true # Enable/disable RAM scaling 44 | thresholds: 45 | cpu_high: 80 46 | cpu_low: 20 47 | ram_high: 80 48 | ram_low: 20 49 | 50 | - vm_id: 102 51 | proxmox_host: host2 52 | scaling_enabled: true 53 | cpu_scaling: true 54 | ram_scaling: true 55 | thresholds: 56 | cpu_high: 80 57 | cpu_low: 20 58 | ram_high: 80 59 | ram_low: 20 60 | 61 | # Logging configuration 62 | logging: 63 | level: INFO # Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL) 64 | log_file: /var/log/vm_autoscale.log # Path to the log file 65 | 66 | # Alerts configuration (Optional) 67 | alerts: 68 | email_enabled: false 69 | email_recipient: admin@example.com 70 | smtp_server: smtp.example.com 71 | smtp_port: 587 72 | smtp_user: your_smtp_user 73 | smtp_password: your_smtp_password 74 | 75 | # Gotify notifications configuration (Optional) 76 | gotify: 77 | enabled: false 78 | server_url: https://gotify.example.com # Base URL of the Gotify server 79 | app_token: your_gotify_app_token_here # Application token for authentication 80 | priority: 5 # Notification priority level (1-10) 81 | 82 | # Safety checks for host resource limits 83 | host_limits: 84 | max_host_cpu_percent: 90 # Max CPU usage percentage for the host before scaling is restricted 85 | max_host_ram_percent: 90 # Max RAM usage percentage for the host before scaling is restricted 86 | -------------------------------------------------------------------------------- /host_resource_checker.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import json 3 | 4 | class HostResourceChecker: 5 | """ 6 | Class to check and monitor host resource usage via SSH. 7 | """ 8 | 9 | def __init__(self, ssh_client): 10 | """ 11 | Initialize the HostResourceChecker with an SSH client. 12 | :param ssh_client: Instance of SSH client for executing remote commands. 13 | """ 14 | self.ssh_client = ssh_client 15 | self.logger = logging.getLogger("host_resource_checker") 16 | 17 | def check_host_resources(self, max_host_cpu_percent, max_host_ram_percent): 18 | """ 19 | Check host CPU and RAM usage against specified thresholds. 20 | :param max_host_cpu_percent: Maximum allowable CPU usage percentage. 21 | :param max_host_ram_percent: Maximum allowable RAM usage percentage. 22 | :return: True if resources are within limits, False otherwise. 23 | """ 24 | try: 25 | # Command to retrieve host resource status 26 | command = "pvesh get /nodes/$(hostname)/status --output-format json" 27 | output, error, exit_status = self.ssh_client.execute_command(command) # Properly unpack the tuple 28 | 29 | # Debug logging 30 | self.logger.debug(f"Raw command output: {output}") 31 | self.logger.debug(f"Error output: {error}") 32 | self.logger.debug(f"Exit status: {exit_status}") 33 | 34 | # Check for error output 35 | if error: 36 | raise Exception(f"Command execution error: {error}") 37 | 38 | # Make sure output is a string 39 | if not isinstance(output, str): 40 | output = output.decode() if isinstance(output, bytes) else str(output) 41 | 42 | # Parse JSON response 43 | data = json.loads(output.strip()) # Add strip() to remove any whitespace 44 | 45 | # Rest of your code remains the same 46 | if 'cpu' not in data or 'memory' not in data: 47 | raise KeyError("Missing 'cpu' or 'memory' in the command output.") 48 | 49 | # Extract and calculate CPU usage 50 | host_cpu_usage = data['cpu'] * 100 # Convert to percentage 51 | 52 | # Extract memory details 53 | memory_data = data['memory'] 54 | total_mem = memory_data.get('total', 1) # Avoid division by zero 55 | used_mem = memory_data.get('used', 0) 56 | cached_mem = memory_data.get('cached', 0) 57 | free_mem = memory_data.get('free', 0) 58 | 59 | # Calculate RAM usage as a percentage 60 | available_mem = free_mem + cached_mem 61 | host_ram_usage = ((total_mem - available_mem) / total_mem) * 100 62 | 63 | # Log resource usage 64 | self.logger.info(f"Host CPU Usage: {host_cpu_usage:.2f}%, " 65 | f"Host RAM Usage: {host_ram_usage:.2f}%") 66 | 67 | # Check CPU usage threshold 68 | if host_cpu_usage > max_host_cpu_percent: 69 | self.logger.warning(f"Host CPU usage exceeds maximum allowed limit: " 70 | f"{host_cpu_usage:.2f}% > {max_host_cpu_percent}%") 71 | return False 72 | 73 | # Check RAM usage threshold 74 | if host_ram_usage > max_host_ram_percent: 75 | self.logger.warning(f"Host RAM usage exceeds maximum allowed limit: " 76 | f"{host_ram_usage:.2f}% > {max_host_ram_percent}%") 77 | return False 78 | 79 | # Resources are within limits 80 | return True 81 | 82 | except json.JSONDecodeError as json_err: 83 | self.logger.error(f"Failed to parse JSON output: {str(json_err)}") 84 | self.logger.error(f"Raw output causing JSON error: {output}") 85 | raise 86 | except KeyError as key_err: 87 | self.logger.error(f"Missing data in the response: {str(key_err)}") 88 | raise 89 | except Exception as e: 90 | self.logger.error(f"Failed to check host resources: {str(e)}") 91 | raise -------------------------------------------------------------------------------- /install.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Install script for Proxmox VM Autoscale project 3 | # Repository: https://github.com/fabriziosalmi/proxmox-vm-autoscale 4 | 5 | # Variables 6 | INSTALL_DIR="/usr/local/bin/vm_autoscale" 7 | BACKUP_DIR="/etc/vm_autoscale" # New separate backup directory 8 | REPO_URL="https://github.com/fabriziosalmi/proxmox-vm-autoscale" 9 | SERVICE_FILE="vm_autoscale.service" 10 | CONFIG_FILE="$INSTALL_DIR/config.yaml" 11 | BACKUP_FILE="$BACKUP_DIR/config.yaml.backup" # Updated backup location 12 | REQUIREMENTS_FILE="$INSTALL_DIR/requirements.txt" 13 | PYTHON_CMD="/usr/bin/python3" 14 | 15 | # Ensure the script is run as root 16 | if [ "$EUID" -ne 0 ]; then 17 | echo "ERROR: Please run this script as root." 18 | exit 1 19 | fi 20 | 21 | # Create backup directory if it doesn't exist 22 | if [ ! -d "$BACKUP_DIR" ]; then 23 | echo "Creating backup directory..." 24 | mkdir -p "$BACKUP_DIR" || { echo "ERROR: Failed to create backup directory"; exit 1; } 25 | fi 26 | 27 | # Backup existing config.yaml if it exists 28 | if [ -f "$CONFIG_FILE" ]; then 29 | echo "Backing up existing config.yaml to $BACKUP_FILE..." 30 | cp "$CONFIG_FILE" "$BACKUP_FILE" || { echo "ERROR: Failed to backup config.yaml"; exit 1; } 31 | fi 32 | 33 | # Install necessary dependencies 34 | echo "Installing necessary dependencies..." 35 | apt-get update || { echo "ERROR: Failed to update package lists"; exit 1; } 36 | apt-get install -y python3 curl bash git python3-paramiko python3-yaml python3-requests python3-cryptography \ 37 | || { echo "ERROR: Failed to install required packages"; exit 1; } 38 | 39 | # Clone the repository 40 | echo "Cloning the repository..." 41 | if [ -d "$INSTALL_DIR" ]; then 42 | echo "Removing existing installation directory..." 43 | rm -rf "$INSTALL_DIR" || { echo "ERROR: Failed to remove existing directory $INSTALL_DIR"; exit 1; } 44 | fi 45 | 46 | git clone "$REPO_URL" "$INSTALL_DIR" || { echo "ERROR: Failed to clone the repository from $REPO_URL"; exit 1; } 47 | 48 | # Restore backup if it exists 49 | if [ -f "$BACKUP_FILE" ]; then 50 | echo "Restoring config.yaml from backup..." 51 | cp "$BACKUP_FILE" "$CONFIG_FILE" || { echo "ERROR: Failed to restore config.yaml from backup"; exit 1; } 52 | fi 53 | 54 | # Install Python dependencies 55 | if [ -f "$REQUIREMENTS_FILE" ]; then 56 | echo "Installing Python dependencies..." 57 | pip3 install -r "$REQUIREMENTS_FILE" || { echo "ERROR: Failed to install Python dependencies"; exit 1; } 58 | else 59 | echo "WARNING: Requirements file not found. Skipping Python dependency installation." 60 | fi 61 | 62 | # Set permissions 63 | echo "Setting permissions for installation directory..." 64 | chmod -R 755 "$INSTALL_DIR" || { echo "ERROR: Failed to set permissions on $INSTALL_DIR"; exit 1; } 65 | chmod -R 755 "$BACKUP_DIR" || { echo "ERROR: Failed to set permissions on $BACKUP_DIR"; exit 1; } 66 | 67 | # Create the systemd service file 68 | echo "Creating the systemd service file..." 69 | cat < /etc/systemd/system/$SERVICE_FILE 70 | [Unit] 71 | Description=Proxmox VM Autoscale Service 72 | After=network.target 73 | 74 | [Service] 75 | ExecStart=$PYTHON_CMD $INSTALL_DIR/autoscale.py 76 | WorkingDirectory=$INSTALL_DIR 77 | Restart=always 78 | User=root 79 | Environment=PYTHONUNBUFFERED=1 80 | 81 | [Install] 82 | WantedBy=multi-user.target 83 | EOF 84 | 85 | if [ $? -ne 0 ]; then 86 | echo "ERROR: Failed to create systemd service file at /etc/systemd/system/$SERVICE_FILE" 87 | exit 1 88 | fi 89 | 90 | # Reload systemd, enable the service, and ensure it's not started 91 | echo "Reloading systemd and enabling the service..." 92 | systemctl daemon-reload || { echo "ERROR: Failed to reload systemd"; exit 1; } 93 | systemctl enable "$SERVICE_FILE" || { echo "ERROR: Failed to enable the service"; exit 1; } 94 | 95 | # Post-installation instructions 96 | echo "Installation complete. The service is enabled but not started." 97 | echo "To start the service, use: sudo systemctl start $SERVICE_FILE" 98 | echo "Logs can be monitored using: journalctl -u $SERVICE_FILE -f" 99 | echo "Config backup location: $BACKUP_FILE" 100 | -------------------------------------------------------------------------------- /logging_config.json: -------------------------------------------------------------------------------- 1 | { 2 | "version": 1, 3 | "disable_existing_loggers": false, 4 | "formatters": { 5 | "detailed": { 6 | "format": "%(asctime)s [%(levelname)s] %(name)s: %(message)s" 7 | }, 8 | "simple": { 9 | "format": "%(levelname)s: %(message)s" 10 | } 11 | }, 12 | "handlers": { 13 | "console": { 14 | "class": "logging.StreamHandler", 15 | "level": "INFO", 16 | "formatter": "simple", 17 | "stream": "ext://sys.stdout" 18 | }, 19 | "file": { 20 | "class": "logging.FileHandler", 21 | "level": "DEBUG", 22 | "formatter": "detailed", 23 | "filename": "/var/log/vm_autoscale.log", 24 | "mode": "a", 25 | "encoding": "utf8" 26 | } 27 | }, 28 | "loggers": { 29 | "": { 30 | "level": "DEBUG", 31 | "handlers": ["console", "file"] 32 | }, 33 | "ssh_utils": { 34 | "level": "INFO", 35 | "handlers": ["console", "file"], 36 | "propagate": false 37 | }, 38 | "vm_resource_manager": { 39 | "level": "INFO", 40 | "handlers": ["console", "file"], 41 | "propagate": false 42 | }, 43 | "host_resource_checker": { 44 | "level": "INFO", 45 | "handlers": ["console", "file"], 46 | "propagate": false 47 | } 48 | }, 49 | "root": { 50 | "level": "DEBUG", 51 | "handlers": ["console", "file"] 52 | } 53 | } 54 | -------------------------------------------------------------------------------- /ssh_utils.py: -------------------------------------------------------------------------------- 1 | import paramiko 2 | import logging 3 | import time 4 | from paramiko.ssh_exception import SSHException, AuthenticationException 5 | 6 | class SSHClient: 7 | def __init__(self, host, user, password=None, key_path=None, port=22): 8 | """ 9 | Initializes the SSH client with given credentials. 10 | :param host: Hostname or IP address of the server. 11 | :param user: Username to connect with. 12 | :param password: Password for SSH (optional). 13 | :param key_path: Path to the private SSH key (optional). 14 | :param port: Port for SSH connection (default: 22). 15 | """ 16 | self.host = host 17 | self.user = user 18 | self.password = password 19 | self.key_path = key_path 20 | self.port = port 21 | self.logger = logging.getLogger("ssh_utils") 22 | self.client = None 23 | # Added max retries and backoff factor for connection attempts 24 | self.max_retries = 5 25 | self.backoff_factor = 1 26 | 27 | def connect(self): 28 | """ 29 | Establish an SSH connection to the host. 30 | """ 31 | if self.client is not None and self.client.get_transport() and self.client.get_transport().is_active(): 32 | self.logger.info(f"Already connected to {self.host}. Reusing the connection.") 33 | return 34 | 35 | attempt = 0 36 | while attempt < self.max_retries: 37 | try: 38 | self.client = paramiko.SSHClient() 39 | self.client.set_missing_host_key_policy(paramiko.AutoAddPolicy()) 40 | 41 | # Connect using password or private key 42 | if self.password: 43 | self.client.connect( 44 | hostname=self.host, 45 | username=self.user, 46 | password=self.password, 47 | port=self.port, 48 | timeout=10 49 | ) 50 | elif self.key_path: 51 | private_key = paramiko.RSAKey.from_private_key_file(self.key_path) 52 | self.client.connect( 53 | hostname=self.host, 54 | username=self.user, 55 | pkey=private_key, 56 | port=self.port, 57 | timeout=10 58 | ) 59 | else: 60 | raise ValueError("Either password or key_path must be provided for SSH connection.") 61 | 62 | self.logger.info(f"Successfully connected to {self.host} on port {self.port}") 63 | break # successful connection: exit loop 64 | 65 | except AuthenticationException: 66 | self.logger.error(f"Authentication failed for {self.host}. Check credentials or key file.") 67 | raise 68 | except (SSHException, Exception) as e: 69 | attempt += 1 70 | if attempt >= self.max_retries: 71 | self.logger.error(f"Failed to connect to {self.host} after {attempt} attempts.") 72 | raise e 73 | sleep_time = self.backoff_factor * (2 ** (attempt - 1)) 74 | self.logger.info(f"Retrying connection to {self.host} in {sleep_time} seconds (attempt {attempt}/{self.max_retries})") 75 | time.sleep(sleep_time) 76 | 77 | def execute_command(self, command, timeout=30): 78 | """Execute a command on the remote server with retry logic.""" 79 | attempts = 0 80 | while attempts < self.max_retries: 81 | try: 82 | # ...existing code before try... 83 | stdin, stdout, stderr = self.client.exec_command(command, timeout=timeout) 84 | exit_status = stdout.channel.recv_exit_status() 85 | 86 | output = stdout.read().decode('utf-8').strip() 87 | error = stderr.read().decode('utf-8').strip() 88 | 89 | if exit_status == 0: 90 | self.logger.info(f"Command executed successfully on {self.host}: {command}") 91 | return output, error, exit_status 92 | else: 93 | self.logger.warning(f"Command execution failed on {self.host} with exit status {exit_status}") 94 | return output, error, exit_status 95 | except Exception as e: 96 | attempts += 1 97 | self.logger.error(f"Error executing command on {self.host} (attempt {attempts}): {str(e)}") 98 | self.close() 99 | try: 100 | self.connect() 101 | except Exception as connect_err: 102 | self.logger.error(f"Reconnection failed on {self.host}: {str(connect_err)}") 103 | time.sleep(self.backoff_factor * (2 ** (attempts - 1))) 104 | raise Exception(f"Failed to execute command on {self.host} after {attempts} attempts.") 105 | 106 | def close(self): 107 | """ 108 | Close the SSH connection. 109 | """ 110 | if self.client: 111 | try: 112 | self.client.close() 113 | self.logger.info(f"SSH connection closed for {self.host}") 114 | except Exception as e: 115 | self.logger.error(f"Error while closing SSH connection to {self.host}: {str(e)}") 116 | finally: 117 | self.client = None 118 | 119 | def __enter__(self): 120 | """ 121 | Context manager entry. 122 | """ 123 | self.connect() 124 | return self 125 | 126 | def __exit__(self, exc_type, exc_value, traceback): 127 | """ 128 | Context manager exit - ensure the SSH connection is closed. 129 | """ 130 | self.close() 131 | 132 | def is_connected(self): 133 | """ 134 | Check if the SSH client is connected and transport is active. 135 | :return: True if connected, False otherwise. 136 | """ 137 | return self.client is not None and self.client.get_transport() and self.client.get_transport().is_active() 138 | -------------------------------------------------------------------------------- /vm_autoscale.service: -------------------------------------------------------------------------------- 1 | [Unit] 2 | Description=Proxmox VM Autoscale Service 3 | Documentation=https://github.com/fabriziosalmi/proxmox-vm-autoscale 4 | After=network.target 5 | 6 | [Service] 7 | ExecStart=/usr/bin/python3 /usr/local/bin/vm_autoscale/autoscale.py 8 | WorkingDirectory=/usr/local/bin/vm_autoscale 9 | Restart=always 10 | RestartSec=10 11 | User=root 12 | Environment=PYTHONUNBUFFERED=1 13 | 14 | [Install] 15 | WantedBy=multi-user.target 16 | -------------------------------------------------------------------------------- /vm_manager.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import re 3 | import time 4 | import threading 5 | 6 | 7 | class VMResourceManager: 8 | def __init__(self, ssh_client, vm_id, config): 9 | self.ssh_client = ssh_client 10 | self.vm_id = vm_id 11 | self.config = config 12 | self.logger = logging.getLogger("vm_resource_manager") 13 | self.last_scale_time = 0 14 | self.scale_cooldown = self.config.get("scale_cooldown", 300) # Default to 5 minutes 15 | self.scale_lock = threading.Lock() # Added lock for scaling control 16 | 17 | def _get_command_output(self, output): 18 | """Helper method to properly handle command output that might be a tuple.""" 19 | if isinstance(output, tuple): 20 | # Assuming the first element contains the stdout 21 | return str(output[0]).strip() if output and output[0] is not None else "" 22 | return str(output).strip() if output is not None else "" 23 | 24 | def is_vm_running(self, retries=3, delay=5): 25 | """Check if the VM is running with retries and improved error handling.""" 26 | for attempt in range(1, retries + 1): 27 | try: 28 | command = f"qm status {self.vm_id} --verbose" 29 | self.logger.debug(f"Executing command to check VM status: {command}") 30 | output = self.ssh_client.execute_command(command) 31 | output_str = self._get_command_output(output) 32 | self.logger.debug(f"Command output: {output_str}") 33 | 34 | if "status: running" in output_str.lower(): 35 | self.logger.info(f"VM {self.vm_id} is running.") 36 | return True 37 | elif "status:" in output_str.lower(): 38 | self.logger.info(f"VM {self.vm_id} is not running.") 39 | return False 40 | else: 41 | self.logger.warning( 42 | f"Unexpected output while checking VM status: {output_str}" 43 | ) 44 | except Exception as e: 45 | self.logger.warning( 46 | f"Attempt {attempt}/{retries} failed to check VM status: {e}. Retrying..." 47 | ) 48 | time.sleep(delay * attempt) # Exponential backoff 49 | 50 | self.logger.error( 51 | f"Unable to determine status of VM {self.vm_id} after {retries} attempts." 52 | ) 53 | return False 54 | 55 | def get_resource_usage(self): 56 | """Retrieve CPU and RAM usage as percentages.""" 57 | try: 58 | if not self.is_vm_running(): 59 | return 0.0, 0.0 60 | #command = f"qm status {self.vm_id} --verbose" 61 | # Updated command - this might well be refinable to simpler and faster. 62 | vmid = self.vm_id 63 | command = f"pvesh get /cluster/resources | grep 'qemu/{vmid}' | awk -F '│' '{{print $6, $15, $16}}'" 64 | output = self.ssh_client.execute_command(command) 65 | # example output: " 3.17% 5.00 GiB 3.82 GiB " 66 | self.logger.info(f"VM status output: {output}") 67 | cpu_usage = self._parse_cpu_usage(output) 68 | ram_usage = self._parse_ram_usage(output) 69 | return cpu_usage, ram_usage 70 | except Exception as e: 71 | self.logger.error(f"Failed to retrieve resource usage: {e}") 72 | return 0.0, 0.0 73 | 74 | def can_scale(self): 75 | """Determine if scaling can occur using a lock to avoid race conditions.""" 76 | with self.scale_lock: 77 | current_time = time.time() 78 | if current_time - self.last_scale_time < self.scale_cooldown: 79 | return False 80 | self.last_scale_time = current_time 81 | return True 82 | 83 | def scale_cpu(self, direction): 84 | """Scale the CPU cores and vCPUs of the VM.""" 85 | if not self.can_scale(): 86 | return False 87 | 88 | try: 89 | current_cores = self._get_current_cores() 90 | max_cores = self._get_max_cores() 91 | min_cores = self._get_min_cores() 92 | current_vcpus = self._get_current_vcpus() 93 | 94 | self.last_scale_time = time.time() 95 | if direction == "up" and current_cores < max_cores: 96 | self._scale_cpu_up(current_cores, current_vcpus) 97 | return True 98 | elif direction == "down" and current_cores > min_cores: 99 | self._scale_cpu_down(current_cores, current_vcpus) 100 | return True 101 | else: 102 | self.logger.info("No CPU scaling required.") 103 | return False 104 | except Exception as e: 105 | self.logger.error(f"Failed to scale CPU: {e}") 106 | raise 107 | 108 | def scale_ram(self, direction): 109 | """Scale the RAM of the VM.""" 110 | if not self.can_scale(): 111 | return False 112 | 113 | try: 114 | current_ram = self._get_current_ram() 115 | max_ram = self._get_max_ram() 116 | min_ram = self._get_min_ram() 117 | 118 | self.last_scale_time = time.time() 119 | if direction == "up" and current_ram < max_ram: 120 | new_ram = min(current_ram + 512, max_ram) 121 | self._set_ram(new_ram) 122 | return True 123 | elif direction == "down" and current_ram > min_ram: 124 | new_ram = max(current_ram - 512, min_ram) 125 | self._set_ram(new_ram) 126 | return True 127 | else: 128 | self.logger.info("No RAM scaling required.") 129 | return False 130 | except Exception as e: 131 | self.logger.error(f"Failed to scale RAM: {e}") 132 | raise 133 | 134 | def _parse_cpu_usage(self, output): 135 | """Parse CPU usage from VM status output.""" 136 | try: 137 | output_str = self._get_command_output(output) 138 | percentage_cpu_match = re.search(r"^\s*(\d+(?:\.\d+)?)%", output_str) 139 | if percentage_cpu_match: 140 | return float(percentage_cpu_match.group(1)) 141 | self.logger.warning("CPU usage not found in output.") 142 | return 0.0 143 | except Exception as e: 144 | self.logger.error(f"Error parsing CPU usage: {e}") 145 | return 0.0 146 | 147 | def _convert_to_gib(self, value, unit): 148 | """ Converts memory units to GiB. """ 149 | unit = unit.lower() 150 | if unit == 'gib': 151 | return value 152 | elif unit == 'mib': 153 | return value / 1024 # Convert MiB to GiB 154 | else: 155 | self.logger.warning(f"Unknown memory unit '{unit}'. Assuming GiB.") 156 | return value # Assume GiB if unit is unknown 157 | 158 | def _parse_ram_usage(self, output): 159 | """ Parses RAM usage from VM status output. """ 160 | try: 161 | output_str = self._get_command_output(output) 162 | self.logger.debug(f"Processing output: '{output_str}'") 163 | # ---------------------------- 164 | # Extract Memory Values 165 | # ---------------------------- 166 | # Pattern Explanation: 167 | # - (\d+(?:\.\d+)?)\s+(GiB|MiB) : Capture first memory value and its unit 168 | # - \s+ : Match one or more whitespace characters 169 | # - (\d+(?:\.\d+)?)\s+(GiB|MiB) : Capture second memory value and its unit 170 | pattern_memory = r"(\d+(?:\.\d+)?)\s+(GiB|MiB)\s+(\d+(?:\.\d+)?)\s+(GiB|MiB)" 171 | memory_match = re.search(pattern_memory, output_str) 172 | if memory_match: 173 | max_mem_value = float(memory_match.group(1)) 174 | max_mem_unit = memory_match.group(2) 175 | used_mem_value = float(memory_match.group(3)) 176 | used_mem_unit = memory_match.group(4) 177 | 178 | self.logger.debug(f"Extracted Max Memory: {max_mem_value} {max_mem_unit}") 179 | self.logger.debug(f"Extracted Used Memory: {used_mem_value} {used_mem_unit}") 180 | 181 | # Convert memory values to GiB 182 | max_mem_gib = self._convert_to_gib(max_mem_value, max_mem_unit) 183 | used_mem_gib = self._convert_to_gib(used_mem_value, used_mem_unit) 184 | 185 | self.logger.debug(f"Converted Max Memory: {max_mem_gib} GiB") 186 | self.logger.debug(f"Converted Used Memory: {used_mem_gib} GiB") 187 | 188 | if max_mem_gib == 0: 189 | self.logger.warning("Maximum memory is zero. Cannot compute usage percentage.") 190 | return 0.0 191 | 192 | # Calculate RAM usage percentage based on memory values 193 | usage_percentage = (used_mem_gib / max_mem_gib) * 100 194 | self.logger.debug(f"Calculated RAM Usage: {usage_percentage:.2f}%") 195 | return usage_percentage 196 | else: 197 | self.logger.warning("RAM memory values not found in output.") 198 | return 0.0 199 | 200 | except Exception as e: 201 | self.logger.error(f"Error parsing RAM usage: {e}") 202 | return 0.0 203 | 204 | def _get_current_vcpus(self): 205 | """Retrieve current vCPUs assigned to the VM.""" 206 | try: 207 | command = f"qm config {self.vm_id}" 208 | output = self.ssh_client.execute_command(command) 209 | output_str = self._get_command_output(output) 210 | match = re.search(r"vcpus:\s*(\d+)", output_str) 211 | return int(match.group(1)) if match else 1 212 | except Exception as e: 213 | self.logger.error(f"Failed to retrieve vCPUs: {e}") 214 | return 1 215 | 216 | def _get_current_cores(self): 217 | """Retrieve current CPU cores assigned to the VM.""" 218 | try: 219 | command = f"qm config {self.vm_id}" 220 | output = self.ssh_client.execute_command(command) 221 | output_str = self._get_command_output(output) 222 | match = re.search(r"cores:\s*(\d+)", output_str) 223 | return int(match.group(1)) if match else 1 224 | except Exception as e: 225 | self.logger.error(f"Failed to retrieve CPU cores: {e}") 226 | return 1 227 | 228 | def _get_max_cores(self): 229 | """Retrieve maximum allowed CPU cores.""" 230 | return self.config.get("max_cores", 8) 231 | 232 | def _get_min_cores(self): 233 | """Retrieve minimum allowed CPU cores.""" 234 | return self.config.get("min_cores", 1) 235 | 236 | def _get_current_ram(self): 237 | """Retrieve current RAM assigned to the VM.""" 238 | try: 239 | command = f"qm config {self.vm_id}" 240 | output = self.ssh_client.execute_command(command) 241 | output_str = self._get_command_output(output) 242 | match = re.search(r"memory:\s*(\d+)", output_str) 243 | return int(match.group(1)) if match else 512 244 | except Exception as e: 245 | self.logger.error(f"Failed to retrieve current RAM: {e}") 246 | return 512 247 | 248 | def _get_max_ram(self): 249 | """Retrieve maximum allowed RAM.""" 250 | return self.config.get("max_ram", 16384) 251 | 252 | def _get_min_ram(self): 253 | """Retrieve minimum allowed RAM.""" 254 | return self.config.get("min_ram", 512) 255 | 256 | def _set_ram(self, ram): 257 | """Set the RAM for the VM.""" 258 | try: 259 | command = f"qm set {self.vm_id} -memory {ram}" 260 | output = self.ssh_client.execute_command(command) 261 | self._get_command_output(output) # Process output to catch any errors 262 | self.logger.info(f"RAM set to {ram} MB for VM {self.vm_id}.") 263 | except Exception as e: 264 | self.logger.error(f"Failed to set RAM to {ram}: {e}") 265 | raise 266 | 267 | def _scale_cpu_up(self, current_cores, current_vcpus): 268 | """Helper method to scale CPU up.""" 269 | new_cores = current_cores + 1 270 | self._set_cores(new_cores) 271 | new_vcpus = min(current_vcpus + 1, new_cores) 272 | self._set_vcpus(new_vcpus) 273 | 274 | def _scale_cpu_down(self, current_cores, current_vcpus): 275 | """Helper method to scale CPU down.""" 276 | new_vcpus = max(current_vcpus - 1, 1) 277 | self._set_vcpus(new_vcpus) 278 | new_cores = current_cores - 1 279 | self._set_cores(new_cores) 280 | 281 | def _set_cores(self, cores): 282 | """Set the CPU cores for the VM.""" 283 | try: 284 | command = f"qm set {self.vm_id} -cores {cores}" 285 | output = self.ssh_client.execute_command(command) 286 | self._get_command_output(output) # Process output to catch any errors 287 | self.logger.info(f"CPU cores set to {cores} for VM {self.vm_id}.") 288 | except Exception as e: 289 | self.logger.error(f"Failed to set CPU cores to {cores}: {e}") 290 | raise 291 | 292 | def _set_vcpus(self, vcpus): 293 | """Set the vCPUs for the VM.""" 294 | try: 295 | command = f"qm set {self.vm_id} -vcpus {vcpus}" 296 | output = self.ssh_client.execute_command(command) 297 | self._get_command_output(output) # Process output to catch any errors 298 | self.logger.info(f"vCPUs set to {vcpus} for VM {self.vm_id}.") 299 | except Exception as e: 300 | self.logger.error(f"Failed to set vCPUs to {vcpus}: {e}") 301 | raise --------------------------------------------------------------------------------