├── .github
    ├── FUNDING.yml
    └── ISSUE_TEMPLATE
    │   ├── bug_report.md
    │   └── feature_request.md
├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── SECURITY.md
├── autoscale.py
├── config.yaml
├── host_resource_checker.py
├── install.sh
├── logging_config.json
├── ssh_utils.py
├── vm_autoscale.service
└── vm_manager.py


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | # These are supported funding model platforms
2 | 
3 | github: [fabriziosalmi]
4 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/bug_report.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Bug report
 3 | about: Create a report to help us improve
 4 | title: ''
 5 | labels: ''
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 | 
13 | **To Reproduce**
14 | Steps to reproduce the behavior:
15 | 1. Go to '...'
16 | 2. Click on '....'
17 | 3. Scroll down to '....'
18 | 4. See error
19 | 
20 | **Expected behavior**
21 | A clear and concise description of what you expected to happen.
22 | 
23 | **Screenshots**
24 | If applicable, add screenshots to help explain your problem.
25 | 
26 | **Desktop (please complete the following information):**
27 |  - OS: [e.g. iOS]
28 |  - Browser [e.g. chrome, safari]
29 |  - Version [e.g. 22]
30 | 
31 | **Smartphone (please complete the following information):**
32 |  - Device: [e.g. iPhone6]
33 |  - OS: [e.g. iOS8.1]
34 |  - Browser [e.g. stock browser, safari]
35 |  - Version [e.g. 22]
36 | 
37 | **Additional context**
38 | Add any other context about the problem here.
39 | 


--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature_request.md:
--------------------------------------------------------------------------------
 1 | ---
 2 | name: Feature request
 3 | about: Suggest an idea for this project
 4 | title: ''
 5 | labels: ''
 6 | assignees: ''
 7 | 
 8 | ---
 9 | 
10 | **Is your feature request related to a problem? Please describe.**
11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12 | 
13 | **Describe the solution you'd like**
14 | A clear and concise description of what you want to happen.
15 | 
16 | **Describe alternatives you've considered**
17 | A clear and concise description of any alternative solutions or features you've considered.
18 | 
19 | **Additional context**
20 | Add any other context or screenshots about the feature request here.
21 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | __pycache__
2 | .DS_Store
3 | 


--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
  1 | # Contributor Covenant Code of Conduct
  2 | 
  3 | ## Our Pledge
  4 | 
  5 | We as members, contributors, and leaders pledge to make participation in our
  6 | community a harassment-free experience for everyone, regardless of age, body
  7 | size, visible or invisible disability, ethnicity, sex characteristics, gender
  8 | identity and expression, level of experience, education, socio-economic status,
  9 | nationality, personal appearance, race, religion, or sexual identity
 10 | and orientation.
 11 | 
 12 | We pledge to act and interact in ways that contribute to an open, welcoming,
 13 | diverse, inclusive, and healthy community.
 14 | 
 15 | ## Our Standards
 16 | 
 17 | Examples of behavior that contributes to a positive environment for our
 18 | community include:
 19 | 
 20 | * Demonstrating empathy and kindness toward other people
 21 | * Being respectful of differing opinions, viewpoints, and experiences
 22 | * Giving and gracefully accepting constructive feedback
 23 | * Accepting responsibility and apologizing to those affected by our mistakes,
 24 |   and learning from the experience
 25 | * Focusing on what is best not just for us as individuals, but for the
 26 |   overall community
 27 | 
 28 | Examples of unacceptable behavior include:
 29 | 
 30 | * The use of sexualized language or imagery, and sexual attention or
 31 |   advances of any kind
 32 | * Trolling, insulting or derogatory comments, and personal or political attacks
 33 | * Public or private harassment
 34 | * Publishing others' private information, such as a physical or email
 35 |   address, without their explicit permission
 36 | * Other conduct which could reasonably be considered inappropriate in a
 37 |   professional setting
 38 | 
 39 | ## Enforcement Responsibilities
 40 | 
 41 | Community leaders are responsible for clarifying and enforcing our standards of
 42 | acceptable behavior and will take appropriate and fair corrective action in
 43 | response to any behavior that they deem inappropriate, threatening, offensive,
 44 | or harmful.
 45 | 
 46 | Community leaders have the right and responsibility to remove, edit, or reject
 47 | comments, commits, code, wiki edits, issues, and other contributions that are
 48 | not aligned to this Code of Conduct, and will communicate reasons for moderation
 49 | decisions when appropriate.
 50 | 
 51 | ## Scope
 52 | 
 53 | This Code of Conduct applies within all community spaces, and also applies when
 54 | an individual is officially representing the community in public spaces.
 55 | Examples of representing our community include using an official e-mail address,
 56 | posting via an official social media account, or acting as an appointed
 57 | representative at an online or offline event.
 58 | 
 59 | ## Enforcement
 60 | 
 61 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
 62 | reported to the community leaders responsible for enforcement at
 63 | fabrizio.salmi@gmail.com.
 64 | All complaints will be reviewed and investigated promptly and fairly.
 65 | 
 66 | All community leaders are obligated to respect the privacy and security of the
 67 | reporter of any incident.
 68 | 
 69 | ## Enforcement Guidelines
 70 | 
 71 | Community leaders will follow these Community Impact Guidelines in determining
 72 | the consequences for any action they deem in violation of this Code of Conduct:
 73 | 
 74 | ### 1. Correction
 75 | 
 76 | **Community Impact**: Use of inappropriate language or other behavior deemed
 77 | unprofessional or unwelcome in the community.
 78 | 
 79 | **Consequence**: A private, written warning from community leaders, providing
 80 | clarity around the nature of the violation and an explanation of why the
 81 | behavior was inappropriate. A public apology may be requested.
 82 | 
 83 | ### 2. Warning
 84 | 
 85 | **Community Impact**: A violation through a single incident or series
 86 | of actions.
 87 | 
 88 | **Consequence**: A warning with consequences for continued behavior. No
 89 | interaction with the people involved, including unsolicited interaction with
 90 | those enforcing the Code of Conduct, for a specified period of time. This
 91 | includes avoiding interactions in community spaces as well as external channels
 92 | like social media. Violating these terms may lead to a temporary or
 93 | permanent ban.
 94 | 
 95 | ### 3. Temporary Ban
 96 | 
 97 | **Community Impact**: A serious violation of community standards, including
 98 | sustained inappropriate behavior.
 99 | 
100 | **Consequence**: A temporary ban from any sort of interaction or public
101 | communication with the community for a specified period of time. No public or
102 | private interaction with the people involved, including unsolicited interaction
103 | with those enforcing the Code of Conduct, is allowed during this period.
104 | Violating these terms may lead to a permanent ban.
105 | 
106 | ### 4. Permanent Ban
107 | 
108 | **Community Impact**: Demonstrating a pattern of violation of community
109 | standards, including sustained inappropriate behavior,  harassment of an
110 | individual, or aggression toward or disparagement of classes of individuals.
111 | 
112 | **Consequence**: A permanent ban from any sort of public interaction within
113 | the community.
114 | 
115 | ## Attribution
116 | 
117 | This Code of Conduct is adapted from the [Contributor Covenant][homepage],
118 | version 2.0, available at
119 | https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
120 | 
121 | Community Impact Guidelines were inspired by [Mozilla's code of conduct
122 | enforcement ladder](https://github.com/mozilla/diversity).
123 | 
124 | [homepage]: https://www.contributor-covenant.org
125 | 
126 | For answers to common questions about this code of conduct, see the FAQ at
127 | https://www.contributor-covenant.org/faq. Translations are available at
128 | https://www.contributor-covenant.org/translations.
129 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | PR are welcome here!
2 | 
3 | Fix issues, don't do like me re-introducing fixed issues, propose and provide full usable magic SOTA solid PRs and get free beers!
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2024 fab
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 🚀 VM Autoscale
  2 | 
  3 | ## 🌟 Overview
  4 | **Proxmox VM Autoscale** is a dynamic scaling service that automatically adjusts virtual machine (VM) resources (CPU cores and RAM) on your Proxmox Virtual Environment (VE) based on real-time metrics and user-defined thresholds. This solution helps ensure efficient resource usage, optimizing performance and resource availability dynamically.
  5 | 
  6 | The service supports multiple Proxmox hosts via SSH connections and can be easily installed and managed as a **systemd** service for seamless automation.
  7 | 
  8 | > [!IMPORTANT]
  9 | > To enable scaling of VM resources, make sure NUMA and hotplug features are enabled:
 10 | > - **Enable NUMA**: VM > Hardware > Processors > Enable NUMA ☑️
 11 | > - **Enable CPU Hotplug**: VM > Options > Hotplug > CPU ☑️
 12 | > - **Enable Memory Hotplug**: VM > Options > Hotplug > Memory ☑️
 13 | 
 14 | ## ✨ Features
 15 | - 🔄 **Auto-scaling of VM CPU and RAM** based on real-time resource metrics.
 16 | - 🛠️ **Configuration-driven** setup using an easy-to-edit YAML file.
 17 | - 🌐 **Multi-host support** via SSH (compatible with both password and key-based authentication).
 18 | - 📲 **Gotify Notifications** for alerting you whenever scaling actions are performed.
 19 | - ⚙️ **Systemd Integration** for effortless setup, management, and monitoring as a Linux service.
 20 | 
 21 | ## 📋 Prerequisites
 22 | - 🖥️ **Proxmox VE** must be installed on the target hosts.
 23 | - 🐍 **Python 3.x** should be installed on the Proxmox host(s).
 24 | - 💻 Familiarity with Proxmox `qm` commands and SSH is recommended.
 25 | 
 26 | ## 🤝 Contributing
 27 | Contributions are **more** than welcome! If you encounter a bug or have suggestions for improvement, please [open an issue](https://github.com/fabriziosalmi/proxmox-vm-autoscale/issues/new/choose) or submit a pull request.
 28 | 
 29 | ### Contributors
 30 | Code improvements by: **[Specimen67](https://github.com/Specimen67)**, **[brianread108](https://github.com/brianread108)**
 31 | 
 32 | ### Want to scale LXC containers instead of VM on Proxmox hosts?
 33 | To autoscale LXC containers on Proxmox hosts, you may be interested in [this related project](https://github.com/fabriziosalmi/proxmox-lxc-autoscale).
 34 | 
 35 | ## 🚀 Quick Start
 36 | 
 37 | To install **Proxmox VM Autoscale**, execute the following `curl bash` command. This command will automatically clone the repository, execute the installation script, and set up the service for you:
 38 | 
 39 | ```bash
 40 | bash <(curl -s https://raw.githubusercontent.com/fabriziosalmi/proxmox-vm-autoscale/main/install.sh)
 41 | ```
 42 | 
 43 | 🎯 **This installation script will:**
 44 | - Clone the repository into `/usr/local/bin/vm_autoscale`.
 45 | - Copy all necessary files to the installation directory.
 46 | - Install the required Python dependencies.
 47 | - Set up a **systemd unit file** to manage the autoscaling service.
 48 | 
 49 | > [!NOTE]
 50 | > The service is enabled but not started automatically at the end of the installation. To start it manually, use the following command.
 51 | 
 52 | ```bash
 53 | systemctl start vm_autoscale.service
 54 | ```
 55 | 
 56 | > [!IMPORTANT]
 57 | > Make sure to review the official [Proxmox documentation](https://pve.proxmox.com/wiki/Hotplug_(qemu_disk,nic,cpu,memory)) for the hotplug feature requirements to enable scaling virtual machines on the fly.
 58 | 
 59 | ## ⚡ Usage
 60 | 
 61 | ### ▶️ Start/Stop the Service
 62 | To **start** the autoscaling service:
 63 | 
 64 | ```bash
 65 | systemctl start vm_autoscale.service
 66 | ```
 67 | 
 68 | To **stop** the service:
 69 | 
 70 | ```bash
 71 | systemctl stop vm_autoscale.service
 72 | ```
 73 | 
 74 | ### 🔍 Check the Status
 75 | To view the service status:
 76 | 
 77 | ```bash
 78 | systemctl status vm_autoscale.service
 79 | ```
 80 | 
 81 | ### 📜 Logs
 82 | Logs are saved to `/var/log/vm_autoscale.log`. You can monitor the logs in real-time using:
 83 | 
 84 | ```bash
 85 | tail -f /var/log/vm_autoscale.log
 86 | ```
 87 | 
 88 | Or by using `journalctl`:
 89 | 
 90 | ```bash
 91 | journalctl -u vm_autoscale.service -f
 92 | ```
 93 | 
 94 | ## ⚙️ Configuration
 95 | 
 96 | The configuration file (`config.yaml`) is located at `/usr/local/bin/vm_autoscale/config.yaml`. This file contains settings for scaling thresholds, resource limits, Proxmox hosts, and VM information.
 97 | 
 98 | ### Example Configuration
 99 | ```yaml
100 | scaling_thresholds:
101 |   cpu:
102 |     high: 80
103 |     low: 20
104 |   ram:
105 |     high: 85
106 |     low: 25
107 | 
108 | scaling_limits:
109 |   min_cores: 1
110 |   max_cores: 8
111 |   min_ram_mb: 512
112 |   max_ram_mb: 16384
113 | 
114 | check_interval: 60  # Check every 60 seconds
115 | 
116 | proxmox_hosts:
117 |   - name: host1
118 |     host: 192.168.1.10
119 |     ssh_user: root
120 |     ssh_password: your_password_here
121 |     ssh_key: /path/to/ssh_key
122 | 
123 | virtual_machines:
124 |   - vm_id: 101
125 |     proxmox_host: host1
126 |     scaling_enabled: true
127 |     cpu_scaling: true
128 |     ram_scaling: true
129 | 
130 | logging:
131 |   level: INFO
132 |   log_file: /var/log/vm_autoscale.log
133 | 
134 | gotify:
135 |   enabled: true
136 |   server_url: https://gotify.example.com
137 |   app_token: your_gotify_app_token_here
138 |   priority: 5
139 | ```
140 | 
141 | ### ⚙️ Configuration Details
142 | - **`scaling_thresholds`**: Defines the CPU and RAM usage thresholds that trigger scaling actions (e.g., when CPU > 80%, scale up).
143 | - **`scaling_limits`**: Specifies the **minimum** and **maximum** resources (CPU cores and RAM) each VM can have.
144 | - **`proxmox_hosts`**: Contains the details of Proxmox hosts, including SSH credentials.
145 | - **`virtual_machines`**: Lists the VMs to be managed by the autoscaling script, allowing per-VM scaling customization.
146 | - **`logging`**: Specifies the logging level and log file path for activity tracking and debugging.
147 | - **`gotify`**: Configures **Gotify notifications** to send alerts when scaling actions are performed.
148 | 
149 | ## 📲 Gotify Notifications
150 | Gotify is used to send real-time notifications regarding scaling actions. Configure Gotify in the `config.yaml` file:
151 | - **`enabled`**: Set to `true` to enable notifications.
152 | - **`server_url`**: URL of the Gotify server.
153 | - **`app_token`**: Authentication token for accessing Gotify.
154 | - **`priority`**: Notification priority level (1-10).
155 | 
156 | ## 👨‍💻 Development
157 | 
158 | ### 🔧 Requirements
159 | - **Python 3.x**
160 | - Required Python Packages: `paramiko`, `requests`, `PyYAML`
161 | 
162 | ### 🐛 Running Manually
163 | To run the script manually for debugging or testing:
164 | 
165 | ```bash
166 | python3 /usr/local/bin/vm_autoscale/autoscale.py
167 | ```
168 | 
169 | ### Others projects
170 | 
171 | If You like my projects, you may also like these ones:
172 | 
173 | - [caddy-waf](https://github.com/fabriziosalmi/caddy-waf) Caddy WAF (Regex Rules, IP and DNS filtering, Rate Limiting, GeoIP, Tor, Anomaly Detection) 
174 | - [patterns](https://github.com/fabriziosalmi/patterns) Automated OWASP CRS and Bad Bot Detection for Nginx, Apache, Traefik and HaProxy
175 | - [blacklists](https://github.com/fabriziosalmi/blacklists) Hourly updated domains blacklist 🚫 
176 | - [UglyFeed](https://github.com/fabriziosalmi/UglyFeed) Retrieve, aggregate, filter, evaluate, rewrite and serve RSS feeds using Large Language Models for fun, research and learning purposes 
177 | - [proxmox-lxc-autoscale](https://github.com/fabriziosalmi/proxmox-lxc-autoscale) Automatically scale LXC containers resources on Proxmox hosts 
178 | - [DevGPT](https://github.com/fabriziosalmi/DevGPT) Code togheter, right now! GPT powered code assistant to build project in minutes
179 | - [websites-monitor](https://github.com/fabriziosalmi/websites-monitor) Websites monitoring via GitHub Actions (expiration, security, performances, privacy, SEO)
180 | - [caddy-mib](https://github.com/fabriziosalmi/caddy-mib) Track and ban client IPs generating repetitive errors on Caddy 
181 | - [zonecontrol](https://github.com/fabriziosalmi/zonecontrol) Cloudflare Zones Settings Automation using GitHub Actions 
182 | - [lws](https://github.com/fabriziosalmi/lws) linux (containers) web services
183 | - [cf-box](https://github.com/fabriziosalmi/cf-box) cf-box is a set of Python tools to play with API and multiple Cloudflare accounts.
184 | - [limits](https://github.com/fabriziosalmi/limits) Automated rate limits implementation for web servers 
185 | - [dnscontrol-actions](https://github.com/fabriziosalmi/dnscontrol-actions) Automate DNS updates and rollbacks across multiple providers using DNSControl and GitHub Actions 
186 | - [proxmox-lxc-autoscale-ml](https://github.com/fabriziosalmi/proxmox-lxc-autoscale-ml) Automatically scale the LXC containers resources on Proxmox hosts with AI
187 | - [csv-anonymizer](https://github.com/fabriziosalmi/csv-anonymizer) CSV fuzzer/anonymizer
188 | - [iamnotacoder](https://github.com/fabriziosalmi/iamnotacoder) AI code generation and improvement
189 | 
190 | 
191 | ### ⚠️ Disclaimer
192 | > [!CAUTION]
193 | > The author assumes no responsibility for any damage or issues that may arise from using this tool.
194 | 
195 | ### 📜 License
196 | This project is licensed under the **MIT License**. See the LICENSE file for complete details.
197 | 


--------------------------------------------------------------------------------
/SECURITY.md:
--------------------------------------------------------------------------------
1 | # Security Policy
2 | 
3 | ## Supported Versions
4 | 
5 | Alpha version, barely tested. Testers are welcome!
6 | 


--------------------------------------------------------------------------------
/autoscale.py:
--------------------------------------------------------------------------------
  1 | import yaml
  2 | import json
  3 | import requests
  4 | import smtplib
  5 | import logging
  6 | import logging.config
  7 | import time
  8 | import re
  9 | import sys
 10 | from ssh_utils import SSHClient
 11 | from pathlib import Path
 12 | from email.mime.text import MIMEText
 13 | from email.mime.multipart import MIMEMultipart
 14 | from vm_manager import VMResourceManager
 15 | from host_resource_checker import HostResourceChecker
 16 | from functools import wraps
 17 | from typing import Union, List, Optional, Dict, Any
 18 | 
 19 | class ConfigurationError(Exception):
 20 |     """Custom exception for configuration-related errors."""
 21 |     pass
 22 | 
 23 | class NotificationManager:
 24 |     def __init__(self, config: Dict[str, Any], logger: logging.Logger):
 25 |         self.config = config
 26 |         self.logger = logger
 27 |         self.validate_notification_config()
 28 | 
 29 |     def validate_notification_config(self) -> None:
 30 |         """Validate notification configuration at startup."""
 31 |         notification_enabled = False
 32 |         
 33 |         if self.config.get('gotify', {}).get('enabled', False):
 34 |             notification_enabled = True
 35 |             gotify_config = self.config.get('gotify', {})
 36 |             if not all([gotify_config.get('server_url'), gotify_config.get('app_token')]):
 37 |                 raise ConfigurationError("Gotify is enabled but configuration is incomplete")
 38 |         
 39 |         if self.config.get('alerts', {}).get('email_enabled', False):
 40 |             notification_enabled = True
 41 |             alerts_config = self.config.get('alerts', {})
 42 |             required_fields = ['smtp_server', 'smtp_user', 'email_recipient']
 43 |             missing_fields = [field for field in required_fields if not alerts_config.get(field)]
 44 |             if missing_fields:
 45 |                 raise ConfigurationError(f"Email alerts are enabled but missing configuration: {', '.join(missing_fields)}")
 46 | 
 47 |         if not notification_enabled:
 48 |             self.logger.warning("No notification method is enabled in configuration")
 49 | 
 50 |     def _format_message(self, message: Union[str, tuple, Any]) -> str:
 51 |         """Format message to ensure it's a string."""
 52 |         if isinstance(message, tuple):
 53 |             # If it's a tuple, join non-empty parts
 54 |             return ' '.join(str(part) for part in message if part)
 55 |         elif isinstance(message, str):
 56 |             return message
 57 |         else:
 58 |             return str(message)
 59 | 
 60 |     def send_gotify_notification(self, message: str, priority: Optional[int] = None) -> None:
 61 |         """Send notification via Gotify with retry logic."""
 62 |         try:
 63 |             gotify_config = self.config.get('gotify', {})
 64 |             server_url = gotify_config['server_url'].rstrip('/')  # Remove trailing slash if present
 65 |             app_token = gotify_config['app_token']
 66 |             final_priority = priority or gotify_config.get('priority', 5)
 67 | 
 68 |             formatted_message = self._format_message(message)
 69 | 
 70 |             response = requests.post(
 71 |                 f"{server_url}/message",
 72 |                 data={
 73 |                     "title": "VM Autoscale Alert",
 74 |                     "message": formatted_message,
 75 |                     "priority": final_priority
 76 |                 },
 77 |                 headers={"Authorization": f"Bearer {app_token}"},
 78 |                 timeout=10
 79 |             )
 80 |             response.raise_for_status()
 81 |             self.logger.info("Gotify notification sent successfully")
 82 |         except requests.exceptions.RequestException as e:
 83 |             self.logger.error(f"Failed to send Gotify notification: {str(e)}")
 84 |             raise
 85 | 
 86 |     def send_smtp_notification(self, message: str) -> None:
 87 |         """Send notification via email with retry logic."""
 88 |         try:
 89 |             alerts_config = self.config['alerts']
 90 |             smtp_config = {
 91 |                 'host': alerts_config['smtp_server'],
 92 |                 'port': alerts_config.get('smtp_port', 587),
 93 |                 'user': alerts_config['smtp_user'],
 94 |                 'password': alerts_config['smtp_password'],
 95 |                 'recipient': alerts_config['email_recipient']
 96 |             }
 97 | 
 98 |             to_emails = [smtp_config['recipient']] if isinstance(smtp_config['recipient'], str) else smtp_config['recipient']
 99 |             if not all(isinstance(email, str) for email in to_emails):
100 |                 raise ValueError("Invalid email format in recipients")
101 |             formatted_message = self._format_message(message)
102 |             # Updated regex to capture the VM number
103 |             pattern = r"VM\s+(\d+)"
104 |             result = re.search(pattern, formatted_message)
105 |             if result:
106 |                 vm_id = result.group(1)
107 |             else:
108 |                 vm_id = ""
109 |             msg = MIMEMultipart()
110 |             msg['From'] = smtp_config['user']
111 |             msg['To'] = ", ".join(to_emails)
112 |             msg['Subject'] = f"VM Autoscale Alert for VM {vm_id}"
113 |             msg.attach(MIMEText(formatted_message, 'plain'))
114 | 
115 |             with smtplib.SMTP(smtp_config['host'], smtp_config['port']) as server:
116 |                 server.starttls()
117 |                 if smtp_config['password']:
118 |                     server.login(smtp_config['user'], smtp_config['password'])
119 |                 server.sendmail(smtp_config['user'], to_emails, msg.as_string())
120 |             
121 |             self.logger.info("Email notification sent successfully")
122 |         except Exception as e:
123 |             self.logger.error(f"Failed to send email notification: {str(e)}")
124 |             raise
125 | 
126 |     def send_notification(self, message: Union[str, tuple, Any], priority: Optional[int] = None) -> None:
127 |         """Send notification through configured channels."""
128 |         sent = False
129 |         errors = []
130 |         formatted_message = self._format_message(message)
131 | 
132 |         if self.config.get('gotify', {}).get('enabled', False):
133 |             try:
134 |                 self.send_gotify_notification(formatted_message, priority)
135 |                 sent = True
136 |             except Exception as e:
137 |                 error_msg = f"Failed to send Gotify notification: {str(e)}"
138 |                 errors.append(error_msg)
139 |                 self.logger.error(error_msg)
140 | 
141 |         if self.config.get('alerts', {}).get('email_enabled', False):
142 |             try:
143 |                 self.send_smtp_notification(formatted_message)
144 |                 sent = True
145 |             except Exception as e:
146 |                 error_msg = f"Failed to send email notification: {str(e)}"
147 |                 errors.append(error_msg)
148 |                 self.logger.error(error_msg)
149 | 
150 |         if not sent:
151 |             error_summary = f" Errors: {'; '.join(errors)}" if errors else ""
152 |             self.logger.warning(
153 |                 f"Failed to send notification through any channel. Message: {formatted_message}.{error_summary}"
154 |             )
155 | 
156 | class VMAutoscaler:
157 |     def __init__(self, config_path: str, logging_config_path: Optional[str] = None):
158 |         self.config = self._load_config(config_path)
159 |         self.logger = self._setup_logging(logging_config_path)
160 |         self.notification_manager = NotificationManager(self.config, self.logger)
161 | 
162 |     @staticmethod
163 |     def _load_config(config_path: str) -> Dict[str, Any]:
164 |         """Load and validate configuration file."""
165 |         if not Path(config_path).exists():
166 |             raise FileNotFoundError(f"Configuration file not found at {config_path}")
167 |         
168 |         with open(config_path, 'r') as config_file:
169 |             config = yaml.safe_load(config_file)
170 |         
171 |         # Validate essential configuration
172 |         required_sections = ['scaling_thresholds', 'scaling_limits', 'proxmox_hosts', 'virtual_machines']
173 |         missing_sections = [section for section in required_sections if section not in config]
174 |         if missing_sections:
175 |             raise ConfigurationError(f"Missing required configuration sections: {', '.join(missing_sections)}")
176 |             
177 |         return config
178 | 
179 |     def _setup_logging(self, logging_config_path: Optional[str]) -> logging.Logger:
180 |         """Setup logging configuration."""
181 |         if logging_config_path and Path(logging_config_path).exists():
182 |             with open(logging_config_path, 'r') as logging_file:
183 |                 logging_config = json.load(logging_file)
184 |                 logging.config.dictConfig(logging_config)
185 |         else:
186 |             logging.basicConfig(
187 |                 level=self.config.get('logging', {}).get('level', 'INFO'),
188 |                 format="%(asctime)s [%(levelname)s] %(message)s",
189 |                 handlers=[
190 |                     logging.FileHandler(self.config.get('logging', {}).get('log_file', '/var/log/vm_autoscale.log')),
191 |                     logging.StreamHandler()
192 |                 ]
193 |             )
194 |         return logging.getLogger("vm_autoscale")
195 | 
196 |     def process_vm(self, host: Dict[str, Any], vm: Dict[str, Any]) -> None:
197 |         """Process a single VM for autoscaling."""
198 |         ssh_client = None
199 |         try:
200 |             ssh_client = SSHClient(
201 |                 host=host['host'],
202 |                 port=host['ssh_port'],
203 |                 user=host['ssh_user'],
204 |                 password=host.get('ssh_password'),
205 |                 key_path=host.get('ssh_key')
206 |             )
207 |             ssh_client.connect()
208 | 
209 |             vm_manager = VMResourceManager(ssh_client, vm['vm_id'], self.config)
210 |             
211 |             # First check if VM is running
212 |             if not vm_manager.is_vm_running():
213 |                 self.logger.info(f"VM {vm['vm_id']} is not running. Skipping scaling.")
214 |                 return
215 | 
216 |             # Check host resources first
217 |             host_checker = HostResourceChecker(ssh_client)
218 |             if not host_checker.check_host_resources(
219 |                     self.config['host_limits']['max_host_cpu_percent'],
220 |                     self.config['host_limits']['max_host_ram_percent']):
221 |                 self.logger.warning(f"Host {host['name']} resources maxed out. Skipping scaling.")
222 |                 return
223 | 
224 |             # Get current resource usage once to avoid multiple calls
225 |             current_cpu_usage, current_ram_usage = vm_manager.get_resource_usage()
226 |             self.logger.info(f"VM {vm['vm_id']} current usage - CPU: {current_cpu_usage}%, RAM: {current_ram_usage}%")
227 | 
228 |             # Handle CPU scaling if enabled
229 |             if vm.get('cpu_scaling', False):
230 |                 try:
231 |                     self._handle_cpu_scaling(vm_manager, vm['vm_id'], current_cpu_usage)
232 |                     self.logger.debug(f"CPU scaling completed for VM {vm['vm_id']}")
233 |                 except Exception as e:
234 |                     self.logger.error(f"CPU scaling failed for VM {vm['vm_id']}: {str(e)}")
235 |                     # Continue to RAM scaling even if CPU scaling fails
236 | 
237 |             # Handle RAM scaling if enabled
238 |             if vm.get('ram_scaling', False):
239 |                 try:
240 |                     self._handle_ram_scaling(vm_manager, vm['vm_id'], current_ram_usage)
241 |                     self.logger.debug(f"RAM scaling completed for VM {vm['vm_id']}")
242 |                 except Exception as e:
243 |                     self.logger.error(f"RAM scaling failed for VM {vm['vm_id']}: {str(e)}")
244 | 
245 |         except Exception as e:
246 |             self.logger.error(f"Error processing VM {vm['vm_id']} on host {host['name']}: {e}")
247 |             self.notification_manager.send_notification(
248 |                 f"Error processing VM {vm['vm_id']} on host {host['name']}: {e}",
249 |                 priority=9
250 |             )
251 |         finally:
252 |             if ssh_client:
253 |                 ssh_client.close()
254 | 
255 |     def _handle_cpu_scaling(self, vm_manager: VMResourceManager, vm_id: int, cpu_usage: float) -> None:
256 |         """Handle CPU scaling decisions."""
257 |         thresholds = self.config['scaling_thresholds']['cpu']
258 |         if cpu_usage > thresholds['high']:
259 |             if vm_manager.scale_cpu('up'):
260 |                 self.notification_manager.send_notification(
261 |                     f"Scaled up CPU for VM {vm_id} due to high usage ({cpu_usage}%).",
262 |                     priority=7
263 |                 )
264 |         elif cpu_usage < thresholds['low']:
265 |             if vm_manager.scale_cpu('down'):
266 |                 self.notification_manager.send_notification(
267 |                     f"Scaled down CPU for VM {vm_id} due to low usage ({cpu_usage}%).",
268 |                     priority=5
269 |                 )
270 | 
271 |     def _handle_ram_scaling(self, vm_manager: VMResourceManager, vm_id: int, ram_usage: float) -> None:
272 |         """Handle RAM scaling decisions."""
273 |         thresholds = self.config['scaling_thresholds']['ram']
274 |         if ram_usage > thresholds['high']:
275 |             if vm_manager.scale_ram('up'):
276 |                 self.notification_manager.send_notification(
277 |                     f"Scaled up RAM for VM {vm_id} due to high usage ({ram_usage}%).",
278 |                     priority=7
279 |                 )
280 |         elif ram_usage < thresholds['low']:
281 |             if vm_manager.scale_ram('down'):
282 |                 self.notification_manager.send_notification(
283 |                     f"Scaled down RAM for VM {vm_id} due to low usage ({ram_usage}%).",
284 |                     priority=5
285 |                 )
286 | 
287 |     def run(self) -> None:
288 |         """Main execution loop."""
289 |         self.logger.info("Starting VM Autoscaler")
290 |         while True:
291 |             try:
292 |                 for host in self.config['proxmox_hosts']:
293 |                     for vm in self.config['virtual_machines']:
294 |                         if vm['proxmox_host'] == host['name'] and vm.get('scaling_enabled', False):
295 |                             self.process_vm(host, vm)
296 |                 
297 |                 check_interval = self.config.get('check_interval', 300)  # Default to 5 minutes
298 |                 time.sleep(check_interval)
299 |             
300 |             except KeyboardInterrupt:
301 |                 self.logger.info("Shutting down VM Autoscaler")
302 |                 break
303 |             except Exception as e:
304 |                 self.logger.error(f"Unexpected error in main loop: {e}")
305 |                 self.notification_manager.send_notification(
306 |                     f"Unexpected error in VM Autoscaler: {e}",
307 |                     priority=10
308 |                 )
309 |                 time.sleep(60)  # Wait before retrying
310 | 
311 | def main():
312 |     """Entry point of the application."""
313 |     try:
314 |         autoscaler = VMAutoscaler(
315 |             config_path="/usr/local/bin/vm_autoscale/config.yaml",
316 |             logging_config_path="/usr/local/bin/vm_autoscale/logging_config.json"
317 |         )
318 |         autoscaler.run()
319 |     except Exception as e:
320 |         logging.critical(f"Failed to start VM Autoscaler: {e}")
321 |         sys.exit(1)
322 | 
323 | if __name__ == "__main__":
324 |     main()
325 | 


--------------------------------------------------------------------------------
/config.yaml:
--------------------------------------------------------------------------------
 1 | # Configuration file for vm_autoscale
 2 | 
 3 | # Global thresholds for scaling VMs
 4 | scaling_thresholds:
 5 |   cpu:
 6 |     high: 80            # Percentage CPU usage at which scaling up is triggered
 7 |     low: 20             # Percentage CPU usage at which scaling down is considered
 8 |   ram:
 9 |     high: 85            # Percentage RAM usage at which scaling up is triggered
10 |     low: 25             # Percentage RAM usage at which scaling down is considered
11 | 
12 | # Scaling limits for VMs
13 | scaling_limits:
14 |   min_cores: 1          # Minimum number of CPU cores that a VM can have
15 |   max_cores: 8          # Maximum number of CPU cores that a VM can have
16 |   min_ram_mb: 1024       # Minimum RAM (in MB) that a VM can have (updated since NUMA doeasnt like less than 1024MB)
17 |   max_ram_mb: 16384     # Maximum RAM (in MB) that a VM can have
18 | 
19 | # Time intervals for checking VM resources and performing actions (in seconds)
20 | check_interval: 300      # How often to check VM stats and take action if needed
21 | 
22 | # List of Proxmox hosts to manage
23 | proxmox_hosts:
24 |   - name: host1
25 |     host: 192.168.1.10
26 |     ssh_user: root
27 |     ssh_password: your_password_here   # SSH password or key must be provided
28 |     ssh_key: /path/to/ssh_key          # Path to SSH private key (optional, use if no password)
29 |     ssh_port: 22                       # SSH Port (default 22)
30 | 
31 |   - name: host2
32 |     host: 192.168.1.11
33 |     ssh_user: root
34 |     ssh_password: your_password_here
35 |     ssh_key: /path/to/ssh_key
36 | 
37 | # Virtual machines to be monitored and scaled
38 | virtual_machines:
39 |   - vm_id: "101"
40 |     proxmox_host: "host1"
41 |     scaling_enabled: true
42 |     cpu_scaling: true    # Enable/disable CPU scaling
43 |     ram_scaling: true    # Enable/disable RAM scaling
44 |     thresholds:
45 |       cpu_high: 80
46 |       cpu_low: 20
47 |       ram_high: 80
48 |       ram_low: 20
49 | 
50 |   - vm_id: 102
51 |     proxmox_host: host2
52 |     scaling_enabled: true
53 |     cpu_scaling: true
54 |     ram_scaling: true
55 |     thresholds:
56 |       cpu_high: 80
57 |       cpu_low: 20
58 |       ram_high: 80
59 |       ram_low: 20
60 | 
61 | # Logging configuration
62 | logging:
63 |   level: INFO                         # Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
64 |   log_file: /var/log/vm_autoscale.log # Path to the log file
65 | 
66 | # Alerts configuration (Optional)
67 | alerts:
68 |   email_enabled: false
69 |   email_recipient: admin@example.com
70 |   smtp_server: smtp.example.com
71 |   smtp_port: 587
72 |   smtp_user: your_smtp_user
73 |   smtp_password: your_smtp_password
74 | 
75 | # Gotify notifications configuration (Optional)
76 | gotify:
77 |   enabled: false
78 |   server_url: https://gotify.example.com        # Base URL of the Gotify server
79 |   app_token: your_gotify_app_token_here         # Application token for authentication
80 |   priority: 5                                   # Notification priority level (1-10)
81 | 
82 | # Safety checks for host resource limits
83 | host_limits:
84 |   max_host_cpu_percent: 90            # Max CPU usage percentage for the host before scaling is restricted
85 |   max_host_ram_percent: 90            # Max RAM usage percentage for the host before scaling is restricted
86 | 


--------------------------------------------------------------------------------
/host_resource_checker.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | import json
 3 | 
 4 | class HostResourceChecker:
 5 |     """
 6 |     Class to check and monitor host resource usage via SSH.
 7 |     """
 8 | 
 9 |     def __init__(self, ssh_client):
10 |         """
11 |         Initialize the HostResourceChecker with an SSH client.
12 |         :param ssh_client: Instance of SSH client for executing remote commands.
13 |         """
14 |         self.ssh_client = ssh_client
15 |         self.logger = logging.getLogger("host_resource_checker")
16 | 
17 |     def check_host_resources(self, max_host_cpu_percent, max_host_ram_percent):
18 |         """
19 |         Check host CPU and RAM usage against specified thresholds.
20 |         :param max_host_cpu_percent: Maximum allowable CPU usage percentage.
21 |         :param max_host_ram_percent: Maximum allowable RAM usage percentage.
22 |         :return: True if resources are within limits, False otherwise.
23 |         """
24 |         try:
25 |             # Command to retrieve host resource status
26 |             command = "pvesh get /nodes/$(hostname)/status --output-format json"
27 |             output, error, exit_status = self.ssh_client.execute_command(command)  # Properly unpack the tuple
28 |             
29 |             # Debug logging
30 |             self.logger.debug(f"Raw command output: {output}")
31 |             self.logger.debug(f"Error output: {error}")
32 |             self.logger.debug(f"Exit status: {exit_status}")
33 |             
34 |             # Check for error output
35 |             if error:
36 |                 raise Exception(f"Command execution error: {error}")
37 |                 
38 |             # Make sure output is a string
39 |             if not isinstance(output, str):
40 |                 output = output.decode() if isinstance(output, bytes) else str(output)
41 |                 
42 |             # Parse JSON response
43 |             data = json.loads(output.strip())  # Add strip() to remove any whitespace
44 |             
45 |             # Rest of your code remains the same
46 |             if 'cpu' not in data or 'memory' not in data:
47 |                 raise KeyError("Missing 'cpu' or 'memory' in the command output.")
48 |                 
49 |             # Extract and calculate CPU usage
50 |             host_cpu_usage = data['cpu'] * 100  # Convert to percentage
51 |             
52 |             # Extract memory details
53 |             memory_data = data['memory']
54 |             total_mem = memory_data.get('total', 1)  # Avoid division by zero
55 |             used_mem = memory_data.get('used', 0)
56 |             cached_mem = memory_data.get('cached', 0)
57 |             free_mem = memory_data.get('free', 0)
58 |             
59 |             # Calculate RAM usage as a percentage
60 |             available_mem = free_mem + cached_mem
61 |             host_ram_usage = ((total_mem - available_mem) / total_mem) * 100
62 |             
63 |             # Log resource usage
64 |             self.logger.info(f"Host CPU Usage: {host_cpu_usage:.2f}%, "
65 |                              f"Host RAM Usage: {host_ram_usage:.2f}%")
66 |             
67 |             # Check CPU usage threshold
68 |             if host_cpu_usage > max_host_cpu_percent:
69 |                 self.logger.warning(f"Host CPU usage exceeds maximum allowed limit: "
70 |                                     f"{host_cpu_usage:.2f}% > {max_host_cpu_percent}%")
71 |                 return False
72 |                 
73 |             # Check RAM usage threshold
74 |             if host_ram_usage > max_host_ram_percent:
75 |                 self.logger.warning(f"Host RAM usage exceeds maximum allowed limit: "
76 |                                     f"{host_ram_usage:.2f}% > {max_host_ram_percent}%")
77 |                 return False
78 |                 
79 |             # Resources are within limits
80 |             return True
81 |             
82 |         except json.JSONDecodeError as json_err:
83 |             self.logger.error(f"Failed to parse JSON output: {str(json_err)}")
84 |             self.logger.error(f"Raw output causing JSON error: {output}")
85 |             raise
86 |         except KeyError as key_err:
87 |             self.logger.error(f"Missing data in the response: {str(key_err)}")
88 |             raise
89 |         except Exception as e:
90 |             self.logger.error(f"Failed to check host resources: {str(e)}")
91 |             raise


--------------------------------------------------------------------------------
/install.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | # Install script for Proxmox VM Autoscale project
  3 | # Repository: https://github.com/fabriziosalmi/proxmox-vm-autoscale
  4 | 
  5 | # Variables
  6 | INSTALL_DIR="/usr/local/bin/vm_autoscale"
  7 | BACKUP_DIR="/etc/vm_autoscale"  # New separate backup directory
  8 | REPO_URL="https://github.com/fabriziosalmi/proxmox-vm-autoscale"
  9 | SERVICE_FILE="vm_autoscale.service"
 10 | CONFIG_FILE="$INSTALL_DIR/config.yaml"
 11 | BACKUP_FILE="$BACKUP_DIR/config.yaml.backup"  # Updated backup location
 12 | REQUIREMENTS_FILE="$INSTALL_DIR/requirements.txt"
 13 | PYTHON_CMD="/usr/bin/python3"
 14 | 
 15 | # Ensure the script is run as root
 16 | if [ "$EUID" -ne 0 ]; then
 17 |     echo "ERROR: Please run this script as root."
 18 |     exit 1
 19 | fi
 20 | 
 21 | # Create backup directory if it doesn't exist
 22 | if [ ! -d "$BACKUP_DIR" ]; then
 23 |     echo "Creating backup directory..."
 24 |     mkdir -p "$BACKUP_DIR" || { echo "ERROR: Failed to create backup directory"; exit 1; }
 25 | fi
 26 | 
 27 | # Backup existing config.yaml if it exists
 28 | if [ -f "$CONFIG_FILE" ]; then
 29 |     echo "Backing up existing config.yaml to $BACKUP_FILE..."
 30 |     cp "$CONFIG_FILE" "$BACKUP_FILE" || { echo "ERROR: Failed to backup config.yaml"; exit 1; }
 31 | fi
 32 | 
 33 | # Install necessary dependencies
 34 | echo "Installing necessary dependencies..."
 35 | apt-get update || { echo "ERROR: Failed to update package lists"; exit 1; }
 36 | apt-get install -y python3 curl bash git python3-paramiko python3-yaml python3-requests python3-cryptography \
 37 |     || { echo "ERROR: Failed to install required packages"; exit 1; }
 38 | 
 39 | # Clone the repository
 40 | echo "Cloning the repository..."
 41 | if [ -d "$INSTALL_DIR" ]; then
 42 |     echo "Removing existing installation directory..."
 43 |     rm -rf "$INSTALL_DIR" || { echo "ERROR: Failed to remove existing directory $INSTALL_DIR"; exit 1; }
 44 | fi
 45 | 
 46 | git clone "$REPO_URL" "$INSTALL_DIR" || { echo "ERROR: Failed to clone the repository from $REPO_URL"; exit 1; }
 47 | 
 48 | # Restore backup if it exists
 49 | if [ -f "$BACKUP_FILE" ]; then
 50 |     echo "Restoring config.yaml from backup..."
 51 |     cp "$BACKUP_FILE" "$CONFIG_FILE" || { echo "ERROR: Failed to restore config.yaml from backup"; exit 1; }
 52 | fi
 53 | 
 54 | # Install Python dependencies
 55 | if [ -f "$REQUIREMENTS_FILE" ]; then
 56 |     echo "Installing Python dependencies..."
 57 |     pip3 install -r "$REQUIREMENTS_FILE" || { echo "ERROR: Failed to install Python dependencies"; exit 1; }
 58 | else
 59 |     echo "WARNING: Requirements file not found. Skipping Python dependency installation."
 60 | fi
 61 | 
 62 | # Set permissions
 63 | echo "Setting permissions for installation directory..."
 64 | chmod -R 755 "$INSTALL_DIR" || { echo "ERROR: Failed to set permissions on $INSTALL_DIR"; exit 1; }
 65 | chmod -R 755 "$BACKUP_DIR" || { echo "ERROR: Failed to set permissions on $BACKUP_DIR"; exit 1; }
 66 | 
 67 | # Create the systemd service file
 68 | echo "Creating the systemd service file..."
 69 | cat <<EOF > /etc/systemd/system/$SERVICE_FILE
 70 | [Unit]
 71 | Description=Proxmox VM Autoscale Service
 72 | After=network.target
 73 | 
 74 | [Service]
 75 | ExecStart=$PYTHON_CMD $INSTALL_DIR/autoscale.py
 76 | WorkingDirectory=$INSTALL_DIR
 77 | Restart=always
 78 | User=root
 79 | Environment=PYTHONUNBUFFERED=1
 80 | 
 81 | [Install]
 82 | WantedBy=multi-user.target
 83 | EOF
 84 | 
 85 | if [ $? -ne 0 ]; then
 86 |     echo "ERROR: Failed to create systemd service file at /etc/systemd/system/$SERVICE_FILE"
 87 |     exit 1
 88 | fi
 89 | 
 90 | # Reload systemd, enable the service, and ensure it's not started
 91 | echo "Reloading systemd and enabling the service..."
 92 | systemctl daemon-reload || { echo "ERROR: Failed to reload systemd"; exit 1; }
 93 | systemctl enable "$SERVICE_FILE" || { echo "ERROR: Failed to enable the service"; exit 1; }
 94 | 
 95 | # Post-installation instructions
 96 | echo "Installation complete. The service is enabled but not started."
 97 | echo "To start the service, use: sudo systemctl start $SERVICE_FILE"
 98 | echo "Logs can be monitored using: journalctl -u $SERVICE_FILE -f"
 99 | echo "Config backup location: $BACKUP_FILE"
100 | 


--------------------------------------------------------------------------------
/logging_config.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "version": 1,
 3 |     "disable_existing_loggers": false,
 4 |     "formatters": {
 5 |         "detailed": {
 6 |             "format": "%(asctime)s [%(levelname)s] %(name)s: %(message)s"
 7 |         },
 8 |         "simple": {
 9 |             "format": "%(levelname)s: %(message)s"
10 |         }
11 |     },
12 |     "handlers": {
13 |         "console": {
14 |             "class": "logging.StreamHandler",
15 |             "level": "INFO",
16 |             "formatter": "simple",
17 |             "stream": "ext://sys.stdout"
18 |         },
19 |         "file": {
20 |             "class": "logging.FileHandler",
21 |             "level": "DEBUG",
22 |             "formatter": "detailed",
23 |             "filename": "/var/log/vm_autoscale.log",
24 |             "mode": "a",
25 |             "encoding": "utf8"
26 |         }
27 |     },
28 |     "loggers": {
29 |         "": {
30 |             "level": "DEBUG",
31 |             "handlers": ["console", "file"]
32 |         },
33 |         "ssh_utils": {
34 |             "level": "INFO",
35 |             "handlers": ["console", "file"],
36 |             "propagate": false
37 |         },
38 |         "vm_resource_manager": {
39 |             "level": "INFO",
40 |             "handlers": ["console", "file"],
41 |             "propagate": false
42 |         },
43 |         "host_resource_checker": {
44 |             "level": "INFO",
45 |             "handlers": ["console", "file"],
46 |             "propagate": false
47 |         }
48 |     },
49 |     "root": {
50 |         "level": "DEBUG",
51 |         "handlers": ["console", "file"]
52 |     }
53 | }
54 | 


--------------------------------------------------------------------------------
/ssh_utils.py:
--------------------------------------------------------------------------------
  1 | import paramiko
  2 | import logging
  3 | import time
  4 | from paramiko.ssh_exception import SSHException, AuthenticationException
  5 | 
  6 | class SSHClient:
  7 |     def __init__(self, host, user, password=None, key_path=None, port=22):
  8 |         """
  9 |         Initializes the SSH client with given credentials.
 10 |         :param host: Hostname or IP address of the server.
 11 |         :param user: Username to connect with.
 12 |         :param password: Password for SSH (optional).
 13 |         :param key_path: Path to the private SSH key (optional).
 14 |         :param port: Port for SSH connection (default: 22).
 15 |         """
 16 |         self.host = host
 17 |         self.user = user
 18 |         self.password = password
 19 |         self.key_path = key_path
 20 |         self.port = port
 21 |         self.logger = logging.getLogger("ssh_utils")
 22 |         self.client = None
 23 |         # Added max retries and backoff factor for connection attempts
 24 |         self.max_retries = 5
 25 |         self.backoff_factor = 1
 26 | 
 27 |     def connect(self):
 28 |         """
 29 |         Establish an SSH connection to the host.
 30 |         """
 31 |         if self.client is not None and self.client.get_transport() and self.client.get_transport().is_active():
 32 |             self.logger.info(f"Already connected to {self.host}. Reusing the connection.")
 33 |             return
 34 | 
 35 |         attempt = 0
 36 |         while attempt < self.max_retries:
 37 |             try:
 38 |                 self.client = paramiko.SSHClient()
 39 |                 self.client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
 40 |                 
 41 |                 # Connect using password or private key
 42 |                 if self.password:
 43 |                     self.client.connect(
 44 |                         hostname=self.host, 
 45 |                         username=self.user, 
 46 |                         password=self.password, 
 47 |                         port=self.port,
 48 |                         timeout=10
 49 |                     )
 50 |                 elif self.key_path:
 51 |                     private_key = paramiko.RSAKey.from_private_key_file(self.key_path)
 52 |                     self.client.connect(
 53 |                         hostname=self.host, 
 54 |                         username=self.user, 
 55 |                         pkey=private_key, 
 56 |                         port=self.port,
 57 |                         timeout=10
 58 |                     )
 59 |                 else:
 60 |                     raise ValueError("Either password or key_path must be provided for SSH connection.")
 61 |                 
 62 |                 self.logger.info(f"Successfully connected to {self.host} on port {self.port}")
 63 |                 break  # successful connection: exit loop
 64 | 
 65 |             except AuthenticationException:
 66 |                 self.logger.error(f"Authentication failed for {self.host}. Check credentials or key file.")
 67 |                 raise
 68 |             except (SSHException, Exception) as e:
 69 |                 attempt += 1
 70 |                 if attempt >= self.max_retries:
 71 |                     self.logger.error(f"Failed to connect to {self.host} after {attempt} attempts.")
 72 |                     raise e
 73 |                 sleep_time = self.backoff_factor * (2 ** (attempt - 1))
 74 |                 self.logger.info(f"Retrying connection to {self.host} in {sleep_time} seconds (attempt {attempt}/{self.max_retries})")
 75 |                 time.sleep(sleep_time)
 76 | 
 77 |     def execute_command(self, command, timeout=30):
 78 |         """Execute a command on the remote server with retry logic."""
 79 |         attempts = 0
 80 |         while attempts < self.max_retries:
 81 |             try:
 82 |                 # ...existing code before try...
 83 |                 stdin, stdout, stderr = self.client.exec_command(command, timeout=timeout)
 84 |                 exit_status = stdout.channel.recv_exit_status()
 85 | 
 86 |                 output = stdout.read().decode('utf-8').strip()
 87 |                 error = stderr.read().decode('utf-8').strip()
 88 | 
 89 |                 if exit_status == 0:
 90 |                     self.logger.info(f"Command executed successfully on {self.host}: {command}")
 91 |                     return output, error, exit_status
 92 |                 else:
 93 |                     self.logger.warning(f"Command execution failed on {self.host} with exit status {exit_status}")
 94 |                     return output, error, exit_status
 95 |             except Exception as e:
 96 |                 attempts += 1
 97 |                 self.logger.error(f"Error executing command on {self.host} (attempt {attempts}): {str(e)}")
 98 |                 self.close()
 99 |                 try:
100 |                     self.connect()
101 |                 except Exception as connect_err:
102 |                     self.logger.error(f"Reconnection failed on {self.host}: {str(connect_err)}")
103 |                 time.sleep(self.backoff_factor * (2 ** (attempts - 1)))
104 |         raise Exception(f"Failed to execute command on {self.host} after {attempts} attempts.")
105 | 
106 |     def close(self):
107 |         """
108 |         Close the SSH connection.
109 |         """
110 |         if self.client:
111 |             try:
112 |                 self.client.close()
113 |                 self.logger.info(f"SSH connection closed for {self.host}")
114 |             except Exception as e:
115 |                 self.logger.error(f"Error while closing SSH connection to {self.host}: {str(e)}")
116 |             finally:
117 |                 self.client = None
118 | 
119 |     def __enter__(self):
120 |         """
121 |         Context manager entry.
122 |         """
123 |         self.connect()
124 |         return self
125 | 
126 |     def __exit__(self, exc_type, exc_value, traceback):
127 |         """
128 |         Context manager exit - ensure the SSH connection is closed.
129 |         """
130 |         self.close()
131 | 
132 |     def is_connected(self):
133 |         """
134 |         Check if the SSH client is connected and transport is active.
135 |         :return: True if connected, False otherwise.
136 |         """
137 |         return self.client is not None and self.client.get_transport() and self.client.get_transport().is_active()
138 | 


--------------------------------------------------------------------------------
/vm_autoscale.service:
--------------------------------------------------------------------------------
 1 | [Unit]
 2 | Description=Proxmox VM Autoscale Service
 3 | Documentation=https://github.com/fabriziosalmi/proxmox-vm-autoscale
 4 | After=network.target
 5 | 
 6 | [Service]
 7 | ExecStart=/usr/bin/python3 /usr/local/bin/vm_autoscale/autoscale.py
 8 | WorkingDirectory=/usr/local/bin/vm_autoscale
 9 | Restart=always
10 | RestartSec=10
11 | User=root
12 | Environment=PYTHONUNBUFFERED=1
13 | 
14 | [Install]
15 | WantedBy=multi-user.target
16 | 


--------------------------------------------------------------------------------
/vm_manager.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | import re
  3 | import time
  4 | import threading
  5 | 
  6 | 
  7 | class VMResourceManager:
  8 |     def __init__(self, ssh_client, vm_id, config):
  9 |         self.ssh_client = ssh_client
 10 |         self.vm_id = vm_id
 11 |         self.config = config
 12 |         self.logger = logging.getLogger("vm_resource_manager")
 13 |         self.last_scale_time = 0
 14 |         self.scale_cooldown = self.config.get("scale_cooldown", 300)  # Default to 5 minutes
 15 |         self.scale_lock = threading.Lock()  # Added lock for scaling control
 16 | 
 17 |     def _get_command_output(self, output):
 18 |         """Helper method to properly handle command output that might be a tuple."""
 19 |         if isinstance(output, tuple):
 20 |             # Assuming the first element contains the stdout
 21 |             return str(output[0]).strip() if output and output[0] is not None else ""
 22 |         return str(output).strip() if output is not None else ""
 23 | 
 24 |     def is_vm_running(self, retries=3, delay=5):
 25 |         """Check if the VM is running with retries and improved error handling."""
 26 |         for attempt in range(1, retries + 1):
 27 |             try:
 28 |                 command = f"qm status {self.vm_id} --verbose"
 29 |                 self.logger.debug(f"Executing command to check VM status: {command}")
 30 |                 output = self.ssh_client.execute_command(command)
 31 |                 output_str = self._get_command_output(output)
 32 |                 self.logger.debug(f"Command output: {output_str}")
 33 |         
 34 |                 if "status: running" in output_str.lower():
 35 |                     self.logger.info(f"VM {self.vm_id} is running.")
 36 |                     return True
 37 |                 elif "status:" in output_str.lower():
 38 |                     self.logger.info(f"VM {self.vm_id} is not running.")
 39 |                     return False
 40 |                 else:
 41 |                     self.logger.warning(
 42 |                         f"Unexpected output while checking VM status: {output_str}"
 43 |                     )
 44 |             except Exception as e:
 45 |                 self.logger.warning(
 46 |                     f"Attempt {attempt}/{retries} failed to check VM status: {e}. Retrying..."
 47 |                 )
 48 |                 time.sleep(delay * attempt)  # Exponential backoff
 49 |         
 50 |         self.logger.error(
 51 |             f"Unable to determine status of VM {self.vm_id} after {retries} attempts."
 52 |         )
 53 |         return False
 54 | 
 55 |     def get_resource_usage(self):
 56 |         """Retrieve CPU and RAM usage as percentages."""
 57 |         try:
 58 |             if not self.is_vm_running():
 59 |                 return 0.0, 0.0
 60 |             #command = f"qm status {self.vm_id} --verbose"
 61 |             # Updated command  - this might well be refinable to simpler and faster.
 62 |             vmid = self.vm_id
 63 |             command = f"pvesh get /cluster/resources | grep 'qemu/{vmid}' | awk -F '│' '{{print $6, $15, $16}}'"
 64 |             output = self.ssh_client.execute_command(command)
 65 |             # example output: "  3.17%     5.00 GiB     3.82 GiB "
 66 |             self.logger.info(f"VM status output: {output}")
 67 |             cpu_usage = self._parse_cpu_usage(output)
 68 |             ram_usage = self._parse_ram_usage(output)
 69 |             return cpu_usage, ram_usage
 70 |         except Exception as e:
 71 |             self.logger.error(f"Failed to retrieve resource usage: {e}")
 72 |             return 0.0, 0.0
 73 | 
 74 |     def can_scale(self):
 75 |         """Determine if scaling can occur using a lock to avoid race conditions."""
 76 |         with self.scale_lock:
 77 |             current_time = time.time()
 78 |             if current_time - self.last_scale_time < self.scale_cooldown:
 79 |                 return False
 80 |             self.last_scale_time = current_time
 81 |             return True
 82 | 
 83 |     def scale_cpu(self, direction):
 84 |         """Scale the CPU cores and vCPUs of the VM."""
 85 |         if not self.can_scale():
 86 |             return False
 87 | 
 88 |         try:
 89 |             current_cores = self._get_current_cores()
 90 |             max_cores = self._get_max_cores()
 91 |             min_cores = self._get_min_cores()
 92 |             current_vcpus = self._get_current_vcpus()
 93 | 
 94 |             self.last_scale_time = time.time()
 95 |             if direction == "up" and current_cores < max_cores:
 96 |                 self._scale_cpu_up(current_cores, current_vcpus)
 97 |                 return True
 98 |             elif direction == "down" and current_cores > min_cores:
 99 |                 self._scale_cpu_down(current_cores, current_vcpus)
100 |                 return True
101 |             else:
102 |                 self.logger.info("No CPU scaling required.")
103 |                 return False
104 |         except Exception as e:
105 |             self.logger.error(f"Failed to scale CPU: {e}")
106 |             raise
107 | 
108 |     def scale_ram(self, direction):
109 |         """Scale the RAM of the VM."""
110 |         if not self.can_scale():
111 |             return False
112 | 
113 |         try:
114 |             current_ram = self._get_current_ram()
115 |             max_ram = self._get_max_ram()
116 |             min_ram = self._get_min_ram()
117 | 
118 |             self.last_scale_time = time.time()
119 |             if direction == "up" and current_ram < max_ram:
120 |                 new_ram = min(current_ram + 512, max_ram)
121 |                 self._set_ram(new_ram)
122 |                 return True
123 |             elif direction == "down" and current_ram > min_ram:
124 |                 new_ram = max(current_ram - 512, min_ram)
125 |                 self._set_ram(new_ram)
126 |                 return True
127 |             else:
128 |                 self.logger.info("No RAM scaling required.")
129 |             return False
130 |         except Exception as e:
131 |             self.logger.error(f"Failed to scale RAM: {e}")
132 |             raise
133 | 
134 |     def _parse_cpu_usage(self, output):
135 |         """Parse CPU usage from VM status output."""
136 |         try:
137 |             output_str = self._get_command_output(output)
138 |             percentage_cpu_match = re.search(r"^\s*(\d+(?:\.\d+)?)%", output_str)
139 |             if percentage_cpu_match:
140 |                 return float(percentage_cpu_match.group(1))
141 |             self.logger.warning("CPU usage not found in output.")
142 |             return 0.0
143 |         except Exception as e:
144 |             self.logger.error(f"Error parsing CPU usage: {e}")
145 |             return 0.0
146 |     
147 |     def _convert_to_gib(self, value, unit):
148 |         """ Converts memory units to GiB. """
149 |         unit = unit.lower()
150 |         if unit == 'gib':
151 |             return value
152 |         elif unit == 'mib':
153 |             return value / 1024  # Convert MiB to GiB
154 |         else:
155 |             self.logger.warning(f"Unknown memory unit '{unit}'. Assuming GiB.")
156 |             return value  # Assume GiB if unit is unknown
157 | 
158 |     def _parse_ram_usage(self, output):
159 |         """ Parses RAM usage from VM status output. """
160 |         try:
161 |             output_str = self._get_command_output(output)
162 |             self.logger.debug(f"Processing output: '{output_str}'")
163 |             # ----------------------------
164 |             # Extract Memory Values
165 |             # ----------------------------
166 |             # Pattern Explanation:
167 |             # - (\d+(?:\.\d+)?)\s+(GiB|MiB) : Capture first memory value and its unit
168 |             # - \s+                         : Match one or more whitespace characters
169 |             # - (\d+(?:\.\d+)?)\s+(GiB|MiB) : Capture second memory value and its unit
170 |             pattern_memory = r"(\d+(?:\.\d+)?)\s+(GiB|MiB)\s+(\d+(?:\.\d+)?)\s+(GiB|MiB)"
171 |             memory_match = re.search(pattern_memory, output_str)
172 |             if memory_match:
173 |                 max_mem_value = float(memory_match.group(1))
174 |                 max_mem_unit = memory_match.group(2)
175 |                 used_mem_value = float(memory_match.group(3))
176 |                 used_mem_unit = memory_match.group(4)
177 | 
178 |                 self.logger.debug(f"Extracted Max Memory: {max_mem_value} {max_mem_unit}")
179 |                 self.logger.debug(f"Extracted Used Memory: {used_mem_value} {used_mem_unit}")
180 | 
181 |                 # Convert memory values to GiB
182 |                 max_mem_gib = self._convert_to_gib(max_mem_value, max_mem_unit)
183 |                 used_mem_gib = self._convert_to_gib(used_mem_value, used_mem_unit)
184 | 
185 |                 self.logger.debug(f"Converted Max Memory: {max_mem_gib} GiB")
186 |                 self.logger.debug(f"Converted Used Memory: {used_mem_gib} GiB")
187 | 
188 |                 if max_mem_gib == 0:
189 |                     self.logger.warning("Maximum memory is zero. Cannot compute usage percentage.")
190 |                     return 0.0
191 | 
192 |                 # Calculate RAM usage percentage based on memory values
193 |                 usage_percentage = (used_mem_gib / max_mem_gib) * 100
194 |                 self.logger.debug(f"Calculated RAM Usage: {usage_percentage:.2f}%")
195 |                 return usage_percentage
196 |             else:
197 |                 self.logger.warning("RAM memory values not found in output.")
198 |                 return 0.0
199 | 
200 |         except Exception as e:
201 |             self.logger.error(f"Error parsing RAM usage: {e}")
202 |             return 0.0
203 | 
204 |     def _get_current_vcpus(self):
205 |         """Retrieve current vCPUs assigned to the VM."""
206 |         try:
207 |             command = f"qm config {self.vm_id}"
208 |             output = self.ssh_client.execute_command(command)
209 |             output_str = self._get_command_output(output)
210 |             match = re.search(r"vcpus:\s*(\d+)", output_str)
211 |             return int(match.group(1)) if match else 1
212 |         except Exception as e:
213 |             self.logger.error(f"Failed to retrieve vCPUs: {e}")
214 |             return 1
215 | 
216 |     def _get_current_cores(self):
217 |         """Retrieve current CPU cores assigned to the VM."""
218 |         try:
219 |             command = f"qm config {self.vm_id}"
220 |             output = self.ssh_client.execute_command(command)
221 |             output_str = self._get_command_output(output)
222 |             match = re.search(r"cores:\s*(\d+)", output_str)
223 |             return int(match.group(1)) if match else 1
224 |         except Exception as e:
225 |             self.logger.error(f"Failed to retrieve CPU cores: {e}")
226 |             return 1
227 | 
228 |     def _get_max_cores(self):
229 |         """Retrieve maximum allowed CPU cores."""
230 |         return self.config.get("max_cores", 8)
231 | 
232 |     def _get_min_cores(self):
233 |         """Retrieve minimum allowed CPU cores."""
234 |         return self.config.get("min_cores", 1)
235 | 
236 |     def _get_current_ram(self):
237 |         """Retrieve current RAM assigned to the VM."""
238 |         try:
239 |             command = f"qm config {self.vm_id}"
240 |             output = self.ssh_client.execute_command(command)
241 |             output_str = self._get_command_output(output)
242 |             match = re.search(r"memory:\s*(\d+)", output_str)
243 |             return int(match.group(1)) if match else 512
244 |         except Exception as e:
245 |             self.logger.error(f"Failed to retrieve current RAM: {e}")
246 |             return 512
247 | 
248 |     def _get_max_ram(self):
249 |         """Retrieve maximum allowed RAM."""
250 |         return self.config.get("max_ram", 16384)
251 | 
252 |     def _get_min_ram(self):
253 |         """Retrieve minimum allowed RAM."""
254 |         return self.config.get("min_ram", 512)
255 | 
256 |     def _set_ram(self, ram):
257 |         """Set the RAM for the VM."""
258 |         try:
259 |             command = f"qm set {self.vm_id} -memory {ram}"
260 |             output = self.ssh_client.execute_command(command)
261 |             self._get_command_output(output)  # Process output to catch any errors
262 |             self.logger.info(f"RAM set to {ram} MB for VM {self.vm_id}.")
263 |         except Exception as e:
264 |             self.logger.error(f"Failed to set RAM to {ram}: {e}")
265 |             raise
266 | 
267 |     def _scale_cpu_up(self, current_cores, current_vcpus):
268 |         """Helper method to scale CPU up."""
269 |         new_cores = current_cores + 1
270 |         self._set_cores(new_cores)
271 |         new_vcpus = min(current_vcpus + 1, new_cores)
272 |         self._set_vcpus(new_vcpus)
273 | 
274 |     def _scale_cpu_down(self, current_cores, current_vcpus):
275 |         """Helper method to scale CPU down."""
276 |         new_vcpus = max(current_vcpus - 1, 1)
277 |         self._set_vcpus(new_vcpus)
278 |         new_cores = current_cores - 1
279 |         self._set_cores(new_cores)
280 | 
281 |     def _set_cores(self, cores):
282 |         """Set the CPU cores for the VM."""
283 |         try:
284 |             command = f"qm set {self.vm_id} -cores {cores}"
285 |             output = self.ssh_client.execute_command(command)
286 |             self._get_command_output(output)  # Process output to catch any errors
287 |             self.logger.info(f"CPU cores set to {cores} for VM {self.vm_id}.")
288 |         except Exception as e:
289 |             self.logger.error(f"Failed to set CPU cores to {cores}: {e}")
290 |             raise
291 | 
292 |     def _set_vcpus(self, vcpus):
293 |         """Set the vCPUs for the VM."""
294 |         try:
295 |             command = f"qm set {self.vm_id} -vcpus {vcpus}"
296 |             output = self.ssh_client.execute_command(command)
297 |             self._get_command_output(output)  # Process output to catch any errors
298 |             self.logger.info(f"vCPUs set to {vcpus} for VM {self.vm_id}.")
299 |         except Exception as e:
300 |             self.logger.error(f"Failed to set vCPUs to {vcpus}: {e}")
301 |             raise


--------------------------------------------------------------------------------