├── images ├── 2021-02-04-04-07-46.png └── 2021-02-04-04-53-08.png ├── requirements.txt ├── server.py ├── master.py ├── README.md ├── .gitignore ├── device.py └── templates └── index.html /images/2021-02-04-04-07-46.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/server-monitor/HEAD/images/2021-02-04-04-07-46.png -------------------------------------------------------------------------------- /images/2021-02-04-04-53-08.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kehanlu/server-monitor/HEAD/images/2021-02-04-04-53-08.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | click==7.1.2 2 | fastapi==0.63.0 3 | gunicorn==20.0.4 4 | h11==0.12.0 5 | httptools==0.1.1 6 | pydantic==1.7.3 7 | starlette==0.13.6 8 | uvicorn==0.13.3 9 | uvloop==0.15.2 10 | -------------------------------------------------------------------------------- /server.py: -------------------------------------------------------------------------------- 1 | from fastapi import FastAPI 2 | from fastapi.middleware.cors import CORSMiddleware 3 | 4 | from device import NvidiaSMI, RAM 5 | 6 | app = FastAPI() 7 | 8 | origins = [ 9 | "http://localhost:3000", 10 | "http://140.118.127.80:3000", 11 | ] 12 | 13 | app.add_middleware( 14 | CORSMiddleware, 15 | allow_origins=origins, 16 | allow_origin_regex="https?:\/\/.*\.ntust\.edu\.tw", 17 | allow_methods=["*"], 18 | allow_headers=["*"], 19 | ) 20 | 21 | # app.add_middleware( 22 | # CORSMiddleware, 23 | # allow_origin_regex="https?:\/\/.*\.ntust\.edu\.tw", 24 | # allow_methods=["*"], 25 | # allow_headers=["*"], 26 | # ) 27 | 28 | # 29 | 30 | 31 | @app.get("/") 32 | async def root(): 33 | return { 34 | "nvidia_smi": NvidiaSMI(), 35 | "ram": RAM() 36 | } 37 | -------------------------------------------------------------------------------- /master.py: -------------------------------------------------------------------------------- 1 | from flask import Flask 2 | from flask import render_template 3 | import requests 4 | import json 5 | from datetime import datetime 6 | import pytz 7 | from config import CONFIG 8 | 9 | tz = pytz.timezone("Asia/Taipei") 10 | 11 | app = Flask(__name__) 12 | app.config["TEMPLATES_AUTO_RELOAD"] = True 13 | 14 | SITE_TITLE = CONFIG.get("site_title", "Server status") 15 | TOP_MESSAGE = CONFIG.get("top_message", "Hello world") 16 | 17 | if CONFIG.get("server_ips") is None: 18 | raise ValueError() 19 | SERVER_IPS = CONFIG.get("server_ips") 20 | 21 | 22 | @app.route('/') 23 | def server(): 24 | servers = list() 25 | now = datetime.now(tz=tz).strftime("%Y-%m-%d %T") 26 | for ip in SERVER_IPS: 27 | resp = requests.get(f"http://{ip}:23333") 28 | if resp.status_code != 200: 29 | data = { 30 | "ip": ip, 31 | "active": False 32 | } 33 | else: 34 | data = json.loads(resp.text) 35 | data["ip"] = ip 36 | data["active"] = True 37 | servers.append(data) 38 | 39 | context = {"title": SITE_TITLE, 40 | "top_message": TOP_MESSAGE, 41 | "now": now, 42 | "servers": servers} 43 | return render_template("index.html", **context) 44 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | To monitor RAM and GPU usage on multiple servers. 2 | 3 | In a computer science lab or company, you usually have multiple servers and GPUs running many deep learning experiments. You want to know which device is working and which is available at a glance with a minimum setup. 4 | 5 | ## Screenshots 6 | 7 | ![](images/2021-02-04-04-07-46.png) 8 | 9 | ## Installation 10 | 11 | ```shell 12 | git clone https://github.com/kehanlu/server-monitor 13 | cd server-monitor 14 | pip install -r requirements.txt 15 | ``` 16 | 17 | - `nvidia-smi`: https://www.nvidia.com 18 | 19 | 20 | ## Usage 21 | 22 | ![](images/2021-02-04-04-53-08.png) 23 | 24 | ### Server 25 | 26 | "Server" means the server you want to monitor. 27 | 28 | 1. Go to server you want to monitor 29 | - You have to be sure that `nvidia-smi` command is installed. 30 | 31 | 2. run the command to start an API. 32 | 33 | ```shell 34 | gunicorn -w 1 -k uvicorn.workers.UvicornWorker -b 0.0.0.0:23333 server:app --daemon 35 | ``` 36 | 37 | ### Master 38 | 39 | "Master" means the web server which is going to fetch data from each servers. You can run this web server on any computer. In some case, you might want this web server to be accessible from public network, but still hide servers behind a firewall. 40 | 41 | 1. create a file named `config.py` 42 | 43 | 2. In `config.py`, you need to have a list of server ips. Then the web server will iterate from the list and GET the API at `http://{ip}:23333`. 44 | 45 | - `server_ips` 46 | - `site_title(optional)`: the title of website 47 | - `top_message(optional)`: the message shows on the top 48 | 49 | ```python 50 | CONFIG = { 51 | "site_title": "Server status", 52 | "top_message": "Hello world", 53 | "server_ips": [ 54 | "192.168.0.2", 55 | "192.168.0.3", 56 | "192.168.0.4", 57 | ], 58 | } 59 | 60 | ``` 61 | 62 | 3. run the command to start the server. 63 | 64 | ```shell 65 | gunicorn -w 1 -b 0.0.0.0:8787 master:app 66 | ``` 67 | 68 | 4. Visit `127.0.0.1:8787` or `:8787` to see the website. 69 | 70 | ## Contribution 71 | 72 | Pull requests are welcome. This is still an early project (and just for fun). 73 | 74 | TODOs: 75 | 76 | - Fast installation script. 77 | - Handle error. 78 | - Use Nginx to serve the sites. 79 | - Use CI/CD to automatically update projects on servers. -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | config.py 2 | 3 | # Byte-compiled / optimized / DLL files 4 | __pycache__/ 5 | *.py[cod] 6 | *$py.class 7 | 8 | # C extensions 9 | *.so 10 | 11 | # Distribution / packaging 12 | .Python 13 | build/ 14 | develop-eggs/ 15 | dist/ 16 | downloads/ 17 | eggs/ 18 | .eggs/ 19 | lib/ 20 | lib64/ 21 | parts/ 22 | sdist/ 23 | var/ 24 | wheels/ 25 | pip-wheel-metadata/ 26 | share/python-wheels/ 27 | *.egg-info/ 28 | .installed.cfg 29 | *.egg 30 | MANIFEST 31 | 32 | # PyInstaller 33 | # Usually these files are written by a python script from a template 34 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 35 | *.manifest 36 | *.spec 37 | 38 | # Installer logs 39 | pip-log.txt 40 | pip-delete-this-directory.txt 41 | 42 | # Unit test / coverage reports 43 | htmlcov/ 44 | .tox/ 45 | .nox/ 46 | .coverage 47 | .coverage.* 48 | .cache 49 | nosetests.xml 50 | coverage.xml 51 | *.cover 52 | *.py,cover 53 | .hypothesis/ 54 | .pytest_cache/ 55 | 56 | # Translations 57 | *.mo 58 | *.pot 59 | 60 | # Django stuff: 61 | *.log 62 | local_settings.py 63 | db.sqlite3 64 | db.sqlite3-journal 65 | 66 | # Flask stuff: 67 | instance/ 68 | .webassets-cache 69 | 70 | # Scrapy stuff: 71 | .scrapy 72 | 73 | # Sphinx documentation 74 | docs/_build/ 75 | 76 | # PyBuilder 77 | target/ 78 | 79 | # Jupyter Notebook 80 | .ipynb_checkpoints 81 | 82 | # IPython 83 | profile_default/ 84 | ipython_config.py 85 | 86 | # pyenv 87 | .python-version 88 | 89 | # pipenv 90 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 91 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 92 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 93 | # install all needed dependencies. 94 | #Pipfile.lock 95 | 96 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 97 | __pypackages__/ 98 | 99 | # Celery stuff 100 | celerybeat-schedule 101 | celerybeat.pid 102 | 103 | # SageMath parsed files 104 | *.sage.py 105 | 106 | # Environments 107 | .env 108 | .venv 109 | env/ 110 | venv/ 111 | ENV/ 112 | env.bak/ 113 | venv.bak/ 114 | 115 | # Spyder project settings 116 | .spyderproject 117 | .spyproject 118 | 119 | # Rope project settings 120 | .ropeproject 121 | 122 | # mkdocs documentation 123 | /site 124 | 125 | # mypy 126 | .mypy_cache/ 127 | .dmypy.json 128 | dmypy.json 129 | 130 | # Pyre type checker 131 | .pyre/ 132 | -------------------------------------------------------------------------------- /device.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | 4 | class GPU(): 5 | def __init__(self, index, name, memory_total, memory_free, memory_used, utilization_gpu): 6 | self.index = index 7 | self.name = name 8 | self.memory_total = int(memory_total.split()[0]) 9 | self.memory_free = int(memory_free.split()[0]) 10 | self.memory_used = int(memory_used.split()[0]) 11 | self.utilization_gpu = int(utilization_gpu.split()[0]) 12 | def __repr__(self): 13 | return f"GPU {self.index}: {self.name:20.20},{self.memory_used:>10}/{self.memory_total:<10},{self.utilization_gpu}" 14 | 15 | def get_info(self): 16 | return { 17 | "index": self.index, 18 | "name": self.name, 19 | "memory_total": self.memory_total, 20 | "memory_free": self.memory_free, 21 | "memory_used": self.memory_used, 22 | "utilization_gpu": self.utilization_gpu 23 | } 24 | 25 | class NvidiaSMI(): 26 | def __init__(self): 27 | query = [ 28 | "index", 29 | "name", 30 | "memory.total", 31 | "memory.free", 32 | "memory.used", 33 | "utilization.gpu", 34 | ] 35 | self.gpu_list = list() 36 | for gpu_info in os.popen(f"nvidia-smi --query-gpu={','.join(query)} --format=csv,noheader").readlines(): 37 | gpu_info = gpu_info.strip().split(',') 38 | gpu = dict() 39 | for c,info in zip(query, gpu_info): 40 | c = c.replace(".","_") 41 | gpu[c] = info.strip() 42 | self.gpu_list.append(GPU(**gpu)) 43 | 44 | self.processes = list() 45 | for process in os.popen("nvidia-smi --query-compute-apps=gpu_name,name,used_gpu_memory --format=csv,noheader").readlines(): 46 | p = process.split(',') 47 | self.processes.append({ 48 | "gpu_name": p[0].strip(), 49 | "process_name": p[1].strip(), 50 | "memory_used": p[2].strip() 51 | }) 52 | 53 | def show(self): 54 | for gpu in self.gpu_list: 55 | print(gpu) 56 | 57 | def to_json(self): 58 | return json.dumps({ 59 | "gpus": [gpu.get_info() for gpu in self.gpu_list], 60 | "processes": self.processes 61 | }) 62 | def get_info(self): 63 | return { 64 | "gpus": [gpu.get_info() for gpu in self.gpu_list], 65 | "processes": self.processes 66 | } 67 | 68 | class RAM(): 69 | def __init__(self): 70 | info = os.popen("free -g").read().split('\n')[1].split() 71 | self.info = { 72 | 'total': int(info[1]), 73 | 'used': int(info[2]), 74 | 'available': int(info[6]) 75 | } 76 | 77 | def to_json(self): 78 | return self.info 79 | 80 | def get_info(self): 81 | return self.info -------------------------------------------------------------------------------- /templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | {{title}} 8 | 9 | 33 | 34 | 35 | 36 | 37 |
38 |

39 | {{top_message}} 40 |

41 |

Request time: {{now}}

42 | {% for server in servers %} 43 | 44 |
45 |

46 | {{server.ip}} 47 |

48 | 49 | {% if server.active %} 50 |

RAM: {{server.ram.info.used}}GB / {{server.ram.info.total}}GB

51 |
52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | {% for gpu in server.nvidia_smi.gpu_list %} 63 | 64 | 65 | 66 | 71 | 73 | 74 | {% endfor %} 75 | 76 |
IDNameMemory usedUtil gpu
{{gpu.index}}{{gpu.name}} 67 |
{{"%6sMiB /%6sMiB"|format(gpu.memory_used,gpu.memory_total) }}
68 | 70 |
{{gpu.utilization_gpu}} %
77 |
78 |
79 | {% for process in server.nvidia_smi.processes%} 80 |
{{ "%-20s %-30.30s %-10s" | format(process.gpu_name, process.process_name,
81 |                             process.memory_used)}}
82 | {% endfor %} 83 | {% else %} 84 | service down! 85 | 86 | {% endif %} 87 |
88 | {% endfor %} 89 | 90 | 91 | 92 |
93 | 94 | 95 | 96 | 97 | --------------------------------------------------------------------------------