├── .github └── workflows │ ├── build-stage.yml │ └── test-stage.yml ├── .gitignore ├── Makefile ├── README.md ├── assets ├── create-repo-from-template.png ├── create-repo-success.png ├── result.png └── setting.png ├── book.toml └── src ├── SUMMARY.md ├── architecture └── index.md ├── assets └── architecture.png ├── devops ├── GithubWorkflow.md ├── LXCContainer.md └── index.md ├── introduction └── index.md ├── sensor-manager └── index.md ├── user-guides └── index.md └── virtual-sensor └── index.md /.github/workflows/build-stage.yml: -------------------------------------------------------------------------------- 1 | name: Build and deploy 2 | on: 3 | push: 4 | branches: 5 | - master 6 | 7 | # Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages 8 | permissions: 9 | contents: write 10 | pages: write 11 | id-token: write 12 | 13 | jobs: 14 | build: 15 | concurrency: ci-${{ github.ref }} # Recommended if you intend to make multiple deployments in quick succession. 16 | runs-on: ubuntu-latest 17 | steps: 18 | - uses: actions/checkout@v3 19 | with: 20 | fetch-depth: 0 21 | 22 | - name: Install mdbook 🏁 23 | run: | 24 | mkdir mdbook 25 | curl -sSL https://github.com/rust-lang/mdBook/releases/download/v0.4.14/mdbook-v0.4.14-x86_64-unknown-linux-gnu.tar.gz | tar -xz --directory=./mdbook 26 | echo `pwd`/mdbook >> $GITHUB_PATH 27 | - name: Build book 🪛 28 | run: | 29 | mdbook build 30 | 31 | - name: Deploy 🚀 32 | uses: JamesIves/github-pages-deploy-action@v4 33 | with: 34 | folder: book # The folder the action should deploy. 35 | -------------------------------------------------------------------------------- /.github/workflows/test-stage.yml: -------------------------------------------------------------------------------- 1 | name: Test 2 | on: 3 | pull_request: 4 | branches: 5 | - master 6 | 7 | jobs: 8 | test: 9 | name: Test 10 | runs-on: ubuntu-latest 11 | steps: 12 | - uses: actions/checkout@master 13 | - name: Install Rust 14 | run: | 15 | rustup set profile minimal 16 | rustup toolchain install stable 17 | rustup default stable 18 | - name: Install mdbook 19 | run: | 20 | mkdir bin 21 | curl -sSL https://github.com/rust-lang/mdBook/releases/download/v0.4.14/mdbook-v0.4.14-x86_64-unknown-linux-gnu.tar.gz | tar -xz --directory=bin 22 | echo "$(pwd)/bin" >> $GITHUB_PATH 23 | - name: Run tests 24 | run: mdbook test 25 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | book 2 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | start: 2 | mdbook serve --port 3001 3 | clean_branch: 4 | git branch --merged >/tmp/merged-branches && nano /tmp/merged-branches && xargs git branch -D ] 21 | ``` 22 | 23 | If `--port` option is not provided, default host port is `3000`. 24 | 25 | For book structure guide, visit . 26 | 27 | ## Use this template to create your own document page 28 | 29 | This section will show you how to use `gh-pages` to deploy this book on your own domain. 30 | 31 | Firstly, click `Use this template` button & choose `Create a new repository`. 32 | 33 | ![image1](./assets/create-repo-from-template.png) 34 | 35 | Then, you will be redirected to create new repository page of Github. Filter out informations and submit. The new repository should look like this: 36 | 37 | ![image2](./assets/create-repo-success.png) 38 | 39 | Secondly, in your already created repository, visit `Setting` -> `Pages` and configure like this: 40 | 41 | ![image3](./assets/setting.png) 42 | 43 | Finally, wait for some minutes and refresh current page. You will see the result: 44 | 45 | ![image3](./assets/result.png) 46 | -------------------------------------------------------------------------------- /assets/create-repo-from-template.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HPCMonitoring/docs/cb2125996ff044406c1cc5a8a551f475e506f25f/assets/create-repo-from-template.png -------------------------------------------------------------------------------- /assets/create-repo-success.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HPCMonitoring/docs/cb2125996ff044406c1cc5a8a551f475e506f25f/assets/create-repo-success.png -------------------------------------------------------------------------------- /assets/result.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HPCMonitoring/docs/cb2125996ff044406c1cc5a8a551f475e506f25f/assets/result.png -------------------------------------------------------------------------------- /assets/setting.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HPCMonitoring/docs/cb2125996ff044406c1cc5a8a551f475e506f25f/assets/setting.png -------------------------------------------------------------------------------- /book.toml: -------------------------------------------------------------------------------- 1 | [book] 2 | authors = ["phucvinh57"] 3 | language = "en" 4 | multilingual = false 5 | src = "src" 6 | title = "HPC Monitoring Documentation" 7 | -------------------------------------------------------------------------------- /src/SUMMARY.md: -------------------------------------------------------------------------------- 1 | # Summary 2 | 3 | - [Introduction](./introduction/index.md) 4 | - [Architecture](./architecture/index.md) 5 | - [Sensor Manager](./sensor-manager/index.md) 6 | - [Virtual Sensor](./virtual-sensor/index.md) 7 | - [Devops](./devops/index.md) 8 | - [Github workflows settings](./devops/GithubWorkflow.md) 9 | - [Linux container (LXC)](./devops/LXCContainer.md) 10 | - [User guides](./user-guides/index.md) 11 | -------------------------------------------------------------------------------- /src/architecture/index.md: -------------------------------------------------------------------------------- 1 | # System Architecture 2 | 3 | ![Architecture](./../assets/architecture.png) 4 | -------------------------------------------------------------------------------- /src/assets/architecture.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HPCMonitoring/docs/cb2125996ff044406c1cc5a8a551f475e506f25f/src/assets/architecture.png -------------------------------------------------------------------------------- /src/devops/GithubWorkflow.md: -------------------------------------------------------------------------------- 1 | # Github workflows settings 2 | 3 | We recommend to run github runners inside [Linux container](https://linuxcontainers.org/lxc/introduction/) (`LXC`). 4 | 5 | For details about self-hosted github runners, visit . 6 | 7 | For LXC installation and usage, you can visit our guides [here](./LXCContainer.md). 8 | -------------------------------------------------------------------------------- /src/devops/LXCContainer.md: -------------------------------------------------------------------------------- 1 | # Linux container (LXC) 2 | 3 | ## Install LXC 4 | 5 | Full documentation about installation goes [here](https://linuxcontainers.org/lxc/getting-started). 6 | For security reasons, we create an unprivileged container as a user by these following steps: 7 | 8 | ### Init configurations 9 | 10 | ```bash 11 | mkdir -p ~/.config/lxc 12 | cp /etc/lxc/default.conf ~/.config/lxc/default.conf 13 | MS_UID="$(grep "$(id -un)" /etc/subuid | cut -d : -f 2)" 14 | ME_UID="$(grep "$(id -un)" /etc/subuid | cut -d : -f 3)" 15 | MS_GID="$(grep "$(id -un)" /etc/subgid | cut -d : -f 2)" 16 | ME_GID="$(grep "$(id -un)" /etc/subgid | cut -d : -f 3)" 17 | echo "lxc.idmap = u 0 $MS_UID $ME_UID" >> ~/.config/lxc/default.conf 18 | echo "lxc.idmap = g 0 $MS_GID $ME_GID" >> ~/.config/lxc/default.conf 19 | ``` 20 | 21 | ### Download container 22 | 23 | Run this command to start download: 24 | 25 | ```bash 26 | systemd-run --unit=hpc-unit --user --scope -p "Delegate=yes" -- lxc-create -t download -n hpc-container 27 | ``` 28 | 29 | Then, the console will print list of distibution, choose distribution `centos`, release `7` and host computer's architecture. After downloading successful, your terminal should print result like this: 30 | 31 | ```script 32 | Downloading the image index 33 | 34 | --- 35 | DIST RELEASE ARCH VARIANT BUILD 36 | --- 37 | almalinux 8 amd64 default 20230123_23:10 38 | almalinux 8 arm64 default 20230123_23:14 39 | almalinux 8 ppc64el default 20230123_23:08 40 | ..... Other distribution 41 | --- 42 | 43 | Distribution: 44 | centos 45 | Release: 46 | 7 47 | Architecture: 48 | amd64 49 | 50 | Downloading the image index 51 | Downloading the rootfs 52 | Downloading the metadata 53 | The image cache is now ready 54 | Unpacking the rootfs 55 | 56 | --- 57 | You just created a Centos 7 x86_64 (20230123_22:38) container. 58 | ``` 59 | 60 | ### Start container 61 | 62 | Run lxc container with allocating an empty delegated cgroup: 63 | 64 | ```bash 65 | systemd-run --unit=hpc-unit --user --scope -p "Delegate=yes" -- lxc-start hpc-container 66 | ``` 67 | 68 | To confirm its status: 69 | 70 | ```bash 71 | lxc-info -n my-container 72 | lxc-ls -f 73 | ``` 74 | 75 | And get a shell inside it with: 76 | 77 | ```bash 78 | lxc-attach -n hpc-container 79 | ``` 80 | 81 | Stopping it can be done with: 82 | 83 | ```bash 84 | lxc-stop -n my-container 85 | ``` 86 | 87 | And finally removing it with: 88 | 89 | ```bash 90 | lxc-destroy -n my-container 91 | ``` 92 | -------------------------------------------------------------------------------- /src/devops/index.md: -------------------------------------------------------------------------------- 1 | # Appendix 2 | 3 | The following sections show how to deploy & operate the entire system. 4 | -------------------------------------------------------------------------------- /src/introduction/index.md: -------------------------------------------------------------------------------- 1 | # Introduction 2 | 3 | From a thesis in [Ho Chi Minh University of Technology](https://hcmut.edu.vn/) ... 4 | 5 | Hello world ! 6 | 7 | ## Our HPC system problem 8 | 9 | Specify problems in HCMUT's HPC system, its user story, demands ... 10 | 11 | ## Why this project 12 | 13 | Should compare to other monitoring tools such as `Zabbix`, `Prometheus` ... 14 | 15 | Which problems do this project resolve ? ... 16 | -------------------------------------------------------------------------------- /src/sensor-manager/index.md: -------------------------------------------------------------------------------- 1 | # Sensor manager 2 | -------------------------------------------------------------------------------- /src/user-guides/index.md: -------------------------------------------------------------------------------- 1 | # User guide 2 | 3 | User guide here ... (Web UI, how to interact ...) 4 | -------------------------------------------------------------------------------- /src/virtual-sensor/index.md: -------------------------------------------------------------------------------- 1 | # Virtual Sensor 2 | 3 | Virtual sensor descriptions. 4 | 5 | ## Helpful libraries 6 | 7 | - [JSON in modern C++](https://github.com/nlohmann/json) 8 | - [Boost C++ Libraries](https://www.boost.org/) 9 | - [Kafka Client](https://docs.confluent.io/kafka-clients/librdkafka/current/overview.html) 10 | 11 | ## Data collection 12 | 13 | ### Payload's content interfaces 14 | 15 | ```typescript 16 | interface Process { 17 | name: string; 18 | pid: number 19 | parentPid: number 20 | uid: number 21 | gid: number 22 | executePath: string 23 | command: string 24 | virtualMemoryUsage: number // In KB 25 | physicalMemoryUsage: number // In KB 26 | cpuTime: number // In ms 27 | cpuUsage: number // In % 28 | networkInBandwidth: number // What interface ??? 29 | networkOutBandwidth: number 30 | ioWrite: number // In KB 31 | ioRead: number // In KB 32 | } 33 | 34 | interface NetworkInterface { 35 | name: string 36 | inBandwidth: number 37 | outBandwidth: number 38 | } 39 | 40 | interface Memory { 41 | used: number 42 | available: number 43 | swapUsed?: number 44 | swapFree?: number 45 | } 46 | 47 | interface Cpu { 48 | user: number 49 | nice: number 50 | system: number 51 | iowait: number 52 | steal: number 53 | idle: number 54 | } 55 | 56 | interface IOUsage { 57 | deviceName: string 58 | readPerSecond: number 59 | writePerSecond: number 60 | } 61 | 62 | interface DiskUsage { 63 | filesystemName: string 64 | used: number // In KB 65 | available: number // In KB 66 | } 67 | ``` 68 | 69 | ### Full payload interface 70 | 71 | ```typescript 72 | interface KafkaMessage { 73 | nodeId: number 74 | timestamp: number 75 | payload: Process[] | NetworkInterface[] | Memory | Cpu | IOUsage | DiskUsage 76 | type: 'PROCESS' | 'NETWORK_INTERFACE' | 'MEMORY' | 'CPU' | 'IO_USAGE' | 'DISK_USAGE' 77 | } 78 | ``` 79 | 80 | ### Data in namespace 81 | 82 | Note that a process runs in container such as `Docker`, `LXC` ... or runs in a VM has its own namespace. 83 | 84 | ### Sample data from `/proc/$PID/net/dev` file 85 | 86 | |Interface name |Receive ||||||| |Transmit ||||||| | 87 | |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| 88 | | |bytes|packets|errs|drop|fifo|frame|compressed|multicast|bytes|packets|errs|drop|fifo|colls|carrier|compressed| 89 | |lo|2469224|19558|0|0|0|0|0|0|2469224|19558|0|0|0|0|0|0| 90 | 91 | ## Helpful tools 92 | 93 | Some useful command lines: 94 | 95 | - `sysstat` 96 | - `df` 97 | - `free` 98 | 99 | ## Build `systemd` service 100 | 101 | 102 | --------------------------------------------------------------------------------