├── Assignments └── Exercise.md ├── Lectures ├── Lecture01.pdf ├── Lecture02.pdf ├── Lecture03.pdf ├── Lecture04.pdf ├── Lecture05.pdf └── Software containers intro and hands-on.pdf ├── README.md └── Tutorials ├── Containers └── SlurmID │ ├── .gitignore │ ├── LICENSE │ ├── README.md │ ├── docker-compose.yml │ ├── services │ ├── slurmbase │ │ ├── Dockerfile │ │ ├── entrypoint.sh │ │ ├── keys │ │ │ ├── authorized_keys │ │ │ ├── id_rsa │ │ │ └── id_rsa.pub │ │ ├── munge.key │ │ ├── prestartup.py │ │ ├── prestartup_slurmbase.sh │ │ ├── slurm.conf │ │ ├── sudoers │ │ ├── supervisord.conf │ │ ├── supervisord_munge.conf │ │ └── supervisord_sshd.conf │ ├── slurmcluster │ │ ├── Dockerfile │ │ └── supervisord_slurmd.conf │ ├── slurmclustermaster │ │ ├── Dockerfile │ │ ├── prestartup_slurmclustermaster.sh │ │ ├── supervisord_slurmctld.conf │ │ └── test_job.sh │ └── slurmclusterworker │ │ └── Dockerfile │ └── slurmid │ ├── build │ ├── clean │ ├── logs │ ├── ps │ ├── rerun │ ├── run │ ├── shell │ └── status └── VirtualMachine ├── README.md ├── UTMSlurmCluster.md └── slurm.conf /Assignments/Exercise.md: -------------------------------------------------------------------------------- 1 | # Exercise for the course Cloud Computing Basic. 2 | 3 | This is the exercise for Prof. Taffoni and Ruggero of the 2023/2024 Cloud Computing course. 4 | 5 | Version `1.0`: This document can be modified several times in the next few days to improve the clarity of the information and provide a better understanding of what we are asking. 6 | 7 | ## Rules 8 | 9 | - Exercise should be done individually: no group, please! 10 | - Materials (code/scripts/pictures and final report) should be prepared on a GitHub repository, starting with this one and sharing it with the teachers. 11 | - A report should be sent by e-mail to the teachers at least five days in advance: the name of the file should `YOURSURNAME_report.pdf` 12 | - Results and numbers of the exercises should be presented (also with the help of slides** in a maximum 10-minute presentation: this will be part of the exam). A few more questions on the topic of the courses will be asked at the end of the presentation. 13 | 14 | ***deadlines*** 15 | 16 | You should send us the e-mail at least one week before the exam. For the first two scheduled "appelli" this means: 17 | - exam scheduled at 1.02.2024 **deadline 28.01.2023 at midnight** 18 | - exam scheduled at 23.02.2024 **deadline 20.02.2023 at midnight** 19 | The report should clearly explain which software stack we should use to deploy the developed infrastructure and run all the programs you used in your exercises. Providing well-done Makefiles/Dockerfiles/scripts to automatize the work is highly appreciated. 20 | 21 | # The exercise: Cloud-Based File Storage System 22 | 23 | You are tasked with identifying, deploying, and implementing a cloud-based file storage system. The system should allow users to upload, download, and delete files. Each user should have a private storage space. The system should be scalable, secure, and cost-efficient. Suggested solutions to use for the exam are Nextcloud and MinIO. 24 | 25 | ## Requirements 26 | 27 | The deployed platform should be able to: 28 | 29 | Manage User Authentication and Authorization: 30 | - Users should be able to sign up, log in, and log out. 31 | - Users should have different roles (e.g., regular user and admin). 32 | - Regular users should have their private storage space. 33 | - Admins should have the ability to manage users. 34 | 35 | Manage File Operations: 36 | - Users should be able to upload files to their private storage. 37 | - Users should be able to download files from their private storage. 38 | - Users should be able to delete files from their private storage. 39 | 40 | Address Scalability: 41 | - Design the system to handle a growing number of users and files. 42 | - Discuss theoretically how you would handle increased load and traffic. 43 | 44 | Address Security: 45 | - Implement secure file storage and transmission. 46 | - Discuss how you would secure user authentication. 47 | - Discuss measures to prevent unauthorized access. 48 | 49 | Discuss Cost-Efficiency: 50 | - Consider the cost implications of your design. 51 | - Discuss how you would optimize the system for cost efficiency. 52 | 53 | Deployment: 54 | - Provide a deployment plan for your system in a containerized environment on your laptop based on docker and docker-compose. 55 | - Discuss how you would monitor and manage the deployed system. 56 | - Choose a cloud provider that could be used to deploy the system in production and justify your choice. 57 | 58 | Test your infrastructure: 59 | - Consider the performance of your system in terms of load and IO operations 60 | 61 | ## Submission details 62 | 63 | Documentation: 64 | - Submit a detailed design document explaining your choices and describing the platform's architecture, including components, databases, and their interactions. 65 | - Include a section on the security measures taken. 66 | 67 | Code: 68 | - Submit the Docker files and any code eventually developed/modified for your cloud-based file storage system. 69 | - Include a README file with instructions on how to deploy and use your system. 70 | 71 | Presentation: 72 | - Prepare a short presentation summarizing your design, implementation, and any interesting challenges you faced. 73 | - Be ready to answer questions about your design choices and on the topics discussed during the Cloud Course Lectures 74 | 75 | ## Evaluation Criteria 76 | 77 | - Design Clarity: Is the system design well-documented and clear? 78 | - Functionality: Does the system meet the specified requirements? 79 | - Scalability: How well does the system handle increased load? How does the system perform on small files (a few KB), large files (GBs), and average (MBs) 80 | - Security: Are appropriate security measures implemented? 81 | - Cost-Efficiency: Has the student considered cost implications and optimized the system accordingly? 82 | 83 | 84 | -------------------------------------------------------------------------------- /Lectures/Lecture01.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Foundations-of-HPC/Cloud-Basic-2023/218ee4bc46d278a1edfc2fef670969ef4352b56e/Lectures/Lecture01.pdf -------------------------------------------------------------------------------- /Lectures/Lecture02.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Foundations-of-HPC/Cloud-Basic-2023/218ee4bc46d278a1edfc2fef670969ef4352b56e/Lectures/Lecture02.pdf -------------------------------------------------------------------------------- /Lectures/Lecture03.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Foundations-of-HPC/Cloud-Basic-2023/218ee4bc46d278a1edfc2fef670969ef4352b56e/Lectures/Lecture03.pdf -------------------------------------------------------------------------------- /Lectures/Lecture04.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Foundations-of-HPC/Cloud-Basic-2023/218ee4bc46d278a1edfc2fef670969ef4352b56e/Lectures/Lecture04.pdf -------------------------------------------------------------------------------- /Lectures/Lecture05.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Foundations-of-HPC/Cloud-Basic-2023/218ee4bc46d278a1edfc2fef670969ef4352b56e/Lectures/Lecture05.pdf -------------------------------------------------------------------------------- /Lectures/Software containers intro and hands-on.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Foundations-of-HPC/Cloud-Basic-2023/218ee4bc46d278a1edfc2fef670969ef4352b56e/Lectures/Software containers intro and hands-on.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Cloud-Computing, base module 2023 2 | 3 | Lecture slides, codes and materials for the 3-credit course on Cloud Computing Base 4 | 5 | 6 | #### Teachers 7 | 8 | **Giuliano Taffoni**, INAF - National Institute for Astrophysics, Astronomical Observatory of Trieste 9 | 10 | **Stefano Alberto Russo** , INAF - National Institute for Astrophysics, Astronomical Observatory of Trieste 11 | 12 | Link to the teams channel: to come 13 | 14 | Computational resource to be used: ORFEO cluster see https://orfeo-doc.areasciencepark.it/ 15 | 16 | #### Google Drive for the recordings 17 | 18 | The recordings of the lectures will be transferred [here](https://drive.google.com/drive/folders/1x2tOBLZtr99eCy8o0wRfOkYEErShw2UZ?usp=sharing) 19 | 20 | 21 | #### Prerequisites 22 | 23 | - decent knowledge of linux command line interface 24 | - decent knwoledge of a programming language ( C and/or C++ but even Fortran is fine) 25 | - decent knowledge of scripting language ( python is fine but bash awk perl are also welcome} 26 | 27 | #### CALENDAR 28 | 29 | We plan to provide 10/12 hours frontal lecture, 10 hours of lab/tutorials and a few more hours as seminars. 30 | Here below the schedule. 31 | Seminar will be announced during lecture and they will be generally on wednesday from 1 to 2mp 32 | 33 | Lecture rooms: 34 | 35 | 36 | The course should be completed by end of november. 37 | First weeek of January is spare time if something has beeen left behind. 38 | 39 | 40 | | DATE | LECTURE | TUTORIALS | 41 | | :---------- | :---------------------------------------------| :--------------------------------------------- | 42 | | Fri, Sep 29 | Introduction to Cloud | | 43 | | Fri, Oct 06 | Cloud Architecture | | 44 | | Fri, Oct 13 | Virtualization | | 45 | | Fri, Oct 20 | Virtualization | | 46 | | Fri, Oct 27 | | Lab: using AWS cloud | 47 | | Mon, Oct 30 | | Virtual Machines | 48 | | Mon, Nov 06 | Seminar: Cloud in Industry | | 49 | | Mon, Nov 13 | Seminar 3: TBD| 50 | | Thu, Nov 16 | Service Model | Virtual Manchine | 51 | | Fri, Nov 17 | Lab: Containers | | 52 | | Fri, Nov 17 | | Lab: Containers | 53 | | Mon, Nov 20 | Data Cloud | | 54 | | Fri, Nov 24 | Containers | | 55 | | Fri, Nov 24 | | Lab: Containers | 56 | | Thu, Nov 30 | Cloud Security | 57 | 58 | 59 | Exame sessions of the first semester 60 | 61 | | DATE | TIME | LOCATION | 62 | | :---------- | :---------------------------------------------| :--------------------------------------------- 63 | | Thu, Feb 1 2024 | 09:00-14:00 | Osservatorio Astronomico di Trieste, via Bazzoni 2 | 64 | | Fri, Feb 23 2024 | 13:00-18:00 | Osservatorio Astronomico di Trieste, via Bazzoni 2 | 65 | 66 | 67 | 68 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/.gitignore: -------------------------------------------------------------------------------- 1 | data 2 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "{}" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright {yyyy} {name of copyright owner} 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | 203 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/README.md: -------------------------------------------------------------------------------- 1 | # Slurm-in-Docker (SlurmID) 2 | 3 | A Slurm cluster in a set of Docker containers. Requires Docker, Docker Compose and Bash. 4 | 5 | ## Quickstart 6 | 7 | Build 8 | 9 | $ slurmid/build 10 | 11 | Run 12 | 13 | $ slurmid/run 14 | 15 | List running services 16 | 17 | $ slurmid/ps 18 | 19 | Check status 20 | 21 | $ slurmid/ps 22 | 23 | Execute demo job 24 | 25 | $ slurmid/shell slurmclustermaster 26 | $ sudo su -l testuser 27 | $ sinfo # Check cluster status 28 | $ sbatch -p partition1 /examples/test_job.sh 29 | $ squeue # Check queue status 30 | $ cat results.txt 31 | 32 | Clean 33 | 34 | $ slurmid/clean 35 | 36 | ## Configuration 37 | 38 | Bu default, the `/shared` folder is shared between all services, and the `/home/testuser` folder as well. They are both persistent and stored locally in the data folder in the project's root directory (where this file is located). 39 | 40 | ## Logs 41 | 42 | Check out logs for Docker containers (including entrypoints): 43 | 44 | 45 | $ slurmid/logs slurmclustermaster 46 | 47 | $ slurmid/logs slurmclusterworker-one 48 | 49 | 50 | Check out logs for supervisord services: 51 | 52 | $ slurmid/logs slurmclustermaster slurmctld 53 | 54 | $ slurmid/logs slurmclusterworker-one munged 55 | 56 | $ slurmid/logs slurmclusterworker-one slurmd 57 | 58 | ## Building errors 59 | 60 | It is common for the build process to fail with a "404 not found" error on an apt-get instructions, as apt repositories often change their IP addresses. In such case, try: 61 | 62 | $ slurmid/build nocache 63 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: '3' 2 | services: 3 | 4 | slurmclustermaster: 5 | image: "slurmid/slurmclustermaster" 6 | container_name: slurmclustermaster 7 | hostname: slurmclustermaster 8 | restart: unless-stopped 9 | environment: 10 | - SAFEMODE=False 11 | volumes: 12 | - ./data/slurmclustermaster/log:/var/log/supervisord 13 | - ./data/shared:/shared 14 | 15 | slurmclusterworker-one: 16 | image: "slurmid/slurmclusterworker" 17 | container_name: slurmclusterworker-one 18 | hostname: slurmclusterworker-one 19 | restart: unless-stopped 20 | environment: 21 | - SAFEMODE=False 22 | volumes: 23 | - ./data/slurmclusterworker-one/log:/var/log/supervisord 24 | - ./data/shared:/shared 25 | 26 | slurmclusterworker-two: 27 | image: "slurmid/slurmclusterworker" 28 | container_name: slurmclusterworker-two 29 | hostname: slurmclusterworker-two 30 | restart: unless-stopped 31 | environment: 32 | - SAFEMODE=False 33 | volumes: 34 | - ./data/slurmclusterworker-two/log:/var/log/supervisord 35 | - ./data/shared:/shared -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM ubuntu:18.04 2 | MAINTAINER Stefano Alberto Russo 3 | 4 | #---------------------- 5 | # Basics 6 | #---------------------- 7 | 8 | # Set non-interactive 9 | ENV DEBIAN_FRONTEND noninteractive 10 | 11 | # Update 12 | RUN apt-get update 13 | 14 | # Utilities 15 | RUN apt-get install -y nano telnet unzip wget supervisor openssh-server 16 | 17 | # Devel 18 | RUN apt-get install -y build-essential python-dev git-core 19 | 20 | # Java 21 | RUN apt-get install -y openjdk-8-jre 22 | 23 | # IP utilities (mandatory for DNS!) 24 | RUN apt-get install net-tools iproute2 iputils-ping -y 25 | 26 | 27 | #------------------------ 28 | # SlurmID user 29 | #------------------------ 30 | 31 | # Add group. We chose GID 65527 to try avoiding conflicts. 32 | RUN groupadd -g 65527 slurmid 33 | 34 | # Add user. We chose UID 65527 to try avoiding conflicts. 35 | RUN useradd slurmid -d /slurmid -u 65527 -g 65527 -m -s /bin/bash 36 | 37 | # Add slurmid user to sudoers 38 | RUN adduser slurmid sudo 39 | 40 | # Keys 41 | RUN mkdir /slurmid/.ssh 42 | COPY keys/authorized_keys /slurmid/.ssh/ 43 | COPY keys/id_rsa /slurmid/.ssh/ 44 | RUN chmod 0600 /slurmid/.ssh/id_rsa 45 | COPY keys/id_rsa.pub /slurmid/.ssh/ 46 | RUN chown -R slurmid:slurmid /slurmid/.ssh 47 | 48 | # Install suodo 49 | RUN apt-get install sudo -y 50 | 51 | # No pass sudo (for everyone, actually) 52 | COPY sudoers /etc/sudoers 53 | 54 | # bash_profile for loading correct env (/env.sh created by entrypoint.sh) 55 | RUN echo "source /env.sh" > /slurmid/.bash_profile 56 | RUN chown slurmid:slurmid /slurmid/.bash_profile 57 | 58 | #------------------------ 59 | # Data, Logs and opt dirs 60 | #------------------------ 61 | 62 | # Create dirs 63 | RUN mkdir /data && mkdir /var/log/slurmid 64 | 65 | # Give right permissions 66 | RUN chown -R slurmid:slurmid /data && chown -R slurmid:slurmid /var/log/slurmid 67 | 68 | 69 | #---------------------- 70 | # Supervisord conf 71 | #---------------------- 72 | 73 | COPY supervisord.conf /etc/supervisor/ 74 | 75 | 76 | #---------------------- 77 | # SSH conf 78 | #---------------------- 79 | 80 | RUN mkdir /var/run/sshd && chmod 0755 /var/run/sshd 81 | COPY supervisord_sshd.conf /etc/supervisor/conf.d/ 82 | 83 | 84 | #---------------------- 85 | # Prestartup scripts 86 | #---------------------- 87 | 88 | # Create dir for prestartup scripts and copy main script 89 | RUN mkdir /prestartup 90 | COPY prestartup.py / 91 | 92 | 93 | #---------------------- 94 | # Slurm 95 | #---------------------- 96 | 97 | # Install Slurm 98 | RUN apt-get -y install slurm-wlm 99 | 100 | # Explicitly create /var/run/ dirs 101 | RUN mkdir -p /var/run/munge 102 | RUN mkdir -p /var/run/slurm-wlm 103 | 104 | # Add munge key and set permissions 105 | COPY munge.key /etc/munge/munge.key 106 | RUN chown munge:munge /etc/munge/munge.key 107 | RUN chmod 0400 /etc/munge/munge.key 108 | 109 | # Add munge daemon supervisord coinf 110 | COPY supervisord_munge.conf /etc/supervisor/conf.d/ 111 | 112 | # Add Slurm conf 113 | COPY slurm.conf /etc/slurm-llnl/slurm.conf 114 | 115 | # TODO: why do we need this? 116 | RUN ln -s /var/lib/slurm-llnl /var/lib/slurm-wlm 117 | RUN ln -s /var/log/slurm-llnl /var/log/slurm-wlm 118 | 119 | 120 | #---------------------- 121 | # Test user and 122 | # prestartup 123 | #---------------------- 124 | 125 | # Add testuser user 126 | RUN useradd testuser 127 | RUN mkdir -p /home/testuser/.ssh 128 | RUN cat /slurmid/.ssh/id_rsa.pub >> /home/testuser/.ssh/authorized_keys 129 | RUN chown -R testuser:testuser /home/testuser 130 | RUN usermod -s /bin/bash testuser 131 | 132 | # Add prestartup 133 | COPY prestartup_slurmbase.sh /prestartup/ 134 | RUN touch -m /prestartup/prestartup_slurmbase.sh 135 | 136 | 137 | #---------------------- 138 | # Entrypoint 139 | #---------------------- 140 | 141 | # Copy entrypoint 142 | COPY entrypoint.sh / 143 | 144 | # Give right permissions 145 | RUN chmod 755 /entrypoint.sh 146 | 147 | # Set entrypoint 148 | ENTRYPOINT ["/entrypoint.sh"] 149 | 150 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/entrypoint.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Exit on any error. More complex logics could be implemented in future 4 | # (see https://stackoverflow.com/questions/4381618/exit-a-script-on-error) 5 | set -e 6 | 7 | echo "" 8 | echo "[INFO] Executing entrypoint..." 9 | 10 | #--------------------- 11 | # Prestartup scripts 12 | #--------------------- 13 | 14 | if [ "x$SAFEMODE" == "xFalse" ]; then 15 | echo "[INFO] Executing prestartup scripts (parents + current):" 16 | python /prestartup.py 17 | else 18 | echo "[INFO] Not executing prestartup scripts as we are in safemode" 19 | fi 20 | 21 | 22 | #--------------------- 23 | # Save env 24 | #--------------------- 25 | echo "[INFO] Dumping env" 26 | 27 | # Save env vars for later usage (e.g. ssh) 28 | 29 | env | \ 30 | while read env_var; do 31 | if [[ $env_var == HOME\=* ]]; then 32 | : # Skip HOME var 33 | elif [[ $env_var == PWD\=* ]]; then 34 | : # Skip PWD var 35 | else 36 | echo "export $env_var" >> /env.sh 37 | fi 38 | done 39 | 40 | #--------------------- 41 | # Entrypoint command 42 | #--------------------- 43 | # Start! 44 | 45 | 46 | if [[ "x$@" == "x" ]] ; then 47 | ENTRYPOINT_COMMAND="supervisord" 48 | else 49 | ENTRYPOINT_COMMAND=$@ 50 | fi 51 | 52 | echo -n "[INFO] Executing Docker entrypoint command: " 53 | echo $ENTRYPOINT_COMMAND 54 | exec "$ENTRYPOINT_COMMAND" 55 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/keys/authorized_keys: -------------------------------------------------------------------------------- 1 | ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC2n4wiLiRmE1sla5+w0IW3wwPW/mqhhkm7IyCBS+rGTgnts7xsWcxobvamNdD6KSLNnjFZbBb7Yaf/BvWrwQgdqIFVU3gRWHYzoU6js+lKtBjd0e2DAVGivWCKEkSGLx7zhx7uH/Jt8kyZ4NaZq0p5+SFHBzePdR/1rURd8G8+G3OaCPKqP+JQT4RMUQHC5SNRJLcK1piYdmhDiYEyuQG4FlStKCWLCXeUY2EVirNMeQIfOgbUHJsVjH07zm1y8y7lTWDMWVZOnkG6Ap5kB+n4l1eWbslOKgDv29JTFOMU+bvGvYZh70lmLK7Hg4CMpXVgvw5VF9v97YiiigLwvC7wasBHaASwH7wUqakXYhdGFxJ23xVMSLnvJn4S++4L8t8bifRIVqhT6tZCPOU4fdOvJKCRjKrf7gcW/E33ovZFgoOCJ2vBLIh9N9ME0v7tG15JpRtgIBsCXwLcl3tVyCZJ/eyYMbc3QJGsbcPGb2CYRjDbevPCQlNavcMdlyrNIke7VimM5aW8OBJKVh5wCNRpd9XylrKo1cZHYxu/c5Lr6VUZjLpxDlSz+IuTn4VE7vmgHNPnXdlxRKjLHG/FZrZTSCWFEBcRoSa/hysLSFwwDjKd9nelOZRNBvJ+NY48vA8ixVnk4WAMlR/5qhjTRam66BVysHeRcbjJ2IGjwTJC5Q== docker@dev.ops 2 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/keys/id_rsa: -------------------------------------------------------------------------------- 1 | -----BEGIN RSA PRIVATE KEY----- 2 | MIIJKQIBAAKCAgEAtp+MIi4kZhNbJWufsNCFt8MD1v5qoYZJuyMggUvqxk4J7bO8 3 | bFnMaG72pjXQ+ikizZ4xWWwW+2Gn/wb1q8EIHaiBVVN4EVh2M6FOo7PpSrQY3dHt 4 | gwFRor1gihJEhi8e84ce7h/ybfJMmeDWmatKefkhRwc3j3Uf9a1EXfBvPhtzmgjy 5 | qj/iUE+ETFEBwuUjUSS3CtaYmHZoQ4mBMrkBuBZUrSgliwl3lGNhFYqzTHkCHzoG 6 | 1BybFYx9O85tcvMu5U1gzFlWTp5BugKeZAfp+JdXlm7JTioA79vSUxTjFPm7xr2G 7 | Ye9JZiyux4OAjKV1YL8OVRfb/e2IoooC8Lwu8GrAR2gEsB+8FKmpF2IXRhcSdt8V 8 | TEi57yZ+EvvuC/LfG4n0SFaoU+rWQjzlOH3TrySgkYyq3+4HFvxN96L2RYKDgidr 9 | wSyIfTfTBNL+7RteSaUbYCAbAl8C3Jd7VcgmSf3smDG3N0CRrG3Dxm9gmEYw23rz 10 | wkJTWr3DHZcqzSJHu1YpjOWlvDgSSlYecAjUaXfV8payqNXGR2Mbv3OS6+lVGYy6 11 | cQ5Us/iLk5+FRO75oBzT513ZcUSoyxxvxWa2U0glhRAXEaEmv4crC0hcMA4ynfZ3 12 | pTmUTQbyfjWOPLwPIsVZ5OFgDJUf+aoY00WpuugVcrB3kXG4ydiBo8EyQuUCAwEA 13 | AQKCAgEAh0Vm52qGS5XKzc0KXE4YviUVkwqgsURnGNbMHPm+zWTAtfGMgDWD01de 14 | G3+Ba8tMnEGxDCukWk/bwGvHTZGOEWnfYvSQ20hLRbMWLOv2wf7k7GmzJHa1oXXl 15 | LGCboUkGBBzyLDA9wnLXiqOgUfMvF2oR3CrcXMbFBZVyLqMJw1dSKaa3GKR5XkOI 16 | G39lbpeLsW8gpkaOgWAzmtMfgBLJ0zG3RwuVw4cfrCpwnyQ960c26ypwJG2L8ko9 17 | +S7Oo3a+JdtK+BK0e0d+J+oIqM+z3w87MZKeSeeTChgpkqDGE6NoE64O/DvigmxW 18 | ijI95fApIaBjXWRu74gizUKtKuQ5X1pvo1zyQXWqhcaFnB4fv7+kI4L7JwlY4QIf 19 | CLEjYfZFXCtmRo6QPn/09OPiU8xgimqVdIfr7JYjDMoEyMW9vfy5EJmtwS9M41tJ 20 | 2gDbhw1fhwUVW1MsJjLuboMXudsubGvGUy+jB48YPQs2Yx13NgUu15jtvPxVCC9v 21 | CdnaL6PJtloSXh5zYpapUg2UN5oH48BLw1hWFoDBcgzTxlCjyEJGtem9QM1Y997e 22 | z561gw8iu1vw0XDuv5zd7qzyIgAYuB8b3Pe6Rg+V2jennKvymMrtCvUNcLRs1pF8 23 | LV0t9rTQzQWP5d8AmxywZfgXaQ0zcrTTd2rkjwf/yBH5yNIhDAECggEBAOl6K4pA 24 | EHsWjGy1IrvhoDztbLkkLzxLvVbkjKT9NJwHucL+oekfxLt/c+1+8mNUjiMHyLd8 25 | cH+R2Lyy1YhfBrT92gPfRRHUBLx+XS0P3p0dj3N+U+C//WAaMS5mb+pkTUFGLQ8g 26 | vRHPHt0rAjvzpMUCNUtO+o11srZIOjLOLYkxSIDqwFXFWDyCgfqYev1jkNDivILk 27 | HjeNrz3G5XpIBQdclZtX1f9yII5EfA6ChUGOLIAMwY1Mr6gTJTKtE3Q6anC0AgoW 28 | ugw5oTSZpKySCKjf20AVcKvPBA3Tq+TBR10XmSTwL6r0bzuptXJBr+teOsnvs1+g 29 | qhwgqExgFrkLf30CggEBAMg9g5VtYtmSfKxFR/59YSwjgRUSz7xFzsdUnbYN71X1 30 | fd7o5htmEIitGRshzbRLCE85cW6TlGC02RiH9wQctc288BkYPYMW7LXRluKSXm+J 31 | WlEwiWsGAJzsNK8U6fxCM0ZsX2AQ3tVSRHnhl7a/CByUQZFS/M+aTLWuOZQZElyK 32 | PqsCw4eD6GbTk2qtlkxp8Gc/eAnii4KWfb6yvx5dgJXz1Nuu/ueZI+lmEP+EgubD 33 | /f9vUzNFHgcU0+z2bH49gvUJ6t9nIAJ4HsHvoI6L286YVzR7qCP5inVksRspVLPP 34 | iH8EDr4QhLnCh4GZiWy1JBpm/Zg+YcibQKxacs/nfYkCggEAXby3DmJ6O3DqIBr5 35 | PwVvGAcax5pHfKXL9r772KHwJVTUt/0TdE1U5xJcsNVu64JfLqFJbKGBaTZdFiWW 36 | pZHBV5kzlqplSKse267AKf9dGSdtGKl3c5yhVZwucrqd5DUw7ywFmzVBs4y8j39c 37 | /kTruk0QqJOk9HZ0scp90zgEADjRKzEU11rL+j9LgBkICAOZeMQPe12q5BL2cI8S 38 | Qu33VuVNC3lQaaage33zcL/mUFOMejyk2N4ZCBnnrVjfnqJ1aZpb10EYoR/iIQQu 39 | oTpgT6zQkgIJonES55o8QTN4O1/mFHZ6LODGZ+XS+3Rz9MN4Rur90T7oDTLvXvqV 40 | JOYA4QKCAQEAluueKFq4nUnGQ8U3/Pyc57qeyLZT8hAfSKdi8ttP31bXFtIs1Mu5 41 | fHoSqRtyQggnbCbccr4yoCzOT6nyqJvG/xj/UbquagY2RNeCRKSTHrfEZdsSR6LP 42 | hXaWQrudm659nP+DZxFwEhIeYEqCoY8b2wZ24MROnV4roOd+qDu5VhwwHY5ItvPZ 43 | jt66hjXtSQyzz+3LWI/yHGu2vKtWVtmcV+jeLvGXWBFZOsnd1+gVDT79Sq+qYsMe 44 | XbH6BOi6Xu+Xq35dEyJTwuisLfmg5q9M7Uput7TXxr2G+PH6doFRQPETbMAvKFuk 45 | 3albnneNV2yzmF61ljC2XI9/UCgfzskoGQKCAQBcgsPCQREaEiMvfmWjoDeip/Cy 46 | c0QzTJ6Oy5kVxfjHxRhEZyjKPBbXLGjewLoUfuBJvOJ7Iqadv5vP2AOUS0KMkmwt 47 | w0rIUhk9WaLo+f4Fci1d14CPs59w2GYhSniGOT/qiPprUZVUr+J0fJ6q2i7kRUTR 48 | gLmSxLEKbHUTKJVTJ0wviIHZYHA+WIQzK1j2NdVIjpLNRXaV4+g0vDBnmCovbBgy 49 | VkyXcPF8q/aDjPcDb9cyCxt4PJQRrP7n959Y2sIjyVwAIEg5wzFuPp3LG+ITnLpG 50 | TtrkLRzqxPKqAY0p4D/7exFyk4SeUHFWWifs7uYeflw3vxN+VmazFE4WdXh3 51 | -----END RSA PRIVATE KEY----- 52 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/keys/id_rsa.pub: -------------------------------------------------------------------------------- 1 | ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQC2n4wiLiRmE1sla5+w0IW3wwPW/mqhhkm7IyCBS+rGTgnts7xsWcxobvamNdD6KSLNnjFZbBb7Yaf/BvWrwQgdqIFVU3gRWHYzoU6js+lKtBjd0e2DAVGivWCKEkSGLx7zhx7uH/Jt8kyZ4NaZq0p5+SFHBzePdR/1rURd8G8+G3OaCPKqP+JQT4RMUQHC5SNRJLcK1piYdmhDiYEyuQG4FlStKCWLCXeUY2EVirNMeQIfOgbUHJsVjH07zm1y8y7lTWDMWVZOnkG6Ap5kB+n4l1eWbslOKgDv29JTFOMU+bvGvYZh70lmLK7Hg4CMpXVgvw5VF9v97YiiigLwvC7wasBHaASwH7wUqakXYhdGFxJ23xVMSLnvJn4S++4L8t8bifRIVqhT6tZCPOU4fdOvJKCRjKrf7gcW/E33ovZFgoOCJ2vBLIh9N9ME0v7tG15JpRtgIBsCXwLcl3tVyCZJ/eyYMbc3QJGsbcPGb2CYRjDbevPCQlNavcMdlyrNIke7VimM5aW8OBJKVh5wCNRpd9XylrKo1cZHYxu/c5Lr6VUZjLpxDlSz+IuTn4VE7vmgHNPnXdlxRKjLHG/FZrZTSCWFEBcRoSa/hysLSFwwDjKd9nelOZRNBvJ+NY48vA8ixVnk4WAMlR/5qhjTRam66BVysHeRcbjJ2IGjwTJC5Q== slurmid@slurmid.platform 2 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/munge.key: -------------------------------------------------------------------------------- 1 | nVSHnJzgIVuvfxPf2AJtHlEA6jWeJdLhxdujd04/ZPysw3MZk7QaURlIOMArbXpH8w37bsuAFw9G 2 | ulS6bWPIaja8JiLYqjdQQApLBhCaqK27nid3siIiZkzU2J4IEaCV3KneOPZax9xuIJyDIHb5ailI 3 | V7YQXddTprZt/uluhkAYNwaVQt6PvXLH3Kofa2M/rEPVMc8VaYgsmHq3GGkwqZ3tqNyE0GienWal 4 | zM56vJJrUZtEPc/IK8Sl/0QRkCWXxLOr1XRrNf6w9CbI3Wx44LF22L0NcaV6WLYDHbIn2raTod+5 5 | nSHD0Mrnvmvx70kcKCsDOftcoj4d8eRlRQqnY0hM2NevziI4U0d9Ejo0CscTc0YiGsntmGh8SgPl 6 | YHUhwjsj5DznVkyTj2ilDVkXDMdnKH36cd/Ti7rI9xhGwWiWroJs8GZwwltoNKrLZz5hvpGJRjza 7 | N4FRPRfTNV9el2USWM1qSMM1Y/7NRXaaonBN8gJch5DjD0u4v2qFMXnN8/phbfYtvHzjjwT0+iuU 8 | 7ZerEj5DV83FPCuPu3neM8vqLO0O2ZdYIhh8N/b+mtVfh1BGmdBeqlwbQzupyi2jtYa8UsRJNryJ 9 | aQNAI1nqE7WkQNi4jnfR3c1mpqrpJm9A0iz0IIK2o3sjIVAY1nB1DvOAql4uV2+L9cSs8YjEiGEo 10 | KG35YpgUoOdQZCSAppLebh6rTHdjmovruZpV28o2pc/Oy336pr4cq5pWJCbrpHOKgF7Zcqw7aT4g 11 | 01OxZVG4iurQUBIhovnbHxdfHZBtNKS0ArwffI7sbu7LG2vwJOdvqAnnEc1B+VI9kLqqZRlCNAqE 12 | kJ4j/cKu0mlq4nxv4wxpsPeN6Uf8Lj+9yHoDPxnoy9DeibBGvgTenUCS2KC0hHgbfnGNkmd0fpoi 13 | RQhvUxRoEpPDlfjnWAk8g/IPetdAEKoo9N/xZpmmxIz9M8JFZWhAkmoMc/Gr05iSYBD1gwCfZkRZ 14 | 5LJSITBvOi78pRN1fMct5dW8K7bbefN3s+EmS3kmhB12o0t2X08rv7a4nHwnlNQ3SfirkWaQlfL2 15 | QV7HjU6X5atv072ArDX9fJYxjdG22he+Fk1eYDx63oS7LdXwcETBaDO7z60tQSUp1YSmFEpnocFt 16 | R5dqK/U8dnxUo80Cjm/DerWXkUPtfLpQqIV4DpNrQBTEc5TUlfKNUa61N2QvJXyDME4A4Ynm1vdR 17 | XZJcUAVu03nuxvlUFolusR/Qu0LOPLMqD5pX1cjgEGQkkS9OWjp0YsvQiVVDItpE3k2C4ETWC8MW 18 | wnHFuhVF1UB5o5BC6wy3/wgejsJ84lyMtD1vF3/OZftjIW94ksFUdS0mJ75Jj0wKKdXHclhv9A== 19 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/prestartup.py: -------------------------------------------------------------------------------- 1 | 2 | import os 3 | import sys 4 | import datetime 5 | import subprocess 6 | from collections import namedtuple 7 | 8 | def shell(command, interactive=False): 9 | '''Execute a command in the shell. By default prints everything. If the capture switch is set, 10 | then it returns a namedtuple with stdout, stderr, and exit code.''' 11 | 12 | if interactive: 13 | exit_code = subprocess.call(command, shell=True) 14 | if exit_code == 0: 15 | return True 16 | else: 17 | return False 18 | 19 | process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True) 20 | (stdout, stderr) = process.communicate() 21 | exit_code = process.wait() 22 | 23 | # Convert to str (Python 3) 24 | stdout = stdout.decode(encoding='UTF-8') 25 | stderr = stderr.decode(encoding='UTF-8') 26 | 27 | # Output namedtuple 28 | Output = namedtuple('Output', 'stdout stderr exit_code') 29 | 30 | # Return 31 | return Output(stdout, stderr, exit_code) 32 | 33 | 34 | prestartup_scripts_path='/prestartup' 35 | def sorted_ls(path): 36 | mtime = lambda f: os.stat(os.path.join(path, f)).st_mtime 37 | file_list = list(sorted(os.listdir(path), key=mtime)) 38 | return file_list 39 | 40 | for item in sorted_ls(prestartup_scripts_path): 41 | if item.endswith('.sh'): 42 | 43 | # Execute this startup script 44 | print('[INFO] Executing prestartup script "{}"...'.format(item)) 45 | script = prestartup_scripts_path+'/'+item 46 | 47 | # Use bash and not chmod + execute, see https://github.com/moby/moby/issues/9547 48 | out = shell('bash {}'.format(script)) 49 | 50 | # Set date 51 | date_str = str(datetime.datetime.now()).split('.')[0] 52 | 53 | # Print and log stdout and stderr 54 | for line in out.stdout.strip().split('\n'): 55 | print(' out: {}'.format(line)) 56 | 57 | for line in out.stderr.strip().split('\n'): 58 | print(' err: {}'.format(line)) 59 | 60 | # Handle error in the startup script 61 | if out.exit_code: 62 | print('[ERROR] Exit code "{}" for "{}"'.format(out.exit_code, item)) 63 | 64 | # Exit with error code 1 65 | sys.exit(1) 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/prestartup_slurmbase.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -e 3 | 4 | # "Deactivate" local testuser home 5 | mv /home/testuser /home_testuser_vanilla 6 | 7 | # Link testuser against the home in the shared folder (which will be setup by the master node) 8 | ln -s /shared/home_testuser /home/testuser 9 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/slurm.conf: -------------------------------------------------------------------------------- 1 | # slurm.conf file generated by configurator.html. 2 | # Put this file on all nodes of your cluster. 3 | # See the slurm.conf man page for more information. 4 | # 5 | ControlMachine=slurmclustermaster 6 | #ControlAddr= 7 | #BackupController= 8 | #BackupAddr= 9 | # 10 | AuthType=auth/munge 11 | CacheGroups=0 12 | #CheckpointType=checkpoint/none 13 | CryptoType=crypto/munge 14 | #DisableRootJobs=NO 15 | #EnforcePartLimits=NO 16 | #Epilog= 17 | #EpilogSlurmctld= 18 | #FirstJobId=1 19 | #MaxJobId=999999 20 | #GresTypes= 21 | #GroupUpdateForce=0 22 | #GroupUpdateTime=600 23 | JobCheckpointDir=/var/lib/slurm-llnl/checkpoint 24 | #JobCredentialPrivateKey= 25 | #JobCredentialPublicCertificate= 26 | #JobFileAppend=0 27 | #JobRequeue=1 28 | #JobSubmitPlugins=1 29 | #KillOnBadExit=0 30 | #LaunchType=launch/slurm 31 | #Licenses=foo*4,bar 32 | #MailProg=/usr/bin/mail 33 | #MaxJobCount=5000 34 | #MaxStepCount=40000 35 | #MaxTasksPerNode=128 36 | MpiDefault=none 37 | #MpiParams=ports=#-# 38 | #PluginDir= 39 | #PlugStackConfig= 40 | #PrivateData=jobs 41 | ProctrackType=proctrack/pgid 42 | #Prolog= 43 | #PrologSlurmctld= 44 | #PropagatePrioProcess=0 45 | #PropagateResourceLimits= 46 | #PropagateResourceLimitsExcept= 47 | #RebootProgram= 48 | ReturnToService=1 49 | #SallocDefaultCommand= 50 | SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid 51 | SlurmctldPort=6817 52 | SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid 53 | SlurmdPort=6818 54 | SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd 55 | SlurmUser=slurm 56 | #SlurmdUser=root 57 | #SrunEpilog= 58 | #SrunProlog= 59 | StateSaveLocation=/var/lib/slurm-llnl/slurmctld 60 | SwitchType=switch/none 61 | #TaskEpilog= 62 | TaskPlugin=task/none 63 | #TaskPluginParam= 64 | #TaskProlog= 65 | #TopologyPlugin=topology/tree 66 | #TmpFS=/tmp 67 | #TrackWCKey=no 68 | #TreeWidth= 69 | #UnkillableStepProgram= 70 | #UsePAM=0 71 | # 72 | # 73 | # TIMERS 74 | #BatchStartTimeout=10 75 | #CompleteWait=0 76 | #EpilogMsgTime=2000 77 | #GetEnvTimeout=2 78 | #HealthCheckInterval=0 79 | #HealthCheckProgram= 80 | InactiveLimit=0 81 | KillWait=30 82 | #MessageTimeout=10 83 | #ResvOverRun=0 84 | MinJobAge=300 85 | #OverTimeLimit=0 86 | SlurmctldTimeout=120 87 | SlurmdTimeout=300 88 | #UnkillableStepTimeout=60 89 | #VSizeFactor=0 90 | Waittime=0 91 | # 92 | # 93 | # SCHEDULING 94 | #DefMemPerCPU=0 95 | FastSchedule=1 96 | #MaxMemPerCPU=0 97 | #SchedulerRootFilter=1 98 | #SchedulerTimeSlice=30 99 | SchedulerType=sched/builtin 100 | SchedulerPort=7321 101 | SelectType=select/linear 102 | #SelectTypeParameters= 103 | # 104 | # 105 | # JOB PRIORITY 106 | #PriorityFlags= 107 | #PriorityType=priority/basic 108 | #PriorityDecayHalfLife= 109 | #PriorityCalcPeriod= 110 | #PriorityFavorSmall= 111 | #PriorityMaxAge= 112 | #PriorityUsageResetPeriod= 113 | #PriorityWeightAge= 114 | #PriorityWeightFairshare= 115 | #PriorityWeightJobSize= 116 | #PriorityWeightPartition= 117 | #PriorityWeightQOS= 118 | # 119 | # 120 | # LOGGING AND ACCOUNTING 121 | #AccountingStorageEnforce=0 122 | #AccountingStorageHost= 123 | #AccountingStorageLoc= 124 | #AccountingStoragePass= 125 | #AccountingStoragePort= 126 | AccountingStorageType=accounting_storage/none 127 | #AccountingStorageUser= 128 | AccountingStoreJobComment=YES 129 | ClusterName=cluster 130 | #DebugFlags= 131 | #JobCompHost= 132 | JobCompLoc=/var/log/slurm-llnl/jobs.log 133 | #JobCompPass= 134 | #JobCompPort= 135 | JobCompType=jobcomp/filetxt 136 | #JobCompUser= 137 | JobAcctGatherFrequency=30 138 | JobAcctGatherType=jobacct_gather/none 139 | SlurmctldDebug=3 140 | SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log 141 | SlurmdDebug=3 142 | SlurmdLogFile=/var/log/slurm-llnl/slurmd.log 143 | #SlurmSchedLogFile= 144 | #SlurmSchedLogLevel= 145 | # 146 | # 147 | # POWER SAVE SUPPORT FOR IDLE NODES (optional) 148 | #SuspendProgram= 149 | #ResumeProgram= 150 | #SuspendTimeout= 151 | #ResumeTimeout= 152 | #ResumeRate= 153 | #SuspendExcNodes= 154 | #SuspendExcParts= 155 | #SuspendRate= 156 | #SuspendTime= 157 | # 158 | # Must add controller node explicitly but don't place it into any partition 159 | NodeName=slurmclustermaster CPUs=1 State=UNKNOWN 160 | # 161 | # COMPUTE NODES 162 | NodeName=slurmclusterworker-one CPUs=1 State=UNKNOWN 163 | NodeName=slurmclusterworker-two CPUs=1 State=UNKNOWN 164 | PartitionName=partition1 Nodes=slurmclusterworker-one,slurmclusterworker-two MaxTime=INFINITE State=UP 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/sudoers: -------------------------------------------------------------------------------- 1 | # 2 | # This file MUST be edited with the 'visudo' command as root. 3 | # 4 | # Please consider adding local content in /etc/sudoers.d/ instead of 5 | # directly modifying this file. 6 | # 7 | # See the man page for details on how to write a sudoers file. 8 | # 9 | Defaults env_reset 10 | Defaults mail_badpass 11 | Defaults secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" 12 | 13 | # Host alias specification 14 | 15 | # User alias specification 16 | 17 | # Cmnd alias specification 18 | 19 | # User privilege specification 20 | root ALL=(ALL:ALL) ALL 21 | 22 | # Members of the admin group may gain root privileges 23 | %admin ALL=(ALL) ALL 24 | 25 | # Allow members of group sudo to execute any command 26 | %sudo ALL=(ALL:ALL) NOPASSWD:ALL 27 | 28 | # See sudoers(5) for more information on "#include" directives: 29 | 30 | #includedir /etc/sudoers.d 31 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/supervisord.conf: -------------------------------------------------------------------------------- 1 | ; supervisor config file 2 | 3 | [unix_http_server] 4 | file=/var/run/supervisor.sock ; (the path to the socket file) 5 | chmod=0700 ; sockef file mode (default 0700) 6 | 7 | [supervisord] 8 | logfile=/var/log/supervisor/supervisord.log ; (main log file;default $CWD/supervisord.log) 9 | pidfile=/var/run/supervisord.pid ; (supervisord pidfile;default supervisord.pid) 10 | childlogdir=/var/log/supervisor ; ('AUTO' child log dir, default $TEMP) 11 | nodaemon=true ; Mandatory to run Supervisor in foreground and avoid Docker to exit! 12 | 13 | ; The below section must remain in the config file for RPC 14 | ; (supervisorctl/web interface) to work, additional interfaces may be 15 | ; added by defining them in separate rpcinterface: sections 16 | [rpcinterface:supervisor] 17 | supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface 18 | 19 | [supervisorctl] 20 | serverurl=unix:///var/run/supervisor.sock ; use a unix:// URL for a unix socket 21 | 22 | ; The [include] section can just contain the "files" setting. This 23 | ; setting can list multiple files (separated by whitespace or 24 | ; newlines). It can also contain wildcards. The filenames are 25 | ; interpreted as relative to this file. Included files *cannot* 26 | ; include files themselves. 27 | 28 | [include] 29 | files = /etc/supervisor/conf.d/*.conf 30 | 31 | 32 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/supervisord_munge.conf: -------------------------------------------------------------------------------- 1 | [program:munged] 2 | 3 | ; Process definition 4 | process_name = munged 5 | command = /usr/sbin/munged -f --key-file /etc/munge/munge.key -F 6 | autostart = true 7 | autorestart = true 8 | startsecs = 5 9 | stopwaitsecs = 10 10 | priority = 100 11 | 12 | ; Log files 13 | stdout_logfile = /var/log/supervisord/munged.log 14 | stdout_logfile_maxbytes = 100MB 15 | stdout_logfile_backups = 5 16 | stderr_logfile = /var/log/supervisord/munged.log 17 | stderr_logfile_maxbytes = 100MB 18 | stderr_logfile_backups = 5 19 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmbase/supervisord_sshd.conf: -------------------------------------------------------------------------------- 1 | [program:sshd] 2 | 3 | ; Process definition 4 | process_name = sshd 5 | command = /usr/sbin/sshd -D 6 | autostart = true 7 | autorestart = true 8 | startsecs = 5 9 | stopwaitsecs = 10 10 | 11 | ; Log files 12 | stdout_logfile = /var/log/supervisord/sshd.log 13 | stdout_logfile_maxbytes = 10MB 14 | stdout_logfile_backups = 5 15 | stderr_logfile = /var/log/supervisord/sshd.log 16 | stderr_logfile_maxbytes = 10MB 17 | stderr_logfile_backups = 5 18 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmcluster/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM slurmid/slurmbase 2 | MAINTAINER Stefano Alberto Russo 3 | 4 | # Add Slurm supervisord conf 5 | COPY supervisord_slurm* /etc/supervisor/conf.d/ 6 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmcluster/supervisord_slurmd.conf: -------------------------------------------------------------------------------- 1 | [program:slurmd] 2 | 3 | ; Process definition 4 | process_name = slurmd 5 | command = /usr/sbin/slurmd -D -f /etc/slurm-llnl/slurm.conf 6 | autostart = true 7 | autorestart = true 8 | startsecs = 5 9 | stopwaitsecs = 10 10 | priority = 200 11 | 12 | ; Log files 13 | stdout_logfile = /var/log/supervisord/slurmd.log 14 | stdout_logfile_maxbytes = 100MB 15 | stdout_logfile_backups = 5 16 | stderr_logfile = /var/log/supervisord/slurmd.log 17 | stderr_logfile_maxbytes = 100MB 18 | stderr_logfile_backups = 5 19 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmclustermaster/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM slurmid/slurmcluster 2 | MAINTAINER Stefano Alberto Russo 3 | 4 | # Configure supervisord to run SLURM 5 | COPY supervisord_slurm* /etc/supervisor/conf.d/ 6 | 7 | # Add sample job script 8 | RUN mkdir /examples 9 | COPY test_job.sh /examples/test_job.sh 10 | 11 | # Add prestartup 12 | COPY prestartup_slurmclustermaster.sh /prestartup/ 13 | RUN touch -m /prestartup/prestartup_slurmclustermaster.sh 14 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmclustermaster/prestartup_slurmclustermaster.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -e 3 | 4 | # Generic slurmid user shared folder 5 | mkdir -p /shared/slurmid && chown slurmid:slurmid /shared/slurmid 6 | 7 | # Shared home for testuser to simulate a shared home folders filesystem 8 | cp -a /home_testuser_vanilla /shared/home_testuser 9 | 10 | # Create shared data directories 11 | mkdir -p /shared/scratch 12 | chmod 777 /shared/scratch 13 | 14 | mkdir -p /shared/data/shared 15 | chmod 777 /shared/data/shared 16 | 17 | mkdir -p /shared/data/users/testuser 18 | chown testuser:testuser /shared/data/users/testuser 19 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmclustermaster/supervisord_slurmctld.conf: -------------------------------------------------------------------------------- 1 | [program:slurmctld] 2 | 3 | ; Process definition 4 | process_name = slurmctld 5 | command = /usr/sbin/slurmctld -D -f /etc/slurm-llnl/slurm.conf 6 | autostart = true 7 | autorestart = true 8 | startsecs = 5 9 | stopwaitsecs = 10 10 | priority = 300 11 | 12 | ; Log files 13 | stdout_logfile = /var/log/supervisord/slurmctld.log 14 | stdout_logfile_maxbytes = 100MB 15 | stdout_logfile_backups = 5 16 | stderr_logfile = /var/log/supervisord/slurmctld.log 17 | stderr_logfile_maxbytes = 100MB 18 | stderr_logfile_backups = 5 19 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmclustermaster/test_job.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #SBATCH --job-name=test_job 4 | #SBATCH --output=results.txt 5 | #SBATCH --ntasks=2 6 | 7 | srun bash -c "printf 'Started on %s\n' \$(hostname)" 8 | sleep 60 9 | srun bash -c "printf 'Done on %s\n' \$(hostname)" 10 | 11 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/services/slurmclusterworker/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM slurmid/slurmcluster 2 | MAINTAINER Stefano Alberto Russo 3 | 4 | # Add testuser user to sudoers 5 | RUN adduser testuser sudo 6 | 7 | # Custom stuf (e.g. installingdependencies) here... 8 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/slurmid/build: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | set -e 3 | 4 | # Check if we are in the right place 5 | if [ ! -d ./services ]; then 6 | echo "You must run this command from the project's root folder." 7 | exit 1 8 | fi 9 | 10 | # Set service and caching switch 11 | if [[ "x$1" == "xnocache" ]] ; then 12 | NOCACHE=true 13 | SERVICE="" 14 | elif [[ "x$2" == "xnocache" ]] ; then 15 | NOCACHE=true 16 | SERVICE=$1 17 | else 18 | if [[ "x$NOCACHE" == "x" ]] ; then 19 | # Set the default only if we did not get any NOCACHE env var 20 | NOCACHE=false 21 | fi 22 | SERVICE=$1 23 | fi 24 | 25 | if [[ "x$NOCACHE" == "xtrue" ]] ; then 26 | BUILD_COMMAND="docker build --no-cache" 27 | else 28 | BUILD_COMMAND="docker build" 29 | fi 30 | 31 | if [[ "x$SERVICE" == "x" ]] ; then 32 | 33 | # Build all services 34 | NOCACHE=$NOCACHE slurmid/build slurmbase 35 | NOCACHE=$NOCACHE slurmid/build slurmcluster 36 | NOCACHE=$NOCACHE slurmid/build slurmclustermaster 37 | NOCACHE=$NOCACHE slurmid/build slurmclusterworker 38 | 39 | else 40 | 41 | # Build a specific image 42 | echo "" 43 | if [[ "x$NOCACHE" == "xtrue" ]] ; then 44 | echo "-> Building $SERVICE (without cache)..." 45 | else 46 | echo "-> Building $SERVICE..." 47 | fi 48 | echo "" 49 | $BUILD_COMMAND services/$SERVICE -t slurmid/$SERVICE 50 | 51 | fi 52 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/slurmid/clean: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Check if we are in the right place 4 | if [ ! -d ./services ]; then 5 | echo "You must run this command from the project's root folder." 6 | exit 1 7 | fi 8 | 9 | if [[ $# -eq 0 ]] ; then 10 | docker-compose down 11 | else 12 | docker-compose rm -s -v -f $@ 13 | fi 14 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/slurmid/logs: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Check if we are in the right place 4 | if [ ! -d ./services ]; then 5 | echo "You must run this command from the project's root folder." 6 | exit 1 7 | fi 8 | 9 | if [[ $# -eq 0 ]] ; then 10 | echo "Please tell me which service to get logs from." 11 | exit 1 12 | fi 13 | 14 | if [[ "x$2" != "x" ]] ; then 15 | tail -f -n 1000 data/$1/log/$2.log 16 | else 17 | docker-compose logs -f $1 18 | fi 19 | 20 | 21 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/slurmid/ps: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Check if we are in the right place 4 | if [ ! -d ./services ]; then 5 | echo "You must run this command from the project's root folder." 6 | exit 1 7 | fi 8 | 9 | if [[ $# -eq 0 ]] ; then 10 | docker-compose ps 11 | else 12 | echo "This command does not support any argument." 13 | exit 1 14 | fi 15 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/slurmid/rerun: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Check if we are in the right place 4 | if [ ! -d ./services ]; then 5 | echo "You must run this command from the project's root folder." 6 | exit 1 7 | fi 8 | 9 | if [[ $# -eq 0 ]] ; then 10 | docker-compose down 11 | docker-compose up -d 12 | else 13 | slurmid/clean $@ 14 | slurmid/run $@ 15 | fi 16 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/slurmid/run: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Check if we are in the right place 4 | if [ ! -d ./services ]; then 5 | echo "You must run this command from the project's root folder." 6 | exit 1 7 | fi 8 | 9 | if [[ $# -eq 0 ]] ; then 10 | docker-compose up -d 11 | else 12 | docker-compose up -d $@ 13 | fi 14 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/slurmid/shell: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Check if we are in the right place 4 | if [ ! -d ./services ]; then 5 | echo "You must run this command from the project's root folder." 6 | exit 1 7 | fi 8 | 9 | if [[ $# -eq 0 ]] ; then 10 | echo "Please tell me on which service to open the shell in." 11 | exit 1 12 | 13 | elif [[ $# -gt 2 ]] ; then 14 | echo "Use double quotes to wrap commands with spaces" 15 | exit 1 16 | else 17 | 18 | COMMAND=$2 19 | if [[ "x$COMMAND" == "x" ]] ; then 20 | echo "" 21 | echo "Executing: /bin/bash" 22 | echo "" 23 | docker-compose exec $1 sudo -i -u slurmid /bin/bash 24 | else 25 | echo "" 26 | echo "Executing: \"$COMMAND\"" 27 | echo "" 28 | docker-compose exec $1 sudo -i -u slurmid /bin/bash -c "$COMMAND" 29 | fi 30 | 31 | fi 32 | -------------------------------------------------------------------------------- /Tutorials/Containers/SlurmID/slurmid/status: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Check if we are in the right place 4 | if [ ! -d ./services ]; then 5 | echo "You must run this command from the project's root folder." 6 | exit 1 7 | fi 8 | 9 | if [[ $# -eq 0 ]] ; then 10 | 11 | declare -a container_names 12 | OUT=$(slurmid/ps) 13 | 14 | while read -r line; do 15 | 16 | if [[ $line == *"Up"* ]]; then 17 | container_name=$(echo $line | cut -d ' ' -f1) 18 | container_names+=($container_name); 19 | fi 20 | 21 | done <<< "$OUT" 22 | 23 | for container_name in ${container_names[@]} 24 | do 25 | echo "" 26 | echo "Container \"$container_name\":" 27 | docker-compose exec $container_name /bin/bash -c "supervisorctl status" 28 | done 29 | echo "" 30 | 31 | else 32 | docker-compose exec $@ /bin/bash -c "supervisorctl status" 33 | fi 34 | -------------------------------------------------------------------------------- /Tutorials/VirtualMachine/README.md: -------------------------------------------------------------------------------- 1 | # Virtualization Tutorial 2 | 3 | In this tutorial, we will learn how to build a cluster of Linux machines on our local environment using Virtualbox. 4 | Each machine will have two NICs one internal and one to connect to WAN . 5 | We will connect to them from our host windows machine via SSH. 6 | 7 | This configuration is useful to try out a clustered application which requires multiple Linux machines like kubernetes or an HPC cluster on your local environment. 8 | The primary goal will be to test the guest virtual machine performances using standard benckmaks as HPL, STREAM or iozone and compare with the host performances. 9 | 10 | Then we will installa a slurm based cluster to test parallel applications 11 | 12 | ## GOALs 13 | In this tutorial, we are going to create a cluster of four Linux virtual machines. 14 | 15 | * Each machine is capable of connecting to the internet and able to connect with each other privately as well as can be reached from the host machine. 16 | * Our machines will be named cluster01, cluster02, ..., cluster0X. 17 | * The first machine: cluster01 will act as a master node and will have 1vCPUs, 2GB of RAM and 25 GB hard disk. 18 | * The other machines will act as worker nodes will have 1vCPUs, 1GB of RAM and 10 GB hard disk. 19 | * We will assign our machines static IP address in the internal network: 192.168.0.1, 192.168.0.22, 192.168.0.23, .... 192.168.0.XX. 20 | 21 | ## Prerequisite 22 | 23 | * VirtualBOX installed in your linux/windows/Apple (UTM in Arm based Mac) 24 | * ubuntu server 22.04 LTS image to install 25 | * SSH client to connect 26 | 27 | ## Create virtual machines on Virtualbox 28 | We create one template that we will use then to deply the cluster and to make some performance tests and comparisons 29 | 30 | Create the template virtual machine which we will name "template" with 1vCPUs, 1GB of RAM and 25 GB hard disk. 31 | 32 | You can use Ubuntu 22.04 LTS server (https://ubuntu.com/download/server) 33 | 34 | Make sure to set up the network as follows: 35 | 36 | * Attach the downloaded Ubuntu ISO to the "ISO Image". 37 | * Type: is Lunux 38 | * Version Ubuntu 22.04 LTS 39 | When you start the virtual machines for the first time you are prompted to instal and setup Ubuntu. 40 | Follow through with the installation until you get to the “Network Commections”. As the VM network protocol is NAT, the virtual machine will be assinged to an automatic IP (internal) and it will be able to access internet for software upgrades. 41 | 42 | The VM is now accessing the network to download the software and updates for the LTS. 43 | 44 | When you are prompted for the "Guided storage configuration" panel keep the default installation method: use an entire disk. 45 | 46 | When you are prompted for the Profile setup, you will be requested to define a server name (template) and super user (e.g. user01) and his administrative password. 47 | 48 | 49 | Also, enable open ssh server in the software selection prompt. 50 | 51 | Follow the installation and then shutdown the VM. 52 | 53 | Inspect the VM and in particular the Network, you will find only one adapter attached to NAT. If you look at the advanced tab you will find the Adapter Type (Intel) and the MAC address. 54 | 55 | Start the newly created machine and make sure it is running. 56 | 57 | Login and update the software: 58 | 59 | ``` 60 | $ sudo apt update 61 | ... 62 | 63 | $ sudo apt upgrade 64 | ``` 65 | 66 | 67 | When your VM is all set up, log in to the VM and test the network by pinging any public site. e.g `ping google.com`. If all works well, this should be successful. 68 | 69 | You can check the DHCP-assigned IP address by entering the following command: 70 | 71 | ```shell 72 | hostname -I 73 | ``` 74 | 75 | You will get an output similar to this: 76 | ``` 77 | 10.0.2.15 78 | ``` 79 | 80 | This is the default IP address assigned by your network DHCP. Note that this IP address is dynamic and can change or worst still, get assigned to another machine. But for now, you can connect to this IP from your host machine via SSH. 81 | 82 | Now install some useful additional packages: 83 | 84 | ``` 85 | $ sudo apt install net-tools 86 | ``` 87 | 88 | If everything is ok, we can proceed cloning this template. We will create 3 clones (you can create more than 3 89 | according to the amount of RAM and cores available in your laptop). 90 | 91 | You must shutdown the node to clone it, using VirtualBox interface (select VM and right click) create 3 new VMs. 92 | 93 | ``` 94 | $ sudo shutdown -h now 95 | ``` 96 | 97 | 98 | Right click on the name of the VM and clone it. The first clone will be the login/master node the other twos will be computing nodes. 99 | 100 | ## Configure the cluster 101 | 102 | Once the 2 machines has been cloned we can bootstrap the login/master node and configure it. 103 | Add a new network adapter on each machine: enable "Adapter 2" "Attached to" internal network and name it "clustervimnet" 104 | 105 | ### Login/master node 106 | 107 | Bootstrap the VM and configure the secondary network adapter with a static IP. 108 | 109 | In the example below the interface is enp0s8, to find your own one: 110 | 111 | ``` 112 | $ ip link show 113 | 1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 114 | link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 115 | 2: enp0s3: mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 116 | link/ether 08:00:27:2b:e5:36 brd ff:ff:ff:ff:ff:ff 117 | 3: enp0s8: mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 118 | link/ether 08:00:27:6e:cf:82 brd ff:ff:ff:ff:ff:ff 119 | ``` 120 | 121 | You are interested to link 2 and 3. Link 2 is the NAT device, Link 3 is the internal network device. 122 | You are interested in Link 3. 123 | 124 | Now we configure the adapter. To do this we will edit the netplan file: 125 | 126 | ``` 127 | $ sudo vim /etc/netplan/00-installer-config.yaml 128 | 129 | # This is the network config written by 'subiquity' 130 | network: 131 | ethernets: 132 | enp0s1: 133 | dhcp4: true 134 | enp0s8: 135 | dhcp4: no 136 | addresses: [192.168.0.1/24] 137 | version: 2 138 | ``` 139 | 140 | and apply the configuration 141 | 142 | ``` 143 | $ sudo netplan apply 144 | ``` 145 | We change the hostname: 146 | ``` 147 | $ sudo vim /etc/hostname 148 | 149 | cluster01 150 | ``` 151 | 152 | 153 | Edit the hosts file to assign names to the cluster that should include names for each node as follows: 154 | 155 | ``` 156 | $ sudo vim /etc/hosts 157 | 158 | 127.0.0.1 localhost 159 | 192.168.0.1 cluster01 160 | 161 | 192.168.0.22 cluster02 162 | 192.168.0.23 cluster03 163 | 192.168.0.24 cluster04 164 | 192.168.0.25 cluster05 165 | 192.168.0.26 cluster06 166 | 192.168.0.27 cluster07 167 | 192.168.0.28 cluster08 168 | 169 | 170 | # The following lines are desirable for IPv6 capable hosts 171 | ::1 ip6-localhost ip6-loopback 172 | fe00::0 ip6-localnet 173 | ff00::0 ip6-mcastprefix 174 | ff02::1 ip6-allnodes 175 | ff02::2 ip6-allrouters 176 | 177 | ``` 178 | 179 | 180 | Then we install a DNSMASQ server to dynamically assign the IP and hostname to the other nodes on the internal interface and create a cluster [1]. 181 | 182 | ``` 183 | $ sudo systemctl disable systemd-resolved 184 | Removed /etc/systemd/system/multi-user.target.wants/systemd-resolved.service. 185 | Removed /etc/systemd/system/dbus-org.freedesktop.resolve1.service. 186 | $ sudo systemctl stop systemd-resolved 187 | ``` 188 | Then 189 | 190 | ``` 191 | $ ls -lh /etc/resolv.conf 192 | lrwxrwxrwx 1 root root 39 Jul 26 2018 /etc/resolv.conf ../run/systemd/resolve/stub-resolv.conf 193 | $ sudo unlink /etc/resolv.conf 194 | ``` 195 | Create a new resolv.conf file and add public DNS servers you wish. In my case am going to use google DNS. 196 | 197 | ``` 198 | $ echo nameserver 8.8.8.8 | sudo tee /etc/resolv.conf 199 | ``` 200 | 201 | 202 | Install dnsmasq 203 | 204 | ``` 205 | $ sudo apt install dnsmasq -y 206 | ``` 207 | 208 | To find and configuration file for Dnsmasq, navigate to /etc/dnsmasq.conf. Edit the file by modifying it with your desired configs. Below is minimal configurations for it to run and support minimum operations. 209 | 210 | ``` 211 | $ vim /etc/dnsmasq.conf 212 | 213 | port=53 214 | bogus-priv 215 | strict-order 216 | expand-hosts 217 | dhcp-range=192.168.0.22,192.168.0.28,255.255.255.0,12h 218 | dhcp-option=option:dns-server,192.168.0.1 219 | dhcp-option=3 220 | 221 | ``` 222 | 223 | 224 | When done with editing the file, close it and restart Dnsmasq to apply the changes. 225 | ``` 226 | $ sudo systemctl restart dnsmasq 227 | ``` 228 | 229 | Check if it is working 230 | 231 | ``` 232 | $ host cluster01 233 | 234 | ``` 235 | 236 | Shutdown the VM. 237 | 238 | ### Port forwarding on login/master node 239 | To enable ssh from host to guest VM you need to create a port forwarding rule in VirtualBox. 240 | To do this open 241 | ``` 242 | VM settings -> Network -> Advanced -> Port Forwarding 243 | ``` 244 | 245 | and create a forwarding rule from host to the VM: 246 | * Name --> ssh 247 | * Protocol --> TCP 248 | * HostIP --> 127.0.0.1 249 | * Host Port --> 2222 250 | * Guest Port --> 22 251 | 252 | Now you should be able to ssh to your VM. Startup the VM, then 253 | 254 | ``` 255 | ssh -p 2222 yury@127.0.0.1 256 | ``` 257 | 258 | but you will have to enter the password. 259 | If you want a passwordless access you need to generate a ssh key or use an ssh key if you already have it. 260 | 261 | If you don’t have public/private key pair already, run ssh-keygen and agree to all defaults. 262 | This will create id_rsa (private key) and id_rsa.pub (public key) in ~/.ssh directory. 263 | 264 | Copy host public key to your VM: 265 | 266 | ``` 267 | scp -P 2222 ~/.ssh/id_rsa.pub user01@127.0.0.1:~ 268 | ``` 269 | 270 | Connect to the VM and add host public key to ~/.ssh/authorized_keys: 271 | 272 | ``` 273 | ssh -p 2222 user01@127.0.0.1 274 | mkdir ~/.ssh 275 | chmod 700 ~/.ssh 276 | cat ~/id_rsa.pub >> ~/.ssh/authorized_keys 277 | chmod 644 ~/.ssh/authorized_keys 278 | exit 279 | ``` 280 | 281 | Now you should be able to ssh to the VM without password. 282 | 283 | To build a cluster we aldo need a distributed filesystem accessible from all nodes. 284 | We use NFS. 285 | 286 | ``` 287 | $ sudo apt install nfs-kernel-server 288 | $ sudo mkdir /shared 289 | $ sudo chmod 777 /shared 290 | ``` 291 | 292 | Modify the NFS config file: 293 | 294 | ``` 295 | $ sudo vim /etc/exports 296 | 297 | /shared/ 192.168.0.0/255.255.255.0(rw,sync,no_root_squash,no_subtree_check) 298 | 299 | ``` 300 | Restart the server 301 | 302 | ``` 303 | $ sudo systemctl enable nfs-kernel-server 304 | $ sudo systemctl restart nfs-kernel-server 305 | ``` 306 | 307 | ### Computing nodes 308 | Bootstrap the VM cluster01 and configure the secondary network adapter with a dynamic IP (this should be standard configuration and nothing should me modified, anyway please check with the "ip link show" command to check the name of the adapters). 309 | 310 | To do this we will edit the netplan file: 311 | 312 | ``` 313 | $ sudo vim /etc/netplan/00-installer-config.yaml 314 | 315 | # This is the network config written by 'subiquity' 316 | network: 317 | ethernets: 318 | enp0s3: 319 | dhcp4: true 320 | dhcp4-overrides: 321 | use-dns: no 322 | enp0s8: 323 | dhcp4: true 324 | version: 2 325 | ``` 326 | 327 | and apply the configuration 328 | 329 | ``` 330 | $ sudo netplan apply 331 | ``` 332 | We change the hostname to empty: 333 | ``` 334 | $ sudo vim /etc/hostname 335 | 336 | 337 | ``` 338 | 339 | Set the proper dns server (assigned with dhcp): 340 | 341 | ``` 342 | $ sudo rm /etc/resolv.conf 343 | ``` 344 | 345 | then 346 | 347 | ``` 348 | $ sudo ln -s /run/systemd/resolve/resolv.conf /etc/resolv.conf 349 | ``` 350 | 351 | Reboot the machine. 352 | At reboot you will see that the machine will have a new ip address 353 | 354 | ``` 355 | $ hostname -I 356 | 10.0.2.15 192.168.0.23 357 | ``` 358 | Install dnsmasq (trick necessary to install the cluster later) 359 | ``` 360 | $ sudo apt install dnsmasq -y 361 | $ sudo systemctl disable dnsmasq 362 | ``` 363 | 364 | Now, from the cluster01 you will be able to connect to cluster02 machine with ssh. 365 | 366 | ``` 367 | $ ssh user01@cluster03 368 | user01@cluster02:~$ 369 | 370 | ``` 371 | To access the new machine without password you can proceed described above. Run ssh-keygen and agree to all defaults. 372 | This will create id_rsa (private key) and id_rsa.pub (public key) in ~/.ssh directory. 373 | 374 | Copy host public key to your VM: 375 | 376 | ``` 377 | scp ~/.ssh/id_rsa.pub user01@cluster03:~ 378 | ``` 379 | 380 | Connect to the VM and add host public key to ~/.ssh/authorized_keys: 381 | 382 | ``` 383 | ssh user01@cluster03 384 | mkdir ~/.ssh 385 | chmod 700 ~/.ssh 386 | cat ~/id_rsa.pub >> ~/.ssh/authorized_keys 387 | chmod 644 ~/.ssh/authorized_keys 388 | exit 389 | ``` 390 | 391 | Configure the shared filesystem 392 | 393 | ``` 394 | $ sudo apt install nfs-common 395 | $ sudo mkdir /shared 396 | ``` 397 | 398 | Mount the shared directory adn test it 399 | ``` 400 | $ sudo mount 192.168.0.1:/shared /shared 401 | $ touch /shared/pippo 402 | ``` 403 | If everything will be ok you will see the "pippo" file in all the nodes. 404 | 405 | To authomatically mount at boot edit the /etc/fstab file: 406 | 407 | ``` 408 | $ sudo vim /etc/fstab 409 | 410 | ``` 411 | 412 | Append the following line at the end of the file 413 | 414 | ``` 415 | 192.168.0.1:/shared /shared nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0 416 | ``` 417 | 418 | 419 | 420 | ## Install a SLURM based cluster 421 | Here I will describe a simple configuration of the slurm management tool for launching jobs in a really simplistic Virtual cluster. I will assume the following configuration: a main node (cluster01) and 3 compute nodes (cluster03 ... VMs). I also assume there is ping access between the nodes and some sort of mechanism for you to know the IP of each node at all times (most basic should be a local NAT with static IPs) 422 | 423 | Slurm management tool work on a set of nodes, one of which is considered the master node, and has the slurmctld daemon running; all other compute nodes have the slurmd daemon. 424 | 425 | All communications are authenticated via the munge service and all nodes need to share the same authentication key. Slurm by default holds a journal of activities in a directory configured in the slurm.conf file, however a Database management system can be set. All in all what we will try to do is: 426 | 427 | * Install munge in all nodes and configure the same authentication key in each of them 428 | * Install gcc, openmpi and configure them 429 | * Configure the slurmctld service in the master node 430 | * Configure the slurmd service in the compute nodes 431 | * Create a basic file structure for storing jobs and jobs result that is equal in all the nodes of the cluster 432 | * Manipulate the state of the nodes, and learn to resume them if they are down 433 | * Run some simple jobs as test 434 | * Set up MPI task on the cluster 435 | 436 | ### Install gcc and OpenMPI 437 | 438 | ``` 439 | $ sudo apt install gcc-12 openmpi-bin openmpi-common 440 | ``` 441 | Configure gcc-12 as default: 442 | ``` 443 | $ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 100 444 | ``` 445 | Test the installation: 446 | 447 | ``` 448 | $ gcc --version 449 | gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 450 | Copyright (C) 2022 Free Software Foundation, Inc. 451 | This is free software; see the source for copying conditions. There is NO 452 | warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 453 | ``` 454 | and MPI 455 | 456 | ``` 457 | $ mpicc --version 458 | gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 459 | Copyright (C) 2022 Free Software Foundation, Inc. 460 | This is free software; see the source for copying conditions. There is NO 461 | warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 462 | ``` 463 | 464 | ### Install MUNGE 465 | 466 | Lets start installing munge authentication tool using the system package manager, for cluster01 and cluster03: 467 | 468 | ``` 469 | $ sudo apt-get install -y libmunge-dev libmunge2 munge 470 | ``` 471 | 472 | munge requires that we generate a key file for testing authentication, for this we use the dd utility, with the fast pseudo-random device /dev/urandom. At cluster01 node do: 473 | ``` 474 | $ sudo dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key 475 | $ chown munge:munge /etc/munge/munge.key 476 | $ chmod 400 /etc/munge/munge.key 477 | ``` 478 | 479 | Copy the key on cluster03 with scp and chech also on cluster03 the file permissions and user. 480 | 481 | ``` 482 | scp /etc/munge/munge.key user01@cluster03:/etc/munge/munge.key 483 | ``` 484 | Test communication with, locally and remotely with these commands respectively: 485 | 486 | ``` 487 | $ munge -n | unmunge 488 | $ munge -n | ssh cluster03 unmunge 489 | ``` 490 | ### Install Slurm 491 | 492 | ``` 493 | $ sudo apt-get install -y slurmd slurmctld 494 | ``` 495 | 496 | Copy the slurm configuration file of GIT repository to '/etc/slurm' directory of in cluster01 and cluster03 497 | 498 | On cluster01 499 | 500 | ``` 501 | $ sudo systemctl enable slurmctld 502 | $ sudo systemctl start slurmctld 503 | 504 | $ sudo systemctl enable slurmd 505 | $ sudo systemctl start slurmd 506 | ``` 507 | 508 | On cluster03 509 | ``` 510 | $ sudo systemctl disable slurmctld 511 | $ sudo systemctl stop slurmctld 512 | 513 | $ sudo systemctl enable slurmd 514 | $ sudo systemctl start slurmd 515 | ``` 516 | 517 | Now you can test the envoroment: 518 | 519 | ``` 520 | $ sinfo 521 | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 522 | debug* up infinite 1 idle cluster03 523 | debug* up infinite 2 unk* cluster[04-05] 524 | ``` 525 | 526 | Test a job: 527 | 528 | ``` 529 | $ srun hostname 530 | cluste03 531 | ``` 532 | 533 | ### Clone the node 534 | 535 | Shutdown cluster03 and clone it to cluster04 and then start the two VMs. 536 | 537 | Afther startup you sould find 538 | 539 | ``` 540 | $ sinfo 541 | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 542 | debug* up infinite 2 idle cluster[03-04] 543 | debug* up infinite 1 unk* cluster05 544 | ``` 545 | 546 | 547 | ## Testing VMM enviroment 548 | 549 | ### Test VM performance 550 | We propose 3 different tests to study VMs performances and VMM capabilities. 551 | 552 | * Test the network isolation cloning a new VM from template then change the name of the "Internal Netwok" to "testnetwork", then try to access to the new VM from the cluster 553 | * Using HPCC compare the host and the guest performance in terms of CPU flops and memory bandwidth 554 | * Using iozone compare host IO performance with guest IO performance 555 | 556 | For windows based system it may be necessary to install Linux subsystem. 557 | 558 | 559 | 560 | Install and use hpcc: 561 | 562 | ``` 563 | sudo apt install hpcc 564 | ``` 565 | 566 | HPCC is a suite of benchmarks that measure performance of processor, 567 | memory subsytem, and the interconnect. For details refer to the 568 | HPC~Challenge web site (\url{http://icl.cs.utk.edu/hpcc/}.) 569 | 570 | In essence, HPC Challenge consists of a number of tests each 571 | of which measures performance of a different aspect of the system: HPL, STREAM, DGEMM, PTRANS, RandomAccess, FFT. 572 | 573 | If you are familiar with the High Performance Linpack (HPL) benchmark 574 | code (see the HPL web site: http://www.netlib.org/benchmark/hpl/}) then you can reuse the input 575 | file that you already have for HPL. 576 | See http://www.netlib.org/benchmark/hpl/tuning.html for a description of this file and its parameters. 577 | You can use the following sites for finding the appropriate values: 578 | 579 | * Tweak HPL parameters: https://www.advancedclustering.com/act_kb/tune-hpl-dat-file/ 580 | * HPL Calculator: https://hpl-calculator.sourceforge.net/ 581 | 582 | The main parameters to play with for optimizing the HPL runs are: 583 | 584 | * NB: depends on the CPU architecture, use the recommended blocking sizes (NB in HPL.dat) listed after loading the toolchain/intel module under $EBROOTIMKL/compilers_and_libraries/linux/mkl/benchmarks/mp_linpack/readme.txt, i.e 585 | * NB=192 for the broadwell processors available on iris 586 | * NB=384 on the skylake processors available on iris 587 | * P and Q, knowing that the product P x Q SHOULD typically be equal to the number of MPI processes. 588 | * Of course N the problem size. 589 | 590 | To run the HPCC benchmark, first create the HPL input file and then simply exeute the hpcc command from cli. 591 | 592 | Install and use IOZONE: 593 | 594 | ``` 595 | $ sudo apt istall iozone 596 | ``` 597 | 598 | IOzone performs the following 13 types of test. If you are executing iozone test on a database server, you can focus on the 1st 6 tests, as they directly impact the database performance. 599 | 600 | * Read – Indicates the performance of reading a file that already exists in the filesystem. 601 | * Write – Indicates the performance of writing a new file to the filesystem. 602 | * Re-read – After reading a file, this indicates the performance of reading a file again. 603 | * Re-write – Indicates the performance of writing to an existing file. 604 | * Random Read – Indicates the performance of reading a file by reading random information from the file. i.e this is not a sequential read. 605 | * Random Write – Indicates the performance of writing to a file in various random locations. i.e this is not a sequential write. 606 | * Backward Read 607 | * Record Re-Write 608 | * Stride Read 609 | * Fread 610 | * Fwrite 611 | * Freread 612 | * Frewrite 613 | 614 | IOZONE can be run in parallel over multiple threads, and use different output files size to stress performance. 615 | 616 | ``` 617 | $ ./iozone -a -b output.xls 618 | ``` 619 | 620 | Executes all stests and create an XLS output to simplify the analysis of the results. 621 | 622 | Here you will find an introduction to IOZONE with some examples: https://www.cyberciti.biz/tips/linux-filesystem-benchmarking-with-iozone.html 623 | 624 | 625 | 626 | ### Test slurm cluster 627 | 628 | * Run a simple MPI program on the cluster 629 | * Run an interactive job 630 | * Use the OSU ping pong benchmark to test the VM interconnect. 631 | 632 | Install OSU MPI benchmarks: download the latest tarball from http://mvapich.cse.ohio-state.edu/benchmarks/. 633 | 634 | 635 | ``` 636 | $ tar zxvf osu-micro-benchmarks-7.3.tar.gz 637 | 638 | $ cd osu-micro-benchmarks-7.3/ 639 | 640 | $ sudo apt install make g++-12 641 | 642 | $ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-12 100 643 | $ sudo update-alternatives --install /usr/bin/c++ c++ /usr/bin/g++-12 100 644 | $ ./configure CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx --prefix=/shared/OSU/ 645 | $ make 646 | $ make install 647 | ``` 648 | 649 | ## References 650 | 651 | [1] Configure DNSMASQ https://computingforgeeks.com/install-and-configure-dnsmasq-on-ubuntu/?expand_article=1 652 | 653 | [2] Configure NFS Mounts https://www.howtoforge.com/how-to-install-nfs-server-and-client-on-ubuntu-22-04/ 654 | 655 | [3] Configure network with netplan https://linuxconfig.org/netplan-network-configuration-tutorial-for-beginners 656 | 657 | [4] SLURM Quick Start Administrator Guide https://slurm.schedmd.com/quickstart_admin.html 658 | 659 | [5] Simple SLURM configuration on Debian systems: https://gist.github.com/asmateus/301b0cb86700cbe74c269b27f2ecfbef 660 | 661 | [6] HPC Challege https://hpcchallenge.org/hpcc/ 662 | 663 | [7] IO Zone Benchmarks https://www.iozone.org/ 664 | -------------------------------------------------------------------------------- /Tutorials/VirtualMachine/UTMSlurmCluster.md: -------------------------------------------------------------------------------- 1 | # UTM based Slurm Cluster Tutorial 2 | 3 | In this tutorial, we will learn how to build a cluster of Linux machines on our local environment using UTM virtualization system. 4 | Each machine will have only one NIC to connect to WAN and between cluster nodes. 5 | We will connect to them from our host windows machine via SSH. 6 | 7 | This configuration is useful to try out a clustered application which requires multiple Linux machines like kubernetes or an HPC cluster on your local environment. 8 | The primary goal will be to test the guest virtual machine performances using standard benckmaks as HPL, STREAM or iozone and compare with the host performances. 9 | 10 | Then we will installa a slurm based cluster to test parallel applications 11 | 12 | ## GOALs 13 | In this tutorial, we are going to create a cluster of four Linux virtual machines. 14 | 15 | * Each machine is capable of connecting to the internet and able to connect with each other privately as well as can be reached from the host machine. 16 | * Our machines will be named login, cluster0X, ..., cluster0X. 17 | * The first machine: login will act as a master node and will have 1vCPUs, 2GB of RAM and 25 GB hard disk. 18 | * The other machines will act as worker nodes will have 1vCPUs, 1GB of RAM and 10 GB hard disk. 19 | 20 | ## Prerequisite 21 | 22 | * UTM installed in your Apple machine 23 | * ubuntu server 22.04 LTS image to install 24 | * SSH client to connect 25 | 26 | ## Create virtual machines on Virtualbox 27 | We create one template that we will use then to deply the cluster and to make some performance tests and comparisons 28 | 29 | Create the template virtual machine which we will name "template" with 1vCPUs, 1GB of RAM and 25 GB hard disk. 30 | 31 | You can use Ubuntu 22.04 LTS server (https://ubuntu.com/download/server) 32 | 33 | Make sure to set up the network as follows: 34 | 35 | * Attach the downloaded Ubuntu ISO to the "ISO Image". 36 | * Type: is Lunux 37 | * Version Ubuntu 22.04 LTS 38 | When you start the virtual machines for the first time you are prompted to instal and setup Ubuntu. 39 | Follow through with the installation until you get to the “Network Commections”. As the VM network protocol is NAT, the virtual machine will be assinged to an automatic IP (internal) and it will be able to access internet for software upgrades. 40 | 41 | The VM is now accessing the network to download the software and updates for the LTS. 42 | 43 | When you are prompted for the "Guided storage configuration" panel keep the default installation method: use an entire disk. 44 | 45 | When you are prompted for the Profile setup, you will be requested to define a server name (template) and super user (e.g. user01) and his administrative password. 46 | 47 | 48 | Also, enable open ssh server in the software selection prompt. 49 | 50 | Follow the installation and then shutdown the VM. 51 | 52 | Inspect the VM and in particular the Network, you will find only one adapter attached to NAT. If you look at the advanced tab you will find the Adapter Type (Intel) and the MAC address. 53 | 54 | Start the newly created machine and make sure it is running. 55 | 56 | Login and update the software: 57 | 58 | ``` 59 | $ sudo apt update 60 | ... 61 | 62 | $ sudo apt upgrade 63 | ``` 64 | 65 | 66 | When your VM is all set up, log in to the VM and test the network by pinging any public site. e.g `ping google.com`. If all works well, this should be successful. 67 | 68 | You can check the DHCP-assigned IP address by entering the following command: 69 | 70 | ```shell 71 | hostname -I 72 | ``` 73 | 74 | You will get an output similar to this: 75 | ``` 76 | 192.168.64.8 77 | ``` 78 | 79 | This is the default IP address assigned by your network DHCP. Note that this IP address is dynamic and can change or worst still, get assigned to another machine. But for now, you can connect to this IP from your host machine via SSH. 80 | 81 | Now install some useful additional packages: 82 | 83 | ``` 84 | $ sudo apt install net-tools 85 | ``` 86 | 87 | Edit the hosts file to assign names to the cluster that should include names for each node as follows: 88 | 89 | ``` 90 | 127.0.0.1 localhost 91 | 127.0.1.1 localhost 92 | 93 | 192.168.64.2 cluster02 94 | 192.168.64.3 cluster03 95 | 192.168.64.4 cluster04 96 | 192.168.64.5 cluster05 97 | 192.168.64.6 cluster06 98 | 192.168.64.7 cluster07 99 | 192.168.64.8 cluster08 100 | 192.168.64.9 cluster09 101 | 192.168.64.10 cluster10 102 | 192.168.64.11 cluster11 103 | 192.168.64.12 cluster12 104 | 192.168.64.13 cluster13 105 | 192.168.64.14 cluster14 106 | 192.168.64.15 cluster15 107 | 192.168.64.16 cluster16 108 | 192.168.64.17 cluster17 109 | 192.168.64.18 cluster18 110 | 192.168.64.19 cluster19 111 | 192.168.64.20 cluster20 112 | 192.168.64.21 cluster21 113 | 192.168.64.22 cluster22 114 | 192.168.64.23 cluster23 115 | 192.168.64.24 cluster24 116 | 192.168.64.25 cluster25 117 | 192.168.64.26 cluster26 118 | 192.168.64.27 cluster27 119 | 192.168.64.28 cluster28 120 | 192.168.64.29 cluster29 121 | 192.168.64.30 cluster30 122 | 192.168.64.31 cluster31 123 | 192.168.64.32 cluster32 124 | 192.168.64.33 cluster33 125 | 192.168.64.34 cluster34 126 | 192.168.64.35 cluster35 127 | 192.168.64.36 cluster36 128 | 192.168.64.37 cluster37 129 | 192.168.64.38 cluster38 130 | 192.168.64.39 cluster39 131 | 192.168.64.40 cluster40 132 | 192.168.64.41 cluster41 133 | 192.168.64.42 cluster42 134 | 192.168.64.43 cluster43 135 | 192.168.64.44 cluster44 136 | 192.168.64.45 cluster45 137 | 192.168.64.46 cluster46 138 | 192.168.64.47 cluster47 139 | 192.168.64.48 cluster48 140 | 192.168.64.49 cluster49 141 | 192.168.64.50 cluster50 142 | 192.168.64.51 cluster51 143 | 192.168.64.52 cluster52 144 | 192.168.64.53 cluster53 145 | 192.168.64.54 cluster54 146 | 192.168.64.55 cluster55 147 | 192.168.64.56 cluster56 148 | 192.168.64.57 cluster57 149 | 192.168.64.58 cluster58 150 | 192.168.64.59 cluster59 151 | 192.168.64.60 cluster60 152 | 192.168.64.61 cluster61 153 | 192.168.64.62 cluster62 154 | 192.168.64.63 cluster63 155 | 192.168.64.64 cluster64 156 | 192.168.64.65 cluster65 157 | 192.168.64.66 cluster66 158 | 192.168.64.67 cluster67 159 | 192.168.64.68 cluster68 160 | 192.168.64.69 cluster69 161 | 192.168.64.70 cluster70 162 | 192.168.64.71 cluster71 163 | 192.168.64.72 cluster72 164 | 192.168.64.73 cluster73 165 | 192.168.64.74 cluster74 166 | 192.168.64.75 cluster75 167 | 192.168.64.76 cluster76 168 | 192.168.64.77 cluster77 169 | 192.168.64.78 cluster78 170 | 192.168.64.79 cluster79 171 | 192.168.64.80 cluster80 172 | 192.168.64.81 cluster81 173 | 192.168.64.82 cluster82 174 | 192.168.64.83 cluster83 175 | 192.168.64.84 cluster84 176 | 192.168.64.85 cluster85 177 | 192.168.64.86 cluster86 178 | 192.168.64.87 cluster87 179 | 192.168.64.88 cluster88 180 | 192.168.64.89 cluster89 181 | 192.168.64.90 cluster90 182 | 192.168.64.91 cluster91 183 | 192.168.64.92 cluster92 184 | 192.168.64.93 cluster93 185 | 192.168.64.94 cluster94 186 | 192.168.64.95 cluster95 187 | 192.168.64.96 cluster96 188 | 192.168.64.97 cluster97 189 | 192.168.64.98 cluster98 190 | 192.168.64.99 cluster99 191 | 192 | # The following lines are desirable for IPv6 capable hosts 193 | ::1 ip6-localhost ip6-loopback 194 | fe00::0 ip6-localnet 195 | ff00::0 ip6-mcastprefix 196 | ff02::1 ip6-allnodes 197 | ff02::2 ip6-allrouters 198 | 199 | ``` 200 | 201 | If everything is ok, we can proceed cloning this template. We will create 3 clones (you can create more than 3 202 | according to the amount of RAM and cores available in your laptop). 203 | 204 | You must shutdown the node to clone it, using VirtualBox interface (select VM and right click) create 3 new VMs. 205 | 206 | ``` 207 | $ sudo shutdown -h now 208 | ``` 209 | 210 | 211 | Right click on the name of the VM and clone it. The first clone will be the login/master node the other twos will be computing nodes. 212 | 213 | ## Configure the cluster 214 | 215 | Once the 2 machines has been cloned we can bootstrap the login/master node and configure it. 216 | Add a new network adapter on each machine: enable "Adapter 2" "Attached to" internal network and name it "clustervimnet" 217 | 218 | ### Login/master node 219 | 220 | Bootstrap the VM. 221 | 222 | In the example below the interface is enp0s1, to find your own one: 223 | 224 | ``` 225 | $ ip link show 226 | 1: lo: mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 227 | link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 228 | 2: enp0s1: mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 229 | link/ether 0e:2d:57:48:8b:90 brd ff:ff:ff:ff:ff:ff 230 | ``` 231 | 232 | You are interested to link 2. Link 2 is the NAT device with IP assigned dynamucally by the UTM server. 233 | 234 | Now we configure the adapter. 235 | 236 | ``` 237 | $ hostname -I 238 | 192.168.64.8 fdc4:f46c:bb17:931b:c2d:57ff:fe48:8b90 239 | ``` 240 | 241 | Edit the /etc/hosts files to assign the machine with this IP address the name login 242 | ``` 243 | $vim /etc/hosts 244 | 245 | 127.0.0.1 localhost 246 | 127.0.1.1 localhost 247 | 248 | 192.168.64.2 cluster02 249 | 192.168.64.3 cluster03 250 | 192.168.64.4 cluster04 251 | 192.168.64.5 cluster05 252 | 192.168.64.6 cluster06 253 | 192.168.64.7 cluster07 254 | 192.168.64.8 login 255 | 192.168.64.9 cluster09 256 | 192.168.64.10 cluster10 257 | ``` 258 | 259 | 260 | 261 | We change the hostname: 262 | ``` 263 | $ sudo vim /etc/hostname 264 | 265 | login 266 | ``` 267 | 268 | Shutdown the VM. 269 | 270 | ### Accessing the login/master node 271 | UTM allows to access the VM from the host directly. 272 | A bridged interface is created on the Host at the UTM installation time. For example on the host machine: 273 | 274 | ``` 275 | HOSTMACHINE_NAME ~ % ifconfig 276 | bridge100: flags=8a63 mtu 1500 277 | options=3 278 | ether 6e:7e:67:eb:5b:64 279 | inet 192.168.64.1 netmask 0xffffff00 broadcast 192.168.64.255 280 | inet6 fe80::6c7e:67ff:feeb:5b64%bridge100 prefixlen 64 scopeid 0x16 281 | inet6 fdc4:f46c:bb17:931b:1099:c71:e430:9090 prefixlen 64 autoconf secured 282 | Configuration: 283 | id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0 284 | maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200 285 | root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0 286 | ipfilter disabled flags 0x0 287 | member: vmenet0 flags=3 288 | ifmaxaddr 0 port 21 priority 0 path cost 0 289 | member: vmenet1 flags=3 290 | ifmaxaddr 0 port 26 priority 0 path cost 0 291 | nd6 options=201 292 | media: autoselect 293 | status: active 294 | ``` 295 | 296 | To enable ssh from host to guest VM you need to use ssh from the host machine directly 297 | 298 | ``` 299 | ssh user01@192.168.64.8 300 | ``` 301 | 302 | but you will have to enter the password. 303 | If you want a passwordless access you need to generate a ssh key or use an ssh key if you already have it. 304 | 305 | If you don’t have public/private key pair already, run ssh-keygen and agree to all defaults. 306 | This will create id_rsa (private key) and id_rsa.pub (public key) in ~/.ssh directory. 307 | 308 | Copy host public key to your VM (adapt the ip address to the ip address of the virtual machine): 309 | 310 | ``` 311 | scp ~/.ssh/id_rsa.pub user01@192.168.64.8:~ 312 | ``` 313 | 314 | Connect to the VM and add host public key to ~/.ssh/authorized_keys: 315 | 316 | ``` 317 | ssh -p user01@192.168.64.8 318 | mkdir ~/.ssh 319 | chmod 700 ~/.ssh 320 | cat ~/id_rsa.pub >> ~/.ssh/authorized_keys 321 | chmod 644 ~/.ssh/authorized_keys 322 | exit 323 | ``` 324 | 325 | Now you should be able to ssh to the VM without password. 326 | 327 | 328 | ### Login Node services 329 | 330 | We install a DNSMASQ server to dynamically assign the IP and hostname to the other nodes on the internal interface and create a cluster [1]. 331 | 332 | ``` 333 | $ sudo systemctl disable systemd-resolved 334 | Removed /etc/systemd/system/multi-user.target.wants/systemd-resolved.service. 335 | Removed /etc/systemd/system/dbus-org.freedesktop.resolve1.service. 336 | $ sudo systemctl stop systemd-resolved 337 | ``` 338 | Then 339 | 340 | ``` 341 | $ ls -lh /etc/resolv.conf 342 | lrwxrwxrwx 1 root root 39 Jul 26 2018 /etc/resolv.conf ../run/systemd/resolve/resolv.conf 343 | $ sudo unlink /etc/resolv.conf 344 | ``` 345 | Create a new resolv.conf file and add public DNS servers you wish. In my case am going to use google DNS. 346 | 347 | ``` 348 | $ sudo echo nameserver 192.168.64.1 | sudo tee /etc/resolv.conf 349 | ``` 350 | 351 | 352 | Install dnsmasq 353 | 354 | ``` 355 | $ sudo apt install dnsmasq -y 356 | ``` 357 | 358 | To find and configuration file for Dnsmasq, navigate to /etc/dnsmasq.conf. Edit the file by modifying it with your desired configs. Below is minimal configurations for it to run and support minimum operations. 359 | 360 | ``` 361 | $ vim /etc/dnsmasq.conf 362 | 363 | port=53 364 | bogus-priv 365 | strict-order 366 | expand-hosts 367 | ``` 368 | 369 | 370 | When done with editing the file, close it and restart Dnsmasq to apply the changes. 371 | ``` 372 | $ sudo systemctl restart dnsmasq 373 | ``` 374 | 375 | Add localhost as DNS: 376 | ``` 377 | $ sudo vim etc/resolv.conf 378 | nameserver 192.168.64.1 379 | nameserver 127.0.0.1 380 | ``` 381 | 382 | Check if it is working 383 | 384 | ``` 385 | $ host cluster88 386 | 387 | ``` 388 | 389 | To build a cluster we aldo need a distributed filesystem accessible from all nodes. 390 | We use NFS. 391 | 392 | ``` 393 | $ sudo apt install nfs-kernel-server 394 | $ sudo mkdir /shared 395 | $ sudo chmod 777 /shared 396 | ``` 397 | 398 | Modify the NFS config file: 399 | 400 | ``` 401 | $ sudo vim /etc/exports 402 | 403 | /shared/ 192.168.64.0/255.255.255.0(rw,sync,no_root_squash,no_subtree_check) 404 | 405 | ``` 406 | Restart the server 407 | 408 | ``` 409 | $ sudo systemctl enable nfs-kernel-server 410 | $ sudo systemctl restart nfs-kernel-server 411 | ``` 412 | 413 | ### Computing nodes 414 | Clone the template to create a new VM. Edit the VM and assign a new name (e.g. Cluster02). 415 | Randomize the mac address of the new machine otherwhise it will be the same of the other VM. 416 | ``` 417 | Network-> MAC Address -> Random 418 | ``` 419 | Save. 420 | 421 | Bootstrap the machine. 422 | 423 | We change the hostname to empty: 424 | ``` 425 | $ sudo vim /etc/hostname 426 | 427 | 428 | ``` 429 | 430 | Set the proper dns server (assigned with dhcp): 431 | 432 | ``` 433 | $ sudo rm /etc/resolv.conf 434 | ``` 435 | 436 | then point the DNS to the ip address of the login node (in this case 192.168.64.8) 437 | 438 | ``` 439 | $ sudo vim /etc/resolv.conf 440 | nameserver 192.168.64.8 441 | ``` 442 | 443 | Configure hostname at boot 444 | 445 | ``` 446 | $ sudo vim /etc/rc.local 447 | #!/bin/sh -e 448 | # 449 | # rc.local 450 | # 451 | # This script is executed at the end of each multiuser runlevel. 452 | # Make sure that the script will "exit 0" on success or any other 453 | # value on error. 454 | # 455 | # In order to enable or disable this script just change the execution 456 | # bits. 457 | # 458 | # By default this script does nothing. 459 | 460 | IPADDR=`hostname -I | awk '{print$1}'` 461 | HOSTNAME=`dig -x $IPADDR +short | sed 's/.$//'` 462 | /bin/hostname $HOSTNAME 463 | 464 | exit 0 465 | ``` 466 | ``` 467 | $ sudo chmod +x /etc/rc.local 468 | ``` 469 | 470 | 471 | Reboot the machine. 472 | At reboot you will see that the machine will have a new ip address 473 | 474 | ``` 475 | $ hostname -I 476 | 192.168.64.12 477 | ``` 478 | Install dnsmasq (trick necessary to install the cluster later) 479 | ``` 480 | $ sudo apt install dnsmasq -y 481 | $ sudo systemctl disable dnsmasq 482 | ``` 483 | 484 | Now, from the cluster01 you will be able to connect to cluster02 machine with ssh. 485 | 486 | ``` 487 | $ ssh user01@cluster12 488 | user01@cluster12:~$ 489 | 490 | ``` 491 | To access the new machine without password you can proceed described above. Run ssh-keygen and agree to all defaults. 492 | This will create id_rsa (private key) and id_rsa.pub (public key) in ~/.ssh directory. 493 | 494 | Copy host public key to your VM: 495 | 496 | ``` 497 | scp ~/.ssh/id_rsa.pub user01@cluster03:~ 498 | ``` 499 | 500 | Connect to the VM and add host public key to ~/.ssh/authorized_keys: 501 | 502 | ``` 503 | ssh user01@cluster12 504 | mkdir ~/.ssh 505 | chmod 700 ~/.ssh 506 | cat ~/id_rsa.pub >> ~/.ssh/authorized_keys 507 | chmod 644 ~/.ssh/authorized_keys 508 | exit 509 | ``` 510 | 511 | Configure the shared filesystem 512 | 513 | ``` 514 | $ sudo apt install nfs-common 515 | $ sudo mkdir /shared 516 | ``` 517 | 518 | Mount the shared directory adn test it 519 | ``` 520 | $ sudo mount 192.168.64.8:/shared /shared 521 | $ touch /shared/pippo 522 | ``` 523 | If everything will be ok you will see the "pippo" file in all the nodes. 524 | 525 | To authomatically mount at boot edit the /etc/fstab file: 526 | 527 | ``` 528 | $ sudo vim /etc/fstab 529 | 530 | ``` 531 | 532 | Append the following line at the end of the file 533 | 534 | ``` 535 | 192.168.64.8:/shared /shared nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0 536 | ``` 537 | 538 | 539 | 540 | ## Install a SLURM based cluster 541 | Here I will describe a simple configuration of the slurm management tool for launching jobs in a really simplistic Virtual cluster. I will assume the following configuration: a main node (cluster01) and 3 compute nodes (cluster03 ... VMs). I also assume there is ping access between the nodes and some sort of mechanism for you to know the IP of each node at all times (most basic should be a local NAT with static IPs) 542 | 543 | Slurm management tool work on a set of nodes, one of which is considered the master node, and has the slurmctld daemon running; all other compute nodes have the slurmd daemon. 544 | 545 | All communications are authenticated via the munge service and all nodes need to share the same authentication key. Slurm by default holds a journal of activities in a directory configured in the slurm.conf file, however a Database management system can be set. All in all what we will try to do is: 546 | 547 | * Install munge in all nodes and configure the same authentication key in each of them 548 | * Install gcc, openmpi and configure them 549 | * Configure the slurmctld service in the master node 550 | * Configure the slurmd service in the compute nodes 551 | * Create a basic file structure for storing jobs and jobs result that is equal in all the nodes of the cluster 552 | * Manipulate the state of the nodes, and learn to resume them if they are down 553 | * Run some simple jobs as test 554 | * Set up MPI task on the cluster 555 | 556 | ### Install gcc and OpenMPI 557 | 558 | ``` 559 | $ sudo apt install gcc-12 openmpi-bin openmpi-common 560 | ``` 561 | Configure gcc-12 as default: 562 | ``` 563 | $ sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 100 564 | ``` 565 | Test the installation: 566 | 567 | ``` 568 | $ gcc --version 569 | gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 570 | Copyright (C) 2022 Free Software Foundation, Inc. 571 | This is free software; see the source for copying conditions. There is NO 572 | warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 573 | ``` 574 | and MPI 575 | 576 | ``` 577 | $ mpicc --version 578 | gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 579 | Copyright (C) 2022 Free Software Foundation, Inc. 580 | This is free software; see the source for copying conditions. There is NO 581 | warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 582 | ``` 583 | 584 | ### Install MUNGE 585 | 586 | Lets start installing munge authentication tool using the system package manager, for cluster01 and cluster03: 587 | 588 | ``` 589 | $ sudo apt-get install -y libmunge-dev libmunge2 munge 590 | ``` 591 | 592 | munge requires that we generate a key file for testing authentication, for this we use the dd utility, with the fast pseudo-random device /dev/urandom. At cluster01 node do: 593 | ``` 594 | $ sudo dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key 595 | $ chown munge:munge /etc/munge/munge.key 596 | $ chmod 400 /etc/munge/munge.key 597 | ``` 598 | 599 | Copy the key on cluster03 with scp and chech also on cluster03 the file permissions and user. 600 | 601 | ``` 602 | scp /etc/munge/munge.key user01@cluster03:/etc/munge/munge.key 603 | ``` 604 | Test communication with, locally and remotely with these commands respectively: 605 | 606 | ``` 607 | $ munge -n | unmunge 608 | $ munge -n | ssh cluster03 unmunge 609 | ``` 610 | ### Install Slurm 611 | 612 | ``` 613 | $ sudo apt-get install -y slurmd slurmctld 614 | ``` 615 | 616 | Copy the slurm configuration file of GIT repository to '/etc/slurm' directory of in cluster01 and cluster03 617 | 618 | On login 619 | 620 | ``` 621 | $ sudo systemctl enable slurmctld 622 | $ sudo systemctl start slurmctld 623 | 624 | $ sudo systemctl enable slurmd 625 | $ sudo systemctl start slurmd 626 | ``` 627 | 628 | On cluster12 629 | ``` 630 | $ sudo systemctl disable slurmctld 631 | $ sudo systemctl stop slurmctld 632 | 633 | $ sudo systemctl enable slurmd 634 | $ sudo systemctl start slurmd 635 | ``` 636 | 637 | Now you can test the envoroment: 638 | 639 | ``` 640 | $ sinfo 641 | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 642 | debug* up infinite 1 idle cluster12 643 | debug* up infinite 2 unk* cluster[04-05] 644 | ``` 645 | 646 | Test a job: 647 | 648 | ``` 649 | $ srun hostname 650 | cluste03 651 | ``` 652 | 653 | ### Clone the node 654 | 655 | Shutdown cluster03 and clone it to cluster04 and then start the two VMs. 656 | 657 | Afther startup you sould find 658 | 659 | ``` 660 | $ sinfo 661 | PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 662 | debug* up infinite 2 idle cluster[03-04] 663 | debug* up infinite 1 unk* cluster05 664 | ``` 665 | 666 | 667 | ## Testing VMM enviroment 668 | 669 | ### Test VM performance 670 | We propose 3 different tests to study VMs performances and VMM capabilities. 671 | 672 | * Test the network isolation cloning a new VM from template then change the name of the "Internal Netwok" to "testnetwork", then try to access to the new VM from the cluster 673 | * Using HPCC compare the host and the guest performance in terms of CPU flops and memory bandwidth 674 | * Using iozone compare host IO performance with guest IO performance 675 | 676 | For windows based system it may be necessary to install Linux subsystem. 677 | 678 | 679 | 680 | Install and use hpcc: 681 | 682 | ``` 683 | sudo apt install hpcc 684 | ``` 685 | 686 | HPCC is a suite of benchmarks that measure performance of processor, 687 | memory subsytem, and the interconnect. For details refer to the 688 | HPC~Challenge web site (\url{http://icl.cs.utk.edu/hpcc/}.) 689 | 690 | In essence, HPC Challenge consists of a number of tests each 691 | of which measures performance of a different aspect of the system: HPL, STREAM, DGEMM, PTRANS, RandomAccess, FFT. 692 | 693 | If you are familiar with the High Performance Linpack (HPL) benchmark 694 | code (see the HPL web site: http://www.netlib.org/benchmark/hpl/}) then you can reuse the input 695 | file that you already have for HPL. 696 | See http://www.netlib.org/benchmark/hpl/tuning.html for a description of this file and its parameters. 697 | You can use the following sites for finding the appropriate values: 698 | 699 | * Tweak HPL parameters: https://www.advancedclustering.com/act_kb/tune-hpl-dat-file/ 700 | * HPL Calculator: https://hpl-calculator.sourceforge.net/ 701 | 702 | The main parameters to play with for optimizing the HPL runs are: 703 | 704 | * NB: depends on the CPU architecture, use the recommended blocking sizes (NB in HPL.dat) listed after loading the toolchain/intel module under $EBROOTIMKL/compilers_and_libraries/linux/mkl/benchmarks/mp_linpack/readme.txt, i.e 705 | * NB=192 for the broadwell processors available on iris 706 | * NB=384 on the skylake processors available on iris 707 | * P and Q, knowing that the product P x Q SHOULD typically be equal to the number of MPI processes. 708 | * Of course N the problem size. 709 | 710 | To run the HPCC benchmark, first create the HPL input file and then simply exeute the hpcc command from cli. 711 | 712 | Install and use IOZONE: 713 | 714 | ``` 715 | $ sudo apt istall iozone 716 | ``` 717 | 718 | IOzone performs the following 13 types of test. If you are executing iozone test on a database server, you can focus on the 1st 6 tests, as they directly impact the database performance. 719 | 720 | * Read – Indicates the performance of reading a file that already exists in the filesystem. 721 | * Write – Indicates the performance of writing a new file to the filesystem. 722 | * Re-read – After reading a file, this indicates the performance of reading a file again. 723 | * Re-write – Indicates the performance of writing to an existing file. 724 | * Random Read – Indicates the performance of reading a file by reading random information from the file. i.e this is not a sequential read. 725 | * Random Write – Indicates the performance of writing to a file in various random locations. i.e this is not a sequential write. 726 | * Backward Read 727 | * Record Re-Write 728 | * Stride Read 729 | * Fread 730 | * Fwrite 731 | * Freread 732 | * Frewrite 733 | 734 | IOZONE can be run in parallel over multiple threads, and use different output files size to stress performance. 735 | 736 | ``` 737 | $ ./iozone -a -b output.xls 738 | ``` 739 | 740 | Executes all stests and create an XLS output to simplify the analysis of the results. 741 | 742 | Here you will find an introduction to IOZONE with some examples: https://www.cyberciti.biz/tips/linux-filesystem-benchmarking-with-iozone.html 743 | 744 | 745 | 746 | ### Test slurm cluster 747 | 748 | * Run a simple MPI program on the cluster 749 | * Run an interactive job 750 | * Use the OSU ping pong benchmark to test the VM interconnect. 751 | 752 | Install OSU MPI benchmarks: download the latest tarball from http://mvapich.cse.ohio-state.edu/benchmarks/. 753 | 754 | 755 | ``` 756 | $ tar zxvf osu-micro-benchmarks-7.3.tar.gz 757 | 758 | $ cd osu-micro-benchmarks-7.3/ 759 | 760 | $ sudo apt install make g++-12 761 | 762 | $ sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-12 100 763 | $ sudo update-alternatives --install /usr/bin/c++ c++ /usr/bin/g++-12 100 764 | $ ./configure CC=/usr/bin/mpicc CXX=/usr/bin/mpicxx --prefix=/shared/OSU/ 765 | $ make 766 | $ make install 767 | ``` 768 | 769 | ## References 770 | 771 | [1] Configure DNSMASQ https://computingforgeeks.com/install-and-configure-dnsmasq-on-ubuntu/?expand_article=1 772 | 773 | [2] Configure NFS Mounts https://www.howtoforge.com/how-to-install-nfs-server-and-client-on-ubuntu-22-04/ 774 | 775 | [3] Configure network with netplan https://linuxconfig.org/netplan-network-configuration-tutorial-for-beginners 776 | 777 | [4] SLURM Quick Start Administrator Guide https://slurm.schedmd.com/quickstart_admin.html 778 | 779 | [5] Simple SLURM configuration on Debian systems: https://gist.github.com/asmateus/301b0cb86700cbe74c269b27f2ecfbef 780 | 781 | [6] HPC Challege https://hpcchallenge.org/hpcc/ 782 | 783 | [7] IO Zone Benchmarks https://www.iozone.org/ 784 | -------------------------------------------------------------------------------- /Tutorials/VirtualMachine/slurm.conf: -------------------------------------------------------------------------------- 1 | ClusterName=virtual 2 | SlurmctldHost=cluster01 3 | ProctrackType=proctrack/linuxproc 4 | 5 | ReturnToService=2 6 | 7 | SlurmctldPidFile=/run/slurmctld.pid 8 | 9 | SlurmdPidFile=/run/slurmd.pid 10 | 11 | SlurmdSpoolDir=/var/lib/slurm/slurmd 12 | 13 | StateSaveLocation=/var/lib/slurm/slurmctld 14 | 15 | SlurmUser=slurm 16 | 17 | TaskPlugin=task/none 18 | 19 | SchedulerType=sched/backfill 20 | 21 | SelectType=select/cons_tres 22 | 23 | SelectTypeParameters=CR_Core_Memory 24 | 25 | AccountingStorageType=accounting_storage/none 26 | 27 | JobCompType=jobcomp/none 28 | 29 | JobAcctGatherType=jobacct_gather/none 30 | 31 | SlurmctldDebug=info 32 | 33 | SlurmctldLogFile=/var/log/slurm/slurmctld.log 34 | 35 | SlurmdDebug=info 36 | 37 | SlurmdLogFile=/var/log/slurm/slurmd.log 38 | 39 | NodeName=cluster02 NodeAddr=192.168.64.7 CPUs=6 RealMemory=3800 40 | 41 | # PartitionName ################################################################ 42 | # 43 | # Name by which the partition may be referenced (e.g. "Interactive"). This 44 | # name can be specified by users when submitting jobs. If the PartitionName is 45 | # "DEFAULT", the values specified with that record will apply to subsequent 46 | # partition specifications unless explicitly set to other values in that 47 | # partition record or replaced with a different set of default values. Each 48 | # line where PartitionName is "DEFAULT" will replace or add to previous default 49 | # values and not a reinitialize the default values. 50 | 51 | PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP 52 | root@cluster01:~# cat /etc/slurm/slurm.conf 53 | ClusterName=virtual 54 | SlurmctldHost=cluster01 55 | ProctrackType=proctrack/linuxproc 56 | 57 | ReturnToService=2 58 | 59 | SlurmctldPidFile=/run/slurmctld.pid 60 | 61 | SlurmdPidFile=/run/slurmd.pid 62 | 63 | SlurmdSpoolDir=/var/lib/slurm/slurmd 64 | 65 | StateSaveLocation=/var/lib/slurm/slurmctld 66 | 67 | SlurmUser=slurm 68 | 69 | TaskPlugin=task/none 70 | 71 | SchedulerType=sched/backfill 72 | 73 | SelectType=select/cons_tres 74 | 75 | SelectTypeParameters=CR_Core_Memory 76 | 77 | AccountingStorageType=accounting_storage/none 78 | 79 | JobCompType=jobcomp/none 80 | 81 | JobAcctGatherType=jobacct_gather/none 82 | 83 | SlurmctldDebug=info 84 | 85 | SlurmctldLogFile=/var/log/slurm/slurmctld.log 86 | 87 | SlurmdDebug=info 88 | 89 | SlurmdLogFile=/var/log/slurm/slurmd.log 90 | 91 | NodeName=cluster02 NodeAddr=192.168.64.7 CPUs=6 RealMemory=3800 92 | 93 | # PartitionName ################################################################ 94 | # 95 | # Name by which the partition may be referenced (e.g. "Interactive"). This 96 | # name can be specified by users when submitting jobs. If the PartitionName is 97 | # "DEFAULT", the values specified with that record will apply to subsequent 98 | # partition specifications unless explicitly set to other values in that 99 | # partition record or replaced with a different set of default values. Each 100 | # line where PartitionName is "DEFAULT" will replace or add to previous default 101 | # values and not a reinitialize the default values. 102 | 103 | PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP 104 | --------------------------------------------------------------------------------