└── README.md /README.md: -------------------------------------------------------------------------------- 1 | 2 | Installation of a Slurm Cluster 3 | =============================== 4 | 5 | A short installation guide for a Slurm cluster running each component on a different node with an additional backup controller and a database daemon. 6 | 7 | 8 | Hosts Overview 9 | -------------- 10 | 11 | Listing of participating hosts and its function within the cluster. 12 | 13 | Function |Host |FQDN 14 | ------------------------------|-----------------------------|----------------------------- 15 | Slurm Primary Controller |lxcm01 |lxcm01.devops.test 16 | Slurm Backup Controller |lxcm02 |lxcm02.devops.test 17 | Slurm Primary Database Daemon |lxcc01 |lxcc01.devops.test 18 | Slurm Backup Database Daemon |lxcc02 |lxcc02.devops.test 19 | MySQL Database |lxdb01 |lxdb01.devops.test 20 | Slurm Worker |lxdev01 |lxdev01.devops.test 21 | Slurm Worker |lxdev02 |lxdev02.devops.test 22 | 23 | 24 | Installation and Configuration 25 | ------------------------------ 26 | 27 | We will start to setup the MySQL database first and are continuing with the Slurm database daemon, controller and finally the worker nodes. 28 | 29 | ### MySQL Database 30 | 31 | ##### Installation and Configuration 32 | 33 | Install the MySQL server package: 34 | ``` 35 | apt-get install mysql-server 36 | ``` 37 | 38 | Edit the MySQL server configuration file under: **/etc/mysql/my.cnf** 39 | Uncomment the following line to allow remote access from a different host rather than localhost: 40 | ``` 41 | bind-address = 127.0.0.1 42 | ``` 43 | 44 | Reboot the MySQL database: 45 | ``` 46 | systemctl restart mysql 47 | ``` 48 | 49 | Check the log file if the MySQL database started successfully: 50 | ``` 51 | less /var/log/mysql/error.log 52 | ``` 53 | 54 | ##### Creating a Slurm Database User 55 | 56 | The Slurm database user is allowed to connect from the Slurm primary and backup database daemon only. 57 | 58 | Create the Slurm database user: 59 | ``` 60 | CREATE USER 'slurm'@'lxcc01.devops.test' IDENTIFIED BY '12345678'; 61 | CREATE USER 'slurm'@'lxcc02.devops.test' IDENTIFIED BY '12345678'; 62 | ``` 63 | 64 | Grant access to the Slurm database user on the slurm accounting database: 65 | ``` 66 | GRANT ALL ON slurm_acct_db.* TO 'slurm'@'lxcc01.devops.test'; 67 | GRANT ALL ON slurm_acct_db.* TO 'slurm'@'lxcc02.devops.test'; 68 | ``` 69 | 70 | Finally flush the privileges: 71 | ``` 72 | FLUSH PRIVILEGES; 73 | ``` 74 | 75 | ### Setting up a Slurm Database Daemon 76 | 77 | ##### Installing the Munge Authentification Service 78 | [Setup the Munge authentification service](#munge) 79 | 80 | ##### Installing the Slurm Database Daemon 81 | 82 | Install the Slurm database daemon package: 83 | ``` 84 | apt-get install slurmdbd 85 | ``` 86 | 87 | Create the Slurm database daemon configuration file under: **/etc/slurm-llnl/slurmdbd.conf** 88 | ``` 89 | AuthType=auth/munge 90 | AuthInfo=/var/run/munge/munge.socket.2 91 | DbdHost=lxcc01 92 | DbdBackupHost=lxcc02 93 | StorageHost=lxdb01 94 | StorageLoc=slurm_acct_db 95 | StoragePass=12345678 96 | StorageType=accounting_storage/mysql 97 | StorageUser=slurm 98 | LogFile=/var/log/slurm-llnl/slurmdbd.log 99 | PidFile=/var/run/slurm-llnl/slurmdbd.pid 100 | SlurmUser=slurm 101 | ``` 102 | 103 | Set configuration file owner- and group-ship to slurm: 104 | ``` 105 | chown slurm:slurm /etc/slurm-llnl/slurmdbd.conf 106 | ``` 107 | 108 | Start the Slurm database daemon: 109 | ``` 110 | systemctl start slurmdbd 111 | ``` 112 | 113 | Check the log file if the Slurm database daemon started successfully and everything rolled up: 114 | ``` 115 | less /var/log/slurm-llnl/slurmdbd.log 116 | ``` 117 | 118 | ### Slurm Primary Controller 119 | 120 | ##### Installing the Munge Authentification Service 121 | [Setup the Munge authentification service](#munge) 122 | 123 | ##### Installing the Slurm Primary Controller 124 | 125 | Install the Slurm controller package: 126 | ``` 127 | apt-get install slurmctld 128 | ``` 129 | 130 | ##### Setup the Slurm Controller/Worker configuration file 131 | [Setup the Slurm configuration file](#slurm_conf) 132 | 133 | ##### Setup the checkpoint directories for the primary controller 134 | [Setup the checkpoint directories](#setup_checkdirs) 135 | 136 | ##### Starting the Slurm Primary Controller 137 | Start the Slurm controller daemon: 138 | ``` 139 | systemctl start slurmctld 140 | ``` 141 | 142 | Check the log file if the Slurm controller started successfully: 143 | ``` 144 | less /var/log/slurm-llnl/slurmctld.log 145 | ``` 146 | 147 | ##### Setting up the logical cluster 148 | ``` 149 | sacctmgr add cluster snowflake -i 150 | sacctmgr -i add account slurm Cluster=snowflake Description='none' Organization='none' 151 | ``` 152 | 153 | ### Slurm Backup Controller 154 | 155 | ##### Installing the Munge Authentification Service 156 | [Setup the Munge authentification service](#munge) 157 | 158 | ##### Installing the Slurm Backup Controller 159 | 160 | Install the Slurm controller package: 161 | ``` 162 | apt-get install slurmctld 163 | ``` 164 | 165 | ##### Setup the Slurm Controller/Worker configuration file 166 | [Setup the Slurm configuration file](#slurm_conf) 167 | 168 | ##### Setup the checkpoint directories for the backup controller 169 | [Setup the checkpoint directories](#setup_checkdirs) 170 | 171 | ##### Starting the Slurm Backup Controller 172 | Start the Slurm controller daemon: 173 | ``` 174 | systemctl start slurmctld 175 | ``` 176 | 177 | Check the log file if the Slurm controller started successfully: 178 | ``` 179 | less /var/log/slurm-llnl/slurmctld.log 180 | ``` 181 | 182 | ### Slurm Worker 183 | 184 | ##### Installing the Munge Authentification Service 185 | [Setup the Munge authentification service](#munge) 186 | 187 | ##### Installing the Slurm Worker 188 | 189 | Install the Slurm worker package: 190 | ``` 191 | apt-get install slurmd 192 | ``` 193 | 194 | ##### Setup the Slurm Controller/Worker configuration file 195 | [Setup the Slurm configuration file](#slurm_conf) 196 | 197 | Start the Slurm controller daemon: 198 | ``` 199 | systemctl start slurmd 200 | ``` 201 | 202 | Check the log file if the Slurm controller started successfully: 203 | ``` 204 | less /var/log/slurm-llnl/slurmd.log 205 | ``` 206 | 207 | 208 | Common Installation and Configuration Tasks 209 | ------------------------------------------- 210 | Describes common installation and configuration tasks that have to be done multiple times. 211 | 212 | ### Munge Authentification Service 213 | 214 | Install the Munge package: 215 | ``` 216 | apt-get install munge 217 | ``` 218 | 219 | Provide a shared key under: **/etc/munge/munge.key** 220 | After setting the shared key do a reboot of the Munge service to active the key: 221 | ``` 222 | systemctl restart munge 223 | ``` 224 | 225 | ### Slurm Controller and Worker Configuration File 226 | 227 | Create the Slurm Controller/Worker configuration file under: **/etc/slurm-llnl/slurm.conf** 228 | ``` 229 | # MANAGEMENT POLICIES 230 | ControlMachine=lxcm01 231 | BackupController=lxcm02 232 | AuthType=auth/munge 233 | CryptoType=crypto/munge 234 | SlurmUser=slurm 235 | 236 | # NODE CONFIGURATIONS 237 | NodeName=lxdev0[1-4] 238 | 239 | # PARTITION CONFIGURATIONS 240 | PartitionName=debug Nodes=lxdev0[1-4] Default=YES 241 | 242 | # ACCOUNTING 243 | AccountingStorageType=accounting_storage/slurmdbd 244 | AccountingStorageHost=lxcc01 245 | AccountingStorageBackupHost=lxcc02 246 | JobAcctGatherType=jobacct_gather/linux 247 | ClusterName=snowflake 248 | 249 | # CONNECTION 250 | SlurmctldPort=6817 251 | SlurmdPort=6818 252 | 253 | # DIRECTORIES 254 | JobCheckpointDir=/var/lib/slurm-llnl/job_checkpoint 255 | SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd 256 | StateSaveLocation=/var/lib/slurm-llnl/state_checkpoint 257 | 258 | # LOGGING 259 | SlurmctldDebug=debug 260 | SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log 261 | SlurmdDebug=debug 262 | SlurmdLogFile=/var/log/slurm-llnl/slurmd.log 263 | 264 | # STATE INFO 265 | SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid 266 | SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid 267 | 268 | # SCHEDULING 269 | FastSchedule=2 270 | 271 | # ERROR RECOVERY 272 | ReturnToService=1 273 | ``` 274 | 275 | Set configuration file owner- and group-ship for Slurm: 276 | ``` 277 | chown slurm:slurm /etc/slurm-llnl/slurm.conf 278 | ``` 279 | 280 | ### Setting up the checkpoint directories between the Slurm Controller 281 | 282 | Both directory paths which are specified by the parameter **JobCheckpointDir** and **StateSaveLocation** in the configuration file **/etc/slurm-llnl/slurm.conf** are set up here via a NFS mount, so both the primary and backup controller can share its directory contents. 283 | 284 | ##### Creating the checkpoint directories 285 | 286 | Create the checkpoint directories on _both_ controller: 287 | ``` 288 | cd /var/lib/slurm-llnl/ 289 | 290 | mkdir job_checkpoint 291 | mkdir state_checkpoint 292 | 293 | chown slurm:slurm job_checkpoint 294 | chown slurm:slurm state_checkpoint 295 | ``` 296 | 297 | ##### Setting up the primary controller as NFS server 298 | The primary controller is used as NFS server to share the checkpoint directories with the backup controller. 299 | 300 | Install the required package: 301 | ``` 302 | apt-get install nfs-kernel-server 303 | ``` 304 | 305 | Edit the proper config file: 306 | ``` 307 | vi /etc/exports 308 | /var/lib/slurm-llnl/job_checkpoint lxcm02.devops.test(rw,sync) 309 | /var/lib/slurm-llnl/state_checkpoint lxcm02.devops.test(rw,sync) 310 | ``` 311 | 312 | Restart the NFS server: 313 | ``` 314 | systemctl restart nfs-kernel-server 315 | ``` 316 | 317 | ##### Setting up the backup controller 318 | 319 | Install the required package: 320 | ``` 321 | apt-get install nfs-common 322 | ``` 323 | 324 | Mount the checkpoint directories: 325 | ``` 326 | sudo mount -t nfs -o rw,nosuid lxcm01.devops.test:/var/lib/slurm-llnl/job_checkpoint /var/lib/slurm-llnl/job_checkpoint 327 | sudo mount -t nfs -o rw,nosuid lxcm01.devops.test:/var/lib/slurm-llnl/state_checkpoint /var/lib/slurm-llnl/state_checkpoint 328 | ``` 329 | --------------------------------------------------------------------------------