└── README.md
/README.md:
--------------------------------------------------------------------------------
1 |
2 | Installation of a Slurm Cluster
3 | ===============================
4 |
5 | A short installation guide for a Slurm cluster running each component on a different node with an additional backup controller and a database daemon.
6 |
7 |
8 | Hosts Overview
9 | --------------
10 |
11 | Listing of participating hosts and its function within the cluster.
12 |
13 | Function |Host |FQDN
14 | ------------------------------|-----------------------------|-----------------------------
15 | Slurm Primary Controller |lxcm01 |lxcm01.devops.test
16 | Slurm Backup Controller |lxcm02 |lxcm02.devops.test
17 | Slurm Primary Database Daemon |lxcc01 |lxcc01.devops.test
18 | Slurm Backup Database Daemon |lxcc02 |lxcc02.devops.test
19 | MySQL Database |lxdb01 |lxdb01.devops.test
20 | Slurm Worker |lxdev01 |lxdev01.devops.test
21 | Slurm Worker |lxdev02 |lxdev02.devops.test
22 |
23 |
24 | Installation and Configuration
25 | ------------------------------
26 |
27 | We will start to setup the MySQL database first and are continuing with the Slurm database daemon, controller and finally the worker nodes.
28 |
29 | ### MySQL Database
30 |
31 | ##### Installation and Configuration
32 |
33 | Install the MySQL server package:
34 | ```
35 | apt-get install mysql-server
36 | ```
37 |
38 | Edit the MySQL server configuration file under: **/etc/mysql/my.cnf**
39 | Uncomment the following line to allow remote access from a different host rather than localhost:
40 | ```
41 | bind-address = 127.0.0.1
42 | ```
43 |
44 | Reboot the MySQL database:
45 | ```
46 | systemctl restart mysql
47 | ```
48 |
49 | Check the log file if the MySQL database started successfully:
50 | ```
51 | less /var/log/mysql/error.log
52 | ```
53 |
54 | ##### Creating a Slurm Database User
55 |
56 | The Slurm database user is allowed to connect from the Slurm primary and backup database daemon only.
57 |
58 | Create the Slurm database user:
59 | ```
60 | CREATE USER 'slurm'@'lxcc01.devops.test' IDENTIFIED BY '12345678';
61 | CREATE USER 'slurm'@'lxcc02.devops.test' IDENTIFIED BY '12345678';
62 | ```
63 |
64 | Grant access to the Slurm database user on the slurm accounting database:
65 | ```
66 | GRANT ALL ON slurm_acct_db.* TO 'slurm'@'lxcc01.devops.test';
67 | GRANT ALL ON slurm_acct_db.* TO 'slurm'@'lxcc02.devops.test';
68 | ```
69 |
70 | Finally flush the privileges:
71 | ```
72 | FLUSH PRIVILEGES;
73 | ```
74 |
75 | ### Setting up a Slurm Database Daemon
76 |
77 | ##### Installing the Munge Authentification Service
78 | [Setup the Munge authentification service](#munge)
79 |
80 | ##### Installing the Slurm Database Daemon
81 |
82 | Install the Slurm database daemon package:
83 | ```
84 | apt-get install slurmdbd
85 | ```
86 |
87 | Create the Slurm database daemon configuration file under: **/etc/slurm-llnl/slurmdbd.conf**
88 | ```
89 | AuthType=auth/munge
90 | AuthInfo=/var/run/munge/munge.socket.2
91 | DbdHost=lxcc01
92 | DbdBackupHost=lxcc02
93 | StorageHost=lxdb01
94 | StorageLoc=slurm_acct_db
95 | StoragePass=12345678
96 | StorageType=accounting_storage/mysql
97 | StorageUser=slurm
98 | LogFile=/var/log/slurm-llnl/slurmdbd.log
99 | PidFile=/var/run/slurm-llnl/slurmdbd.pid
100 | SlurmUser=slurm
101 | ```
102 |
103 | Set configuration file owner- and group-ship to slurm:
104 | ```
105 | chown slurm:slurm /etc/slurm-llnl/slurmdbd.conf
106 | ```
107 |
108 | Start the Slurm database daemon:
109 | ```
110 | systemctl start slurmdbd
111 | ```
112 |
113 | Check the log file if the Slurm database daemon started successfully and everything rolled up:
114 | ```
115 | less /var/log/slurm-llnl/slurmdbd.log
116 | ```
117 |
118 | ### Slurm Primary Controller
119 |
120 | ##### Installing the Munge Authentification Service
121 | [Setup the Munge authentification service](#munge)
122 |
123 | ##### Installing the Slurm Primary Controller
124 |
125 | Install the Slurm controller package:
126 | ```
127 | apt-get install slurmctld
128 | ```
129 |
130 | ##### Setup the Slurm Controller/Worker configuration file
131 | [Setup the Slurm configuration file](#slurm_conf)
132 |
133 | ##### Setup the checkpoint directories for the primary controller
134 | [Setup the checkpoint directories](#setup_checkdirs)
135 |
136 | ##### Starting the Slurm Primary Controller
137 | Start the Slurm controller daemon:
138 | ```
139 | systemctl start slurmctld
140 | ```
141 |
142 | Check the log file if the Slurm controller started successfully:
143 | ```
144 | less /var/log/slurm-llnl/slurmctld.log
145 | ```
146 |
147 | ##### Setting up the logical cluster
148 | ```
149 | sacctmgr add cluster snowflake -i
150 | sacctmgr -i add account slurm Cluster=snowflake Description='none' Organization='none'
151 | ```
152 |
153 | ### Slurm Backup Controller
154 |
155 | ##### Installing the Munge Authentification Service
156 | [Setup the Munge authentification service](#munge)
157 |
158 | ##### Installing the Slurm Backup Controller
159 |
160 | Install the Slurm controller package:
161 | ```
162 | apt-get install slurmctld
163 | ```
164 |
165 | ##### Setup the Slurm Controller/Worker configuration file
166 | [Setup the Slurm configuration file](#slurm_conf)
167 |
168 | ##### Setup the checkpoint directories for the backup controller
169 | [Setup the checkpoint directories](#setup_checkdirs)
170 |
171 | ##### Starting the Slurm Backup Controller
172 | Start the Slurm controller daemon:
173 | ```
174 | systemctl start slurmctld
175 | ```
176 |
177 | Check the log file if the Slurm controller started successfully:
178 | ```
179 | less /var/log/slurm-llnl/slurmctld.log
180 | ```
181 |
182 | ### Slurm Worker
183 |
184 | ##### Installing the Munge Authentification Service
185 | [Setup the Munge authentification service](#munge)
186 |
187 | ##### Installing the Slurm Worker
188 |
189 | Install the Slurm worker package:
190 | ```
191 | apt-get install slurmd
192 | ```
193 |
194 | ##### Setup the Slurm Controller/Worker configuration file
195 | [Setup the Slurm configuration file](#slurm_conf)
196 |
197 | Start the Slurm controller daemon:
198 | ```
199 | systemctl start slurmd
200 | ```
201 |
202 | Check the log file if the Slurm controller started successfully:
203 | ```
204 | less /var/log/slurm-llnl/slurmd.log
205 | ```
206 |
207 |
208 | Common Installation and Configuration Tasks
209 | -------------------------------------------
210 | Describes common installation and configuration tasks that have to be done multiple times.
211 |
212 | ### Munge Authentification Service
213 |
214 | Install the Munge package:
215 | ```
216 | apt-get install munge
217 | ```
218 |
219 | Provide a shared key under: **/etc/munge/munge.key**
220 | After setting the shared key do a reboot of the Munge service to active the key:
221 | ```
222 | systemctl restart munge
223 | ```
224 |
225 | ### Slurm Controller and Worker Configuration File
226 |
227 | Create the Slurm Controller/Worker configuration file under: **/etc/slurm-llnl/slurm.conf**
228 | ```
229 | # MANAGEMENT POLICIES
230 | ControlMachine=lxcm01
231 | BackupController=lxcm02
232 | AuthType=auth/munge
233 | CryptoType=crypto/munge
234 | SlurmUser=slurm
235 |
236 | # NODE CONFIGURATIONS
237 | NodeName=lxdev0[1-4]
238 |
239 | # PARTITION CONFIGURATIONS
240 | PartitionName=debug Nodes=lxdev0[1-4] Default=YES
241 |
242 | # ACCOUNTING
243 | AccountingStorageType=accounting_storage/slurmdbd
244 | AccountingStorageHost=lxcc01
245 | AccountingStorageBackupHost=lxcc02
246 | JobAcctGatherType=jobacct_gather/linux
247 | ClusterName=snowflake
248 |
249 | # CONNECTION
250 | SlurmctldPort=6817
251 | SlurmdPort=6818
252 |
253 | # DIRECTORIES
254 | JobCheckpointDir=/var/lib/slurm-llnl/job_checkpoint
255 | SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
256 | StateSaveLocation=/var/lib/slurm-llnl/state_checkpoint
257 |
258 | # LOGGING
259 | SlurmctldDebug=debug
260 | SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
261 | SlurmdDebug=debug
262 | SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
263 |
264 | # STATE INFO
265 | SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
266 | SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
267 |
268 | # SCHEDULING
269 | FastSchedule=2
270 |
271 | # ERROR RECOVERY
272 | ReturnToService=1
273 | ```
274 |
275 | Set configuration file owner- and group-ship for Slurm:
276 | ```
277 | chown slurm:slurm /etc/slurm-llnl/slurm.conf
278 | ```
279 |
280 | ### Setting up the checkpoint directories between the Slurm Controller
281 |
282 | Both directory paths which are specified by the parameter **JobCheckpointDir** and **StateSaveLocation** in the configuration file **/etc/slurm-llnl/slurm.conf** are set up here via a NFS mount, so both the primary and backup controller can share its directory contents.
283 |
284 | ##### Creating the checkpoint directories
285 |
286 | Create the checkpoint directories on _both_ controller:
287 | ```
288 | cd /var/lib/slurm-llnl/
289 |
290 | mkdir job_checkpoint
291 | mkdir state_checkpoint
292 |
293 | chown slurm:slurm job_checkpoint
294 | chown slurm:slurm state_checkpoint
295 | ```
296 |
297 | ##### Setting up the primary controller as NFS server
298 | The primary controller is used as NFS server to share the checkpoint directories with the backup controller.
299 |
300 | Install the required package:
301 | ```
302 | apt-get install nfs-kernel-server
303 | ```
304 |
305 | Edit the proper config file:
306 | ```
307 | vi /etc/exports
308 | /var/lib/slurm-llnl/job_checkpoint lxcm02.devops.test(rw,sync)
309 | /var/lib/slurm-llnl/state_checkpoint lxcm02.devops.test(rw,sync)
310 | ```
311 |
312 | Restart the NFS server:
313 | ```
314 | systemctl restart nfs-kernel-server
315 | ```
316 |
317 | ##### Setting up the backup controller
318 |
319 | Install the required package:
320 | ```
321 | apt-get install nfs-common
322 | ```
323 |
324 | Mount the checkpoint directories:
325 | ```
326 | sudo mount -t nfs -o rw,nosuid lxcm01.devops.test:/var/lib/slurm-llnl/job_checkpoint /var/lib/slurm-llnl/job_checkpoint
327 | sudo mount -t nfs -o rw,nosuid lxcm01.devops.test:/var/lib/slurm-llnl/state_checkpoint /var/lib/slurm-llnl/state_checkpoint
328 | ```
329 |
--------------------------------------------------------------------------------