├── .gitignore
├── .vscode
    └── settings.json
├── LICENSE
├── README.md
├── check_if_user_expired.sh
├── create_admin_user.sh
├── create_faculty_admin_sudo_group.sh
├── create_users.sh
├── create_users_no_sync.sh
├── delete_users.sh
└── demo_files
    ├── test.job
    ├── tf1_load_test.py
    ├── tf1_test.py
    ├── tf2_load_test.py
    └── tf2_test.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.hdf5
2 | Class_List_MSDS684_FW1_2019.csv
3 | 


--------------------------------------------------------------------------------
/.vscode/settings.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "spellright.language": [
 3 |         "en_US"
 4 |     ],
 5 |     "spellright.documentTypes": [
 6 |         "markdown",
 7 |         "latex",
 8 |         "plaintext"
 9 |     ]
10 | }


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Nate George
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # slurm_gpu_ubuntu
  2 | Instructions for setting up a SLURM cluster using Ubuntu 18.04.3 with GPUs.  Go from a pile of hardware to a functional GPU cluster with job queueing and user management.
  3 | 
  4 | OS used: Ubuntu 18.04.3 LTS
  5 | 
  6 | 
  7 | # Overview
  8 | This guide will help you create and install a GPU HPC cluster with a job queue and user management.  The idea is to have a GPU cluster which allows use of a few GPUs by many people.  Using multiple GPUs at once is not the point here, and hasn't been tested.  This guide demonstrates how to create a GPU cluster for neural networks (deep learning) which uses Python and related neural network libraries (Tensorflow, Keras, Pytorch), CUDA, and NVIDIA GPU cards.  You can expect this to take you a few days up to a week.
  9 | 
 10 | ## Outline of steps:
 11 | 
 12 | - Prepare hardware
 13 | - Install OSs
 14 | - Sync UID/GIDs or create slurm/munge users
 15 | - Install Software (Nvidia drivers, Anaconda and Python packages)
 16 | - Install/configure file sharing (NFS here; if using more than one node/computer in the cluster)
 17 | - Install munge/SLURM and configure
 18 | - User management
 19 | 
 20 | ## Acknowledgements
 21 | This wouldn't have been possible without this [github repo](https://github.com/mknoxnv/ubuntu-slurm) from mknoxnv.  I don't know who that person is, but they saved me weeks of work trying to figure out all the conf files and services, etc.
 22 | 
 23 | # Preparing Hardware
 24 | 
 25 | If you do not already have hardware, here are some considerations:
 26 | 
 27 | Top-of-the-line commodity motherboards can handle up to 4 GPUs.  You should pay attention to PCI Lanes in the motherboard and CPU specifications.  Usually GPUs can take up to 16 PCI Lanes, and work fastest for data transfer when using all 16 lanes.  To use 4 GPUs in one machine, your motherboard should support at least 64 PCI Lanes, and CPUs should have at least 64 Lanes available.  M.2 SSDs can use PCI lanes as well, so it can be better to have a little more than 64 Lanes if possible.  The motherboard and CPU specs usually detail the PCI lanes.
 28 | 
 29 | We used NVIDIA GPU cards in our cluster, but many AMD cards [should now work](https://rocm.github.io/) with Python deep learning libraries now.
 30 | 
 31 | Power supply wattage is also important to consider, as GPUs can take a lot of Watts at peak power.
 32 | 
 33 | You only need one computer, but to have more than 4 GPUs you will need at least 2 computers.  This guide assumes you are using more than one computer in your cluster.
 34 | 
 35 | # Installing operating systems
 36 | 
 37 | Once you have hardware up and running, you need to install an OS.  From my research I've found Ubuntu is the top Linux distribution as of 2019 (both for commodity hardware and servers), and is recommended.  Currently the latest long-term stability version is Ubuntu 18.04.3, which is what was used here.  LTS are usually better because they are more stable over time.  Other Linux distributions may differ in some of the commands.
 38 | 
 39 | I recommend creating a bootable USB stick and installing Ubuntu from that.  Often with NVIDIA, the installation freezes upon loading and [this fix](https://askubuntu.com/a/870245/458247) must be implemented.  Once the boot menu appears, choose Ubuntu or Install Ubuntu, then press 'e', then add `apci=off` directly after `quiet splash` (leaving a space between splish and apci).  Then press F10 and it should boot.
 40 | 
 41 | I recommend using [LVM](https://www.howtogeek.com/211937/how-to-use-lvm-on-ubuntu-for-easy-partition-resizing-and-snapshots/) when installing (there is a checkbox for it with Ubuntu installation), so that you can add and extend storage HDDs if needed.
 42 | 
 43 | **Note**: Along the way I used the package manager to update/upgade software many times (`sudo apt-get update` and `sudo apt-get upgrade`) followed by reboots.  If something is not working, this can be a first step to try to debug it.
 44 | 
 45 | ## Synchronizing GID/UIDs
 46 | It's recommend to sync the GIDs and UIDs across machines.  This can be done with something like LDAP (install instructions [here](https://computingforgeeks.com/how-to-install-and-configure-openldap-ubuntu-18-04/) and [here](https://www.techrepublic.com/article/how-to-install-openldap-on-ubuntu-18-04/)).  In my experience, for basic cluster management where all users can read and write to the folders where job files exist, the only GIDs and UIDs that need to be synced are the slurm and munge users.  Other users can be created and run SLURM jobs without having usernames on the other machines in the cluster.
 47 | 
 48 | However, if you want to isolate access to users' home folders (best practice I'd say), then you must synchronize users across the cluster.  The easiest way I've found to synchronize UIDs and GIDs across an Ubuntu cluster is FreeIPA.  Here are installation instructions:
 49 | 
 50 | - [Server (master node)](https://computingforgeeks.com/how-to-install-and-configure-freeipa-server-on-ubuntu-18-04-ubuntu-16-04/) (note: I had to run [this command](https://stackoverflow.com/a/54539428/4549682) to fix an issue after installing)
 51 | - [Client (worker nodes)](https://computingforgeeks.com/how-to-configure-freeipa-client-on-ubuntu-18-04-ubuntu-16-04-centos-7/)
 52 | 
 53 | It is important that you set the hostname to a FQDN, otherwise kerberos/FreeIPA won't work.  If you accidentally set the hostname during the kerberos setup to the wrong thing, you can change it in `/etc/krb5.conf`.  You could also completely purge kerberos [like so](https://serverfault.com/a/885525/305991).  If you need to reconfigure the ipa configuration, you can do `sudo ipa-server-install --uninstall` then try installing again.  I had to do the uninstall twice for it to work.
 54 | 
 55 | ## Synchronizing time
 56 | Free-IPA should take care of syncing time, so you shouldn't have to worry about this if you setup freeipa.  You can see if times are synced with the `date` command on the various machines.
 57 | 
 58 | 
 59 | It's not a bad idea to sync the time across the servers.  [Here's how](https://knowm.org/how-to-synchronize-time-across-a-linux-cluster/).  One time when I set it up, it was ok, but another time the slurmctld service wouldn't start and it was because the times weren't synced.
 60 | 
 61 | 
 62 | ## Set up munge and slurm users and groups
 63 | Immediately after installing OS’s, you want to create the munge and slurm users and groups on all machines.  The GID and UID (group and user IDs) must match for munge and slurm across all machines.  If you have a lot of machines, you can use the parallel SSH utilities mentioned before.  There are also other options like NIS and NISplus.  One other option is to use FreeIPA to create users and groups.
 64 | 
 65 | On all machines we need the munge authentication service and slurm installed.  First, we want to have the munge and slurm users/groups with the same UIDs and GIDs.  In my experience, these are the only GID and UIDs that need synchronization for the cluster to work.  On all machines:
 66 | 
 67 | ```
 68 | sudo adduser -u 1111 munge --disabled-password --gecos ""
 69 | sudo adduser -u 1121 slurm --disabled-password --gecos ""
 70 | ```
 71 | 
 72 | #### You shouldn’t need to do this, but just in case, you could create the groups first, then create the users
 73 | 
 74 | ```
 75 | sudo addgroup -gid 1111 munge
 76 | sudo addgroup -gid 1121 slurm
 77 | sudo adduser -u 1111 munge --disabled-password --gecos "" -gid 1111
 78 | sudo adduser -u 1121 slurm --disabled-password --gecos "" -gid 1121
 79 | ```
 80 | 
 81 | When a user is created, a group with the same name is created as well.
 82 | 
 83 | The numbers don’t matter as long as they are available for the user and group IDs.  These numbers seemed to work with a default Ubuntu 18.04.3 installation.  It seems like by default ubuntu sets up a new user with a UID and  GID of UID + 1 if the GID already exists, so this follows that pattern.
 84 | 
 85 | 
 86 | 
 87 | ## Installing software/drivers
 88 | Next you should install SSH.  Open a terminal and install: `sudo apt install openssh-server -y`.
 89 | 
 90 | Once you have SSH on the machines, you may want to use a [parallel SSH utility](https://www.tecmint.com/run-commands-on-multiple-linux-servers/) to execute commands on all machines at once.
 91 | 
 92 | ### Install NVIDIA drivers
 93 | You will need the latest NVIDIA drivers install for their cards.  The procedure [currently is](http://ubuntuhandbook.org/index.php/2019/04/nvidia-430-09-gtx-1650-support/):
 94 | 
 95 | ```
 96 | sudo add-apt-repository ppa:graphics-drivers/ppa
 97 | sudo apt-get update
 98 | sudo apt-get install nvidia-driver-430
 99 | ```
100 | 
101 | The 430 driver will probably update soon.  You can use `sudo apt-cache search nvidia-driver*` to find the latest one, or go to the "Software & Updates" menu to install it.  For some reason, on the latest install I had to use aptitude to install it:
102 | 
103 | ```
104 | sudo apt-get install aptitude -y
105 | sudo aptitude install nvidia-driver-430
106 | ```
107 | 
108 | But that still didn't seem to solve the issue, and I installed it via the "Software & Updates" menu under "Additional Drivers".
109 | 
110 | We also use [NoMachine](https://www.nomachine.com/) for remote GUI access.
111 | 
112 | ## Install the Anaconda Python distribution.
113 | Anaconda makes installing deep learning libraries easier, and doesn’t require installing CUDA/CuDNN libraries (which is a pain).  Anaconda handles the CUDA and other dependencies for deep learning libraries.
114 | 
115 | Download the distribution file:
116 | 
117 | ```
118 | cd /tmp
119 | wget https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh
120 | ```
121 | You may want to visit https://repo.anaconda.com/archive/ to get the latest anaconda version instead, though you can use:
122 | 
123 | 
124 | `conda update conda anaconda`
125 | 
126 | or
127 | 
128 | `conda update --all`
129 | 
130 | to update Anaconda once it’s installed.
131 | 
132 | Once the .sh file is downloaded, you should make it executable with
133 | 
134 | `sudo chmod +777 Anaconda3-2019.03-Linux-x86_64.sh`
135 | 
136 | then run the file:
137 | 
138 | `./Anaconda3-2019.03-Linux-x86_64.sh`
139 | 
140 | I chose yes for `Do you wish the installer to initialize Anaconda3
141 | by running conda init?`.
142 | 
143 | Then you should do `source ~/.bashrc` to enable `conda` as a command.
144 | 
145 | If you chose `no` for the `conda init` portion, you may need to add some aliases to bashrc:
146 | 
147 | `nano ~/.bashrc`
148 | 
149 | Add the lines:
150 | 
151 | ```
152 | alias conda=~/anaconda3/bin/conda
153 | alias python=~/anaconda3/bin/python
154 | alias ipython=~/anaconda3/bin/ipython
155 | ```
156 | 
157 | Now install some anaconda packages:
158 | 
159 | ```
160 | conda update conda
161 | conda install anaconda
162 | conda install python=3.6
163 | conda install tensorflow-gpu keras pytorch
164 | ```
165 | 
166 | The 3.6 install can take a while to complete (environment solving with conda is slow; it took about 15 minutes for me even on a fast computer -- the environment solving is definitely a big drawback of anaconda).  Not a bad idea to use tmux and put the `conda install python=3.6` in a `tmux` shell in case an SSH session is interrupted.
167 | 
168 | Python3.6 is the latest version with easy support for tensorflow and some other packages.
169 | 
170 | 
171 | At this point you can use this code to test GPU functionality with this [demo code](https://raw.githubusercontent.com/keras-team/keras/master/examples/mnist_cnn.py), you could also use [this](https://stackoverflow.com/a/38580201/4549682).
172 | 
173 | 
174 | # Install NFS (shared storage)
175 | In order for SLURM to work properly, there must be a storage location present on all computers in the cluster with the same files used for jobs.  All computers in the cluster must be able to read and write to this directory.  One way to do this is with NFS, although other options such as OCFS2 exist.  Here we use NFS.
176 | 
177 | For the instructions, we will call the primary server `master` (the one hosting storage and the SLURM controller) and assume we have one worker node (another computer with GPUs) called `worker`.  We will also assume the username/groupname for the main administrative account on all machines is `admin:admin`.  I used the same username and group for the administrative accounts on all the servers.
178 | 
179 | ## Master node
180 | On the master server, do:
181 | 
182 | `sudo apt install nfs-kernel-server -y`
183 | 
184 | Make a storage location:
185 | 
186 | `sudo mkdir /storage`
187 | 
188 | In my case, /storage was actually the mount point for a second HDD (LVM, which was expanded to 20TB).
189 | 
190 | Change ownership to your administrative username and group:
191 | 
192 | `sudo chown admin:admin /storage`
193 | 
194 | Next we need to add rules for the shared location.  This is done with:
195 | 
196 | `sudo nano /etc/exports`
197 | 
198 | Then adding the line:
199 | 
200 | `/storage    *(rw,sync,no_root_squash)`
201 | 
202 | The * is for IP addresses or hostnames.  In this case we allow anything, but you may want to limit it to your IPs/hostnames in the cluster.  In fact, it wasn't working for me unless I explicitly set the IPs of the clients here.  You have to have a separate entry for each IP.  Mine ended up looking like:
203 | 
204 | `/storage 172.xx.224.xx(rw,sync,no_root_squash,all_squash,anonuid=999999,anongid=999999) 172.xx.224.xx(rw,sync,no_root_squash,all_squash,anonuid=999999,anongid=999999)`
205 | 
206 | where the 'xx's are actual numbers.
207 | 
208 | Then start the NFS service:
209 | 
210 | `sudo systemctl start nfs-kernel-server.service`
211 | 
212 | It should start automatically upon restarts.
213 | 
214 | You should also add a rule to allow for NFS traffic from the workers through port 2049.  This is done like so:
215 | 
216 | `sudo ufw allow from <ip_addr> to any port nfs`
217 | 
218 | Check the status with `sudo ufw status`.  You should see a rule to allow traffic to port 2049 from your worker nodes' IP addresses. [Here's more info](https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-16-04).
219 | 
220 | 
221 | ## Client nodes
222 | Now we can set up the clients.  On all worker servers:
223 | 
224 | ```
225 | sudo apt install nfs-common -y
226 | sudo mkdir /storage
227 | sudo chown admin:admin /storage
228 | sudo mount master:/storage /storage
229 | ```
230 | 
231 | To make the drive mount upon restarts for the worker nodes, add this to fstab (`sudo nano /etc/fstab`):
232 | 
233 | `master:/storage /storage nfs auto,timeo=14,intr 0 0`
234 | 
235 | This can be done like so:
236 | 
237 | `echo master:/storage /storage nfs auto,timeo=14,intr 0 0 | sudo tee -a /etc/fstab`
238 | 
239 | Now any files put into /storage from the master server can be seen on all worker servers connect via NFS.  The worker servers MUST be read and write.  If not, any sbatch jobs will give an exit status of 1:0.
240 | 
241 | 
242 | # Preparing for SLURM installation
243 | ## Passwordless SSH from master to all workers
244 | 
245 | First we need passwordless SSH between the master and compute nodes.  We are still using `master` as the master node hostname and `worker` as the worker hostname.  On the master:
246 | 
247 | ```
248 | ssh-keygen
249 | ssh-copy-id admin@worker
250 | ```
251 | 
252 | To do this with many worker nodes, you might want to set up a small script to loop through worker hostnames or IPs.
253 | 
254 | ## Install munge on the master:
255 | ```
256 | sudo apt-get install libmunge-dev libmunge2 munge -y
257 | sudo systemctl enable munge
258 | sudo systemctl start munge
259 | ```
260 | 
261 | Test munge if you like:
262 | `munge -n | unmunge | grep STATUS`
263 | 
264 | 
265 | Copy the munge key to /storage
266 | ```
267 | sudo cp /etc/munge/munge.key /storage/
268 | sudo chown munge /storage/munge.key
269 | sudo chmod 400 /storage/munge.key
270 | ```
271 | 
272 | ## Install munge on worker nodes:
273 | ```
274 | sudo apt-get install libmunge-dev libmunge2 munge
275 | sudo cp /storage/munge.key /etc/munge/munge.key
276 | sudo systemctl enable munge
277 | sudo systemctl start munge
278 | ```
279 | 
280 | If you want, you can test munge:
281 | `munge -n | unmunge | grep STATUS`
282 | 
283 | ## Prepare DB for SLURM
284 | 
285 | These instructions more or less follow this github repo: https://github.com/mknoxnv/ubuntu-slurm
286 | 
287 | First we want to clone the repo:
288 | `cd /storage`
289 | `git clone https://github.com/mknoxnv/ubuntu-slurm.git`
290 | 
291 | Install prereqs:
292 | ```
293 | sudo apt-get install git gcc make ruby ruby-dev libpam0g-dev libmariadb-client-lgpl-dev libmysqlclient-dev mariadb-server build-essential libssl-dev -y
294 | sudo gem install fpm
295 | ```
296 | 
297 | Next we set up MariaDB for storing SLURM data:
298 | ```
299 | sudo systemctl enable mysql
300 | sudo systemctl start mysql
301 | sudo mysql -u root
302 | ```
303 | 
304 | Within mysql:
305 | ```
306 | create database slurm_acct_db;
307 | create user 'slurm'@'localhost';
308 | set password for 'slurm'@'localhost' = password('slurmdbpass');
309 | grant usage on *.* to 'slurm'@'localhost';
310 | grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
311 | flush privileges;
312 | exit
313 | ```
314 | 
315 | Copy the default db config file:
316 | `cp /storage/ubuntu-slurm/slurmdbd.conf /storage`
317 | 
318 | Ideally you want to change the password to something different than `slurmdbpass`.  This must also be set in the config file `/storage/slurmdbd.conf`.
319 | 
320 | # Install SLURM
321 | ## Download and install SLURM on Master
322 | 
323 | ### Build the SLURM .deb install file
324 | It’s best to check the downloads page and use the latest version (right click link for download and use in the wget command).  Ideally we’d have a script to scrape the latest version and use that dynamically.
325 | 
326 | You can use the -j option to specify the number of CPU cores to use for 'make', like `make -j12`.  `htop` is a nice package that will show usage stats and quickly show how many cores you have.
327 | 
328 | ```
329 | cd /storage
330 | wget https://download.schedmd.com/slurm/slurm-19.05.2.tar.bz2
331 | tar xvjf slurm-19.05.2.tar.bz2
332 | cd slurm-19.05.2
333 | ./configure --prefix=/tmp/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm
334 | make
335 | make contrib
336 | make install
337 | cd ..
338 | ```
339 | 
340 | ### Install SLURM
341 | ```
342 | sudo fpm -s dir -t deb -v 1.0 -n slurm-19.05.2 --prefix=/usr -C /tmp/slurm-build .
343 | sudo dpkg -i slurm-19.05.2_1.0_amd64.deb
344 | ```
345 | 
346 | Make all the directories we need:
347 | ```
348 | sudo mkdir -p /etc/slurm /etc/slurm/prolog.d /etc/slurm/epilog.d /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
349 | sudo chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
350 | ```
351 | 
352 | Copy slurm control and db services:
353 | ```
354 | sudo cp /storage/ubuntu-slurm/slurmdbd.service /etc/systemd/system/
355 | sudo cp /storage/ubuntu-slurm/slurmctld.service /etc/systemd/system/
356 | ```
357 | 
358 | The slurmdbd.conf file should be copied before starting the slurm services:
359 | `sudo cp /storage/slurmdbd.conf /etc/slurm/`
360 | 
361 | Start the slurm services:
362 | ```
363 | sudo systemctl daemon-reload
364 | sudo systemctl enable slurmdbd
365 | sudo systemctl start slurmdbd
366 | sudo systemctl enable slurmctld
367 | sudo systemctl start slurmctld
368 | ```
369 | 
370 | If the master is also going to be a worker/compute node, you should do:
371 | 
372 | ```
373 | sudo cp /storage/ubuntu-slurm/slurmd.service /etc/systemd/system/
374 | sudo systemctl enable slurmd
375 | sudo systemctl start slurmd
376 | ```
377 | 
378 | ## Worker nodes
379 | Now install SLURM on worker nodes:
380 | 
381 | ```
382 | cd /storage
383 | sudo dpkg -i slurm-19.05.2_1.0_amd64.deb
384 | sudo cp /storage/ubuntu-slurm/slurmd.service /etc/systemd/system/
385 | sudo systemctl enable slurmd
386 | sudo systemctl start slurmd
387 | ```
388 | 
389 | ## Configuring SLURM
390 | 
391 | Next we need to set up the configuration file.  Copy the default config from the github repo:
392 | 
393 | `cp /storage/ubuntu-slurm/slurm.conf /storage/slurm.conf`
394 | 
395 | Note: for job limits for users, you should add the [AccountingStorageEnforce=limits](https://slurm.schedmd.com/resource_limits.html) line to the config file.
396 | 
397 | Once SLURM is installed on all nodes, we can use the command
398 | 
399 | `sudo slurmd -C`
400 | 
401 | to print out the machine specs.  Then we can copy this line into the config file and modify it slightly.  To modify it, we need to add the number of GPUs we have in the system (and remove the last part which show UpTime).  Here is an example of a config line:
402 | 
403 | `NodeName=worker1 Gres=gpu:2 CPUs=12 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=128846`
404 | 
405 | Take this line and put it at the bottom of `slurm.conf`.
406 | 
407 | Next, setup the `gres.conf` file.  Lines in `gres.conf` should look like:
408 | 
409 | ```
410 | NodeName=master Name=gpu File=/dev/nvidia0
411 | NodeName=master Name=gpu File=/dev/nvidia1
412 | ```
413 | 
414 | If you have multiple GPUs, keep adding lines for each node and increment the last number after nvidia.
415 | 
416 | Gres has more options detailed in the docs: https://slurm.schedmd.com/slurm.conf.html (near the bottom).
417 | 
418 | Finally, we need to copy .conf files on **all** machines.  This includes the `slurm.conf` file, `gres.conf`, `cgroup.conf` , and `cgroup_allowed_devices_file.conf`.  Without these files it seems like things don’t work.
419 | 
420 | ```
421 | sudo cp /storage/ubuntu-slurm/cgroup* /etc/slurm/
422 | sudo cp /storage/slurm.conf /etc/slurm/
423 | sudo cp /storage/gres.conf /etc/slurm/
424 | ```
425 | 
426 | This directory should also be created on workers:
427 | ```
428 | sudo mkdir -p /var/spool/slurm/d
429 | sudo chown slurm /var/spool/slurm/d
430 | ```
431 | 
432 | After the conf files have been copied to all workers and the master node, you may want to reboot the computers, or at least restart the slurm services:
433 | 
434 | Workers:
435 | `sudo systemctl restart slurmd`
436 | Master:
437 | ```
438 | sudo systemctl restart slurmctld
439 | sudo systemctl restart slurmdbd
440 | sudo systemctl restart slurmd
441 | ```
442 | 
443 | Next we just create a cluster:
444 | `sudo sacctmgr add cluster compute-cluster`
445 | 
446 | 
447 | ## Configure cgroups
448 | 
449 | I think cgroups allows memory limitations from SLURM jobs and users to be implemented.  Set memory cgroups on all workers with:
450 | 
451 | ```
452 | sudo nano /etc/default/grub
453 | And change the following variable to:
454 | GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"
455 | sudo update-grub
456 | ```
457 | Finally at the end, I did one last `sudo apt update`, `sudo apt upgrade`, and `sudo apt autoremove`, then rebooted the computers:
458 | `sudo reboot`
459 | 
460 | 
461 | # User Management
462 | It's best to configure the password min life to 0, so users can change their passwords immediately and log in.  This can be done with:
463 | `ipa pwpolicy-add ipausers --minlife=0 --priority=0`
464 | This is useful for resetting passwords.
465 | 
466 | ## Adding users
467 | Since we are using FreeIPA, we can use that to create users.
468 | [Here is an example script](create_users.sh) to add users from a csv file.  There is also a [script for deleting users](delete_users.sh).
469 | 
470 | We are adding users within the FreeIPA system, within the SLURM system, and creating a home directory.  The user is set to expire a little over a year from creation, and the password is set to expire upon the first login (prompting the user to change their password).
471 | 
472 | Adding users may also be possibly done with Linux tools and SLURM commands.  In that case it's best to create a group for different user groups:
473 | `sudo groupadd normal`
474 | But this would require creating users on all the machines. FreeIPA takes care of that for us, so it's a better solution.
475 | 
476 | ## Storage quotas
477 | Next we need to set storage quotas for the user.  Follow [this guide](https://www.digitalocean.com/community/tutorials/how-to-set-filesystem-quotas-on-ubuntu-18-04) to set up the quota settings on the machine.
478 | 
479 | Then we can set quotas:
480 | 
481 | ```bash
482 | sudo setquota -u ngeorge 150G 150G 0 0 /storage
483 | sudo setquota -u ngeorge 5G 5G 0 0 /
484 | ```
485 | 
486 | The `/dev/mapper/ubuntu--vg-root` is the LVM partition for the root drive `/`, and `/dev/disk/by-uuid/987d372b-9c96-4e62-af82-2d95dc6655b4` is the <file system> from `/etc/fstab` for the HDD /storage.
487 | 
488 | This sets the soft and hard limits to 150GB for /storage.
489 | 
490 | 
491 | To see how much of the quota people are using:
492 | ```
493 | sudo repquota -s /
494 | sudo repquota -s /storage
495 | ```
496 | 
497 | The new users don’t seem to always show up until they have saved something on the drive.  You can also specifically look at one user with:
498 | 
499 | `sudo quota -vs ngeorge`
500 | 
501 | 
502 | ## Deleting SLURM users on expiration
503 | The slurm account manager has no way to set an expiration for users.  So we use [this script](check_if_user_expired.sh) to check if the Linux username has expired, and if so, we delete the slurm username and home directory.  This runs on a cronjob once per day.  At it to the crontab file with:
504 | 
505 | `sudo crontab -e`
506 | 
507 | Add this line to run at 5 am every day on the machine:
508 | 
509 | `0 5 * * * bash /home/<username>/slurm_gpu_ubuntu/check_if_user_expired.sh
510 | `
511 | Obviously fix the path to where the script is, and change the username to yours.
512 | 
513 | ## Resetting user passwords
514 | To reset a user password:
515 | 
516 | `kinit admin`
517 | `ipa user-mod <username> --password --setattr krbPasswordExpiration=$(date '+%Y-%m-%d' -d '-1 day')$'Z'`
518 | This also sets the password to already be expired so they must reset it upon login.
519 | 
520 | If the error comes up: `Password change failed. Server message: Current password's minimum life has not expired`
521 | then the min life for passwords needs to be changed:
522 | `ipa pwpolicy-add ipausers --minlife=0 --priority=0`
523 | The ipausers above is the group for the users for which passwords are being reset.
524 | 
525 | # Troubleshooting
526 | 
527 | When in doubt, first try updating software with `sudo apt update; sudo apt upgrade -y` and rebooting (`sudo reboot`).
528 | 
529 | ## Log files
530 | When in doubt, you can check the log files.  The locations are set in the slurm.conf file, and are `/var/log/slurmd.log` and `/var/log/slurmctld.log` by default.  Open them with `sudo nano /var/log/slurmctld.log`.  To go to the bottom of the file, use ctrl+_ and ctrl+v.  I also changed the log paths to `/var/log/slurm/slurmd.log` and so on, and changed the permissions of the folder to be owner by slurm: `sudo chown slurm:slurm /var/log/slurm`.
531 | 
532 | ## Checking SLURM states
533 | Some helpful commands:
534 | 
535 | `scontrol ping` -- this checks if the controller node can be reached.  If this isn't working (i.e. the command returns 'DOWN' and not 'UP'), you might need to allow connections to the slurmctrlport (in the slurm.conf file).  This is set to 6817 in the config file.  To allow connections with the firewall, execute:
536 | 
537 | `sudo ufw allow from any to any port 6817`
538 | 
539 | and
540 | 
541 | `sudo ufw reload`
542 | 
543 | ## Error codes 1:0 and 2:0
544 | If trying to run a job with `sbatch` and the exit code is 1:0, this is usually a file writing error.  The first thing to check is that your output and error file paths in the .job file are correct.  Also check the .py file you want to run has the correct filepath in your .job file.  Then you should go to the logs (`/var/log/slurm/slurmctld.log`) and see which node the job was trying to run on.  Then go to that node and open the logs (`/var/log/slurm/slurmd.log`) to see what it says.  It may say something about the path for the output/error files, or the path to the .py file is incorrect.
545 | 
546 | It could also mean your common storage location is not r/w accessible to all nodes.  In the logs, this would show up as something about permissions and unable to write to the filesystem.  Double-check that you can create files on the /storage location on all workers with something like `touch testing.txt`.  If you can't create a file from the worker nodes, you probably have some sort of NFS issue.  Go back to the NFS section and make sure everything looks ok.  You should be able to create directories/files in /storage from any node with the admin account and they should show up as owned by the admin user.  If not, you may have some issue in your /etc/exports or with your GID/UIDs not matching.
547 | 
548 | If the exit code is 2:0, this can mean there is some problem with either the location of the python executable, or some other error when running the python script.  Double check that the srun or python script is working as expected with the python executable specified in the sbatch job file.
549 | 
550 | If some workers are 'draining', down, or unavailable, you might try:
551 | 
552 | `sudo scontrol update NodeName=worker1 State=RESUME`
553 | 
554 | 
555 | ## Node is stuck draining (drng from `sinfo`)
556 | This has happened due to the memory size in slurm.conf being higher than actual memor size.  Double check the memory from `free -m` or `sudo slurmd -C` and update slurm.conf on all machines in the cluster.  Then run `sudo scontrol update NodeName=worker1 State=RESUME`
557 | 
558 | ## Nodes are not visible upon restart
559 | After restarting the master node, sometimes the workers aren't there. I've found I often have to do `sudo scontrol update NodeName=worker1 State=RESUME` to get them working/available.
560 | 
561 | 
562 | ## Taking a node offline
563 | The best way to take a node offline for maintenance is to drain it:
564 | `sudo scontrol update NodeName=worker1 State=DRAIN Reason='Maintenance'`
565 | 
566 | Users can see the reason with `sinfo -R`
567 | 
568 | 
569 | ## Testing GPU load
570 | Using `watch -n 0.1 nvidia-smi` will show the GPU load in real-time.  You can use this to monitor jobs as they are scheduled to make sure all the GPUs are being utilized.
571 | 
572 | 
573 | 
574 | 
575 | ## Setting account options
576 | You may want to limit jobs or submissions.  Here is how to set attributes (-1 means no limit):
577 | ```bash
578 | sudo sacctmgr modify account students set GrpJobs=-1
579 | sudo sacctmgr modify account students set GrpSubmitJobs=-1
580 | sudo sacctmgr modify account students set MaxJobs=-1
581 | sudo sacctmgr modify account students set MaxSubmitJobs=-1
582 | ```
583 | 
584 | ## FreeIPA Troubleshooting
585 | 
586 | If you can't access the FreeIPA admin web GUI, you may try changing permissions on the Kerberos folder as noted [here](https://scattered.network/2019/04/09/freeipa-webui-login-fails-with-login-failed-due-to-an-unknown-reason/).
587 | 
588 | To get the machines to talk to each other with FreeIPA, you may also need to take some or all of these steps:
589 | 
590 | - [Install a DNS service on the IPA server](https://docs.fedoraproject.org/en-US/Fedora/18/html/FreeIPA_Guide/enabling-dns.html)
591 | - [configure it to recognize the clients](https://www.howtoforge.com/how-to-install-freeipa-client-on-ubuntu-server-1804/#step-testing-freeipa-client)
592 | - This may also require enabling TCP and UDP traffic on port 53, and changing
593 | the [resolv.conf file on the clients](http://clusterfrak.com/sysops/app_installs/freeipa_clients/) to recognize the new name server on the server computer
594 | - change the /etc/nsswitch.conf file to include the line "initgroups: files sss", and
595 | add several instances "sss" to [other lines in this file](https://bugzilla.redhat.com/show_bug.cgi?id=1366569)
596 | 
597 | 
598 | # Better sacct
599 | 
600 | This shows all running jobs with the user who is running them.
601 | 
602 | `sacct --format=jobid,jobname,state,exitcode,user,account`
603 | 
604 | More on sacct [here](https://slurm.schedmd.com/sacct.html).
605 | 
606 | 
607 | # Changing IPs
608 | If the IP addresses of your machines change, you will need to update these in the file `/etc/hosts` on all machines and `/etc/exports` on the master node.  It's best to restart after making these changes.
609 | 
610 | # NFS directory not showing up
611 | Check the service is running on the master node:
612 | `sudo systemctl status nfs-kernel-server.service`
613 | 
614 | If it is not working, you may have a syntax error in your /etc/exports file.  Rebooting after getting this working is a good idea.  Not a bad idea to reboot the client computers as well.
615 | 
616 | Once you have the service running on the master node, then see if you can manually mount the drive on the clients:
617 | 
618 | `sudo mount master:/storage /storage`
619 | 
620 | If it is hanging here, try mounting on the master server:
621 | 
622 | `sudo mkdir /test`
623 | `sudo mount master:/storage /test`
624 | 
625 | If this works, you might have an issue with ports being blocked or other connection issues between the master and clients.
626 | 
627 | You should check your firewall status with `sudo ufw status`.  You should see a rule allowing port 2049 access from your worker nodes.  If you don't have it, be sure to add it with `sudo ufw allow from <ip_addr> to any port nfs` then `sudo ufw reload`.  You should use the IP and not the hostname.  A reference for this is [here]().
628 | 
629 | 
630 | # Node not able to connect to slurmctld
631 | If a node isn't able to connnect to the controller (server/master), first check that time is properly synced.  Try using the `date` command to see if the times are synced across the servers.
632 | 
633 | # Unable to uninstall and reinstall freeipa client
634 | If you are trying to uninstall the freeipa client and reinstall it and it fails (e.g. gives an error `The ipa-client-install command failed. See /var/log/ipaclient-install.log for more information`), you can try installing it with:
635 | 
636 | ```
637 | sudo ipa-client-install --hostname=`hostname -f` \
638 | --mkhomedir --server=copper.regis.edu \
639 | --domain regis.edu --realm REGIS.EDU --force-join
640 | ```
641 | 
642 | where the domain is your FQDN instead of regis.edu and instead of 'copper' you should use your server's name.
643 | 
644 | You might also try removing this file instead:
645 | 
646 | `sudo rm /var/lib/ipa-client/sysrestore/sysrestore.state`
647 | 
648 | However, when I was having this problem, it appeared to be some issue with the LDAP and SSSD not working.  I ended up reformatting and reinstalling the OS on the problem machine instead of trying to debug SSSD which looked extremely time consuming.
649 | 
650 | 
651 | # Running a demo file
652 | 
653 | To test the cluster, it's useful to run some demo code.  Since Keras is within TensorFlow from version 2.0 onward, there are two demo files under the folder 'demo_files'.  tf1_test.py is for TensorFlow 1.x, and tf2_test.py is for TensorFlow 2.x.
654 | 
655 | To run the demo file, it's best to first just run it with the system Python to see if it works.  You can run `python tf2_test.py`, and if it works then you can proceed to the next step.  If not, check the Python path (`which python`) to make sure it's using the correct Python executable.
656 | 
657 | To ensure you're using the GPU and not CPU, you can run `nvidia-smi` to watch and make sure the GPU memory is getting used while running the file.  `watch -n 0.1 nvidia-smi` will show the GPU memory updated every 0.1 second.
658 | 
659 | Once the TensorFlow demo file works, you can try submitting it as a SLURM job.  This uses the test.job file.  Run `sbatch test.job`.  Then you can check the status of the job with `sacct`.  This should show 'running', and you should see GPU being used on one of the worker nodes.  To specify the exact worker node being used, add a line to the .job file:
660 | `#SBATCH --nodelist=worker_name`
661 | where 'worker_name' is the name of the node.
662 | You should also be able to use `sinfo` to check which nodes are running jobs.
663 | 


--------------------------------------------------------------------------------
/check_if_user_expired.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # checks if a slurm user should be deleted.  This happens if the user
 4 | # saccmgr options: -n = no header, -P = parsable2 (pipe delimited)
 5 | # csvcut is from csvkit; install with conda or pip: conda install csvkit
 6 | slurm_users=$(sacctmgr list user -n -P | csvcut -d '|' -c 1)
 7 | for u in $slurm_users
 8 | do
 9 |   if id "$u" > /dev/null 2>&1; then
10 |     echo $u
11 |     echo 'user still exists'
12 |   else
13 |     # user id doesn't exist anymore; delete from slurm
14 |     echo $u
15 |     echo 'user no longer exists, deleting slurm username'
16 |     sudo sacctmgr -i delete user name=$u
17 |     sudo rm -r /storage/$u
18 |   fi
19 | done
20 | 


--------------------------------------------------------------------------------
/create_admin_user.sh:
--------------------------------------------------------------------------------
 1 | # used for creating admin accounts, like faculty
 2 | 
 3 | # first authenticate for kerberos
 4 | # kpass can be set in ~/.bashrc or as an environment variable
 5 | {
 6 | echo $kpass | kinit admin
 7 | } || {
 8 |   # if $kpass env variable does not exist, it will ask for the password
 9 |   kinit admin
10 | } || {
11 |   # don't run script if auth fails
12 |   echo "couldn't authenticate for kerberos; exiting"
13 |   exit 1
14 | }
15 | 
16 | id='ksorauf'
17 | sudo mkdir /storage/$id
18 | salt='deepdream'
19 | PASSWORD="$id$salt"
20 | echo $PASSWORD | ipa user-add $id --first='-' --last='-' --homedir=/storage/$id --shell=/bin/bash --password --setattr krbPasswordExpiration=$(date '+%Y-%m-%d' -d '-1 day')$'Z'
21 | # add to admin group (change group name as necessary)
22 | ipa group-add-member faculty --users=$id
23 | # also should have the group have sudo privelages; see create_faculty_admin_sudo_group.sh
24 | # https://serverfault.com/a/560237/305991
25 | 
26 | 
27 | # create slurm user
28 | sudo sacctmgr -i create user name=$id account=faculty
29 | 


--------------------------------------------------------------------------------
/create_faculty_admin_sudo_group.sh:
--------------------------------------------------------------------------------
 1 | # this is for creating a group for admins/faculty that have sudo priveleges
 2 | # note: this still doesn't quite work
 3 | # create sudo rule
 4 | # https://serverfault.com/a/560237/305991
 5 | 
 6 | ipa sudorule-add --cmdcat=all --hostcat=all --runasuser=all All
 7 | 
 8 | # create faculty group
 9 | ipa group-add faculty
10 | 
11 | ipa sudorule-add-group --groups=faculty All
12 | 


--------------------------------------------------------------------------------
/create_users.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | # You should first create the students SLURM account:
  4 | # sudo sacctmgr create account students
  5 | 
  6 | # need to first authenticate for kerberos
  7 | # kpass can be set in ~/.bashrc or as an environment variable
  8 | 
  9 | # the CSV file should have a header row, then values like:
 10 | #
 11 | # Name,ID,Email,User ID
 12 | # "George, Nathan C.",2915101,ngeorge@regis.edu,ngeorge01
 13 | 
 14 | # The script is run like:
 15 | # create_users.sh -f Class_List_MSDS684_FW1_2019.csv -c User ID -t
 16 | # options:
 17 | # -f : filename for CSV with usernames
 18 | # -c : column name for userid column
 19 | # -t : option for testing (if set, does not create users or directories);
 20 | # this is for testing if the csv and column name are correct
 21 | 
 22 | {
 23 | echo $kpass | kinit admin
 24 | } || {
 25 |   # if $kpass env variable does not exist, it will ask for the password
 26 |   kinit admin
 27 | } || {
 28 |   # don't run script if auth fails
 29 |   echo "couldn't authenticate for kerberos; exiting"
 30 |   exit 1
 31 | }
 32 | 
 33 | # test to ensure ipa connection works
 34 | res=ip
 35 | if ! [ $? -eq 0 ]; then
 36 |     echo 'connection to ipa failed; exiting'
 37 |     exit 1
 38 | fi
 39 | 
 40 | # default values for args
 41 | usernamecol="User ID"
 42 | testing=false
 43 | 
 44 | while [ "$1" != "" ]; do
 45 | case $1 in
 46 |   -f | --file )
 47 |   shift
 48 |   csvfile="$1"
 49 |   shift;;
 50 |   -t | --testing )
 51 |   testing=true
 52 |   shift;;
 53 |   -c | --usernamecol )
 54 |   shift
 55 |   usernamecol="$1"
 56 |   shift;;
 57 | esac
 58 | done
 59 | 
 60 | defaultsalt="deepdream"
 61 | 
 62 | # gets the fourth column from the csv, which is the id in this case
 63 | # you need to install csvtool for this to work: sudo apt install csvtool -y
 64 | # sed '1d' deletes the first line (the column label)
 65 | ids=$(csvtool namedcol "$usernamecol" $csvfile  | sed '1d')
 66 | if $testing
 67 | then
 68 |   echo "testing"
 69 |   echo $ids
 70 | else
 71 |   echo "creating new users"
 72 |   for id in $ids
 73 |   do
 74 |     # in case IDs are uppercase, convert to lowercase
 75 |     lc_id=$(echo "$id" | tr '[:upper:]' '[:lower:]')
 76 |     # some bug is causing the users to stick around even after sudo ipa user-del,
 77 |     # so skip this check to see if they exist
 78 |     # if id "$id" > /dev/null 2>&1; then
 79 |     #   # if user exists already, do nothing
 80 |     #   echo 'user already exists'
 81 |     #   :
 82 |     # else
 83 |     PASSWORD="$lc_id$defaultsalt"
 84 |     echo $PASSWORD
 85 |     echo /storage/$id/
 86 |     # for testing I also had to set the minimum password life to 0 hours:
 87 |     # ipa pwpolicy-mod global_policy --minlife 0
 88 |     # https://serverfault.com/a/609004/305991
 89 |     # sets user to expire in 1 year + 1 month, and the password is set to have already expired (so they must reset it upon logging in)
 90 |     echo $PASSWORD | ipa user-add $lc_id --first='-' --last='-' --homedir=/storage/$lc_id --shell=/bin/bash --password --setattr krbprincipalexpiration=$(date '+%Y-%m-%d' -d '+1 year +30 days')$'Z' --setattr krbPasswordExpiration=$(date '+%Y-%m-%d' -d '-1 day')$'Z'
 91 |     # make their home folder only readable to them and not other students
 92 |     sudo mkdir /storage/$lc_id
 93 |     # bashrc and profile were copied from the main accounts' home dir
 94 |     sudo cp /etc/skel/.profile /storage/$lc_id
 95 |     sudo cp /etc/skel/.bashrc /storage/$lc_id
 96 |     # only allow users to see their own directory and not others'
 97 |     sudo chmod -R 700 /storage/$lc_id
 98 |     # only allow users to use 4 of 6 GPUs at a time
 99 |     # -i option: commit without asking for confirmation (no y/N option)
100 |     # MaxJobs -- max number of jobs that can run at once
101 |     # MaxSubmitJobs -- Max number of jobs that can be submitted to the queue
102 |     # MaxWall -- max number of minutes per job (set to 12 hours)
103 |     sudo sacctmgr -i create user name=$lc_id account=students MaxJobs=4 MaxSubmitJobs=30 MaxWall=720
104 |     # sudo sacctmgr -i modify user where name=$id set MaxJobs=4
105 |     # fi
106 |   done
107 | 
108 |   # for some reason it can't find the newly-created users, so have to put this in another loop
109 |   for id in $ids
110 |   do
111 |     # in case IDs are uppercase, convert to lowercase
112 |     lc_id=$(echo "$id" | tr '[:upper:]' '[:lower:]')
113 |     sudo chown -R $lc_id:$lc_id /storage/$lc_id
114 |     # set quota on storage drive
115 |     sudo setquota -u $lc_id 150G 150G 0 0 /storage
116 |     sudo setquota -u $lc_id 5G 5G 0 0 /
117 |   done
118 | fi


--------------------------------------------------------------------------------
/create_users_no_sync.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # Creates users with base Linux and no UID/GID synchronization
 4 | 
 5 | # You should first create the students SLURM account:
 6 | # sudo sacctmgr create account students
 7 | 
 8 | csvfile=Class_List_MSDS684_FW1_2019.csv
 9 | defaultsalt="deepdream"
10 | 
11 | # gets the fourth column from the csv, which is the id in this case
12 | ids=$(csvtool namedcol "User ID" $csvfile  | sed '1d')
13 | for id in $ids
14 | do
15 |   if id "$id" > /dev/null 2>&1; then
16 |     # if user exists already, do nothing
17 |     :
18 |   else
19 |     PASSWORD="$id$defaultsalt"
20 |     echo $PASSWORD
21 |     echo /storage/$id/
22 |     sudo useradd $id -d /storage/$id/ -m -g students -p $(openssl passwd -1 ${PASSWORD}) -e $(date '+%Y-%m-%d' -d '+1 year +30 days') -s /bin/bash
23 |     # make their home folder only readable to them and not other students
24 |     sudo chmod -R 700 /storage/$id
25 |     sudo chown -R $id:$id /storage/$id
26 |     sudo sacctmgr -i create user name=$id account=students
27 |     # expire their password so it must be changed upon first login
28 |     sudo passwd -e $id
29 |     # set quota on storage drive
30 |     sudo setquota -u $id 150G 150G 0 0 /storage
31 |     sudo setquota -u $id 5G 5G 0 0 /
32 |   fi
33 | done
34 | 


--------------------------------------------------------------------------------
/delete_users.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # need to first authenticate for kerberos
 4 | # kpass can be set in ~/.bashrc or as an environment variable
 5 | {
 6 | echo $kpass | sudo kinit admin
 7 | } || {
 8 |   # if $kpass env variable does not exist, it will ask for the password
 9 |   sudo kinit admin
10 | } || {
11 |   # don't run script if auth fails
12 |   echo "couldn't authenticate for kerberos; exiting"
13 |   exit 1
14 | }
15 | 
16 | # this should just be a file with a username on each newline
17 | csvfile=del_users.csv
18 | 
19 | ids=$(csvtool col 1 $csvfile)
20 | 
21 | for id in $ids
22 | do
23 |   ipa user-del $id
24 |   # -i option: commit without asking for confirmation
25 |   sudo sacctmgr -i delete user $id
26 |   sudo rm -r /storage/$id
27 | done
28 | 


--------------------------------------------------------------------------------
/demo_files/test.job:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH -N 1      # nodes requested
 3 | #SBATCH --job-name=test
 4 | #SBATCH --output=/storage/test/test.out
 5 | #SBATCH --error=/storage/test/test.err
 6 | #SBATCH --time=2-00:00
 7 | #SBATCH --mem=36000
 8 | #SBATCH --qos=normal
 9 | #SBATCH --gres=gpu:1
10 | # the -u option means 'unbuffered',
11 | # which should continuously write output to the .out file
12 | srun -u /home/msds/anaconda3/bin/python /storage/test/tf2_test.py
13 | 


--------------------------------------------------------------------------------
/demo_files/tf1_load_test.py:
--------------------------------------------------------------------------------
 1 | # to load saved model
 2 | from keras.models import load_model
 3 | import keras
 4 | from keras.datasets import mnist
 5 | from keras import backend as K
 6 | 
 7 | # input image dimensions
 8 | img_rows, img_cols = 28, 28
 9 | 
10 | num_classes = 10
11 | 
12 | # the data, split between train and test sets
13 | (x_train, y_train), (x_test, y_test) = mnist.load_data()
14 | 
15 | if K.image_data_format() == 'channels_first':
16 |     x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
17 |     x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
18 |     input_shape = (1, img_rows, img_cols)
19 | else:
20 |     x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
21 |     x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
22 |     input_shape = (img_rows, img_cols, 1)
23 | 
24 | x_train = x_train.astype('float32')
25 | x_test = x_test.astype('float32')
26 | x_train /= 255
27 | x_test /= 255
28 | 
29 | # convert class vectors to binary class matrices
30 | y_train = keras.utils.to_categorical(y_train, num_classes)
31 | y_test = keras.utils.to_categorical(y_test, num_classes)
32 | 
33 | 
34 | # to load saved model
35 | filepath = 'tf1_mnist_cnn.hdf5'
36 | model = load_model(filepath)
37 | score = model.evaluate(x_test, y_test, verbose = 0)
38 | print('Test loss:', score[0])
39 | print('Test accuracy:', score[1])
40 | 


--------------------------------------------------------------------------------
/demo_files/tf1_test.py:
--------------------------------------------------------------------------------
 1 | from __future__ import print_function
 2 | """
 3 | Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
 4 | This uses keras and tensorflow pre-tensorflow 2.0.
 5 | """
 6 | 
 7 | '''Trains a simple convnet on the MNIST dataset.
 8 | 
 9 | Gets to 99.25% test accuracy after 12 epochs
10 | (there is still a lot of margin for parameter tuning).
11 | 16 seconds per epoch on a GRID K520 GPU.
12 | '''
13 | 
14 | import keras
15 | from keras.datasets import mnist
16 | from keras.models import Sequential
17 | from keras.layers import Dense, Dropout, Flatten
18 | from keras.layers import Conv2D, MaxPooling2D
19 | from keras import backend as K
20 | from keras.callbacks.callbacks import ModelCheckpoint
21 | 
22 | batch_size = 128
23 | num_classes = 10
24 | epochs = 3
25 | 
26 | # input image dimensions
27 | img_rows, img_cols = 28, 28
28 | 
29 | # the data, split between train and test sets
30 | (x_train, y_train), (x_test, y_test) = mnist.load_data()
31 | 
32 | if K.image_data_format() == 'channels_first':
33 |     x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
34 |     x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
35 |     input_shape = (1, img_rows, img_cols)
36 | else:
37 |     x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
38 |     x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
39 |     input_shape = (img_rows, img_cols, 1)
40 | 
41 | x_train = x_train.astype('float32')
42 | x_test = x_test.astype('float32')
43 | x_train /= 255
44 | x_test /= 255
45 | print('x_train shape:', x_train.shape)
46 | print(x_train.shape[0], 'train samples')
47 | print(x_test.shape[0], 'test samples')
48 | 
49 | # convert class vectors to binary class matrices
50 | y_train = keras.utils.to_categorical(y_train, num_classes)
51 | y_test = keras.utils.to_categorical(y_test, num_classes)
52 | 
53 | model = Sequential()
54 | model.add(Conv2D(32, kernel_size=(3, 3),
55 |                  activation='relu',
56 |                  input_shape=input_shape))
57 | model.add(Conv2D(64, (3, 3), activation='relu'))
58 | model.add(MaxPooling2D(pool_size=(2, 2)))
59 | model.add(Dropout(0.25))
60 | model.add(Flatten())
61 | model.add(Dense(128, activation='relu'))
62 | model.add(Dropout(0.5))
63 | model.add(Dense(num_classes, activation='softmax'))
64 | 
65 | model.compile(loss=keras.losses.categorical_crossentropy,
66 |               optimizer=keras.optimizers.Adadelta(),
67 |               metrics=['accuracy'])
68 | 
69 | # add checkpoint to save model with lowest val loss
70 | filepath = 'tf1_mnist_cnn.hdf5'
71 | save_checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=2, \
72 |                              save_best_only=False, save_weights_only=False, \
73 |                              mode='auto', period=1)
74 | 
75 | # verbose=2 is important; it writes one line per epoch
76 | # verbose=1 is the 'live monitoring' version, which doesn't work with slurm
77 | model.fit(x_train, y_train,
78 |           batch_size=batch_size,
79 |           epochs=epochs,
80 |           verbose=2,
81 |           validation_data=(x_test, y_test),
82 |           callbacks=[save_checkpoint])
83 | 
84 | score = model.evaluate(x_test, y_test, verbose=0)
85 | print('Test loss:', score[0])
86 | print('Test accuracy:', score[1])
87 | 


--------------------------------------------------------------------------------
/demo_files/tf2_load_test.py:
--------------------------------------------------------------------------------
 1 | from tensorflow.keras.models import load_model
 2 | import tensorflow.keras as keras
 3 | from tensorflow.keras.datasets import mnist
 4 | from tensorflow.keras import backend as K
 5 | 
 6 | # input image dimensions
 7 | img_rows, img_cols = 28, 28
 8 | 
 9 | num_classes = 10
10 | 
11 | # the data, split between train and test sets
12 | (x_train, y_train), (x_test, y_test) = mnist.load_data()
13 | 
14 | if K.image_data_format() == 'channels_first':
15 |     x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
16 |     x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
17 |     input_shape = (1, img_rows, img_cols)
18 | else:
19 |     x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
20 |     x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
21 |     input_shape = (img_rows, img_cols, 1)
22 | 
23 | x_train = x_train.astype('float32')
24 | x_test = x_test.astype('float32')
25 | x_train /= 255
26 | x_test /= 255
27 | 
28 | # convert class vectors to binary class matrices
29 | y_train = keras.utils.to_categorical(y_train, num_classes)
30 | y_test = keras.utils.to_categorical(y_test, num_classes)
31 | 
32 | 
33 | # to load saved model
34 | filepath = 'tf2_mnist_model.hdf5'
35 | model = load_model(filepath)
36 | score = model.evaluate(x_test, y_test, verbose = 0)
37 | print('Test loss:', score[0])
38 | print('Test accuracy:', score[1])
39 | 


--------------------------------------------------------------------------------
/demo_files/tf2_test.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py
 3 | This uses keras and tensorflow pre-tensorflow 2.0.
 4 | """
 5 | 
 6 | '''Trains a simple convnet on the MNIST dataset.
 7 | 
 8 | Gets to 99.25% test accuracy after 12 epochs
 9 | (there is still a lot of margin for parameter tuning).
10 | 16 seconds per epoch on a GRID K520 GPU.
11 | '''
12 | 
13 | import tensorflow.keras as keras
14 | from tensorflow.keras.datasets import mnist
15 | from tensorflow.keras.models import Sequential
16 | from tensorflow.keras.layers import Dense, Dropout, Flatten
17 | from tensorflow.keras.layers import Conv2D, MaxPooling2D
18 | from tensorflow.keras import backend as K
19 | from tensorflow.keras.callbacks import ModelCheckpoint
20 | 
21 | batch_size = 128
22 | num_classes = 10
23 | epochs = 3
24 | 
25 | # input image dimensions
26 | img_rows, img_cols = 28, 28
27 | 
28 | # the data, split between train and test sets
29 | (x_train, y_train), (x_test, y_test) = mnist.load_data()
30 | 
31 | if K.image_data_format() == 'channels_first':
32 |     x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
33 |     x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
34 |     input_shape = (1, img_rows, img_cols)
35 | else:
36 |     x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
37 |     x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
38 |     input_shape = (img_rows, img_cols, 1)
39 | 
40 | x_train = x_train.astype('float32')
41 | x_test = x_test.astype('float32')
42 | x_train /= 255
43 | x_test /= 255
44 | print('x_train shape:', x_train.shape)
45 | print(x_train.shape[0], 'train samples')
46 | print(x_test.shape[0], 'test samples')
47 | 
48 | # convert class vectors to binary class matrices
49 | y_train = keras.utils.to_categorical(y_train, num_classes)
50 | y_test = keras.utils.to_categorical(y_test, num_classes)
51 | 
52 | model = Sequential()
53 | model.add(Conv2D(32, kernel_size=(3, 3),
54 |                  activation='relu',
55 |                  input_shape=input_shape))
56 | model.add(Conv2D(64, (3, 3), activation='relu'))
57 | model.add(MaxPooling2D(pool_size=(2, 2)))
58 | model.add(Dropout(0.25))
59 | model.add(Flatten())
60 | model.add(Dense(128, activation='relu'))
61 | model.add(Dropout(0.5))
62 | model.add(Dense(num_classes, activation='softmax'))
63 | 
64 | model.compile(loss=keras.losses.categorical_crossentropy,
65 |               optimizer=keras.optimizers.Adadelta(),
66 |               metrics=['accuracy'])
67 | 
68 | # add a checkpoint to save the lowest validation loss
69 | filepath = 'tf2_mnist_model.hdf5'
70 | 
71 | checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=2, \
72 |                              save_best_only=False, save_weights_only=False, \
73 |                              mode='auto', save_frequency=1)
74 | 
75 | # verbose=2 is important; it writes one line per epoch
76 | # verbose=1 is the 'live monitoring' version, which doesn't work with slurm
77 | model.fit(x_train, y_train,
78 |           batch_size=batch_size,
79 |           epochs=epochs,
80 |           verbose=2,
81 |           validation_data=(x_test, y_test),
82 |           callbacks=[checkpoint])
83 | 
84 | score = model.evaluate(x_test, y_test, verbose=0)
85 | print('Test loss:', score[0])
86 | print('Test accuracy:', score[1])
87 | 


--------------------------------------------------------------------------------