├── .gitignore ├── .vscode └── settings.json ├── LICENSE ├── README.md ├── check_if_user_expired.sh ├── create_admin_user.sh ├── create_faculty_admin_sudo_group.sh ├── create_users.sh ├── create_users_no_sync.sh ├── delete_users.sh └── demo_files ├── test.job ├── tf1_load_test.py ├── tf1_test.py ├── tf2_load_test.py └── tf2_test.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.hdf5 2 | Class_List_MSDS684_FW1_2019.csv 3 | -------------------------------------------------------------------------------- /.vscode/settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "spellright.language": [ 3 | "en_US" 4 | ], 5 | "spellright.documentTypes": [ 6 | "markdown", 7 | "latex", 8 | "plaintext" 9 | ] 10 | } -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Nate George 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # slurm_gpu_ubuntu 2 | Instructions for setting up a SLURM cluster using Ubuntu 18.04.3 with GPUs. Go from a pile of hardware to a functional GPU cluster with job queueing and user management. 3 | 4 | OS used: Ubuntu 18.04.3 LTS 5 | 6 | 7 | # Overview 8 | This guide will help you create and install a GPU HPC cluster with a job queue and user management. The idea is to have a GPU cluster which allows use of a few GPUs by many people. Using multiple GPUs at once is not the point here, and hasn't been tested. This guide demonstrates how to create a GPU cluster for neural networks (deep learning) which uses Python and related neural network libraries (Tensorflow, Keras, Pytorch), CUDA, and NVIDIA GPU cards. You can expect this to take you a few days up to a week. 9 | 10 | ## Outline of steps: 11 | 12 | - Prepare hardware 13 | - Install OSs 14 | - Sync UID/GIDs or create slurm/munge users 15 | - Install Software (Nvidia drivers, Anaconda and Python packages) 16 | - Install/configure file sharing (NFS here; if using more than one node/computer in the cluster) 17 | - Install munge/SLURM and configure 18 | - User management 19 | 20 | ## Acknowledgements 21 | This wouldn't have been possible without this [github repo](https://github.com/mknoxnv/ubuntu-slurm) from mknoxnv. I don't know who that person is, but they saved me weeks of work trying to figure out all the conf files and services, etc. 22 | 23 | # Preparing Hardware 24 | 25 | If you do not already have hardware, here are some considerations: 26 | 27 | Top-of-the-line commodity motherboards can handle up to 4 GPUs. You should pay attention to PCI Lanes in the motherboard and CPU specifications. Usually GPUs can take up to 16 PCI Lanes, and work fastest for data transfer when using all 16 lanes. To use 4 GPUs in one machine, your motherboard should support at least 64 PCI Lanes, and CPUs should have at least 64 Lanes available. M.2 SSDs can use PCI lanes as well, so it can be better to have a little more than 64 Lanes if possible. The motherboard and CPU specs usually detail the PCI lanes. 28 | 29 | We used NVIDIA GPU cards in our cluster, but many AMD cards [should now work](https://rocm.github.io/) with Python deep learning libraries now. 30 | 31 | Power supply wattage is also important to consider, as GPUs can take a lot of Watts at peak power. 32 | 33 | You only need one computer, but to have more than 4 GPUs you will need at least 2 computers. This guide assumes you are using more than one computer in your cluster. 34 | 35 | # Installing operating systems 36 | 37 | Once you have hardware up and running, you need to install an OS. From my research I've found Ubuntu is the top Linux distribution as of 2019 (both for commodity hardware and servers), and is recommended. Currently the latest long-term stability version is Ubuntu 18.04.3, which is what was used here. LTS are usually better because they are more stable over time. Other Linux distributions may differ in some of the commands. 38 | 39 | I recommend creating a bootable USB stick and installing Ubuntu from that. Often with NVIDIA, the installation freezes upon loading and [this fix](https://askubuntu.com/a/870245/458247) must be implemented. Once the boot menu appears, choose Ubuntu or Install Ubuntu, then press 'e', then add `apci=off` directly after `quiet splash` (leaving a space between splish and apci). Then press F10 and it should boot. 40 | 41 | I recommend using [LVM](https://www.howtogeek.com/211937/how-to-use-lvm-on-ubuntu-for-easy-partition-resizing-and-snapshots/) when installing (there is a checkbox for it with Ubuntu installation), so that you can add and extend storage HDDs if needed. 42 | 43 | **Note**: Along the way I used the package manager to update/upgade software many times (`sudo apt-get update` and `sudo apt-get upgrade`) followed by reboots. If something is not working, this can be a first step to try to debug it. 44 | 45 | ## Synchronizing GID/UIDs 46 | It's recommend to sync the GIDs and UIDs across machines. This can be done with something like LDAP (install instructions [here](https://computingforgeeks.com/how-to-install-and-configure-openldap-ubuntu-18-04/) and [here](https://www.techrepublic.com/article/how-to-install-openldap-on-ubuntu-18-04/)). In my experience, for basic cluster management where all users can read and write to the folders where job files exist, the only GIDs and UIDs that need to be synced are the slurm and munge users. Other users can be created and run SLURM jobs without having usernames on the other machines in the cluster. 47 | 48 | However, if you want to isolate access to users' home folders (best practice I'd say), then you must synchronize users across the cluster. The easiest way I've found to synchronize UIDs and GIDs across an Ubuntu cluster is FreeIPA. Here are installation instructions: 49 | 50 | - [Server (master node)](https://computingforgeeks.com/how-to-install-and-configure-freeipa-server-on-ubuntu-18-04-ubuntu-16-04/) (note: I had to run [this command](https://stackoverflow.com/a/54539428/4549682) to fix an issue after installing) 51 | - [Client (worker nodes)](https://computingforgeeks.com/how-to-configure-freeipa-client-on-ubuntu-18-04-ubuntu-16-04-centos-7/) 52 | 53 | It is important that you set the hostname to a FQDN, otherwise kerberos/FreeIPA won't work. If you accidentally set the hostname during the kerberos setup to the wrong thing, you can change it in `/etc/krb5.conf`. You could also completely purge kerberos [like so](https://serverfault.com/a/885525/305991). If you need to reconfigure the ipa configuration, you can do `sudo ipa-server-install --uninstall` then try installing again. I had to do the uninstall twice for it to work. 54 | 55 | ## Synchronizing time 56 | Free-IPA should take care of syncing time, so you shouldn't have to worry about this if you setup freeipa. You can see if times are synced with the `date` command on the various machines. 57 | 58 | 59 | It's not a bad idea to sync the time across the servers. [Here's how](https://knowm.org/how-to-synchronize-time-across-a-linux-cluster/). One time when I set it up, it was ok, but another time the slurmctld service wouldn't start and it was because the times weren't synced. 60 | 61 | 62 | ## Set up munge and slurm users and groups 63 | Immediately after installing OS’s, you want to create the munge and slurm users and groups on all machines. The GID and UID (group and user IDs) must match for munge and slurm across all machines. If you have a lot of machines, you can use the parallel SSH utilities mentioned before. There are also other options like NIS and NISplus. One other option is to use FreeIPA to create users and groups. 64 | 65 | On all machines we need the munge authentication service and slurm installed. First, we want to have the munge and slurm users/groups with the same UIDs and GIDs. In my experience, these are the only GID and UIDs that need synchronization for the cluster to work. On all machines: 66 | 67 | ``` 68 | sudo adduser -u 1111 munge --disabled-password --gecos "" 69 | sudo adduser -u 1121 slurm --disabled-password --gecos "" 70 | ``` 71 | 72 | #### You shouldn’t need to do this, but just in case, you could create the groups first, then create the users 73 | 74 | ``` 75 | sudo addgroup -gid 1111 munge 76 | sudo addgroup -gid 1121 slurm 77 | sudo adduser -u 1111 munge --disabled-password --gecos "" -gid 1111 78 | sudo adduser -u 1121 slurm --disabled-password --gecos "" -gid 1121 79 | ``` 80 | 81 | When a user is created, a group with the same name is created as well. 82 | 83 | The numbers don’t matter as long as they are available for the user and group IDs. These numbers seemed to work with a default Ubuntu 18.04.3 installation. It seems like by default ubuntu sets up a new user with a UID and GID of UID + 1 if the GID already exists, so this follows that pattern. 84 | 85 | 86 | 87 | ## Installing software/drivers 88 | Next you should install SSH. Open a terminal and install: `sudo apt install openssh-server -y`. 89 | 90 | Once you have SSH on the machines, you may want to use a [parallel SSH utility](https://www.tecmint.com/run-commands-on-multiple-linux-servers/) to execute commands on all machines at once. 91 | 92 | ### Install NVIDIA drivers 93 | You will need the latest NVIDIA drivers install for their cards. The procedure [currently is](http://ubuntuhandbook.org/index.php/2019/04/nvidia-430-09-gtx-1650-support/): 94 | 95 | ``` 96 | sudo add-apt-repository ppa:graphics-drivers/ppa 97 | sudo apt-get update 98 | sudo apt-get install nvidia-driver-430 99 | ``` 100 | 101 | The 430 driver will probably update soon. You can use `sudo apt-cache search nvidia-driver*` to find the latest one, or go to the "Software & Updates" menu to install it. For some reason, on the latest install I had to use aptitude to install it: 102 | 103 | ``` 104 | sudo apt-get install aptitude -y 105 | sudo aptitude install nvidia-driver-430 106 | ``` 107 | 108 | But that still didn't seem to solve the issue, and I installed it via the "Software & Updates" menu under "Additional Drivers". 109 | 110 | We also use [NoMachine](https://www.nomachine.com/) for remote GUI access. 111 | 112 | ## Install the Anaconda Python distribution. 113 | Anaconda makes installing deep learning libraries easier, and doesn’t require installing CUDA/CuDNN libraries (which is a pain). Anaconda handles the CUDA and other dependencies for deep learning libraries. 114 | 115 | Download the distribution file: 116 | 117 | ``` 118 | cd /tmp 119 | wget https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh 120 | ``` 121 | You may want to visit https://repo.anaconda.com/archive/ to get the latest anaconda version instead, though you can use: 122 | 123 | 124 | `conda update conda anaconda` 125 | 126 | or 127 | 128 | `conda update --all` 129 | 130 | to update Anaconda once it’s installed. 131 | 132 | Once the .sh file is downloaded, you should make it executable with 133 | 134 | `sudo chmod +777 Anaconda3-2019.03-Linux-x86_64.sh` 135 | 136 | then run the file: 137 | 138 | `./Anaconda3-2019.03-Linux-x86_64.sh` 139 | 140 | I chose yes for `Do you wish the installer to initialize Anaconda3 141 | by running conda init?`. 142 | 143 | Then you should do `source ~/.bashrc` to enable `conda` as a command. 144 | 145 | If you chose `no` for the `conda init` portion, you may need to add some aliases to bashrc: 146 | 147 | `nano ~/.bashrc` 148 | 149 | Add the lines: 150 | 151 | ``` 152 | alias conda=~/anaconda3/bin/conda 153 | alias python=~/anaconda3/bin/python 154 | alias ipython=~/anaconda3/bin/ipython 155 | ``` 156 | 157 | Now install some anaconda packages: 158 | 159 | ``` 160 | conda update conda 161 | conda install anaconda 162 | conda install python=3.6 163 | conda install tensorflow-gpu keras pytorch 164 | ``` 165 | 166 | The 3.6 install can take a while to complete (environment solving with conda is slow; it took about 15 minutes for me even on a fast computer -- the environment solving is definitely a big drawback of anaconda). Not a bad idea to use tmux and put the `conda install python=3.6` in a `tmux` shell in case an SSH session is interrupted. 167 | 168 | Python3.6 is the latest version with easy support for tensorflow and some other packages. 169 | 170 | 171 | At this point you can use this code to test GPU functionality with this [demo code](https://raw.githubusercontent.com/keras-team/keras/master/examples/mnist_cnn.py), you could also use [this](https://stackoverflow.com/a/38580201/4549682). 172 | 173 | 174 | # Install NFS (shared storage) 175 | In order for SLURM to work properly, there must be a storage location present on all computers in the cluster with the same files used for jobs. All computers in the cluster must be able to read and write to this directory. One way to do this is with NFS, although other options such as OCFS2 exist. Here we use NFS. 176 | 177 | For the instructions, we will call the primary server `master` (the one hosting storage and the SLURM controller) and assume we have one worker node (another computer with GPUs) called `worker`. We will also assume the username/groupname for the main administrative account on all machines is `admin:admin`. I used the same username and group for the administrative accounts on all the servers. 178 | 179 | ## Master node 180 | On the master server, do: 181 | 182 | `sudo apt install nfs-kernel-server -y` 183 | 184 | Make a storage location: 185 | 186 | `sudo mkdir /storage` 187 | 188 | In my case, /storage was actually the mount point for a second HDD (LVM, which was expanded to 20TB). 189 | 190 | Change ownership to your administrative username and group: 191 | 192 | `sudo chown admin:admin /storage` 193 | 194 | Next we need to add rules for the shared location. This is done with: 195 | 196 | `sudo nano /etc/exports` 197 | 198 | Then adding the line: 199 | 200 | `/storage *(rw,sync,no_root_squash)` 201 | 202 | The * is for IP addresses or hostnames. In this case we allow anything, but you may want to limit it to your IPs/hostnames in the cluster. In fact, it wasn't working for me unless I explicitly set the IPs of the clients here. You have to have a separate entry for each IP. Mine ended up looking like: 203 | 204 | `/storage 172.xx.224.xx(rw,sync,no_root_squash,all_squash,anonuid=999999,anongid=999999) 172.xx.224.xx(rw,sync,no_root_squash,all_squash,anonuid=999999,anongid=999999)` 205 | 206 | where the 'xx's are actual numbers. 207 | 208 | Then start the NFS service: 209 | 210 | `sudo systemctl start nfs-kernel-server.service` 211 | 212 | It should start automatically upon restarts. 213 | 214 | You should also add a rule to allow for NFS traffic from the workers through port 2049. This is done like so: 215 | 216 | `sudo ufw allow from to any port nfs` 217 | 218 | Check the status with `sudo ufw status`. You should see a rule to allow traffic to port 2049 from your worker nodes' IP addresses. [Here's more info](https://www.digitalocean.com/community/tutorials/how-to-set-up-an-nfs-mount-on-ubuntu-16-04). 219 | 220 | 221 | ## Client nodes 222 | Now we can set up the clients. On all worker servers: 223 | 224 | ``` 225 | sudo apt install nfs-common -y 226 | sudo mkdir /storage 227 | sudo chown admin:admin /storage 228 | sudo mount master:/storage /storage 229 | ``` 230 | 231 | To make the drive mount upon restarts for the worker nodes, add this to fstab (`sudo nano /etc/fstab`): 232 | 233 | `master:/storage /storage nfs auto,timeo=14,intr 0 0` 234 | 235 | This can be done like so: 236 | 237 | `echo master:/storage /storage nfs auto,timeo=14,intr 0 0 | sudo tee -a /etc/fstab` 238 | 239 | Now any files put into /storage from the master server can be seen on all worker servers connect via NFS. The worker servers MUST be read and write. If not, any sbatch jobs will give an exit status of 1:0. 240 | 241 | 242 | # Preparing for SLURM installation 243 | ## Passwordless SSH from master to all workers 244 | 245 | First we need passwordless SSH between the master and compute nodes. We are still using `master` as the master node hostname and `worker` as the worker hostname. On the master: 246 | 247 | ``` 248 | ssh-keygen 249 | ssh-copy-id admin@worker 250 | ``` 251 | 252 | To do this with many worker nodes, you might want to set up a small script to loop through worker hostnames or IPs. 253 | 254 | ## Install munge on the master: 255 | ``` 256 | sudo apt-get install libmunge-dev libmunge2 munge -y 257 | sudo systemctl enable munge 258 | sudo systemctl start munge 259 | ``` 260 | 261 | Test munge if you like: 262 | `munge -n | unmunge | grep STATUS` 263 | 264 | 265 | Copy the munge key to /storage 266 | ``` 267 | sudo cp /etc/munge/munge.key /storage/ 268 | sudo chown munge /storage/munge.key 269 | sudo chmod 400 /storage/munge.key 270 | ``` 271 | 272 | ## Install munge on worker nodes: 273 | ``` 274 | sudo apt-get install libmunge-dev libmunge2 munge 275 | sudo cp /storage/munge.key /etc/munge/munge.key 276 | sudo systemctl enable munge 277 | sudo systemctl start munge 278 | ``` 279 | 280 | If you want, you can test munge: 281 | `munge -n | unmunge | grep STATUS` 282 | 283 | ## Prepare DB for SLURM 284 | 285 | These instructions more or less follow this github repo: https://github.com/mknoxnv/ubuntu-slurm 286 | 287 | First we want to clone the repo: 288 | `cd /storage` 289 | `git clone https://github.com/mknoxnv/ubuntu-slurm.git` 290 | 291 | Install prereqs: 292 | ``` 293 | sudo apt-get install git gcc make ruby ruby-dev libpam0g-dev libmariadb-client-lgpl-dev libmysqlclient-dev mariadb-server build-essential libssl-dev -y 294 | sudo gem install fpm 295 | ``` 296 | 297 | Next we set up MariaDB for storing SLURM data: 298 | ``` 299 | sudo systemctl enable mysql 300 | sudo systemctl start mysql 301 | sudo mysql -u root 302 | ``` 303 | 304 | Within mysql: 305 | ``` 306 | create database slurm_acct_db; 307 | create user 'slurm'@'localhost'; 308 | set password for 'slurm'@'localhost' = password('slurmdbpass'); 309 | grant usage on *.* to 'slurm'@'localhost'; 310 | grant all privileges on slurm_acct_db.* to 'slurm'@'localhost'; 311 | flush privileges; 312 | exit 313 | ``` 314 | 315 | Copy the default db config file: 316 | `cp /storage/ubuntu-slurm/slurmdbd.conf /storage` 317 | 318 | Ideally you want to change the password to something different than `slurmdbpass`. This must also be set in the config file `/storage/slurmdbd.conf`. 319 | 320 | # Install SLURM 321 | ## Download and install SLURM on Master 322 | 323 | ### Build the SLURM .deb install file 324 | It’s best to check the downloads page and use the latest version (right click link for download and use in the wget command). Ideally we’d have a script to scrape the latest version and use that dynamically. 325 | 326 | You can use the -j option to specify the number of CPU cores to use for 'make', like `make -j12`. `htop` is a nice package that will show usage stats and quickly show how many cores you have. 327 | 328 | ``` 329 | cd /storage 330 | wget https://download.schedmd.com/slurm/slurm-19.05.2.tar.bz2 331 | tar xvjf slurm-19.05.2.tar.bz2 332 | cd slurm-19.05.2 333 | ./configure --prefix=/tmp/slurm-build --sysconfdir=/etc/slurm --enable-pam --with-pam_dir=/lib/x86_64-linux-gnu/security/ --without-shared-libslurm 334 | make 335 | make contrib 336 | make install 337 | cd .. 338 | ``` 339 | 340 | ### Install SLURM 341 | ``` 342 | sudo fpm -s dir -t deb -v 1.0 -n slurm-19.05.2 --prefix=/usr -C /tmp/slurm-build . 343 | sudo dpkg -i slurm-19.05.2_1.0_amd64.deb 344 | ``` 345 | 346 | Make all the directories we need: 347 | ``` 348 | sudo mkdir -p /etc/slurm /etc/slurm/prolog.d /etc/slurm/epilog.d /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm 349 | sudo chown slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm 350 | ``` 351 | 352 | Copy slurm control and db services: 353 | ``` 354 | sudo cp /storage/ubuntu-slurm/slurmdbd.service /etc/systemd/system/ 355 | sudo cp /storage/ubuntu-slurm/slurmctld.service /etc/systemd/system/ 356 | ``` 357 | 358 | The slurmdbd.conf file should be copied before starting the slurm services: 359 | `sudo cp /storage/slurmdbd.conf /etc/slurm/` 360 | 361 | Start the slurm services: 362 | ``` 363 | sudo systemctl daemon-reload 364 | sudo systemctl enable slurmdbd 365 | sudo systemctl start slurmdbd 366 | sudo systemctl enable slurmctld 367 | sudo systemctl start slurmctld 368 | ``` 369 | 370 | If the master is also going to be a worker/compute node, you should do: 371 | 372 | ``` 373 | sudo cp /storage/ubuntu-slurm/slurmd.service /etc/systemd/system/ 374 | sudo systemctl enable slurmd 375 | sudo systemctl start slurmd 376 | ``` 377 | 378 | ## Worker nodes 379 | Now install SLURM on worker nodes: 380 | 381 | ``` 382 | cd /storage 383 | sudo dpkg -i slurm-19.05.2_1.0_amd64.deb 384 | sudo cp /storage/ubuntu-slurm/slurmd.service /etc/systemd/system/ 385 | sudo systemctl enable slurmd 386 | sudo systemctl start slurmd 387 | ``` 388 | 389 | ## Configuring SLURM 390 | 391 | Next we need to set up the configuration file. Copy the default config from the github repo: 392 | 393 | `cp /storage/ubuntu-slurm/slurm.conf /storage/slurm.conf` 394 | 395 | Note: for job limits for users, you should add the [AccountingStorageEnforce=limits](https://slurm.schedmd.com/resource_limits.html) line to the config file. 396 | 397 | Once SLURM is installed on all nodes, we can use the command 398 | 399 | `sudo slurmd -C` 400 | 401 | to print out the machine specs. Then we can copy this line into the config file and modify it slightly. To modify it, we need to add the number of GPUs we have in the system (and remove the last part which show UpTime). Here is an example of a config line: 402 | 403 | `NodeName=worker1 Gres=gpu:2 CPUs=12 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 RealMemory=128846` 404 | 405 | Take this line and put it at the bottom of `slurm.conf`. 406 | 407 | Next, setup the `gres.conf` file. Lines in `gres.conf` should look like: 408 | 409 | ``` 410 | NodeName=master Name=gpu File=/dev/nvidia0 411 | NodeName=master Name=gpu File=/dev/nvidia1 412 | ``` 413 | 414 | If you have multiple GPUs, keep adding lines for each node and increment the last number after nvidia. 415 | 416 | Gres has more options detailed in the docs: https://slurm.schedmd.com/slurm.conf.html (near the bottom). 417 | 418 | Finally, we need to copy .conf files on **all** machines. This includes the `slurm.conf` file, `gres.conf`, `cgroup.conf` , and `cgroup_allowed_devices_file.conf`. Without these files it seems like things don’t work. 419 | 420 | ``` 421 | sudo cp /storage/ubuntu-slurm/cgroup* /etc/slurm/ 422 | sudo cp /storage/slurm.conf /etc/slurm/ 423 | sudo cp /storage/gres.conf /etc/slurm/ 424 | ``` 425 | 426 | This directory should also be created on workers: 427 | ``` 428 | sudo mkdir -p /var/spool/slurm/d 429 | sudo chown slurm /var/spool/slurm/d 430 | ``` 431 | 432 | After the conf files have been copied to all workers and the master node, you may want to reboot the computers, or at least restart the slurm services: 433 | 434 | Workers: 435 | `sudo systemctl restart slurmd` 436 | Master: 437 | ``` 438 | sudo systemctl restart slurmctld 439 | sudo systemctl restart slurmdbd 440 | sudo systemctl restart slurmd 441 | ``` 442 | 443 | Next we just create a cluster: 444 | `sudo sacctmgr add cluster compute-cluster` 445 | 446 | 447 | ## Configure cgroups 448 | 449 | I think cgroups allows memory limitations from SLURM jobs and users to be implemented. Set memory cgroups on all workers with: 450 | 451 | ``` 452 | sudo nano /etc/default/grub 453 | And change the following variable to: 454 | GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1" 455 | sudo update-grub 456 | ``` 457 | Finally at the end, I did one last `sudo apt update`, `sudo apt upgrade`, and `sudo apt autoremove`, then rebooted the computers: 458 | `sudo reboot` 459 | 460 | 461 | # User Management 462 | It's best to configure the password min life to 0, so users can change their passwords immediately and log in. This can be done with: 463 | `ipa pwpolicy-add ipausers --minlife=0 --priority=0` 464 | This is useful for resetting passwords. 465 | 466 | ## Adding users 467 | Since we are using FreeIPA, we can use that to create users. 468 | [Here is an example script](create_users.sh) to add users from a csv file. There is also a [script for deleting users](delete_users.sh). 469 | 470 | We are adding users within the FreeIPA system, within the SLURM system, and creating a home directory. The user is set to expire a little over a year from creation, and the password is set to expire upon the first login (prompting the user to change their password). 471 | 472 | Adding users may also be possibly done with Linux tools and SLURM commands. In that case it's best to create a group for different user groups: 473 | `sudo groupadd normal` 474 | But this would require creating users on all the machines. FreeIPA takes care of that for us, so it's a better solution. 475 | 476 | ## Storage quotas 477 | Next we need to set storage quotas for the user. Follow [this guide](https://www.digitalocean.com/community/tutorials/how-to-set-filesystem-quotas-on-ubuntu-18-04) to set up the quota settings on the machine. 478 | 479 | Then we can set quotas: 480 | 481 | ```bash 482 | sudo setquota -u ngeorge 150G 150G 0 0 /storage 483 | sudo setquota -u ngeorge 5G 5G 0 0 / 484 | ``` 485 | 486 | The `/dev/mapper/ubuntu--vg-root` is the LVM partition for the root drive `/`, and `/dev/disk/by-uuid/987d372b-9c96-4e62-af82-2d95dc6655b4` is the from `/etc/fstab` for the HDD /storage. 487 | 488 | This sets the soft and hard limits to 150GB for /storage. 489 | 490 | 491 | To see how much of the quota people are using: 492 | ``` 493 | sudo repquota -s / 494 | sudo repquota -s /storage 495 | ``` 496 | 497 | The new users don’t seem to always show up until they have saved something on the drive. You can also specifically look at one user with: 498 | 499 | `sudo quota -vs ngeorge` 500 | 501 | 502 | ## Deleting SLURM users on expiration 503 | The slurm account manager has no way to set an expiration for users. So we use [this script](check_if_user_expired.sh) to check if the Linux username has expired, and if so, we delete the slurm username and home directory. This runs on a cronjob once per day. At it to the crontab file with: 504 | 505 | `sudo crontab -e` 506 | 507 | Add this line to run at 5 am every day on the machine: 508 | 509 | `0 5 * * * bash /home//slurm_gpu_ubuntu/check_if_user_expired.sh 510 | ` 511 | Obviously fix the path to where the script is, and change the username to yours. 512 | 513 | ## Resetting user passwords 514 | To reset a user password: 515 | 516 | `kinit admin` 517 | `ipa user-mod --password --setattr krbPasswordExpiration=$(date '+%Y-%m-%d' -d '-1 day')$'Z'` 518 | This also sets the password to already be expired so they must reset it upon login. 519 | 520 | If the error comes up: `Password change failed. Server message: Current password's minimum life has not expired` 521 | then the min life for passwords needs to be changed: 522 | `ipa pwpolicy-add ipausers --minlife=0 --priority=0` 523 | The ipausers above is the group for the users for which passwords are being reset. 524 | 525 | # Troubleshooting 526 | 527 | When in doubt, first try updating software with `sudo apt update; sudo apt upgrade -y` and rebooting (`sudo reboot`). 528 | 529 | ## Log files 530 | When in doubt, you can check the log files. The locations are set in the slurm.conf file, and are `/var/log/slurmd.log` and `/var/log/slurmctld.log` by default. Open them with `sudo nano /var/log/slurmctld.log`. To go to the bottom of the file, use ctrl+_ and ctrl+v. I also changed the log paths to `/var/log/slurm/slurmd.log` and so on, and changed the permissions of the folder to be owner by slurm: `sudo chown slurm:slurm /var/log/slurm`. 531 | 532 | ## Checking SLURM states 533 | Some helpful commands: 534 | 535 | `scontrol ping` -- this checks if the controller node can be reached. If this isn't working (i.e. the command returns 'DOWN' and not 'UP'), you might need to allow connections to the slurmctrlport (in the slurm.conf file). This is set to 6817 in the config file. To allow connections with the firewall, execute: 536 | 537 | `sudo ufw allow from any to any port 6817` 538 | 539 | and 540 | 541 | `sudo ufw reload` 542 | 543 | ## Error codes 1:0 and 2:0 544 | If trying to run a job with `sbatch` and the exit code is 1:0, this is usually a file writing error. The first thing to check is that your output and error file paths in the .job file are correct. Also check the .py file you want to run has the correct filepath in your .job file. Then you should go to the logs (`/var/log/slurm/slurmctld.log`) and see which node the job was trying to run on. Then go to that node and open the logs (`/var/log/slurm/slurmd.log`) to see what it says. It may say something about the path for the output/error files, or the path to the .py file is incorrect. 545 | 546 | It could also mean your common storage location is not r/w accessible to all nodes. In the logs, this would show up as something about permissions and unable to write to the filesystem. Double-check that you can create files on the /storage location on all workers with something like `touch testing.txt`. If you can't create a file from the worker nodes, you probably have some sort of NFS issue. Go back to the NFS section and make sure everything looks ok. You should be able to create directories/files in /storage from any node with the admin account and they should show up as owned by the admin user. If not, you may have some issue in your /etc/exports or with your GID/UIDs not matching. 547 | 548 | If the exit code is 2:0, this can mean there is some problem with either the location of the python executable, or some other error when running the python script. Double check that the srun or python script is working as expected with the python executable specified in the sbatch job file. 549 | 550 | If some workers are 'draining', down, or unavailable, you might try: 551 | 552 | `sudo scontrol update NodeName=worker1 State=RESUME` 553 | 554 | 555 | ## Node is stuck draining (drng from `sinfo`) 556 | This has happened due to the memory size in slurm.conf being higher than actual memor size. Double check the memory from `free -m` or `sudo slurmd -C` and update slurm.conf on all machines in the cluster. Then run `sudo scontrol update NodeName=worker1 State=RESUME` 557 | 558 | ## Nodes are not visible upon restart 559 | After restarting the master node, sometimes the workers aren't there. I've found I often have to do `sudo scontrol update NodeName=worker1 State=RESUME` to get them working/available. 560 | 561 | 562 | ## Taking a node offline 563 | The best way to take a node offline for maintenance is to drain it: 564 | `sudo scontrol update NodeName=worker1 State=DRAIN Reason='Maintenance'` 565 | 566 | Users can see the reason with `sinfo -R` 567 | 568 | 569 | ## Testing GPU load 570 | Using `watch -n 0.1 nvidia-smi` will show the GPU load in real-time. You can use this to monitor jobs as they are scheduled to make sure all the GPUs are being utilized. 571 | 572 | 573 | 574 | 575 | ## Setting account options 576 | You may want to limit jobs or submissions. Here is how to set attributes (-1 means no limit): 577 | ```bash 578 | sudo sacctmgr modify account students set GrpJobs=-1 579 | sudo sacctmgr modify account students set GrpSubmitJobs=-1 580 | sudo sacctmgr modify account students set MaxJobs=-1 581 | sudo sacctmgr modify account students set MaxSubmitJobs=-1 582 | ``` 583 | 584 | ## FreeIPA Troubleshooting 585 | 586 | If you can't access the FreeIPA admin web GUI, you may try changing permissions on the Kerberos folder as noted [here](https://scattered.network/2019/04/09/freeipa-webui-login-fails-with-login-failed-due-to-an-unknown-reason/). 587 | 588 | To get the machines to talk to each other with FreeIPA, you may also need to take some or all of these steps: 589 | 590 | - [Install a DNS service on the IPA server](https://docs.fedoraproject.org/en-US/Fedora/18/html/FreeIPA_Guide/enabling-dns.html) 591 | - [configure it to recognize the clients](https://www.howtoforge.com/how-to-install-freeipa-client-on-ubuntu-server-1804/#step-testing-freeipa-client) 592 | - This may also require enabling TCP and UDP traffic on port 53, and changing 593 | the [resolv.conf file on the clients](http://clusterfrak.com/sysops/app_installs/freeipa_clients/) to recognize the new name server on the server computer 594 | - change the /etc/nsswitch.conf file to include the line "initgroups: files sss", and 595 | add several instances "sss" to [other lines in this file](https://bugzilla.redhat.com/show_bug.cgi?id=1366569) 596 | 597 | 598 | # Better sacct 599 | 600 | This shows all running jobs with the user who is running them. 601 | 602 | `sacct --format=jobid,jobname,state,exitcode,user,account` 603 | 604 | More on sacct [here](https://slurm.schedmd.com/sacct.html). 605 | 606 | 607 | # Changing IPs 608 | If the IP addresses of your machines change, you will need to update these in the file `/etc/hosts` on all machines and `/etc/exports` on the master node. It's best to restart after making these changes. 609 | 610 | # NFS directory not showing up 611 | Check the service is running on the master node: 612 | `sudo systemctl status nfs-kernel-server.service` 613 | 614 | If it is not working, you may have a syntax error in your /etc/exports file. Rebooting after getting this working is a good idea. Not a bad idea to reboot the client computers as well. 615 | 616 | Once you have the service running on the master node, then see if you can manually mount the drive on the clients: 617 | 618 | `sudo mount master:/storage /storage` 619 | 620 | If it is hanging here, try mounting on the master server: 621 | 622 | `sudo mkdir /test` 623 | `sudo mount master:/storage /test` 624 | 625 | If this works, you might have an issue with ports being blocked or other connection issues between the master and clients. 626 | 627 | You should check your firewall status with `sudo ufw status`. You should see a rule allowing port 2049 access from your worker nodes. If you don't have it, be sure to add it with `sudo ufw allow from to any port nfs` then `sudo ufw reload`. You should use the IP and not the hostname. A reference for this is [here](). 628 | 629 | 630 | # Node not able to connect to slurmctld 631 | If a node isn't able to connnect to the controller (server/master), first check that time is properly synced. Try using the `date` command to see if the times are synced across the servers. 632 | 633 | # Unable to uninstall and reinstall freeipa client 634 | If you are trying to uninstall the freeipa client and reinstall it and it fails (e.g. gives an error `The ipa-client-install command failed. See /var/log/ipaclient-install.log for more information`), you can try installing it with: 635 | 636 | ``` 637 | sudo ipa-client-install --hostname=`hostname -f` \ 638 | --mkhomedir --server=copper.regis.edu \ 639 | --domain regis.edu --realm REGIS.EDU --force-join 640 | ``` 641 | 642 | where the domain is your FQDN instead of regis.edu and instead of 'copper' you should use your server's name. 643 | 644 | You might also try removing this file instead: 645 | 646 | `sudo rm /var/lib/ipa-client/sysrestore/sysrestore.state` 647 | 648 | However, when I was having this problem, it appeared to be some issue with the LDAP and SSSD not working. I ended up reformatting and reinstalling the OS on the problem machine instead of trying to debug SSSD which looked extremely time consuming. 649 | 650 | 651 | # Running a demo file 652 | 653 | To test the cluster, it's useful to run some demo code. Since Keras is within TensorFlow from version 2.0 onward, there are two demo files under the folder 'demo_files'. tf1_test.py is for TensorFlow 1.x, and tf2_test.py is for TensorFlow 2.x. 654 | 655 | To run the demo file, it's best to first just run it with the system Python to see if it works. You can run `python tf2_test.py`, and if it works then you can proceed to the next step. If not, check the Python path (`which python`) to make sure it's using the correct Python executable. 656 | 657 | To ensure you're using the GPU and not CPU, you can run `nvidia-smi` to watch and make sure the GPU memory is getting used while running the file. `watch -n 0.1 nvidia-smi` will show the GPU memory updated every 0.1 second. 658 | 659 | Once the TensorFlow demo file works, you can try submitting it as a SLURM job. This uses the test.job file. Run `sbatch test.job`. Then you can check the status of the job with `sacct`. This should show 'running', and you should see GPU being used on one of the worker nodes. To specify the exact worker node being used, add a line to the .job file: 660 | `#SBATCH --nodelist=worker_name` 661 | where 'worker_name' is the name of the node. 662 | You should also be able to use `sinfo` to check which nodes are running jobs. 663 | -------------------------------------------------------------------------------- /check_if_user_expired.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # checks if a slurm user should be deleted. This happens if the user 4 | # saccmgr options: -n = no header, -P = parsable2 (pipe delimited) 5 | # csvcut is from csvkit; install with conda or pip: conda install csvkit 6 | slurm_users=$(sacctmgr list user -n -P | csvcut -d '|' -c 1) 7 | for u in $slurm_users 8 | do 9 | if id "$u" > /dev/null 2>&1; then 10 | echo $u 11 | echo 'user still exists' 12 | else 13 | # user id doesn't exist anymore; delete from slurm 14 | echo $u 15 | echo 'user no longer exists, deleting slurm username' 16 | sudo sacctmgr -i delete user name=$u 17 | sudo rm -r /storage/$u 18 | fi 19 | done 20 | -------------------------------------------------------------------------------- /create_admin_user.sh: -------------------------------------------------------------------------------- 1 | # used for creating admin accounts, like faculty 2 | 3 | # first authenticate for kerberos 4 | # kpass can be set in ~/.bashrc or as an environment variable 5 | { 6 | echo $kpass | kinit admin 7 | } || { 8 | # if $kpass env variable does not exist, it will ask for the password 9 | kinit admin 10 | } || { 11 | # don't run script if auth fails 12 | echo "couldn't authenticate for kerberos; exiting" 13 | exit 1 14 | } 15 | 16 | id='ksorauf' 17 | sudo mkdir /storage/$id 18 | salt='deepdream' 19 | PASSWORD="$id$salt" 20 | echo $PASSWORD | ipa user-add $id --first='-' --last='-' --homedir=/storage/$id --shell=/bin/bash --password --setattr krbPasswordExpiration=$(date '+%Y-%m-%d' -d '-1 day')$'Z' 21 | # add to admin group (change group name as necessary) 22 | ipa group-add-member faculty --users=$id 23 | # also should have the group have sudo privelages; see create_faculty_admin_sudo_group.sh 24 | # https://serverfault.com/a/560237/305991 25 | 26 | 27 | # create slurm user 28 | sudo sacctmgr -i create user name=$id account=faculty 29 | -------------------------------------------------------------------------------- /create_faculty_admin_sudo_group.sh: -------------------------------------------------------------------------------- 1 | # this is for creating a group for admins/faculty that have sudo priveleges 2 | # note: this still doesn't quite work 3 | # create sudo rule 4 | # https://serverfault.com/a/560237/305991 5 | 6 | ipa sudorule-add --cmdcat=all --hostcat=all --runasuser=all All 7 | 8 | # create faculty group 9 | ipa group-add faculty 10 | 11 | ipa sudorule-add-group --groups=faculty All 12 | -------------------------------------------------------------------------------- /create_users.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # You should first create the students SLURM account: 4 | # sudo sacctmgr create account students 5 | 6 | # need to first authenticate for kerberos 7 | # kpass can be set in ~/.bashrc or as an environment variable 8 | 9 | # the CSV file should have a header row, then values like: 10 | # 11 | # Name,ID,Email,User ID 12 | # "George, Nathan C.",2915101,ngeorge@regis.edu,ngeorge01 13 | 14 | # The script is run like: 15 | # create_users.sh -f Class_List_MSDS684_FW1_2019.csv -c User ID -t 16 | # options: 17 | # -f : filename for CSV with usernames 18 | # -c : column name for userid column 19 | # -t : option for testing (if set, does not create users or directories); 20 | # this is for testing if the csv and column name are correct 21 | 22 | { 23 | echo $kpass | kinit admin 24 | } || { 25 | # if $kpass env variable does not exist, it will ask for the password 26 | kinit admin 27 | } || { 28 | # don't run script if auth fails 29 | echo "couldn't authenticate for kerberos; exiting" 30 | exit 1 31 | } 32 | 33 | # test to ensure ipa connection works 34 | res=ip 35 | if ! [ $? -eq 0 ]; then 36 | echo 'connection to ipa failed; exiting' 37 | exit 1 38 | fi 39 | 40 | # default values for args 41 | usernamecol="User ID" 42 | testing=false 43 | 44 | while [ "$1" != "" ]; do 45 | case $1 in 46 | -f | --file ) 47 | shift 48 | csvfile="$1" 49 | shift;; 50 | -t | --testing ) 51 | testing=true 52 | shift;; 53 | -c | --usernamecol ) 54 | shift 55 | usernamecol="$1" 56 | shift;; 57 | esac 58 | done 59 | 60 | defaultsalt="deepdream" 61 | 62 | # gets the fourth column from the csv, which is the id in this case 63 | # you need to install csvtool for this to work: sudo apt install csvtool -y 64 | # sed '1d' deletes the first line (the column label) 65 | ids=$(csvtool namedcol "$usernamecol" $csvfile | sed '1d') 66 | if $testing 67 | then 68 | echo "testing" 69 | echo $ids 70 | else 71 | echo "creating new users" 72 | for id in $ids 73 | do 74 | # in case IDs are uppercase, convert to lowercase 75 | lc_id=$(echo "$id" | tr '[:upper:]' '[:lower:]') 76 | # some bug is causing the users to stick around even after sudo ipa user-del, 77 | # so skip this check to see if they exist 78 | # if id "$id" > /dev/null 2>&1; then 79 | # # if user exists already, do nothing 80 | # echo 'user already exists' 81 | # : 82 | # else 83 | PASSWORD="$lc_id$defaultsalt" 84 | echo $PASSWORD 85 | echo /storage/$id/ 86 | # for testing I also had to set the minimum password life to 0 hours: 87 | # ipa pwpolicy-mod global_policy --minlife 0 88 | # https://serverfault.com/a/609004/305991 89 | # sets user to expire in 1 year + 1 month, and the password is set to have already expired (so they must reset it upon logging in) 90 | echo $PASSWORD | ipa user-add $lc_id --first='-' --last='-' --homedir=/storage/$lc_id --shell=/bin/bash --password --setattr krbprincipalexpiration=$(date '+%Y-%m-%d' -d '+1 year +30 days')$'Z' --setattr krbPasswordExpiration=$(date '+%Y-%m-%d' -d '-1 day')$'Z' 91 | # make their home folder only readable to them and not other students 92 | sudo mkdir /storage/$lc_id 93 | # bashrc and profile were copied from the main accounts' home dir 94 | sudo cp /etc/skel/.profile /storage/$lc_id 95 | sudo cp /etc/skel/.bashrc /storage/$lc_id 96 | # only allow users to see their own directory and not others' 97 | sudo chmod -R 700 /storage/$lc_id 98 | # only allow users to use 4 of 6 GPUs at a time 99 | # -i option: commit without asking for confirmation (no y/N option) 100 | # MaxJobs -- max number of jobs that can run at once 101 | # MaxSubmitJobs -- Max number of jobs that can be submitted to the queue 102 | # MaxWall -- max number of minutes per job (set to 12 hours) 103 | sudo sacctmgr -i create user name=$lc_id account=students MaxJobs=4 MaxSubmitJobs=30 MaxWall=720 104 | # sudo sacctmgr -i modify user where name=$id set MaxJobs=4 105 | # fi 106 | done 107 | 108 | # for some reason it can't find the newly-created users, so have to put this in another loop 109 | for id in $ids 110 | do 111 | # in case IDs are uppercase, convert to lowercase 112 | lc_id=$(echo "$id" | tr '[:upper:]' '[:lower:]') 113 | sudo chown -R $lc_id:$lc_id /storage/$lc_id 114 | # set quota on storage drive 115 | sudo setquota -u $lc_id 150G 150G 0 0 /storage 116 | sudo setquota -u $lc_id 5G 5G 0 0 / 117 | done 118 | fi -------------------------------------------------------------------------------- /create_users_no_sync.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Creates users with base Linux and no UID/GID synchronization 4 | 5 | # You should first create the students SLURM account: 6 | # sudo sacctmgr create account students 7 | 8 | csvfile=Class_List_MSDS684_FW1_2019.csv 9 | defaultsalt="deepdream" 10 | 11 | # gets the fourth column from the csv, which is the id in this case 12 | ids=$(csvtool namedcol "User ID" $csvfile | sed '1d') 13 | for id in $ids 14 | do 15 | if id "$id" > /dev/null 2>&1; then 16 | # if user exists already, do nothing 17 | : 18 | else 19 | PASSWORD="$id$defaultsalt" 20 | echo $PASSWORD 21 | echo /storage/$id/ 22 | sudo useradd $id -d /storage/$id/ -m -g students -p $(openssl passwd -1 ${PASSWORD}) -e $(date '+%Y-%m-%d' -d '+1 year +30 days') -s /bin/bash 23 | # make their home folder only readable to them and not other students 24 | sudo chmod -R 700 /storage/$id 25 | sudo chown -R $id:$id /storage/$id 26 | sudo sacctmgr -i create user name=$id account=students 27 | # expire their password so it must be changed upon first login 28 | sudo passwd -e $id 29 | # set quota on storage drive 30 | sudo setquota -u $id 150G 150G 0 0 /storage 31 | sudo setquota -u $id 5G 5G 0 0 / 32 | fi 33 | done 34 | -------------------------------------------------------------------------------- /delete_users.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # need to first authenticate for kerberos 4 | # kpass can be set in ~/.bashrc or as an environment variable 5 | { 6 | echo $kpass | sudo kinit admin 7 | } || { 8 | # if $kpass env variable does not exist, it will ask for the password 9 | sudo kinit admin 10 | } || { 11 | # don't run script if auth fails 12 | echo "couldn't authenticate for kerberos; exiting" 13 | exit 1 14 | } 15 | 16 | # this should just be a file with a username on each newline 17 | csvfile=del_users.csv 18 | 19 | ids=$(csvtool col 1 $csvfile) 20 | 21 | for id in $ids 22 | do 23 | ipa user-del $id 24 | # -i option: commit without asking for confirmation 25 | sudo sacctmgr -i delete user $id 26 | sudo rm -r /storage/$id 27 | done 28 | -------------------------------------------------------------------------------- /demo_files/test.job: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH -N 1 # nodes requested 3 | #SBATCH --job-name=test 4 | #SBATCH --output=/storage/test/test.out 5 | #SBATCH --error=/storage/test/test.err 6 | #SBATCH --time=2-00:00 7 | #SBATCH --mem=36000 8 | #SBATCH --qos=normal 9 | #SBATCH --gres=gpu:1 10 | # the -u option means 'unbuffered', 11 | # which should continuously write output to the .out file 12 | srun -u /home/msds/anaconda3/bin/python /storage/test/tf2_test.py 13 | -------------------------------------------------------------------------------- /demo_files/tf1_load_test.py: -------------------------------------------------------------------------------- 1 | # to load saved model 2 | from keras.models import load_model 3 | import keras 4 | from keras.datasets import mnist 5 | from keras import backend as K 6 | 7 | # input image dimensions 8 | img_rows, img_cols = 28, 28 9 | 10 | num_classes = 10 11 | 12 | # the data, split between train and test sets 13 | (x_train, y_train), (x_test, y_test) = mnist.load_data() 14 | 15 | if K.image_data_format() == 'channels_first': 16 | x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols) 17 | x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols) 18 | input_shape = (1, img_rows, img_cols) 19 | else: 20 | x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1) 21 | x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1) 22 | input_shape = (img_rows, img_cols, 1) 23 | 24 | x_train = x_train.astype('float32') 25 | x_test = x_test.astype('float32') 26 | x_train /= 255 27 | x_test /= 255 28 | 29 | # convert class vectors to binary class matrices 30 | y_train = keras.utils.to_categorical(y_train, num_classes) 31 | y_test = keras.utils.to_categorical(y_test, num_classes) 32 | 33 | 34 | # to load saved model 35 | filepath = 'tf1_mnist_cnn.hdf5' 36 | model = load_model(filepath) 37 | score = model.evaluate(x_test, y_test, verbose = 0) 38 | print('Test loss:', score[0]) 39 | print('Test accuracy:', score[1]) 40 | -------------------------------------------------------------------------------- /demo_files/tf1_test.py: -------------------------------------------------------------------------------- 1 | from __future__ import print_function 2 | """ 3 | Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py 4 | This uses keras and tensorflow pre-tensorflow 2.0. 5 | """ 6 | 7 | '''Trains a simple convnet on the MNIST dataset. 8 | 9 | Gets to 99.25% test accuracy after 12 epochs 10 | (there is still a lot of margin for parameter tuning). 11 | 16 seconds per epoch on a GRID K520 GPU. 12 | ''' 13 | 14 | import keras 15 | from keras.datasets import mnist 16 | from keras.models import Sequential 17 | from keras.layers import Dense, Dropout, Flatten 18 | from keras.layers import Conv2D, MaxPooling2D 19 | from keras import backend as K 20 | from keras.callbacks.callbacks import ModelCheckpoint 21 | 22 | batch_size = 128 23 | num_classes = 10 24 | epochs = 3 25 | 26 | # input image dimensions 27 | img_rows, img_cols = 28, 28 28 | 29 | # the data, split between train and test sets 30 | (x_train, y_train), (x_test, y_test) = mnist.load_data() 31 | 32 | if K.image_data_format() == 'channels_first': 33 | x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols) 34 | x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols) 35 | input_shape = (1, img_rows, img_cols) 36 | else: 37 | x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1) 38 | x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1) 39 | input_shape = (img_rows, img_cols, 1) 40 | 41 | x_train = x_train.astype('float32') 42 | x_test = x_test.astype('float32') 43 | x_train /= 255 44 | x_test /= 255 45 | print('x_train shape:', x_train.shape) 46 | print(x_train.shape[0], 'train samples') 47 | print(x_test.shape[0], 'test samples') 48 | 49 | # convert class vectors to binary class matrices 50 | y_train = keras.utils.to_categorical(y_train, num_classes) 51 | y_test = keras.utils.to_categorical(y_test, num_classes) 52 | 53 | model = Sequential() 54 | model.add(Conv2D(32, kernel_size=(3, 3), 55 | activation='relu', 56 | input_shape=input_shape)) 57 | model.add(Conv2D(64, (3, 3), activation='relu')) 58 | model.add(MaxPooling2D(pool_size=(2, 2))) 59 | model.add(Dropout(0.25)) 60 | model.add(Flatten()) 61 | model.add(Dense(128, activation='relu')) 62 | model.add(Dropout(0.5)) 63 | model.add(Dense(num_classes, activation='softmax')) 64 | 65 | model.compile(loss=keras.losses.categorical_crossentropy, 66 | optimizer=keras.optimizers.Adadelta(), 67 | metrics=['accuracy']) 68 | 69 | # add checkpoint to save model with lowest val loss 70 | filepath = 'tf1_mnist_cnn.hdf5' 71 | save_checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=2, \ 72 | save_best_only=False, save_weights_only=False, \ 73 | mode='auto', period=1) 74 | 75 | # verbose=2 is important; it writes one line per epoch 76 | # verbose=1 is the 'live monitoring' version, which doesn't work with slurm 77 | model.fit(x_train, y_train, 78 | batch_size=batch_size, 79 | epochs=epochs, 80 | verbose=2, 81 | validation_data=(x_test, y_test), 82 | callbacks=[save_checkpoint]) 83 | 84 | score = model.evaluate(x_test, y_test, verbose=0) 85 | print('Test loss:', score[0]) 86 | print('Test accuracy:', score[1]) 87 | -------------------------------------------------------------------------------- /demo_files/tf2_load_test.py: -------------------------------------------------------------------------------- 1 | from tensorflow.keras.models import load_model 2 | import tensorflow.keras as keras 3 | from tensorflow.keras.datasets import mnist 4 | from tensorflow.keras import backend as K 5 | 6 | # input image dimensions 7 | img_rows, img_cols = 28, 28 8 | 9 | num_classes = 10 10 | 11 | # the data, split between train and test sets 12 | (x_train, y_train), (x_test, y_test) = mnist.load_data() 13 | 14 | if K.image_data_format() == 'channels_first': 15 | x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols) 16 | x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols) 17 | input_shape = (1, img_rows, img_cols) 18 | else: 19 | x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1) 20 | x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1) 21 | input_shape = (img_rows, img_cols, 1) 22 | 23 | x_train = x_train.astype('float32') 24 | x_test = x_test.astype('float32') 25 | x_train /= 255 26 | x_test /= 255 27 | 28 | # convert class vectors to binary class matrices 29 | y_train = keras.utils.to_categorical(y_train, num_classes) 30 | y_test = keras.utils.to_categorical(y_test, num_classes) 31 | 32 | 33 | # to load saved model 34 | filepath = 'tf2_mnist_model.hdf5' 35 | model = load_model(filepath) 36 | score = model.evaluate(x_test, y_test, verbose = 0) 37 | print('Test loss:', score[0]) 38 | print('Test accuracy:', score[1]) 39 | -------------------------------------------------------------------------------- /demo_files/tf2_test.py: -------------------------------------------------------------------------------- 1 | """ 2 | Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py 3 | This uses keras and tensorflow pre-tensorflow 2.0. 4 | """ 5 | 6 | '''Trains a simple convnet on the MNIST dataset. 7 | 8 | Gets to 99.25% test accuracy after 12 epochs 9 | (there is still a lot of margin for parameter tuning). 10 | 16 seconds per epoch on a GRID K520 GPU. 11 | ''' 12 | 13 | import tensorflow.keras as keras 14 | from tensorflow.keras.datasets import mnist 15 | from tensorflow.keras.models import Sequential 16 | from tensorflow.keras.layers import Dense, Dropout, Flatten 17 | from tensorflow.keras.layers import Conv2D, MaxPooling2D 18 | from tensorflow.keras import backend as K 19 | from tensorflow.keras.callbacks import ModelCheckpoint 20 | 21 | batch_size = 128 22 | num_classes = 10 23 | epochs = 3 24 | 25 | # input image dimensions 26 | img_rows, img_cols = 28, 28 27 | 28 | # the data, split between train and test sets 29 | (x_train, y_train), (x_test, y_test) = mnist.load_data() 30 | 31 | if K.image_data_format() == 'channels_first': 32 | x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols) 33 | x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols) 34 | input_shape = (1, img_rows, img_cols) 35 | else: 36 | x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1) 37 | x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1) 38 | input_shape = (img_rows, img_cols, 1) 39 | 40 | x_train = x_train.astype('float32') 41 | x_test = x_test.astype('float32') 42 | x_train /= 255 43 | x_test /= 255 44 | print('x_train shape:', x_train.shape) 45 | print(x_train.shape[0], 'train samples') 46 | print(x_test.shape[0], 'test samples') 47 | 48 | # convert class vectors to binary class matrices 49 | y_train = keras.utils.to_categorical(y_train, num_classes) 50 | y_test = keras.utils.to_categorical(y_test, num_classes) 51 | 52 | model = Sequential() 53 | model.add(Conv2D(32, kernel_size=(3, 3), 54 | activation='relu', 55 | input_shape=input_shape)) 56 | model.add(Conv2D(64, (3, 3), activation='relu')) 57 | model.add(MaxPooling2D(pool_size=(2, 2))) 58 | model.add(Dropout(0.25)) 59 | model.add(Flatten()) 60 | model.add(Dense(128, activation='relu')) 61 | model.add(Dropout(0.5)) 62 | model.add(Dense(num_classes, activation='softmax')) 63 | 64 | model.compile(loss=keras.losses.categorical_crossentropy, 65 | optimizer=keras.optimizers.Adadelta(), 66 | metrics=['accuracy']) 67 | 68 | # add a checkpoint to save the lowest validation loss 69 | filepath = 'tf2_mnist_model.hdf5' 70 | 71 | checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=2, \ 72 | save_best_only=False, save_weights_only=False, \ 73 | mode='auto', save_frequency=1) 74 | 75 | # verbose=2 is important; it writes one line per epoch 76 | # verbose=1 is the 'live monitoring' version, which doesn't work with slurm 77 | model.fit(x_train, y_train, 78 | batch_size=batch_size, 79 | epochs=epochs, 80 | verbose=2, 81 | validation_data=(x_test, y_test), 82 | callbacks=[checkpoint]) 83 | 84 | score = model.evaluate(x_test, y_test, verbose=0) 85 | print('Test loss:', score[0]) 86 | print('Test accuracy:', score[1]) 87 | --------------------------------------------------------------------------------