├── .gitignore ├── README.md ├── instructions ├── initial-setup.md ├── interactive-jobs.md ├── parallel-jobs.md ├── singularity-workflow.md ├── useful-commands.md └── virtual-env-workflow.md ├── logs └── .gitignore └── src ├── deeplearning.def ├── deploy-with-conda-multiple.sh ├── deploy-with-conda.sh ├── deploy-with-singularity.sh ├── deploy-with-virtualenv.sh ├── multiply-debug.py └── multiply.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.sif 2 | .vscode 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Tübingen ML Cloud Hello World Example(s) 2 | 3 | The goal of this repository is to give an easy entry point to people who want to use the [Tübingen ML Cloud](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/home), and in particular [Slurm](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#common-slurm-commands), to run Python code on GPUs. 4 | 5 | Specifically, this tutorial contains a small Python code sample for multiplying two matrices using PyTorch and instructions for setting up the required environments and executing this code on CPU or GPU on the ML Cloud. 6 | 7 | The tutorial consists of a minimal example for getting started quickly, and a number of slightly more detailed instructions explaining different workflows and aspects of the ML Cloud. 8 | 9 | **This is not an official introduction**, but rather a tutorial put together by the members of the [Machine Learning in Medical Image Analysis (MLMIA) lab](https://www.mlmia-unitue.de). The official ML Cloud Slurm wiki can be found [here](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#common-slurm-commands). If you do spot mistakes or have suggestions, comments and pull requests are very welcome. 10 | 11 | ## Contents 12 | 13 | Contents of this tutorial: 14 | * [Who is this for?](#who-is-this-for): Short description of target audience 15 | * [Minimal example](#minimal-example): The shortest path from getting an account to running the `multiply.py` code on GPU. 16 | 17 | In-depth instructions: 18 | * [Initial setup](/instructions/initial-setup.md): Some useful tricks for setting up your SSH connections and mounting Slurm volumes locally for working more efficiently. 19 | * [Virtual environment workflow](/instructions/virtual-env-workflow.md): A more in detailed explanation of the workflow based on virtual environments also used in the minimal example and an alternative workflow using `virtualenv`, instead of `conda`. 20 | * [Singularity workflow](/instructions/singularity-workflow.md): An alternative workflow using Singularity containers, which allows for more flexibility. 21 | * [Running parallel jobs](/instructions/parallel-jobs.md): A short introduction to executing a command for a list of parameters in parallel on multiple GPUs. 22 | * [Interactive jobs and debugging](/instructions/interactive-jobs.md): How to run interactive jobs and how to debug Python code on Slurm. 23 | * [Useful commands](/instructions/useful-commands.md): Some commonly used Slurm commands for shepherding jobs. 24 | 25 | ## Who is this for? 26 | 27 | This repository was compiled with members of the [MLMIA lab](https://www.mlmia-unitue.de) in mind to enable them to have a smooth start with the ML Cloud. 28 | 29 | You will notice that in many places, paths refer to locations only accessible to the MLMIA group such as `/mnt/qb/baumgartner`. However, there should be equivalent paths for members of all groups. For example, if you are part of the Berens group there would be an equivalent `/mnt/qb/berens` folder to which you *will* have access. 30 | 31 | Generally, we believe that this introduction may also be useful for other ML Cloud users. 32 | 33 | The instructions were written with Linux users in mind. Most of the instructions will translate to Mac, as well. If you use windows, you may have to resort to [Putty](https://www.putty.org/), although newer versions now also support `ssh` in the [Windows terminal](https://docs.microsoft.com/en-us/windows/terminal/tutorials/ssh). 34 | 35 | ## Minimal example 36 | 37 | So you have just obtained access to Slurm from the [ML Cloud Masters](mailto:mlcloudmaster@uni-tuebingen.de)? What now? 38 | 39 | ### Access via ssh keys 40 | 41 | Once Slurm access is granted as well switch to SSH-key based authentication as described [here](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#login-and-access). Password access is disabled after a few days. 42 | 43 | ### Download the tutorial code 44 | 45 | Login to the Slurm login node using 46 | 47 | ```` 48 | ssh @134.2.168.52 49 | ```` 50 | Switch to your work directory using 51 | ```` 52 | cd $WORK 53 | ```` 54 | Download the contents (including code) of this tutorial into your current working directory using the following line: 55 | ```` 56 | git clone https://@github.com/baumgach/tue-slurm-helloworld.git 57 | ```` 58 | 59 | Change to code directory 60 | 61 | ```` 62 | cd tue-slurm-helloworld 63 | ```` 64 | 65 | The code for `multiply.py`, which we want to execute is now on Slurm. You can look at it using 66 | ```` 67 | cat src/multiply.py 68 | ```` 69 | or your favourite command line editor. 70 | 71 | ### Setting up a Python environment 72 | 73 | This code depends on a number of specific Python packages that are not installed on the Slurm nodes globally. We also cannot install them ourselves because we lack permissions. Instead we will create a virtual environment using [Conda](https://docs.conda.io/en/latest/). 74 | 75 | Create an environment called `myenv` using the following command 76 | 77 | 78 | ```` 79 | conda create -n myenv 80 | ```` 81 | 82 | and confirm with `y`. 83 | 84 | Activate the new environment using 85 | 86 | ```` 87 | conda activate myenv 88 | ```` 89 | 90 | There will already be a default Python, but we can install a specific Python version (e.g. 3.9.7) using 91 | ```` 92 | conda install python==3.9.7 93 | ```` 94 | 95 | We can install packages using `conda install` or `pip install`. I had a smoother experience with `pip`: 96 | 97 | ```` 98 | pip install numpy torch 99 | ```` 100 | 101 | ### Running the code on GPU 102 | 103 | Code can be deployed to GPUs using the Slurm `sbatch` command and a "deployment" script that specifies what resources specifically we request, and what code should be executed. 104 | 105 | Such a deployment script for Conda is provided in [src/deploy-with-conda.sh](src/deploy-with-conda.sh). The top part consists of our requested resources: 106 | 107 | ````bash 108 | # Part of deploy-with-conda.sh 109 | #SBATCH --ntasks=1 # Number of tasks (see below) 110 | #SBATCH --cpus-per-task=1 # Number of CPU cores per task 111 | #SBATCH --nodes=1 # Ensure that all cores are on one machine 112 | #SBATCH --time=0-00:05 # Runtime in D-HH:MM 113 | #SBATCH --partition=gpu-2080ti-dev # Partition to submit to 114 | #SBATCH --gres=gpu:1 # optionally type and number of gpus 115 | #SBATCH --mem=50G # Memory pool for all cores (see also --mem-per-cpu) 116 | #SBATCH --output=logs/job_%j.out # File to which STDOUT will be written 117 | #SBATCH --error=logs/job_%j.err # File to which STDERR will be written 118 | #SBATCH --mail-type=FAIL # Type of email notification- BEGIN,END,FAIL,ALL 119 | #SBATCH --mail-user= # Email to which notifications will be sent 120 | ```` 121 | 122 | For example, we are requesting 1 GPU from the gpu-2080ti-dev partition. See [here](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#partitions) for a list of all available partitions and [here](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#submitting-batch-jobs) for an explanations of the available options. Note: the lines starting with `#SBATCH` are *not* comments. 123 | 124 | The bottom part of the `deploy-with-conda.sh` script consists of bash instructions that will be executed on the assigned work node. In particular it contains the following four lines, which make a logs directory, activate the environment, run the code, and deactivate the environment again. 125 | 126 | ````bash 127 | # Part of deploy-with-conda.sh 128 | conda activate myenv 129 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu 130 | conda deactivate 131 | ```` 132 | 133 | The Python code depends on a file containing an integer mimicking a dependency on a training dataset. It contains a single integer specifying the size of the matrices to be multiplied by `multiply.py`. The file is located at `/mnt/qb/baumgartner/storagetest.txt` and it's permissions are set such that everybody can read it even if you are not a member of the MLMIA group. 134 | 135 | You can submit this job using the following command 136 | ```` 137 | sbatch src/deploy-with-conda.sh 138 | ```` 139 | 140 | ### Checking job and looking at results 141 | 142 | After submitting the job you can check its progress using 143 | ```` 144 | squeue --me 145 | ```` 146 | 147 | After it has finished you can look at the results in the log files in the `logs` directory for example using `cat`. 148 | 149 | ```` 150 | ls logs 151 | cat logs/.out 152 | ```` 153 | 154 | The file ending in `.out` contains the STDOUT, i.e. all print statement and things like that. The file ending in `.err` contains the STDERR, i.e. any error messages that were generated (hopefully none). 155 | 156 | ## Relevant links 157 | 158 | * [Slurm documentation](https://slurm.schedmd.com/) 159 | * [ML Cloud Wiki](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/home) 160 | * [ML Cloud Wiki section on Slurm](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/SLURM) 161 | * [Internal MLMIA Wiki on Computation](https://wiki.mlmia-unitue.de/it-information:start) (only available to MLMIA members) 162 | 163 | ## What now? 164 | 165 | This concludes the minimal example. If you want to learn more, you can have a look at the more detailed instructions described in the [Contents section](#contents) above. 166 | -------------------------------------------------------------------------------- /instructions/initial-setup.md: -------------------------------------------------------------------------------- 1 | # Initial setup 2 | 3 | The following steps will describe essential some useful SSH and mounting setups. 4 | 5 | ## Create a SSH config 6 | 7 | In order to prevent you from going crazy by having to type the IP address each time, you can setup a shortcut in your SSH config file. 8 | 9 | On your **local machine**, create a file `$HOME/.ssh/config` with the following content: 10 | 11 | ```` 12 | Host slurm 13 | HostName 134.2.168.52 14 | User 15 | ForwardAgent yes 16 | ```` 17 | 18 | Then, on your local machine, run 19 | 20 | ```` 21 | ssh-add 22 | ```` 23 | to enable the forwarding of your identity to different Slurm nodes. This will allow you to ssh into the node on which your job is running. You may need to periodically rerun the `ssh-add` command in the future (if `ssh-add -L` return empty, then you need to run `ssh-add` again) 24 | 25 | You can now `ssh` to the Slurm login node using the following command 26 | ```` 27 | ssh slurm 28 | ```` 29 | 30 | and copy files from your local workstation to Slurm using the following syntax 31 | ```` 32 | scp slurm: 33 | ```` 34 | 35 | ## Locally mount shared work directory 36 | 37 | Locally mounting the Slurm work directories will facilitate editing your code and moving around data. It could for example make sense to mount your data directory for moving files, or your `$WORK` directory so you can edit code on your local workstation using your favourite editor. 38 | 39 | In the following we will mount the remote shared folder `/mnt/qb/work/baumgartner/` to the same location on your local system. It could also be a different location on your local system, but if we keep the path exactly the same, we do not need to change the paths in the code each time. 40 | 41 | All the following steps need to be executed on your local machine. 42 | 43 | In case you do not have `sshfs` installed, you need to do so using (assuming you are on Ubuntu) 44 | ```` 45 | sudo apt-get install sshfs 46 | ```` 47 | This is a program that allows us to mount a remote folder on our local machine using the `ssh` protocol. 48 | 49 | Next, on your local machine create the mount point where you want to mount the remote folder. Obviously, first change to the term in the brackets to your own username (command `whoami` if you don't know) 50 | ```` 51 | sudo mkdir -p /mnt/qb/work/baumgartner/ 52 | ```` 53 | The `-p` is required because we are creating a whole hierarchy, not just a single folder. `sudo` is required because we are creating the folders in the root directory `/`, where the user does not have write access. 54 | 55 | Next we mount the remote folder using the following command: 56 | 57 | ```` 58 | sudo sshfs -o allow_other,IdentityFile=/home/$USER/.ssh/id_rsa @134.2.168.52:/mnt/qb/work/baumgartner/ /mnt/qb/work/baumgartner/ 59 | ```` 60 | This is assuming you belong to the `baumgartner` group. Adjust accordingly if you belong to a different group. 61 | 62 | Note: The folder only stays mounted as long as your internet connection doesn't drop. So, if you for instance reboot your machine, you need to re-execute the `sshfs` command. To automatically mount the folder upon rebooting you need to edit your `/etc/fstab` file, which is beyond the scope of this tutorial. 63 | 64 | ### Trouble shooting 65 | 66 | You might have to map your local username's user_id and group_id to the mount point by adding the options `` -o uid=`id -u username` -o gid=`id -g username` `` to the `sshfs` command. This is not necessarily the case, but may help if you don't have your remote user's privileges. 67 | 68 | ## Workflows for editing code remotely 69 | 70 | If you mounted your code directory as described above, you can now open that folder on your local machine using your favourite editor. 71 | 72 | Drawbacks: 73 | * Some people report some laggy behaviour due to the slowness of the SSH connection. 74 | * Also like this your code is *only* on the ML Cloud so if it goes down you don't have access to your code 75 | 76 | Instead of mounting the drive, some editors have functionality or plugins specifically for editing code remotely via SSH. For example, Visual Code Studio has the "Remote -SSH" which allows you to do exactly that. 77 | 78 | An alternative workflow is to develop locally and commit+push your changes to a git repository. Before running your code you can `git pull` on the Slurm node. As a side effect this will require you to commit your changes frequently which is generally a good thing. 79 | 80 | ## What now? 81 | 82 | Go back to the [Contents section](/README.md#contents) or continue with a more detailed description of the [virtual environment](/instructions/virtual-env-workflow.md) or [Singularity](/instructions/singularity-workflow.md) based workflow. -------------------------------------------------------------------------------- /instructions/interactive-jobs.md: -------------------------------------------------------------------------------- 1 | # Running interactive jobs and debugging 2 | 3 | It is also possible to open interactive sessions which can be very useful for debugging. However, **do not** use this as your main way of developing. Interactive sessions don't automatically exit when your job is finished on so will be blocking resources unnecessarily. 4 | 5 | ## Starting an interactive session on a work node 6 | 7 | When we ssh onto the Slurm login node, we do not have access to any of the computational resources such as GPU. For that usually deploy a job using `sbatch` which will then be executed on one of the work nodes. 8 | 9 | It is also possible to enter one of the work nodes directly, by using the following `srun` command: 10 | 11 | ```` 12 | srun --pty bash 13 | ```` 14 | 15 | This will start an interactive job, login to an available work node and run the bash shell. However, we will still not have access to GPUs because we didn't request any. 16 | 17 | Requesting resources works exactly the same as in the deployment scripts, but we provide the variables as command line arguments (which incidentally you can also do for `sbatch` if you want to). For example: 18 | 19 | ```` 20 | srun --pty --partition=gpu-2080ti-dev --time=0-00:30 --gres=gpu:1 bash 21 | ```` 22 | 23 | This will start an interactive session with a 2080ti GPU for 30 minutes. A description of the meaning of all command line arguments can be found in [this section](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#submitting-batch-jobs) of the ML Cloud Slurm wiki. 24 | 25 | You can now navigate to the source code if you are not already there 26 | ```` 27 | cd $WORK/tue-slurm-helloworld 28 | ```` 29 | 30 | Assuming you are using Conda, activate your environment using 31 | ```` 32 | conda activate myenv 33 | ```` 34 | Otherwise use the appropriate virtualenv or Singularity equivalent. 35 | 36 | 37 | Then you can run Python as you would on your local system 38 | ```` 39 | python src/multiply.py --timer_repetitions 10000 --use-gpu 40 | ```` 41 | 42 | ## Debugging 43 | 44 | This part will require a Python package called `ipdb`. If you followed the minimal example originally, it may not yet be installed. You can do so using 45 | 46 | ```` 47 | pip install ipdb 48 | ```` 49 | 50 | There is a version of the `multiply.py` script, where I added a `ipdb` breakpoint around line 22: [src/multiply-debug.py](/src/multiply-debug.py). 51 | 52 | We can also run that file instead: 53 | ```` 54 | python src/multiply-debug.py --timer_repetitions 10000 --use-gpu 55 | ```` 56 | 57 | This will run the code until line 24 and then open an interactive debug terminal which is similar to the `ipython` console and allows you to execute any Python commands that you want. In there you can for example check the value and shape of the variables that have already been defined 58 | ````python 59 | print(x) 60 | print(x.shape) 61 | ```` 62 | 63 | You can also step through the code line-by-line by typing "n", step into functions using "s", or continue executing the rest of the code using "c". 64 | 65 | ## Starting other interactive jobs 66 | 67 | Starting an interactive session with a `bash` terminal is a very sensible way of doing things. However, you can technically run whatever you want using `srun`. Here is an example of directly running the code using a singularity container if you have been following that workflow: 68 | 69 | ```` 70 | srun --pty --partition=gpu-2080ti-dev --time=0-00:30 --gres=gpu:1 \ 71 | singularity exec \ 72 | --nv \ 73 | --bind /mnt/qb/baumgartner,`pwd` deeplearning.sif \ 74 | python3 src/multiply-debug.py \ 75 | --timer_repetitions 10000 \ 76 | --use-gpu 77 | ```` 78 | 79 | ## SSH-ing into the worknode on which a job is running 80 | 81 | When you run `squeue --me` you can check on which nodes your jobs are running. When I, for example deploy the `deploy-with-conda-multiple.sh` script, my `squeue --me` output is the following: 82 | 83 | ```` 84 | JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 85 | 1121515_0 gpu-2080t deploy-w cbaumgar R 0:03 1 bg-slurmb-bm-2 86 | 1121515_1 gpu-2080t deploy-w cbaumgar R 0:03 1 bg-slurmb-bm-2 87 | 1121515_2 gpu-2080t deploy-w cbaumgar R 0:03 1 bg-slurmb-bm-2 88 | 1121515_3 gpu-2080t deploy-w cbaumgar R 0:03 1 slurm-bm-64 89 | 1121515_4 gpu-2080t deploy-w cbaumgar R 0:03 1 slurm-bm-63 90 | ```` 91 | 92 | This tells me for example, that job `1121515_4` is running on work node `slurm-bm-63`. I can now simply ssh into this work node using 93 | 94 | ```` 95 | ssh slurm-bm-63 96 | ```` 97 | 98 | A requirement for this to work is that all the identity forwarding stuff in the [initial setup](/instructions/initial-setup.md) section of this tutorial were completed correctly. If it is not working for some reason try running `ssh-add` on your local machine. 99 | 100 | Note also that you can only ssh onto nodes, on which you have jobs running and that you will only have access only to resources requested by that job (not all of the resources of that work node). 101 | 102 | Once we ssh-ed to the work node (in this case node `slurm-bm-63`) we can for example check what the GPU is up to using 103 | ```` 104 | nvidia-smi -l 105 | ```` 106 | 107 | or we can see what is stored on the local temporary storage: 108 | 109 | ```` 110 | ls /scratch_local 111 | ```` 112 | 113 | ## What now? 114 | 115 | Go back to the [Contents section](/README.md#contents) or learn about some [useful Slurm commands](/instructions/useful-commands.md). 116 | -------------------------------------------------------------------------------- /instructions/parallel-jobs.md: -------------------------------------------------------------------------------- 1 | # Running multiple jobs in parallel 2 | 3 | You can of course use `sbatch` to start multiple jobs simultaneously by running it multiple times, for example, with different parameters. 4 | 5 | However, Slurm also offers a convenient possibility to run multiple jobs for an array of parameters which we will explore in this tutorial. 6 | 7 | ## Running multiple jobs sequentially 8 | 9 | Our `multiply.py` script has a `--random-seed` option to start the random matrix generation with different random seeds. Say we want to run the code with 5 different random seeds, a common problem when evaluating machine learning problems. 10 | 11 | Note, that the bottom half of the deploy scripts (e.g. [src/deploy-with-conda.sh](/src/deploy-with-conda.sh), are just regular bash. So there is nothing keeping us from using a bash for loop in our `sbatch` deployment script. 12 | 13 | For example, in [src/deploy-with-conda-multiple.sh](/src/deploy-with-conda-multiple.sh) we could replace 14 | ````bash 15 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu 16 | ```` 17 | with 18 | ````bash 19 | for seed in 0 1 2 3 4 20 | do 21 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu --random-seed $seed 22 | done 23 | ```` 24 | 25 | With this replacement `deploy-with-conda.sh` would start a single Slurm job that would run `multiply.py` 5 times, one after the other, each time with a different random seed. 26 | 27 | However, it would be neat if we could actually run the 5 jobs in parallel on different GPUs. 28 | 29 | ## Running multiple jobs in parallel 30 | 31 | Slurm provides a nice functionality, to explore different parameter ranges like the different random seeds in the example above. Specifically, different parameter ranges can be explored using the `--array` option. 32 | 33 | An example, is given in [src/deploy-with-conda-multiple.sh](/src/deploy-with-conda-multiple.sh). The changes with respect to `deploy-with-conda.sh` are adding the following line to the sbatch arguments: 34 | 35 | ````bash 36 | #SBATCH --array=0,1,2,3,4 37 | ```` 38 | 39 | and replacing the python call by the following 40 | ````bash 41 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu --random-seed ${SLURM_ARRAY_TASK_ID} 42 | ```` 43 | 44 | This will start 5 separate jobs each with the different values from the specified array which will be contained in the automatically generated bash variable `${SLURM_ARRAY_TASK_ID}`. 45 | 46 | The example script can be deployed using 47 | 48 | ````bash 49 | sbatch src/deploy-with-conda-multiple.sh 50 | ```` 51 | 52 | You can check with `squeue --me` that 5 jobs have in fact been started. 53 | 54 | Learn more about the `--array` option in [this excellent section](https://slurm.schedmd.com/job_array.html) of the Slurm documentation. 55 | 56 | ## What now? 57 | 58 | Go back to the [Contents section](/README.md#contents) or learn about [interactive jobs and debugging](/instructions/interactive-jobs.md). -------------------------------------------------------------------------------- /instructions/singularity-workflow.md: -------------------------------------------------------------------------------- 1 | # Using Slurm with Singularity 2 | 3 | This example contains an example workflow using Singularity which includes instructions on how to build a Singularity container and deploying it on the ML Cloud Slurm host. 4 | 5 | In contrast, to the [virtual environment workflow](/instructions/virtual-env-workflow.md) (i.e. Conda or virtualenv) the Singularity workflow offers more flexibility for setting up your environment, but it is also arguably a bit more involved and cumbersome to use. 6 | 7 | ## What is Singularity in a nutshell 8 | 9 | [Singularity](https://en.wikipedia.org/wiki/Singularity_(software)) is a type of container similar to the more widely used [Docker](https://www.docker.com/resources/what-container). 10 | 11 | It allows you to package everything you need to run your code into a file. It is almost like a virtual machine, in which you can log in, install whatever you want and store whatever data you want. In contrast to an actual virtual machine no part of your hardware is allocated specifically to run a singularity container. Rather it just takes whatever resources it needs like any other job on your system. 12 | 13 | The cool thing about them is that you can just give that file to anyone else and they can exactly reproduce your experiment setup without having to do any annoying environment setups like finding the right CUDA or tensorflow versions. Rather than having instructions with your code on how to get it to run, you could technically just share the Singularity file. However, in practice these files tend to be very large so they are not typically shared widely. However, they are also very useful for packaging and sending jobs that will exactly like you want them to run on any system. 14 | 15 | As you will see the "pure" idea would be to package your code inside the container. However, this is pretty unwieldy for daily use with the ML Cloud, so we will also discuss a sort of hybrid setup where the container only contains the dependencies, but the code lives outside of the container. 16 | 17 | If containers are new to you make sure to read about [containers in general](https://www.docker.com/resources/what-container) and [singularity containers](https://en.wikipedia.org/wiki/Singularity_(software)) in particular. 18 | 19 | Singularity (like Conda and virtualenv in the previous tutorial) is already pre-installed on Slurm. 20 | 21 | ## Download code and build singularity container 22 | 23 | Make sure you are on the Slurm login node, and download the code if you haven't done so yet in the minimal example or the conda workflow example: 24 | 25 | ```` 26 | cd $WORK 27 | git clone https://github.com/baumgach/tue-slurm-helloworld.git 28 | cd tue-slurm-helloworld 29 | ```` 30 | 31 | As the first step, build singularity container 32 | 33 | ```` 34 | singularity build --fakeroot deeplearning.sif src/deeplearning.def 35 | ```` 36 | 37 | Explanations: 38 | - `deeplearning.def` contains instructions on how to build the container. It's a text file, have a look at it if you want. 39 | - Importantly, this file contains the specification of the "operating system", a Ubuntu 20.04 with cuda support in this case, and instructions to install Python3 with all required packages for this example. 40 | 41 | 42 | 43 | ## Entering the new container in a shell 44 | 45 | You can now enter a shell in your container using the following command and have a look around. 46 | 47 | ```` 48 | singularity shell --bind /mnt/qb/baumgartner deeplearning.sif 49 | ```` 50 | 51 | The bind option is not necessary to enter the container. It will mount the `baumgartner` folder, so you have access to it from inside the container also. This is important for the `multiply.py` script to execute successfully. 52 | 53 | You are now running an encapsulated operating system as specified in `deeplearning.def`. 54 | Try opening an `ipython` terminal. 55 | Note that your home directory and current working directory automatically get mounted inside the container. This is not true for all paths, however. 56 | 57 | The following lines in the `deeplearning.def` file 58 | ```` 59 | %files 60 | multiply.py /opt/code 61 | ```` 62 | copies your code to `/opt/code` in the container. We copied it there so it would be baked into the container, so when you move the container elsewhere the code is still around. 63 | 64 | The code is also available in your current working directory. So technically it is there twice. However, when you later move your container to another location (e.g. to the Slurm host), the working directory with the code (which only mounted) will no longer be the same. The copied code in `/opt/code`, on the other hand, stays with the container. 65 | 66 | You can navigate to either copy of the code and execute it using 67 | 68 | ```` 69 | python3 multiply.py 70 | ```` 71 | 72 | Note that the code assumes that the following file exists on Slurm: `/mnt/qb/baumgartner/storagetest.txt`. I have put it there, so if nothing has changed it should be there. It contains a single integer specifying the size of the matrices to be multiplied in `multiply.py`. The idea behind this is to simulate dependence on external data. In a real-world example, this could for example be a medical dataset. 73 | 74 | **An important thing to note:** When you change your code after building the container, it will be changed in 75 | your current working directory but not under `/opt/code`. 76 | 77 | You can exit the container using `exit` or `Ctrl+d`. 78 | 79 | ## Executing stuff in the container 80 | 81 | You can execute the Python code without having to enter the container using the `exec` option as follows: 82 | 83 | ```` 84 | singularity exec --nv --bind /mnt/qb/baumgartner,`pwd` deeplearning.sif python3 /opt/code/multiply.py --timer_repetitions 10000 --no-gpu 85 | ```` 86 | 87 | Explanations: 88 | - The `--nv` option is required to enable GPU access 89 | - The `--bind` option mounts the `baumgartner` folder and the current working directory (which can be obtained in bash using `pwd`) inside the container. 90 | - Note that we are also passing some options to the Python script at the end. 91 | - The Slurm login nodes do not have GPUs so for now we use the `--no-gpu` option for our python script. This will change once we submit a job further down in this file. 92 | 93 | Further note that the above command references the code at `/opt/code`. This copy of the code is static. This means if you change something in your code, you need to rebuild the container for the changes to take effect. 94 | 95 | However, as long as the container is still on your local machine you can change `/opt/code/multiply.py` to the path on your local machine (i.e. perhaps something like `$HOME/tue-Slurm-helloworld/multiply.py`). Because this copy of the code was not copied, but is mounted, any changes you make, will take effect and you do not need to rebuild your container each time (only once you want to deploy it remotely). 96 | 97 | ## Running code outside of the Singularity container 98 | 99 | As alluded to in the beginning, it may be a bit cumbersome to rebuild your entire Singularity container every time you make a change to your code, especially because Singularity really rebuilds the whole thing each time even if only a few parts have changed (this is different in Docker). 100 | 101 | So rather than baking the code into the container we can also run some code that lives outside of the container somewhere on slurm. We can for example run the identical file that lives in the `src` folder of this tutorial. 102 | 103 | ```` 104 | singularity exec --nv --bind /mnt/qb/baumgartner,`pwd` deeplearning.sif python3 src/multiply.py --timer_repetitions 10000 --no-gpu 105 | ```` 106 | 107 | Not all directories are available from inside the container, but the following directories will be there: 108 | - The home directory of your Slurm login node 109 | - All the paths you add using the `--bind` option in the command above. 110 | 111 | ## Running the code on Slurm 112 | 113 | The `deploy-with-singularity.sh` file contains instructions for your Slurm job. Those are a bunch of options for which GPUs to use etc, and which code to execute. In this case, `multiply.py`. 114 | 115 | We can submit the Slurm job using the following command 116 | ```` 117 | sbatch src/deploy-with-singularity.sh 118 | ```` 119 | 120 | This will run the job on Slurm as specified in `deploy.sh`. Note by the way, that the leading `#SBATCH` do not mean that those lines are commented out. This is just Slurm syntax. 121 | 122 | Have a look at the outputs written to `logs/`. 123 | Of course this is a very basic example, and you should later tailor `deploy-with-singularity.sh` to your needs. 124 | 125 | More info on how to manage Slurm jobs can be found [here](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm) 126 | 127 | ## Advantages and disadvantages of baking code into the container 128 | 129 | The following are general remarks not related to this tutorial. 130 | 131 | You have two options to run code on Slurm using a singularity container: 132 | 1. Copy the code into the container locally and then move the whole container including code over to Slurm and run it there. For this you would need to install Singularity on your local machine as described below. 133 | 2. Do not copy your code into the container, but rather move it to your `$WORK` directory on the remote Slurm host and access it from there by mounting binding the code directory from Slurm into the container using the `--bind`option. 134 | 135 | The advantage of (1) is that each container is (mostly) self-contained and completely reproducible. (For this to be completely true, the data would also have to be baked into the container, which we could also do). If you give this container to someone else they can run it and get exactly the same result. You could also save containers with important experiments for later, so you can reproduce them or check what exactly you did. In fact this is what many ML companies do often in conjunction with Kubernetes, but usually using Docker rather than Singularity which is a bit more flexible and faster to build. 136 | 137 | However, using method (1) you will need to rebuild and copy your container to Slurm every time you change something. This, unfortunately, takes a long time because Singularity (in contrast to docker) always rebuilds everything from scratch. So practically you will be able to develop much faster using method (2). This comes at the cost of the above mentioned reproducibility. 138 | 139 | Method (2) seems the preferable one for actual research and development. You can have your code permanently in your Slurm directory and edit it locally by mounting your Slurm home locally using `sshfs` like above. Another option is to use a SSH extension for your IDE such as the "Remote -SSH" extension for Visual Studio Code. 140 | 141 | ## Install Singularity on your local machine (optional) 142 | 143 | You may want to install Singularity also on your local machine so you can develop your environment locally before copying it over to Slurm . 144 | 145 | To install singularity on your local machine, follow the steps described [here](https://sylabs.io/guides/3.7/user-guide/quick_start.html). 146 | 147 | ## What now? 148 | 149 | Go back to the [Contents section](/README.md#contents), learn about doing [parallel parameter sweeps](/instructions/parallel-jobs.md) using `sbatch`, or [interactive jobs and debugging](/instructions/interactive-jobs.md). 150 | -------------------------------------------------------------------------------- /instructions/useful-commands.md: -------------------------------------------------------------------------------- 1 | # Useful commands 2 | 3 | Some common commands are explained in [this section](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#common-slurm-commands) of the ML Cloud Slurm Wiki. 4 | 5 | They are reproduced here with some useful options. 6 | 7 | ## Show jobs 8 | 9 | Your own jobs 10 | 11 | ```` 12 | squeue --me 13 | ```` 14 | 15 | Any users jobs 16 | 17 | ```` 18 | squeue -u 19 | ```` 20 | 21 | ## Cancel jobs 22 | 23 | Cancel specific job 24 | 25 | ```` 26 | scancel 27 | ```` 28 | 29 | Cancel all my jobs 30 | 31 | ```` 32 | scancel -u 33 | ```` 34 | 35 | ## Check fairshare 36 | 37 | Check on a group level 38 | 39 | ```` 40 | sshare 41 | ```` 42 | 43 | Check on an individual level 44 | 45 | ```` 46 | sshare --all 47 | ```` 48 | 49 | ## What now? 50 | 51 | Go back to the [Contents section](/README.md#contents) or give feed-back about the tutorial on [github](https://github.com/baumgach/tue-slurm-helloworld/issues)! -------------------------------------------------------------------------------- /instructions/virtual-env-workflow.md: -------------------------------------------------------------------------------- 1 | # Using Slurm with Conda or Virtualenv 2 | 3 | In this section, we will describe an example workflow for setting up a virtual python environment and then running the `multiply.py` code which relies on PyTorch and some other libraries. 4 | 5 | For many this may be a more accessible workflow compared to the [Singularity based approach](/instructions/singularity-workflow.md), which was originally recommended by the ML Cloud folks and is described in [the next section](/instructions/singularity-workflow.md). However, for some cases it may be lacking in flexibility, especially if you need special libraries or a custom operating system for your experiments. 6 | 7 | In the following we will describe a [Conda](https://docs.conda.io/en/latest/)-based workflow in more detail and will then give a brief alternative workflow using [virtualenv](https://virtualenv.pypa.io/en/latest/). 8 | 9 | ## What is Conda in a Nutshell 10 | 11 | [Conda](https://docs.conda.io/en/latest/) is an environment manager for Python and other languages. It allows you to create and activate virtual environments in which you can install Python packages. The advantage is that the packages will not be installed globally on your machine, but only in this environment. This means you can use different versions of packages for different projects. Conda is already installed on the Slurm login nodes. 12 | 13 | One drawback with respect to the [Singularity workflow](/instructions/singularity-workflow.md) is that you can only influence the version of Python and certain packages, but not over things like the operating system or packages not available through Conda. For example, using the Singularity method you can install any package or software available on Linux in your Singularity instance (e.g. using `apt`). Usually, this isn't a problem, however. 14 | 15 | ## Setting up a Conda environment 16 | 17 | Make sure you are on the Slurm login node, and download the code if you haven't done so yet in the minimal example: 18 | 19 | ```` 20 | cd $WORK 21 | git clone https://github.com/baumgach/tue-slurm-helloworld.git 22 | cd tue-slurm-helloworld 23 | ```` 24 | 25 | Make a Conda environment called `myenv` using 26 | 27 | ```` 28 | conda create -n myenv 29 | ```` 30 | 31 | and confirm with `y`. 32 | 33 | Activate the new environment using 34 | 35 | ```` 36 | conda activate myenv 37 | ```` 38 | 39 | Let's see which Python version we have by default. 40 | 41 | ```` 42 | python --version 43 | ```` 44 | 45 | That is the system-wide Python version. At the time of writing the system wide Python defaults to `python2.7`, although `python3` is also installed with version `3.6.8`. Using Conda we can instead install whichever specific Python version we like. For example 46 | 47 | ```` 48 | conda install python==3.9.7 49 | ```` 50 | 51 | `python` now defaults to the specified version, which you can check again with `python --version`. 52 | 53 | We can install packages using `conda install` or `pip install`. `conda install` works for some stuff `pip install` doesn't and vice versa. Writing this tutorial I had a smoother experience with `pip`. 54 | 55 | So in order to install our dependencies we can use: 56 | 57 | ```` 58 | pip install numpy torch ipdb 59 | ```` 60 | 61 | ## Other conda commands 62 | 63 | There are many other useful Conda commands for managing your packages and environments such as 64 | * `conda env list` for displaying all existing environments 65 | * `conda env remove --name myenv` for deleting an environment 66 | 67 | Have a look at this helpful [cheat sheet](https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf). 68 | 69 | ## Running the code 70 | 71 | As before we can now run our matrix multiplication code directly on the login node without GPU. 72 | 73 | ```` 74 | cd $WORK/tue-slurm-helloworld 75 | python src/multiply.py --timer_repetitions 10000 --no-gpu 76 | ```` 77 | 78 | But of course we would rather submit this as a job with GPU usage. For this, I prepared another deploy script. 79 | 80 | ```` 81 | sbatch src/deploy-with-conda.sh 82 | ```` 83 | 84 | The results will be written to the newly created `logs` directory which should have been created in your current working directory. You can list the files sorted by most recent and some useful extra information using 85 | 86 | ```` 87 | ls -ltrh logs 88 | ```` 89 | 90 | You can then look at the output log file and error file corresponding to your job using the `cat` or `less` commands. For example: 91 | 92 | ```` 93 | cat logs/job_.out 94 | ```` 95 | 96 | To deactivate the environment type 97 | 98 | ```` 99 | conda deactivate 100 | ```` 101 | 102 | ## Alternative workflow using virtualenv 103 | 104 | The following describes the equivalent of the above in virtualenv. Virtualenv is yet a bit simpler than Conda, but also a bit less flexible. For example, it does not (easily) allow control over the specific Python version you can use. 105 | 106 | Setting up the environment 107 | 108 | ```` 109 | cd $HOME # go to home directory 110 | virtualenv -p python3 env 111 | source env/bin/activate 112 | pip install --upgrade pip 113 | pip install numpy matplotlib ipython torch torchvision ipdb 114 | ```` 115 | 116 | Deactivating the environment 117 | ```` 118 | deactivate 119 | ```` 120 | 121 | Running the code 122 | 123 | ```` 124 | cd $WORK 125 | sbatch src/deploy-with-virtualenv.sh 126 | ```` 127 | 128 | ## What now? 129 | 130 | Go back to the [Contents section](/README.md#contents), continue with the more flexible [Singularity](/instructions/singularity-workflow.md) based workflow, or learn about [interactive jobs and debugging](/instructions/interactive-jobs.md). 131 | -------------------------------------------------------------------------------- /logs/.gitignore: -------------------------------------------------------------------------------- 1 | * 2 | !.gitignore 3 | -------------------------------------------------------------------------------- /src/deeplearning.def: -------------------------------------------------------------------------------- 1 | BootStrap: docker 2 | From: nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04 3 | 4 | %post 5 | # Downloads the latest package lists. 6 | apt-get update -y 7 | 8 | # Install Python3 requirements 9 | # --> python3-tk is required by matplotlib. 10 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 11 | python3 \ 12 | python3-tk \ 13 | python3-pip \ 14 | python3-distutils \ 15 | python3-setuptools 16 | 17 | # Reduce the size of the image by deleting the package lists we downloaded, 18 | # which are useless now. 19 | apt-get -y clean 20 | rm -rf /var/lib/apt/lists/* 21 | 22 | # Install Python modules. 23 | pip3 install numpy ipdb torch 24 | 25 | # Make code directory 26 | mkdir -p /opt/code 27 | 28 | %files 29 | # Copy code from home directory to singularity root 30 | src/multiply.py /opt/code/ 31 | 32 | -------------------------------------------------------------------------------- /src/deploy-with-conda-multiple.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --ntasks=1 # Number of tasks (see below) 3 | #SBATCH --cpus-per-task=1 # Number of CPU cores per task 4 | #SBATCH --nodes=1 # Ensure that all cores are on one machine 5 | #SBATCH --time=0-00:05 # Runtime in D-HH:MM 6 | #SBATCH --partition=gpu-2080ti-dev # Partition to submit to 7 | #SBATCH --gres=gpu:1 # optionally type and number of gpus 8 | #SBATCH --mem=50G # Memory pool for all cores (see also --mem-per-cpu) 9 | #SBATCH --output=logs/job_%j.out # File to which STDOUT will be written 10 | #SBATCH --error=logs/job_%j.err # File to which STDERR will be written 11 | #SBATCH --mail-type=FAIL # Type of email notification- BEGIN,END,FAIL,ALL 12 | #SBATCH --mail-user= # Email to which notifications will be sent 13 | #SBATCH --array=0,1,2,3,4 # array of cityscapes random seeds 14 | 15 | # print info about current job 16 | echo "---------- JOB INFOS ------------" 17 | scontrol show job $SLURM_JOB_ID 18 | echo -e "---------------------------------\n" 19 | 20 | # Due to a potential bug, we need to manually load our bash configurations first 21 | source $HOME/.bashrc 22 | 23 | # Next activate the conda environment 24 | conda activate myenv 25 | 26 | # Run code with values specified in task array 27 | echo "-------- PYTHON OUTPUT ----------" 28 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu --random-seed ${SLURM_ARRAY_TASK_ID} 29 | echo "---------------------------------" 30 | 31 | # Deactivate environment again 32 | conda deactivate 33 | -------------------------------------------------------------------------------- /src/deploy-with-conda.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --ntasks=1 # Number of tasks (see below) 3 | #SBATCH --cpus-per-task=1 # Number of CPU cores per task 4 | #SBATCH --nodes=1 # Ensure that all cores are on one machine 5 | #SBATCH --time=0-00:05 # Runtime in D-HH:MM 6 | #SBATCH --partition=gpu-2080ti-dev # Partition to submit to 7 | #SBATCH --gres=gpu:1 # optionally type and number of gpus 8 | #SBATCH --mem=50G # Memory pool for all cores (see also --mem-per-cpu) 9 | #SBATCH --output=logs/job_%j.out # File to which STDOUT will be written 10 | #SBATCH --error=logs/job_%j.err # File to which STDERR will be written 11 | #SBATCH --mail-type=FAIL # Type of email notification- BEGIN,END,FAIL,ALL 12 | #SBATCH --mail-user= # Email to which notifications will be sent 13 | 14 | # print info about current job 15 | echo "---------- JOB INFOS ------------" 16 | scontrol show job $SLURM_JOB_ID 17 | echo -e "---------------------------------\n" 18 | 19 | # Due to a potential bug, we need to manually load our bash configurations first 20 | source $HOME/.bashrc 21 | 22 | # Next activate the conda environment 23 | conda activate myenv 24 | 25 | # Run our code 26 | echo "-------- PYTHON OUTPUT ----------" 27 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu 28 | echo "---------------------------------" 29 | 30 | # Deactivate environment again 31 | conda deactivate 32 | -------------------------------------------------------------------------------- /src/deploy-with-singularity.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --ntasks=1 # Number of tasks (see below) 3 | #SBATCH --cpus-per-task=1 # Number of CPU cores per task 4 | #SBATCH --nodes=1 # Ensure that all cores are on one machine 5 | #SBATCH --time=0-00:05 # Runtime in D-HH:MM 6 | #SBATCH --partition=gpu-2080ti-dev # Partition to submit to 7 | #SBATCH --gres=gpu:1 # optionally type and number of gpus 8 | #SBATCH --mem=50G # Memory pool for all cores (see also --mem-per-cpu) 9 | #SBATCH --output=logs/job_%j.out # File to which STDOUT will be written 10 | #SBATCH --error=logs/job_%j.err # File to which STDERR will be written 11 | #SBATCH --mail-type=FAIL # Type of email notification- BEGIN,END,FAIL,ALL 12 | #SBATCH --mail-user= # Email to which notifications will be sent 13 | 14 | # print info about current job 15 | echo "---------- JOB INFOS ------------" 16 | scontrol show job $SLURM_JOB_ID 17 | echo -e "---------------------------------\n" 18 | 19 | # Run singularity command 20 | echo "-------- PYTHON OUTPUT ----------" 21 | singularity exec --nv --bind /mnt/qb/baumgartner,`pwd` deeplearning.sif \ 22 | python3 /opt/code/multiply.py \ 23 | --timer_repetitions 10000 \ 24 | --use-gpu 25 | echo "---------------------------------" -------------------------------------------------------------------------------- /src/deploy-with-virtualenv.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --ntasks=1 # Number of tasks (see below) 3 | #SBATCH --cpus-per-task=1 # Number of CPU cores per task 4 | #SBATCH --nodes=1 # Ensure that all cores are on one machine 5 | #SBATCH --time=0-00:05 # Runtime in D-HH:MM 6 | #SBATCH --partition=gpu-2080ti-dev # Partition to submit to 7 | #SBATCH --gres=gpu:1 # optionally type and number of gpus 8 | #SBATCH --mem=50G # Memory pool for all cores (see also --mem-per-cpu) 9 | #SBATCH --output=logs/job_%j.out # File to which STDOUT will be written 10 | #SBATCH --error=logs/job_%j.err # File to which STDERR will be written 11 | #SBATCH --mail-type=FAIL # Type of email notification- BEGIN,END,FAIL,ALL 12 | #SBATCH --mail-user= # Email to which notifications will be sent 13 | 14 | # print info about current job 15 | echo "---------- JOB INFOS ------------" 16 | scontrol show job $SLURM_JOB_ID 17 | echo -e "---------------------------------\n" 18 | 19 | # Activate virtualenv 20 | source $HOME/env/bin/activate 21 | 22 | # Run code 23 | echo "-------- PYTHON OUTPUT ----------" 24 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu 25 | echo "---------------------------------" 26 | 27 | # Deactivate virtualenv again 28 | deactivate 29 | -------------------------------------------------------------------------------- /src/multiply-debug.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from timeit import timeit 3 | import argparse 4 | 5 | def time_multiplication( 6 | timer_repetitions, 7 | use_gpu=True, 8 | data_path="/mnt/qb/baumgartner/storagetest.txt", 9 | random_seed=0 10 | ): 11 | 12 | # Simulate external data access by reading the matrix size from external file 13 | with open(data_path) as f: 14 | s = int(f.read()) 15 | 16 | # Set random seed 17 | torch.manual_seed(random_seed) 18 | 19 | # Create tensors 20 | x = torch.randn(32, s) 21 | y = torch.randn(s, 32) 22 | 23 | import ipdb 24 | ipdb.set_trace() 25 | 26 | # Check if Cuda available and move tensor to Cuda if yes 27 | cuda_available = torch.cuda.is_available() 28 | print(f"Cuda_available={cuda_available}") 29 | if cuda_available and use_gpu: 30 | device = torch.cuda.current_device() 31 | print(f"Current Cuda device: {device}") 32 | x = x.to(device) 33 | y = y.to(device) 34 | 35 | # Multiply matrix first once for result and then multiple times for measuring elapsed time 36 | mult = torch.matmul(x, y) 37 | elapsed_time = timeit(lambda: torch.matmul(x, y), number=timer_repetitions) 38 | 39 | # Print some results: 40 | print("result:") 41 | print(mult) 42 | print("Output shape", mult.shape) 43 | print(f"elapsed time: {elapsed_time}") 44 | 45 | if __name__ == "__main__": 46 | 47 | parser = argparse.ArgumentParser( 48 | description="Tescript for Singularity and SLURM on GPU. It times the multiplication of two matrices, repeated `timer_repetition` times." 49 | ) 50 | parser.add_argument( 51 | "--timer_repetitions", dest="timer_repetitions", action="store", default=1000, type=int, help="How many times to repeat the multiplication.", 52 | ) 53 | parser.add_argument('--use-gpu', dest='use_gpu', action='store_true') 54 | parser.add_argument('--no-gpu', dest='use_gpu', action='store_false') 55 | parser.set_defaults(use_gpu=True) 56 | parser.add_argument( 57 | "--random-seed", dest="random_seed", action="store", default=0, type=int, help="Random seed for sampling random matrices.", 58 | ) 59 | args = parser.parse_args() 60 | 61 | # Run main function with command line input, to test singularity arguments 62 | time_multiplication( 63 | timer_repetitions=args.timer_repetitions, 64 | use_gpu=args.use_gpu, 65 | random_seed=args.random_seed, 66 | ) 67 | 68 | print("done.") -------------------------------------------------------------------------------- /src/multiply.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from timeit import timeit 3 | import argparse 4 | 5 | def time_multiplication( 6 | timer_repetitions, 7 | use_gpu=True, 8 | data_path="/mnt/qb/baumgartner/storagetest.txt", 9 | random_seed=0 10 | ): 11 | 12 | # Simulate external data access by reading the matrix size from external file 13 | with open(data_path) as f: 14 | s = int(f.read()) 15 | 16 | # Set random seed 17 | torch.manual_seed(random_seed) 18 | 19 | # Create tensors 20 | x = torch.randn(32, s) 21 | y = torch.randn(s, 32) 22 | 23 | # Check if Cuda available and move tensor to Cuda if yes 24 | cuda_available = torch.cuda.is_available() 25 | print(f"Cuda_available={cuda_available}") 26 | if cuda_available and use_gpu: 27 | device = torch.cuda.current_device() 28 | print(f"Current Cuda device: {device}") 29 | x = x.to(device) 30 | y = y.to(device) 31 | 32 | # Multiply matrix first once for result and then multiple times for measuring elapsed time 33 | mult = torch.matmul(x, y) 34 | elapsed_time = timeit(lambda: torch.matmul(x, y), number=timer_repetitions) 35 | 36 | # Print some results: 37 | print("result:") 38 | print(mult) 39 | print("Output shape", mult.shape) 40 | print(f"elapsed time: {elapsed_time}") 41 | 42 | if __name__ == "__main__": 43 | 44 | parser = argparse.ArgumentParser( 45 | description="Tescript for Singularity and SLURM on GPU. It times the multiplication of two matrices, repeated `timer_repetition` times." 46 | ) 47 | parser.add_argument( 48 | "--timer_repetitions", dest="timer_repetitions", action="store", default=1000, type=int, help="How many times to repeat the multiplication.", 49 | ) 50 | parser.add_argument('--use-gpu', dest='use_gpu', action='store_true') 51 | parser.add_argument('--no-gpu', dest='use_gpu', action='store_false') 52 | parser.set_defaults(use_gpu=True) 53 | parser.add_argument( 54 | "--random-seed", dest="random_seed", action="store", default=0, type=int, help="Random seed for sampling random matrices.", 55 | ) 56 | args = parser.parse_args() 57 | 58 | # Run main function with command line input, to test singularity arguments 59 | time_multiplication( 60 | timer_repetitions=args.timer_repetitions, 61 | use_gpu=args.use_gpu, 62 | random_seed=args.random_seed, 63 | ) 64 | 65 | print("done.") --------------------------------------------------------------------------------