├── .gitignore
├── README.md
├── instructions
    ├── initial-setup.md
    ├── interactive-jobs.md
    ├── parallel-jobs.md
    ├── singularity-workflow.md
    ├── useful-commands.md
    └── virtual-env-workflow.md
├── logs
    └── .gitignore
└── src
    ├── deeplearning.def
    ├── deploy-with-conda-multiple.sh
    ├── deploy-with-conda.sh
    ├── deploy-with-singularity.sh
    ├── deploy-with-virtualenv.sh
    ├── multiply-debug.py
    └── multiply.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.sif
2 | .vscode
3 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Tübingen ML Cloud Hello World Example(s)
  2 | 
  3 | The goal of this repository is to give an easy entry point to people who want to use the [Tübingen ML Cloud](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/home), and in particular [Slurm](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#common-slurm-commands), to run Python code on GPUs.
  4 | 
  5 | Specifically, this tutorial contains a small Python code sample for multiplying two matrices using PyTorch and instructions for setting up the required environments and executing this code on CPU or GPU on the ML Cloud. 
  6 | 
  7 | The tutorial consists of a minimal example for getting started quickly, and a number of slightly more detailed instructions explaining different workflows and aspects of the ML Cloud. 
  8 | 
  9 | **This is not an official introduction**, but rather a tutorial put together by the members of the [Machine Learning in Medical Image Analysis (MLMIA) lab](https://www.mlmia-unitue.de). The official ML Cloud Slurm wiki can be found [here](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#common-slurm-commands). If you do spot mistakes or have suggestions, comments and pull requests are very welcome.  
 10 | 
 11 | ## Contents
 12 | 
 13 | Contents of this tutorial:
 14 |   * [Who is this for?](#who-is-this-for): Short description of target audience
 15 |   * [Minimal example](#minimal-example): The shortest path from getting an account to running the `multiply.py` code on GPU. 
 16 | 
 17 | In-depth instructions:
 18 |   * [Initial setup](/instructions/initial-setup.md): Some useful tricks for setting up your SSH connections and mounting Slurm volumes locally for working more efficiently. 
 19 |   * [Virtual environment workflow](/instructions/virtual-env-workflow.md): A more in detailed explanation of the workflow based on virtual environments also used in the minimal example and an alternative workflow using `virtualenv`, instead of `conda`. 
 20 |   * [Singularity workflow](/instructions/singularity-workflow.md): An alternative workflow using Singularity containers, which allows for more flexibility.
 21 |   * [Running parallel jobs](/instructions/parallel-jobs.md): A short introduction to executing a command for a list of parameters in parallel on multiple GPUs. 
 22 |   * [Interactive jobs and debugging](/instructions/interactive-jobs.md): How to run interactive jobs and how to debug Python code on Slurm. 
 23 |   * [Useful commands](/instructions/useful-commands.md): Some commonly used Slurm commands for shepherding jobs. 
 24 | 
 25 | ## Who is this for?
 26 | 
 27 | This repository was compiled with members of the [MLMIA lab](https://www.mlmia-unitue.de) in mind to enable them to have a smooth start with the ML Cloud. 
 28 | 
 29 | You will notice that in many places, paths refer to locations only accessible to the MLMIA group such as `/mnt/qb/baumgartner`. However, there should be equivalent paths for members of all groups. For example, if you are part of the Berens group there would be an equivalent `/mnt/qb/berens` folder to which you *will* have access. 
 30 | 
 31 | Generally, we believe that this introduction may also be useful for other ML Cloud users. 
 32 | 
 33 | The instructions were written with Linux users in mind. Most of the instructions will translate to Mac, as well. If you use windows, you may have to resort to [Putty](https://www.putty.org/), although newer versions now also support `ssh` in the [Windows terminal](https://docs.microsoft.com/en-us/windows/terminal/tutorials/ssh). 
 34 | 
 35 | ## Minimal example
 36 | 
 37 | So you have just obtained access to Slurm from the [ML Cloud Masters](mailto:mlcloudmaster@uni-tuebingen.de)? What now? 
 38 | 
 39 | ### Access via ssh keys
 40 | 
 41 | Once Slurm access is granted as well switch to SSH-key based authentication as described [here](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#login-and-access). Password access is disabled after a few days. 
 42 | 
 43 | ### Download the tutorial code
 44 | 
 45 | Login to the Slurm login node using 
 46 | 
 47 | ````
 48 | ssh <your-slurm-username>@134.2.168.52
 49 | ```` 
 50 | Switch to your work directory using 
 51 | ````
 52 | cd $WORK
 53 | ````
 54 | Download the contents (including code) of this tutorial into your current working directory using the following line:
 55 | ````
 56 | git clone https://<your-slurm-username>@github.com/baumgach/tue-slurm-helloworld.git
 57 | ````
 58 | 
 59 | Change to code directory
 60 | 
 61 | ````
 62 | cd tue-slurm-helloworld
 63 | ````
 64 | 
 65 | The code for `multiply.py`, which we want to execute is now on Slurm. You can look at it using 
 66 | ````
 67 | cat src/multiply.py
 68 | ````
 69 | or your favourite command line editor. 
 70 | 
 71 | ### Setting up a Python environment 
 72 | 
 73 | This code depends on a number of specific Python packages that are not installed on the Slurm nodes globally. We also cannot install them ourselves because we lack permissions. Instead we will create a virtual environment using [Conda](https://docs.conda.io/en/latest/). 
 74 | 
 75 | Create an environment called `myenv` using the following command
 76 | 
 77 | 
 78 | ````
 79 | conda create -n myenv 
 80 | ````
 81 | 
 82 | and confirm with `y`. 
 83 | 
 84 | Activate the new environment using 
 85 | 
 86 | ````
 87 | conda activate myenv
 88 | ````
 89 | 
 90 | There will already be a default Python, but we can install a specific Python version (e.g. 3.9.7) using 
 91 | ````
 92 | conda install python==3.9.7
 93 | ````
 94 | 
 95 | We can install packages using `conda install` or `pip install`. I had a smoother experience with `pip`:
 96 | 
 97 | ````
 98 | pip install numpy torch
 99 | ````
100 | 
101 | ### Running the code on GPU 
102 | 
103 | Code can be deployed to GPUs using the Slurm `sbatch` command and a "deployment" script that specifies what resources specifically we request, and what code should be executed. 
104 | 
105 | Such a deployment script for Conda is provided in [src/deploy-with-conda.sh](src/deploy-with-conda.sh). The top part consists of our requested resources:
106 | 
107 | ````bash
108 | # Part of deploy-with-conda.sh
109 | #SBATCH --ntasks=1                # Number of tasks (see below)
110 | #SBATCH --cpus-per-task=1         # Number of CPU cores per task
111 | #SBATCH --nodes=1                 # Ensure that all cores are on one machine
112 | #SBATCH --time=0-00:05            # Runtime in D-HH:MM
113 | #SBATCH --partition=gpu-2080ti-dev # Partition to submit to
114 | #SBATCH --gres=gpu:1              # optionally type and number of gpus
115 | #SBATCH --mem=50G                 # Memory pool for all cores (see also --mem-per-cpu)
116 | #SBATCH --output=logs/job_%j.out  # File to which STDOUT will be written
117 | #SBATCH --error=logs/job_%j.err   # File to which STDERR will be written
118 | #SBATCH --mail-type=FAIL           # Type of email notification- BEGIN,END,FAIL,ALL
119 | #SBATCH --mail-user=<your-email>  # Email to which notifications will be sent
120 | ````
121 | 
122 | For example, we are requesting 1 GPU from the gpu-2080ti-dev partition. See [here](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#partitions) for a list of all available partitions and [here](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#submitting-batch-jobs) for an explanations of the available options. Note: the lines starting with `#SBATCH` are *not* comments. 
123 | 
124 | The bottom part of the `deploy-with-conda.sh` script consists of bash instructions that will be executed on the assigned work node. In particular it contains the following four lines, which make a logs directory, activate the environment, run the code, and deactivate the environment again.
125 | 
126 | ````bash
127 | # Part of deploy-with-conda.sh
128 | conda activate myenv
129 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu
130 | conda deactivate
131 | ````
132 | 
133 | The Python code depends on a file containing an integer mimicking a dependency on a training dataset. It contains a single integer specifying the size of the matrices to be multiplied by `multiply.py`. The file is located at `/mnt/qb/baumgartner/storagetest.txt` and it's permissions are set such that everybody can read it even if you are not a member of the MLMIA group. 
134 | 
135 | You can submit this job using the following command 
136 | ````
137 | sbatch src/deploy-with-conda.sh
138 | ````
139 | 
140 | ### Checking job and looking at results
141 | 
142 | After submitting the job you can check its progress using 
143 | ````
144 | squeue --me
145 | ````
146 | 
147 | After it has finished you can look at the results in the log files in the `logs` directory for example using `cat`. 
148 | 
149 | ````
150 | ls logs
151 | cat logs/<logfilename>.out
152 | ````
153 | 
154 | The file ending in `.out` contains the STDOUT, i.e. all print statement and things like that. The file ending in `.err` contains the STDERR, i.e. any error messages that were generated (hopefully none). 
155 | 
156 | ## Relevant links
157 | 
158 |   * [Slurm documentation](https://slurm.schedmd.com/)
159 |   * [ML Cloud Wiki](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/home)
160 |   * [ML Cloud Wiki section on Slurm](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/SLURM)
161 |   * [Internal MLMIA Wiki on Computation](https://wiki.mlmia-unitue.de/it-information:start) (only available to MLMIA members)
162 | 
163 | ## What now?
164 | 
165 | This concludes the minimal example. If you want to learn more, you can have a look at the more detailed instructions described in the [Contents section](#contents) above. 
166 | 


--------------------------------------------------------------------------------
/instructions/initial-setup.md:
--------------------------------------------------------------------------------
 1 | # Initial setup
 2 | 
 3 | The following steps will describe essential some useful SSH and mounting setups. 
 4 | 
 5 | ## Create a SSH config
 6 | 
 7 | In order to prevent you from going crazy by having to type the IP address each time, you can setup a shortcut in your SSH config file. 
 8 | 
 9 | On your **local machine**, create a file `$HOME/.ssh/config` with the following content:
10 | 
11 | ````
12 | Host slurm
13 |     HostName 134.2.168.52
14 |     User <your-slurm-username>
15 |     ForwardAgent yes
16 | ````
17 | 
18 | Then, on your local machine, run 
19 | 
20 | ````
21 | ssh-add
22 | ````
23 | to enable the forwarding of your identity to different Slurm nodes. This will allow you to ssh into the node on which your job is running. You may need to periodically rerun the `ssh-add` command in the future (if `ssh-add -L` return empty, then you need to run `ssh-add` again)
24 | 
25 | You can now `ssh` to the Slurm login node using the following command
26 | ````
27 | ssh slurm
28 | ````
29 | 
30 | and copy files from your local workstation to Slurm  using the following syntax
31 | ````
32 | scp <localfile> slurm:<remote-dir>
33 | ````
34 | 
35 | ## Locally mount shared work directory
36 | 
37 | Locally mounting the Slurm work directories will facilitate editing your code and moving around data. It could for example make sense to mount your data directory for moving files, or your `$WORK` directory so you can edit code on your local workstation using your favourite editor. 
38 | 
39 | In the following we will mount the remote shared folder `/mnt/qb/work/baumgartner/<your-username>` to the same location on your local system. It could also be a different location on your local system, but if we keep the path exactly the same, we do not need to change the paths in the code each time.
40 | 
41 | All the following steps need to be executed on your local machine.
42 | 
43 | In case you do not have `sshfs` installed, you need to do so using (assuming you are on Ubuntu)
44 | ````
45 | sudo apt-get install sshfs
46 | ````
47 | This is a program that allows us to mount a remote folder on our local machine using the `ssh` protocol.
48 | 
49 | Next, on your local machine create the mount point where you want to mount the remote folder. Obviously, first change to the term in the brackets to your own username (command `whoami` if you don't know)
50 | ````
51 | sudo mkdir -p /mnt/qb/work/baumgartner/<your-slurm-username>
52 | ````
53 | The `-p` is required because we are creating a whole hierarchy, not just a single folder. `sudo` is required because we are creating the folders in the root directory `/`, where the user does not have write access.
54 | 
55 | Next we mount the remote folder using the following command:
56 | 
57 | ````
58 | sudo sshfs -o allow_other,IdentityFile=/home/$USER/.ssh/id_rsa <your-slurm-username>@134.2.168.52:/mnt/qb/work/baumgartner/<your-username> /mnt/qb/work/baumgartner/<your-username>
59 | ````
60 | This is assuming you belong to the `baumgartner` group. Adjust accordingly if you belong to a different group. 
61 | 
62 | Note: The folder only stays mounted as long as your internet connection doesn't drop. So, if you for instance reboot your machine, you need to re-execute the `sshfs` command. To automatically mount the folder upon rebooting you need to edit your `/etc/fstab` file, which is beyond the scope of this tutorial. 
63 | 
64 | ### Trouble shooting
65 | 
66 | You might have to map your local username's user_id and group_id to the mount point by adding the options `` -o uid=`id -u username` -o gid=`id -g username` `` to the `sshfs` command. This is not necessarily the case, but may help if you don't have your remote user's privileges.
67 | 
68 | ## Workflows for editing code remotely 
69 | 
70 | If you mounted your code directory as described above, you can now open that folder on your local machine using your favourite editor. 
71 | 
72 | Drawbacks:
73 |   * Some people report some laggy behaviour due to the slowness of the SSH connection. 
74 |   * Also like this your code is *only* on the ML Cloud so if it goes down you don't have access to your code 
75 | 
76 | Instead of mounting the drive, some editors have functionality or plugins specifically for editing code remotely via SSH. For example, Visual Code Studio has the "Remote -SSH" which allows you to do exactly that. 
77 | 
78 | An alternative workflow is to develop locally and commit+push your changes to a git repository. Before running your code you can `git pull` on the Slurm node. As a side effect this will require you to commit your changes frequently which is generally a good thing. 
79 | 
80 | ## What now?
81 | 
82 | Go back to the [Contents section](/README.md#contents) or continue with a more detailed description of the [virtual environment](/instructions/virtual-env-workflow.md) or [Singularity](/instructions/singularity-workflow.md) based workflow. 


--------------------------------------------------------------------------------
/instructions/interactive-jobs.md:
--------------------------------------------------------------------------------
  1 | # Running interactive jobs and debugging 
  2 | 
  3 | It is also possible to open interactive sessions which can be very useful for debugging. However, **do not** use this as your main way of developing. Interactive sessions don't automatically exit when your job is finished on so will be blocking resources unnecessarily. 
  4 | 
  5 | ## Starting an interactive session on a work node
  6 | 
  7 | When we ssh onto the Slurm login node, we do not have access to any of the computational resources such as GPU. For that usually deploy a job using `sbatch` which will then be executed on one of the work nodes. 
  8 | 
  9 | It is also possible to enter one of the work nodes directly, by using the following `srun` command:
 10 | 
 11 | ````
 12 | srun --pty bash
 13 | ````
 14 | 
 15 | This will start an interactive job, login to an available work node and run the bash shell. However, we will still not have access to GPUs because we didn't request any.
 16 | 
 17 | Requesting resources works exactly the same as in the deployment scripts, but we provide the variables as command line arguments (which incidentally you can also do for `sbatch` if you want to). For example:
 18 | 
 19 | ````
 20 | srun --pty --partition=gpu-2080ti-dev --time=0-00:30 --gres=gpu:1 bash 
 21 | ````
 22 | 
 23 | This will start an interactive session with a 2080ti GPU for 30 minutes. A description of the meaning of all command line arguments can be found in [this section](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#submitting-batch-jobs) of the ML Cloud Slurm wiki.
 24 | 
 25 | You can now navigate to the source code if you are not already there
 26 | ````
 27 | cd $WORK/tue-slurm-helloworld
 28 | ````
 29 | 
 30 | Assuming you are using Conda, activate your environment using 
 31 | ````
 32 | conda activate myenv
 33 | ````
 34 | Otherwise use the appropriate virtualenv or Singularity equivalent. 
 35 | 
 36 | 
 37 | Then you can run Python as you would on your local system
 38 | ````
 39 | python src/multiply.py --timer_repetitions 10000 --use-gpu
 40 | ````
 41 | 
 42 | ## Debugging
 43 | 
 44 | This part will require a Python package called `ipdb`. If you followed the minimal example originally, it may not yet be installed. You can do so using 
 45 | 
 46 | ````
 47 | pip install ipdb
 48 | ````
 49 | 
 50 | There is a version of the `multiply.py` script, where I added a `ipdb` breakpoint around line 22: [src/multiply-debug.py](/src/multiply-debug.py). 
 51 | 
 52 | We can also run that file instead:
 53 | ````
 54 | python src/multiply-debug.py --timer_repetitions 10000 --use-gpu
 55 | ````
 56 | 
 57 | This will run the code until line 24 and then open an interactive debug terminal which is similar to the `ipython` console and allows you to execute any Python commands that you want. In there you can for example check the value and shape of the variables that have already been defined
 58 | ````python
 59 | print(x)
 60 | print(x.shape)
 61 | ````
 62 | 
 63 | You can also step through the code line-by-line by typing "n", step into functions using "s", or continue executing the rest of the code using "c". 
 64 | 
 65 | ## Starting other interactive jobs  
 66 | 
 67 | Starting an interactive session with a `bash` terminal is a very sensible way of doing things. However, you can technically run whatever you want using `srun`. Here is an example of directly running the code using a singularity container if you have been following that workflow:
 68 | 
 69 | ````
 70 | srun --pty --partition=gpu-2080ti-dev --time=0-00:30 --gres=gpu:1 \
 71 |    singularity exec \
 72 |    --nv \
 73 |    --bind /mnt/qb/baumgartner,`pwd` deeplearning.sif \
 74 |        python3 src/multiply-debug.py \
 75 |        --timer_repetitions 10000 \
 76 |        --use-gpu 
 77 | ````
 78 | 
 79 | ## SSH-ing into the worknode on which a job is running
 80 | 
 81 | When you run `squeue --me` you can check on which nodes your jobs are running. When I, for example deploy the `deploy-with-conda-multiple.sh` script, my `squeue --me` output is the following:
 82 | 
 83 | ````
 84 |      JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 85 |  1121515_0 gpu-2080t deploy-w cbaumgar  R       0:03      1 bg-slurmb-bm-2
 86 |  1121515_1 gpu-2080t deploy-w cbaumgar  R       0:03      1 bg-slurmb-bm-2
 87 |  1121515_2 gpu-2080t deploy-w cbaumgar  R       0:03      1 bg-slurmb-bm-2
 88 |  1121515_3 gpu-2080t deploy-w cbaumgar  R       0:03      1 slurm-bm-64
 89 |  1121515_4 gpu-2080t deploy-w cbaumgar  R       0:03      1 slurm-bm-63
 90 | ````
 91 | 
 92 | This tells me for example, that job `1121515_4` is running on work node `slurm-bm-63`. I can now simply ssh into this work node using 
 93 | 
 94 | ````
 95 | ssh slurm-bm-63
 96 | ````
 97 | 
 98 | A requirement for this to work is that all the identity forwarding stuff in the [initial setup](/instructions/initial-setup.md) section of this tutorial were completed correctly. If it is not working for some reason try running `ssh-add` on your local machine. 
 99 | 
100 | Note also that you can only ssh onto nodes, on which you have jobs running and that you will only have access only to resources requested by that job (not all of the resources of that work node). 
101 | 
102 | Once we ssh-ed to the work node (in this case node `slurm-bm-63`) we can for example check what the GPU is up to using
103 | ````
104 | nvidia-smi -l
105 | ````
106 | 
107 | or we can see what is stored on the local temporary storage:
108 | 
109 | ````
110 | ls /scratch_local
111 | ````
112 | 
113 | ## What now?
114 | 
115 | Go back to the [Contents section](/README.md#contents) or learn about some [useful Slurm commands](/instructions/useful-commands.md). 
116 | 


--------------------------------------------------------------------------------
/instructions/parallel-jobs.md:
--------------------------------------------------------------------------------
 1 | # Running multiple jobs in parallel 
 2 | 
 3 | You can of course use `sbatch` to start multiple jobs simultaneously by running it multiple times, for example, with different parameters. 
 4 | 
 5 | However, Slurm also offers a convenient possibility to run multiple jobs for an array of parameters which we will explore in this tutorial. 
 6 | 
 7 | ## Running multiple jobs sequentially 
 8 | 
 9 | Our `multiply.py` script has a `--random-seed` option to start the random matrix generation with different random seeds. Say we want to run the code with 5 different random seeds, a common problem when evaluating machine learning problems. 
10 | 
11 | Note, that the bottom half of the deploy scripts (e.g. [src/deploy-with-conda.sh](/src/deploy-with-conda.sh), are just regular bash. So there is nothing keeping us from using a bash for loop in our `sbatch` deployment script. 
12 | 
13 | For example, in [src/deploy-with-conda-multiple.sh](/src/deploy-with-conda-multiple.sh) we could replace
14 | ````bash
15 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu
16 | ````
17 | with 
18 | ````bash
19 | for seed in 0 1 2 3 4
20 | do
21 |    python3 src/multiply.py --timer_repetitions 10000 --use-gpu --random-seed $seed
22 | done
23 | ````
24 | 
25 | With this replacement `deploy-with-conda.sh` would start a single Slurm job that would run `multiply.py` 5 times, one after the other, each time with a different random seed. 
26 | 
27 | However, it would be neat if we could actually run the 5 jobs in parallel on different GPUs. 
28 | 
29 | ## Running multiple jobs in parallel 
30 | 
31 | Slurm provides a nice functionality, to explore different parameter ranges like the different random seeds in the example above. Specifically, different parameter ranges can be explored using the `--array` option. 
32 | 
33 | An example, is given in [src/deploy-with-conda-multiple.sh](/src/deploy-with-conda-multiple.sh). The changes with respect to `deploy-with-conda.sh` are adding the following line to the sbatch arguments:
34 | 
35 | ````bash
36 | #SBATCH --array=0,1,2,3,4
37 | ````
38 | 
39 | and replacing the python call by the following 
40 | ````bash
41 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu --random-seed ${SLURM_ARRAY_TASK_ID} 
42 | ````
43 | 
44 | This will start 5 separate jobs each with the different values from the specified array which will be contained in the automatically generated bash variable `${SLURM_ARRAY_TASK_ID}`. 
45 | 
46 | The example script can be deployed using 
47 | 
48 | ````bash
49 | sbatch src/deploy-with-conda-multiple.sh
50 | ````
51 | 
52 | You can check with `squeue --me` that 5 jobs have in fact been started. 
53 | 
54 | Learn more about the `--array` option in [this excellent section](https://slurm.schedmd.com/job_array.html) of the Slurm documentation.
55 | 
56 | ## What now?
57 | 
58 | Go back to the [Contents section](/README.md#contents) or learn about [interactive jobs and debugging](/instructions/interactive-jobs.md). 


--------------------------------------------------------------------------------
/instructions/singularity-workflow.md:
--------------------------------------------------------------------------------
  1 | # Using Slurm with Singularity
  2 | 
  3 | This example contains an example workflow using Singularity which includes instructions on how to build a Singularity container and deploying it on the ML Cloud Slurm host.
  4 | 
  5 | In contrast, to the [virtual environment workflow](/instructions/virtual-env-workflow.md) (i.e. Conda or virtualenv) the Singularity workflow offers more flexibility for setting up your environment, but it is also arguably a bit more involved and cumbersome to use. 
  6 | 
  7 | ## What is Singularity in a nutshell
  8 | 
  9 | [Singularity](https://en.wikipedia.org/wiki/Singularity_(software)) is a type of container similar to the more widely used [Docker](https://www.docker.com/resources/what-container). 
 10 | 
 11 | It allows you to package everything you need to run your code into a file. It is almost like a virtual machine, in which you can log in, install whatever you want and store whatever data you want. In contrast to an actual virtual machine no part of your hardware is allocated specifically to run a singularity container. Rather it just takes whatever resources it needs like any other job on your system. 
 12 | 
 13 | The cool thing about them is that you can just give that file to anyone else and they can exactly reproduce your experiment setup without having to do any annoying environment setups like finding the right CUDA or tensorflow versions. Rather than having instructions with your code on how to get it to run, you could technically just share the Singularity file. However, in practice these files tend to be very large so they are not typically shared widely. However, they are also very useful for packaging and sending jobs that will exactly like you want them to run on any system. 
 14 | 
 15 | As you will see the "pure" idea would be to package your code inside the container. However, this is pretty unwieldy for daily use with the ML Cloud, so we will also discuss a sort of hybrid setup where the container only contains the dependencies, but the code lives outside of the container. 
 16 | 
 17 | If containers are new to you make sure to read about [containers in general](https://www.docker.com/resources/what-container) and [singularity containers](https://en.wikipedia.org/wiki/Singularity_(software)) in particular. 
 18 | 
 19 | Singularity (like Conda and virtualenv in the previous tutorial) is already pre-installed on Slurm. 
 20 | 
 21 | ## Download code and build singularity container
 22 | 
 23 | Make sure you are on the Slurm login node, and download the code if you haven't done so yet in the minimal example or the conda workflow example:
 24 | 
 25 | ````
 26 | cd $WORK
 27 | git clone https://github.com/baumgach/tue-slurm-helloworld.git
 28 | cd tue-slurm-helloworld 
 29 | ````
 30 | 
 31 | As the first step, build singularity container
 32 | 
 33 | ````
 34 | singularity build --fakeroot deeplearning.sif src/deeplearning.def
 35 | ````
 36 | 
 37 | Explanations:
 38 |  - `deeplearning.def` contains instructions on how to build the container. It's a text file, have a look at it if you want.
 39 |  - Importantly, this file contains the specification of the "operating system", a Ubuntu 20.04 with cuda support in this case, and instructions to install Python3 with all required packages for this example.
 40 | 
 41 | 
 42 | 
 43 | ## Entering the new container in a shell
 44 | 
 45 | You can now enter a shell in your container using the following command and have a look around.
 46 | 
 47 | ````
 48 | singularity shell --bind /mnt/qb/baumgartner deeplearning.sif
 49 | ````
 50 | 
 51 | The bind option is not necessary to enter the container. It will mount the `baumgartner` folder, so you have access to it from inside the container also. This is important for the `multiply.py` script to execute successfully.  
 52 | 
 53 | You are now running an encapsulated operating system as specified in `deeplearning.def`.
 54 | Try opening an `ipython` terminal.
 55 | Note that your home directory and current working directory automatically get mounted inside the container. This is not true for all paths, however.
 56 | 
 57 | The following lines in the `deeplearning.def` file
 58 | ````
 59 | %files
 60 |     multiply.py /opt/code
 61 | ````
 62 | copies your code to `/opt/code` in the container. We copied it there so it would be baked into the container, so when you move the container elsewhere the code is still around.
 63 | 
 64 | The code is also available in your current working directory. So technically it is there twice. However, when you later move your container to another location (e.g. to the Slurm host), the working directory with the code (which only mounted) will no longer be the same. The copied code in `/opt/code`, on the other hand, stays with the container.   
 65 | 
 66 | You can navigate to either copy of the code and execute it using
 67 | 
 68 | ````
 69 | python3 multiply.py
 70 | ````
 71 | 
 72 | Note that the code assumes that the following file exists on Slurm: `/mnt/qb/baumgartner/storagetest.txt`. I have put it there, so if nothing has changed it should be there. It contains a single integer specifying the size of the matrices to be multiplied in `multiply.py`. The idea behind this is to simulate dependence on external data. In a real-world example, this could for example be a medical dataset.
 73 | 
 74 | **An important thing to note:** When you change your code after building the container, it will be changed in
 75 | your current working directory but not under `/opt/code`.
 76 | 
 77 | You can exit the container using `exit` or `Ctrl+d`.
 78 | 
 79 | ## Executing stuff in the container
 80 | 
 81 | You can execute the Python code without having to enter the container using the `exec` option as follows:
 82 | 
 83 | ````
 84 | singularity exec --nv --bind /mnt/qb/baumgartner,`pwd` deeplearning.sif python3 /opt/code/multiply.py --timer_repetitions 10000 --no-gpu
 85 | ````
 86 | 
 87 | Explanations:
 88 |  - The `--nv` option is required to enable GPU access
 89 |  - The `--bind` option mounts the `baumgartner` folder and the current working directory (which can be obtained in bash using `pwd`) inside the container.
 90 |  - Note that we are also passing some options to the Python script at the end.
 91 |  - The Slurm login nodes do not have GPUs so for now we use the `--no-gpu` option for our python script. This will change once we submit a job further down in this file. 
 92 | 
 93 | Further note that the above command references the code at `/opt/code`. This copy of the code is static. This means if you change something in your code, you need to rebuild the container for the changes to take effect.
 94 | 
 95 | However, as long as the container is still on your local machine you can change `/opt/code/multiply.py` to the path on your local machine (i.e. perhaps something like `$HOME/tue-Slurm-helloworld/multiply.py`). Because this copy of the code was not copied, but is mounted, any changes you make, will take effect and you do not need to rebuild your container each time (only once you want to deploy it remotely).
 96 | 
 97 | ## Running code outside of the Singularity container
 98 | 
 99 | As alluded to in the beginning, it may be a bit cumbersome to rebuild your entire Singularity container every time you make a change to your code, especially because Singularity really rebuilds the whole thing each time even if only a few parts have changed (this is different in Docker). 
100 | 
101 | So rather than baking the code into the container we can also run some code that lives outside of the container somewhere on slurm. We can for example run the identical file that lives in the `src` folder of this tutorial. 
102 | 
103 | ````
104 | singularity exec --nv --bind /mnt/qb/baumgartner,`pwd` deeplearning.sif python3 src/multiply.py --timer_repetitions 10000 --no-gpu
105 | ````
106 | 
107 | Not all directories are available from inside the container, but the following directories will be there:
108 |  - The home directory of your Slurm login node
109 |  - All the paths you add using the `--bind` option in the command above. 
110 | 
111 | ## Running the code on Slurm
112 | 
113 | The `deploy-with-singularity.sh` file contains instructions for your Slurm job. Those are a bunch of options for which GPUs to use etc, and which code to execute. In this case, `multiply.py`. 
114 | 
115 | We can submit the Slurm job using the following command
116 | ````
117 | sbatch src/deploy-with-singularity.sh
118 | ````
119 | 
120 | This will run the job on Slurm as specified in `deploy.sh`. Note by the way, that the leading `#SBATCH` do not mean that those lines are commented out. This is just Slurm syntax.
121 | 
122 | Have a look at the outputs written to `logs/`.
123 | Of course this is a very basic example, and you should later tailor `deploy-with-singularity.sh` to your needs.
124 | 
125 | More info on how to manage Slurm jobs can be found [here](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm)
126 | 
127 | ## Advantages and disadvantages of baking code into the container
128 | 
129 | The following are general remarks not related to this tutorial.
130 | 
131 | You have two options to run code on Slurm using a singularity container:
132 |  1. Copy the code into the container locally and then move the whole container including code over to Slurm and run it there. For this you would need to install Singularity on your local machine as described below. 
133 |  2. Do not copy your code into the container, but rather move it to your `$WORK` directory on the remote Slurm host and access it from there by mounting binding the code directory from Slurm into the container using the `--bind`option. 
134 | 
135 | The advantage of (1) is that each container is (mostly) self-contained and completely reproducible. (For this to be completely true, the data would also have to be baked into the container, which we could also do). If you give this container to someone else they can run it and get exactly the same result. You could also save containers with important experiments for later, so you can reproduce them or check what exactly you did. In fact this is what many ML companies do often in conjunction with Kubernetes, but usually using Docker rather than Singularity which is a bit more flexible and faster to build. 
136 | 
137 | However, using method (1) you will need to rebuild and copy your container to Slurm every time you change something. This, unfortunately, takes a long time because Singularity (in contrast to docker) always rebuilds everything from scratch. So practically you will be able to develop much faster using method (2). This comes at the cost of the above mentioned reproducibility.
138 | 
139 | Method (2) seems the preferable one for actual research and development. You can have your code permanently in your Slurm directory and edit it locally by mounting your Slurm home locally using `sshfs` like above. Another option is to use a SSH extension for your IDE such as the "Remote -SSH" extension for Visual Studio Code.
140 | 
141 | ## Install Singularity on your local machine (optional)
142 | 
143 | You may want to install Singularity also on your local machine so you can develop your environment locally before copying it over to Slurm . 
144 | 
145 | To install singularity on your local machine, follow the steps described [here](https://sylabs.io/guides/3.7/user-guide/quick_start.html).
146 | 
147 | ## What now?
148 | 
149 | Go back to the [Contents section](/README.md#contents), learn about doing [parallel parameter sweeps](/instructions/parallel-jobs.md) using `sbatch`, or [interactive jobs and debugging](/instructions/interactive-jobs.md). 
150 | 


--------------------------------------------------------------------------------
/instructions/useful-commands.md:
--------------------------------------------------------------------------------
 1 | # Useful commands
 2 | 
 3 | Some common commands are explained in [this section](https://gitlab.mlcloud.uni-tuebingen.de/doku/public/-/wikis/Slurm#common-slurm-commands) of the ML Cloud Slurm Wiki. 
 4 | 
 5 | They are reproduced here with some useful options. 
 6 | 
 7 | ## Show jobs 
 8 | 
 9 | Your own jobs 
10 | 
11 | ````
12 | squeue --me
13 | ````
14 | 
15 | Any users jobs
16 | 
17 | ````
18 | squeue -u <username>
19 | ````
20 | 
21 | ## Cancel jobs
22 | 
23 | Cancel specific job 
24 | 
25 | ````
26 | scancel <job-id>
27 | ````
28 | 
29 | Cancel all my jobs 
30 | 
31 | ````
32 | scancel -u <your-slurm-username>
33 | ````
34 | 
35 | ## Check fairshare
36 | 
37 | Check on a group level 
38 | 
39 | ````
40 | sshare
41 | ````
42 | 
43 | Check on an individual level
44 | 
45 | ````
46 | sshare --all
47 | ````
48 | 
49 | ## What now?
50 | 
51 | Go back to the [Contents section](/README.md#contents) or give feed-back about the tutorial on [github](https://github.com/baumgach/tue-slurm-helloworld/issues)! 


--------------------------------------------------------------------------------
/instructions/virtual-env-workflow.md:
--------------------------------------------------------------------------------
  1 | # Using Slurm with Conda or Virtualenv
  2 | 
  3 | In this section, we will describe an example workflow for setting up a virtual python environment and then running the `multiply.py` code which relies on PyTorch and some other libraries. 
  4 | 
  5 | For many this may be a more accessible workflow compared to the [Singularity based approach](/instructions/singularity-workflow.md), which was originally recommended by the ML Cloud folks and is described in [the next section](/instructions/singularity-workflow.md). However, for some cases it may be lacking in flexibility, especially if you need special libraries or a custom operating system for your experiments. 
  6 | 
  7 | In the following we will describe a [Conda](https://docs.conda.io/en/latest/)-based workflow in more detail and will then give a brief alternative workflow using [virtualenv](https://virtualenv.pypa.io/en/latest/). 
  8 | 
  9 | ## What is Conda in a Nutshell
 10 | 
 11 | [Conda](https://docs.conda.io/en/latest/) is an environment manager for Python and other languages. It allows you to create and activate virtual environments in which you can install Python packages. The advantage is that the packages will not be installed globally on your machine, but only in this environment. This means you can use different versions of packages for different projects. Conda is already installed on the Slurm login nodes. 
 12 | 
 13 | One drawback with respect to the [Singularity workflow](/instructions/singularity-workflow.md) is that you can only influence the version of Python and certain packages, but not over things like the operating system or packages not available through Conda. For example, using the Singularity method you can install any package or software available on Linux in your Singularity instance (e.g. using `apt`). Usually, this isn't a problem, however. 
 14 | 
 15 | ## Setting up a Conda environment 
 16 | 
 17 | Make sure you are on the Slurm login node, and download the code if you haven't done so yet in the minimal example:
 18 | 
 19 | ````
 20 | cd $WORK
 21 | git clone https://github.com/baumgach/tue-slurm-helloworld.git
 22 | cd tue-slurm-helloworld 
 23 | ````
 24 | 
 25 | Make a Conda environment called `myenv` using
 26 | 
 27 | ````
 28 | conda create -n myenv 
 29 | ````
 30 | 
 31 | and confirm with `y`. 
 32 | 
 33 | Activate the new environment using 
 34 | 
 35 | ````
 36 | conda activate myenv
 37 | ````
 38 | 
 39 | Let's see which Python version we have by default.
 40 | 
 41 | ````
 42 | python --version 
 43 | ````
 44 | 
 45 | That is the system-wide Python version. At the time of writing the system wide Python defaults to `python2.7`, although `python3` is also installed with version `3.6.8`. Using Conda we can instead install whichever specific Python version we like. For example
 46 | 
 47 | ````
 48 | conda install python==3.9.7
 49 | ````
 50 | 
 51 | `python` now defaults to the specified version, which you can check again with `python --version`. 
 52 | 
 53 | We can install packages using `conda install` or `pip install`. `conda install` works for some stuff `pip install` doesn't and vice versa. Writing this tutorial I had a smoother experience with `pip`. 
 54 | 
 55 | So in order to install our dependencies we can use:
 56 | 
 57 | ````
 58 | pip install numpy torch ipdb
 59 | ````
 60 | 
 61 | ## Other conda commands 
 62 | 
 63 | There are many other useful Conda commands for managing your packages and environments such as 
 64 |   * `conda env list` for displaying all existing environments
 65 |   * `conda env remove --name myenv` for deleting an environment
 66 | 
 67 | Have a look at this helpful [cheat sheet](https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf).
 68 | 
 69 | ## Running the code 
 70 | 
 71 | As before we can now run our matrix multiplication code directly on the login node without GPU. 
 72 | 
 73 | ````
 74 | cd $WORK/tue-slurm-helloworld
 75 | python src/multiply.py --timer_repetitions 10000 --no-gpu 
 76 | ````
 77 | 
 78 | But of course we would rather submit this as a job with GPU usage. For this, I prepared another deploy script. 
 79 | 
 80 | ````
 81 | sbatch src/deploy-with-conda.sh
 82 | ````
 83 | 
 84 | The results will be written to the newly created `logs` directory which should have been created in your current working directory. You can list the files sorted by most recent and some useful extra information using 
 85 | 
 86 | ````
 87 | ls -ltrh logs
 88 | ````
 89 | 
 90 | You can then look at the output log file and error file corresponding to your job using the `cat` or `less` commands. For example:
 91 | 
 92 | ````
 93 | cat logs/job_<job-id>.out
 94 | ````
 95 | 
 96 | To deactivate the environment type 
 97 | 
 98 | ````
 99 | conda deactivate
100 | ````
101 | 
102 | ## Alternative workflow using virtualenv 
103 | 
104 | The following describes the equivalent of the above in virtualenv. Virtualenv is yet a bit simpler than Conda, but also a bit less flexible. For example, it does not (easily) allow control over the specific Python version you can use. 
105 | 
106 | Setting up the environment
107 | 
108 | ```` 
109 | cd $HOME  # go to home directory
110 | virtualenv -p python3 env 
111 | source env/bin/activate
112 | pip install --upgrade pip
113 | pip install numpy matplotlib ipython torch torchvision ipdb
114 | ````
115 | 
116 | Deactivating the environment
117 | ````
118 | deactivate
119 | ````
120 | 
121 | Running the code 
122 | 
123 | ````
124 | cd $WORK 
125 | sbatch src/deploy-with-virtualenv.sh
126 | ````
127 | 
128 | ## What now?
129 | 
130 | Go back to the [Contents section](/README.md#contents), continue with the more flexible  [Singularity](/instructions/singularity-workflow.md) based workflow, or learn about [interactive jobs and debugging](/instructions/interactive-jobs.md). 
131 | 


--------------------------------------------------------------------------------
/logs/.gitignore:
--------------------------------------------------------------------------------
1 | *
2 | !.gitignore
3 | 


--------------------------------------------------------------------------------
/src/deeplearning.def:
--------------------------------------------------------------------------------
 1 | BootStrap: docker
 2 | From: nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04
 3 | 
 4 | %post
 5 |     # Downloads the latest package lists.
 6 |     apt-get update -y
 7 | 
 8 |     # Install Python3 requirements
 9 |     # --> python3-tk is required by matplotlib.
10 |     DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
11 |         python3 \
12 |         python3-tk \
13 |         python3-pip \
14 |         python3-distutils \
15 |         python3-setuptools
16 | 
17 |     # Reduce the size of the image by deleting the package lists we downloaded,
18 |     # which are useless now.
19 |     apt-get -y clean
20 |     rm -rf /var/lib/apt/lists/*
21 |   
22 |     # Install Python modules.
23 |     pip3 install numpy ipdb torch 
24 | 
25 |     # Make code directory
26 |     mkdir -p /opt/code
27 | 
28 | %files 
29 |     # Copy code from home directory to singularity root
30 |     src/multiply.py /opt/code/
31 | 
32 | 


--------------------------------------------------------------------------------
/src/deploy-with-conda-multiple.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --ntasks=1                # Number of tasks (see below)
 3 | #SBATCH --cpus-per-task=1         # Number of CPU cores per task
 4 | #SBATCH --nodes=1                 # Ensure that all cores are on one machine
 5 | #SBATCH --time=0-00:05            # Runtime in D-HH:MM
 6 | #SBATCH --partition=gpu-2080ti-dev # Partition to submit to
 7 | #SBATCH --gres=gpu:1              # optionally type and number of gpus
 8 | #SBATCH --mem=50G                 # Memory pool for all cores (see also --mem-per-cpu)
 9 | #SBATCH --output=logs/job_%j.out  # File to which STDOUT will be written
10 | #SBATCH --error=logs/job_%j.err   # File to which STDERR will be written
11 | #SBATCH --mail-type=FAIL           # Type of email notification- BEGIN,END,FAIL,ALL
12 | #SBATCH --mail-user=<your-email>  # Email to which notifications will be sent
13 | #SBATCH --array=0,1,2,3,4   # array of cityscapes random seeds
14 | 
15 | # print info about current job
16 | echo "---------- JOB INFOS ------------"
17 | scontrol show job $SLURM_JOB_ID 
18 | echo -e "---------------------------------\n"
19 | 
20 | # Due to a potential bug, we need to manually load our bash configurations first
21 | source $HOME/.bashrc
22 | 
23 | # Next activate the conda environment 
24 | conda activate myenv
25 | 
26 | # Run code with values specified in task array
27 | echo "-------- PYTHON OUTPUT ----------"
28 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu --random-seed ${SLURM_ARRAY_TASK_ID} 
29 | echo "---------------------------------"
30 | 
31 | # Deactivate environment again
32 | conda deactivate
33 | 


--------------------------------------------------------------------------------
/src/deploy-with-conda.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --ntasks=1                # Number of tasks (see below)
 3 | #SBATCH --cpus-per-task=1         # Number of CPU cores per task
 4 | #SBATCH --nodes=1                 # Ensure that all cores are on one machine
 5 | #SBATCH --time=0-00:05            # Runtime in D-HH:MM
 6 | #SBATCH --partition=gpu-2080ti-dev # Partition to submit to
 7 | #SBATCH --gres=gpu:1              # optionally type and number of gpus
 8 | #SBATCH --mem=50G                 # Memory pool for all cores (see also --mem-per-cpu)
 9 | #SBATCH --output=logs/job_%j.out  # File to which STDOUT will be written
10 | #SBATCH --error=logs/job_%j.err   # File to which STDERR will be written
11 | #SBATCH --mail-type=FAIL           # Type of email notification- BEGIN,END,FAIL,ALL
12 | #SBATCH --mail-user=<your-email>  # Email to which notifications will be sent
13 | 
14 | # print info about current job
15 | echo "---------- JOB INFOS ------------"
16 | scontrol show job $SLURM_JOB_ID 
17 | echo -e "---------------------------------\n"
18 | 
19 | # Due to a potential bug, we need to manually load our bash configurations first
20 | source $HOME/.bashrc
21 | 
22 | # Next activate the conda environment 
23 | conda activate myenv
24 | 
25 | # Run our code
26 | echo "-------- PYTHON OUTPUT ----------"
27 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu
28 | echo "---------------------------------"
29 | 
30 | # Deactivate environment again
31 | conda deactivate
32 | 


--------------------------------------------------------------------------------
/src/deploy-with-singularity.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --ntasks=1                # Number of tasks (see below)
 3 | #SBATCH --cpus-per-task=1         # Number of CPU cores per task
 4 | #SBATCH --nodes=1                 # Ensure that all cores are on one machine
 5 | #SBATCH --time=0-00:05            # Runtime in D-HH:MM
 6 | #SBATCH --partition=gpu-2080ti-dev # Partition to submit to
 7 | #SBATCH --gres=gpu:1              # optionally type and number of gpus
 8 | #SBATCH --mem=50G                 # Memory pool for all cores (see also --mem-per-cpu)
 9 | #SBATCH --output=logs/job_%j.out  # File to which STDOUT will be written
10 | #SBATCH --error=logs/job_%j.err   # File to which STDERR will be written
11 | #SBATCH --mail-type=FAIL           # Type of email notification- BEGIN,END,FAIL,ALL
12 | #SBATCH --mail-user=<your-email>  # Email to which notifications will be sent
13 | 
14 | # print info about current job
15 | echo "---------- JOB INFOS ------------"
16 | scontrol show job $SLURM_JOB_ID 
17 | echo -e "---------------------------------\n"
18 | 
19 | # Run singularity command 
20 | echo "-------- PYTHON OUTPUT ----------"
21 | singularity exec --nv --bind /mnt/qb/baumgartner,`pwd` deeplearning.sif \
22 |    python3 /opt/code/multiply.py \
23 |       --timer_repetitions 10000 \
24 |       --use-gpu
25 | echo "---------------------------------"


--------------------------------------------------------------------------------
/src/deploy-with-virtualenv.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --ntasks=1                # Number of tasks (see below)
 3 | #SBATCH --cpus-per-task=1         # Number of CPU cores per task
 4 | #SBATCH --nodes=1                 # Ensure that all cores are on one machine
 5 | #SBATCH --time=0-00:05            # Runtime in D-HH:MM
 6 | #SBATCH --partition=gpu-2080ti-dev # Partition to submit to
 7 | #SBATCH --gres=gpu:1              # optionally type and number of gpus
 8 | #SBATCH --mem=50G                 # Memory pool for all cores (see also --mem-per-cpu)
 9 | #SBATCH --output=logs/job_%j.out  # File to which STDOUT will be written
10 | #SBATCH --error=logs/job_%j.err   # File to which STDERR will be written
11 | #SBATCH --mail-type=FAIL           # Type of email notification- BEGIN,END,FAIL,ALL
12 | #SBATCH --mail-user=<your-email>  # Email to which notifications will be sent
13 | 
14 | # print info about current job
15 | echo "---------- JOB INFOS ------------"
16 | scontrol show job $SLURM_JOB_ID 
17 | echo -e "---------------------------------\n"
18 | 
19 | # Activate virtualenv 
20 | source $HOME/env/bin/activate
21 | 
22 | # Run code
23 | echo "-------- PYTHON OUTPUT ----------"
24 | python3 src/multiply.py --timer_repetitions 10000 --use-gpu
25 | echo "---------------------------------"
26 | 
27 | # Deactivate virtualenv again
28 | deactivate 
29 | 


--------------------------------------------------------------------------------
/src/multiply-debug.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from timeit import timeit
 3 | import argparse
 4 | 
 5 | def time_multiplication(
 6 |     timer_repetitions,
 7 |     use_gpu=True,
 8 |     data_path="/mnt/qb/baumgartner/storagetest.txt",
 9 |     random_seed=0
10 | ):
11 | 
12 |     # Simulate external data access by reading the matrix size from external file
13 |     with open(data_path) as f:
14 |         s = int(f.read())
15 | 
16 |     # Set random seed
17 |     torch.manual_seed(random_seed)
18 | 
19 |     # Create tensors
20 |     x = torch.randn(32, s)
21 |     y = torch.randn(s, 32)
22 |     
23 |     import ipdb
24 |     ipdb.set_trace()
25 | 
26 |     # Check if Cuda available and move tensor to Cuda if yes
27 |     cuda_available = torch.cuda.is_available() 
28 |     print(f"Cuda_available={cuda_available}")
29 |     if cuda_available and use_gpu:
30 |         device = torch.cuda.current_device()
31 |         print(f"Current Cuda device: {device}")
32 |         x = x.to(device)
33 |         y = y.to(device)
34 | 
35 |     # Multiply matrix first once for result and then multiple times for measuring elapsed time
36 |     mult = torch.matmul(x, y)
37 |     elapsed_time = timeit(lambda: torch.matmul(x, y), number=timer_repetitions)
38 | 
39 |     # Print some results:
40 |     print("result:")
41 |     print(mult)
42 |     print("Output shape", mult.shape)
43 |     print(f"elapsed time: {elapsed_time}")
44 | 
45 | if __name__ == "__main__":
46 | 
47 |     parser = argparse.ArgumentParser(
48 |         description="Tescript for Singularity and SLURM on GPU. It times the multiplication of two matrices, repeated `timer_repetition` times."
49 |     )
50 |     parser.add_argument(
51 |         "--timer_repetitions", dest="timer_repetitions", action="store", default=1000, type=int, help="How many times to repeat the multiplication.",
52 |     )
53 |     parser.add_argument('--use-gpu', dest='use_gpu', action='store_true')
54 |     parser.add_argument('--no-gpu', dest='use_gpu', action='store_false')
55 |     parser.set_defaults(use_gpu=True)
56 |     parser.add_argument(
57 |         "--random-seed", dest="random_seed", action="store", default=0, type=int, help="Random seed for sampling random matrices.",
58 |     )
59 |     args = parser.parse_args()
60 | 
61 |     # Run main function with command line input, to test singularity arguments
62 |     time_multiplication(
63 |         timer_repetitions=args.timer_repetitions, 
64 |         use_gpu=args.use_gpu, 
65 |         random_seed=args.random_seed,
66 |         )
67 | 
68 |     print("done.")


--------------------------------------------------------------------------------
/src/multiply.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | from timeit import timeit
 3 | import argparse
 4 | 
 5 | def time_multiplication(
 6 |     timer_repetitions,
 7 |     use_gpu=True,
 8 |     data_path="/mnt/qb/baumgartner/storagetest.txt",
 9 |     random_seed=0
10 | ):
11 | 
12 |     # Simulate external data access by reading the matrix size from external file
13 |     with open(data_path) as f:
14 |         s = int(f.read())
15 | 
16 |     # Set random seed
17 |     torch.manual_seed(random_seed)
18 | 
19 |     # Create tensors
20 |     x = torch.randn(32, s)
21 |     y = torch.randn(s, 32)
22 |     
23 |     # Check if Cuda available and move tensor to Cuda if yes
24 |     cuda_available = torch.cuda.is_available() 
25 |     print(f"Cuda_available={cuda_available}")
26 |     if cuda_available and use_gpu:
27 |         device = torch.cuda.current_device()
28 |         print(f"Current Cuda device: {device}")
29 |         x = x.to(device)
30 |         y = y.to(device)
31 | 
32 |     # Multiply matrix first once for result and then multiple times for measuring elapsed time
33 |     mult = torch.matmul(x, y)
34 |     elapsed_time = timeit(lambda: torch.matmul(x, y), number=timer_repetitions)
35 | 
36 |     # Print some results:
37 |     print("result:")
38 |     print(mult)
39 |     print("Output shape", mult.shape)
40 |     print(f"elapsed time: {elapsed_time}")
41 | 
42 | if __name__ == "__main__":
43 | 
44 |     parser = argparse.ArgumentParser(
45 |         description="Tescript for Singularity and SLURM on GPU. It times the multiplication of two matrices, repeated `timer_repetition` times."
46 |     )
47 |     parser.add_argument(
48 |         "--timer_repetitions", dest="timer_repetitions", action="store", default=1000, type=int, help="How many times to repeat the multiplication.",
49 |     )
50 |     parser.add_argument('--use-gpu', dest='use_gpu', action='store_true')
51 |     parser.add_argument('--no-gpu', dest='use_gpu', action='store_false')
52 |     parser.set_defaults(use_gpu=True)
53 |     parser.add_argument(
54 |         "--random-seed", dest="random_seed", action="store", default=0, type=int, help="Random seed for sampling random matrices.",
55 |     )
56 |     args = parser.parse_args()
57 | 
58 |     # Run main function with command line input, to test singularity arguments
59 |     time_multiplication(
60 |         timer_repetitions=args.timer_repetitions, 
61 |         use_gpu=args.use_gpu, 
62 |         random_seed=args.random_seed,
63 |         )
64 | 
65 |     print("done.")


--------------------------------------------------------------------------------