├── README.md
├── lisa.md
├── das5.md
└── ivi.md


/README.md:
--------------------------------------------------------------------------------
 1 | # nodes-info
 2 | 
 3 | Resources on how to use the GPU clusters: [das5](das5.md), [ivi](ivi.md) and [LISA](lisa.md).
 4 | 
 5 | ## General things to know
 6 | 
 7 | * DO NOT stay idle on a GPU node.
 8 | 
 9 | * Learn on how to create sessions with either `screen` or `tmux`. Opening a session and then submitting your job allows you to disconnect and then later resume (from a different location) and monitor the progress. `tmux` is not available on lisa.
10 | 
11 | * `nvidia-smi` is useful to check the usage of GPUs on a node.
12 | 
13 | * `CUDA_VISIBLE_DEVICES=0 python myscript.py` will only make the GPU:0 visible to python. Alternatively, you can specify it within your python script.
14 | 
15 | * You can create a bash script with multiple parallel jobs that you will submit via `slurm`. Here is an example:
16 | 
17 | ```bash
18 | CUDA_VISIBLE_DEVICES=0 python exp1.py & \
19 |     CUDA_VISIBLE_DEVICES=1,2 python exp2.py & \
20 |     CUDA_VISIBLE_DEVICES=3 python exp3.py & \
21 |     wait
22 | ```
23 | 
24 | A quick tutorial on how to use das4 is available [here](https://goo.gl/Atq9Za).
25 | 


--------------------------------------------------------------------------------
/lisa.md:
--------------------------------------------------------------------------------
 1 | # LISA GPU-island
 2 | 
 3 | ## Account creation
 4 | 
 5 | Ask for an account by sending an email to <helpdesk@surfsara.nl>.
 6 | Add the following [information](https://userinfo.surfsara.nl/systems/lisa/account) in your email.
 7 | Note that a (short) support letter from your advisor is mandatory to get approved.  
 8 | It should normally take 1-2 days.
 9 | 
10 | ## Hostname
11 | 
12 | `login-gpu.lisa.surfsara.nl` to access GPU nodes.
13 | 
14 | The head node has GPUs. You can debug your code there for 15 min.
15 | 
16 | ## Useful things to have in your `.bashrc`
17 | 
18 | ```bash
19 | module load CUDA
20 | module load cuDNN
21 | ```
22 | 
23 | Pick the version you want for CUDA by looking at what's available `module avail CUDA`
24 | 
25 | ## Get a node with slurm
26 | 
27 | ```bash
28 | srun -u --pty --time=HH:MM:SS -p gpu bash -i
29 | ```
30 | 
31 | `--time` maximum of 5 days (120:00:00)  
32 | `-p gpu` get a node with GPUs
33 | 
34 | However, the preferred way of interacting with the job queue is to submit a job using the interactive session:
35 | 
36 | ```bash
37 | srun --time=HH:MM:SS -p gpu myjob.sh
38 | ```
39 | 
40 | slurm will provide a job id. Use that id if you want to remove yourself from the queue with `scancel [job id]`.
41 | 
42 | ## Monitor node usage
43 | 
44 | See which nodes are available:
45 | 
46 | ```bash
47 | sinfo -p gpu
48 | ```
49 | 
50 | See who is in the queue:
51 | 
52 | ```bash
53 | squeue -p gpu
54 | ```
55 | 
56 | ## Disk space
57 | 
58 | You will be allocated free space 200GB on your home folder.
59 | 
60 | There are also temporary storage space and project space available. Check this [page](https://userinfo.surfsara.nl/systems/lisa/getting-started#filesystems).
61 | 
62 | 
63 | ## Credit system
64 | 
65 | You can monitor your credits through `accinfo`, which gives an overview of your account information and your credit budget, or `accuse` which shows your monthly usage.
66 | 
67 | Make sure that you don't have a credit of 0, otherwise ask Boy Menist <b.n.j.menist@uva.nl> to fix your account.  
68 | 
69 | At first, 10k credits are assigned to new users.
70 | If you run out of credits (i.e. negative number), ask the helpdesk for more credits and put Boy Menist on cc.
71 | 
72 | One hour of computation on a GPU node is equivalent to 48 credits.
73 | 
74 | ## More info
75 | 
76 | Read the [starting guide](https://userinfo.surfsara.nl/systems/lisa/getting-started) for more info.
77 | 


--------------------------------------------------------------------------------
/das5.md:
--------------------------------------------------------------------------------
 1 | # DAS-5 cluster
 2 | 
 3 | ## Account creation
 4 | 
 5 | Ask for an account by sending an email to <das-account@cs.vu.nl>. Use your `@uva.nl` email address.  
 6 | It should normally take 1-2 days.
 7 | 
 8 | ## Hostname
 9 | 
10 | `fs4.das5.science.uva.nl` to access GPU nodes.
11 | 
12 | ## Useful things to have in your `.bashrc`
13 | 
14 | ```bash
15 | module load gcc
16 | module load slurm
17 | module load cuda80
18 | module load cuDNN
19 | alias mywatch='/home/koelma/bin/mywatch'
20 | ```
21 | 
22 | Other modules are also available, full list is avaible with `module avail`.
23 | 
24 | Pick the version you want for CUDA by looking at what's available `module avail cuda`.
25 | 
26 | ## Get a node with slurm
27 | 
28 | `srun -u --pty bash -i` Get an interactive session on a *CPU* node  
29 | `srun -u --pty --gres=gpu:1 bash -i` Get an interactive session on a *GPU* node  
30 | `srun -u --pty -w node404 bash -i`  Get an interactive session on `node404`  
31 | `srun -u --pty -p fatq bash -i`  Get an interactive session on the fatq node (more RAM)
32 | 
33 | Ideally you should submit a job instead of using an interactive session:
34 | 
35 | `srun --gres=gpu:1  python myscript.py --myargument=foo`
36 | 
37 | `srun --gres=gpu:4 bash mybashscript.sh`
38 | 
39 | slurm will provide a job id. Use that id if you want to remove yourself from the queue with `scancel [job id]`.
40 | 
41 | ## Jupyter notebook on a node
42 | 
43 | For those who want to play with jupyter notebook on a node, here is the procedure.
44 | 
45 | 1. Get an interective session on a CPU or GPU node
46 | 2. Record which node you got (e.g. `node401`)
47 | 3. On the obtained node, run `jupyter notebook --no-browser --port=8888`
48 | 4. From your local machine, run `ssh -t -t username@fs4.das5.science.uva.nl -L 8888:localhost:8888 ssh node401 -L 8888:localhost:8888`
49 | 5. Open a browser on your local machine and go to http://localhost:8888
50 | 6. You should now have access to a jupyter notebook that is running on a node on das5
51 | 7. Do **NOT** forget to kill both sessions once you are done
52 | 
53 | *Note:* if you run into issues to launch jupyter notebook on the GPU node, you might want to remove the following environment variable `unset XDG_RUNTIME_DIR`
54 | 
55 | ## Monitor node usage
56 | 
57 | Either use `squeue` or `mywatch` (see last line of [.bashrc](#useful-things-to-have-in-your-bashrc))
58 | 
59 | ## Data storage
60 | 
61 | Your homefolder space is quite limited. Only use it for scripts.
62 | 
63 | Install your own python or other stuff on `/var/scratch/` (there is already a directory for each user). Also store your data on `/var/scratch/`. If you run out of space, ask Dennis.
64 | 
65 | Check your current quota with `quota -sv`.
66 | 
67 | ## More info
68 | 
69 | Read more on the [das-5 website](https://www.cs.vu.nl/das5/jobs.shtml).
70 | 


--------------------------------------------------------------------------------
/ivi.md:
--------------------------------------------------------------------------------
 1 | # IvI GPU cluster
 2 | 
 3 | There is a channel `#ivi_cluster` in the IvI slack.  
 4 | Read the pinned posts for more info, this readme only provides the basics.
 5 | 
 6 | ## Hostname
 7 | 
 8 | `ivi-h0.science.uva.nl` to access the cluster. Use your UvAnetID as credentials.
 9 | 
10 | ## Useful things to have in your `.bashrc`
11 | 
12 | ```bash
13 | export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-8.0/lib64
14 | export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda-8.0/cudnn/cuda/lib64
15 | alias mywatch='/home/dkoelma1/bin/mywatch'
16 | ```
17 | 
18 | Pick the version you want for CUDA by looking at what's available in `/usr/local/`.
19 | 
20 | ## Get a node with slurm
21 | 
22 | When submitting a job, you need to specify the number of GPU, the ram, the number of CPUs and the duration. Please adjust these arguments according to your needs.  
23 | Each node has 4 GPUs, 128GB of ram, 48 CPU threads (but 2 are used for slurm), and 2x10 TB local HDD (/hddstore). Maximum run time is 7 days.
24 | 
25 | Here is an example on how to get an interactive session with 1 GPU for 2h30:  
26 | ``srun -u --pty --gres=gpu:1 --mem=30G --cpus-per-task=10 --time=2:30:00 -D `pwd` bash -i``
27 | 
28 | Same but with 4 GPUs for 1 day and 8 hours:  
29 | ``srun -u --pty --gres=gpu:4 --mem=120G --cpus-per-task=40 --time=1-8 -D `pwd` bash -i``  
30 | 
31 | Ideally you should submit a job instead of using an interactive session:  
32 | ``srun --gres=gpu:1 --mem=30G --cpus-per-task=10 --time=2:30:00 -D `pwd` python myscript.py --myargument=foo``
33 | 
34 | Without the ``-D `pwd` `` argument, slurm will start the job in the `` /tmp`` directory.  
35 | Nodes can also be specified, e.g., to get ivi-cn001 add the following argument ``-w ivi-cn001``.
36 | 
37 | Quoting a wise man. Priority of a job will depend on the size of the job (smaller jobs have higher priority) and the amount of resources used in the past (if you have consumed less resources in the past you will have a higher priority).
38 | 
39 | slurm will provide a job id. Use that id if you want to remove yourself from the queue with `scancel [job id]`.
40 | 
41 | ## Monitor node usage
42 | 
43 | Either use `squeue` or `mywatch` (see last line of [.bashrc](#useful-things-to-have-in-your-bashrc))  
44 | Additionally, use `sinfo -h -N -o "%12n %8O %11T"` to monitor CPU usage.
45 | 
46 | 
47 | ## Jupyter notebook on a node
48 | 1. Get an interactive session on a GPU node.
49 | 2. Run `jupyter notebook --no-browser --port=20105` on GPU node. And you'll get a token like this `http://localhost:20105/?token=31427b811f3f3fdaef9d5da5cce8c81ba4df2f2ed4535960`.
50 | 3. Run `ssh -L 20103:localhost:20104 UvAnetID@ivi-h0.science.uva.nl ssh -L 20104:localhost:20105 ivi-cn009` on your local machine. Here you project the local port `20103` to IvI port `20104` and project IvI port `20104` to GPU node port `20105`.
51 | 4. Open your browser and go to http://localhost:20103 and paste the token in step 2 `31427b811f3f3fdaef9d5da5cce8c81ba4df2f2ed4535960`.
52 | 
53 | ## Data storage
54 | 
55 | 1 TB on your home directory.  
56 | Move your data to `/hddstore` or `/sddstore` for faster compute.
57 | 
58 | ## Change GCC version
59 | 
60 | The default GCC version is 4.8.  
61 | Use GCC 6 with
62 | `source /opt/rh/devtoolset-6/enable`
63 | or GCC 7 with
64 | `source /opt/rh/devtoolset-7/enable`
65 | 


--------------------------------------------------------------------------------