├── README.md
├── intro
    └── README.md
├── managing_users
    └── README.md
├── pyxis
    ├── README.md
    └── clean_pyxis.bash
├── slurm_intro
    └── README.md
└── storage
    └── README.md


/README.md:
--------------------------------------------------------------------------------
 1 | # Quick Links
 2 | 
 3 | - [User Account Request](https://docs.google.com/forms/d/e/1FAIpQLSeTaLPpwmcvvqbeizdu_MRhhMjZ1L9z54NqHse6BJF9s0DTSg/viewform?usp=sf_link)
 4 | - [Cluster Issue Submission](https://docs.google.com/forms/d/e/1FAIpQLScRKQLOcW9J2Pvw4clOho3lv04R3VPfG7F9MZTqbsquv79y3g/viewform?usp=sf_link)
 5 | 
 6 | # Tutorials
 7 | 
 8 | - [Getting an account](https://github.com/daniilidis-group/cluster_tutorials/tree/master/managing_users)
 9 | - [Intro](https://github.com/daniilidis-group/cluster_tutorials/tree/master/intro)
10 | - [Intro to SLURM](https://github.com/daniilidis-group/cluster_tutorials/tree/master/slurm_intro)
11 | - [Some notes about storage](https://github.com/daniilidis-group/cluster_tutorials/tree/master/storage)
12 | - [Docker/pyxis/enroot](https://github.com/daniilidis-group/cluster_tutorials/tree/master/pyxis)
13 | 


--------------------------------------------------------------------------------
/intro/README.md:
--------------------------------------------------------------------------------
 1 | # Cluster Intro
 2 | 
 3 | Start here for a quick getting started tutorial.
 4 | 
 5 | ## Logging in
 6 | 
 7 | You need to be:
 8 | 
 9 | 1) Wired on campus
10 | 2) Wireless on campus
11 | 3) UPenn VPN
12 | 
13 | ## Getting resources
14 | 
15 | All computation in the cluster must be done after creating a request for resources.
16 | 
17 | ### Direct interactive session
18 | 
19 | This method enables you to get an interactive terminal that holds the resouces
20 | 
21 | ```
22 | srun --partition debug --qos debug --mem=8G --gres=gpu:1 --pty bash
23 | ```
24 | 
25 | ### Direct blocking request
26 | 
27 | This method enables you to submit a job and blocks your terminal until the command has been completed.
28 | 
29 | ```
30 | srun --mem-per-gpu=10G --cpus-per-gpu=4 --gpus=1 nvidia-smi
31 | ```
32 | 
33 | ### Non-blocking batch request
34 | 
35 | This method allows you to schedule a job for the cluster to get to later. These jobs can be significantly more complex than `srun` allows for. This file will be run with `sbatch <filename>`
36 | 
37 | ```
38 | #!/bin/bash
39 | #SBATCH --mem-per-gpu=10G
40 | #SBATCH --cpus-per-gpu=4
41 | #SBATCH --gpus=1
42 | #SBATCH --time=00:10:00
43 | #SBATCH --qos=low
44 | #SBATCH --partition=compute
45 | 
46 | hostname
47 | nvidia-smi
48 | 
49 | exit 0
50 | ```
51 | 
52 | ## Visualization
53 | 
54 | Follow the tensorboard tutorial:
55 | 
56 | [Tensorboard](https://github.com/daniilidis-group/cluster_tutorials/tree/master/tensorboard)
57 | 


--------------------------------------------------------------------------------
/managing_users/README.md:
--------------------------------------------------------------------------------
 1 | # New user creation
 2 | 
 3 | To request a new user, please fill out the [Cluster User Request Form](https://forms.gle/97BZLTwX5dMXDV118). The admin team will verify that the account is allowed to be created and create the account.
 4 | 
 5 | # Login
 6 | 
 7 | [SSH Key Guide](https://linuxize.com/post/how-to-set-up-ssh-keys-on-ubuntu-1804/) We require SSH key based login from remote locations.
 8 | 
 9 | If at any point you want to change your password:
10 | 
11 | ```
12 | $ passwd
13 | ```
14 | 
15 | For access from an off campus location, you will need to setup an ssh key for authentication: 
16 | 
17 | You will *NOT* be able to login from a remote location if you don't have an ssh key setup.
18 | 
19 | # SLURM Compute Allocation
20 | 
21 | Depending upon your group's policies, you will either be allocated compute with respect to your group as a whole or an existing lab member will need to allocate you under their compute. If it is the former, this will be done for you. If it is the latter, your point of contact will be able to help you.
22 | 
23 | # Managing your account
24 | 
25 | This is for those groups that have a larger organizational structure:
26 | 
27 | - Kostas Group
28 | 
29 | ## Your account
30 | This account will balance your usage equally with other members of your group. It will additionally bill all usage from students working with you against your account. To look at your account:
31 | 
32 | ```
33 | $ sacctmgr show assoc Account=<your_username>-account
34 | ```
35 | 
36 | ## Adding a student under your account
37 | Giving a user access to your compute will count against your overall account and thus you are taking responsibility for their overall usage.
38 | 
39 | ```
40 | $ sacctmgr add user Name=<their_username> DefaultAccount=<your_username>-account
41 | ```
42 | 
43 | You're able to remove them from your account at any point in time as well as modify how they can use your account.
44 | 


--------------------------------------------------------------------------------
/pyxis/README.md:
--------------------------------------------------------------------------------
 1 | # Pyxis
 2 | 
 3 | Pyxis manages enroot inside of SLURM. These are both provided by nvidia for running containers in an unpriveledged mode.
 4 | 
 5 | - https://github.com/NVIDIA/pyxis
 6 | - https://github.com/NVIDIA/enroot
 7 | 
 8 | ## Running
 9 | 
10 | The following srun command downloads the compressed containers, decompresses and removes the decompressed container upon completion:
11 | ```
12 | srun --container-image=centos grep PRETTY /etc/os-release
13 | ```
14 | 
15 | If you would like to keep the image (uncompressed) on the machine
16 | ```
17 | srun --container-name=centosstays --container-image=centos grep PRETTY /etc/os-release
18 | ```
19 | 
20 | These are all of the options for Pyxis:
21 | ```
22 | $ srun --help
23 | ...
24 |       --container-image=[USER@][REGISTRY#]IMAGE[:TAG]|PATH
25 |                               [pyxis] the image to use for the container
26 |                               filesystem. Can be either a docker image given as
27 |                               an enroot URI, or a path to a squashfs file on the
28 |                               remote host filesystem.
29 | 
30 |       --container-mounts=SRC:DST[:FLAGS][,SRC:DST...]
31 |                               [pyxis] bind mount[s] inside the container. Mount
32 |                               flags are separated with "+", e.g. "ro+rprivate"
33 | 
34 |       --container-workdir=PATH
35 |                               [pyxis] working directory inside the container
36 |       --container-name=NAME   [pyxis] name to use for saving and loading the
37 |                               container on the host. Unnamed containers are
38 |                               removed after the slurm task is complete; named
39 |                               containers are not. If a container with this name
40 |                               already exists, the existing container is used and
41 |                               the import is skipped.
42 |       --container-mount-home  [pyxis] bind mount the user's home directory.
43 |                               System-level enroot settings might cause this
44 |                               directory to be already-mounted.
45 | 
46 |       --no-container-mount-home
47 |                               [pyxis] do not bind mount the user's home
48 |                               directory
49 |       --container-remap-root  [pyxis] ask to be remapped to root inside the
50 |                               container. Does not grant elevated system
51 |                               permissions, despite appearances. (default)
52 | 
53 |       --no-container-remap-root
54 |                               [pyxis] do not remap to root inside the container
55 | ```
56 | ## Running with sbatch
57 | 
58 | Unfortunately the srun parameters from a SPANK plugin don't migrate to sbatch by default. The workaround is thus using sbatch to allocate resources and then use srun from within.
59 | 
60 | ```
61 | #!/bin/bash
62 | #SBATCH --gpus=1
63 | #SBATCH --cpus-per-gpu=4
64 | #SBATCH --mem-per-gpu=10G
65 | #SBATCH --time=14:00:00
66 | #SBATCH --qos=low
67 | 
68 | CONTAINER_NAME="<name>"
69 | CONTAINER_IMAGE="<docker hub image>"
70 | COMMAND="<command string>"
71 | EXTRA_MOUNTS="/Datasets:/Datasets,/scratch:/scratch"
72 | 
73 | srun --container-mount-home\
74 |      --container-mounts=${EXTRA_MOUNTS}\
75 |      --container-name=${CONTAINER_NAME}\
76 |      --container-image=${CONTAINER_IMAGE}\
77 |      --no-container-remap-root\
78 |      ${COMMAND}
79 | ```
80 | 
81 | ## Cleanup
82 | 
83 | Cleaning up your containers on a particular node should only be done when you have no jobs running on that node. This process can be useful if you're trying to debug why something isn't working
84 | 
85 | ```
86 | srun -w <node_name1>,<node_name2> --cores=1 --mem=1G --time=00:10:00 rm -r /tmp/enroot-data/user-$(id -u)
87 | ```
88 | 
89 | Or to have it scheduled (ensure you edit which node you are using):
90 | ```
91 | sbatch clean_pyxis.bash
92 | ```
93 | 
94 | ## Admin notes
95 | - Pyxis must be installed on all machines
96 | - Enroot must be on all compute nodes
97 | 


--------------------------------------------------------------------------------
/pyxis/clean_pyxis.bash:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH -w node-2080ti-[0-6]
 3 | #SBATCH --cores=1
 4 | #SBATCH --mem=1G
 5 | #SBATCH --time=00:10:00
 6 | #SBATCH --qos=low
 7 | 
 8 | ENROOT_DATA_PATH="/tmp/enroot-data/user-\$(id -u)"
 9 | 
10 | rm -r $ENROOT_DATA_PATH
11 | 


--------------------------------------------------------------------------------
/slurm_intro/README.md:
--------------------------------------------------------------------------------
  1 | # SLURM Intro
  2 | 
  3 | ## Getting comfortable with SLURM
  4 | 
  5 | ### Partitions
  6 | All nodes in the systems are assigned to one or more partitions that are available for user access given they have sufficient permissions. The non-specific partitions are as follows:
  7 | 
  8 | |     Partition    | Time Limit | Valid QOSs              | Nodes                            | Default |
  9 | |------------------|------------|-------------------------|----------------------------------|---------|
 10 | |  batch           |  1-00:00:00| normal                  |               ALL                |   YES   |
 11 | | \<prof\>-compute | 14-00:00:00| varies                  | Professor Specific               |    NO   |
 12 | 
 13 | 
 14 | To view the current state of each partition:
 15 | 
 16 | ```
 17 | sinfo
 18 | ```
 19 | 
 20 | In addition to the above partitions, you will see the professor specific partitions that you may or may not have access to.
 21 | 
 22 | ## srun
 23 | 
 24 | Now that you have some idea of how everything is broken down, you can run your first command. `srun` provides a blocking way to run a single command using remote resources.
 25 | 
 26 | ```
 27 | srun --mem-per-gpu=10G --cpus-per-gpu=4 --gpus=1 nvidia-smi
 28 | ```
 29 | 
 30 | The requested resources will be allocated and then your command will run. In this case you should see the GPU information displayed on your console.
 31 | 
 32 | ## sbatch
 33 | 
 34 | Running invidual commands in a blocking manner is often too cumbersome for any large scale projects. `sbatch` provides a method to submit a job(s) to be scheduled and run at the most optimal time.
 35 | 
 36 | ```
 37 | #!/bin/bash
 38 | #SBATCH --mem-per-gpu=10G
 39 | #SBATCH --cpus-per-gpu=4
 40 | #SBATCH --gpus=1
 41 | #SBATCH --time=00:10:00
 42 | 
 43 | hostname
 44 | nvidia-smi
 45 | 
 46 | exit 0
 47 | ```
 48 | 
 49 | If the above contents are in `example.bash`, you can submit this job through `sbatch example.bash`. Upon being scheduled you will see a new file `slurm-<job_id>.out` that contains the contents of stdout from the above bash script.
 50 | 
 51 | ## A note about accurate estimates
 52 | 
 53 | Accurate time and resource estimates are critical to ensure that as many jobs can be scheduled as possible. If you don't request enough, your job could crash in often unpredictable ways. If you request too much, those resources go to waste as they could be used for another job at the same time. In addition, your `fairshare` will be billed for resources that you request and thus could be adversely affected if you request a large number of unused resources.
 54 | 
 55 | ## Running an interactive session
 56 | 
 57 | Interactive session are designed to help facilitate efficient debugging of your code. You can request an interactive terminal on any partition or QOS, however debug is recommended as it will be able to preempt most currently running jobs.
 58 | 
 59 | ```
 60 | srun --partition debug --qos debug --mem=8G --gres=gpu:1 --pty bash
 61 | ```
 62 | 
 63 | ## Stopping jobs
 64 | 
 65 | To stop a job you will need to know your Job ID which is announced after you use any command that requests resources. In addition, you can look at squeue to find a jobid.
 66 | 
 67 | ```
 68 | scancel <job_id>
 69 | ```
 70 | 
 71 | ## Monitoring your jobs
 72 | 
 73 | ### Queue
 74 | Checking the work queue:
 75 | ```
 76 | squeue
 77 | ```
 78 | You will see all of the currently running jobs and scheduled jobs when running this command. Use the `sprio` command to check the priority level of each job.
 79 | 
 80 | 
 81 | ## Advanced scheduling options
 82 | 
 83 | ### QOS
 84 | 
 85 | Depending on your affiliation, you will have at least 3 QOSs available to you:
 86 | 
 87 | | QOS  | # GPUs | Preempts | Exempt Time | Max GPU Min Per Job | Max Jobs | Max Submit Jobs | Priority | Usage Factor | Default |
 88 | |------|--------|----------|-------------|---------------------|----------|-----------------|----------|--------------|---------|
 89 | |normal|        |          |  00:30:00   |                     |   60     |      120        |    1     |       10     |   YES   |
 90 | 
 91 | Professors who have their own resources have defined their own QOS to ensure equitable distribution of resources within their group. Examples are as follows:
 92 | 
 93 | |     QOS     | # GPUs |     Preempts   |
 94 | |-------------|--------|----------------|
 95 | | \<prof\>-med  |     10 | low            |
 96 | | \<prof\>-high |      1 | low, \<prof\>-med|
 97 |   
 98 | For more specifics of what currently exists you can look at:
 99 | 
100 | ```
101 | sacctmgr show qos
102 | ```
103 | 
104 | Each QOS attempts to fufil a separate requirement that people might have and encourages smaller more managable chunks of runtime.
105 | 
106 | - \# GPUs is the total per user for that QOS
107 | - Preempts indicates which QOS can be preempted when you start a job with that QOS
108 | - Exempt time indicates the amount of wall time that must pass before a job can be preempted (this is considered "safe" time)
109 | - Max GPU Min Per Job is the total number of minutes your job can use on a GPU, (i.e. using 3 GPUs for 15 minutes is 45 GPU minutes)
110 | - Max Jobs is the total number of jobs that can be accruing time in that QOS per user
111 | - Max Submit Jobs is the total number of jobs that you can submit in that QOS per user
112 | - Priority is an additional priority factor that gets accounted for in the scheduler
113 | - Usage Factor is how much it "costs" to run in this partition
114 | 
115 | The basic QOSs (listed above) provide general access to a large number of resources. More specialized QOSs will be assigned by each group for their specific resources, these will take priority over these more generic QOSs.
116 | 
117 | ### Requesting a specific GPU
118 | 
119 | You are allowed to be more specific about the type of GPU that you want to use:
120 | 
121 | ```
122 | srun --mem-per-gpu=10G --cpus-per-gpu=4 --gpus=gtx1080ti:1 nvidia-smi
123 | ```
124 | 
125 | The list of types:
126 | 
127 | |      Name    | Architecture | VRAM |
128 | |--------------|--------------|------|
129 | | geforce_rtx_2080_ti   |   Turing     | 11GB |
130 | | rtx_a6000 | Ampere | 48GB |
131 | | a40       | Ampere | 48GB |
132 | | a10       | Ampere | 24GB |
133 | | geforce_rtx_3090 | Ampere | 24GB |
134 | | l40 | Lovelace | 48GB |
135 | 
136 | 
137 | 
138 | ### Requesting a specific node
139 | 
140 | If a specific node has the configuration that you'd like:
141 | ```
142 | srun --mem-per-gpu=10G --cpus-per-gpu=4 --gpus=1 -w <node_name> nvidia-smi
143 | ```
144 | 
145 | 
146 | ## Batches
147 | 
148 | You will need to make a file that contains the parameters of the batch of jobs:
149 | 
150 | ```
151 | #!/bin/bash
152 | #SBATCH --mem-per-gpu=10G
153 | #SBATCH --cpus-per-gpu=4
154 | #SBATCH --gpus=1
155 | #SBATCH --array=0-3
156 | #SBATCH --time=00:10:00
157 | 
158 | hostname
159 | nvidia-smi
160 | 
161 | echo "My unique array ID is $SLURM_ARRAY_TASK_ID out of $SLURM_ARRAY_TASK_MAX"
162 | 
163 | exit 0
164 | ```
165 | 
166 | Write that to a file called test.bash and to run it use:
167 | 
168 | ```
169 | sbatch test.bash
170 | ```
171 | 
172 | This will run 4 jobs (--array=0-3) that each check the host they ran on and check the GPU itself.
173 | 
174 | Each SBATCH line denotes a separate option which is further defined in https://slurm.schedmd.com/sbatch.html
175 | 
176 | By default, a log file named `slurm-<job_id>.out` should also be generated in the same directory as `test.bash`, which you can view as a live log with:
177 | ```
178 | tail -f slurm-<job_id>.out
179 | ```
180 | sbatch allows you to specify this file to be anything you would like. Check their documentation for more info.
181 | 
182 | You can also start an interactive terminal for a job that you started with sbatch (to check CPU or GPU utilization, e.g.). To do so, you need to start a new step/task within the running job:
183 | ```
184 | srun --jobid <job_id> --pty bash
185 | ```
186 | With `squeue -s` you can see that your new step has a step id like `<job_id>.<int>.`
187 | 
188 | ## Long running jobs
189 | 
190 | Jobs that run for a long time (i.e. more time than the QOS allows for) can still be scheduled in blocks and automatically requeued. This is handled by having the correct exiting conditions:
191 | 
192 | 1) Exit code 3 from the primary job script
193 | 2) No signal was sent to your sub job
194 | 
195 | This is an example in bash to create a job array with 4 elements that repeats forever. Please feel free to try this script, but do not allow it to run for forever.
196 | 
197 | ```
198 | #!/bin/bash
199 | #SBATCH --mem-per-gpu=10G
200 | #SBATCH --cpus-per-gpu=4
201 | #SBATCH --gpus=1
202 | #SBATCH --array=0-3
203 | #SBATCH --time=00:10:00
204 | 
205 | hostname
206 | nvidia-smi
207 | exit 3
208 | ```
209 | 
210 | ## Handling preemption
211 | 
212 | Jobs that are in a lower QOS allow other jobs to preempt them. This can be handled in a couple different ways. The most robust is to catch the signal (SIGTERM) and checkpoint your model. This can be seen further in the `mnist` tutorial. Jobs that are preempted are automatically placed back into the queue to be rescheduled.
213 | 
214 | # Deep Learning Packages
215 | 
216 | We provide the following packages installed on the base of every machine:
217 | 
218 | | Name    | Version |
219 | |---------|---------|
220 | | Python  |  3.10   |
221 | | Pytorch |  1.13   |
222 | | Tensorflow | 2.0 |
223 | 
224 | If the following command:
225 | ```
226 | nvcc --version
227 | ```
228 | returns `Command 'nvcc' not found` you may need to update your path to include cuda. To do so temporarily, run:
229 | ```
230 | export PATH=$PATH:/usr/local/cuda/bin
231 | ```
232 | To do so perminantly, add the line above to the bottom of your bashrc, which should be at `~/.bashrc`.
233 | 
234 | In addition, you can always use venv for your specific use case. It is recommended to keep libraries like that on NVMe for speed of loading. For example, from inside of a slurm interactive session:
235 | 
236 | ```
237 | cd /scratch/<username>/virtual_envs
238 | python3 -m venv test
239 | ```
240 | 
241 | 
242 | # Helpful Debugging Tools
243 | 
244 | If you want to see how many cpus and gpus each node has/how many are allocated, you can run the following command.
245 | ```
246 | scontrol show node
247 | ```
248 | This will display a list of all of the nodes and information about them. An example output is shown below.
249 | 
250 | ```
251 | NodeName=node-2080ti-3 Arch=x86_64 CoresPerSocket=20 
252 |    CPUAlloc=10 CPUTot=40 CPULoad=8.42
253 |    AvailableFeatures=(null)
254 |    ActiveFeatures=(null)
255 |    Gres=gpu:rtx2080ti:4(S:0)
256 |    NodeAddr=node-2080ti-3 NodeHostName=node-2080ti-3 Version=19.05.5
257 |    OS=Linux 4.15.0-108-generic #109-Ubuntu SMP Fri Jun 19 11:33:10 UTC 2020 
258 |    RealMemory=92000 AllocMem=0 FreeMem=59361 Sockets=1 Boards=1
259 |    State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
260 |    Partitions=compute,kostas-compute 
261 |    BootTime=2020-06-29T13:54:09 SlurmdStartTime=2020-07-11T11:24:16
262 |    CfgTRES=cpu=40,mem=92000M,billing=40,gres/gpu=4
263 |    AllocTRES=cpu=10,gres/gpu=3
264 |    CapWatts=n/a
265 |    CurrentWatts=0 AveWatts=0
266 |    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
267 | ```
268 | For this node, the second line shows the allocated vs total cpus.  The total number and type of gpus is shown under `Gres`.  The number of allocated gpus is shown under `AllocTRES`.
269 | 
270 | 


--------------------------------------------------------------------------------
/storage/README.md:
--------------------------------------------------------------------------------
 1 | # Storage
 2 | 
 3 | We supply two levels of storage that are logically and physically separated. Each has quotas and rules that manage the access and usage internally.
 4 | 
 5 | ## Storage types
 6 | 
 7 | 1) total storage size
 8 | 2) total number of inodes (in general less large files is better for performance than many small files)
 9 | 
10 | | Pool | Size Quota | inode Quota |   Default Directories  |
11 | |------|------------|-------------|------------------------|
12 | | HDD  |    4TB     |      10M    | /home                  |
13 | | NVMe |   100GB    |             | /mnt/kostas_graid      |
14 | 
15 | ## HDD
16 | 
17 | We maintain a TrueNAS Scale based ZFS server for serving home folders for our users. These home folders are accessile through every node within the cluster. Users will see this in `/home/<username>` which is created during the first login.
18 | 
19 | This is the bulk storage for our users to maintain some history on their experiments. Datasets and log files on HDDs tend to slow down the system if too many people use them that way. Try to avoid this usage pattern. This is default on `/home`.
20 | 
21 | ## NVMe
22 | 
23 | We maintain a GRAID based server with 40TB of NVMe storage. The uplink for this node into the network is a dual port 100Gbps card allowing for remote failover.
24 | 
25 | This storage is separated into user software environments and datasets.
26 | 
27 | ### User Software
28 | 
29 | User software can be found in `/mnt/kostas_graid/sw/envs`. Users are free to make their own folders and each user is allocated 100GB for software.
30 | 
31 | ### Datasets
32 | 
33 | Datasets can be found in `/mnt/kostas_graid/datsets`. Users are free to make their own folders and there is not currently a quota in place (if more than 1TB please reach out to a cluster admin!). This data is automatically cleaned up based on the file access time. If a file has not been access more than 14 days, it will be removed. Empty folders are also removed. You can check the last time a file as been accessed with:
34 | 
35 | ```
36 | ls -lutr
37 | ```
38 | 


--------------------------------------------------------------------------------