├── LICENSE
├── README.md
├── cassio
    ├── DATASETS.md
    ├── README.md
    ├── gpu_job.slurm
    └── test_gpu.py
├── client
    ├── README.md
    └── config
├── greene
    ├── README.md
    ├── gpu_job.slurm
    ├── submitit_example
    │   ├── env.sh
    │   ├── python-greene
    │   ├── run_with_submitit.py
    │   └── slurm.py
    ├── sweep_job.slurm
    ├── test_gpu.py
    └── test_sweep.py
└── prince
    ├── README.md
    ├── gpu_job.slurm
    └── test_gpu.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2020, nyu-dl
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | 1. Redistributions of source code must retain the above copyright notice, this
10 |    list of conditions and the following disclaimer.
11 | 
12 | 2. Redistributions in binary form must reproduce the above copyright notice,
13 |    this list of conditions and the following disclaimer in the documentation
14 |    and/or other materials provided with the distribution.
15 | 
16 | 3. Neither the name of the copyright holder nor the names of its
17 |    contributors may be used to endorse or promote products derived from
18 |    this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Doing computations at CILVR Lab
 2 | 
 3 | This repo is created to help new CILVR members to perform initial setup in order
 4 | to use computational resources at NYU.
 5 | 
 6 | ## Repo structure
 7 | 
 8 | ### greene
 9 | 
10 | README there helps you to go over main information about NYU HPC Greene cluster (**NEW ONE**).
11 | 
12 | ### client
13 | 
14 | README there helps you to configure your client such as PC or laptop.
15 | 
16 | ### cassio
17 | 
18 | README there helps you to go over main information about CILVR cluster.
19 | 
20 | ### prince
21 | 
22 | README there helps you to go over main information about NYU HPC Prince cluster.
23 | 


--------------------------------------------------------------------------------
/cassio/DATASETS.md:
--------------------------------------------------------------------------------
 1 | **Last update: 01/18/2020**
 2 | 
 3 | Here we store up-to-date information about the datasets stored in CILVR/CDS filesystems.
 4 | 
 5 | Please send a PR with edited file if you want to contribute here!
 6 | 
 7 | # NLP data
 8 | 
 9 | `/misc/vlgscratch4/BowmanGroup/datasets` :
10 | 
11 | * bert_trees
12 | 
13 | * crosslingual_wikipedia
14 | 
15 | * latent_tree_data
16 | 
17 | * nli_data
18 | 
19 | * ptb_data
20 | 
21 | * snli_1.0
22 | 
23 | * SQuAD2
24 | 
25 | * task_data
26 | 
27 | * wikipedia_corpus_small
28 | 
29 | * wikipedia_sop_small
30 | 
31 | * crosslingual_combine_wiki
32 | 
33 | * eg_word
34 | 
35 | * multinli_1.0.zip
36 | 
37 | * probing-dep
38 | 
39 | * ptb_trees  
40 | 
41 | * snli_1.0.zip
42 | 
43 | * SST
44 | 
45 | * wikipedia_corpus
46 | 
47 | * wikipedia_corpus_tiny
48 | 
49 | * WikiText103
50 | 


--------------------------------------------------------------------------------
/cassio/README.md:
--------------------------------------------------------------------------------
  1 | # Cassio CILVR cluster
  2 | 
  3 | Troubleshooting: helpdesk@courant.nyu.edu
  4 | 
  5 | In this part we will touch the following aspects of our own cluster:
  6 | 
  7 | * CIMS computing nodes
  8 | 
  9 | * Courant / CILVR filesystems
 10 | 
 11 | * Software management
 12 | 
 13 | * Slurm Workload Manager
 14 | 
 15 | ## CIMS computing nodes
 16 | 
 17 | Courant has their own park of machines which everyone is welcome to use, learn more about it [here](https://cims.nyu.edu/webapps/content/systems/resources/computeservers).
 18 | These machines do not have any workload manager.
 19 | 
 20 | ### Connecting between nodes once you are logged on Access node
 21 | 
 22 | Add your public key to `~/.ssh/authorized_keys` in order to jump between nodes without password typing each time. **Set restricted permissions on authorized_keys file**: `chmod 644 ~/.ssh/authorized_keys`.
 23 | 
 24 | ## Filesystems
 25 | 
 26 | ### Home filesystem
 27 | 
 28 | Everyone has their own home fs mounted in `/home/${USER}`.
 29 | 
 30 | Home fs has a limitation of 8GB (yes, eight) and one can check current usage with a command `showusage`:
 31 | 
 32 | ```text
 33 | [kulikov@cassio ~]$ showusage
 34 | You are using 56% of your 8.0G quota for /home/kulikov.
 35 | You are using 0% of your .4G quota for /web/kulikov.
 36 | ```
 37 | 
 38 | There is a large benefit of using home fs: reliable backups. There are always 2 previous days backups of your fs located in `~/.zfs/`. In case if you can not locate the folder, fill a request [here](https://cims.nyu.edu/webapps/content/systems/forms/restore_file).
 39 | 
 40 | Home fs is a good place to keep important files such as project notes, code, experiment configurations.
 41 | 
 42 | ### Big RAIDs
 43 | 
 44 | We have multiple big RAID filesystems, where one can keep checkpoints, datasets etc.
 45 | 
 46 | * `/misc/kcgscratch1`
 47 | 
 48 | * `/misc/vlgscratch4`
 49 | 
 50 | * `/misc/vlgscratch5`
 51 | 
 52 | We follow some simple folder structure as you will notice there.
 53 | 
 54 | Afaik there are no backups there, use `rm` very carefully.
 55 | 
 56 | ### Transfer files
 57 | 
 58 | One can transfer some files to/from any location avaialable to your CIMS account via:
 59 | 
 60 | #### SSH (scp, rsync etc.)
 61 | 
 62 | Example command (assuming you have a working ssh config):
 63 | 
 64 | to cims:
 65 | `rsync -chaPvz <local_path> cims:<remote_path>`
 66 | 
 67 | from cims:
 68 | `rsync -chaPvz cims:<remote_path> <local_path>`
 69 | 
 70 | #### Using GLOBUS transfer service: [learn more](https://www.globus.org/)
 71 | 
 72 | **NYU HPC has disjoint filesystems from Courant/CILVR**
 73 | 
 74 | ## Software management
 75 | 
 76 | **There is no sudo.** If you need to install some software (which is not possible to build locally), email helpdesk to assistance.
 77 | 
 78 | All computing machines including cassio node support dynamic modification of user's set of environment variables (`env`) which is called *environment modules* (learn more about it [here](http://modules.sourceforge.net/)).
 79 | 
 80 | Main command: `module`.
 81 | 
 82 | Why: imagine that we need to have all different versions of `gcc` installed on machines for various reasons. Every user can control what version of gcc to use via environment variable with the path to needed version.
 83 | 
 84 | Useful commands:
 85 | 
 86 | `module avail`: show all available modules to load
 87 | 
 88 | `module load <module_name>`
 89 | 
 90 | `module unload <module_name>`
 91 | 
 92 | `module purge`: unload all modules
 93 | 
 94 | Typical usecase: loading specific CUDA toolkit (with nvcc), gcc.
 95 | 
 96 | *Machine learning packages and libraries there are typically outdated.* Afaik it is a common practice to make your own python environment with either conda or embedded venv.
 97 | 
 98 | ### Conda environment
 99 | 
100 | Everyone has their own taste on how to build their DL/ML enviornment, here I will brifly discuss how to get Miniconda running. Miniconda ships just a python with a small set of packages + conda, while Anaconda is a monster.
101 | 
102 | Miniconda homepage: [https://docs.conda.io/en/latest/miniconda.html](https://docs.conda.io/en/latest/miniconda.html)
103 | 
104 | 1. Copy the link to the install script in your clipboard and run: `wget <URL>` in the CIMS terminal.
105 | 
106 |     There is no easy way to keep your conda env in home fs (8G quota), so type some location in your big RAID folder as installation path.
107 | 
108 |     After installation type Y to agree to append neccessary lines into your `.bashrc`. After reconnection or re-login you should be able to run `conda` command.
109 | 
110 | 2. Create conda env: `conda create -n tutorial python=3`. Activate tutorial env: 
111 | 
112 |     `conda activate tutorial`
113 | 
114 | 3. Install several packages:
115 | 
116 | `conda install pytorch torchvision cudatoolkit=10.2 -c pytorch`
117 | 
118 | `conda install jupyterlab`
119 | 
120 | ## Slurm Workload Manager
121 | 
122 | ### Terminal multiplexer / screen / tmux
123 | 
124 | Tmux or any other multiplexer helps you to run a shell session on Cassio (or any other) node which you can detach from and attach anytime later *without losing the state.*
125 | 
126 | Learn more about it here: [https://github.com/tmux/tmux/wiki](https://github.com/tmux/tmux/wiki)
127 | 
128 | ### Slurm quotas
129 | 
130 | 1. Hot season (conference deadlines, dense queue)
131 | 
132 | 2. Cold season (summer, sparse queue)
133 | 
134 | ### Cassio node
135 | 
136 | `cassio.cs.nyu.edu` is the head node from where one can submit a job or request some resources.
137 | 
138 | **Do not run any intensive jobs on cassio node.**
139 | 
140 | Popular job management commands:
141 | 
142 | `sinfo -o --long --Node --format="%.8N %.8T %.4c %.10m %.20f %.30G"` - shows all available nodes with corresponding GPUs installed. **Note features – this allows you to specify the desired GPU.**
143 | 
144 | `squeue -u ${USER}` - shows state of your jobs in the queue.
145 | 
146 | `scancel <jobid>` - cancel job with specified id. You can only cancel your own jobs.
147 | 
148 | `scancel -u ${USER}` - cancel *all* your current jobs, use this one very carefully.
149 | 
150 | `scancel --name myJobName` - cancel job given the job name.
151 | 
152 | `scontrol hold <jobid>` - hold pending job from being scheduled. This may be helpful if you noticed that some data/code/files are not ready yet for the particular job.
153 | 
154 | `scontrol release <jobid>` - release the job from hold.
155 | 
156 | `scontrol requeue <jobid>` - cancel and submit the job again.
157 | 
158 | ### Running the interactive job
159 | 
160 | By interactive job we define a shell on a machine (possibly with a GPU) where you can interactively run/debug code or run some software e.g. JupyterLab or Tensorboard.
161 | 
162 | In order to request a machine and instantly (after Slurm assignment) connect to assigned machine, run:
163 | 
164 | `srun --qos=interactive --mem 16G --gres=gpu:1 --constraint=gpu_12gb --pty bash`
165 | 
166 | Explanation:
167 | 
168 | * `--qos=interactive` means your job will have a special QoS labeled 'interactive'. In our case this means your time limit will be longer than a usual job (7 days?), but there are max 2 jobs per user with such QoS.
169 | 
170 | * `--mem 16G` means the upper limit of RAM you expect your job to use. Machine will show all its RAM, however Slurm kills the job if it exceeds the requested RAM. **Do not set max possible RAM here, this may decrease your priority over time.** Instead, try to estimate the reasonable amount.
171 | 
172 | * `--gres=gpu:1` means number of gpus you will see in the requested instance. No gpus if you do not use this arg.
173 | 
174 | * `--constraint=gpu_12gb` each node has assigned features given what kind of GPU it has. Check `sinfo` command above to output all nodes with all possible features. Features may be combined using logical OR operator as `gpu_12gb|gpu_6gb`.
175 | 
176 | * `--pty bash` means that after connecting to the instance you will be given the bash shell.
177 | 
178 | You may remove `--qos` arg and run as many interactive jobs as you wish, if you need that.
179 | 
180 | #### Port forwarding from the client to Cassio node
181 | 
182 | As an example of port forwarding we will launch JupyterLab from interactive GPU job shell and connect to it from client browser.
183 | 
184 | 1. Start an interactive job (you may exclude GPU to get it fast if your priority is low at the moment):
185 | 
186 |     `srun --qos=interactive --mem 16G --gres=gpu:1 --constraint=gpu_12gb --pty bash`
187 | 
188 |     Note the host name of the machine you got e.g. lion4 (will be needed for port forwarding).
189 | 
190 | 2. Activate the conda environment with installed JupyterLab:
191 | 
192 |     `conda activate tutorial`
193 | 
194 | 3. Start JupyterLab
195 | 
196 |     `jupyter lab --no-browser --port <port>`
197 | 
198 |     Explanation:
199 | 
200 |     * `--no-browser` means it will not invoke default OS browser (you don't want CLI browser).
201 | 
202 |     * `--port <port>` means the port JupyterLab will be listening for requests. Usually we choose some 4 digit number to make sure that we do not select any reserved ports like 80 or 443.
203 | 
204 | 4. Open another tab on your terminal client and run:
205 | 
206 |     `ssh -L <port>:localhost:<port> -J cims <interactive_job_hostname> -N` (job hostname may be short e.g. lion4)
207 | 
208 |     Explanation:
209 | 
210 |     * `-L <port>:localhost:<port>` Specifies that the given port on the local (client) host is to be forwarded to the given host and port on the remote side.
211 | 
212 |     * `-J cims <other host>` means jump over cims to other host. This uses your ssh config to resolve what does cims mean.
213 | 
214 |     * `-N` means there will no shell given upon connection, only tunnel will be started.
215 | 
216 | 5. Go to your browser and open `localhost:<port>`. You should be able to open JupyterLab page. It may ask you for security token: get it form stdout of interactive job instance.
217 | 
218 | **Disclaimer:** there are many other ways to get set this up: one may use ssh SOCKS proxy, initialize tunnel from the interactive job itself etc. And all the methods are OK if you can run it.
219 | 
220 | ### Submitting a batch job
221 | 
222 | Batch jobs can be used for any computations where you do not expect your code to crash. In other words, there is no easy way to interrupt or debug running batch job.
223 | 
224 | Main command to submit a batch job: `sbatch <path_to_script>`.
225 | 
226 | The first part of the script consist of slurm preprocessing directives such as:
227 | 
228 | ```bash
229 | #SBATCH --job-name=job_wgpu
230 | #SBATCH --open-mode=append
231 | #SBATCH --output=./%j_%x.out
232 | #SBATCH --error=./%j_%x.err
233 | #SBATCH --export=ALL
234 | #SBATCH --time=00:10:00
235 | #SBATCH --gres=gpu:1
236 | #SBATCH --constraint=gpu_12gb
237 | #SBATCH --mem=64G
238 | #SBATCH -c 4
239 | ```
240 | 
241 | **Important: do not forget to activate conda env before submitting a job, or make sure you do so in the script.**
242 | 
243 | Similar to arguments we passed to `srun` during interactive job request, here we specify requirements for the batch job.
244 | 
245 | After `#SBATCH` block one may execute any shell commands or run any script of your choice.
246 | 
247 | **You can not mix `#SBATCH` lines with other commands, Slurm will not register any `#SBATCH` after the first regular (non-comment) command in the script.**
248 | 
249 | To submit `job_wgpu` located in `gpu_job.slurm`, go to Cassio node and run:
250 | 
251 | `sbatch gpu_job.slurm`
252 | 
253 | #### What happens when you hit enter
254 | 
255 | Slurm registers your job in the database with corresponding job id. The allocation may not happen instantly and the job will be positioned in the queue.
256 | 
257 | One can get all available information about the job using:
258 | 
259 | `scontrol show jobid -dd <job_id>`
260 | 
261 | **One can only get information about pending or running job with the command above.**
262 | 
263 | While a job is in the queue, one can hold it from allocation (and later release) using corresponding commands we checked above.
264 | 
265 | ### Managing experiments, running grid search etc
266 | 
267 | This is out of scope for today tutorial.
268 | 


--------------------------------------------------------------------------------
/cassio/gpu_job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=job_wgpu
 3 | #SBATCH --open-mode=append
 4 | #SBATCH --output=./%j_%x.out
 5 | #SBATCH --error=./%j_%x.err
 6 | #SBATCH --export=ALL
 7 | #SBATCH --time=00:10:00
 8 | #SBATCH --gres=gpu:1
 9 | #SBATCH --constraint=gpu_12gb
10 | #SBATCH --mem=64G
11 | #SBATCH -c 4
12 | 
13 | python ./test_gpu.py
14 | 


--------------------------------------------------------------------------------
/cassio/test_gpu.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import time
 3 | 
 4 | if __name__ == '__main__':
 5 | 
 6 |     print(f"Torch cuda available: {torch.cuda.is_available()}")
 7 |     print(f"GPU name: {torch.cuda.get_device_name()}\n\n")
 8 | 
 9 |     t1 = torch.randn(100,1000)
10 |     t2 = torch.randn(1000,10000)
11 | 
12 |     cpu_start = time.time()
13 | 
14 |     for i in range(100):
15 |         t = t1 @ t2
16 | 
17 |     cpu_end = time.time()
18 | 
19 |     print(f"CPU matmul elapsed: {cpu_end-cpu_start} sec.")
20 | 
21 |     t1 = t1.to('cuda')
22 |     t2 = t2.to('cuda')
23 | 
24 |     gpu_start = time.time()
25 | 
26 |     for i in range(100):
27 |         t = t1 @ t2
28 | 
29 |     gpu_end = time.time()
30 | 
31 |     print(f"GPU matmul elapsed: {gpu_end-gpu_start} sec.")
32 | 
33 | 
34 | 


--------------------------------------------------------------------------------
/client/README.md:
--------------------------------------------------------------------------------
 1 | # Client setup
 2 | 
 3 | We define client as your own machine (PC, laptop etc.) which has a terminal with CLI and
 4 | SSH client.
 5 | 
 6 | You should be fine using any operating system as long as you are comfortable with it.
 7 | 
 8 | ### MacOS, Linux
 9 | SSH client should be ready to be used by default.
10 | 
11 | ### Windows 10
12 | If you are using Windows 10, there are many options which you can google about, we highlight some of them here:
13 | 
14 | 1. OpenSSH installed natively on Windows 10: [learn more](https://docs.microsoft.com/en-us/windows-server/administration/openssh/openssh_overview)
15 | 
16 | 2. Windows Subsystem for Linux (this way you get linux shell with some limitations): [learn more](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
17 | 
18 | 3. Third-party software with SSH functionality (e.g. PuTTY): [learn more](https://www.putty.org/)
19 | 
20 | ## Pre-requisites for tutorial:
21 | 
22 | * Make sure you have OpenSSH installed: commands `ssh`, `ssh-keygen` should work!
23 | 
24 | * Create SSH keypair: [learn more](https://www.ssh.com/ssh/keygen/)
25 | 
26 | * Make sure your `~/.ssh` folder exists and it  has correct system permissions: [learn more](https://wiki.ruanbekker.com/index.php/Linux_SSH_Directory_Permissions)
27 | 
28 | * Make sure you can connect to both CIMS access nodes and Prince login nodes: [CIMS](https://cims.nyu.edu/webapps/content/systems/userservices/netaccess/secure) [Prince](https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/hpc-access)
29 | 
30 | *We will not spend time on these pre-requisites during the tutorial*
31 | 
32 | ## NYU MFA
33 | 
34 | You should be aware of NYU multi-factor authentication already (which is used when you login to NYU services).
35 | 
36 | CIMS uses MFA authorization when you connect to access node via SSH. Follow necessary steps to login.
37 | 
38 | ### Using ssh key instead of MFA
39 | 
40 | In order to login to access node without using MFA, append your public ssh key in `~/.ssh/authorized_keys_access` **ON ACCESS NODE** (create this file if needed). Set 600 permission mask to that file: `chmod 600 ~/.ssh/authorized_keys_access`.
41 | 
42 | ## Creating SSH config file
43 | 
44 | SSH config is used to simplify ssh connection's configuration (such as usernames, ssh-key paths etc).
45 | 
46 | Make necessary edits to `config` file in current folder and copy it to `~/.ssh/config`.
47 | 
48 | After this step you should be able to directly connect to cassio host by typing `ssh cassio` or to access node via `ssh cims`.
49 | 


--------------------------------------------------------------------------------
/client/config:
--------------------------------------------------------------------------------
 1 | Host cims
 2 |     HostName access.cims.nyu.edu
 3 |     Port 22
 4 |     Compression yes
 5 |     User <USERNAME>
 6 |     IdentityFile <PATH_TO_KEY>
 7 |     ServerAliveInterval 60
 8 | 
 9 | Host cassio
10 |     HostName cassio.cs.nyu.edu
11 |     Port 22
12 |     Compression yes
13 |     User <USERNAME>
14 |     IdentityFile <PATH_TO_KEY>
15 |     ForwardAgent yes
16 |     ServerAliveInterval 60
17 |     ProxyCommand ssh cims -W %h:%p
18 | 
19 | Host prince
20 |     HostName prince.hpc.nyu.edu
21 |     Port 22
22 |     Compression yes
23 |     User <USERNAME> # DIFFERENT USERNAME HERE GIVEN BY HPC NOT CIMS
24 |     IdentityFile <PATH_TO_KEY>
25 |     ServerAliveInterval 60
26 |     ProxyCommand ssh cims -W %h:%p
27 | 
28 | Host greene
29 |     HostName log-2.hpc.nyu.edu
30 |     Port 22
31 |     Compression yes
32 |     User <USERNAME> # DIFFERENT USERNAME HERE GIVEN BY HPC NOT CIMS
33 |     IdentityFile <PATH_TO_KEY>
34 |     ForwardAgent yes
35 |     ServerAliveInterval 60
36 |     ProxyCommand ssh cims -W %h:%p
37 | 


--------------------------------------------------------------------------------
/greene/README.md:
--------------------------------------------------------------------------------
  1 | # Greene tutorial
  2 | 
  3 | This tutorial aims to help with your migration from the Prince cluster (retires early 2021). It assumes some knowledge of Slurm, ssh and terminal. It assumes that one can connect to Greene ([same way](https://github.com/nyu-dl/cluster-support/tree/master/client) as to Prince).
  4 | 
  5 | Official Greene docs & overview: 
  6 | https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/greene
  7 | https://sites.google.com/a/nyu.edu/nyu-hpc/systems/greene-cluster
  8 | 
  9 | **Content:**
 10 | 1. Greene login nodes
 11 | 2. Cluster overview
 12 | 3. Data migration from Prince
 13 | 4. Singularity - running jobs within a container
 14 | 5. Running a batch job with singularity
 15 | 6. Setting up a simple sweep over a hyper-parameter using Slurm array job.
 16 | 7. Port forwarding to Jupyter Lab
 17 | 8. Making squashfs out of your dataset and accessing it while running your scripts.
 18 | 9. Using a python-based [Submitit](https://github.com/facebookincubator/submitit) framework on Greene
 19 | 
 20 | 
 21 | ## Greene login nodes
 22 | 
 23 | * greene.hpc.nyu.edu (balancer)
 24 | * log-1.hpc.nyu.edu
 25 | * log-2.hpc.nyu.edu
 26 | 
 27 | You have to use NYU HPC account to login, username is the same as your NetID.
 28 | 
 29 | Login nodes are accessible from within NYU network, i.e. one can connect from CIMS `access1` node, check [ssh config](https://github.com/nyu-dl/cluster-support/blob/master/client/config) for details.
 30 | 
 31 | Login node should not be used to run anything related to your computations, use it for file management (`git`, `rsync`), jobs management (`srun`, `salloc`, `sbatch`).
 32 | 
 33 | ## Cluster overview
 34 | 
 35 | |Number of nodes|CPU cores|GPU|RAM|
 36 | |---------------|---------|---|---|
 37 | |524|48|-|180G|
 38 | |40|48|-|369G|
 39 | |4|96|-|3014G **(3T)**|
 40 | |10|48|4 x V100 32G (PCIe)|369G|
 41 | |65|48|4 x **RTX8000 48G**|384G|
 42 | 
 43 | **quotas**:
 44 | 
 45 | |filesystem|env var|what for|flushed|quota|
 46 | |----------|-------|--------|-------|-----|
 47 | |/archive|$ARCHIVE|long term storage|NO|2TB/20K inodes|
 48 | |/home|$HOME|probably nothing|NO|50GB/30K inodes|
 49 | |/scratch|$SCRATCH|experiments/stuff|YES (60 days)|5TB/1M inodes|
 50 | 
 51 | ```
 52 | echo $SCRATCH
 53 | /scratch/ik1147
 54 | ```
 55 | 
 56 | ### Differences with Prince
 57 | 
 58 | * Strong preference **for Singularity** / not modules.
 59 | * GPU specification through `--gres=gpu:{rtx8000|v100}:1` (Prince uses/used partitions)
 60 | * Low inode quota on home fs (~30k files)
 61 | * No beegfs
 62 | 
 63 | ## Data migration from Prince
 64 | 
 65 | **All filesystems on Greene are brand new i.e. nothing will be migrated from Prince automatically.**
 66 | 
 67 | ### Copying a folder from Prince to Greene
 68 | 
 69 | 1. Identify the absoulte path of some dir on Prince you need to transfer: `realpath <PATH>`
 70 | 2. Create destination folder on Greene where you wish all the content from `<PATH>` to be copied to: `mkdir <DEST_PATH>`
 71 | 
 72 | You may start the transfer from both Prince and Greene (as you wish):
 73 | * from Prince: `rsync -chaPvz <PATH>/ log-2.hpc.nyu.edu:<DEST_PATH>/` (notice trailing slashes)
 74 | * from Greene: `rsync -chaPvz prince.hpc.nyu.edu:<PATH>/ <DEST_PATH>/`
 75 | 
 76 | Start such transfer withing tmux/screen to avoid connectivity issues.
 77 | 
 78 | ## Singularity
 79 | 
 80 | One can read white paper here: https://arxiv.org/pdf/1709.10140.pdf
 81 | 
 82 | The main idea of using a container is to provide an isolated user space on a compute node and to simplify the node management (security and updates).
 83 | 
 84 | ### Getting a container image
 85 | 
 86 | Each singularity container has a definition file (`.def`) and the image file (`.sif`). Lets consider the following container:
 87 | 
 88 | `/scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.def`
 89 | `/scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif`
 90 | 
 91 | Definiton file contains all commands which are executed along the way when the image OS is created, take a look!
 92 | 
 93 | This particular image will have CUDA 10.1 with cudnn 7 libs within Ubuntu 18.04 OS.
 94 | 
 95 | Lets execute this container with singularity:
 96 | 
 97 | ```bash
 98 | [ik1147@log-2 ~]$ singularity exec /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash
 99 | Singularity> uname -a
100 | Linux log-2.nyu.cluster 4.18.0-193.28.1.el8_2.x86_64 #1 SMP Fri Oct 16 13:38:49 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux
101 | Singularity> lsb_release -a
102 | LSB Version:    core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch
103 | Distributor ID: Ubuntu
104 | Description:    Ubuntu 18.04.5 LTS
105 | Release:        18.04
106 | Codename:       bionic
107 | ```
108 | 
109 | Notice that we are still on the login node but within the Ubuntu container!
110 | 
111 | One can find more containers here:
112 | 
113 | `/scratch/work/public/singularity/`
114 | 
115 | Container files are read-only images, i.e. there is no need to copy over the container. Instead one may create a symlink for convenience. Please contact HPC if you need a container image with some specific software.
116 | 
117 | ### Setting up an overlay filesystem with your computing environment
118 | 
119 | **Why?** The whole reason of using Singularity with overlay filesystem is *to reduce the impact on scratch filesystem* (remember how slow is it on Prince sometimes?). 
120 | 
121 | **High-level idea**: make a read-only filesystem which will be used exclusively to host your conda environment and other static files which you constantly re-use for each job.
122 | 
123 | **How is it different from scratch?** overlayfs is a separate fs mounted on each compute node when you start the container while scratch is a shared GPFS accessed via network.
124 | 
125 | There are two different modes when mounting the overlayfs:
126 | * read-write: use this one when setting up env (installing conda, libs, other static files)
127 | * read-only: use this one when running your jobs. It has to be read-only since multiple processes will access the same image. It will crash if any job has already mounted it as read-write.
128 | 
129 | Setting up your fs image:
130 | 1. Copy the empty fs gzip to your scratch path (e.g. `/scratch/<NETID>/` or `$SCRATCH` for your root scratch): `cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz $SCRATCH/`
131 | 2. Unzip the archive: `gunzip -v $SCRATCH/overlay-50G-10M.ext3.gz` (can take a while to unzip...)
132 | 3. Execute container with overlayfs (check comment below about `rw` arg): `singularity exec --overlay $SCRATCH/overlay-50G-10M.ext3:rw /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash`
133 | 4. Check file systems: `df -h`. There will be a record: `overlay          53G   52M   50G   1% /`. The size equals to the filesystem image you chose. **The actual content of the image is mounted in `/ext3`.**
134 | 5. Create a file in overlayfs: `touch /ext3/testfile`
135 | 6. Exit from Singularity
136 | 
137 | One has permission for file creation since the fs was mounted with `rw` arg. In contrast `ro` will mount it as read-only. 
138 | 
139 | Setting up conda environment:
140 | 
141 | 1. Start a CPU (GPU if you want/need) job: `srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=32GB --time=1:00:00 --gres=gpu:1 --pty /bin/bash`
142 | 2. Start singularity (notice `--nv` for GPU propagation): `singularity exec --nv --overlay $SCRATCH/overlay-50G-10M.ext3:rw /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash`
143 | 3. Install your conda env in `/ext3`: https://github.com/nyu-dl/cluster-support/tree/master/cassio#conda-environment.
144 |     3.1. `wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh`
145 |     3.2. `bash ./Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3`
146 |     3.3. Install packages you are using (torch, jupyter, tensorflow etc.)
147 | 4. Exit singularity container (*not the CPU/GPU job*)
148 | 
149 | By now you have a working conda environment located in a ext3 image filesystem here: `$SCRATCH/overlay-50G-10M.ext3`. 
150 | 
151 | Let's run the container with **read-only** layerfs now (notice `ro` arg there):
152 | `singularity exec --nv --overlay $SCRATCH/overlay-50G-10M.ext3:ro /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash`
153 | 
154 | ```bash
155 | Singularity> conda activate
156 | (base) Singularity> python
157 | Python 3.8.5 (default, Sep  4 2020, 07:30:14)
158 | [GCC 7.3.0] :: Anaconda, Inc. on linux
159 | Type "help", "copyright", "credits" or "license" for more information.
160 | >>> import torch
161 | >>> torch.__version__
162 | '1.7.0'
163 | >>>
164 | (base) Singularity> touch /ext3/test.txt
165 | touch: cannot touch '/ext3/test.txt': Read-only file system
166 | ```
167 | 
168 | As shown above, the conda env works fine while `/ext3` is not writable.
169 | 
170 | **Caution:** by default conda and python keep cache in home: `~/.conda`, `~/.cache`. Since home fs has small quota, move your cache folder to scratch:
171 | 1. `mkdir $SCRATCH/python_cache`
172 | 2. `cd`
173 | 3. `ln -s $SCRATCH/python_cache/.cache`
174 | 
175 | ```
176 | .cache -> /scratch/ik1147/cache/
177 | ```
178 | 
179 | ***From now on you may run your interactive coding/debugging sessions, congratulations!***
180 | 
181 | ## Running a batch job with Singularity
182 | 
183 | files to use:
184 | * `gpu_job.slurm`
185 | * `test_gpu.py`
186 | 
187 | There is one major detail with a batch job submission: sbatch script starts the singularity container with a `/bin/bash -c "<COMMAND>"` tail, where `"<COMMAND>"` is whatever your job is running.
188 | 
189 | In addition, conda has to be sourced and activated. 
190 | 
191 | Lets create helper script for conda activation: copy the code below in `/ext3/env.sh`
192 | 
193 | ```bash=
194 | #!/bin/bash
195 | 
196 | source /ext3/miniconda3/etc/profile.d/conda.sh
197 | export PATH=/ext3/miniconda3/bin:$PATH
198 | ```
199 | 
200 | This is an example batch job submission script (also as a file `gpu_job.slurm` in this repo folder):
201 | 
202 | ```bash=
203 | #!/bin/bash
204 | #SBATCH --job-name=job_wgpu
205 | #SBATCH --open-mode=append
206 | #SBATCH --output=./%j_%x.out
207 | #SBATCH --error=./%j_%x.err
208 | #SBATCH --export=ALL
209 | #SBATCH --time=00:10:00
210 | #SBATCH --gres=gpu:1
211 | #SBATCH --mem=64G
212 | #SBATCH -c 4
213 | 
214 | singularity exec --nv --overlay $SCRATCH/overlay-50G-10M.ext3:ro /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash -c "
215 | 
216 | source /ext3/env.sh
217 | conda activate
218 | 
219 | python ./test_gpu.py
220 | "
221 | ```
222 | 
223 | the output:
224 | ```
225 | Torch cuda available: True
226 | GPU name: Quadro RTX 8000
227 | 
228 | 
229 | CPU matmul elapsed: 0.438138484954834 sec.
230 | GPU matmul elapsed: 0.2669832706451416 sec.
231 | ```
232 | 
233 | ***From now on you may run your experiments as a batch job, congratulations!***
234 | 
235 | 
236 | ## Setting up a simple sweep over a hyper-parameter using Slurm array job
237 | 
238 | files to use:
239 | * `sweep_job.slurm`
240 | * `test_sweep.py`
241 | 
242 | Usually we may want to do multiple runs of the same job using different hyper params or learning rates etc. There are many cool frameworks to do that like Pytorch Lightning etc. Now we will look on the simplest (imo) version of such sweep construction.
243 | 
244 | **High-level idea:** define a sweep over needed arguments and create a product as a list of all possible combinations (check `test_sweep.py` for details). In the end a specific config is mapped to its position in the list. The SLURM array id is used as a map to specific config in the sweep (check `sweep_job.slurm`).
245 | 
246 | Notes:
247 | * notice no `--nv` in singularity call because we allocate CPU-only resources.
248 | * notice `$SLURM_ARRAY_TASK_ID`. For each specific step job (in range `#SBATCH --array=1-20`) this env var will have the corresponding value assigned.
249 | 
250 | Submitting a sweep job:
251 | `sbatch sweep_job.slurm`
252 | 
253 | Outputs from all completed jobs:
254 | ```
255 | cat *out
256 | {'sweep_step': 20, 'seed': 682, 'device': 'cpu', 'lr': 0.01, 'model_size': 1024, 'some_fixed_arg': 0}
257 | {'sweep_step': 1, 'seed': 936, 'device': 'cpu', 'lr': 0.1, 'model_size': 512, 'some_fixed_arg': 0}
258 | {'sweep_step': 2, 'seed': 492, 'device': 'cpu', 'lr': 0.1, 'model_size': 512, 'some_fixed_arg': 0}
259 | {'sweep_step': 3, 'seed': 52, 'device': 'cpu', 'lr': 0.1, 'model_size': 512, 'some_fixed_arg': 0}
260 | {'sweep_step': 4, 'seed': 691, 'device': 'cpu', 'lr': 0.1, 'model_size': 512, 'some_fixed_arg': 0}
261 | {'sweep_step': 5, 'seed': 826, 'device': 'cpu', 'lr': 0.1, 'model_size': 512, 'some_fixed_arg': 0}
262 | {'sweep_step': 6, 'seed': 152, 'device': 'cpu', 'lr': 0.1, 'model_size': 1024, 'some_fixed_arg': 0}
263 | {'sweep_step': 7, 'seed': 626, 'device': 'cpu', 'lr': 0.1, 'model_size': 1024, 'some_fixed_arg': 0}
264 | {'sweep_step': 8, 'seed': 495, 'device': 'cpu', 'lr': 0.1, 'model_size': 1024, 'some_fixed_arg': 0}
265 | {'sweep_step': 9, 'seed': 703, 'device': 'cpu', 'lr': 0.1, 'model_size': 1024, 'some_fixed_arg': 0}
266 | {'sweep_step': 10, 'seed': 911, 'device': 'cpu', 'lr': 0.1, 'model_size': 1024, 'some_fixed_arg': 0}
267 | {'sweep_step': 11, 'seed': 362, 'device': 'cpu', 'lr': 0.01, 'model_size': 512, 'some_fixed_arg': 0}
268 | {'sweep_step': 12, 'seed': 481, 'device': 'cpu', 'lr': 0.01, 'model_size': 512, 'some_fixed_arg': 0}
269 | {'sweep_step': 13, 'seed': 618, 'device': 'cpu', 'lr': 0.01, 'model_size': 512, 'some_fixed_arg': 0}
270 | {'sweep_step': 14, 'seed': 449, 'device': 'cpu', 'lr': 0.01, 'model_size': 512, 'some_fixed_arg': 0}
271 | {'sweep_step': 15, 'seed': 840, 'device': 'cpu', 'lr': 0.01, 'model_size': 512, 'some_fixed_arg': 0}
272 | {'sweep_step': 16, 'seed': 304, 'device': 'cpu', 'lr': 0.01, 'model_size': 1024, 'some_fixed_arg': 0}
273 | {'sweep_step': 17, 'seed': 499, 'device': 'cpu', 'lr': 0.01, 'model_size': 1024, 'some_fixed_arg': 0}
274 | {'sweep_step': 18, 'seed': 429, 'device': 'cpu', 'lr': 0.01, 'model_size': 1024, 'some_fixed_arg': 0}
275 | {'sweep_step': 19, 'seed': 932, 'device': 'cpu', 'lr': 0.01, 'model_size': 1024, 'some_fixed_arg': 0}
276 | ```
277 | 
278 | **Caution:** be careful with checkpointing when running a sweep. Make sure that your checkpoints/logs have corresponding sweep step in the filename or so (to avoid file clashes).
279 | 
280 | ***From now on you may run your own sweeps, congratulations!***
281 | 
282 | ## Port forwarding to Jupyter Lab
283 | 
284 | ### Part 1: launching JupyterLab on Greene.
285 | 
286 | 1. Launch interactive job with a gpu: `srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=32GB --time=1:00:00 --gres=gpu:1 --pty /bin/bash`
287 | 2. Execute singularity container: `singularity exec --nv --overlay $SCRATCH/overlay-50G-10M.ext3:ro /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash`
288 | 3. Activate conda (base env in this case): `conda activate`
289 | 4. Start jupyter lab: `jupyter lab --ip 0.0.0.0 --port 8965 --no-browser`
290 | 
291 | Expected output:
292 | 
293 | ```
294 | [I 15:51:47.644 LabApp] JupyterLab extension loaded from /ext3/miniconda3/lib/python3.8/site-packages/jupyterlab
295 | [I 15:51:47.644 LabApp] JupyterLab application directory is /ext3/miniconda3/share/jupyter/lab
296 | [I 15:51:47.646 LabApp] Serving notebooks from local directory: /home/ik1147
297 | [I 15:51:47.646 LabApp] Jupyter Notebook 6.1.4 is running at:
298 | [I 15:51:47.646 LabApp] http://gr031.nyu.cluster:8965/
299 | [I 15:51:47.646 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
300 | ```
301 | 
302 | Note the hostname of the node we run from here: `http://gr031.nyu.cluster:8965/`
303 | 
304 | ### Part 2: Forwarding the connection from you local client to Greene.
305 | 
306 | [Check here the ssh config](https://github.com/nyu-dl/cluster-support/blob/master/client/config) with the greene host to make the port forwarding easier to setup.
307 | 
308 | To start the tunnel, run: `ssh -L 8965:gr031.nyu.cluster:8965 greene -N`
309 | 
310 | **Note the hostname should be equal to the node where jupyter is running. Same with the port.**
311 | 
312 | Press Ctrl-C if you want to stop the port forwarding SSH connection.
313 | 
314 | ***From now on you may run your own Jupyter Lab with RTX8000/V100, congratulations!***
315 | 
316 | ## SquashFS for your read-only files
317 | 
318 | Here we will convert read-only file such as data to a SquashFS file which reduces the inode load. It also compresses the data so you save on disk space as well. 
319 | 
320 | Imp note 1: Please first check if the dataset you need is present in `/scratch/work/public/datasets/'. If you are using datasets that you think are useful to many others, please send a mail to HPC so that they can make a common SquashFS in /work/public/. This tutorial was assuming you have some specific folder which you frequently use that has many files.
321 | 
322 | 1. Check your current quota usage using `myquota`
323 | 2. Go to your folder that contains your datasets and you can check the number of files using 
324 | ` for d in $(find $(pwd) -maxdepth 1 -mindepth 1 -type d); do n_files=$(find $d | wc -l); echo $d " " $n_files; done`
325 | 3. I will assume the folder that we want to convert is called DatasetX which contains many files (also handles subfolders with more files). Let's first make an env variable which points to your dataset as follows:
326 | `export DATASET_PATH=/path/to/DatasetX` ,  `export DATASET_NAME=DatasetX` and `export SINGULARITY_DATASET_PATH=/${DATASET_NAME}`  
327 | Next run:
328 | `singularity exec --bind ${DATASET_PATH}:${SINGULARITY_DATASET_PATH}:ro /scratch/work/public/singularity/centos-8.2.2004.sif find ${SINGULARITY_DATASET_PATH} | wc -l`
329 | Here you should be able to see the number of files in DatasetX.
330 | 4. Make a folder in your scratch where you want to store your SquashFS files. Move to this folder.
331 | 5. Now we will run `mksquashfs`. 
332 | `singularity exec --bind ${DATASET_PATH}:${SINGULARITY_DATASET_PATH}:ro /scratch/work/public/singularity/centos-8.2.2004.sif mksquashfs ${SINGULARITY_DATASET_PATH} ${DATASET_NAME}.sqf -keep-as-directory`
333 | 6. Now we have SquashFS file DatasetX.sqf, it will have the same number of files inside, but the disk usage will be lower. We can check this as follows:
334 | `ls -lh ${DATASET_NAME}.sqf`
335 | and 
336 | `du -sh ${DATASET_PATH}`
337 | 
338 | Now let's make sure we can reach the contents of this SquashFS file : 
339 | 1. Start the singularity container with the SquashFS as an overlay. 
340 | `singularity exec --overlay ${DATASET_NAME}.sqf:ro /scratch/work/public/singularity/cuda11.0-cudnn8-devel-ubuntu18.04.sif /bin/bash`
341 | 2. Go to the following location and you should see your dataset within the container. 
342 | `cd /DatasetX`
343 | 
344 | So just as we have done before, this is just an additional overlay that you will add to your Singularity command when running your job :)
345 | 
346 | Imp note 2: After you confirm the SquashFS file is good, you can delete the folder {insert path to folder enclosing DatasetX}/DatasetX to save inode! :D
347 | 
348 | 
349 | 
350 | ## Submitit on greene (Advanced)
351 | 
352 | Submitit is a python wrapper that makes it super easy to submit jobs, sweeps and handles timeouts and restarts for your job. 
353 | For examples and docs please see [Submitit](https://github.com/facebookincubator/submitit). 
354 | 
355 | Here I will assume you have already used submitit as part of your workflow and will only mention what needs to be added so you can use it with Singularity and SquashFS.
356 | 
357 | We will use the following as our running example: [detr](https://github.com/facebookresearch/detr)
358 | 
359 | 1. Clone the repo `git clone https://github.com/facebookresearch/detr.git`
360 | 2. Make a folder called experiments in your scratch. 
361 | 3. The file changes you need are available in the folder submitit_example.
362 |     - Make sure you have an overlay ready which has a conda installation as described earlier in the tutorial. Your /ext3/ folder should contain an env.sh file such as the example provided in the submitit_example folder. 
363 |     - You will need to copy the slurm.py and python-greene files into your desired location. 
364 |         - `slurm.py`:  This file has the necessary changes that allow the singularity to know about the network details such as port address etc.
365 |         - `python-greene` : this is an executable file that provides you with the ability to use singularity with submitit.
366 | Lines 57 and 58 in the python-greene are currently set to use two overlays - one has my conda installation and the other is the SquashFS with the dataset. 
367 | *For this example, you directly use this file. But for your experiments, you will change line 57 and 58 to point to the right location.* 
368 | Line 59 binds the slurm.py file that you have copied, to the one that exists internally in the submitit package.
369 | 
370 | 4. Replace the `run_with_submitit.py` file in the detr repo you cloned with the provided one in the submitit_example folder. This file has cluster specific arguments. In lines 31 and 32, add your username so that all the jobs can reach your shared experiment folder.
371 | You are now ready to run your job! 
372 | 
373 | (I have also done all the above steps and provided the folder `/scratch/ask762/tutorial` where the only thing you need to change is line 31 & 32 in `detr/run_with_submitit.py` to have your username and you should be able to submit a job :)


--------------------------------------------------------------------------------
/greene/gpu_job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=job_wgpu
 3 | #SBATCH --open-mode=append
 4 | #SBATCH --output=./%j_%x.out
 5 | #SBATCH --error=./%j_%x.err
 6 | #SBATCH --export=ALL
 7 | #SBATCH --time=00:10:00
 8 | #SBATCH --gres=gpu:1
 9 | #SBATCH --mem=64G
10 | #SBATCH -c 4
11 | 
12 | singularity exec --nv --overlay $SCRATCH/overlay-50G-10M.ext3:ro /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash -c "
13 | 
14 | source /ext3/env.sh
15 | conda activate
16 | 
17 | python ./test_gpu.py
18 | "
19 | 


--------------------------------------------------------------------------------
/greene/submitit_example/env.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | 
3 | source /ext3/miniconda3/etc/profile.d/conda.sh
4 | export PATH=/ext3/miniconda3/bin:$PATH


--------------------------------------------------------------------------------
/greene/submitit_example/python-greene:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | # https://stackoverflow.com/questions/1668649/how-to-keep-quotes-in-bash-arguments
 4 | args=''
 5 | for i in "$@"; do
 6 |     i="${i//\\/\\\\}"
 7 |     args="$args \"${i//\"/\\\"}\""
 8 | done
 9 | 
10 | module purge
11 | 
12 | export PATH=/share/apps/singularity/bin:$PATH
13 | 
14 | # file systems
15 | export SINGULARITY_BINDPATH=/mnt,/scratch,/share/apps
16 | if [ -d /state/partition1 ]; then
17 |     export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/state/partition1
18 | fi
19 | 
20 | # IB related drivers and libraries
21 | export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/etc/libibverbs.d,/usr/lib64/libibverbs,/usr/lib64/ucx,/usr/include/infiniband,/lib64/libibverbs,/usr/include/rdma
22 | export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,$(echo /usr/bin/ib*_* /usr/sbin/ib* \
23 |                                                          /usr/lib64/libpmi* \
24 |                                                          /lib64/libmlx*.so* \
25 |                                                          /lib64/libib*.so* \
26 |                                                          /lib64/libnl* \
27 |                                                          /usr/include/uc[m,p,s,t] \
28 |                                                          /usr/lib64/libuc[m,p,s,t].so* \
29 |                                                          /lib64/libosmcomp.so* \
30 |                                                          /lib64/librdmacm*.so* | sed -e 's/ /,/g')
31 | 
32 | if [ -d /opt/slurm/lib64 ]; then
33 |     export SINGULARITY_CONTAINLIBS=$(echo /opt/slurm/lib64/libpmi* | xargs | sed -e 's/ /,/g')
34 | fi
35 | 
36 | export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/opt/slurm,/usr/lib64/libmunge.so.2.0.0,/usr/lib64/libmunge.so.2,/var/run/munge,/etc/passwd
37 | 
38 | export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/usr/bin/uuidgen
39 | 
40 | export SINGULARITYENV_PREPEND_PATH=/opt/slurm/bin
41 | 
42 | export SINGULARITYENV_LD_LIBRARY_PATH=/lib64:/lib64/libibverbs
43 | 
44 | export SINGULARITY_CONTAINLIBS=/lib64/libmnl.so.0
45 | export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/usr/sbin/ip,/usr/bin/strace,/usr/sbin/ifconfig
46 | 
47 | nv=""
48 | if [[ "$(hostname -s)" =~ ^g ]]; then nv="--nv"; fi
49 | 
50 | export MYPYTHON=$(readlink -e $0)
51 | 
52 | #if [ "$SLURM_JOBID" != "" ]; then strace="strace"; fi
53 | 
54 | sif=/scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04.sif
55 | 
56 | singularity exec $nv \
57 |             --overlay /scratch/ask762/tutorial/pytorch1.6.0-cuda10.1.ext3:ro \
58 |             --overlay /scratch/ask762/tutorial/multimodal_datasets.sqf:ro \
59 |             --bind /scratch/ask762/tutorial/slurm.py:/ext3/miniconda3/lib/python3.8/site-packages/submitit/slurm/slurm.py:ro \
60 |             $sif \
61 |             /bin/bash -c "
62 | source /ext3/env.sh
63 | \$(which python) $args
64 | exit
65 | "


--------------------------------------------------------------------------------
/greene/submitit_example/run_with_submitit.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
  2 | """
  3 | A script to run multinode training with submitit.
  4 | """
  5 | import argparse
  6 | import os
  7 | import uuid
  8 | import copy
  9 | import itertools
 10 | from typing import Dict
 11 | from collections.abc import Iterable
 12 | from pathlib import Path
 13 | 
 14 | import main as detection
 15 | import submitit
 16 | 
 17 | def parse_args():
 18 |     detection_parser = detection.get_args_parser()
 19 |     parser = argparse.ArgumentParser("Submitit for detection", parents=[detection_parser])
 20 | 
 21 |     parser.add_argument("--ngpus", default=2, type=int, help="Number of gpus to request on each node")
 22 |     parser.add_argument("--nodes", default=2, type=int, help="Number of nodes to request")
 23 |     parser.add_argument("--timeout", default=3000, type=int, help="Duration of the job")
 24 |     parser.add_argument("--job_dir", default="", type=str, help="Job dir. Leave empty for automatic.")
 25 |     parser.add_argument("--mail", default="", type=str,
 26 |                         help="Email this user when the job finishes if specified")
 27 |     return parser.parse_args()
 28 | 
 29 | 
 30 | def get_shared_folder(args) -> Path:
 31 |     if Path('/scratch/{your_username}/experiments/').is_dir():
 32 |             p = Path('/scratch/{your_username}/experiments/')
 33 |             p.mkdir(exist_ok=True)
 34 |             return p
 35 | 
 36 |     raise RuntimeError("No shared folder available")
 37 | 
 38 | def get_init_file(args):
 39 |     # Init file must not exist, but it's parent dir must exist.
 40 |     os.makedirs(str(get_shared_folder(args)), exist_ok=True)
 41 |     init_file = get_shared_folder(args) / f"{uuid.uuid4().hex}_init"
 42 |     if init_file.exists():
 43 |         os.remove(str(init_file))
 44 |     return init_file
 45 | 
 46 | 
 47 | def grid_parameters(grid: Dict):
 48 |     """
 49 |     Yield all combinations of parameters in the grid (as a dict)
 50 |     """
 51 |     grid_copy = dict(grid)
 52 |     # Turn single value in an Iterable
 53 |     for k in grid_copy:
 54 |         if not isinstance(grid_copy[k], Iterable):
 55 |             grid_copy[k] = [grid_copy[k]]
 56 |     for p in itertools.product(*grid_copy.values()):
 57 |         yield dict(zip(grid.keys(), p))
 58 | 
 59 | 
 60 | def sweep(executor: submitit.Executor, args: argparse.ArgumentParser, hyper_parameters: Iterable):
 61 |     jobs = []
 62 |     with executor.batch():
 63 |         for grid_data in hyper_parameters:
 64 |             tmp_args = copy.deepcopy(args)
 65 |             tmp_args.dist_url = get_init_file(args).as_uri()
 66 |             tmp_args.output_dir = args.job_dir
 67 |             for k, v in grid_data.items():
 68 |                 assert hasattr(tmp_args, k)
 69 |                 setattr(tmp_args, k, v)
 70 |             trainer = Trainer(tmp_args)
 71 |             job = executor.submit(trainer)
 72 |             jobs.append(job)
 73 |     print('Sweep job ids:', [job.job_id for job in jobs])
 74 | 
 75 | 
 76 | class Trainer(object):
 77 |     def __init__(self, args):
 78 |         self.args = args
 79 | 
 80 |     def __call__(self):
 81 |         import os
 82 |         import sys
 83 |         import detection as detection
 84 | 
 85 |         self._setup_gpu_args()
 86 |         detection.main(self.args)
 87 | 
 88 |     def checkpoint(self):
 89 |         import os
 90 |         import submitit
 91 |         from pathlib import Path
 92 | 
 93 |         self.args.dist_url = get_init_file(self.args).as_uri()
 94 |         checkpoint_file = os.path.join(self.args.output_dir, "checkpoint.pth")
 95 |         if os.path.exists(checkpoint_file):
 96 |             self.args.resume = checkpoint_file
 97 |         print("Requeuing ", self.args)
 98 |         empty_trainer = type(self)(self.args)
 99 |         return submitit.helpers.DelayedSubmission(empty_trainer)
100 | 
101 |     def _setup_gpu_args(self):
102 |         import submitit
103 |         from pathlib import Path
104 | 
105 |         job_env = submitit.JobEnvironment()
106 |         self.args.output_dir = Path(str(self.args.output_dir).replace("%j", str(job_env.job_id)))
107 |         self.args.gpu = job_env.local_rank
108 |         self.args.rank = job_env.global_rank
109 |         self.args.world_size = job_env.num_tasks
110 |         print(f"Process group: {job_env.num_tasks} tasks, rank: {job_env.global_rank}")
111 | 
112 | 
113 | def main():
114 |     args = parse_args()
115 |     if args.job_dir == "":
116 |         args.job_dir = get_shared_folder(args) / "%j"
117 | 
118 |     # Note that the folder will depend on the job_id, to easily track experimen`ts
119 |     executor = submitit.AutoExecutor(folder=args.job_dir, slurm_max_num_timeout=30)
120 | 
121 |     # cluster setup is defined by environment variables
122 |     num_gpus_per_node = args.ngpus
123 |     nodes = args.nodes
124 |     timeout_min = args.timeout
125 |     kwargs = {}
126 |     if args.use_volta32:
127 |         kwargs['constraint'] = 'volta32gb'
128 |     if args.comment:
129 |         kwargs['comment'] = args.comment
130 | 
131 |     executor.update_parameters(
132 |         mem_gb=40 * num_gpus_per_node,
133 |         tasks_per_node=num_gpus_per_node,  # one task per GPU
134 |         cpus_per_task=10,
135 |         nodes=nodes,
136 |         timeout_min=10080,  # max is 60 * 72
137 |         # Below are cluster dependent parameters
138 |         slurm_gres=f"gpu:rtx8000:{num_gpus_per_node}", #you can choose to comment this, or change it to v100 as per your need
139 |         slurm_signal_delay_s=120,
140 |         **kwargs
141 |     )
142 | 
143 |     executor.update_parameters(name="detectransformer")
144 |     if args.mail:
145 |         executor.update_parameters(
146 |             additional_parameters={'mail-user': args.mail, 'mail-type': 'END'})
147 | 
148 |     executor.update_parameters(slurm_additional_parameters={
149 |         'gres-flags': 'enforce-binding'
150 |     })
151 | 
152 |     args.dist_url = get_init_file(args).as_uri()
153 |     args.output_dir = args.job_dir
154 | 
155 |     trainer = Trainer(args)
156 |     job = executor.submit(trainer)
157 | 
158 |     print("Submitted job_id:", job.job_id)
159 | 
160 | 
161 | if __name__ == "__main__":
162 |     main()
163 | 


--------------------------------------------------------------------------------
/greene/submitit_example/slurm.py:
--------------------------------------------------------------------------------
  1 | # Copyright (c) Facebook, Inc. and its affiliates.
  2 | #
  3 | # This source code is licensed under the MIT license found in the
  4 | # LICENSE file in the root directory of this source tree.
  5 | #
  6 | 
  7 | import functools
  8 | import inspect
  9 | import os
 10 | import re
 11 | import shutil
 12 | import subprocess
 13 | import sys
 14 | import typing as tp
 15 | import uuid
 16 | import warnings
 17 | from pathlib import Path
 18 | from typing import Any, Dict, List, Optional, Set, Tuple, Union
 19 | 
 20 | from ..core import core, job_environment, logger, utils
 21 | 
 22 | 
 23 | def read_job_id(job_id: str) -> tp.List[Tuple[str, ...]]:
 24 |     """Reads formated job id and returns a tuple with format:
 25 |     (main_id, [array_index, [final_array_index])
 26 |     """
 27 |     pattern = r"(?P<main_id>\d+)_\[(?P<arrays>(\d+(-\d+)?(,)?)+)(\%\d+)?\]"
 28 |     match = re.search(pattern, job_id)
 29 |     if match is not None:
 30 |         main = match.group("main_id")
 31 |         array_ranges = match.group("arrays").split(",")
 32 |         return [tuple([main] + array_range.split("-")) for array_range in array_ranges]
 33 |     else:
 34 |         main_id, *array_id = job_id.split("_", 1)
 35 |         if not array_id:
 36 |             return [(main_id,)]
 37 |         # there is an array
 38 |         array_num = str(int(array_id[0]))  # trying to cast to int to make sure we understand
 39 |         return [(main_id, array_num)]
 40 | 
 41 | 
 42 | class SlurmInfoWatcher(core.InfoWatcher):
 43 |     def _make_command(self) -> Optional[List[str]]:
 44 |         # asking for array id will return all status
 45 |         # on the other end, asking for each and every one of them individually takes a huge amount of time
 46 |         to_check = {x.split("_")[0] for x in self._registered - self._finished}
 47 |         if not to_check:
 48 |             return None
 49 |         command = ["sacct", "-o", "JobID,State", "--parsable2"]
 50 |         for jid in to_check:
 51 |             command.extend(["-j", str(jid)])
 52 |         return command
 53 | 
 54 |     def get_state(self, job_id: str, mode: str = "standard") -> str:
 55 |         """Returns the state of the job
 56 |         State of finished jobs are cached (use watcher.clear() to remove all cache)
 57 | 
 58 |         Parameters
 59 |         ----------
 60 |         job_id: int
 61 |             id of the job on the cluster
 62 |         mode: str
 63 |             one of "force" (forces a call), "standard" (calls regularly) or "cache" (does not call)
 64 |         """
 65 |         info = self.get_info(job_id, mode=mode)
 66 |         return info.get("State") or "UNKNOWN"
 67 | 
 68 |     def read_info(self, string: Union[bytes, str]) -> Dict[str, Dict[str, str]]:
 69 |         """Reads the output of sacct and returns a dictionary containing main information
 70 |         """
 71 |         if not isinstance(string, str):
 72 |             string = string.decode()
 73 |         lines = string.splitlines()
 74 |         if len(lines) < 2:
 75 |             return {}  # one job id does not exist (yet)
 76 |         names = lines[0].split("|")
 77 |         # read all lines
 78 |         all_stats: Dict[str, Dict[str, str]] = {}
 79 |         for line in lines[1:]:
 80 |             stats = {x: y.strip() for x, y in zip(names, line.split("|"))}
 81 |             job_id = stats["JobID"]
 82 |             if not job_id or "." in job_id:
 83 |                 continue
 84 |             try:
 85 |                 multi_split_job_id = read_job_id(job_id)
 86 |             except Exception as e:
 87 |                 # Array id are sometimes displayed with weird chars
 88 |                 warnings.warn(
 89 |                     f"Could not interpret {job_id} correctly (please open an issue):\n{e}", DeprecationWarning
 90 |                 )
 91 |                 continue
 92 |             for split_job_id in multi_split_job_id:
 93 |                 all_stats[
 94 |                     "_".join(split_job_id[:2])
 95 |                 ] = stats  # this works for simple jobs, or job array unique instance
 96 |                 # then, deal with ranges:
 97 |                 if len(split_job_id) >= 3:
 98 |                     for index in range(int(split_job_id[1]), int(split_job_id[2]) + 1):
 99 |                         all_stats[f"{split_job_id[0]}_{index}"] = stats
100 |         return all_stats
101 | 
102 | 
103 | class SlurmJob(core.Job[core.R]):
104 | 
105 |     _cancel_command = "scancel"
106 |     watcher = SlurmInfoWatcher(delay_s=600)
107 | 
108 |     def _interrupt(self, timeout: bool = False) -> None:
109 |         """Sends preemption or timeout signal to the job (for testing purpose)
110 | 
111 |         Parameter
112 |         ---------
113 |         timeout: bool
114 |             Whether to trigger a job time-out (if False, it triggers preemption)
115 |         """
116 |         cmd = ["scancel", self.job_id, "--signal"]
117 |         # in case of preemption, SIGTERM is sent first
118 |         if not timeout:
119 |             subprocess.check_call(cmd + ["SIGTERM"])
120 |         subprocess.check_call(cmd + ["USR1"])
121 | 
122 | 
123 | class SlurmParseException(Exception):
124 |     pass
125 | 
126 | 
127 | def _expand_id_suffix(suffix_parts: str) -> List[str]:
128 |     """Parse the a suffix formatted like "1-3,5,8" into
129 |     the list of numeric values 1,2,3,5,8.
130 |     """
131 |     suffixes = []
132 |     for suffix_part in suffix_parts.split(","):
133 |         if "-" in suffix_part:
134 |             low, high = suffix_part.split("-")
135 |             int_length = len(low)
136 |             for num in range(int(low), int(high) + 1):
137 |                 suffixes.append(f"{num:0{int_length}}")
138 |         else:
139 |             suffixes.append(suffix_part)
140 |     return suffixes
141 | 
142 | 
143 | def _parse_node_group(node_list: str, pos: int, parsed: List[str]) -> int:
144 |     """Parse a node group of the form PREFIX[1-3,5,8] and return
145 |     the position in the string at which the parsing stopped
146 |     """
147 |     prefixes = [""]
148 |     while pos < len(node_list):
149 |         c = node_list[pos]
150 |         if c == ",":
151 |             parsed.extend(prefixes)
152 |             return pos + 1
153 |         if c == "[":
154 |             last_pos = node_list.index("]", pos)
155 |             suffixes = _expand_id_suffix(node_list[pos + 1 : last_pos])
156 |             prefixes = [prefix + suffix for prefix in prefixes for suffix in suffixes]
157 |             pos = last_pos + 1
158 |         else:
159 |             for i, prefix in enumerate(prefixes):
160 |                 prefixes[i] = prefix + c
161 |             pos += 1
162 |     parsed.extend(prefixes)
163 |     return pos
164 | 
165 | 
166 | def _parse_node_list(node_list: str):
167 |     try:
168 |         pos = 0
169 |         parsed: List[str] = []
170 |         while pos < len(node_list):
171 |             pos = _parse_node_group(node_list, pos, parsed)
172 |         return parsed
173 |     except ValueError as e:
174 |         raise SlurmParseException(f"Unrecognized format for SLURM_JOB_NODELIST: '{node_list}'", e) from e
175 | 
176 | 
177 | class SlurmJobEnvironment(job_environment.JobEnvironment):
178 | 
179 |     _env = {
180 |         "job_id": "SLURM_JOB_ID",
181 |         "num_tasks": "SLURM_NTASKS",
182 |         "num_nodes": "SLURM_JOB_NUM_NODES",
183 |         "node": "SLURM_NODEID",
184 |         "nodes": "SLURM_JOB_NODELIST",
185 |         "global_rank": "SLURM_PROCID",
186 |         "local_rank": "SLURM_LOCALID",
187 |         "array_job_id": "SLURM_ARRAY_JOB_ID",
188 |         "array_task_id": "SLURM_ARRAY_TASK_ID",
189 |     }
190 | 
191 |     def _requeue(self, countdown: int) -> None:
192 |         jid = self.job_id
193 |         subprocess.check_call(["scontrol", "requeue", jid])
194 |         logger.get_logger().info(f"Requeued job {jid} ({countdown} remaining timeouts)")
195 | 
196 |     @property
197 |     def hostnames(self) -> List[str]:
198 |         """Parse the content of the "SLURM_JOB_NODELIST" environment variable,
199 |         which gives access to the list of hostnames that are part of the current job.
200 | 
201 |         In SLURM, the node list is formatted NODE_GROUP_1,NODE_GROUP_2,...,NODE_GROUP_N
202 |         where each node group is formatted as: PREFIX[1-3,5,8] to define the hosts:
203 |         [PREFIX1, PREFIX2, PREFIX3, PREFIX5, PREFIX8].
204 | 
205 |         Link: https://hpcc.umd.edu/hpcc/help/slurmenv.html
206 |         """
207 | 
208 |         node_list = os.environ.get(self._env["nodes"], "")
209 |         if not node_list:
210 |             return [self.hostname]
211 |         return _parse_node_list(node_list)
212 | 
213 | 
214 | class SlurmExecutor(core.PicklingExecutor):
215 |     """Slurm job executor
216 |     This class is used to hold the parameters to run a job on slurm.
217 |     In practice, it will create a batch file in the specified directory for each job,
218 |     and pickle the task function and parameters. At completion, the job will also pickle
219 |     the output. Logs are also dumped in the same directory.
220 | 
221 |     Parameters
222 |     ----------
223 |     folder: Path/str
224 |         folder for storing job submission/output and logs.
225 |     max_num_timeout: int
226 |         Maximum number of time the job can be requeued after timeout (if
227 |         the instance is derived from helpers.Checkpointable)
228 | 
229 |     Note
230 |     ----
231 |     - be aware that the log/output folder will be full of logs and pickled objects very fast,
232 |       it may need cleaning.
233 |     - the folder needs to point to a directory shared through the cluster. This is typically
234 |       not the case for your tmp! If you try to use it, slurm will fail silently (since it
235 |       will not even be able to log stderr.
236 |     - use update_parameters to specify custom parameters (n_gpus etc...). If you
237 |       input erroneous parameters, an error will print all parameters available for you.
238 |     """
239 | 
240 |     job_class = SlurmJob
241 | 
242 |     def __init__(self, folder: Union[Path, str], max_num_timeout: int = 3) -> None:
243 |         super().__init__(folder, max_num_timeout)
244 |         if not self.affinity() > 0:
245 |             raise RuntimeError('Could not detect "srun", are you indeed on a slurm cluster?')
246 | 
247 |     @classmethod
248 |     def _equivalence_dict(cls) -> core.EquivalenceDict:
249 |         return {
250 |             "name": "job_name",
251 |             "timeout_min": "time",
252 |             "mem_gb": "mem",
253 |             "nodes": "nodes",
254 |             "cpus_per_task": "cpus_per_task",
255 |             "gpus_per_node": "gpus_per_node",
256 |             "tasks_per_node": "ntasks_per_node",
257 |         }
258 | 
259 |     @classmethod
260 |     def _valid_parameters(cls) -> Set[str]:
261 |         """Parameters that can be set through update_parameters
262 |         """
263 |         return set(_get_default_parameters())
264 | 
265 |     def _convert_parameters(self, params: Dict[str, Any]) -> Dict[str, Any]:
266 |         params = super()._convert_parameters(params)
267 |         # replace type in some cases
268 |         if "mem" in params:
269 |             params["mem"] = f"{params['mem']}GB"
270 |         return params
271 | 
272 |     def _internal_update_parameters(self, **kwargs: Any) -> None:
273 |         """Updates sbatch submission file parameters
274 | 
275 |         Parameters
276 |         ----------
277 |         See slurm documentation for most parameters.
278 |         Most useful parameters are: time, mem, gpus_per_node, cpus_per_task, partition
279 |         Below are the parameters that differ from slurm documentation:
280 | 
281 |         signal_delay_s: int
282 |             delay between the kill signal and the actual kill of the slurm job.
283 |         setup: list
284 |             a list of command to run in sbatch befure running srun
285 |         array_parallelism: int
286 |             number of map tasks that will be executed in parallel
287 | 
288 |         Raises
289 |         ------
290 |         ValueError
291 |             In case an erroneous keyword argument is added, a list of all eligible parameters
292 |             is printed, with their default values
293 | 
294 |         Note
295 |         ----
296 |         Best practice (as far as Quip is concerned): cpus_per_task=2x (number of data workers + gpus_per_task)
297 |         You can use cpus_per_gpu=2 (requires using gpus_per_task and not gpus_per_node)
298 |         """
299 |         defaults = _get_default_parameters()
300 |         in_valid_parameters = sorted(set(kwargs) - set(defaults))
301 |         if in_valid_parameters:
302 |             string = "\n  - ".join(f"{x} (default: {repr(y)})" for x, y in sorted(defaults.items()))
303 |             raise ValueError(
304 |                 f"Unavailable parameter(s): {in_valid_parameters}\nValid parameters are:\n  - {string}"
305 |             )
306 |         # check that new parameters are correct
307 |         _make_sbatch_string(command="nothing to do", folder=self.folder, **kwargs)
308 |         super()._internal_update_parameters(**kwargs)
309 | 
310 |     def _internal_process_submissions(
311 |         self, delayed_submissions: tp.List[utils.DelayedSubmission]
312 |     ) -> tp.List[core.Job[tp.Any]]:
313 |         if len(delayed_submissions) == 1:
314 |             return super()._internal_process_submissions(delayed_submissions)
315 |         # array
316 |         folder = utils.JobPaths.get_first_id_independent_folder(self.folder)
317 |         folder.mkdir(parents=True, exist_ok=True)
318 |         pickle_paths = []
319 |         for d in delayed_submissions:
320 |             pickle_path = folder / f"{uuid.uuid4().hex}.pkl"
321 |             d.timeout_countdown = self.max_num_timeout
322 |             d.dump(pickle_path)
323 |             pickle_paths.append(pickle_path)
324 |         n = len(delayed_submissions)
325 |         # Make a copy of the executor, since we don't want other jobs to be
326 |         # scheduled as arrays.
327 |         array_ex = SlurmExecutor(self.folder, self.max_num_timeout)
328 |         array_ex.update_parameters(**self.parameters)
329 |         array_ex.parameters["map_count"] = n
330 |         self._throttle()
331 | 
332 |         first_job: core.Job[tp.Any] = array_ex._submit_command(self._submitit_command_str)
333 |         tasks_ids = list(range(first_job.num_tasks))
334 |         jobs: List[core.Job[tp.Any]] = [
335 |             SlurmJob(folder=self.folder, job_id=f"{first_job.job_id}_{a}", tasks=tasks_ids) for a in range(n)
336 |         ]
337 |         for job, pickle_path in zip(jobs, pickle_paths):
338 |             job.paths.move_temporary_file(pickle_path, "submitted_pickle")
339 |         return jobs
340 | 
341 |     @property
342 |     def _submitit_command_str(self) -> str:
343 |         # make sure to use the current executable (required in pycharm)
344 |         #return f"{sys.executable} -u -m submitit.core._submit '{self.folder}'"
345 |         python = os.getenv("MYPYTHON")
346 |         if not python : python = f"{sys.executable}"
347 |         return python + f" -u -m submitit.core._submit '{self.folder}'"
348 | 
349 |     def _make_submission_file_text(self, command: str, uid: str) -> str:
350 |         return _make_sbatch_string(command=command, folder=self.folder, **self.parameters)
351 | 
352 |     def _num_tasks(self) -> int:
353 |         nodes: int = self.parameters.get("nodes", 1)
354 |         tasks_per_node: int = self.parameters.get("ntasks_per_node", 1)
355 |         return nodes * tasks_per_node
356 | 
357 |     def _make_submission_command(self, submission_file_path: Path) -> List[str]:
358 |         return ["sbatch", str(submission_file_path)]
359 | 
360 |     @staticmethod
361 |     def _get_job_id_from_submission_command(string: Union[bytes, str]) -> str:
362 |         """Returns the job ID from the output of sbatch string
363 |         """
364 |         if not isinstance(string, str):
365 |             string = string.decode()
366 |         output = re.search(r"job (?P<id>[0-9]+)", string)
367 |         if output is None:
368 |             raise utils.FailedSubmissionError(
369 |                 'Could not make sense of sbatch output "{}"\n'.format(string)
370 |                 + "Job instance will not be able to fetch status\n"
371 |                 "(you may however set the job job_id manually if needed)"
372 |             )
373 |         return output.group("id")
374 | 
375 |     @classmethod
376 |     def affinity(cls) -> int:
377 |         return -1 if shutil.which("srun") is None else 2
378 | 
379 | 
380 | @functools.lru_cache()
381 | def _get_default_parameters() -> Dict[str, Any]:
382 |     """Parameters that can be set through update_parameters
383 |     """
384 |     specs = inspect.getfullargspec(_make_sbatch_string)
385 |     zipped = zip(specs.args[-len(specs.defaults) :], specs.defaults)  # type: ignore
386 |     return {key: val for key, val in zipped if key not in {"command", "folder", "map_count"}}
387 | 
388 | 
389 | # pylint: disable=too-many-arguments,unused-argument, too-many-locals
390 | def _make_sbatch_string(
391 |     command: str,
392 |     folder: tp.Union[str, Path],
393 |     job_name: str = "submitit",
394 |     partition: str = None,
395 |     time: int = 5,
396 |     nodes: int = 1,
397 |     ntasks_per_node: int = 1,
398 |     cpus_per_task: tp.Optional[int] = None,
399 |     cpus_per_gpu: tp.Optional[int] = None,
400 |     num_gpus: tp.Optional[int] = None,  # legacy
401 |     gpus_per_node: tp.Optional[int] = None,
402 |     gpus_per_task: tp.Optional[int] = None,
403 |     setup: tp.Optional[tp.List[str]] = None,
404 |     mem: tp.Optional[str] = None,
405 |     mem_per_gpu: tp.Optional[str] = None,
406 |     mem_per_cpu: tp.Optional[str] = None,
407 |     signal_delay_s: int = 90,
408 |     comment: str = "",
409 |     constraint: str = "",
410 |     exclude: str = "",
411 |     gres: str = "",
412 |     exclusive: tp.Union[bool, str] = False,
413 |     array_parallelism: int = 256,
414 |     wckey: str = "submitit",
415 |     map_count: tp.Optional[int] = None,  # used internally
416 |     additional_parameters: tp.Optional[tp.Dict[str, tp.Any]] = None,
417 | ) -> str:
418 |     """Creates the content of an sbatch file with provided parameters
419 | 
420 |     Parameters
421 |     ----------
422 |     See slurm sbatch documentation for most parameters:
423 |     https://slurm.schedmd.com/sbatch.html
424 | 
425 |     Below are the parameters that differ from slurm documentation:
426 | 
427 |     folder: str/Path
428 |         folder where print logs and error logs will be written
429 |     signal_delay_s: int
430 |         delay between the kill signal and the actual kill of the slurm job.
431 |     setup: list
432 |         a list of command to run in sbatch befure running srun
433 |     map_size: int
434 |         number of simultaneous map/array jobs allowed
435 |     additional_parameters: dict
436 |         Forces any parameter to a given value in sbatch. This can be useful
437 |         to add parameters which are not currently available in submitit.
438 |         Eg: {"mail-user": "blublu@fb.com", "mail-type": "BEGIN"}
439 | 
440 |     Raises
441 |     ------
442 |     ValueError
443 |         In case an erroneous keyword argument is added, a list of all eligible parameters
444 |         is printed, with their default values
445 |     """
446 |     nonslurm = [
447 |         "nonslurm",
448 |         "folder",
449 |         "command",
450 |         "map_count",
451 |         "array_parallelism",
452 |         "additional_parameters",
453 |         "setup",
454 |     ]
455 |     parameters = {x: y for x, y in locals().items() if y and y is not None and x not in nonslurm}
456 |     # rename and reformat parameters
457 |     parameters["signal"] = signal_delay_s
458 |     del parameters["signal_delay_s"]
459 |     if job_name:
460 |         parameters["job_name"] = utils.sanitize(job_name)
461 |     if comment:
462 |         parameters["comment"] = utils.sanitize(comment, only_alphanum=False)
463 |     if num_gpus is not None:
464 |         warnings.warn(
465 |             '"num_gpus" is deprecated, please use "gpus_per_node" instead (overwritting with num_gpus)'
466 |         )
467 |         parameters["gpus_per_node"] = parameters.pop("num_gpus", 0)
468 |     if "cpus_per_gpu" in parameters and "gpus_per_task" not in parameters:
469 |         warnings.warn('"cpus_per_gpu" requires to set "gpus_per_task" to work (and not "gpus_per_node")')
470 |     # add necessary parameters
471 |     paths = utils.JobPaths(folder=folder)
472 |     stdout = str(paths.stdout)
473 |     stderr = str(paths.stderr)
474 |     # Job arrays will write files in the form  <ARRAY_ID>_<ARRAY_TASK_ID>_<TASK_ID>
475 |     if map_count is not None:
476 |         assert isinstance(map_count, int) and map_count
477 |         parameters["array"] = f"0-{map_count - 1}%{min(map_count, array_parallelism)}"
478 |         stdout = stdout.replace("%j", "%A_%a")
479 |         stderr = stderr.replace("%j", "%A_%a")
480 |     parameters["output"] = stdout.replace("%t", "0")
481 |     parameters["error"] = stderr.replace("%t", "0")
482 |     parameters.update({"signal": f"USR1@{signal_delay_s}", "open-mode": "append"})
483 |     if additional_parameters is not None:
484 |         parameters.update(additional_parameters)
485 |     # now create
486 |     lines = ["#!/bin/bash", "", "# Parameters"]
487 |     lines += [
488 |         "#SBATCH --{}{}".format(x.replace("_", "-"), "" if parameters[x] is True else f"={parameters[x]}")
489 |         for x in sorted(parameters)
490 |     ]
491 |     # environment setup:
492 |     if setup is not None:
493 |         lines += ["", "# setup"] + setup
494 |     # commandline (this will run the function and args specified in the file provided as argument)
495 |     # We pass --output and --error here, because the SBATCH command doesn't work as expected with a filename pattern
496 |     lines += [
497 |         "",
498 |         "# command",
499 |         "export SUBMITIT_EXECUTOR=slurm",
500 |         "export MASTER_ADDR=$(hostname -s)-ib0",
501 |         "export MASTER_PORT=$(shuf -i 10000-65500 -n 1)",
502 |         f"srun --export=ALL --kill-on-bad-exit=1 --output '{stdout}' --error '{stderr}' --unbuffered {command}\n\n",
503 |     ]
504 |     return "\n".join(lines)


--------------------------------------------------------------------------------
/greene/sweep_job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=job_wgpu
 3 | #SBATCH --open-mode=append
 4 | #SBATCH --output=./%j_%x.out
 5 | #SBATCH --error=./%j_%x.err
 6 | #SBATCH --export=ALL
 7 | #SBATCH --time=00:10:00
 8 | #SBATCH --mem=1G
 9 | #SBATCH -c 4
10 | 
11 | #SBATCH --array=1-20
12 | 
13 | singularity exec --overlay $SCRATCH/overlay-50G-10M.ext3:ro /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash -c "
14 | 
15 | source /ext3/env.sh
16 | conda activate
17 | 
18 | python ./test_sweep.py $SLURM_ARRAY_TASK_ID
19 | "


--------------------------------------------------------------------------------
/greene/test_gpu.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import time
 3 | 
 4 | if __name__ == "__main__":
 5 | 
 6 |     print(f"Torch cuda available: {torch.cuda.is_available()}")
 7 |     print(f"GPU name: {torch.cuda.get_device_name()}\n\n")
 8 | 
 9 |     t1 = torch.randn(100, 1000)
10 |     t2 = torch.randn(1000, 10000)
11 | 
12 |     cpu_start = time.time()
13 | 
14 |     for i in range(100):
15 |         t = t1 @ t2
16 | 
17 |     cpu_end = time.time()
18 | 
19 |     print(f"CPU matmul elapsed: {cpu_end-cpu_start} sec.")
20 | 
21 |     t1 = t1.to("cuda")
22 |     t2 = t2.to("cuda")
23 | 
24 |     gpu_start = time.time()
25 | 
26 |     for i in range(100):
27 |         t = t1 @ t2
28 | 
29 |     gpu_end = time.time()
30 | 
31 |     print(f"GPU matmul elapsed: {gpu_end-gpu_start} sec.")
32 | 


--------------------------------------------------------------------------------
/greene/test_sweep.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import fire
 3 | import itertools
 4 | import numpy
 5 | 
 6 | 
 7 | def retreve_config(sweep_step):
 8 |     grid = {
 9 |         "lr": [0.1, 0.01],
10 |         "model_size": [512, 1024],
11 |         "seed": numpy.random.randint(0, 1000, 5),
12 |     }
13 | 
14 |     grid_setups = list(
15 |         dict(zip(grid.keys(), values)) for values in itertools.product(*grid.values())
16 |     )
17 |     step_grid = grid_setups[sweep_step - 1]  # slurm var will start from 1
18 | 
19 |     # automatically choose the device based on the given node
20 |     if torch.cuda.device_count() > 0:
21 |         expr_device = "cuda"
22 |     else:
23 |         expr_device = "cpu"
24 | 
25 |     config = {
26 |         "sweep_step": sweep_step,
27 |         "seed": step_grid["seed"],
28 |         "device": expr_device,
29 |         "lr": step_grid["lr"],
30 |         "model_size": step_grid["model_size"],
31 |         "some_fixed_arg": 0,
32 |     }
33 | 
34 |     return config
35 | 
36 | 
37 | def run_experiment(config):
38 |     print(config)
39 | 
40 | 
41 | def main(sweep_step):
42 |     config = retreve_config(sweep_step)
43 |     run_experiment(config)
44 | 
45 | 
46 | if __name__ == "__main__":
47 |     fire.Fire(main)


--------------------------------------------------------------------------------
/prince/README.md:
--------------------------------------------------------------------------------
  1 | # NYU HPC Prince cluster
  2 | 
  3 | Troubleshooting: hpc@nyu.edu
  4 | 
  5 | NYU HPC website: [https://sites.google.com/a/nyu.edu/nyu-hpc/](https://sites.google.com/a/nyu.edu/nyu-hpc/)
  6 | 
  7 | One can find many useful tips under *DOCUMENTATION / GUIDES*.
  8 | 
  9 | In this part we will touch the following aspects of our own cluster:
 10 | 
 11 | * Connecting to Prince via CIMS Access
 12 | 
 13 | * Prince computing nodes
 14 | 
 15 | * Prince filesystems
 16 | 
 17 | * Software management
 18 | 
 19 | * Slurm Workload Manager
 20 | 
 21 | ## Connecting to Prince via CIMS Access (without bastion HPC nodes)
 22 | 
 23 | HPC login nodes are reachable from CIMS access node, i.e. there is no need to login over another firewalled bastion HPC node. One can find corresponding `prince` block in the ssh config file (check client folder in this repo).
 24 | 
 25 | ## Prince computing nodes
 26 | 
 27 | Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/systems/prince](https://sites.google.com/a/nyu.edu/nyu-hpc/systems/prince)
 28 | 
 29 | ## Prince filesystems
 30 | 
 31 | Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/data-management/prince-data](https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/data-management/prince-data)
 32 | 
 33 | ### Show quota usage for all filesystems
 34 | 
 35 | In order to check how much filesystem space you use, run `myquota`.
 36 | 
 37 | ## Software management
 38 | 
 39 | There is similar environment modules package as in Cassio for general libs/software.
 40 | 
 41 | There is no difference with managing **conda** on Cassio and on Prince except for installation path.
 42 | 
 43 | Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/prince/packages/conda-environments](https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/prince/packages/conda-environments)
 44 | 
 45 | ## Slurm Workload Manager
 46 | 
 47 | https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/prince/batch/slurm-best-practices
 48 | 
 49 | In general, Slurm behaves similarly on both Cassio and Prince, however there are differences in quotas and how GPU-equipped nodes are distributed:s
 50 | 
 51 | 1. **There is no interactive QoS on Prince.** In other words, run `srun --pty bash` with additional args to get an interactive job allocated.
 52 | 
 53 | 2. **There is no `--constraint` arg to specify a GPU you want.** 
 54 | 
 55 |     Instead, GPU nodes are separated into differnt **partitions** w.r.t. GPU type. Right now there are following parititons available: `--partition=k80_4,k80_8,p40_4,p100_4,v100_sxm2_4,v100_pci_2,dgx1,p1080_4`. Slurm will try to allocate nodes in the order given by this line. 
 56 | 
 57 |      |GPU|Memory|
 58 |      |---|------|
 59 |      |96 P40|24G|
 60 |      |32 P100|16G|
 61 |      |50 K80|24G bridged over 2 GPUs or 12G each|
 62 |      |26 V100|16G|
 63 |      |16 P1080|8G|
 64 |      |DGX1|16G ?|
 65 | 
 66 | ### Port forwarding to interactive job
 67 | 
 68 | If one follows exactly same steps as for Cassio and run:
 69 | 
 70 | `ssh -L <port>:localhost:<port> -J prince <hpc_username>:<interactive_host>`
 71 | 
 72 | then the following error may be returned:
 73 | 
 74 | ```
 75 | channel 0: open failed: administratively prohibited: open failed
 76 | stdio forwarding failed
 77 | kex_exchange_identification: Connection closed by remote host
 78 | ```
 79 | 
 80 | which means that jump to your instance was not successful. In order to avoid this jump, we make a tunnel which will forward connection to the machine itself rather than localhost:
 81 | 
 82 | `ssh -L <port>:<interactive_host>:<port> prince -N`
 83 | 
 84 | **Important: you must run your JupyterLab or any other software with accepting requests from all ip addresses rather than from localhost only (which is a default usually).** To make this change in jupyter, add `--ip 0.0.0.0` arg:
 85 | 
 86 | `jupyter lab --no-browser --port <port> --ip 0.0.0.0`
 87 | 
 88 | Now you should be able to open JupyterLab tab in your browser.
 89 | 
 90 | ### Submitting a batch job
 91 | 
 92 | As we noted before, one particular difference with Cassio is about GPU allocation (note `--partition` below):
 93 | 
 94 | ```bash
 95 | #!/bin/bash
 96 | #SBATCH --job-name=<JOB_NAME>
 97 | #SBATCH --open-mode=append
 98 | #SBATCH --output=<OUTPUT_FILENAME>
 99 | #SBATCH --error=<ERR_FILENAME>
100 | #SBATCH --export=ALL
101 | #SBATCH --time=24:00:00
102 | #SBATCH --partition=p40_4,p100_4,v100_sxm2_4,dgx1
103 | #SBATCH --gres=gpu:1
104 | #SBATCH --mem=64G
105 | #SBATCH -c 4
106 | ```
107 | 
108 | **Important: do not forget to activate conda env before submitting a job, or make sure you do so in the script.**
109 | 
110 | Similar to arguments we passed to `srun` during interactive job request, here we specify requirements for the batch job.
111 | 
112 | After `#SBATCH` block one may execute any shell commands or run any script of your choice.
113 | 
114 | **You can not mix `#SBATCH` lines with other commands, Slurm will not register any `#SBATCH` after the first regular (non-comment) command in the script.**
115 | 
116 | To submit `job_wgpu` located in `gpu_job.slurm`, go to Cassio node and run:
117 | 
118 | `sbatch gpu_job.slurm`
119 | 
120 | sample output:
121 | 
122 | ```
123 | Torch cuda available: True
124 | GPU name: Tesla V100-SXM2-32GB-LS
125 | 
126 | 
127 | CPU matmul elapsed: 1.1984939575195312 sec.
128 | GPU matmul elapsed: 0.01778721809387207 sec.
129 | ```
130 | 
131 | # Mujoco Installation on Prince
132 | 
133 | This tutorial assumes that you have already installed conda in your scratch folder. Please note that your conda environment must be stored in $SCRATCH not in $HOME directory. You will get weird buildlock errors otherwise !
134 | 
135 | We also assume that you have put the Mujoco200 and the license key folder inside `~/.mujoco` folder.
136 | 
137 | ```bash
138 | > conda create -n test_mujoco python=3.6
139 | 
140 | >  singularity exec /beegfs/work/public/singularity/mujoco-200.sif /bin/bash
141 | 
142 | > conda activate test_mujoco
143 | 
144 | > pip install mujoco-py
145 | 
146 | ```
147 | 
148 | 
149 | 
150 | Now, for running a python file, one has to run it inside the Singularity environment. 
151 | 
152 | You can now run any python file using mujoco using a bash script given here:
153 | 
154 | ```bash
155 | #!/bin/bash
156 | 
157 | singularity exec \
158 |         /beegfs/work/public/singularity/mujoco-200.sif \
159 |         bash -c "
160 | export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:$HOME/.mujoco/mujoco200/bin
161 | source $SCRATCH/pyenv/mujoco-py/bin/activate
162 | python $*
163 | "
164 | ```
165 | 
166 | 
167 | 
168 | You can run a python files by running
169 | 
170 | ```bash
171 | bash <bash_script> <python_file>
172 | ```
173 | 
174 | 


--------------------------------------------------------------------------------
/prince/gpu_job.slurm:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | #SBATCH --job-name=job_wgpu
 3 | #SBATCH --open-mode=append
 4 | #SBATCH --output=./%j_%x.out
 5 | #SBATCH --error=./%j_%x.err
 6 | #SBATCH --export=ALL
 7 | #SBATCH --time=00:10:00
 8 | #SBATCH --gres=gpu:1
 9 | #SBATCH --partition=p40_4,p100_4,v100_sxm2_4,dgx1
10 | #SBATCH --mem=64G
11 | #SBATCH -c 4
12 | 
13 | python ./test_gpu.py
14 | 


--------------------------------------------------------------------------------
/prince/test_gpu.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import time
 3 | 
 4 | if __name__ == '__main__':
 5 | 
 6 |     print(f"Torch cuda available: {torch.cuda.is_available()}")
 7 |     print(f"GPU name: {torch.cuda.get_device_name()}\n\n")
 8 | 
 9 |     t1 = torch.randn(100,1000)
10 |     t2 = torch.randn(1000,10000)
11 | 
12 |     cpu_start = time.time()
13 | 
14 |     for i in range(100):
15 |         t = t1 @ t2
16 | 
17 |     cpu_end = time.time()
18 | 
19 |     print(f"CPU matmul elapsed: {cpu_end-cpu_start} sec.")
20 | 
21 |     t1 = t1.to('cuda')
22 |     t2 = t2.to('cuda')
23 | 
24 |     gpu_start = time.time()
25 | 
26 |     for i in range(100):
27 |         t = t1 @ t2
28 | 
29 |     gpu_end = time.time()
30 | 
31 |     print(f"GPU matmul elapsed: {gpu_end-gpu_start} sec.")
32 | 
33 | 
34 | 


--------------------------------------------------------------------------------