├── LICENSE ├── README.md ├── cassio ├── DATASETS.md ├── README.md ├── gpu_job.slurm └── test_gpu.py ├── client ├── README.md └── config ├── greene ├── README.md ├── gpu_job.slurm ├── submitit_example │ ├── env.sh │ ├── python-greene │ ├── run_with_submitit.py │ └── slurm.py ├── sweep_job.slurm ├── test_gpu.py └── test_sweep.py └── prince ├── README.md ├── gpu_job.slurm └── test_gpu.py /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2020, nyu-dl 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Doing computations at CILVR Lab 2 | 3 | This repo is created to help new CILVR members to perform initial setup in order 4 | to use computational resources at NYU. 5 | 6 | ## Repo structure 7 | 8 | ### greene 9 | 10 | README there helps you to go over main information about NYU HPC Greene cluster (**NEW ONE**). 11 | 12 | ### client 13 | 14 | README there helps you to configure your client such as PC or laptop. 15 | 16 | ### cassio 17 | 18 | README there helps you to go over main information about CILVR cluster. 19 | 20 | ### prince 21 | 22 | README there helps you to go over main information about NYU HPC Prince cluster. 23 | -------------------------------------------------------------------------------- /cassio/DATASETS.md: -------------------------------------------------------------------------------- 1 | **Last update: 01/18/2020** 2 | 3 | Here we store up-to-date information about the datasets stored in CILVR/CDS filesystems. 4 | 5 | Please send a PR with edited file if you want to contribute here! 6 | 7 | # NLP data 8 | 9 | `/misc/vlgscratch4/BowmanGroup/datasets` : 10 | 11 | * bert_trees 12 | 13 | * crosslingual_wikipedia 14 | 15 | * latent_tree_data 16 | 17 | * nli_data 18 | 19 | * ptb_data 20 | 21 | * snli_1.0 22 | 23 | * SQuAD2 24 | 25 | * task_data 26 | 27 | * wikipedia_corpus_small 28 | 29 | * wikipedia_sop_small 30 | 31 | * crosslingual_combine_wiki 32 | 33 | * eg_word 34 | 35 | * multinli_1.0.zip 36 | 37 | * probing-dep 38 | 39 | * ptb_trees 40 | 41 | * snli_1.0.zip 42 | 43 | * SST 44 | 45 | * wikipedia_corpus 46 | 47 | * wikipedia_corpus_tiny 48 | 49 | * WikiText103 50 | -------------------------------------------------------------------------------- /cassio/README.md: -------------------------------------------------------------------------------- 1 | # Cassio CILVR cluster 2 | 3 | Troubleshooting: helpdesk@courant.nyu.edu 4 | 5 | In this part we will touch the following aspects of our own cluster: 6 | 7 | * CIMS computing nodes 8 | 9 | * Courant / CILVR filesystems 10 | 11 | * Software management 12 | 13 | * Slurm Workload Manager 14 | 15 | ## CIMS computing nodes 16 | 17 | Courant has their own park of machines which everyone is welcome to use, learn more about it [here](https://cims.nyu.edu/webapps/content/systems/resources/computeservers). 18 | These machines do not have any workload manager. 19 | 20 | ### Connecting between nodes once you are logged on Access node 21 | 22 | Add your public key to `~/.ssh/authorized_keys` in order to jump between nodes without password typing each time. **Set restricted permissions on authorized_keys file**: `chmod 644 ~/.ssh/authorized_keys`. 23 | 24 | ## Filesystems 25 | 26 | ### Home filesystem 27 | 28 | Everyone has their own home fs mounted in `/home/${USER}`. 29 | 30 | Home fs has a limitation of 8GB (yes, eight) and one can check current usage with a command `showusage`: 31 | 32 | ```text 33 | [kulikov@cassio ~]$ showusage 34 | You are using 56% of your 8.0G quota for /home/kulikov. 35 | You are using 0% of your .4G quota for /web/kulikov. 36 | ``` 37 | 38 | There is a large benefit of using home fs: reliable backups. There are always 2 previous days backups of your fs located in `~/.zfs/`. In case if you can not locate the folder, fill a request [here](https://cims.nyu.edu/webapps/content/systems/forms/restore_file). 39 | 40 | Home fs is a good place to keep important files such as project notes, code, experiment configurations. 41 | 42 | ### Big RAIDs 43 | 44 | We have multiple big RAID filesystems, where one can keep checkpoints, datasets etc. 45 | 46 | * `/misc/kcgscratch1` 47 | 48 | * `/misc/vlgscratch4` 49 | 50 | * `/misc/vlgscratch5` 51 | 52 | We follow some simple folder structure as you will notice there. 53 | 54 | Afaik there are no backups there, use `rm` very carefully. 55 | 56 | ### Transfer files 57 | 58 | One can transfer some files to/from any location avaialable to your CIMS account via: 59 | 60 | #### SSH (scp, rsync etc.) 61 | 62 | Example command (assuming you have a working ssh config): 63 | 64 | to cims: 65 | `rsync -chaPvz cims:` 66 | 67 | from cims: 68 | `rsync -chaPvz cims: ` 69 | 70 | #### Using GLOBUS transfer service: [learn more](https://www.globus.org/) 71 | 72 | **NYU HPC has disjoint filesystems from Courant/CILVR** 73 | 74 | ## Software management 75 | 76 | **There is no sudo.** If you need to install some software (which is not possible to build locally), email helpdesk to assistance. 77 | 78 | All computing machines including cassio node support dynamic modification of user's set of environment variables (`env`) which is called *environment modules* (learn more about it [here](http://modules.sourceforge.net/)). 79 | 80 | Main command: `module`. 81 | 82 | Why: imagine that we need to have all different versions of `gcc` installed on machines for various reasons. Every user can control what version of gcc to use via environment variable with the path to needed version. 83 | 84 | Useful commands: 85 | 86 | `module avail`: show all available modules to load 87 | 88 | `module load ` 89 | 90 | `module unload ` 91 | 92 | `module purge`: unload all modules 93 | 94 | Typical usecase: loading specific CUDA toolkit (with nvcc), gcc. 95 | 96 | *Machine learning packages and libraries there are typically outdated.* Afaik it is a common practice to make your own python environment with either conda or embedded venv. 97 | 98 | ### Conda environment 99 | 100 | Everyone has their own taste on how to build their DL/ML enviornment, here I will brifly discuss how to get Miniconda running. Miniconda ships just a python with a small set of packages + conda, while Anaconda is a monster. 101 | 102 | Miniconda homepage: [https://docs.conda.io/en/latest/miniconda.html](https://docs.conda.io/en/latest/miniconda.html) 103 | 104 | 1. Copy the link to the install script in your clipboard and run: `wget ` in the CIMS terminal. 105 | 106 | There is no easy way to keep your conda env in home fs (8G quota), so type some location in your big RAID folder as installation path. 107 | 108 | After installation type Y to agree to append neccessary lines into your `.bashrc`. After reconnection or re-login you should be able to run `conda` command. 109 | 110 | 2. Create conda env: `conda create -n tutorial python=3`. Activate tutorial env: 111 | 112 | `conda activate tutorial` 113 | 114 | 3. Install several packages: 115 | 116 | `conda install pytorch torchvision cudatoolkit=10.2 -c pytorch` 117 | 118 | `conda install jupyterlab` 119 | 120 | ## Slurm Workload Manager 121 | 122 | ### Terminal multiplexer / screen / tmux 123 | 124 | Tmux or any other multiplexer helps you to run a shell session on Cassio (or any other) node which you can detach from and attach anytime later *without losing the state.* 125 | 126 | Learn more about it here: [https://github.com/tmux/tmux/wiki](https://github.com/tmux/tmux/wiki) 127 | 128 | ### Slurm quotas 129 | 130 | 1. Hot season (conference deadlines, dense queue) 131 | 132 | 2. Cold season (summer, sparse queue) 133 | 134 | ### Cassio node 135 | 136 | `cassio.cs.nyu.edu` is the head node from where one can submit a job or request some resources. 137 | 138 | **Do not run any intensive jobs on cassio node.** 139 | 140 | Popular job management commands: 141 | 142 | `sinfo -o --long --Node --format="%.8N %.8T %.4c %.10m %.20f %.30G"` - shows all available nodes with corresponding GPUs installed. **Note features – this allows you to specify the desired GPU.** 143 | 144 | `squeue -u ${USER}` - shows state of your jobs in the queue. 145 | 146 | `scancel ` - cancel job with specified id. You can only cancel your own jobs. 147 | 148 | `scancel -u ${USER}` - cancel *all* your current jobs, use this one very carefully. 149 | 150 | `scancel --name myJobName` - cancel job given the job name. 151 | 152 | `scontrol hold ` - hold pending job from being scheduled. This may be helpful if you noticed that some data/code/files are not ready yet for the particular job. 153 | 154 | `scontrol release ` - release the job from hold. 155 | 156 | `scontrol requeue ` - cancel and submit the job again. 157 | 158 | ### Running the interactive job 159 | 160 | By interactive job we define a shell on a machine (possibly with a GPU) where you can interactively run/debug code or run some software e.g. JupyterLab or Tensorboard. 161 | 162 | In order to request a machine and instantly (after Slurm assignment) connect to assigned machine, run: 163 | 164 | `srun --qos=interactive --mem 16G --gres=gpu:1 --constraint=gpu_12gb --pty bash` 165 | 166 | Explanation: 167 | 168 | * `--qos=interactive` means your job will have a special QoS labeled 'interactive'. In our case this means your time limit will be longer than a usual job (7 days?), but there are max 2 jobs per user with such QoS. 169 | 170 | * `--mem 16G` means the upper limit of RAM you expect your job to use. Machine will show all its RAM, however Slurm kills the job if it exceeds the requested RAM. **Do not set max possible RAM here, this may decrease your priority over time.** Instead, try to estimate the reasonable amount. 171 | 172 | * `--gres=gpu:1` means number of gpus you will see in the requested instance. No gpus if you do not use this arg. 173 | 174 | * `--constraint=gpu_12gb` each node has assigned features given what kind of GPU it has. Check `sinfo` command above to output all nodes with all possible features. Features may be combined using logical OR operator as `gpu_12gb|gpu_6gb`. 175 | 176 | * `--pty bash` means that after connecting to the instance you will be given the bash shell. 177 | 178 | You may remove `--qos` arg and run as many interactive jobs as you wish, if you need that. 179 | 180 | #### Port forwarding from the client to Cassio node 181 | 182 | As an example of port forwarding we will launch JupyterLab from interactive GPU job shell and connect to it from client browser. 183 | 184 | 1. Start an interactive job (you may exclude GPU to get it fast if your priority is low at the moment): 185 | 186 | `srun --qos=interactive --mem 16G --gres=gpu:1 --constraint=gpu_12gb --pty bash` 187 | 188 | Note the host name of the machine you got e.g. lion4 (will be needed for port forwarding). 189 | 190 | 2. Activate the conda environment with installed JupyterLab: 191 | 192 | `conda activate tutorial` 193 | 194 | 3. Start JupyterLab 195 | 196 | `jupyter lab --no-browser --port ` 197 | 198 | Explanation: 199 | 200 | * `--no-browser` means it will not invoke default OS browser (you don't want CLI browser). 201 | 202 | * `--port ` means the port JupyterLab will be listening for requests. Usually we choose some 4 digit number to make sure that we do not select any reserved ports like 80 or 443. 203 | 204 | 4. Open another tab on your terminal client and run: 205 | 206 | `ssh -L :localhost: -J cims -N` (job hostname may be short e.g. lion4) 207 | 208 | Explanation: 209 | 210 | * `-L :localhost:` Specifies that the given port on the local (client) host is to be forwarded to the given host and port on the remote side. 211 | 212 | * `-J cims ` means jump over cims to other host. This uses your ssh config to resolve what does cims mean. 213 | 214 | * `-N` means there will no shell given upon connection, only tunnel will be started. 215 | 216 | 5. Go to your browser and open `localhost:`. You should be able to open JupyterLab page. It may ask you for security token: get it form stdout of interactive job instance. 217 | 218 | **Disclaimer:** there are many other ways to get set this up: one may use ssh SOCKS proxy, initialize tunnel from the interactive job itself etc. And all the methods are OK if you can run it. 219 | 220 | ### Submitting a batch job 221 | 222 | Batch jobs can be used for any computations where you do not expect your code to crash. In other words, there is no easy way to interrupt or debug running batch job. 223 | 224 | Main command to submit a batch job: `sbatch `. 225 | 226 | The first part of the script consist of slurm preprocessing directives such as: 227 | 228 | ```bash 229 | #SBATCH --job-name=job_wgpu 230 | #SBATCH --open-mode=append 231 | #SBATCH --output=./%j_%x.out 232 | #SBATCH --error=./%j_%x.err 233 | #SBATCH --export=ALL 234 | #SBATCH --time=00:10:00 235 | #SBATCH --gres=gpu:1 236 | #SBATCH --constraint=gpu_12gb 237 | #SBATCH --mem=64G 238 | #SBATCH -c 4 239 | ``` 240 | 241 | **Important: do not forget to activate conda env before submitting a job, or make sure you do so in the script.** 242 | 243 | Similar to arguments we passed to `srun` during interactive job request, here we specify requirements for the batch job. 244 | 245 | After `#SBATCH` block one may execute any shell commands or run any script of your choice. 246 | 247 | **You can not mix `#SBATCH` lines with other commands, Slurm will not register any `#SBATCH` after the first regular (non-comment) command in the script.** 248 | 249 | To submit `job_wgpu` located in `gpu_job.slurm`, go to Cassio node and run: 250 | 251 | `sbatch gpu_job.slurm` 252 | 253 | #### What happens when you hit enter 254 | 255 | Slurm registers your job in the database with corresponding job id. The allocation may not happen instantly and the job will be positioned in the queue. 256 | 257 | One can get all available information about the job using: 258 | 259 | `scontrol show jobid -dd ` 260 | 261 | **One can only get information about pending or running job with the command above.** 262 | 263 | While a job is in the queue, one can hold it from allocation (and later release) using corresponding commands we checked above. 264 | 265 | ### Managing experiments, running grid search etc 266 | 267 | This is out of scope for today tutorial. 268 | -------------------------------------------------------------------------------- /cassio/gpu_job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=job_wgpu 3 | #SBATCH --open-mode=append 4 | #SBATCH --output=./%j_%x.out 5 | #SBATCH --error=./%j_%x.err 6 | #SBATCH --export=ALL 7 | #SBATCH --time=00:10:00 8 | #SBATCH --gres=gpu:1 9 | #SBATCH --constraint=gpu_12gb 10 | #SBATCH --mem=64G 11 | #SBATCH -c 4 12 | 13 | python ./test_gpu.py 14 | -------------------------------------------------------------------------------- /cassio/test_gpu.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import time 3 | 4 | if __name__ == '__main__': 5 | 6 | print(f"Torch cuda available: {torch.cuda.is_available()}") 7 | print(f"GPU name: {torch.cuda.get_device_name()}\n\n") 8 | 9 | t1 = torch.randn(100,1000) 10 | t2 = torch.randn(1000,10000) 11 | 12 | cpu_start = time.time() 13 | 14 | for i in range(100): 15 | t = t1 @ t2 16 | 17 | cpu_end = time.time() 18 | 19 | print(f"CPU matmul elapsed: {cpu_end-cpu_start} sec.") 20 | 21 | t1 = t1.to('cuda') 22 | t2 = t2.to('cuda') 23 | 24 | gpu_start = time.time() 25 | 26 | for i in range(100): 27 | t = t1 @ t2 28 | 29 | gpu_end = time.time() 30 | 31 | print(f"GPU matmul elapsed: {gpu_end-gpu_start} sec.") 32 | 33 | 34 | -------------------------------------------------------------------------------- /client/README.md: -------------------------------------------------------------------------------- 1 | # Client setup 2 | 3 | We define client as your own machine (PC, laptop etc.) which has a terminal with CLI and 4 | SSH client. 5 | 6 | You should be fine using any operating system as long as you are comfortable with it. 7 | 8 | ### MacOS, Linux 9 | SSH client should be ready to be used by default. 10 | 11 | ### Windows 10 12 | If you are using Windows 10, there are many options which you can google about, we highlight some of them here: 13 | 14 | 1. OpenSSH installed natively on Windows 10: [learn more](https://docs.microsoft.com/en-us/windows-server/administration/openssh/openssh_overview) 15 | 16 | 2. Windows Subsystem for Linux (this way you get linux shell with some limitations): [learn more](https://docs.microsoft.com/en-us/windows/wsl/install-win10) 17 | 18 | 3. Third-party software with SSH functionality (e.g. PuTTY): [learn more](https://www.putty.org/) 19 | 20 | ## Pre-requisites for tutorial: 21 | 22 | * Make sure you have OpenSSH installed: commands `ssh`, `ssh-keygen` should work! 23 | 24 | * Create SSH keypair: [learn more](https://www.ssh.com/ssh/keygen/) 25 | 26 | * Make sure your `~/.ssh` folder exists and it has correct system permissions: [learn more](https://wiki.ruanbekker.com/index.php/Linux_SSH_Directory_Permissions) 27 | 28 | * Make sure you can connect to both CIMS access nodes and Prince login nodes: [CIMS](https://cims.nyu.edu/webapps/content/systems/userservices/netaccess/secure) [Prince](https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/hpc-access) 29 | 30 | *We will not spend time on these pre-requisites during the tutorial* 31 | 32 | ## NYU MFA 33 | 34 | You should be aware of NYU multi-factor authentication already (which is used when you login to NYU services). 35 | 36 | CIMS uses MFA authorization when you connect to access node via SSH. Follow necessary steps to login. 37 | 38 | ### Using ssh key instead of MFA 39 | 40 | In order to login to access node without using MFA, append your public ssh key in `~/.ssh/authorized_keys_access` **ON ACCESS NODE** (create this file if needed). Set 600 permission mask to that file: `chmod 600 ~/.ssh/authorized_keys_access`. 41 | 42 | ## Creating SSH config file 43 | 44 | SSH config is used to simplify ssh connection's configuration (such as usernames, ssh-key paths etc). 45 | 46 | Make necessary edits to `config` file in current folder and copy it to `~/.ssh/config`. 47 | 48 | After this step you should be able to directly connect to cassio host by typing `ssh cassio` or to access node via `ssh cims`. 49 | -------------------------------------------------------------------------------- /client/config: -------------------------------------------------------------------------------- 1 | Host cims 2 | HostName access.cims.nyu.edu 3 | Port 22 4 | Compression yes 5 | User 6 | IdentityFile 7 | ServerAliveInterval 60 8 | 9 | Host cassio 10 | HostName cassio.cs.nyu.edu 11 | Port 22 12 | Compression yes 13 | User 14 | IdentityFile 15 | ForwardAgent yes 16 | ServerAliveInterval 60 17 | ProxyCommand ssh cims -W %h:%p 18 | 19 | Host prince 20 | HostName prince.hpc.nyu.edu 21 | Port 22 22 | Compression yes 23 | User # DIFFERENT USERNAME HERE GIVEN BY HPC NOT CIMS 24 | IdentityFile 25 | ServerAliveInterval 60 26 | ProxyCommand ssh cims -W %h:%p 27 | 28 | Host greene 29 | HostName log-2.hpc.nyu.edu 30 | Port 22 31 | Compression yes 32 | User # DIFFERENT USERNAME HERE GIVEN BY HPC NOT CIMS 33 | IdentityFile 34 | ForwardAgent yes 35 | ServerAliveInterval 60 36 | ProxyCommand ssh cims -W %h:%p 37 | -------------------------------------------------------------------------------- /greene/README.md: -------------------------------------------------------------------------------- 1 | # Greene tutorial 2 | 3 | This tutorial aims to help with your migration from the Prince cluster (retires early 2021). It assumes some knowledge of Slurm, ssh and terminal. It assumes that one can connect to Greene ([same way](https://github.com/nyu-dl/cluster-support/tree/master/client) as to Prince). 4 | 5 | Official Greene docs & overview: 6 | https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/greene 7 | https://sites.google.com/a/nyu.edu/nyu-hpc/systems/greene-cluster 8 | 9 | **Content:** 10 | 1. Greene login nodes 11 | 2. Cluster overview 12 | 3. Data migration from Prince 13 | 4. Singularity - running jobs within a container 14 | 5. Running a batch job with singularity 15 | 6. Setting up a simple sweep over a hyper-parameter using Slurm array job. 16 | 7. Port forwarding to Jupyter Lab 17 | 8. Making squashfs out of your dataset and accessing it while running your scripts. 18 | 9. Using a python-based [Submitit](https://github.com/facebookincubator/submitit) framework on Greene 19 | 20 | 21 | ## Greene login nodes 22 | 23 | * greene.hpc.nyu.edu (balancer) 24 | * log-1.hpc.nyu.edu 25 | * log-2.hpc.nyu.edu 26 | 27 | You have to use NYU HPC account to login, username is the same as your NetID. 28 | 29 | Login nodes are accessible from within NYU network, i.e. one can connect from CIMS `access1` node, check [ssh config](https://github.com/nyu-dl/cluster-support/blob/master/client/config) for details. 30 | 31 | Login node should not be used to run anything related to your computations, use it for file management (`git`, `rsync`), jobs management (`srun`, `salloc`, `sbatch`). 32 | 33 | ## Cluster overview 34 | 35 | |Number of nodes|CPU cores|GPU|RAM| 36 | |---------------|---------|---|---| 37 | |524|48|-|180G| 38 | |40|48|-|369G| 39 | |4|96|-|3014G **(3T)**| 40 | |10|48|4 x V100 32G (PCIe)|369G| 41 | |65|48|4 x **RTX8000 48G**|384G| 42 | 43 | **quotas**: 44 | 45 | |filesystem|env var|what for|flushed|quota| 46 | |----------|-------|--------|-------|-----| 47 | |/archive|$ARCHIVE|long term storage|NO|2TB/20K inodes| 48 | |/home|$HOME|probably nothing|NO|50GB/30K inodes| 49 | |/scratch|$SCRATCH|experiments/stuff|YES (60 days)|5TB/1M inodes| 50 | 51 | ``` 52 | echo $SCRATCH 53 | /scratch/ik1147 54 | ``` 55 | 56 | ### Differences with Prince 57 | 58 | * Strong preference **for Singularity** / not modules. 59 | * GPU specification through `--gres=gpu:{rtx8000|v100}:1` (Prince uses/used partitions) 60 | * Low inode quota on home fs (~30k files) 61 | * No beegfs 62 | 63 | ## Data migration from Prince 64 | 65 | **All filesystems on Greene are brand new i.e. nothing will be migrated from Prince automatically.** 66 | 67 | ### Copying a folder from Prince to Greene 68 | 69 | 1. Identify the absoulte path of some dir on Prince you need to transfer: `realpath ` 70 | 2. Create destination folder on Greene where you wish all the content from `` to be copied to: `mkdir ` 71 | 72 | You may start the transfer from both Prince and Greene (as you wish): 73 | * from Prince: `rsync -chaPvz / log-2.hpc.nyu.edu:/` (notice trailing slashes) 74 | * from Greene: `rsync -chaPvz prince.hpc.nyu.edu:/ /` 75 | 76 | Start such transfer withing tmux/screen to avoid connectivity issues. 77 | 78 | ## Singularity 79 | 80 | One can read white paper here: https://arxiv.org/pdf/1709.10140.pdf 81 | 82 | The main idea of using a container is to provide an isolated user space on a compute node and to simplify the node management (security and updates). 83 | 84 | ### Getting a container image 85 | 86 | Each singularity container has a definition file (`.def`) and the image file (`.sif`). Lets consider the following container: 87 | 88 | `/scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.def` 89 | `/scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif` 90 | 91 | Definiton file contains all commands which are executed along the way when the image OS is created, take a look! 92 | 93 | This particular image will have CUDA 10.1 with cudnn 7 libs within Ubuntu 18.04 OS. 94 | 95 | Lets execute this container with singularity: 96 | 97 | ```bash 98 | [ik1147@log-2 ~]$ singularity exec /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash 99 | Singularity> uname -a 100 | Linux log-2.nyu.cluster 4.18.0-193.28.1.el8_2.x86_64 #1 SMP Fri Oct 16 13:38:49 EDT 2020 x86_64 x86_64 x86_64 GNU/Linux 101 | Singularity> lsb_release -a 102 | LSB Version: core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch 103 | Distributor ID: Ubuntu 104 | Description: Ubuntu 18.04.5 LTS 105 | Release: 18.04 106 | Codename: bionic 107 | ``` 108 | 109 | Notice that we are still on the login node but within the Ubuntu container! 110 | 111 | One can find more containers here: 112 | 113 | `/scratch/work/public/singularity/` 114 | 115 | Container files are read-only images, i.e. there is no need to copy over the container. Instead one may create a symlink for convenience. Please contact HPC if you need a container image with some specific software. 116 | 117 | ### Setting up an overlay filesystem with your computing environment 118 | 119 | **Why?** The whole reason of using Singularity with overlay filesystem is *to reduce the impact on scratch filesystem* (remember how slow is it on Prince sometimes?). 120 | 121 | **High-level idea**: make a read-only filesystem which will be used exclusively to host your conda environment and other static files which you constantly re-use for each job. 122 | 123 | **How is it different from scratch?** overlayfs is a separate fs mounted on each compute node when you start the container while scratch is a shared GPFS accessed via network. 124 | 125 | There are two different modes when mounting the overlayfs: 126 | * read-write: use this one when setting up env (installing conda, libs, other static files) 127 | * read-only: use this one when running your jobs. It has to be read-only since multiple processes will access the same image. It will crash if any job has already mounted it as read-write. 128 | 129 | Setting up your fs image: 130 | 1. Copy the empty fs gzip to your scratch path (e.g. `/scratch//` or `$SCRATCH` for your root scratch): `cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz $SCRATCH/` 131 | 2. Unzip the archive: `gunzip -v $SCRATCH/overlay-50G-10M.ext3.gz` (can take a while to unzip...) 132 | 3. Execute container with overlayfs (check comment below about `rw` arg): `singularity exec --overlay $SCRATCH/overlay-50G-10M.ext3:rw /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash` 133 | 4. Check file systems: `df -h`. There will be a record: `overlay 53G 52M 50G 1% /`. The size equals to the filesystem image you chose. **The actual content of the image is mounted in `/ext3`.** 134 | 5. Create a file in overlayfs: `touch /ext3/testfile` 135 | 6. Exit from Singularity 136 | 137 | One has permission for file creation since the fs was mounted with `rw` arg. In contrast `ro` will mount it as read-only. 138 | 139 | Setting up conda environment: 140 | 141 | 1. Start a CPU (GPU if you want/need) job: `srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=32GB --time=1:00:00 --gres=gpu:1 --pty /bin/bash` 142 | 2. Start singularity (notice `--nv` for GPU propagation): `singularity exec --nv --overlay $SCRATCH/overlay-50G-10M.ext3:rw /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash` 143 | 3. Install your conda env in `/ext3`: https://github.com/nyu-dl/cluster-support/tree/master/cassio#conda-environment. 144 | 3.1. `wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh` 145 | 3.2. `bash ./Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3` 146 | 3.3. Install packages you are using (torch, jupyter, tensorflow etc.) 147 | 4. Exit singularity container (*not the CPU/GPU job*) 148 | 149 | By now you have a working conda environment located in a ext3 image filesystem here: `$SCRATCH/overlay-50G-10M.ext3`. 150 | 151 | Let's run the container with **read-only** layerfs now (notice `ro` arg there): 152 | `singularity exec --nv --overlay $SCRATCH/overlay-50G-10M.ext3:ro /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash` 153 | 154 | ```bash 155 | Singularity> conda activate 156 | (base) Singularity> python 157 | Python 3.8.5 (default, Sep 4 2020, 07:30:14) 158 | [GCC 7.3.0] :: Anaconda, Inc. on linux 159 | Type "help", "copyright", "credits" or "license" for more information. 160 | >>> import torch 161 | >>> torch.__version__ 162 | '1.7.0' 163 | >>> 164 | (base) Singularity> touch /ext3/test.txt 165 | touch: cannot touch '/ext3/test.txt': Read-only file system 166 | ``` 167 | 168 | As shown above, the conda env works fine while `/ext3` is not writable. 169 | 170 | **Caution:** by default conda and python keep cache in home: `~/.conda`, `~/.cache`. Since home fs has small quota, move your cache folder to scratch: 171 | 1. `mkdir $SCRATCH/python_cache` 172 | 2. `cd` 173 | 3. `ln -s $SCRATCH/python_cache/.cache` 174 | 175 | ``` 176 | .cache -> /scratch/ik1147/cache/ 177 | ``` 178 | 179 | ***From now on you may run your interactive coding/debugging sessions, congratulations!*** 180 | 181 | ## Running a batch job with Singularity 182 | 183 | files to use: 184 | * `gpu_job.slurm` 185 | * `test_gpu.py` 186 | 187 | There is one major detail with a batch job submission: sbatch script starts the singularity container with a `/bin/bash -c ""` tail, where `""` is whatever your job is running. 188 | 189 | In addition, conda has to be sourced and activated. 190 | 191 | Lets create helper script for conda activation: copy the code below in `/ext3/env.sh` 192 | 193 | ```bash= 194 | #!/bin/bash 195 | 196 | source /ext3/miniconda3/etc/profile.d/conda.sh 197 | export PATH=/ext3/miniconda3/bin:$PATH 198 | ``` 199 | 200 | This is an example batch job submission script (also as a file `gpu_job.slurm` in this repo folder): 201 | 202 | ```bash= 203 | #!/bin/bash 204 | #SBATCH --job-name=job_wgpu 205 | #SBATCH --open-mode=append 206 | #SBATCH --output=./%j_%x.out 207 | #SBATCH --error=./%j_%x.err 208 | #SBATCH --export=ALL 209 | #SBATCH --time=00:10:00 210 | #SBATCH --gres=gpu:1 211 | #SBATCH --mem=64G 212 | #SBATCH -c 4 213 | 214 | singularity exec --nv --overlay $SCRATCH/overlay-50G-10M.ext3:ro /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash -c " 215 | 216 | source /ext3/env.sh 217 | conda activate 218 | 219 | python ./test_gpu.py 220 | " 221 | ``` 222 | 223 | the output: 224 | ``` 225 | Torch cuda available: True 226 | GPU name: Quadro RTX 8000 227 | 228 | 229 | CPU matmul elapsed: 0.438138484954834 sec. 230 | GPU matmul elapsed: 0.2669832706451416 sec. 231 | ``` 232 | 233 | ***From now on you may run your experiments as a batch job, congratulations!*** 234 | 235 | 236 | ## Setting up a simple sweep over a hyper-parameter using Slurm array job 237 | 238 | files to use: 239 | * `sweep_job.slurm` 240 | * `test_sweep.py` 241 | 242 | Usually we may want to do multiple runs of the same job using different hyper params or learning rates etc. There are many cool frameworks to do that like Pytorch Lightning etc. Now we will look on the simplest (imo) version of such sweep construction. 243 | 244 | **High-level idea:** define a sweep over needed arguments and create a product as a list of all possible combinations (check `test_sweep.py` for details). In the end a specific config is mapped to its position in the list. The SLURM array id is used as a map to specific config in the sweep (check `sweep_job.slurm`). 245 | 246 | Notes: 247 | * notice no `--nv` in singularity call because we allocate CPU-only resources. 248 | * notice `$SLURM_ARRAY_TASK_ID`. For each specific step job (in range `#SBATCH --array=1-20`) this env var will have the corresponding value assigned. 249 | 250 | Submitting a sweep job: 251 | `sbatch sweep_job.slurm` 252 | 253 | Outputs from all completed jobs: 254 | ``` 255 | cat *out 256 | {'sweep_step': 20, 'seed': 682, 'device': 'cpu', 'lr': 0.01, 'model_size': 1024, 'some_fixed_arg': 0} 257 | {'sweep_step': 1, 'seed': 936, 'device': 'cpu', 'lr': 0.1, 'model_size': 512, 'some_fixed_arg': 0} 258 | {'sweep_step': 2, 'seed': 492, 'device': 'cpu', 'lr': 0.1, 'model_size': 512, 'some_fixed_arg': 0} 259 | {'sweep_step': 3, 'seed': 52, 'device': 'cpu', 'lr': 0.1, 'model_size': 512, 'some_fixed_arg': 0} 260 | {'sweep_step': 4, 'seed': 691, 'device': 'cpu', 'lr': 0.1, 'model_size': 512, 'some_fixed_arg': 0} 261 | {'sweep_step': 5, 'seed': 826, 'device': 'cpu', 'lr': 0.1, 'model_size': 512, 'some_fixed_arg': 0} 262 | {'sweep_step': 6, 'seed': 152, 'device': 'cpu', 'lr': 0.1, 'model_size': 1024, 'some_fixed_arg': 0} 263 | {'sweep_step': 7, 'seed': 626, 'device': 'cpu', 'lr': 0.1, 'model_size': 1024, 'some_fixed_arg': 0} 264 | {'sweep_step': 8, 'seed': 495, 'device': 'cpu', 'lr': 0.1, 'model_size': 1024, 'some_fixed_arg': 0} 265 | {'sweep_step': 9, 'seed': 703, 'device': 'cpu', 'lr': 0.1, 'model_size': 1024, 'some_fixed_arg': 0} 266 | {'sweep_step': 10, 'seed': 911, 'device': 'cpu', 'lr': 0.1, 'model_size': 1024, 'some_fixed_arg': 0} 267 | {'sweep_step': 11, 'seed': 362, 'device': 'cpu', 'lr': 0.01, 'model_size': 512, 'some_fixed_arg': 0} 268 | {'sweep_step': 12, 'seed': 481, 'device': 'cpu', 'lr': 0.01, 'model_size': 512, 'some_fixed_arg': 0} 269 | {'sweep_step': 13, 'seed': 618, 'device': 'cpu', 'lr': 0.01, 'model_size': 512, 'some_fixed_arg': 0} 270 | {'sweep_step': 14, 'seed': 449, 'device': 'cpu', 'lr': 0.01, 'model_size': 512, 'some_fixed_arg': 0} 271 | {'sweep_step': 15, 'seed': 840, 'device': 'cpu', 'lr': 0.01, 'model_size': 512, 'some_fixed_arg': 0} 272 | {'sweep_step': 16, 'seed': 304, 'device': 'cpu', 'lr': 0.01, 'model_size': 1024, 'some_fixed_arg': 0} 273 | {'sweep_step': 17, 'seed': 499, 'device': 'cpu', 'lr': 0.01, 'model_size': 1024, 'some_fixed_arg': 0} 274 | {'sweep_step': 18, 'seed': 429, 'device': 'cpu', 'lr': 0.01, 'model_size': 1024, 'some_fixed_arg': 0} 275 | {'sweep_step': 19, 'seed': 932, 'device': 'cpu', 'lr': 0.01, 'model_size': 1024, 'some_fixed_arg': 0} 276 | ``` 277 | 278 | **Caution:** be careful with checkpointing when running a sweep. Make sure that your checkpoints/logs have corresponding sweep step in the filename or so (to avoid file clashes). 279 | 280 | ***From now on you may run your own sweeps, congratulations!*** 281 | 282 | ## Port forwarding to Jupyter Lab 283 | 284 | ### Part 1: launching JupyterLab on Greene. 285 | 286 | 1. Launch interactive job with a gpu: `srun --nodes=1 --tasks-per-node=1 --cpus-per-task=1 --mem=32GB --time=1:00:00 --gres=gpu:1 --pty /bin/bash` 287 | 2. Execute singularity container: `singularity exec --nv --overlay $SCRATCH/overlay-50G-10M.ext3:ro /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash` 288 | 3. Activate conda (base env in this case): `conda activate` 289 | 4. Start jupyter lab: `jupyter lab --ip 0.0.0.0 --port 8965 --no-browser` 290 | 291 | Expected output: 292 | 293 | ``` 294 | [I 15:51:47.644 LabApp] JupyterLab extension loaded from /ext3/miniconda3/lib/python3.8/site-packages/jupyterlab 295 | [I 15:51:47.644 LabApp] JupyterLab application directory is /ext3/miniconda3/share/jupyter/lab 296 | [I 15:51:47.646 LabApp] Serving notebooks from local directory: /home/ik1147 297 | [I 15:51:47.646 LabApp] Jupyter Notebook 6.1.4 is running at: 298 | [I 15:51:47.646 LabApp] http://gr031.nyu.cluster:8965/ 299 | [I 15:51:47.646 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). 300 | ``` 301 | 302 | Note the hostname of the node we run from here: `http://gr031.nyu.cluster:8965/` 303 | 304 | ### Part 2: Forwarding the connection from you local client to Greene. 305 | 306 | [Check here the ssh config](https://github.com/nyu-dl/cluster-support/blob/master/client/config) with the greene host to make the port forwarding easier to setup. 307 | 308 | To start the tunnel, run: `ssh -L 8965:gr031.nyu.cluster:8965 greene -N` 309 | 310 | **Note the hostname should be equal to the node where jupyter is running. Same with the port.** 311 | 312 | Press Ctrl-C if you want to stop the port forwarding SSH connection. 313 | 314 | ***From now on you may run your own Jupyter Lab with RTX8000/V100, congratulations!*** 315 | 316 | ## SquashFS for your read-only files 317 | 318 | Here we will convert read-only file such as data to a SquashFS file which reduces the inode load. It also compresses the data so you save on disk space as well. 319 | 320 | Imp note 1: Please first check if the dataset you need is present in `/scratch/work/public/datasets/'. If you are using datasets that you think are useful to many others, please send a mail to HPC so that they can make a common SquashFS in /work/public/. This tutorial was assuming you have some specific folder which you frequently use that has many files. 321 | 322 | 1. Check your current quota usage using `myquota` 323 | 2. Go to your folder that contains your datasets and you can check the number of files using 324 | ` for d in $(find $(pwd) -maxdepth 1 -mindepth 1 -type d); do n_files=$(find $d | wc -l); echo $d " " $n_files; done` 325 | 3. I will assume the folder that we want to convert is called DatasetX which contains many files (also handles subfolders with more files). Let's first make an env variable which points to your dataset as follows: 326 | `export DATASET_PATH=/path/to/DatasetX` , `export DATASET_NAME=DatasetX` and `export SINGULARITY_DATASET_PATH=/${DATASET_NAME}` 327 | Next run: 328 | `singularity exec --bind ${DATASET_PATH}:${SINGULARITY_DATASET_PATH}:ro /scratch/work/public/singularity/centos-8.2.2004.sif find ${SINGULARITY_DATASET_PATH} | wc -l` 329 | Here you should be able to see the number of files in DatasetX. 330 | 4. Make a folder in your scratch where you want to store your SquashFS files. Move to this folder. 331 | 5. Now we will run `mksquashfs`. 332 | `singularity exec --bind ${DATASET_PATH}:${SINGULARITY_DATASET_PATH}:ro /scratch/work/public/singularity/centos-8.2.2004.sif mksquashfs ${SINGULARITY_DATASET_PATH} ${DATASET_NAME}.sqf -keep-as-directory` 333 | 6. Now we have SquashFS file DatasetX.sqf, it will have the same number of files inside, but the disk usage will be lower. We can check this as follows: 334 | `ls -lh ${DATASET_NAME}.sqf` 335 | and 336 | `du -sh ${DATASET_PATH}` 337 | 338 | Now let's make sure we can reach the contents of this SquashFS file : 339 | 1. Start the singularity container with the SquashFS as an overlay. 340 | `singularity exec --overlay ${DATASET_NAME}.sqf:ro /scratch/work/public/singularity/cuda11.0-cudnn8-devel-ubuntu18.04.sif /bin/bash` 341 | 2. Go to the following location and you should see your dataset within the container. 342 | `cd /DatasetX` 343 | 344 | So just as we have done before, this is just an additional overlay that you will add to your Singularity command when running your job :) 345 | 346 | Imp note 2: After you confirm the SquashFS file is good, you can delete the folder {insert path to folder enclosing DatasetX}/DatasetX to save inode! :D 347 | 348 | 349 | 350 | ## Submitit on greene (Advanced) 351 | 352 | Submitit is a python wrapper that makes it super easy to submit jobs, sweeps and handles timeouts and restarts for your job. 353 | For examples and docs please see [Submitit](https://github.com/facebookincubator/submitit). 354 | 355 | Here I will assume you have already used submitit as part of your workflow and will only mention what needs to be added so you can use it with Singularity and SquashFS. 356 | 357 | We will use the following as our running example: [detr](https://github.com/facebookresearch/detr) 358 | 359 | 1. Clone the repo `git clone https://github.com/facebookresearch/detr.git` 360 | 2. Make a folder called experiments in your scratch. 361 | 3. The file changes you need are available in the folder submitit_example. 362 | - Make sure you have an overlay ready which has a conda installation as described earlier in the tutorial. Your /ext3/ folder should contain an env.sh file such as the example provided in the submitit_example folder. 363 | - You will need to copy the slurm.py and python-greene files into your desired location. 364 | - `slurm.py`: This file has the necessary changes that allow the singularity to know about the network details such as port address etc. 365 | - `python-greene` : this is an executable file that provides you with the ability to use singularity with submitit. 366 | Lines 57 and 58 in the python-greene are currently set to use two overlays - one has my conda installation and the other is the SquashFS with the dataset. 367 | *For this example, you directly use this file. But for your experiments, you will change line 57 and 58 to point to the right location.* 368 | Line 59 binds the slurm.py file that you have copied, to the one that exists internally in the submitit package. 369 | 370 | 4. Replace the `run_with_submitit.py` file in the detr repo you cloned with the provided one in the submitit_example folder. This file has cluster specific arguments. In lines 31 and 32, add your username so that all the jobs can reach your shared experiment folder. 371 | You are now ready to run your job! 372 | 373 | (I have also done all the above steps and provided the folder `/scratch/ask762/tutorial` where the only thing you need to change is line 31 & 32 in `detr/run_with_submitit.py` to have your username and you should be able to submit a job :) -------------------------------------------------------------------------------- /greene/gpu_job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=job_wgpu 3 | #SBATCH --open-mode=append 4 | #SBATCH --output=./%j_%x.out 5 | #SBATCH --error=./%j_%x.err 6 | #SBATCH --export=ALL 7 | #SBATCH --time=00:10:00 8 | #SBATCH --gres=gpu:1 9 | #SBATCH --mem=64G 10 | #SBATCH -c 4 11 | 12 | singularity exec --nv --overlay $SCRATCH/overlay-50G-10M.ext3:ro /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash -c " 13 | 14 | source /ext3/env.sh 15 | conda activate 16 | 17 | python ./test_gpu.py 18 | " 19 | -------------------------------------------------------------------------------- /greene/submitit_example/env.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | source /ext3/miniconda3/etc/profile.d/conda.sh 4 | export PATH=/ext3/miniconda3/bin:$PATH -------------------------------------------------------------------------------- /greene/submitit_example/python-greene: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # https://stackoverflow.com/questions/1668649/how-to-keep-quotes-in-bash-arguments 4 | args='' 5 | for i in "$@"; do 6 | i="${i//\\/\\\\}" 7 | args="$args \"${i//\"/\\\"}\"" 8 | done 9 | 10 | module purge 11 | 12 | export PATH=/share/apps/singularity/bin:$PATH 13 | 14 | # file systems 15 | export SINGULARITY_BINDPATH=/mnt,/scratch,/share/apps 16 | if [ -d /state/partition1 ]; then 17 | export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/state/partition1 18 | fi 19 | 20 | # IB related drivers and libraries 21 | export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/etc/libibverbs.d,/usr/lib64/libibverbs,/usr/lib64/ucx,/usr/include/infiniband,/lib64/libibverbs,/usr/include/rdma 22 | export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,$(echo /usr/bin/ib*_* /usr/sbin/ib* \ 23 | /usr/lib64/libpmi* \ 24 | /lib64/libmlx*.so* \ 25 | /lib64/libib*.so* \ 26 | /lib64/libnl* \ 27 | /usr/include/uc[m,p,s,t] \ 28 | /usr/lib64/libuc[m,p,s,t].so* \ 29 | /lib64/libosmcomp.so* \ 30 | /lib64/librdmacm*.so* | sed -e 's/ /,/g') 31 | 32 | if [ -d /opt/slurm/lib64 ]; then 33 | export SINGULARITY_CONTAINLIBS=$(echo /opt/slurm/lib64/libpmi* | xargs | sed -e 's/ /,/g') 34 | fi 35 | 36 | export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/opt/slurm,/usr/lib64/libmunge.so.2.0.0,/usr/lib64/libmunge.so.2,/var/run/munge,/etc/passwd 37 | 38 | export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/usr/bin/uuidgen 39 | 40 | export SINGULARITYENV_PREPEND_PATH=/opt/slurm/bin 41 | 42 | export SINGULARITYENV_LD_LIBRARY_PATH=/lib64:/lib64/libibverbs 43 | 44 | export SINGULARITY_CONTAINLIBS=/lib64/libmnl.so.0 45 | export SINGULARITY_BINDPATH=$SINGULARITY_BINDPATH,/usr/sbin/ip,/usr/bin/strace,/usr/sbin/ifconfig 46 | 47 | nv="" 48 | if [[ "$(hostname -s)" =~ ^g ]]; then nv="--nv"; fi 49 | 50 | export MYPYTHON=$(readlink -e $0) 51 | 52 | #if [ "$SLURM_JOBID" != "" ]; then strace="strace"; fi 53 | 54 | sif=/scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04.sif 55 | 56 | singularity exec $nv \ 57 | --overlay /scratch/ask762/tutorial/pytorch1.6.0-cuda10.1.ext3:ro \ 58 | --overlay /scratch/ask762/tutorial/multimodal_datasets.sqf:ro \ 59 | --bind /scratch/ask762/tutorial/slurm.py:/ext3/miniconda3/lib/python3.8/site-packages/submitit/slurm/slurm.py:ro \ 60 | $sif \ 61 | /bin/bash -c " 62 | source /ext3/env.sh 63 | \$(which python) $args 64 | exit 65 | " -------------------------------------------------------------------------------- /greene/submitit_example/run_with_submitit.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved 2 | """ 3 | A script to run multinode training with submitit. 4 | """ 5 | import argparse 6 | import os 7 | import uuid 8 | import copy 9 | import itertools 10 | from typing import Dict 11 | from collections.abc import Iterable 12 | from pathlib import Path 13 | 14 | import main as detection 15 | import submitit 16 | 17 | def parse_args(): 18 | detection_parser = detection.get_args_parser() 19 | parser = argparse.ArgumentParser("Submitit for detection", parents=[detection_parser]) 20 | 21 | parser.add_argument("--ngpus", default=2, type=int, help="Number of gpus to request on each node") 22 | parser.add_argument("--nodes", default=2, type=int, help="Number of nodes to request") 23 | parser.add_argument("--timeout", default=3000, type=int, help="Duration of the job") 24 | parser.add_argument("--job_dir", default="", type=str, help="Job dir. Leave empty for automatic.") 25 | parser.add_argument("--mail", default="", type=str, 26 | help="Email this user when the job finishes if specified") 27 | return parser.parse_args() 28 | 29 | 30 | def get_shared_folder(args) -> Path: 31 | if Path('/scratch/{your_username}/experiments/').is_dir(): 32 | p = Path('/scratch/{your_username}/experiments/') 33 | p.mkdir(exist_ok=True) 34 | return p 35 | 36 | raise RuntimeError("No shared folder available") 37 | 38 | def get_init_file(args): 39 | # Init file must not exist, but it's parent dir must exist. 40 | os.makedirs(str(get_shared_folder(args)), exist_ok=True) 41 | init_file = get_shared_folder(args) / f"{uuid.uuid4().hex}_init" 42 | if init_file.exists(): 43 | os.remove(str(init_file)) 44 | return init_file 45 | 46 | 47 | def grid_parameters(grid: Dict): 48 | """ 49 | Yield all combinations of parameters in the grid (as a dict) 50 | """ 51 | grid_copy = dict(grid) 52 | # Turn single value in an Iterable 53 | for k in grid_copy: 54 | if not isinstance(grid_copy[k], Iterable): 55 | grid_copy[k] = [grid_copy[k]] 56 | for p in itertools.product(*grid_copy.values()): 57 | yield dict(zip(grid.keys(), p)) 58 | 59 | 60 | def sweep(executor: submitit.Executor, args: argparse.ArgumentParser, hyper_parameters: Iterable): 61 | jobs = [] 62 | with executor.batch(): 63 | for grid_data in hyper_parameters: 64 | tmp_args = copy.deepcopy(args) 65 | tmp_args.dist_url = get_init_file(args).as_uri() 66 | tmp_args.output_dir = args.job_dir 67 | for k, v in grid_data.items(): 68 | assert hasattr(tmp_args, k) 69 | setattr(tmp_args, k, v) 70 | trainer = Trainer(tmp_args) 71 | job = executor.submit(trainer) 72 | jobs.append(job) 73 | print('Sweep job ids:', [job.job_id for job in jobs]) 74 | 75 | 76 | class Trainer(object): 77 | def __init__(self, args): 78 | self.args = args 79 | 80 | def __call__(self): 81 | import os 82 | import sys 83 | import detection as detection 84 | 85 | self._setup_gpu_args() 86 | detection.main(self.args) 87 | 88 | def checkpoint(self): 89 | import os 90 | import submitit 91 | from pathlib import Path 92 | 93 | self.args.dist_url = get_init_file(self.args).as_uri() 94 | checkpoint_file = os.path.join(self.args.output_dir, "checkpoint.pth") 95 | if os.path.exists(checkpoint_file): 96 | self.args.resume = checkpoint_file 97 | print("Requeuing ", self.args) 98 | empty_trainer = type(self)(self.args) 99 | return submitit.helpers.DelayedSubmission(empty_trainer) 100 | 101 | def _setup_gpu_args(self): 102 | import submitit 103 | from pathlib import Path 104 | 105 | job_env = submitit.JobEnvironment() 106 | self.args.output_dir = Path(str(self.args.output_dir).replace("%j", str(job_env.job_id))) 107 | self.args.gpu = job_env.local_rank 108 | self.args.rank = job_env.global_rank 109 | self.args.world_size = job_env.num_tasks 110 | print(f"Process group: {job_env.num_tasks} tasks, rank: {job_env.global_rank}") 111 | 112 | 113 | def main(): 114 | args = parse_args() 115 | if args.job_dir == "": 116 | args.job_dir = get_shared_folder(args) / "%j" 117 | 118 | # Note that the folder will depend on the job_id, to easily track experimen`ts 119 | executor = submitit.AutoExecutor(folder=args.job_dir, slurm_max_num_timeout=30) 120 | 121 | # cluster setup is defined by environment variables 122 | num_gpus_per_node = args.ngpus 123 | nodes = args.nodes 124 | timeout_min = args.timeout 125 | kwargs = {} 126 | if args.use_volta32: 127 | kwargs['constraint'] = 'volta32gb' 128 | if args.comment: 129 | kwargs['comment'] = args.comment 130 | 131 | executor.update_parameters( 132 | mem_gb=40 * num_gpus_per_node, 133 | tasks_per_node=num_gpus_per_node, # one task per GPU 134 | cpus_per_task=10, 135 | nodes=nodes, 136 | timeout_min=10080, # max is 60 * 72 137 | # Below are cluster dependent parameters 138 | slurm_gres=f"gpu:rtx8000:{num_gpus_per_node}", #you can choose to comment this, or change it to v100 as per your need 139 | slurm_signal_delay_s=120, 140 | **kwargs 141 | ) 142 | 143 | executor.update_parameters(name="detectransformer") 144 | if args.mail: 145 | executor.update_parameters( 146 | additional_parameters={'mail-user': args.mail, 'mail-type': 'END'}) 147 | 148 | executor.update_parameters(slurm_additional_parameters={ 149 | 'gres-flags': 'enforce-binding' 150 | }) 151 | 152 | args.dist_url = get_init_file(args).as_uri() 153 | args.output_dir = args.job_dir 154 | 155 | trainer = Trainer(args) 156 | job = executor.submit(trainer) 157 | 158 | print("Submitted job_id:", job.job_id) 159 | 160 | 161 | if __name__ == "__main__": 162 | main() 163 | -------------------------------------------------------------------------------- /greene/submitit_example/slurm.py: -------------------------------------------------------------------------------- 1 | # Copyright (c) Facebook, Inc. and its affiliates. 2 | # 3 | # This source code is licensed under the MIT license found in the 4 | # LICENSE file in the root directory of this source tree. 5 | # 6 | 7 | import functools 8 | import inspect 9 | import os 10 | import re 11 | import shutil 12 | import subprocess 13 | import sys 14 | import typing as tp 15 | import uuid 16 | import warnings 17 | from pathlib import Path 18 | from typing import Any, Dict, List, Optional, Set, Tuple, Union 19 | 20 | from ..core import core, job_environment, logger, utils 21 | 22 | 23 | def read_job_id(job_id: str) -> tp.List[Tuple[str, ...]]: 24 | """Reads formated job id and returns a tuple with format: 25 | (main_id, [array_index, [final_array_index]) 26 | """ 27 | pattern = r"(?P\d+)_\[(?P(\d+(-\d+)?(,)?)+)(\%\d+)?\]" 28 | match = re.search(pattern, job_id) 29 | if match is not None: 30 | main = match.group("main_id") 31 | array_ranges = match.group("arrays").split(",") 32 | return [tuple([main] + array_range.split("-")) for array_range in array_ranges] 33 | else: 34 | main_id, *array_id = job_id.split("_", 1) 35 | if not array_id: 36 | return [(main_id,)] 37 | # there is an array 38 | array_num = str(int(array_id[0])) # trying to cast to int to make sure we understand 39 | return [(main_id, array_num)] 40 | 41 | 42 | class SlurmInfoWatcher(core.InfoWatcher): 43 | def _make_command(self) -> Optional[List[str]]: 44 | # asking for array id will return all status 45 | # on the other end, asking for each and every one of them individually takes a huge amount of time 46 | to_check = {x.split("_")[0] for x in self._registered - self._finished} 47 | if not to_check: 48 | return None 49 | command = ["sacct", "-o", "JobID,State", "--parsable2"] 50 | for jid in to_check: 51 | command.extend(["-j", str(jid)]) 52 | return command 53 | 54 | def get_state(self, job_id: str, mode: str = "standard") -> str: 55 | """Returns the state of the job 56 | State of finished jobs are cached (use watcher.clear() to remove all cache) 57 | 58 | Parameters 59 | ---------- 60 | job_id: int 61 | id of the job on the cluster 62 | mode: str 63 | one of "force" (forces a call), "standard" (calls regularly) or "cache" (does not call) 64 | """ 65 | info = self.get_info(job_id, mode=mode) 66 | return info.get("State") or "UNKNOWN" 67 | 68 | def read_info(self, string: Union[bytes, str]) -> Dict[str, Dict[str, str]]: 69 | """Reads the output of sacct and returns a dictionary containing main information 70 | """ 71 | if not isinstance(string, str): 72 | string = string.decode() 73 | lines = string.splitlines() 74 | if len(lines) < 2: 75 | return {} # one job id does not exist (yet) 76 | names = lines[0].split("|") 77 | # read all lines 78 | all_stats: Dict[str, Dict[str, str]] = {} 79 | for line in lines[1:]: 80 | stats = {x: y.strip() for x, y in zip(names, line.split("|"))} 81 | job_id = stats["JobID"] 82 | if not job_id or "." in job_id: 83 | continue 84 | try: 85 | multi_split_job_id = read_job_id(job_id) 86 | except Exception as e: 87 | # Array id are sometimes displayed with weird chars 88 | warnings.warn( 89 | f"Could not interpret {job_id} correctly (please open an issue):\n{e}", DeprecationWarning 90 | ) 91 | continue 92 | for split_job_id in multi_split_job_id: 93 | all_stats[ 94 | "_".join(split_job_id[:2]) 95 | ] = stats # this works for simple jobs, or job array unique instance 96 | # then, deal with ranges: 97 | if len(split_job_id) >= 3: 98 | for index in range(int(split_job_id[1]), int(split_job_id[2]) + 1): 99 | all_stats[f"{split_job_id[0]}_{index}"] = stats 100 | return all_stats 101 | 102 | 103 | class SlurmJob(core.Job[core.R]): 104 | 105 | _cancel_command = "scancel" 106 | watcher = SlurmInfoWatcher(delay_s=600) 107 | 108 | def _interrupt(self, timeout: bool = False) -> None: 109 | """Sends preemption or timeout signal to the job (for testing purpose) 110 | 111 | Parameter 112 | --------- 113 | timeout: bool 114 | Whether to trigger a job time-out (if False, it triggers preemption) 115 | """ 116 | cmd = ["scancel", self.job_id, "--signal"] 117 | # in case of preemption, SIGTERM is sent first 118 | if not timeout: 119 | subprocess.check_call(cmd + ["SIGTERM"]) 120 | subprocess.check_call(cmd + ["USR1"]) 121 | 122 | 123 | class SlurmParseException(Exception): 124 | pass 125 | 126 | 127 | def _expand_id_suffix(suffix_parts: str) -> List[str]: 128 | """Parse the a suffix formatted like "1-3,5,8" into 129 | the list of numeric values 1,2,3,5,8. 130 | """ 131 | suffixes = [] 132 | for suffix_part in suffix_parts.split(","): 133 | if "-" in suffix_part: 134 | low, high = suffix_part.split("-") 135 | int_length = len(low) 136 | for num in range(int(low), int(high) + 1): 137 | suffixes.append(f"{num:0{int_length}}") 138 | else: 139 | suffixes.append(suffix_part) 140 | return suffixes 141 | 142 | 143 | def _parse_node_group(node_list: str, pos: int, parsed: List[str]) -> int: 144 | """Parse a node group of the form PREFIX[1-3,5,8] and return 145 | the position in the string at which the parsing stopped 146 | """ 147 | prefixes = [""] 148 | while pos < len(node_list): 149 | c = node_list[pos] 150 | if c == ",": 151 | parsed.extend(prefixes) 152 | return pos + 1 153 | if c == "[": 154 | last_pos = node_list.index("]", pos) 155 | suffixes = _expand_id_suffix(node_list[pos + 1 : last_pos]) 156 | prefixes = [prefix + suffix for prefix in prefixes for suffix in suffixes] 157 | pos = last_pos + 1 158 | else: 159 | for i, prefix in enumerate(prefixes): 160 | prefixes[i] = prefix + c 161 | pos += 1 162 | parsed.extend(prefixes) 163 | return pos 164 | 165 | 166 | def _parse_node_list(node_list: str): 167 | try: 168 | pos = 0 169 | parsed: List[str] = [] 170 | while pos < len(node_list): 171 | pos = _parse_node_group(node_list, pos, parsed) 172 | return parsed 173 | except ValueError as e: 174 | raise SlurmParseException(f"Unrecognized format for SLURM_JOB_NODELIST: '{node_list}'", e) from e 175 | 176 | 177 | class SlurmJobEnvironment(job_environment.JobEnvironment): 178 | 179 | _env = { 180 | "job_id": "SLURM_JOB_ID", 181 | "num_tasks": "SLURM_NTASKS", 182 | "num_nodes": "SLURM_JOB_NUM_NODES", 183 | "node": "SLURM_NODEID", 184 | "nodes": "SLURM_JOB_NODELIST", 185 | "global_rank": "SLURM_PROCID", 186 | "local_rank": "SLURM_LOCALID", 187 | "array_job_id": "SLURM_ARRAY_JOB_ID", 188 | "array_task_id": "SLURM_ARRAY_TASK_ID", 189 | } 190 | 191 | def _requeue(self, countdown: int) -> None: 192 | jid = self.job_id 193 | subprocess.check_call(["scontrol", "requeue", jid]) 194 | logger.get_logger().info(f"Requeued job {jid} ({countdown} remaining timeouts)") 195 | 196 | @property 197 | def hostnames(self) -> List[str]: 198 | """Parse the content of the "SLURM_JOB_NODELIST" environment variable, 199 | which gives access to the list of hostnames that are part of the current job. 200 | 201 | In SLURM, the node list is formatted NODE_GROUP_1,NODE_GROUP_2,...,NODE_GROUP_N 202 | where each node group is formatted as: PREFIX[1-3,5,8] to define the hosts: 203 | [PREFIX1, PREFIX2, PREFIX3, PREFIX5, PREFIX8]. 204 | 205 | Link: https://hpcc.umd.edu/hpcc/help/slurmenv.html 206 | """ 207 | 208 | node_list = os.environ.get(self._env["nodes"], "") 209 | if not node_list: 210 | return [self.hostname] 211 | return _parse_node_list(node_list) 212 | 213 | 214 | class SlurmExecutor(core.PicklingExecutor): 215 | """Slurm job executor 216 | This class is used to hold the parameters to run a job on slurm. 217 | In practice, it will create a batch file in the specified directory for each job, 218 | and pickle the task function and parameters. At completion, the job will also pickle 219 | the output. Logs are also dumped in the same directory. 220 | 221 | Parameters 222 | ---------- 223 | folder: Path/str 224 | folder for storing job submission/output and logs. 225 | max_num_timeout: int 226 | Maximum number of time the job can be requeued after timeout (if 227 | the instance is derived from helpers.Checkpointable) 228 | 229 | Note 230 | ---- 231 | - be aware that the log/output folder will be full of logs and pickled objects very fast, 232 | it may need cleaning. 233 | - the folder needs to point to a directory shared through the cluster. This is typically 234 | not the case for your tmp! If you try to use it, slurm will fail silently (since it 235 | will not even be able to log stderr. 236 | - use update_parameters to specify custom parameters (n_gpus etc...). If you 237 | input erroneous parameters, an error will print all parameters available for you. 238 | """ 239 | 240 | job_class = SlurmJob 241 | 242 | def __init__(self, folder: Union[Path, str], max_num_timeout: int = 3) -> None: 243 | super().__init__(folder, max_num_timeout) 244 | if not self.affinity() > 0: 245 | raise RuntimeError('Could not detect "srun", are you indeed on a slurm cluster?') 246 | 247 | @classmethod 248 | def _equivalence_dict(cls) -> core.EquivalenceDict: 249 | return { 250 | "name": "job_name", 251 | "timeout_min": "time", 252 | "mem_gb": "mem", 253 | "nodes": "nodes", 254 | "cpus_per_task": "cpus_per_task", 255 | "gpus_per_node": "gpus_per_node", 256 | "tasks_per_node": "ntasks_per_node", 257 | } 258 | 259 | @classmethod 260 | def _valid_parameters(cls) -> Set[str]: 261 | """Parameters that can be set through update_parameters 262 | """ 263 | return set(_get_default_parameters()) 264 | 265 | def _convert_parameters(self, params: Dict[str, Any]) -> Dict[str, Any]: 266 | params = super()._convert_parameters(params) 267 | # replace type in some cases 268 | if "mem" in params: 269 | params["mem"] = f"{params['mem']}GB" 270 | return params 271 | 272 | def _internal_update_parameters(self, **kwargs: Any) -> None: 273 | """Updates sbatch submission file parameters 274 | 275 | Parameters 276 | ---------- 277 | See slurm documentation for most parameters. 278 | Most useful parameters are: time, mem, gpus_per_node, cpus_per_task, partition 279 | Below are the parameters that differ from slurm documentation: 280 | 281 | signal_delay_s: int 282 | delay between the kill signal and the actual kill of the slurm job. 283 | setup: list 284 | a list of command to run in sbatch befure running srun 285 | array_parallelism: int 286 | number of map tasks that will be executed in parallel 287 | 288 | Raises 289 | ------ 290 | ValueError 291 | In case an erroneous keyword argument is added, a list of all eligible parameters 292 | is printed, with their default values 293 | 294 | Note 295 | ---- 296 | Best practice (as far as Quip is concerned): cpus_per_task=2x (number of data workers + gpus_per_task) 297 | You can use cpus_per_gpu=2 (requires using gpus_per_task and not gpus_per_node) 298 | """ 299 | defaults = _get_default_parameters() 300 | in_valid_parameters = sorted(set(kwargs) - set(defaults)) 301 | if in_valid_parameters: 302 | string = "\n - ".join(f"{x} (default: {repr(y)})" for x, y in sorted(defaults.items())) 303 | raise ValueError( 304 | f"Unavailable parameter(s): {in_valid_parameters}\nValid parameters are:\n - {string}" 305 | ) 306 | # check that new parameters are correct 307 | _make_sbatch_string(command="nothing to do", folder=self.folder, **kwargs) 308 | super()._internal_update_parameters(**kwargs) 309 | 310 | def _internal_process_submissions( 311 | self, delayed_submissions: tp.List[utils.DelayedSubmission] 312 | ) -> tp.List[core.Job[tp.Any]]: 313 | if len(delayed_submissions) == 1: 314 | return super()._internal_process_submissions(delayed_submissions) 315 | # array 316 | folder = utils.JobPaths.get_first_id_independent_folder(self.folder) 317 | folder.mkdir(parents=True, exist_ok=True) 318 | pickle_paths = [] 319 | for d in delayed_submissions: 320 | pickle_path = folder / f"{uuid.uuid4().hex}.pkl" 321 | d.timeout_countdown = self.max_num_timeout 322 | d.dump(pickle_path) 323 | pickle_paths.append(pickle_path) 324 | n = len(delayed_submissions) 325 | # Make a copy of the executor, since we don't want other jobs to be 326 | # scheduled as arrays. 327 | array_ex = SlurmExecutor(self.folder, self.max_num_timeout) 328 | array_ex.update_parameters(**self.parameters) 329 | array_ex.parameters["map_count"] = n 330 | self._throttle() 331 | 332 | first_job: core.Job[tp.Any] = array_ex._submit_command(self._submitit_command_str) 333 | tasks_ids = list(range(first_job.num_tasks)) 334 | jobs: List[core.Job[tp.Any]] = [ 335 | SlurmJob(folder=self.folder, job_id=f"{first_job.job_id}_{a}", tasks=tasks_ids) for a in range(n) 336 | ] 337 | for job, pickle_path in zip(jobs, pickle_paths): 338 | job.paths.move_temporary_file(pickle_path, "submitted_pickle") 339 | return jobs 340 | 341 | @property 342 | def _submitit_command_str(self) -> str: 343 | # make sure to use the current executable (required in pycharm) 344 | #return f"{sys.executable} -u -m submitit.core._submit '{self.folder}'" 345 | python = os.getenv("MYPYTHON") 346 | if not python : python = f"{sys.executable}" 347 | return python + f" -u -m submitit.core._submit '{self.folder}'" 348 | 349 | def _make_submission_file_text(self, command: str, uid: str) -> str: 350 | return _make_sbatch_string(command=command, folder=self.folder, **self.parameters) 351 | 352 | def _num_tasks(self) -> int: 353 | nodes: int = self.parameters.get("nodes", 1) 354 | tasks_per_node: int = self.parameters.get("ntasks_per_node", 1) 355 | return nodes * tasks_per_node 356 | 357 | def _make_submission_command(self, submission_file_path: Path) -> List[str]: 358 | return ["sbatch", str(submission_file_path)] 359 | 360 | @staticmethod 361 | def _get_job_id_from_submission_command(string: Union[bytes, str]) -> str: 362 | """Returns the job ID from the output of sbatch string 363 | """ 364 | if not isinstance(string, str): 365 | string = string.decode() 366 | output = re.search(r"job (?P[0-9]+)", string) 367 | if output is None: 368 | raise utils.FailedSubmissionError( 369 | 'Could not make sense of sbatch output "{}"\n'.format(string) 370 | + "Job instance will not be able to fetch status\n" 371 | "(you may however set the job job_id manually if needed)" 372 | ) 373 | return output.group("id") 374 | 375 | @classmethod 376 | def affinity(cls) -> int: 377 | return -1 if shutil.which("srun") is None else 2 378 | 379 | 380 | @functools.lru_cache() 381 | def _get_default_parameters() -> Dict[str, Any]: 382 | """Parameters that can be set through update_parameters 383 | """ 384 | specs = inspect.getfullargspec(_make_sbatch_string) 385 | zipped = zip(specs.args[-len(specs.defaults) :], specs.defaults) # type: ignore 386 | return {key: val for key, val in zipped if key not in {"command", "folder", "map_count"}} 387 | 388 | 389 | # pylint: disable=too-many-arguments,unused-argument, too-many-locals 390 | def _make_sbatch_string( 391 | command: str, 392 | folder: tp.Union[str, Path], 393 | job_name: str = "submitit", 394 | partition: str = None, 395 | time: int = 5, 396 | nodes: int = 1, 397 | ntasks_per_node: int = 1, 398 | cpus_per_task: tp.Optional[int] = None, 399 | cpus_per_gpu: tp.Optional[int] = None, 400 | num_gpus: tp.Optional[int] = None, # legacy 401 | gpus_per_node: tp.Optional[int] = None, 402 | gpus_per_task: tp.Optional[int] = None, 403 | setup: tp.Optional[tp.List[str]] = None, 404 | mem: tp.Optional[str] = None, 405 | mem_per_gpu: tp.Optional[str] = None, 406 | mem_per_cpu: tp.Optional[str] = None, 407 | signal_delay_s: int = 90, 408 | comment: str = "", 409 | constraint: str = "", 410 | exclude: str = "", 411 | gres: str = "", 412 | exclusive: tp.Union[bool, str] = False, 413 | array_parallelism: int = 256, 414 | wckey: str = "submitit", 415 | map_count: tp.Optional[int] = None, # used internally 416 | additional_parameters: tp.Optional[tp.Dict[str, tp.Any]] = None, 417 | ) -> str: 418 | """Creates the content of an sbatch file with provided parameters 419 | 420 | Parameters 421 | ---------- 422 | See slurm sbatch documentation for most parameters: 423 | https://slurm.schedmd.com/sbatch.html 424 | 425 | Below are the parameters that differ from slurm documentation: 426 | 427 | folder: str/Path 428 | folder where print logs and error logs will be written 429 | signal_delay_s: int 430 | delay between the kill signal and the actual kill of the slurm job. 431 | setup: list 432 | a list of command to run in sbatch befure running srun 433 | map_size: int 434 | number of simultaneous map/array jobs allowed 435 | additional_parameters: dict 436 | Forces any parameter to a given value in sbatch. This can be useful 437 | to add parameters which are not currently available in submitit. 438 | Eg: {"mail-user": "blublu@fb.com", "mail-type": "BEGIN"} 439 | 440 | Raises 441 | ------ 442 | ValueError 443 | In case an erroneous keyword argument is added, a list of all eligible parameters 444 | is printed, with their default values 445 | """ 446 | nonslurm = [ 447 | "nonslurm", 448 | "folder", 449 | "command", 450 | "map_count", 451 | "array_parallelism", 452 | "additional_parameters", 453 | "setup", 454 | ] 455 | parameters = {x: y for x, y in locals().items() if y and y is not None and x not in nonslurm} 456 | # rename and reformat parameters 457 | parameters["signal"] = signal_delay_s 458 | del parameters["signal_delay_s"] 459 | if job_name: 460 | parameters["job_name"] = utils.sanitize(job_name) 461 | if comment: 462 | parameters["comment"] = utils.sanitize(comment, only_alphanum=False) 463 | if num_gpus is not None: 464 | warnings.warn( 465 | '"num_gpus" is deprecated, please use "gpus_per_node" instead (overwritting with num_gpus)' 466 | ) 467 | parameters["gpus_per_node"] = parameters.pop("num_gpus", 0) 468 | if "cpus_per_gpu" in parameters and "gpus_per_task" not in parameters: 469 | warnings.warn('"cpus_per_gpu" requires to set "gpus_per_task" to work (and not "gpus_per_node")') 470 | # add necessary parameters 471 | paths = utils.JobPaths(folder=folder) 472 | stdout = str(paths.stdout) 473 | stderr = str(paths.stderr) 474 | # Job arrays will write files in the form __ 475 | if map_count is not None: 476 | assert isinstance(map_count, int) and map_count 477 | parameters["array"] = f"0-{map_count - 1}%{min(map_count, array_parallelism)}" 478 | stdout = stdout.replace("%j", "%A_%a") 479 | stderr = stderr.replace("%j", "%A_%a") 480 | parameters["output"] = stdout.replace("%t", "0") 481 | parameters["error"] = stderr.replace("%t", "0") 482 | parameters.update({"signal": f"USR1@{signal_delay_s}", "open-mode": "append"}) 483 | if additional_parameters is not None: 484 | parameters.update(additional_parameters) 485 | # now create 486 | lines = ["#!/bin/bash", "", "# Parameters"] 487 | lines += [ 488 | "#SBATCH --{}{}".format(x.replace("_", "-"), "" if parameters[x] is True else f"={parameters[x]}") 489 | for x in sorted(parameters) 490 | ] 491 | # environment setup: 492 | if setup is not None: 493 | lines += ["", "# setup"] + setup 494 | # commandline (this will run the function and args specified in the file provided as argument) 495 | # We pass --output and --error here, because the SBATCH command doesn't work as expected with a filename pattern 496 | lines += [ 497 | "", 498 | "# command", 499 | "export SUBMITIT_EXECUTOR=slurm", 500 | "export MASTER_ADDR=$(hostname -s)-ib0", 501 | "export MASTER_PORT=$(shuf -i 10000-65500 -n 1)", 502 | f"srun --export=ALL --kill-on-bad-exit=1 --output '{stdout}' --error '{stderr}' --unbuffered {command}\n\n", 503 | ] 504 | return "\n".join(lines) -------------------------------------------------------------------------------- /greene/sweep_job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=job_wgpu 3 | #SBATCH --open-mode=append 4 | #SBATCH --output=./%j_%x.out 5 | #SBATCH --error=./%j_%x.err 6 | #SBATCH --export=ALL 7 | #SBATCH --time=00:10:00 8 | #SBATCH --mem=1G 9 | #SBATCH -c 4 10 | 11 | #SBATCH --array=1-20 12 | 13 | singularity exec --overlay $SCRATCH/overlay-50G-10M.ext3:ro /scratch/work/public/singularity/cuda10.1-cudnn7-devel-ubuntu18.04-20201207.sif /bin/bash -c " 14 | 15 | source /ext3/env.sh 16 | conda activate 17 | 18 | python ./test_sweep.py $SLURM_ARRAY_TASK_ID 19 | " -------------------------------------------------------------------------------- /greene/test_gpu.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import time 3 | 4 | if __name__ == "__main__": 5 | 6 | print(f"Torch cuda available: {torch.cuda.is_available()}") 7 | print(f"GPU name: {torch.cuda.get_device_name()}\n\n") 8 | 9 | t1 = torch.randn(100, 1000) 10 | t2 = torch.randn(1000, 10000) 11 | 12 | cpu_start = time.time() 13 | 14 | for i in range(100): 15 | t = t1 @ t2 16 | 17 | cpu_end = time.time() 18 | 19 | print(f"CPU matmul elapsed: {cpu_end-cpu_start} sec.") 20 | 21 | t1 = t1.to("cuda") 22 | t2 = t2.to("cuda") 23 | 24 | gpu_start = time.time() 25 | 26 | for i in range(100): 27 | t = t1 @ t2 28 | 29 | gpu_end = time.time() 30 | 31 | print(f"GPU matmul elapsed: {gpu_end-gpu_start} sec.") 32 | -------------------------------------------------------------------------------- /greene/test_sweep.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import fire 3 | import itertools 4 | import numpy 5 | 6 | 7 | def retreve_config(sweep_step): 8 | grid = { 9 | "lr": [0.1, 0.01], 10 | "model_size": [512, 1024], 11 | "seed": numpy.random.randint(0, 1000, 5), 12 | } 13 | 14 | grid_setups = list( 15 | dict(zip(grid.keys(), values)) for values in itertools.product(*grid.values()) 16 | ) 17 | step_grid = grid_setups[sweep_step - 1] # slurm var will start from 1 18 | 19 | # automatically choose the device based on the given node 20 | if torch.cuda.device_count() > 0: 21 | expr_device = "cuda" 22 | else: 23 | expr_device = "cpu" 24 | 25 | config = { 26 | "sweep_step": sweep_step, 27 | "seed": step_grid["seed"], 28 | "device": expr_device, 29 | "lr": step_grid["lr"], 30 | "model_size": step_grid["model_size"], 31 | "some_fixed_arg": 0, 32 | } 33 | 34 | return config 35 | 36 | 37 | def run_experiment(config): 38 | print(config) 39 | 40 | 41 | def main(sweep_step): 42 | config = retreve_config(sweep_step) 43 | run_experiment(config) 44 | 45 | 46 | if __name__ == "__main__": 47 | fire.Fire(main) -------------------------------------------------------------------------------- /prince/README.md: -------------------------------------------------------------------------------- 1 | # NYU HPC Prince cluster 2 | 3 | Troubleshooting: hpc@nyu.edu 4 | 5 | NYU HPC website: [https://sites.google.com/a/nyu.edu/nyu-hpc/](https://sites.google.com/a/nyu.edu/nyu-hpc/) 6 | 7 | One can find many useful tips under *DOCUMENTATION / GUIDES*. 8 | 9 | In this part we will touch the following aspects of our own cluster: 10 | 11 | * Connecting to Prince via CIMS Access 12 | 13 | * Prince computing nodes 14 | 15 | * Prince filesystems 16 | 17 | * Software management 18 | 19 | * Slurm Workload Manager 20 | 21 | ## Connecting to Prince via CIMS Access (without bastion HPC nodes) 22 | 23 | HPC login nodes are reachable from CIMS access node, i.e. there is no need to login over another firewalled bastion HPC node. One can find corresponding `prince` block in the ssh config file (check client folder in this repo). 24 | 25 | ## Prince computing nodes 26 | 27 | Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/systems/prince](https://sites.google.com/a/nyu.edu/nyu-hpc/systems/prince) 28 | 29 | ## Prince filesystems 30 | 31 | Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/data-management/prince-data](https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/data-management/prince-data) 32 | 33 | ### Show quota usage for all filesystems 34 | 35 | In order to check how much filesystem space you use, run `myquota`. 36 | 37 | ## Software management 38 | 39 | There is similar environment modules package as in Cassio for general libs/software. 40 | 41 | There is no difference with managing **conda** on Cassio and on Prince except for installation path. 42 | 43 | Learn more here: [https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/prince/packages/conda-environments](https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/prince/packages/conda-environments) 44 | 45 | ## Slurm Workload Manager 46 | 47 | https://sites.google.com/a/nyu.edu/nyu-hpc/documentation/prince/batch/slurm-best-practices 48 | 49 | In general, Slurm behaves similarly on both Cassio and Prince, however there are differences in quotas and how GPU-equipped nodes are distributed:s 50 | 51 | 1. **There is no interactive QoS on Prince.** In other words, run `srun --pty bash` with additional args to get an interactive job allocated. 52 | 53 | 2. **There is no `--constraint` arg to specify a GPU you want.** 54 | 55 | Instead, GPU nodes are separated into differnt **partitions** w.r.t. GPU type. Right now there are following parititons available: `--partition=k80_4,k80_8,p40_4,p100_4,v100_sxm2_4,v100_pci_2,dgx1,p1080_4`. Slurm will try to allocate nodes in the order given by this line. 56 | 57 | |GPU|Memory| 58 | |---|------| 59 | |96 P40|24G| 60 | |32 P100|16G| 61 | |50 K80|24G bridged over 2 GPUs or 12G each| 62 | |26 V100|16G| 63 | |16 P1080|8G| 64 | |DGX1|16G ?| 65 | 66 | ### Port forwarding to interactive job 67 | 68 | If one follows exactly same steps as for Cassio and run: 69 | 70 | `ssh -L :localhost: -J prince :` 71 | 72 | then the following error may be returned: 73 | 74 | ``` 75 | channel 0: open failed: administratively prohibited: open failed 76 | stdio forwarding failed 77 | kex_exchange_identification: Connection closed by remote host 78 | ``` 79 | 80 | which means that jump to your instance was not successful. In order to avoid this jump, we make a tunnel which will forward connection to the machine itself rather than localhost: 81 | 82 | `ssh -L :: prince -N` 83 | 84 | **Important: you must run your JupyterLab or any other software with accepting requests from all ip addresses rather than from localhost only (which is a default usually).** To make this change in jupyter, add `--ip 0.0.0.0` arg: 85 | 86 | `jupyter lab --no-browser --port --ip 0.0.0.0` 87 | 88 | Now you should be able to open JupyterLab tab in your browser. 89 | 90 | ### Submitting a batch job 91 | 92 | As we noted before, one particular difference with Cassio is about GPU allocation (note `--partition` below): 93 | 94 | ```bash 95 | #!/bin/bash 96 | #SBATCH --job-name= 97 | #SBATCH --open-mode=append 98 | #SBATCH --output= 99 | #SBATCH --error= 100 | #SBATCH --export=ALL 101 | #SBATCH --time=24:00:00 102 | #SBATCH --partition=p40_4,p100_4,v100_sxm2_4,dgx1 103 | #SBATCH --gres=gpu:1 104 | #SBATCH --mem=64G 105 | #SBATCH -c 4 106 | ``` 107 | 108 | **Important: do not forget to activate conda env before submitting a job, or make sure you do so in the script.** 109 | 110 | Similar to arguments we passed to `srun` during interactive job request, here we specify requirements for the batch job. 111 | 112 | After `#SBATCH` block one may execute any shell commands or run any script of your choice. 113 | 114 | **You can not mix `#SBATCH` lines with other commands, Slurm will not register any `#SBATCH` after the first regular (non-comment) command in the script.** 115 | 116 | To submit `job_wgpu` located in `gpu_job.slurm`, go to Cassio node and run: 117 | 118 | `sbatch gpu_job.slurm` 119 | 120 | sample output: 121 | 122 | ``` 123 | Torch cuda available: True 124 | GPU name: Tesla V100-SXM2-32GB-LS 125 | 126 | 127 | CPU matmul elapsed: 1.1984939575195312 sec. 128 | GPU matmul elapsed: 0.01778721809387207 sec. 129 | ``` 130 | 131 | # Mujoco Installation on Prince 132 | 133 | This tutorial assumes that you have already installed conda in your scratch folder. Please note that your conda environment must be stored in $SCRATCH not in $HOME directory. You will get weird buildlock errors otherwise ! 134 | 135 | We also assume that you have put the Mujoco200 and the license key folder inside `~/.mujoco` folder. 136 | 137 | ```bash 138 | > conda create -n test_mujoco python=3.6 139 | 140 | > singularity exec /beegfs/work/public/singularity/mujoco-200.sif /bin/bash 141 | 142 | > conda activate test_mujoco 143 | 144 | > pip install mujoco-py 145 | 146 | ``` 147 | 148 | 149 | 150 | Now, for running a python file, one has to run it inside the Singularity environment. 151 | 152 | You can now run any python file using mujoco using a bash script given here: 153 | 154 | ```bash 155 | #!/bin/bash 156 | 157 | singularity exec \ 158 | /beegfs/work/public/singularity/mujoco-200.sif \ 159 | bash -c " 160 | export LD_LIBRARY_PATH=\$LD_LIBRARY_PATH:$HOME/.mujoco/mujoco200/bin 161 | source $SCRATCH/pyenv/mujoco-py/bin/activate 162 | python $* 163 | " 164 | ``` 165 | 166 | 167 | 168 | You can run a python files by running 169 | 170 | ```bash 171 | bash 172 | ``` 173 | 174 | -------------------------------------------------------------------------------- /prince/gpu_job.slurm: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | #SBATCH --job-name=job_wgpu 3 | #SBATCH --open-mode=append 4 | #SBATCH --output=./%j_%x.out 5 | #SBATCH --error=./%j_%x.err 6 | #SBATCH --export=ALL 7 | #SBATCH --time=00:10:00 8 | #SBATCH --gres=gpu:1 9 | #SBATCH --partition=p40_4,p100_4,v100_sxm2_4,dgx1 10 | #SBATCH --mem=64G 11 | #SBATCH -c 4 12 | 13 | python ./test_gpu.py 14 | -------------------------------------------------------------------------------- /prince/test_gpu.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import time 3 | 4 | if __name__ == '__main__': 5 | 6 | print(f"Torch cuda available: {torch.cuda.is_available()}") 7 | print(f"GPU name: {torch.cuda.get_device_name()}\n\n") 8 | 9 | t1 = torch.randn(100,1000) 10 | t2 = torch.randn(1000,10000) 11 | 12 | cpu_start = time.time() 13 | 14 | for i in range(100): 15 | t = t1 @ t2 16 | 17 | cpu_end = time.time() 18 | 19 | print(f"CPU matmul elapsed: {cpu_end-cpu_start} sec.") 20 | 21 | t1 = t1.to('cuda') 22 | t2 = t2.to('cuda') 23 | 24 | gpu_start = time.time() 25 | 26 | for i in range(100): 27 | t = t1 @ t2 28 | 29 | gpu_end = time.time() 30 | 31 | print(f"GPU matmul elapsed: {gpu_end-gpu_start} sec.") 32 | 33 | 34 | --------------------------------------------------------------------------------