├── .gitignore ├── README-ND.md ├── README.md ├── examples ├── language-model │ ├── LICENSE │ ├── README.md │ ├── data.py │ ├── data │ │ └── wikitext-2 │ │ │ ├── README │ │ │ ├── test.txt │ │ │ ├── train.txt │ │ │ └── valid.txt │ ├── generate.py │ ├── main.py │ ├── model.py │ ├── run-gpu-job-version-2.bash │ ├── run-gpu-job-version-3.bash │ ├── submit-gpu-job-version-2.bash │ ├── submit-gpu-job-version-3.bash │ ├── version-2.def │ └── version-3.def └── xor │ ├── train_xor.py │ └── version-1.def ├── images └── plot.png └── slides.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | *.sif 2 | *.pt 3 | *.swp 4 | __pycache__/ 5 | .venv/ 6 | -------------------------------------------------------------------------------- /README-ND.md: -------------------------------------------------------------------------------- 1 | # How to Install Literally Anything Using Containers 2 | 3 | Brian DuSell
4 | Apr 9, 2019 5 | 6 | Grad Tutorial Talk
7 | Dept. of Computer Science and Engineering
8 | University of Notre Dame 9 | 10 | ## Abstract 11 | 12 | Have you ever spent an inordinate amount of time trying to install 13 | something on the CRC without root privileges? Have you ever wrecked your 14 | computer trying to update CUDA? Have you ever wished you could install two 15 | versions of the same package at once? If so, containers may be what's missing 16 | in your life. In this talk, I will show you how to install software using 17 | Singularity, a container system that allows you to install software in a fully 18 | portable, self-contained Linux environment where you have full administrative 19 | rights. Singularity can be installed on any Linux machine (with techniques 20 | available for running it on Windows and Mac) and is available on the CRC, thus 21 | ensuring that your code runs in a consistent environment no matter which 22 | machine you run it on. Singularity is compatible with Docker images and lets 23 | you effortlessly install any CUDA version of your choosing provided that your 24 | Nvidia drivers have been set up properly. My tutorial will consist of walking 25 | you through using Singularity to run a GPU-accelerated PyTorch program for deep 26 | learning on the CRC. Note: If you want to follow along, please ensure that you 27 | have a directory under the `/scratch365` directory on the CRC's filesystem. 28 | 29 | ## Introduction 30 | 31 | This tutorial will introduce you to [Singularity](https://www.sylabs.io/singularity/), 32 | a containerization system for scientific computing environments that is 33 | available on Notre Dame's CRC computing cluster. Containers allow you to 34 | package the environment that your code depends on inside of a portable unit. 35 | This is extremely useful for ensuring that your code can be run portably 36 | on other machines. It is also useful for installing software, packages, 37 | libraries, etc. in environments where you do not have root privileges, like the 38 | CRC. I will show you how to install PyTorch with GPU support inside of a 39 | container and run a simple PyTorch program to train a neural net. 40 | 41 | ## The Portability Problem 42 | 43 | The programs we write depend on external environments, whether that environment 44 | is explicitly documented or not. A Python program assumes that a Python 45 | interpreter is available on the system it is run on. A Python program that uses 46 | set comprehension syntax, e.g. 47 | 48 | ```python 49 | { x * 2 for x in range(10) } 50 | ``` 51 | 52 | assumes that you're using Python 3. A Python program that uses the function 53 | `subprocess.run()` assumes that you're using at least version 3.5. A Python 54 | program that calls `subprocess.run(['grep', '-r', 'foo', './my/directory'])` 55 | assumes that you're running on a \*nix system where the program `grep` is 56 | available. 57 | 58 | When these dependencies are undocumented, it can become painful to run a 59 | program in an environment that is different from the one it was developed in. 60 | It would be nice to have a way to package a program together with its 61 | environment, and then run that program on any machine. 62 | 63 | ## The Installation Problem 64 | 65 | The CRC is a shared scientific computing environment with a shared file system. 66 | This means that users do not have root privileges and cannot use a package 67 | manager like `yum` or `apt-get` to install new libraries. If you want to 68 | install something on the CRC that is not already there, you have a few options: 69 | 70 | * If it is a major library, ask the staff to install/update it for you 71 | * Install it in your home directory (e.g. `pip install --user` for Python 72 | modules) or other non-standard directory 73 | * Compile it yourself in your home directory 74 | 75 | While it is almost always possible to re-compile a library yourself without 76 | root privileges, it can be very time-consuming. This is especially true when 77 | the library depends on other libraries that also need to be re-compiled, 78 | leading to a tedious search for just the right configuration to stitch them all 79 | together. CUDA also complicates the situation, as certain deep learning 80 | libraries need to be built on a node that has a GPU (even though the GPU is 81 | never used during compilation!). 82 | 83 | Finally, sometimes you deliberately want to install an older version of a 84 | package. But unless you set up two isolated installations, this could conflict 85 | with projects that still require the newer versions. 86 | 87 | To take an extreme (but completely real!) example, older versions of the deep 88 | learning library [DyNet](https://dynet.readthedocs.io/en/latest/) could only be 89 | built with an old version of GCC, and moreover needed to be compiled on a GPU 90 | node with the CRC's CUDA module loaded in order to work properly. In May 2018, 91 | the CRC removed the required version of GCC. This meant that if you wanted to 92 | install or update DyNet, you needed to re-compile that version of GCC yourself 93 | *and* figure out how to configure DyNet to build itself with a compiler in a 94 | non-standard location. 95 | 96 | ## The Solution: Containers 97 | 98 | Containers are a software isolation technique that has exploded in popularity 99 | in recent years, particularly thanks to [Docker](https://www.docker.com/). 100 | A container, like a virtual machine, is an operating system within an operating 101 | system. Unlike a virtual machine, however, it shares the kernel with the host 102 | operating system, so it incurs no performance penalty for translating machine 103 | instructions. Instead, containers rely on special system calls that allow the 104 | host to spoof the filesystem and network that the container has access to, 105 | making it appear from inside the container that it exists in a separate 106 | environment. 107 | 108 | Today we will be talking about an alternative to Docker called Singularity, 109 | which is more suitable for scientific computing environments (Docker is better 110 | suited for things like cloud applications, and there are reasons why it would 111 | not be ideal for a shared environment like the CRC). The CRC currently offers 112 | [Singularity 3.0](https://www.sylabs.io/guides/3.0/user-guide/), which is 113 | available via the `singularity` command. 114 | 115 | Singularity containers are instantiated from **images**, which are files that 116 | define the container's environment. The container's "root" file system is 117 | distinct from that of the host operating system, so you can install whatever 118 | software you like as if you were the root user. Installing software via the 119 | built-in package manager is now an option again. Not only this, but you can 120 | also choose a pre-made image to base your container on. Singularity is 121 | compatible with Docker images (a very deliberate design decision), so it can 122 | take advantage of the extremely rich selection of production-grade Docker 123 | images that are available. For example, there are pre-made images for fresh 124 | installations of Ubuntu, Python, TensorFlow, PyTorch, and even CUDA. For 125 | virtually all major libraries, getting a pre-made image for X is as simple as 126 | Googling "X docker" and taking note of the name of the image. 127 | 128 | Also, because your program's environment is self-contained, it is not affected 129 | by changes to the CRC's software and is no longer susceptible to "software 130 | rot." There is also no longer a need to rely on the CRC's modules via `module 131 | load`. Because the container is portable, it will also run just as well on your 132 | local machine as on the CRC. In the age of containers, "it runs on my machine" 133 | is no longer an excuse. 134 | 135 | ## Basic Workflow 136 | 137 | Singularity instantiates containers from images that define their environment. 138 | Singularity images are stored in `.sif` files. You build a `.sif` file by 139 | defining your environment in a text file and providing that definition to the 140 | command `singularity build`. 141 | 142 | Building an image file does require root privileges, so it is most convenient 143 | to build the image on your local machine or workstation and then copy it to 144 | your `/scratch365` directory in the CRC. The reason it requires root is because 145 | the kernel is shared, and user permissions are implemented in the kernel. So if 146 | you want to do something in the container as root, you actually need to *be* 147 | root on the host when you do it. 148 | 149 | There is also an option to build it without root privileges. This works by 150 | sending your definition to a remote server and building the image there, but I 151 | have had difficulty getting this to work. 152 | 153 | Once you've uploaded your image to the CRC, you can submit a batch job that 154 | runs `singularity exec` with the image file you created and the command you 155 | want to run. That's it! 156 | 157 | ## A Simple PyTorch Program 158 | 159 | I have included a PyTorch program, 160 | [`train_xor.py`](examples/xor/train_xor.py), 161 | that trains a neural network to compute the XOR function and then plots the 162 | loss as a function of training time. It can also save the model to a file. It 163 | depends on the Python modules `torch`, `numpy`, and `matplotlib`. 164 | 165 | ## Installing Singularity 166 | 167 | [Singularity 3.0](https://www.sylabs.io/guides/3.0/user-guide/index.html) 168 | is already available on the CRC via the `singularity` command. 169 | 170 | As for installing Singularity locally, the Singularity docs include detailed 171 | instructions for installing Singularity on major operating systems 172 | [here](https://www.sylabs.io/guides/3.0/user-guide/installation.html). 173 | Installing Singularity is not necessary for following the tutorial in real 174 | time, as I will provide you with pre-built images. 175 | 176 | ## Defining an Image 177 | 178 | The first step in defining an image is picking which base image to use. This 179 | can be a Linux distribution, such as Ubuntu, or an image with a library 180 | pre-installed, like one of PyTorch's 181 | [official Docker images](https://hub.docker.com/r/pytorch/pytorch/tags). Since 182 | our program depends on more than just PyTorch, let's start with a plain Ubuntu 183 | image and build up from there. 184 | 185 | Let's start with the basic syntax for definition files, which is documented 186 | [here](https://www.sylabs.io/guides/3.0/user-guide/definition_files.html). 187 | The first part of the file is the header, where we define the base image and 188 | other meta-information. The only required keyword in the header is `Bootstrap`, 189 | which defines the type of image being imported. Using `Bootstrap: library` 190 | means that we are importing a library from the official 191 | [Singularity Library](https://cloud.sylabs.io/library). 192 | Using `Bootstrap: docker` means that we are importing a Docker image from a 193 | Docker registry such as 194 | [Docker Hub](https://hub.docker.com/). 195 | Let's import the official 196 | [Ubuntu 18.04](https://cloud.sylabs.io/library/_container/5baba99394feb900016ea433) 197 | image. 198 | 199 | ``` 200 | Bootstrap: library 201 | From: ubuntu:18.04 202 | ``` 203 | 204 | The rest of the definition file is split up into several **sections** which 205 | serve special roles. The `%post` section defines a series of commands to be run 206 | while the image is being built, inside of a container as the root user. This 207 | is typically where you install packages. The `%environment` section defines 208 | environment variables that are set when the image is instantiated as a 209 | container. The `%files` section lets you copy files into the image. There are 210 | [many other types of section](https://www.sylabs.io/guides/3.0/user-guide/definition_files.html#sections). 211 | 212 | Let's use the `%post` section to install all of our requirements using 213 | `apt-get` and `pip3`. 214 | 215 | ``` 216 | %post 217 | # Downloads the latest package lists (important). 218 | apt-get update -y 219 | # Runs apt-get while ensuring that there are no user prompts that would 220 | # cause the build process to hang. 221 | # python3-tk is required by matplotlib. 222 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 223 | python3 \ 224 | python3-tk \ 225 | python3-pip 226 | # Reduce the size of the image by deleting the package lists we downloaded, 227 | # which are no longer needed. 228 | rm -rf /var/lib/apt/lists/* 229 | # Install Python modules. 230 | pip3 install torch numpy matplotlib 231 | ``` 232 | 233 | Each line defines a separate command (lines can be continued with a `\`). 234 | Unlike normal shell scripts, the build will be aborted as soon as one of the 235 | commands fails. You do not need to connect the commands with `&&`. 236 | 237 | The final build definition is in the file 238 | [version-1.def](examples/xor/version-1.def). 239 | 240 | ## Building an Image 241 | 242 | Supposing we are on our own Ubuntu machine, we can build this definition into 243 | a `.sif` image file using the following command: 244 | 245 | ```bash 246 | cd examples/xor 247 | sudo singularity build version-1.sif version-1.def 248 | ``` 249 | 250 | [View the screencast](https://bdusell.github.io/singularity-tutorial/casts/version-1.html) 251 | 252 | This ran the commands we defined in the `%post` section inside a container and 253 | afterwards saved the state of the container in the image `version-1.sif`. 254 | 255 | ## Running an Image 256 | 257 | Let's run our PyTorch program in a container based on the image we just built. 258 | 259 | ```bash 260 | singularity exec version-1.sif python3 train_xor.py --output model.pt 261 | ``` 262 | 263 | This program does not take long to run. Once it finishes, it should open a 264 | window with a plot of the model's loss and accuracy over time. 265 | 266 | [![asciicast](https://asciinema.org/a/Lqq0AsJSwVgFoo1Hr8S7euMe5.svg)](https://asciinema.org/a/Lqq0AsJSwVgFoo1Hr8S7euMe5) 267 | 268 | ![Plot](images/plot.png) 269 | 270 | The trained model should also be saved in the file `model.pt`. Note that even 271 | though the program ran in a container, it was able to write a file to the host 272 | file system that remained after the program exited and the container was shut 273 | down. If you are familiar with Docker, you probably know that you cannot write 274 | files to the host in this way unless you explicitly **bind mount** two 275 | directories in the host and container file system. Bind mounting makes a file 276 | or directory on the host system synonymous with one in the container. 277 | 278 | For convenience, Singularity 279 | [binds a few important directories by 280 | default](https://www.sylabs.io/guides/3.0/user-guide/bind_paths_and_mounts.html): 281 | 282 | * Your home directory 283 | * The current working directory 284 | * `/tmp` 285 | * `/proc` 286 | * `/sys` 287 | * `/dev` 288 | 289 | You can add to or override these settings if you wish using the 290 | [`--bind` flag](https://www.sylabs.io/guides/3.0/user-guide/bind_paths_and_mounts.html#specifying-bind-paths) 291 | to `singularity exec`. This is important to remember if you want to access a 292 | file that is outside of your home directory on the CRC -- otherwise you may end 293 | up with cryptic persmission errors. 294 | 295 | It is also important to know that, unlike Docker, environment variables are 296 | inherited inside the container for convenience. 297 | 298 | ## Running an Interactive Shell 299 | 300 | You can also open up a shell inside the container and run commands there. You 301 | can `exit` when you're done. Note that since your home directory is 302 | bind-mounted, the shell inside the container will run your shell's startup file 303 | (e.g. `.bashrc`). 304 | 305 | ``` 306 | $ singularity shell version-1.sif 307 | Singularity version-1.sif:~/singularity-tutorial/examples/xor> python3 train_xor.py 308 | ``` 309 | 310 | ## Running an Image on the CRC 311 | 312 | Let's try running the same image on the CRC. Log in to one of the frontends 313 | using `ssh -X`. The `-X` is necessary to get the plot to appear. 314 | 315 | ```bash 316 | ssh -X yournetid@crcfe01.crc.nd.edu 317 | ``` 318 | 319 | Then download the image to your `/scratch365` directory. One gotcha is that the 320 | `.sif` file *must* be stored on the scratch365 device for Singularity to work. 321 | 322 | ```bash 323 | singularity pull /scratch365/$USER/version-1.sif library://brian/default/singularity-tutorial:version-1 324 | ``` 325 | 326 | Next, download the code to your home directory. 327 | 328 | ```bash 329 | git clone https://github.com/bdusell/singularity-tutorial.git ~/singularity-tutorial 330 | ``` 331 | 332 | Run the program. 333 | 334 | ```bash 335 | cd ~/singularity-tutorial/examples/xor 336 | singularity exec /scratch365/$USER/version-1.sif python3 examples/xor/train_xor.py 337 | ``` 338 | 339 | You should get the same plot from before to show up. Note that it is not 340 | possible to do this using the Python installations provided by the CRC, since 341 | they do not include Tk, which is required by matplotlib. I have found this 342 | extremely useful for making plots from data I have stored on the CRC without 343 | needing to download the data to another machine. 344 | 345 | ## A Beefier PyTorch Program 346 | 347 | As an example of a program that benefits from GPU acceleration, we will be 348 | running the official 349 | [`word_language_model`](https://github.com/pytorch/examples/tree/master/word_language_model) 350 | example PyTorch program, which I have included at 351 | [`examples/language-model`](examples/language-model). 352 | This program trains an 353 | [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory) 354 | [language model](https://en.wikipedia.org/wiki/Language_model) 355 | on a corpus of Wikipedia text. 356 | 357 | ## Adding GPU Support 358 | 359 | In order to add GPU support, we need to include CUDA in our image. In 360 | Singularity, this is delightfully simple. We just need to pick one of 361 | [Nvidia's official Docker images](https://hub.docker.com/r/nvidia/cuda) 362 | to base our image on. Again, the easiest way to install library X is often to 363 | Google "X docker" and pick an image from the README or tags page on Docker Hub. 364 | 365 | The README lists several tags. They tend to indicate variants of the image that 366 | have different components and different versions of things installed. Let's 367 | pick the one that is based on CUDA 10.1, uses Ubuntu 18.04, and includes cuDNN 368 | (which PyTorch can leverage for highly optimized neural network operations). 369 | Let's also pick the `devel` version, since PyTorch needs to compile itself in 370 | the container. This is the image tagged `10.1-cudnn7-devel-ubuntu18.04`. Since 371 | the image comes from the nvidia/cuda repository, the full image name is 372 | `nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04`. 373 | 374 | Our definition file now looks 375 | [like this](examples/language-model/version-2.def). Although we don't need it, 376 | I kept matplotlib for good measure. 377 | 378 | ``` 379 | Bootstrap: docker 380 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 381 | 382 | %post 383 | # Downloads the latest package lists (important). 384 | apt-get update -y 385 | # Runs apt-get while ensuring that there are no user prompts that would 386 | # cause the build process to hang. 387 | # python3-tk is required by matplotlib. 388 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 389 | python3 \ 390 | python3-tk \ 391 | python3-pip 392 | # Reduce the size of the image by deleting the package lists we downloaded, 393 | # which are useless now. 394 | rm -rf /var/lib/apt/lists/* 395 | # Install Python modules. 396 | pip3 install torch numpy matplotlib 397 | ``` 398 | 399 | We build the image as usual. 400 | 401 | ```bash 402 | cd examples/language-model 403 | sudo singularity build version-2.sif version-2.def 404 | ``` 405 | 406 | [View the screencast](https://bdusell.github.io/singularity-tutorial/casts/version-2.html) 407 | 408 | We run the image like before, except that we have to add the `--nv` flag to 409 | allow the container to access the Nvidia drivers on the host in order to use 410 | the GPU. That's all we need to get GPU support working. Not bad! 411 | 412 | This program takes a while to run. Do not run it on the CRC frontend. When I 413 | run one epoch on my workstation (which has a GPU), the output looks like this: 414 | 415 | ``` 416 | $ singularity exec --nv version-2.sif python3 main.py --cuda --epochs 1 417 | | epoch 1 | 200/ 2983 batches | lr 20.00 | ms/batch 45.68 | loss 7.63 | ppl 2050.44 418 | | epoch 1 | 400/ 2983 batches | lr 20.00 | ms/batch 45.11 | loss 6.85 | ppl 945.93 419 | | epoch 1 | 600/ 2983 batches | lr 20.00 | ms/batch 45.03 | loss 6.48 | ppl 653.61 420 | | epoch 1 | 800/ 2983 batches | lr 20.00 | ms/batch 46.43 | loss 6.29 | ppl 541.05 421 | | epoch 1 | 1000/ 2983 batches | lr 20.00 | ms/batch 45.50 | loss 6.14 | ppl 464.91 422 | | epoch 1 | 1200/ 2983 batches | lr 20.00 | ms/batch 44.99 | loss 6.06 | ppl 429.36 423 | | epoch 1 | 1400/ 2983 batches | lr 20.00 | ms/batch 45.27 | loss 5.95 | ppl 382.01 424 | | epoch 1 | 1600/ 2983 batches | lr 20.00 | ms/batch 45.09 | loss 5.95 | ppl 382.31 425 | | epoch 1 | 1800/ 2983 batches | lr 20.00 | ms/batch 45.25 | loss 5.80 | ppl 330.43 426 | | epoch 1 | 2000/ 2983 batches | lr 20.00 | ms/batch 45.08 | loss 5.78 | ppl 324.42 427 | | epoch 1 | 2200/ 2983 batches | lr 20.00 | ms/batch 45.11 | loss 5.66 | ppl 288.16 428 | | epoch 1 | 2400/ 2983 batches | lr 20.00 | ms/batch 45.14 | loss 5.67 | ppl 291.00 429 | | epoch 1 | 2600/ 2983 batches | lr 20.00 | ms/batch 45.21 | loss 5.66 | ppl 287.51 430 | | epoch 1 | 2800/ 2983 batches | lr 20.00 | ms/batch 45.02 | loss 5.54 | ppl 255.68 431 | ----------------------------------------------------------------------------------------- 432 | | end of epoch 1 | time: 140.54s | valid loss 5.54 | valid ppl 254.69 433 | ----------------------------------------------------------------------------------------- 434 | ========================================================================================= 435 | | End of training | test loss 5.46 | test ppl 235.49 436 | ========================================================================================= 437 | ``` 438 | 439 | ## Running a GPU Program on the CRC 440 | 441 | Finally, I will show you how to run this program on the CRC's GPU queue. 442 | This image is too big to be hosted on the Singularity Library, so you need to 443 | copy it from my home directory. We will address this size issue later on. 444 | 445 | ```bash 446 | cp /afs/crc.nd.edu/user/b/bdusell1/Public/singularity-tutorial/version-2.sif /scratch365/$USER/version-2.sif 447 | ``` 448 | 449 | Then, submit a job to run this program on the GPU queue. For convenience, I've 450 | included a script to run the submission command. 451 | 452 | ```bash 453 | cd ~/singularity-tutorial/examples/language-model 454 | bash submit-gpu-job-version-2.bash 455 | ``` 456 | 457 | Check back in a while to verify that the job completed successfully. The output 458 | will be written to `output-version-2.txt`. 459 | 460 | Something you should keep in mind is that, by default, if there are multiple 461 | GPUs available on the system, PyTorch grabs the first one it sees (some 462 | toolkits grab all of them). However, the CRC assigns each job its own GPU, 463 | which is not necessarily the one that PyTorch would pick. If PyTorch does not 464 | respect this assignment, there can be contention among different jobs. You can 465 | control which GPUs PyTorch has access to using the environment variable 466 | `CUDA_VISIBLE_DEVICES`, which can be set to a space-separated list of numbers. 467 | The CRC now sets this environment variable automatically, and since Singularity 468 | inherits environment variables, you actually don't need to do anything. It's 469 | just something you should know about, since there is potential for abuse. 470 | 471 | ## Separating Python modules from the image 472 | 473 | Now that you know the basics of how to run a GPU job on the CRC, here's a tip 474 | for managing Python modules. There's a problem with our current workflow. 475 | Every time we want to install a new Python library, we have to re-build the 476 | image. We should only need to re-build the image when we install a package with 477 | `apt-get` or inherit from a different base image -- in other words, actions 478 | that require root privileges. It would be nice if we could store our Python 479 | libraries in the current working directory using a **package manager**, and 480 | rely on the image only for the basic Ubuntu/CUDA/Python environment. 481 | 482 | [Pipenv](https://github.com/pypa/pipenv) is a package manager for Python. It's 483 | like the Python equivalent of npm (Node.js package manager) or gem (Ruby package 484 | manager). It keeps track of the libraries your project depends on in text files 485 | named `Pipfile` and `Pipfile.lock`, which you can commit to version control in 486 | lieu of the massive libraries themselves. Every time you run `pipenv install 487 | `, Pipenv will update the `Pipfile` and download the library locally. 488 | The important thing is that, rather than putting the library in a system-wide 489 | location, Pipenv installs the library in a *local* directory called `.venv`. 490 | The benefit of this is that the libraries are stored *with* your project, but 491 | they are not part of the image. The image is merely the vehicle for running 492 | them. 493 | 494 | Here is the 495 | [new version](examples/language-model/version-3.def) 496 | of our definition file: 497 | 498 | ``` 499 | BootStrap: docker 500 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 501 | 502 | %post 503 | # Downloads the latest package lists (important). 504 | apt-get update -y 505 | # Runs apt-get while ensuring that there are no user prompts that would 506 | # cause the build process to hang. 507 | # python3-tk is required by matplotlib. 508 | # python3-dev is needed to install some packages. 509 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 510 | python3 \ 511 | python3-tk \ 512 | python3-pip \ 513 | python3-dev 514 | # Reduce the size of the image by deleting the package lists we downloaded, 515 | # which are useless now. 516 | rm -rf /var/lib/apt/lists/* 517 | # Install Pipenv. 518 | pip3 install pipenv 519 | 520 | %environment 521 | # Pipenv requires a certain terminal encoding. 522 | export LANG=C.UTF-8 523 | export LC_ALL=C.UTF-8 524 | # This configures Pipenv to store the packages in the current working 525 | # directory. 526 | export PIPENV_VENV_IN_PROJECT=1 527 | ``` 528 | 529 | On the CRC, download the new image. 530 | 531 | ```bash 532 | singularity pull /scratch365/$USER/version-3.sif library://brian/default/singularity-tutorial:version-3 533 | ``` 534 | 535 | Now we can use the container to install our Python libraries into the current 536 | working directory. We do this by running `pipenv install`. 537 | 538 | ```bash 539 | singularity exec /scratch365/$USER/version-3.sif pipenv install torch numpy matplotlib 540 | ``` 541 | 542 | [![asciicast](https://asciinema.org/a/cywx1Ta3XpO89DvwaE0MaogDo.svg)](https://asciinema.org/a/cywx1Ta3XpO89DvwaE0MaogDo) 543 | 544 | This may take a while. When it is finished, it will have installed the 545 | libraries in a directory named `.venv`. The benefit of installing packages like 546 | this is that you can install new ones without re-building the image, and you 547 | can re-use the image for multiple projects. The `.sif` file is smaller too. 548 | 549 | When you're done, you can test it out by submitting a GPU job. If you look at 550 | the script, you will see that we replace the `python3` command with `pipenv run 551 | python`, which runs the program inside the environment that Pipenv manages. 552 | 553 | ```bash 554 | bash submit-gpu-job-version-3.bash 555 | ``` 556 | 557 | ## Docker 558 | 559 | If this container stuff interests you, you might be interested in 560 | [Docker](https://www.docker.com/) 561 | too. Docker is not available on the CRC, but it may prove useful elsewhere. 562 | For example, I've used it to compile PyTorch from source before. Docker has 563 | its own set of idiosyncrasies, but a good place to start is the 564 | [Docker documentation](https://docs.docker.com/). 565 | 566 | This would be a good time to plug my 567 | [dockerdev](https://github.com/bdusell/dockerdev) 568 | project, which is a bash library that sets up a streamlined workflow for using 569 | Docker containers as development environments. 570 | 571 | ## Conclusion 572 | 573 | By now I think I have shown you that the sky is the limit when it comes to 574 | containers. Hopefully this will prove useful to your research. If you like, you 575 | can show your appreciation by leaving a star on GitHub. :) 576 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # How to Install Literally Anything: A Practical Guide to Singularity 2 | 3 | Brian DuSell 4 | 5 | Note: For a version of this tutorial specially tailored for the CRC computing 6 | cluster at the University of Notre Dame, please see 7 | [README-ND.md](README-ND.md). 8 | 9 | ## Abstract 10 | 11 | Have you ever spent an inordinate amount of time trying to install something on 12 | your HPC cluster without root privileges? Have you ever wrecked your computer 13 | trying to update CUDA? Have you ever wished you could install two versions of 14 | the same package at once? If so, containers may be what's missing in your life. 15 | In this talk, I will show you how to install software using Singularity, an 16 | HPC-centric container system that, like Docker, allows you to install software 17 | in a portable, self-contained Linux environment where you have full 18 | administrative rights. Singularity can be installed on any Linux machine (with 19 | techniques available for running it on Windows and Mac) and is becoming 20 | increasingly available on HPC clusters, thus ensuring that your code can run 21 | in a consistent environment no matter which machine you run it on. Singularity 22 | is compatible with Docker images and can make installing tricky libraries, 23 | such as CUDA, as simple as pulling a pre-built image. My tutorial will include 24 | a walkthrough of using Singularity to run a GPU-accelerated PyTorch program on 25 | an HPC cluster, as well as general tips for setting up an efficient workflow. 26 | 27 | Watch the screencast [here](https://www.youtube.com/watch?v=D5pe4ewtDe8). 28 | 29 | Slides available [here](slides.pdf). 30 | 31 | ## Introduction 32 | 33 | This tutorial will introduce you to [Singularity](https://www.sylabs.io/singularity/), 34 | a containerization system for scientific computing environments that is 35 | available on many scientific computing clusters. Containers allow you to 36 | package the environment that your code depends on inside of a portable unit. 37 | This is extremely useful for ensuring that your code can be run portably 38 | on other machines. It is also useful for installing software, packages, 39 | libraries, etc. in environments where you do not have root privileges, like an 40 | HPC account. I will show you how to install PyTorch with GPU support inside of 41 | a container and run a simple PyTorch program to train a neural net. 42 | 43 | ## The Portability Problem 44 | 45 | The programs we write depend on external environments, whether that environment 46 | is explicitly documented or not. For example, a Python program assumes that a 47 | Python interpreter is available on the system it is run on. 48 | 49 | ```python 50 | def f(n): 51 | return 2 * n 52 | ``` 53 | 54 | However, some Python code requires certain versions of Python. For example, 55 | a Python program that uses set comprehension syntax requires Python 3. 56 | 57 | ```python 58 | def f(n): 59 | return { 2 * x for x in range(n) } 60 | ``` 61 | 62 | A Python program that uses the function `subprocess.run()` assumes that you're 63 | using at least version 3.5. 64 | 65 | ```python 66 | import subprocess 67 | 68 | def f(): 69 | return subprocess.run(...) 70 | ``` 71 | 72 | This Python program additionally assumes that ImageMagick is available on the 73 | system. 74 | 75 | ```python 76 | import subprocess 77 | 78 | def f(): 79 | return subprocess.run([ 80 | 'convert', 'photo.jpg', '-resize', '50%', 'photo.png']) 81 | ``` 82 | 83 | We can go ever deeper down the rabbit hole. 84 | 85 | When these sorts of dependencies are undocumented, it can become painful to run 86 | a program in an environment that is different from the one it was developed in. 87 | It would be nice to have a way to package a program together with its 88 | environment, and then run that program on any machine. 89 | 90 | ## The Installation Problem 91 | 92 | A scientific computing enviroment typically provides users with an account and 93 | a home directory in a shared file system. This means that users do not have 94 | root privileges and cannot use a package manager like `yum` or `apt-get` to 95 | install new libraries. If you want to install something that is not already 96 | there, you have a few options: 97 | 98 | * If it is a major library, ask the cluster's staff to install/update it for you 99 | * Install it in your home directory (e.g. `pip install --user` for Python 100 | modules) or other non-standard directory 101 | * Compile it yourself in your home directory 102 | 103 | While it is almost always possible to re-compile a library yourself without 104 | root privileges, it can be very time-consuming. This is especially true when 105 | the library depends on other libraries that also need to be re-compiled, 106 | leading to a tedious search for just the right configuration to stitch them all 107 | together. CUDA also complicates the situation, as certain deep learning 108 | libraries need to be built on a node that has a GPU (even though the GPU is 109 | never used during compilation!). 110 | 111 | Finally, sometimes you deliberately want to install an older version of a 112 | package. But unless you set up two isolated installations, this could conflict 113 | with projects that still require the newer versions. 114 | 115 | To take an extreme (but completely real!) example, older versions of the deep 116 | learning library [DyNet](https://dynet.readthedocs.io/en/latest/) could only be 117 | built with an old version of GCC, and moreover needed to be compiled on a GPU 118 | node with my HPC cluster's CUDA module loaded in order to work properly. In May 119 | 2018, the staff removed the required version of GCC. This meant that if you 120 | wanted to install or update DyNet, you needed to re-compile that version of GCC 121 | yourself *and* figure out how to configure DyNet to build itself with a 122 | compiler in a non-standard location. 123 | 124 | ## The Solution: Containers 125 | 126 | Containers are a software isolation technique that has exploded in popularity 127 | in recent years, particularly thanks to [Docker](https://www.docker.com/). 128 | A container, like a virtual machine, is an operating system within an operating 129 | system. Unlike a virtual machine, however, it shares the kernel with the host 130 | operating system, so it incurs no performance penalty for translating machine 131 | instructions. Instead, containers rely on special system calls that allow the 132 | host to spoof the filesystem and network that the container has access to, 133 | making it appear from inside the container that it exists in a separate 134 | environment. 135 | 136 | Today we will be talking about an alternative to Docker called Singularity, 137 | which is more suitable for scientific computing environments (Docker is better 138 | suited for things like cloud applications, and there are reasons why it would 139 | not be ideal for a shared scientific computing environment). Singularity is 140 | customarily available via the `singularity` command. 141 | 142 | Singularity containers are instantiated from **images**, which are files that 143 | define the container's environment. The container's "root" file system is 144 | distinct from that of the host operating system, so you can install whatever 145 | software you like as if you were the root user. Installing software via the 146 | built-in package manager is now an option again. Not only this, but you can 147 | also choose a pre-made image to base your container on. Singularity is 148 | compatible with Docker images (a very deliberate design decision), so it can 149 | take advantage of the extremely rich selection of production-grade Docker 150 | images that are available. For example, there are pre-made images for fresh 151 | installations of Ubuntu, Python, TensorFlow, PyTorch, and even CUDA. For 152 | virtually all major libraries, getting a pre-made image for X is as simple as 153 | Googling "X docker" and taking note of the name of the image. 154 | 155 | Also, because your program's environment is self-contained, it is not affected 156 | by changes to the HPC cluster's software and is no longer susceptible to 157 | "software rot." Because the container is portable, it will also run just as 158 | well on your local machine as on the HPC cluster. In the age of containers, 159 | "it runs on my machine" is no longer an excuse. 160 | 161 | ## Basic Workflow 162 | 163 | Singularity instantiates containers from images that define their environment. 164 | Singularity images are stored in `.sif` files. You build a `.sif` file by 165 | defining your environment in a text file and providing that definition to the 166 | command `singularity build`. 167 | 168 | Building an image file does require root privileges, so it is most convenient 169 | to build the image on your local machine or workstation and then copy it to 170 | your HPC cluster via `scp`. The reason it requires root is because the kernel 171 | is shared, and user permissions are implemented in the kernel. So if you want 172 | to do something in the container as root, you actually need to *be* root on 173 | the host when you do it. 174 | 175 | There is also an option to build it without root privileges. This works by 176 | sending your definition to a remote server and building the image there, but I 177 | have had difficulty getting this to work. 178 | 179 | Once you've uploaded your image to your HPC cluster, you can submit a batch 180 | job that runs `singularity exec` with the image file you created and the 181 | command you want to run. That's it! 182 | 183 | ## A Simple PyTorch Program 184 | 185 | I have included a PyTorch program, 186 | [`train_xor.py`](examples/xor/train_xor.py), 187 | that trains a neural network to compute the XOR function and then plots the 188 | loss as a function of training time. It can also save the model to a file. It 189 | depends on the Python modules `torch`, `numpy`, and `matplotlib`. 190 | 191 | ## Installing Singularity 192 | 193 | Consult your HPC cluster's documentation or staff to see if it supports 194 | Singularity. It is normally available via the `singularity` command. The 195 | documentation for the latest version, 3.2, can be found 196 | [here](https://www.sylabs.io/guides/3.2/user-guide/). 197 | 198 | If Singularity is not installed, consider 199 | [requesting 200 | it](https://www.sylabs.io/guides/3.2/user-guide/installation.html#singularity-on-a-shared-resource). 201 | 202 | As for installing Singularity locally, the Singularity docs include detailed 203 | instructions for installing Singularity on major operating systems 204 | [here](https://www.sylabs.io/guides/3.2/user-guide/installation.html). 205 | 206 | ## Defining an Image 207 | 208 | The first step in defining an image is picking which base image to use. This 209 | can be a Linux distribution, such as Ubuntu, or an image with a library 210 | pre-installed, like one of PyTorch's 211 | [official Docker images](https://hub.docker.com/r/pytorch/pytorch/tags). Since 212 | our program depends on more than just PyTorch, let's start with a plain Ubuntu 213 | image and build up from there. 214 | 215 | Let's start with the basic syntax for definition files, which is documented 216 | [here](https://www.sylabs.io/guides/3.2/user-guide/definition_files.html). 217 | The first part of the file is the header, where we define the base image and 218 | other meta-information. The only required keyword in the header is `Bootstrap`, 219 | which defines the type of image being imported. Using `Bootstrap: library` 220 | means that we are importing a library from the official 221 | [Singularity Library](https://cloud.sylabs.io/library). 222 | Using `Bootstrap: docker` means that we are importing a Docker image from a 223 | Docker registry such as 224 | [Docker Hub](https://hub.docker.com/). 225 | Let's import the official 226 | [Ubuntu 18.04](https://cloud.sylabs.io/library/_container/5baba99394feb900016ea433) 227 | image. 228 | 229 | ``` 230 | Bootstrap: library 231 | From: ubuntu:18.04 232 | ``` 233 | 234 | The rest of the definition file is split up into several **sections** which 235 | serve special roles. The `%post` section defines a series of commands to be run 236 | while the image is being built, inside of a container as the root user. This 237 | is typically where you install packages. The `%environment` section defines 238 | environment variables that are set when the image is instantiated as a 239 | container. The `%files` section lets you copy files into the image. There are 240 | [many other types of section](https://www.sylabs.io/guides/3.2/user-guide/definition_files.html#sections). 241 | 242 | Let's use the `%post` section to install all of our requirements using 243 | `apt-get` and `pip3`. 244 | 245 | ``` 246 | %post 247 | # These first few commands allow us to find the python3-pip package later 248 | # on. 249 | apt-get update -y 250 | # Using "noninteractive" mode runs apt-get while ensuring that there are 251 | # no user prompts that would cause the `singularity build` command to hang. 252 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 253 | software-properties-common 254 | add-apt-repository universe 255 | # Downloads the latest package lists (important). 256 | apt-get update -y 257 | # python3-tk is required by matplotlib. 258 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 259 | python3 \ 260 | python3-tk \ 261 | python3-pip \ 262 | python3-distutils \ 263 | python3-setuptools 264 | # Reduce the size of the image by deleting the package lists we downloaded, 265 | # which are useless now. 266 | rm -rf /var/lib/apt/lists/* 267 | # Install Python modules. 268 | pip3 install torch numpy matplotlib 269 | ``` 270 | 271 | Each line defines a separate command (lines can be continued with a `\`). 272 | Unlike normal shell scripts, the build will be aborted as soon as one of the 273 | commands fails. You do not need to connect the commands with `&&`. 274 | 275 | The final build definition is in the file 276 | [version-1.def](examples/xor/version-1.def). 277 | 278 | ## Building an Image 279 | 280 | Supposing we are on our own Ubuntu machine, we can build this definition into 281 | a `.sif` image file using the following command: 282 | 283 | ```bash 284 | cd examples/xor 285 | sudo singularity build version-1.sif version-1.def 286 | ``` 287 | 288 | [View the screencast](https://bdusell.github.io/singularity-tutorial/casts/version-1.html) 289 | 290 | This ran the commands we defined in the `%post` section inside a container and 291 | afterwards saved the state of the container in the image `version-1.sif`. 292 | 293 | ## Running an Image 294 | 295 | Let's run our PyTorch program in a container based on the image we just built. 296 | 297 | ```bash 298 | singularity exec version-1.sif python3 train_xor.py --output model.pt 299 | ``` 300 | 301 | This program does not take long to run. Once it finishes, it should open a 302 | window with a plot of the model's loss and accuracy over time. 303 | 304 | [![asciicast](https://asciinema.org/a/Lqq0AsJSwVgFoo1Hr8S7euMe5.svg)](https://asciinema.org/a/Lqq0AsJSwVgFoo1Hr8S7euMe5) 305 | 306 | ![Plot](images/plot.png) 307 | 308 | The trained model should also be saved in the file `model.pt`. Note that even 309 | though the program ran in a container, it was able to write a file to the host 310 | file system that remained after the program exited and the container was shut 311 | down. If you are familiar with Docker, you probably know that you cannot write 312 | files to the host in this way unless you explicitly **bind mount** two 313 | directories in the host and container file system. Bind mounting makes a file 314 | or directory on the host system synonymous with one in the container. 315 | 316 | For convenience, Singularity 317 | [binds a few important directories by 318 | default](https://www.sylabs.io/guides/3.2/user-guide/bind_paths_and_mounts.html): 319 | 320 | * Your home directory 321 | * The current working directory 322 | * `/sys` 323 | * `/proc` 324 | * others (depending on the version of Singularity) 325 | 326 | You can add to or override these settings if you wish using the 327 | [`--bind` flag](https://www.sylabs.io/guides/3.2/user-guide/bind_paths_and_mounts.html#specifying-bind-paths) 328 | to `singularity exec`. This is important to remember if you want to access a 329 | file that is outside of your home directory -- otherwise you may end up with 330 | inexplicable "file or directory does not exist" errors. If you encounter 331 | cryptic errors when running Singularity, make sure that you have bound all of 332 | the directories you intend your program to have access to. 333 | 334 | It is also important to know that, unlike Docker, environment variables are 335 | inherited inside the container for convenience. This can be a good and a bad 336 | thing. It is good in that it is often convenient, but it is bad in that the 337 | containerized program may behave differently on different hosts for apparently 338 | no reason if the hosts export different environment variables. This behavior 339 | can be disabled by running Singularity with `--cleanenv`. 340 | 341 | Here is an example of when you might want to inherit environment variables. 342 | By default, if there are multiple GPUs available on a system, PyTorch will 343 | grab the first GPU it sees (some toolkits grab all of them). However, your 344 | cluster may allocate a specific GPU for your batch job that is not 345 | necessarily the one that PyTorch would pick. If PyTorch does not respect 346 | this assignment, there can be contention among different jobs. You can 347 | control which GPUs PyTorch has access to using the environment variable 348 | `CUDA_VISIBLE_DEVICES`. As long as your cluster defines this environment 349 | variable for you, you do not need to explicitly forward it to the Singularity 350 | container. 351 | 352 | ## Running an Interactive Shell 353 | 354 | You can also open up a shell inside the container and run commands there. You 355 | can `exit` when you're done. Note that since your home directory is 356 | bind-mounted, the shell inside the container may run your shell's startup file 357 | (e.g. `.bashrc`). 358 | 359 | ``` 360 | $ singularity shell version-1.sif 361 | Singularity version-1.sif:~/singularity-tutorial/examples/xor> python3 train_xor.py 362 | ``` 363 | 364 | Again, this an instance of the host environment leaking into the container in 365 | a potentially unexpected way that you should be mindful of. 366 | 367 | ## Running an Image 368 | 369 | At this point, you may wish to follow along with the tutorial on a system where 370 | Singularity is installed, either on a personal workstation or on an HPC 371 | account. 372 | 373 | You can pull the first tutorial image like so: 374 | 375 | ```bash 376 | singularity pull version-1.sif library://brian/default/singularity-tutorial:version-1 377 | ``` 378 | 379 | Next, clone this repository. 380 | 381 | ```bash 382 | git clone https://github.com/bdusell/singularity-tutorial.git 383 | ``` 384 | 385 | Run the program like this: 386 | 387 | ```bash 388 | cd singularity-tutorial/examples/xor 389 | singularity exec ../../../version-1.sif python3 examples/xor/train_xor.py 390 | ``` 391 | 392 | This program is running the `python3` executable that exists inside the image 393 | `version-1.sif`. It is *not* running the `python3` executable on the host. 394 | Crucially, the host does not even need to have `python3` installed. 395 | 396 | You should get the same plot from before to show up. This would not have been 397 | possible using the software provided on my own HPC cluster, since its Python 398 | installation does not include Tk, which is required by matplotlib. I have 399 | found this extremely useful for making plots from data I have stored on the 400 | cluster without needing to download the data to another machine. 401 | 402 | ## A Beefier PyTorch Program 403 | 404 | As an example of a program that benefits from GPU acceleration, we will be 405 | running the official 406 | [`word_language_model`](https://github.com/pytorch/examples/tree/master/word_language_model) 407 | example PyTorch program, which I have included at 408 | [`examples/language-model`](examples/language-model). 409 | This program trains an 410 | [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory) 411 | [language model](https://en.wikipedia.org/wiki/Language_model) 412 | on a corpus of Wikipedia text. 413 | 414 | ## Adding GPU Support 415 | 416 | In order to add GPU support, we need to include CUDA in our image. In 417 | Singularity, this is delightfully simple. We just need to pick one of 418 | [Nvidia's official Docker images](https://hub.docker.com/r/nvidia/cuda) 419 | to base our image on. Again, the easiest way to install library X is often to 420 | Google "X docker" and pick an image from the README or tags page on Docker Hub. 421 | 422 | The README lists several tags. They tend to indicate variants of the image that 423 | have different components and different versions of things installed. Let's 424 | pick the one that is based on CUDA 10.1, uses Ubuntu 18.04, and includes cuDNN 425 | (which PyTorch can leverage for highly optimized neural network operations). 426 | Let's also pick the `devel` version, since PyTorch needs to compile itself in 427 | the container. This is the image tagged `10.1-cudnn7-devel-ubuntu18.04`. Since 428 | the image comes from the nvidia/cuda repository, the full image name is 429 | `nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04`. 430 | 431 | Our definition file now looks 432 | [like this](examples/language-model/version-2.def). Although we don't need it, 433 | I kept matplotlib for good measure. 434 | 435 | ``` 436 | Bootstrap: docker 437 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 438 | 439 | %post 440 | # Downloads the latest package lists (important). 441 | apt-get update -y 442 | # Runs apt-get while ensuring that there are no user prompts that would 443 | # cause the build process to hang. 444 | # python3-tk is required by matplotlib. 445 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 446 | python3 \ 447 | python3-tk \ 448 | python3-pip \ 449 | python3-setuptools 450 | # Reduce the size of the image by deleting the package lists we downloaded, 451 | # which are useless now. 452 | rm -rf /var/lib/apt/lists/* 453 | # Install Python modules. 454 | pip3 install torch numpy matplotlib 455 | ``` 456 | 457 | We build the image as usual. 458 | 459 | ```bash 460 | cd examples/language-model 461 | sudo singularity build version-2.sif version-2.def 462 | ``` 463 | 464 | [View the screencast](https://bdusell.github.io/singularity-tutorial/casts/version-2.html) 465 | 466 | We run the image like before, except that we have to add the `--nv` flag to 467 | allow the container to access the Nvidia drivers on the host in order to use 468 | the GPU. That's all we need to get GPU support working. Not bad! 469 | 470 | This program takes a while to run. When I run one epoch on my workstation 471 | (which has a GPU), the output looks like this: 472 | 473 | ``` 474 | $ singularity exec --nv version-2.sif python3 main.py --cuda --epochs 1 475 | | epoch 1 | 200/ 2983 batches | lr 20.00 | ms/batch 45.68 | loss 7.63 | ppl 2050.44 476 | | epoch 1 | 400/ 2983 batches | lr 20.00 | ms/batch 45.11 | loss 6.85 | ppl 945.93 477 | | epoch 1 | 600/ 2983 batches | lr 20.00 | ms/batch 45.03 | loss 6.48 | ppl 653.61 478 | | epoch 1 | 800/ 2983 batches | lr 20.00 | ms/batch 46.43 | loss 6.29 | ppl 541.05 479 | | epoch 1 | 1000/ 2983 batches | lr 20.00 | ms/batch 45.50 | loss 6.14 | ppl 464.91 480 | | epoch 1 | 1200/ 2983 batches | lr 20.00 | ms/batch 44.99 | loss 6.06 | ppl 429.36 481 | | epoch 1 | 1400/ 2983 batches | lr 20.00 | ms/batch 45.27 | loss 5.95 | ppl 382.01 482 | | epoch 1 | 1600/ 2983 batches | lr 20.00 | ms/batch 45.09 | loss 5.95 | ppl 382.31 483 | | epoch 1 | 1800/ 2983 batches | lr 20.00 | ms/batch 45.25 | loss 5.80 | ppl 330.43 484 | | epoch 1 | 2000/ 2983 batches | lr 20.00 | ms/batch 45.08 | loss 5.78 | ppl 324.42 485 | | epoch 1 | 2200/ 2983 batches | lr 20.00 | ms/batch 45.11 | loss 5.66 | ppl 288.16 486 | | epoch 1 | 2400/ 2983 batches | lr 20.00 | ms/batch 45.14 | loss 5.67 | ppl 291.00 487 | | epoch 1 | 2600/ 2983 batches | lr 20.00 | ms/batch 45.21 | loss 5.66 | ppl 287.51 488 | | epoch 1 | 2800/ 2983 batches | lr 20.00 | ms/batch 45.02 | loss 5.54 | ppl 255.68 489 | ----------------------------------------------------------------------------------------- 490 | | end of epoch 1 | time: 140.54s | valid loss 5.54 | valid ppl 254.69 491 | ----------------------------------------------------------------------------------------- 492 | ========================================================================================= 493 | | End of training | test loss 5.46 | test ppl 235.49 494 | ========================================================================================= 495 | ``` 496 | 497 | ## Separating Python modules from the image 498 | 499 | Now that you know the basics of how to run a GPU-accelerated program, here's a 500 | tip for managing Python modules. There's a problem with our current workflow. 501 | Every time we want to install a new Python library, we have to re-build the 502 | image. We should only need to re-build the image when we install a package with 503 | `apt-get` or inherit from a different base image -- in other words, actions 504 | that require root privileges. It would be nice if we could store our Python 505 | libraries in the current working directory using a **package manager**, and 506 | rely on the image only for the basic Ubuntu/CUDA/Python environment. 507 | 508 | [Pipenv](https://github.com/pypa/pipenv) is a package manager for Python. It's 509 | like the Python equivalent of npm (Node.js package manager) or Bundler (Ruby 510 | package manager). It keeps track of the libraries your project depends on in 511 | text files named `Pipfile` and `Pipfile.lock`, which you can commit to version 512 | control in lieu of the massive libraries themselves. Every time you run 513 | `pipenv install `, Pipenv will update the `Pipfile` and download the 514 | library locally. The important thing is that, rather than putting the library 515 | in a system-wide location, Pipenv installs the library in a *local* directory 516 | called `.venv`. The benefit of this is that the libraries are stored *with* 517 | your project, but they are not part of the image. The image is merely the 518 | vehicle for running them. 519 | 520 | Here is the 521 | [new version](examples/language-model/version-3.def) 522 | of our definition file: 523 | 524 | ``` 525 | BootStrap: docker 526 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 527 | 528 | %post 529 | # Downloads the latest package lists (important). 530 | apt-get update -y 531 | # Runs apt-get while ensuring that there are no user prompts that would 532 | # cause the build process to hang. 533 | # python3-tk is required by matplotlib. 534 | # python3-dev is needed to install some packages. 535 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 536 | python3 \ 537 | python3-tk \ 538 | python3-pip \ 539 | python3-dev 540 | # Reduce the size of the image by deleting the package lists we downloaded, 541 | # which are useless now. 542 | rm -rf /var/lib/apt/lists/* 543 | # Install Pipenv. 544 | pip3 install pipenv 545 | 546 | %environment 547 | # Pipenv requires a certain terminal encoding. 548 | export LANG=C.UTF-8 549 | export LC_ALL=C.UTF-8 550 | # This configures Pipenv to store the packages in the current working 551 | # directory. 552 | export PIPENV_VENV_IN_PROJECT=1 553 | ``` 554 | 555 | Download the new image. 556 | 557 | ```bash 558 | singularity pull version-3.sif library://brian/default/singularity-tutorial:version-3 559 | ``` 560 | 561 | Now we can use the container to install our Python libraries into the current 562 | working directory. We do this by running `pipenv install`. 563 | 564 | ```bash 565 | singularity exec version-3.sif pipenv install torch numpy matplotlib 566 | ``` 567 | 568 | [![asciicast](https://asciinema.org/a/cywx1Ta3XpO89DvwaE0MaogDo.svg)](https://asciinema.org/a/cywx1Ta3XpO89DvwaE0MaogDo) 569 | 570 | This may take a while. When it is finished, it will have installed the 571 | libraries in a directory named `.venv`. The benefit of installing packages like 572 | this is that you can install new ones without re-building the image, and you 573 | can re-use the image for multiple projects. The `.sif` file is smaller too. 574 | 575 | When you're done, you can test it out using the following command: 576 | 577 | ```bash 578 | singularity exec --nv version-3.sif pipenv run python main.py --cuda --epochs 6 579 | ``` 580 | 581 | Notice that we have replaced the command `python3` with `pipenv run python`. 582 | This command uses the Python executable managed by Pipenv, which in turn exists 583 | inside of the container. 584 | 585 | ## Docker 586 | 587 | If this container stuff interests you, you might be interested in 588 | [Docker](https://www.docker.com/) 589 | too. Docker has its own set of idiosyncrasies, but a good place to start is the 590 | [Docker documentation](https://docs.docker.com/). 591 | 592 | This would be a good time to plug my 593 | [dockerdev](https://github.com/bdusell/dockerdev) 594 | project, which is a bash library that sets up a streamlined workflow for using 595 | Docker containers as development environments. 596 | 597 | ## Conclusion 598 | 599 | By now I think I have shown you that the sky is the limit when it comes to 600 | containers. Hopefully this will prove useful to your research. If you like, you 601 | can show your appreciation by leaving a star on GitHub. :) 602 | -------------------------------------------------------------------------------- /examples/language-model/LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2017, 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /examples/language-model/README.md: -------------------------------------------------------------------------------- 1 | # NOTICE 2 | 3 | This code is largely taken from the 4 | [official PyTorch examples repository](https://github.com/pytorch/examples/tree/master/word_language_model) 5 | in comformance with its [LICENSE](LICENSE). 6 | 7 | # Word-level language modeling RNN 8 | 9 | This example trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task. 10 | By default, the training script uses the Wikitext-2 dataset, provided. 11 | The trained model can then be used by the generate script to generate new text. 12 | 13 | ```bash 14 | python main.py --cuda --epochs 6 # Train a LSTM on Wikitext-2 with CUDA 15 | python main.py --cuda --epochs 6 --tied # Train a tied LSTM on Wikitext-2 with CUDA 16 | python main.py --cuda --tied # Train a tied LSTM on Wikitext-2 with CUDA for 40 epochs 17 | python generate.py # Generate samples from the trained LSTM model. 18 | ``` 19 | 20 | The model uses the `nn.RNN` module (and its sister modules `nn.GRU` and `nn.LSTM`) 21 | which will automatically use the cuDNN backend if run on CUDA with cuDNN installed. 22 | 23 | During training, if a keyboard interrupt (Ctrl-C) is received, 24 | training is stopped and the current model is evaluated against the test dataset. 25 | 26 | The `main.py` script accepts the following arguments: 27 | 28 | ```bash 29 | optional arguments: 30 | -h, --help show this help message and exit 31 | --data DATA location of the data corpus 32 | --model MODEL type of recurrent net (RNN_TANH, RNN_RELU, LSTM, GRU) 33 | --emsize EMSIZE size of word embeddings 34 | --nhid NHID number of hidden units per layer 35 | --nlayers NLAYERS number of layers 36 | --lr LR initial learning rate 37 | --clip CLIP gradient clipping 38 | --epochs EPOCHS upper epoch limit 39 | --batch_size N batch size 40 | --bptt BPTT sequence length 41 | --dropout DROPOUT dropout applied to layers (0 = no dropout) 42 | --decay DECAY learning rate decay per epoch 43 | --tied tie the word embedding and softmax weights 44 | --seed SEED random seed 45 | --cuda use CUDA 46 | --log-interval N report interval 47 | --save SAVE path to save the final model 48 | ``` 49 | 50 | With these arguments, a variety of models can be tested. 51 | As an example, the following arguments produce slower but better models: 52 | 53 | ```bash 54 | python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 55 | python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied 56 | python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40 57 | python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40 --tied 58 | ``` 59 | -------------------------------------------------------------------------------- /examples/language-model/data.py: -------------------------------------------------------------------------------- 1 | import os 2 | from io import open 3 | import torch 4 | 5 | class Dictionary(object): 6 | def __init__(self): 7 | self.word2idx = {} 8 | self.idx2word = [] 9 | 10 | def add_word(self, word): 11 | if word not in self.word2idx: 12 | self.idx2word.append(word) 13 | self.word2idx[word] = len(self.idx2word) - 1 14 | return self.word2idx[word] 15 | 16 | def __len__(self): 17 | return len(self.idx2word) 18 | 19 | 20 | class Corpus(object): 21 | def __init__(self, path): 22 | self.dictionary = Dictionary() 23 | self.train = self.tokenize(os.path.join(path, 'train.txt')) 24 | self.valid = self.tokenize(os.path.join(path, 'valid.txt')) 25 | self.test = self.tokenize(os.path.join(path, 'test.txt')) 26 | 27 | def tokenize(self, path): 28 | """Tokenizes a text file.""" 29 | assert os.path.exists(path) 30 | # Add words to the dictionary 31 | with open(path, 'r', encoding="utf8") as f: 32 | tokens = 0 33 | for line in f: 34 | words = line.split() + [''] 35 | tokens += len(words) 36 | for word in words: 37 | self.dictionary.add_word(word) 38 | 39 | # Tokenize file content 40 | with open(path, 'r', encoding="utf8") as f: 41 | ids = torch.LongTensor(tokens) 42 | token = 0 43 | for line in f: 44 | words = line.split() + [''] 45 | for word in words: 46 | ids[token] = self.dictionary.word2idx[word] 47 | token += 1 48 | 49 | return ids 50 | -------------------------------------------------------------------------------- /examples/language-model/data/wikitext-2/README: -------------------------------------------------------------------------------- 1 | This is raw data from the wikitext-2 dataset. 2 | 3 | See https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/ 4 | -------------------------------------------------------------------------------- /examples/language-model/generate.py: -------------------------------------------------------------------------------- 1 | ############################################################################### 2 | # Language Modeling on Wikitext-2 3 | # 4 | # This file generates new sentences sampled from the language model 5 | # 6 | ############################################################################### 7 | 8 | import argparse 9 | 10 | import torch 11 | 12 | import data 13 | 14 | parser = argparse.ArgumentParser(description='PyTorch Wikitext-2 Language Model') 15 | 16 | # Model parameters. 17 | parser.add_argument('--data', type=str, default='./data/wikitext-2', 18 | help='location of the data corpus') 19 | parser.add_argument('--checkpoint', type=str, default='./model.pt', 20 | help='model checkpoint to use') 21 | parser.add_argument('--outf', type=str, default='generated.txt', 22 | help='output file for generated text') 23 | parser.add_argument('--words', type=int, default='1000', 24 | help='number of words to generate') 25 | parser.add_argument('--seed', type=int, default=1111, 26 | help='random seed') 27 | parser.add_argument('--cuda', action='store_true', 28 | help='use CUDA') 29 | parser.add_argument('--temperature', type=float, default=1.0, 30 | help='temperature - higher will increase diversity') 31 | parser.add_argument('--log-interval', type=int, default=100, 32 | help='reporting interval') 33 | args = parser.parse_args() 34 | 35 | # Set the random seed manually for reproducibility. 36 | torch.manual_seed(args.seed) 37 | if torch.cuda.is_available(): 38 | if not args.cuda: 39 | print("WARNING: You have a CUDA device, so you should probably run with --cuda") 40 | 41 | device = torch.device("cuda" if args.cuda else "cpu") 42 | 43 | if args.temperature < 1e-3: 44 | parser.error("--temperature has to be greater or equal 1e-3") 45 | 46 | with open(args.checkpoint, 'rb') as f: 47 | model = torch.load(f).to(device) 48 | model.eval() 49 | 50 | corpus = data.Corpus(args.data) 51 | ntokens = len(corpus.dictionary) 52 | hidden = model.init_hidden(1) 53 | input = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device) 54 | 55 | with open(args.outf, 'w') as outf: 56 | with torch.no_grad(): # no tracking history 57 | for i in range(args.words): 58 | output, hidden = model(input, hidden) 59 | word_weights = output.squeeze().div(args.temperature).exp().cpu() 60 | word_idx = torch.multinomial(word_weights, 1)[0] 61 | input.fill_(word_idx) 62 | word = corpus.dictionary.idx2word[word_idx] 63 | 64 | outf.write(word + ('\n' if i % 20 == 19 else ' ')) 65 | 66 | if i % args.log_interval == 0: 67 | print('| Generated {}/{} words'.format(i, args.words)) 68 | -------------------------------------------------------------------------------- /examples/language-model/main.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | import argparse 3 | import time 4 | import math 5 | import os 6 | import torch 7 | import torch.nn as nn 8 | import torch.onnx 9 | 10 | import data 11 | import model 12 | 13 | parser = argparse.ArgumentParser(description='PyTorch Wikitext-2 RNN/LSTM Language Model') 14 | parser.add_argument('--data', type=str, default='./data/wikitext-2', 15 | help='location of the data corpus') 16 | parser.add_argument('--model', type=str, default='LSTM', 17 | help='type of recurrent net (RNN_TANH, RNN_RELU, LSTM, GRU)') 18 | parser.add_argument('--emsize', type=int, default=200, 19 | help='size of word embeddings') 20 | parser.add_argument('--nhid', type=int, default=200, 21 | help='number of hidden units per layer') 22 | parser.add_argument('--nlayers', type=int, default=2, 23 | help='number of layers') 24 | parser.add_argument('--lr', type=float, default=20, 25 | help='initial learning rate') 26 | parser.add_argument('--clip', type=float, default=0.25, 27 | help='gradient clipping') 28 | parser.add_argument('--epochs', type=int, default=40, 29 | help='upper epoch limit') 30 | parser.add_argument('--batch_size', type=int, default=20, metavar='N', 31 | help='batch size') 32 | parser.add_argument('--bptt', type=int, default=35, 33 | help='sequence length') 34 | parser.add_argument('--dropout', type=float, default=0.2, 35 | help='dropout applied to layers (0 = no dropout)') 36 | parser.add_argument('--tied', action='store_true', 37 | help='tie the word embedding and softmax weights') 38 | parser.add_argument('--seed', type=int, default=1111, 39 | help='random seed') 40 | parser.add_argument('--cuda', action='store_true', 41 | help='use CUDA') 42 | parser.add_argument('--log-interval', type=int, default=200, metavar='N', 43 | help='report interval') 44 | parser.add_argument('--save', type=str, default='model.pt', 45 | help='path to save the final model') 46 | parser.add_argument('--onnx-export', type=str, default='', 47 | help='path to export the final model in onnx format') 48 | args = parser.parse_args() 49 | 50 | # Set the random seed manually for reproducibility. 51 | torch.manual_seed(args.seed) 52 | if torch.cuda.is_available(): 53 | if not args.cuda: 54 | print("WARNING: You have a CUDA device, so you should probably run with --cuda") 55 | 56 | device = torch.device("cuda" if args.cuda else "cpu") 57 | 58 | ############################################################################### 59 | # Load data 60 | ############################################################################### 61 | 62 | corpus = data.Corpus(args.data) 63 | 64 | # Starting from sequential data, batchify arranges the dataset into columns. 65 | # For instance, with the alphabet as the sequence and batch size 4, we'd get 66 | # ┌ a g m s ┐ 67 | # │ b h n t │ 68 | # │ c i o u │ 69 | # │ d j p v │ 70 | # │ e k q w │ 71 | # └ f l r x ┘. 72 | # These columns are treated as independent by the model, which means that the 73 | # dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient 74 | # batch processing. 75 | 76 | def batchify(data, bsz): 77 | # Work out how cleanly we can divide the dataset into bsz parts. 78 | nbatch = data.size(0) // bsz 79 | # Trim off any extra elements that wouldn't cleanly fit (remainders). 80 | data = data.narrow(0, 0, nbatch * bsz) 81 | # Evenly divide the data across the bsz batches. 82 | data = data.view(bsz, -1).t().contiguous() 83 | return data.to(device) 84 | 85 | eval_batch_size = 10 86 | train_data = batchify(corpus.train, args.batch_size) 87 | val_data = batchify(corpus.valid, eval_batch_size) 88 | test_data = batchify(corpus.test, eval_batch_size) 89 | 90 | ############################################################################### 91 | # Build the model 92 | ############################################################################### 93 | 94 | ntokens = len(corpus.dictionary) 95 | model = model.RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout, args.tied).to(device) 96 | 97 | criterion = nn.CrossEntropyLoss() 98 | 99 | ############################################################################### 100 | # Training code 101 | ############################################################################### 102 | 103 | def repackage_hidden(h): 104 | """Wraps hidden states in new Tensors, to detach them from their history.""" 105 | if isinstance(h, torch.Tensor): 106 | return h.detach() 107 | else: 108 | return tuple(repackage_hidden(v) for v in h) 109 | 110 | 111 | # get_batch subdivides the source data into chunks of length args.bptt. 112 | # If source is equal to the example output of the batchify function, with 113 | # a bptt-limit of 2, we'd get the following two Variables for i = 0: 114 | # ┌ a g m s ┐ ┌ b h n t ┐ 115 | # └ b h n t ┘ └ c i o u ┘ 116 | # Note that despite the name of the function, the subdivison of data is not 117 | # done along the batch dimension (i.e. dimension 1), since that was handled 118 | # by the batchify function. The chunks are along dimension 0, corresponding 119 | # to the seq_len dimension in the LSTM. 120 | 121 | def get_batch(source, i): 122 | seq_len = min(args.bptt, len(source) - 1 - i) 123 | data = source[i:i+seq_len] 124 | target = source[i+1:i+1+seq_len].view(-1) 125 | return data, target 126 | 127 | 128 | def evaluate(data_source): 129 | # Turn on evaluation mode which disables dropout. 130 | model.eval() 131 | total_loss = 0. 132 | ntokens = len(corpus.dictionary) 133 | hidden = model.init_hidden(eval_batch_size) 134 | with torch.no_grad(): 135 | for i in range(0, data_source.size(0) - 1, args.bptt): 136 | data, targets = get_batch(data_source, i) 137 | output, hidden = model(data, hidden) 138 | output_flat = output.view(-1, ntokens) 139 | total_loss += len(data) * criterion(output_flat, targets).item() 140 | hidden = repackage_hidden(hidden) 141 | return total_loss / (len(data_source) - 1) 142 | 143 | 144 | def train(): 145 | # Turn on training mode which enables dropout. 146 | model.train() 147 | total_loss = 0. 148 | start_time = time.time() 149 | ntokens = len(corpus.dictionary) 150 | hidden = model.init_hidden(args.batch_size) 151 | for batch, i in enumerate(range(0, train_data.size(0) - 1, args.bptt)): 152 | data, targets = get_batch(train_data, i) 153 | # Starting each batch, we detach the hidden state from how it was previously produced. 154 | # If we didn't, the model would try backpropagating all the way to start of the dataset. 155 | hidden = repackage_hidden(hidden) 156 | model.zero_grad() 157 | output, hidden = model(data, hidden) 158 | loss = criterion(output.view(-1, ntokens), targets) 159 | loss.backward() 160 | 161 | # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs. 162 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip) 163 | for p in model.parameters(): 164 | p.data.add_(-lr, p.grad.data) 165 | 166 | total_loss += loss.item() 167 | 168 | if batch % args.log_interval == 0 and batch > 0: 169 | cur_loss = total_loss / args.log_interval 170 | elapsed = time.time() - start_time 171 | print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | ' 172 | 'loss {:5.2f} | ppl {:8.2f}'.format( 173 | epoch, batch, len(train_data) // args.bptt, lr, 174 | elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss))) 175 | total_loss = 0 176 | start_time = time.time() 177 | 178 | 179 | def export_onnx(path, batch_size, seq_len): 180 | print('The model is also exported in ONNX format at {}'. 181 | format(os.path.realpath(args.onnx_export))) 182 | model.eval() 183 | dummy_input = torch.LongTensor(seq_len * batch_size).zero_().view(-1, batch_size).to(device) 184 | hidden = model.init_hidden(batch_size) 185 | torch.onnx.export(model, (dummy_input, hidden), path) 186 | 187 | 188 | # Loop over epochs. 189 | lr = args.lr 190 | best_val_loss = None 191 | 192 | # At any point you can hit Ctrl + C to break out of training early. 193 | try: 194 | for epoch in range(1, args.epochs+1): 195 | epoch_start_time = time.time() 196 | train() 197 | val_loss = evaluate(val_data) 198 | print('-' * 89) 199 | print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | ' 200 | 'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time), 201 | val_loss, math.exp(val_loss))) 202 | print('-' * 89) 203 | # Save the model if the validation loss is the best we've seen so far. 204 | if not best_val_loss or val_loss < best_val_loss: 205 | with open(args.save, 'wb') as f: 206 | torch.save(model, f) 207 | best_val_loss = val_loss 208 | else: 209 | # Anneal the learning rate if no improvement has been seen in the validation dataset. 210 | lr /= 4.0 211 | except KeyboardInterrupt: 212 | print('-' * 89) 213 | print('Exiting from training early') 214 | 215 | # Load the best saved model. 216 | with open(args.save, 'rb') as f: 217 | model = torch.load(f) 218 | # after load the rnn params are not a continuous chunk of memory 219 | # this makes them a continuous chunk, and will speed up forward pass 220 | model.rnn.flatten_parameters() 221 | 222 | # Run on test data. 223 | test_loss = evaluate(test_data) 224 | print('=' * 89) 225 | print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format( 226 | test_loss, math.exp(test_loss))) 227 | print('=' * 89) 228 | 229 | if len(args.onnx_export) > 0: 230 | # Export the model in ONNX format. 231 | export_onnx(args.onnx_export, batch_size=1, seq_len=args.bptt) 232 | -------------------------------------------------------------------------------- /examples/language-model/model.py: -------------------------------------------------------------------------------- 1 | import torch.nn as nn 2 | 3 | class RNNModel(nn.Module): 4 | """Container module with an encoder, a recurrent module, and a decoder.""" 5 | 6 | def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False): 7 | super(RNNModel, self).__init__() 8 | self.drop = nn.Dropout(dropout) 9 | self.encoder = nn.Embedding(ntoken, ninp) 10 | if rnn_type in ['LSTM', 'GRU']: 11 | self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout) 12 | else: 13 | try: 14 | nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type] 15 | except KeyError: 16 | raise ValueError( """An invalid option for `--model` was supplied, 17 | options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""") 18 | self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout) 19 | self.decoder = nn.Linear(nhid, ntoken) 20 | 21 | # Optionally tie weights as in: 22 | # "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016) 23 | # https://arxiv.org/abs/1608.05859 24 | # and 25 | # "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016) 26 | # https://arxiv.org/abs/1611.01462 27 | if tie_weights: 28 | if nhid != ninp: 29 | raise ValueError('When using the tied flag, nhid must be equal to emsize') 30 | self.decoder.weight = self.encoder.weight 31 | 32 | self.init_weights() 33 | 34 | self.rnn_type = rnn_type 35 | self.nhid = nhid 36 | self.nlayers = nlayers 37 | 38 | def init_weights(self): 39 | initrange = 0.1 40 | self.encoder.weight.data.uniform_(-initrange, initrange) 41 | self.decoder.bias.data.zero_() 42 | self.decoder.weight.data.uniform_(-initrange, initrange) 43 | 44 | def forward(self, input, hidden): 45 | emb = self.drop(self.encoder(input)) 46 | output, hidden = self.rnn(emb, hidden) 47 | output = self.drop(output) 48 | decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2))) 49 | return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden 50 | 51 | def init_hidden(self, bsz): 52 | weight = next(self.parameters()) 53 | if self.rnn_type == 'LSTM': 54 | return (weight.new_zeros(self.nlayers, bsz, self.nhid), 55 | weight.new_zeros(self.nlayers, bsz, self.nhid)) 56 | else: 57 | return weight.new_zeros(self.nlayers, bsz, self.nhid) 58 | -------------------------------------------------------------------------------- /examples/language-model/run-gpu-job-version-2.bash: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | echo "Running on host $HOSTNAME" 4 | echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" 5 | echo "nvidia-smi output:" 6 | nvidia-smi 7 | 8 | singularity exec --nv /scratch365/$USER/version-2.sif "$@" 9 | -------------------------------------------------------------------------------- /examples/language-model/run-gpu-job-version-3.bash: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | echo "Running on host $HOSTNAME" 4 | echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" 5 | echo "nvidia-smi output:" 6 | nvidia-smi 7 | 8 | singularity exec --nv /scratch365/$USER/version-3.sif pipenv run "$@" 9 | -------------------------------------------------------------------------------- /examples/language-model/submit-gpu-job-version-2.bash: -------------------------------------------------------------------------------- 1 | # The flag `-l gpu_card=1` is necessary when using the GPU queue. 2 | qsub \ 3 | -N singularity-tutorial-version-2 \ 4 | -o output-version-2.txt \ 5 | -q gpu \ 6 | -l gpu_card=1 \ 7 | run-gpu-job-version-2.bash \ 8 | python3 main.py --cuda --epochs 6 9 | -------------------------------------------------------------------------------- /examples/language-model/submit-gpu-job-version-3.bash: -------------------------------------------------------------------------------- 1 | # The flag `-l gpu_card=1` is necessary when using the GPU queue. 2 | qsub \ 3 | -N singularity-tutorial-version-3 \ 4 | -o output-version-3.txt \ 5 | -q gpu \ 6 | -l gpu_card=1 \ 7 | run-gpu-job-version-3.bash \ 8 | python main.py --cuda --epochs 6 9 | -------------------------------------------------------------------------------- /examples/language-model/version-2.def: -------------------------------------------------------------------------------- 1 | Bootstrap: docker 2 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 3 | 4 | %post 5 | # Downloads the latest package lists (important). 6 | apt-get update -y 7 | # Runs apt-get while ensuring that there are no user prompts that would 8 | # cause the build process to hang. 9 | # python3-tk is required by matplotlib. 10 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 11 | python3 \ 12 | python3-tk \ 13 | python3-pip \ 14 | python3-setuptools 15 | # Reduce the size of the image by deleting the package lists we downloaded, 16 | # which are useless now. 17 | rm -rf /var/lib/apt/lists/* 18 | # Install Python modules. 19 | pip3 install torch numpy matplotlib 20 | -------------------------------------------------------------------------------- /examples/language-model/version-3.def: -------------------------------------------------------------------------------- 1 | BootStrap: docker 2 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 3 | 4 | %post 5 | # Downloads the latest package lists (important). 6 | apt-get update -y 7 | # Runs apt-get while ensuring that there are no user prompts that would 8 | # cause the build process to hang. 9 | # python3-tk is required by matplotlib. 10 | # python3-dev is needed to require some packages. 11 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 12 | python3 \ 13 | python3-tk \ 14 | python3-pip \ 15 | python3-dev 16 | # Reduce the size of the image by deleting the package lists we downloaded, 17 | # which are useless now. 18 | rm -rf /var/lib/apt/lists/* 19 | # Install Pipenv. 20 | pip3 install pipenv 21 | 22 | %environment 23 | # Pipenv requires a certain terminal encoding. 24 | export LANG=C.UTF-8 25 | export LC_ALL=C.UTF-8 26 | # This configures Pipenv to store the packages in the current working 27 | # directory. 28 | export PIPENV_VENV_IN_PROJECT=1 29 | -------------------------------------------------------------------------------- /examples/xor/train_xor.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | import matplotlib.pyplot as plt 4 | from matplotlib.ticker import MaxNLocator 5 | import numpy 6 | import torch 7 | 8 | def data_to_tensor_pair(data, device): 9 | x = torch.tensor([x for x, y in data], device=device) 10 | y = torch.tensor([y for x, y in data], device=device) 11 | return x, y 12 | 13 | def evaluate_model(model, x, y): 14 | model.eval() 15 | with torch.no_grad(): 16 | y_pred = model(x) 17 | return compute_accuracy(y_pred, y) 18 | 19 | def compute_accuracy(predictions, expected): 20 | correct = 0 21 | total = 0 22 | for y_pred, y in zip(predictions, expected): 23 | correct += round(y_pred.item()) == round(y.item()) 24 | total += 1 25 | return correct / total 26 | 27 | def construct_model(hidden_units, num_layers): 28 | layers = [] 29 | prev_layer_size = 2 30 | for layer_no in range(num_layers): 31 | layers.extend([ 32 | torch.nn.Linear(prev_layer_size, hidden_units), 33 | torch.nn.Tanh() 34 | ]) 35 | prev_layer_size = hidden_units 36 | layers.extend([ 37 | torch.nn.Linear(prev_layer_size, 1), 38 | torch.nn.Sigmoid() 39 | ]) 40 | return torch.nn.Sequential(*layers) 41 | 42 | def main(): 43 | 44 | parser = argparse.ArgumentParser() 45 | parser.add_argument('--iterations', type=int, default=100) 46 | parser.add_argument('--learning-rate', type=float, default=1.0) 47 | parser.add_argument('--output') 48 | args = parser.parse_args() 49 | 50 | if torch.cuda.is_available(): 51 | print('CUDA is available -- using GPU') 52 | device = torch.device('cuda') 53 | else: 54 | print('CUDA is NOT available -- using CPU') 55 | device = torch.device('cpu') 56 | 57 | # Define our toy training set for the XOR function. 58 | training_data = data_to_tensor_pair([ 59 | ([0.0, 0.0], [0.0]), 60 | ([0.0, 1.0], [1.0]), 61 | ([1.0, 0.0], [1.0]), 62 | ([1.0, 1.0], [0.0]) 63 | ], device) 64 | 65 | # Define our model. Use default initialization. 66 | model = construct_model(hidden_units=10, num_layers=2) 67 | model.to(device) 68 | 69 | loss_values = [] 70 | accuracy_values = [] 71 | optimizer = torch.optim.SGD(model.parameters(), lr=args.learning_rate) 72 | criterion = torch.nn.MSELoss() 73 | for iter_no in range(args.iterations): 74 | print('iteration #{}'.format(iter_no + 1)) 75 | # Perform a parameter update. 76 | model.train() 77 | optimizer.zero_grad() 78 | x, y = training_data 79 | y_pred = model(x) 80 | loss = criterion(y_pred, y) 81 | loss_value = loss.item() 82 | print(' loss: {}'.format(loss_value)) 83 | loss_values.append(loss_value) 84 | loss.backward() 85 | optimizer.step() 86 | # Evaluate the model. 87 | accuracy = evaluate_model(model, x, y) 88 | print(' accuracy: {:.2%}'.format(accuracy)) 89 | accuracy_values.append(accuracy) 90 | 91 | if args.output is not None: 92 | print('saving model to {}'.format(args.output)) 93 | torch.save(model.state_dict(), args.output) 94 | 95 | # Plot loss and accuracy. 96 | fig, ax = plt.subplots() 97 | ax.set_title('Loss and Accuracy vs. Iterations') 98 | ax.set_ylabel('Loss') 99 | ax.set_xlabel('Iteration') 100 | ax.set_xlim(left=1, right=len(loss_values)) 101 | ax.set_ylim(bottom=0.0, auto=None) 102 | ax.xaxis.set_major_locator(MaxNLocator(integer=True)) 103 | x_array = numpy.arange(1, len(loss_values) + 1) 104 | loss_y_array = numpy.array(loss_values) 105 | left_plot = ax.plot(x_array, loss_y_array, '-', label='Loss') 106 | right_ax = ax.twinx() 107 | right_ax.set_ylabel('Accuracy') 108 | right_ax.set_ylim(bottom=0.0, top=1.0) 109 | accuracy_y_array = numpy.array(accuracy_values) 110 | right_plot = right_ax.plot(x_array, accuracy_y_array, '--', label='Accuracy') 111 | lines = left_plot + right_plot 112 | ax.legend(lines, [line.get_label() for line in lines]) 113 | plt.show() 114 | 115 | if __name__ == '__main__': 116 | main() 117 | -------------------------------------------------------------------------------- /examples/xor/version-1.def: -------------------------------------------------------------------------------- 1 | Bootstrap: library 2 | From: ubuntu:18.04 3 | 4 | %post 5 | # These first few commands allow us to find the python3-pip package later 6 | # on. 7 | apt-get update -y 8 | # Using "noninteractive" mode runs apt-get while ensuring that there are 9 | # no user prompts that would cause the `singularity build` command to hang. 10 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 11 | software-properties-common 12 | add-apt-repository universe 13 | # Downloads the latest package lists (important). 14 | apt-get update -y 15 | # python3-tk is required by matplotlib. 16 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \ 17 | python3 \ 18 | python3-tk \ 19 | python3-pip \ 20 | python3-distutils \ 21 | python3-setuptools 22 | # Reduce the size of the image by deleting the package lists we downloaded, 23 | # which are useless now. 24 | rm -rf /var/lib/apt/lists/* 25 | # Install Python modules. 26 | pip3 install torch numpy matplotlib 27 | -------------------------------------------------------------------------------- /images/plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bdusell/singularity-tutorial/f8634a2c4eb5d4ba4a86dd6a9de880cc6ca620a9/images/plot.png -------------------------------------------------------------------------------- /slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bdusell/singularity-tutorial/f8634a2c4eb5d4ba4a86dd6a9de880cc6ca620a9/slides.pdf --------------------------------------------------------------------------------