├── .gitignore
├── README-ND.md
├── README.md
├── examples
├── language-model
│ ├── LICENSE
│ ├── README.md
│ ├── data.py
│ ├── data
│ │ └── wikitext-2
│ │ │ ├── README
│ │ │ ├── test.txt
│ │ │ ├── train.txt
│ │ │ └── valid.txt
│ ├── generate.py
│ ├── main.py
│ ├── model.py
│ ├── run-gpu-job-version-2.bash
│ ├── run-gpu-job-version-3.bash
│ ├── submit-gpu-job-version-2.bash
│ ├── submit-gpu-job-version-3.bash
│ ├── version-2.def
│ └── version-3.def
└── xor
│ ├── train_xor.py
│ └── version-1.def
├── images
└── plot.png
└── slides.pdf
/.gitignore:
--------------------------------------------------------------------------------
1 | *.sif
2 | *.pt
3 | *.swp
4 | __pycache__/
5 | .venv/
6 |
--------------------------------------------------------------------------------
/README-ND.md:
--------------------------------------------------------------------------------
1 | # How to Install Literally Anything Using Containers
2 |
3 | Brian DuSell
4 | Apr 9, 2019
5 |
6 | Grad Tutorial Talk
7 | Dept. of Computer Science and Engineering
8 | University of Notre Dame
9 |
10 | ## Abstract
11 |
12 | Have you ever spent an inordinate amount of time trying to install
13 | something on the CRC without root privileges? Have you ever wrecked your
14 | computer trying to update CUDA? Have you ever wished you could install two
15 | versions of the same package at once? If so, containers may be what's missing
16 | in your life. In this talk, I will show you how to install software using
17 | Singularity, a container system that allows you to install software in a fully
18 | portable, self-contained Linux environment where you have full administrative
19 | rights. Singularity can be installed on any Linux machine (with techniques
20 | available for running it on Windows and Mac) and is available on the CRC, thus
21 | ensuring that your code runs in a consistent environment no matter which
22 | machine you run it on. Singularity is compatible with Docker images and lets
23 | you effortlessly install any CUDA version of your choosing provided that your
24 | Nvidia drivers have been set up properly. My tutorial will consist of walking
25 | you through using Singularity to run a GPU-accelerated PyTorch program for deep
26 | learning on the CRC. Note: If you want to follow along, please ensure that you
27 | have a directory under the `/scratch365` directory on the CRC's filesystem.
28 |
29 | ## Introduction
30 |
31 | This tutorial will introduce you to [Singularity](https://www.sylabs.io/singularity/),
32 | a containerization system for scientific computing environments that is
33 | available on Notre Dame's CRC computing cluster. Containers allow you to
34 | package the environment that your code depends on inside of a portable unit.
35 | This is extremely useful for ensuring that your code can be run portably
36 | on other machines. It is also useful for installing software, packages,
37 | libraries, etc. in environments where you do not have root privileges, like the
38 | CRC. I will show you how to install PyTorch with GPU support inside of a
39 | container and run a simple PyTorch program to train a neural net.
40 |
41 | ## The Portability Problem
42 |
43 | The programs we write depend on external environments, whether that environment
44 | is explicitly documented or not. A Python program assumes that a Python
45 | interpreter is available on the system it is run on. A Python program that uses
46 | set comprehension syntax, e.g.
47 |
48 | ```python
49 | { x * 2 for x in range(10) }
50 | ```
51 |
52 | assumes that you're using Python 3. A Python program that uses the function
53 | `subprocess.run()` assumes that you're using at least version 3.5. A Python
54 | program that calls `subprocess.run(['grep', '-r', 'foo', './my/directory'])`
55 | assumes that you're running on a \*nix system where the program `grep` is
56 | available.
57 |
58 | When these dependencies are undocumented, it can become painful to run a
59 | program in an environment that is different from the one it was developed in.
60 | It would be nice to have a way to package a program together with its
61 | environment, and then run that program on any machine.
62 |
63 | ## The Installation Problem
64 |
65 | The CRC is a shared scientific computing environment with a shared file system.
66 | This means that users do not have root privileges and cannot use a package
67 | manager like `yum` or `apt-get` to install new libraries. If you want to
68 | install something on the CRC that is not already there, you have a few options:
69 |
70 | * If it is a major library, ask the staff to install/update it for you
71 | * Install it in your home directory (e.g. `pip install --user` for Python
72 | modules) or other non-standard directory
73 | * Compile it yourself in your home directory
74 |
75 | While it is almost always possible to re-compile a library yourself without
76 | root privileges, it can be very time-consuming. This is especially true when
77 | the library depends on other libraries that also need to be re-compiled,
78 | leading to a tedious search for just the right configuration to stitch them all
79 | together. CUDA also complicates the situation, as certain deep learning
80 | libraries need to be built on a node that has a GPU (even though the GPU is
81 | never used during compilation!).
82 |
83 | Finally, sometimes you deliberately want to install an older version of a
84 | package. But unless you set up two isolated installations, this could conflict
85 | with projects that still require the newer versions.
86 |
87 | To take an extreme (but completely real!) example, older versions of the deep
88 | learning library [DyNet](https://dynet.readthedocs.io/en/latest/) could only be
89 | built with an old version of GCC, and moreover needed to be compiled on a GPU
90 | node with the CRC's CUDA module loaded in order to work properly. In May 2018,
91 | the CRC removed the required version of GCC. This meant that if you wanted to
92 | install or update DyNet, you needed to re-compile that version of GCC yourself
93 | *and* figure out how to configure DyNet to build itself with a compiler in a
94 | non-standard location.
95 |
96 | ## The Solution: Containers
97 |
98 | Containers are a software isolation technique that has exploded in popularity
99 | in recent years, particularly thanks to [Docker](https://www.docker.com/).
100 | A container, like a virtual machine, is an operating system within an operating
101 | system. Unlike a virtual machine, however, it shares the kernel with the host
102 | operating system, so it incurs no performance penalty for translating machine
103 | instructions. Instead, containers rely on special system calls that allow the
104 | host to spoof the filesystem and network that the container has access to,
105 | making it appear from inside the container that it exists in a separate
106 | environment.
107 |
108 | Today we will be talking about an alternative to Docker called Singularity,
109 | which is more suitable for scientific computing environments (Docker is better
110 | suited for things like cloud applications, and there are reasons why it would
111 | not be ideal for a shared environment like the CRC). The CRC currently offers
112 | [Singularity 3.0](https://www.sylabs.io/guides/3.0/user-guide/), which is
113 | available via the `singularity` command.
114 |
115 | Singularity containers are instantiated from **images**, which are files that
116 | define the container's environment. The container's "root" file system is
117 | distinct from that of the host operating system, so you can install whatever
118 | software you like as if you were the root user. Installing software via the
119 | built-in package manager is now an option again. Not only this, but you can
120 | also choose a pre-made image to base your container on. Singularity is
121 | compatible with Docker images (a very deliberate design decision), so it can
122 | take advantage of the extremely rich selection of production-grade Docker
123 | images that are available. For example, there are pre-made images for fresh
124 | installations of Ubuntu, Python, TensorFlow, PyTorch, and even CUDA. For
125 | virtually all major libraries, getting a pre-made image for X is as simple as
126 | Googling "X docker" and taking note of the name of the image.
127 |
128 | Also, because your program's environment is self-contained, it is not affected
129 | by changes to the CRC's software and is no longer susceptible to "software
130 | rot." There is also no longer a need to rely on the CRC's modules via `module
131 | load`. Because the container is portable, it will also run just as well on your
132 | local machine as on the CRC. In the age of containers, "it runs on my machine"
133 | is no longer an excuse.
134 |
135 | ## Basic Workflow
136 |
137 | Singularity instantiates containers from images that define their environment.
138 | Singularity images are stored in `.sif` files. You build a `.sif` file by
139 | defining your environment in a text file and providing that definition to the
140 | command `singularity build`.
141 |
142 | Building an image file does require root privileges, so it is most convenient
143 | to build the image on your local machine or workstation and then copy it to
144 | your `/scratch365` directory in the CRC. The reason it requires root is because
145 | the kernel is shared, and user permissions are implemented in the kernel. So if
146 | you want to do something in the container as root, you actually need to *be*
147 | root on the host when you do it.
148 |
149 | There is also an option to build it without root privileges. This works by
150 | sending your definition to a remote server and building the image there, but I
151 | have had difficulty getting this to work.
152 |
153 | Once you've uploaded your image to the CRC, you can submit a batch job that
154 | runs `singularity exec` with the image file you created and the command you
155 | want to run. That's it!
156 |
157 | ## A Simple PyTorch Program
158 |
159 | I have included a PyTorch program,
160 | [`train_xor.py`](examples/xor/train_xor.py),
161 | that trains a neural network to compute the XOR function and then plots the
162 | loss as a function of training time. It can also save the model to a file. It
163 | depends on the Python modules `torch`, `numpy`, and `matplotlib`.
164 |
165 | ## Installing Singularity
166 |
167 | [Singularity 3.0](https://www.sylabs.io/guides/3.0/user-guide/index.html)
168 | is already available on the CRC via the `singularity` command.
169 |
170 | As for installing Singularity locally, the Singularity docs include detailed
171 | instructions for installing Singularity on major operating systems
172 | [here](https://www.sylabs.io/guides/3.0/user-guide/installation.html).
173 | Installing Singularity is not necessary for following the tutorial in real
174 | time, as I will provide you with pre-built images.
175 |
176 | ## Defining an Image
177 |
178 | The first step in defining an image is picking which base image to use. This
179 | can be a Linux distribution, such as Ubuntu, or an image with a library
180 | pre-installed, like one of PyTorch's
181 | [official Docker images](https://hub.docker.com/r/pytorch/pytorch/tags). Since
182 | our program depends on more than just PyTorch, let's start with a plain Ubuntu
183 | image and build up from there.
184 |
185 | Let's start with the basic syntax for definition files, which is documented
186 | [here](https://www.sylabs.io/guides/3.0/user-guide/definition_files.html).
187 | The first part of the file is the header, where we define the base image and
188 | other meta-information. The only required keyword in the header is `Bootstrap`,
189 | which defines the type of image being imported. Using `Bootstrap: library`
190 | means that we are importing a library from the official
191 | [Singularity Library](https://cloud.sylabs.io/library).
192 | Using `Bootstrap: docker` means that we are importing a Docker image from a
193 | Docker registry such as
194 | [Docker Hub](https://hub.docker.com/).
195 | Let's import the official
196 | [Ubuntu 18.04](https://cloud.sylabs.io/library/_container/5baba99394feb900016ea433)
197 | image.
198 |
199 | ```
200 | Bootstrap: library
201 | From: ubuntu:18.04
202 | ```
203 |
204 | The rest of the definition file is split up into several **sections** which
205 | serve special roles. The `%post` section defines a series of commands to be run
206 | while the image is being built, inside of a container as the root user. This
207 | is typically where you install packages. The `%environment` section defines
208 | environment variables that are set when the image is instantiated as a
209 | container. The `%files` section lets you copy files into the image. There are
210 | [many other types of section](https://www.sylabs.io/guides/3.0/user-guide/definition_files.html#sections).
211 |
212 | Let's use the `%post` section to install all of our requirements using
213 | `apt-get` and `pip3`.
214 |
215 | ```
216 | %post
217 | # Downloads the latest package lists (important).
218 | apt-get update -y
219 | # Runs apt-get while ensuring that there are no user prompts that would
220 | # cause the build process to hang.
221 | # python3-tk is required by matplotlib.
222 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
223 | python3 \
224 | python3-tk \
225 | python3-pip
226 | # Reduce the size of the image by deleting the package lists we downloaded,
227 | # which are no longer needed.
228 | rm -rf /var/lib/apt/lists/*
229 | # Install Python modules.
230 | pip3 install torch numpy matplotlib
231 | ```
232 |
233 | Each line defines a separate command (lines can be continued with a `\`).
234 | Unlike normal shell scripts, the build will be aborted as soon as one of the
235 | commands fails. You do not need to connect the commands with `&&`.
236 |
237 | The final build definition is in the file
238 | [version-1.def](examples/xor/version-1.def).
239 |
240 | ## Building an Image
241 |
242 | Supposing we are on our own Ubuntu machine, we can build this definition into
243 | a `.sif` image file using the following command:
244 |
245 | ```bash
246 | cd examples/xor
247 | sudo singularity build version-1.sif version-1.def
248 | ```
249 |
250 | [View the screencast](https://bdusell.github.io/singularity-tutorial/casts/version-1.html)
251 |
252 | This ran the commands we defined in the `%post` section inside a container and
253 | afterwards saved the state of the container in the image `version-1.sif`.
254 |
255 | ## Running an Image
256 |
257 | Let's run our PyTorch program in a container based on the image we just built.
258 |
259 | ```bash
260 | singularity exec version-1.sif python3 train_xor.py --output model.pt
261 | ```
262 |
263 | This program does not take long to run. Once it finishes, it should open a
264 | window with a plot of the model's loss and accuracy over time.
265 |
266 | [](https://asciinema.org/a/Lqq0AsJSwVgFoo1Hr8S7euMe5)
267 |
268 | 
269 |
270 | The trained model should also be saved in the file `model.pt`. Note that even
271 | though the program ran in a container, it was able to write a file to the host
272 | file system that remained after the program exited and the container was shut
273 | down. If you are familiar with Docker, you probably know that you cannot write
274 | files to the host in this way unless you explicitly **bind mount** two
275 | directories in the host and container file system. Bind mounting makes a file
276 | or directory on the host system synonymous with one in the container.
277 |
278 | For convenience, Singularity
279 | [binds a few important directories by
280 | default](https://www.sylabs.io/guides/3.0/user-guide/bind_paths_and_mounts.html):
281 |
282 | * Your home directory
283 | * The current working directory
284 | * `/tmp`
285 | * `/proc`
286 | * `/sys`
287 | * `/dev`
288 |
289 | You can add to or override these settings if you wish using the
290 | [`--bind` flag](https://www.sylabs.io/guides/3.0/user-guide/bind_paths_and_mounts.html#specifying-bind-paths)
291 | to `singularity exec`. This is important to remember if you want to access a
292 | file that is outside of your home directory on the CRC -- otherwise you may end
293 | up with cryptic persmission errors.
294 |
295 | It is also important to know that, unlike Docker, environment variables are
296 | inherited inside the container for convenience.
297 |
298 | ## Running an Interactive Shell
299 |
300 | You can also open up a shell inside the container and run commands there. You
301 | can `exit` when you're done. Note that since your home directory is
302 | bind-mounted, the shell inside the container will run your shell's startup file
303 | (e.g. `.bashrc`).
304 |
305 | ```
306 | $ singularity shell version-1.sif
307 | Singularity version-1.sif:~/singularity-tutorial/examples/xor> python3 train_xor.py
308 | ```
309 |
310 | ## Running an Image on the CRC
311 |
312 | Let's try running the same image on the CRC. Log in to one of the frontends
313 | using `ssh -X`. The `-X` is necessary to get the plot to appear.
314 |
315 | ```bash
316 | ssh -X yournetid@crcfe01.crc.nd.edu
317 | ```
318 |
319 | Then download the image to your `/scratch365` directory. One gotcha is that the
320 | `.sif` file *must* be stored on the scratch365 device for Singularity to work.
321 |
322 | ```bash
323 | singularity pull /scratch365/$USER/version-1.sif library://brian/default/singularity-tutorial:version-1
324 | ```
325 |
326 | Next, download the code to your home directory.
327 |
328 | ```bash
329 | git clone https://github.com/bdusell/singularity-tutorial.git ~/singularity-tutorial
330 | ```
331 |
332 | Run the program.
333 |
334 | ```bash
335 | cd ~/singularity-tutorial/examples/xor
336 | singularity exec /scratch365/$USER/version-1.sif python3 examples/xor/train_xor.py
337 | ```
338 |
339 | You should get the same plot from before to show up. Note that it is not
340 | possible to do this using the Python installations provided by the CRC, since
341 | they do not include Tk, which is required by matplotlib. I have found this
342 | extremely useful for making plots from data I have stored on the CRC without
343 | needing to download the data to another machine.
344 |
345 | ## A Beefier PyTorch Program
346 |
347 | As an example of a program that benefits from GPU acceleration, we will be
348 | running the official
349 | [`word_language_model`](https://github.com/pytorch/examples/tree/master/word_language_model)
350 | example PyTorch program, which I have included at
351 | [`examples/language-model`](examples/language-model).
352 | This program trains an
353 | [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory)
354 | [language model](https://en.wikipedia.org/wiki/Language_model)
355 | on a corpus of Wikipedia text.
356 |
357 | ## Adding GPU Support
358 |
359 | In order to add GPU support, we need to include CUDA in our image. In
360 | Singularity, this is delightfully simple. We just need to pick one of
361 | [Nvidia's official Docker images](https://hub.docker.com/r/nvidia/cuda)
362 | to base our image on. Again, the easiest way to install library X is often to
363 | Google "X docker" and pick an image from the README or tags page on Docker Hub.
364 |
365 | The README lists several tags. They tend to indicate variants of the image that
366 | have different components and different versions of things installed. Let's
367 | pick the one that is based on CUDA 10.1, uses Ubuntu 18.04, and includes cuDNN
368 | (which PyTorch can leverage for highly optimized neural network operations).
369 | Let's also pick the `devel` version, since PyTorch needs to compile itself in
370 | the container. This is the image tagged `10.1-cudnn7-devel-ubuntu18.04`. Since
371 | the image comes from the nvidia/cuda repository, the full image name is
372 | `nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04`.
373 |
374 | Our definition file now looks
375 | [like this](examples/language-model/version-2.def). Although we don't need it,
376 | I kept matplotlib for good measure.
377 |
378 | ```
379 | Bootstrap: docker
380 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
381 |
382 | %post
383 | # Downloads the latest package lists (important).
384 | apt-get update -y
385 | # Runs apt-get while ensuring that there are no user prompts that would
386 | # cause the build process to hang.
387 | # python3-tk is required by matplotlib.
388 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
389 | python3 \
390 | python3-tk \
391 | python3-pip
392 | # Reduce the size of the image by deleting the package lists we downloaded,
393 | # which are useless now.
394 | rm -rf /var/lib/apt/lists/*
395 | # Install Python modules.
396 | pip3 install torch numpy matplotlib
397 | ```
398 |
399 | We build the image as usual.
400 |
401 | ```bash
402 | cd examples/language-model
403 | sudo singularity build version-2.sif version-2.def
404 | ```
405 |
406 | [View the screencast](https://bdusell.github.io/singularity-tutorial/casts/version-2.html)
407 |
408 | We run the image like before, except that we have to add the `--nv` flag to
409 | allow the container to access the Nvidia drivers on the host in order to use
410 | the GPU. That's all we need to get GPU support working. Not bad!
411 |
412 | This program takes a while to run. Do not run it on the CRC frontend. When I
413 | run one epoch on my workstation (which has a GPU), the output looks like this:
414 |
415 | ```
416 | $ singularity exec --nv version-2.sif python3 main.py --cuda --epochs 1
417 | | epoch 1 | 200/ 2983 batches | lr 20.00 | ms/batch 45.68 | loss 7.63 | ppl 2050.44
418 | | epoch 1 | 400/ 2983 batches | lr 20.00 | ms/batch 45.11 | loss 6.85 | ppl 945.93
419 | | epoch 1 | 600/ 2983 batches | lr 20.00 | ms/batch 45.03 | loss 6.48 | ppl 653.61
420 | | epoch 1 | 800/ 2983 batches | lr 20.00 | ms/batch 46.43 | loss 6.29 | ppl 541.05
421 | | epoch 1 | 1000/ 2983 batches | lr 20.00 | ms/batch 45.50 | loss 6.14 | ppl 464.91
422 | | epoch 1 | 1200/ 2983 batches | lr 20.00 | ms/batch 44.99 | loss 6.06 | ppl 429.36
423 | | epoch 1 | 1400/ 2983 batches | lr 20.00 | ms/batch 45.27 | loss 5.95 | ppl 382.01
424 | | epoch 1 | 1600/ 2983 batches | lr 20.00 | ms/batch 45.09 | loss 5.95 | ppl 382.31
425 | | epoch 1 | 1800/ 2983 batches | lr 20.00 | ms/batch 45.25 | loss 5.80 | ppl 330.43
426 | | epoch 1 | 2000/ 2983 batches | lr 20.00 | ms/batch 45.08 | loss 5.78 | ppl 324.42
427 | | epoch 1 | 2200/ 2983 batches | lr 20.00 | ms/batch 45.11 | loss 5.66 | ppl 288.16
428 | | epoch 1 | 2400/ 2983 batches | lr 20.00 | ms/batch 45.14 | loss 5.67 | ppl 291.00
429 | | epoch 1 | 2600/ 2983 batches | lr 20.00 | ms/batch 45.21 | loss 5.66 | ppl 287.51
430 | | epoch 1 | 2800/ 2983 batches | lr 20.00 | ms/batch 45.02 | loss 5.54 | ppl 255.68
431 | -----------------------------------------------------------------------------------------
432 | | end of epoch 1 | time: 140.54s | valid loss 5.54 | valid ppl 254.69
433 | -----------------------------------------------------------------------------------------
434 | =========================================================================================
435 | | End of training | test loss 5.46 | test ppl 235.49
436 | =========================================================================================
437 | ```
438 |
439 | ## Running a GPU Program on the CRC
440 |
441 | Finally, I will show you how to run this program on the CRC's GPU queue.
442 | This image is too big to be hosted on the Singularity Library, so you need to
443 | copy it from my home directory. We will address this size issue later on.
444 |
445 | ```bash
446 | cp /afs/crc.nd.edu/user/b/bdusell1/Public/singularity-tutorial/version-2.sif /scratch365/$USER/version-2.sif
447 | ```
448 |
449 | Then, submit a job to run this program on the GPU queue. For convenience, I've
450 | included a script to run the submission command.
451 |
452 | ```bash
453 | cd ~/singularity-tutorial/examples/language-model
454 | bash submit-gpu-job-version-2.bash
455 | ```
456 |
457 | Check back in a while to verify that the job completed successfully. The output
458 | will be written to `output-version-2.txt`.
459 |
460 | Something you should keep in mind is that, by default, if there are multiple
461 | GPUs available on the system, PyTorch grabs the first one it sees (some
462 | toolkits grab all of them). However, the CRC assigns each job its own GPU,
463 | which is not necessarily the one that PyTorch would pick. If PyTorch does not
464 | respect this assignment, there can be contention among different jobs. You can
465 | control which GPUs PyTorch has access to using the environment variable
466 | `CUDA_VISIBLE_DEVICES`, which can be set to a space-separated list of numbers.
467 | The CRC now sets this environment variable automatically, and since Singularity
468 | inherits environment variables, you actually don't need to do anything. It's
469 | just something you should know about, since there is potential for abuse.
470 |
471 | ## Separating Python modules from the image
472 |
473 | Now that you know the basics of how to run a GPU job on the CRC, here's a tip
474 | for managing Python modules. There's a problem with our current workflow.
475 | Every time we want to install a new Python library, we have to re-build the
476 | image. We should only need to re-build the image when we install a package with
477 | `apt-get` or inherit from a different base image -- in other words, actions
478 | that require root privileges. It would be nice if we could store our Python
479 | libraries in the current working directory using a **package manager**, and
480 | rely on the image only for the basic Ubuntu/CUDA/Python environment.
481 |
482 | [Pipenv](https://github.com/pypa/pipenv) is a package manager for Python. It's
483 | like the Python equivalent of npm (Node.js package manager) or gem (Ruby package
484 | manager). It keeps track of the libraries your project depends on in text files
485 | named `Pipfile` and `Pipfile.lock`, which you can commit to version control in
486 | lieu of the massive libraries themselves. Every time you run `pipenv install
487 | `, Pipenv will update the `Pipfile` and download the library locally.
488 | The important thing is that, rather than putting the library in a system-wide
489 | location, Pipenv installs the library in a *local* directory called `.venv`.
490 | The benefit of this is that the libraries are stored *with* your project, but
491 | they are not part of the image. The image is merely the vehicle for running
492 | them.
493 |
494 | Here is the
495 | [new version](examples/language-model/version-3.def)
496 | of our definition file:
497 |
498 | ```
499 | BootStrap: docker
500 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
501 |
502 | %post
503 | # Downloads the latest package lists (important).
504 | apt-get update -y
505 | # Runs apt-get while ensuring that there are no user prompts that would
506 | # cause the build process to hang.
507 | # python3-tk is required by matplotlib.
508 | # python3-dev is needed to install some packages.
509 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
510 | python3 \
511 | python3-tk \
512 | python3-pip \
513 | python3-dev
514 | # Reduce the size of the image by deleting the package lists we downloaded,
515 | # which are useless now.
516 | rm -rf /var/lib/apt/lists/*
517 | # Install Pipenv.
518 | pip3 install pipenv
519 |
520 | %environment
521 | # Pipenv requires a certain terminal encoding.
522 | export LANG=C.UTF-8
523 | export LC_ALL=C.UTF-8
524 | # This configures Pipenv to store the packages in the current working
525 | # directory.
526 | export PIPENV_VENV_IN_PROJECT=1
527 | ```
528 |
529 | On the CRC, download the new image.
530 |
531 | ```bash
532 | singularity pull /scratch365/$USER/version-3.sif library://brian/default/singularity-tutorial:version-3
533 | ```
534 |
535 | Now we can use the container to install our Python libraries into the current
536 | working directory. We do this by running `pipenv install`.
537 |
538 | ```bash
539 | singularity exec /scratch365/$USER/version-3.sif pipenv install torch numpy matplotlib
540 | ```
541 |
542 | [](https://asciinema.org/a/cywx1Ta3XpO89DvwaE0MaogDo)
543 |
544 | This may take a while. When it is finished, it will have installed the
545 | libraries in a directory named `.venv`. The benefit of installing packages like
546 | this is that you can install new ones without re-building the image, and you
547 | can re-use the image for multiple projects. The `.sif` file is smaller too.
548 |
549 | When you're done, you can test it out by submitting a GPU job. If you look at
550 | the script, you will see that we replace the `python3` command with `pipenv run
551 | python`, which runs the program inside the environment that Pipenv manages.
552 |
553 | ```bash
554 | bash submit-gpu-job-version-3.bash
555 | ```
556 |
557 | ## Docker
558 |
559 | If this container stuff interests you, you might be interested in
560 | [Docker](https://www.docker.com/)
561 | too. Docker is not available on the CRC, but it may prove useful elsewhere.
562 | For example, I've used it to compile PyTorch from source before. Docker has
563 | its own set of idiosyncrasies, but a good place to start is the
564 | [Docker documentation](https://docs.docker.com/).
565 |
566 | This would be a good time to plug my
567 | [dockerdev](https://github.com/bdusell/dockerdev)
568 | project, which is a bash library that sets up a streamlined workflow for using
569 | Docker containers as development environments.
570 |
571 | ## Conclusion
572 |
573 | By now I think I have shown you that the sky is the limit when it comes to
574 | containers. Hopefully this will prove useful to your research. If you like, you
575 | can show your appreciation by leaving a star on GitHub. :)
576 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # How to Install Literally Anything: A Practical Guide to Singularity
2 |
3 | Brian DuSell
4 |
5 | Note: For a version of this tutorial specially tailored for the CRC computing
6 | cluster at the University of Notre Dame, please see
7 | [README-ND.md](README-ND.md).
8 |
9 | ## Abstract
10 |
11 | Have you ever spent an inordinate amount of time trying to install something on
12 | your HPC cluster without root privileges? Have you ever wrecked your computer
13 | trying to update CUDA? Have you ever wished you could install two versions of
14 | the same package at once? If so, containers may be what's missing in your life.
15 | In this talk, I will show you how to install software using Singularity, an
16 | HPC-centric container system that, like Docker, allows you to install software
17 | in a portable, self-contained Linux environment where you have full
18 | administrative rights. Singularity can be installed on any Linux machine (with
19 | techniques available for running it on Windows and Mac) and is becoming
20 | increasingly available on HPC clusters, thus ensuring that your code can run
21 | in a consistent environment no matter which machine you run it on. Singularity
22 | is compatible with Docker images and can make installing tricky libraries,
23 | such as CUDA, as simple as pulling a pre-built image. My tutorial will include
24 | a walkthrough of using Singularity to run a GPU-accelerated PyTorch program on
25 | an HPC cluster, as well as general tips for setting up an efficient workflow.
26 |
27 | Watch the screencast [here](https://www.youtube.com/watch?v=D5pe4ewtDe8).
28 |
29 | Slides available [here](slides.pdf).
30 |
31 | ## Introduction
32 |
33 | This tutorial will introduce you to [Singularity](https://www.sylabs.io/singularity/),
34 | a containerization system for scientific computing environments that is
35 | available on many scientific computing clusters. Containers allow you to
36 | package the environment that your code depends on inside of a portable unit.
37 | This is extremely useful for ensuring that your code can be run portably
38 | on other machines. It is also useful for installing software, packages,
39 | libraries, etc. in environments where you do not have root privileges, like an
40 | HPC account. I will show you how to install PyTorch with GPU support inside of
41 | a container and run a simple PyTorch program to train a neural net.
42 |
43 | ## The Portability Problem
44 |
45 | The programs we write depend on external environments, whether that environment
46 | is explicitly documented or not. For example, a Python program assumes that a
47 | Python interpreter is available on the system it is run on.
48 |
49 | ```python
50 | def f(n):
51 | return 2 * n
52 | ```
53 |
54 | However, some Python code requires certain versions of Python. For example,
55 | a Python program that uses set comprehension syntax requires Python 3.
56 |
57 | ```python
58 | def f(n):
59 | return { 2 * x for x in range(n) }
60 | ```
61 |
62 | A Python program that uses the function `subprocess.run()` assumes that you're
63 | using at least version 3.5.
64 |
65 | ```python
66 | import subprocess
67 |
68 | def f():
69 | return subprocess.run(...)
70 | ```
71 |
72 | This Python program additionally assumes that ImageMagick is available on the
73 | system.
74 |
75 | ```python
76 | import subprocess
77 |
78 | def f():
79 | return subprocess.run([
80 | 'convert', 'photo.jpg', '-resize', '50%', 'photo.png'])
81 | ```
82 |
83 | We can go ever deeper down the rabbit hole.
84 |
85 | When these sorts of dependencies are undocumented, it can become painful to run
86 | a program in an environment that is different from the one it was developed in.
87 | It would be nice to have a way to package a program together with its
88 | environment, and then run that program on any machine.
89 |
90 | ## The Installation Problem
91 |
92 | A scientific computing enviroment typically provides users with an account and
93 | a home directory in a shared file system. This means that users do not have
94 | root privileges and cannot use a package manager like `yum` or `apt-get` to
95 | install new libraries. If you want to install something that is not already
96 | there, you have a few options:
97 |
98 | * If it is a major library, ask the cluster's staff to install/update it for you
99 | * Install it in your home directory (e.g. `pip install --user` for Python
100 | modules) or other non-standard directory
101 | * Compile it yourself in your home directory
102 |
103 | While it is almost always possible to re-compile a library yourself without
104 | root privileges, it can be very time-consuming. This is especially true when
105 | the library depends on other libraries that also need to be re-compiled,
106 | leading to a tedious search for just the right configuration to stitch them all
107 | together. CUDA also complicates the situation, as certain deep learning
108 | libraries need to be built on a node that has a GPU (even though the GPU is
109 | never used during compilation!).
110 |
111 | Finally, sometimes you deliberately want to install an older version of a
112 | package. But unless you set up two isolated installations, this could conflict
113 | with projects that still require the newer versions.
114 |
115 | To take an extreme (but completely real!) example, older versions of the deep
116 | learning library [DyNet](https://dynet.readthedocs.io/en/latest/) could only be
117 | built with an old version of GCC, and moreover needed to be compiled on a GPU
118 | node with my HPC cluster's CUDA module loaded in order to work properly. In May
119 | 2018, the staff removed the required version of GCC. This meant that if you
120 | wanted to install or update DyNet, you needed to re-compile that version of GCC
121 | yourself *and* figure out how to configure DyNet to build itself with a
122 | compiler in a non-standard location.
123 |
124 | ## The Solution: Containers
125 |
126 | Containers are a software isolation technique that has exploded in popularity
127 | in recent years, particularly thanks to [Docker](https://www.docker.com/).
128 | A container, like a virtual machine, is an operating system within an operating
129 | system. Unlike a virtual machine, however, it shares the kernel with the host
130 | operating system, so it incurs no performance penalty for translating machine
131 | instructions. Instead, containers rely on special system calls that allow the
132 | host to spoof the filesystem and network that the container has access to,
133 | making it appear from inside the container that it exists in a separate
134 | environment.
135 |
136 | Today we will be talking about an alternative to Docker called Singularity,
137 | which is more suitable for scientific computing environments (Docker is better
138 | suited for things like cloud applications, and there are reasons why it would
139 | not be ideal for a shared scientific computing environment). Singularity is
140 | customarily available via the `singularity` command.
141 |
142 | Singularity containers are instantiated from **images**, which are files that
143 | define the container's environment. The container's "root" file system is
144 | distinct from that of the host operating system, so you can install whatever
145 | software you like as if you were the root user. Installing software via the
146 | built-in package manager is now an option again. Not only this, but you can
147 | also choose a pre-made image to base your container on. Singularity is
148 | compatible with Docker images (a very deliberate design decision), so it can
149 | take advantage of the extremely rich selection of production-grade Docker
150 | images that are available. For example, there are pre-made images for fresh
151 | installations of Ubuntu, Python, TensorFlow, PyTorch, and even CUDA. For
152 | virtually all major libraries, getting a pre-made image for X is as simple as
153 | Googling "X docker" and taking note of the name of the image.
154 |
155 | Also, because your program's environment is self-contained, it is not affected
156 | by changes to the HPC cluster's software and is no longer susceptible to
157 | "software rot." Because the container is portable, it will also run just as
158 | well on your local machine as on the HPC cluster. In the age of containers,
159 | "it runs on my machine" is no longer an excuse.
160 |
161 | ## Basic Workflow
162 |
163 | Singularity instantiates containers from images that define their environment.
164 | Singularity images are stored in `.sif` files. You build a `.sif` file by
165 | defining your environment in a text file and providing that definition to the
166 | command `singularity build`.
167 |
168 | Building an image file does require root privileges, so it is most convenient
169 | to build the image on your local machine or workstation and then copy it to
170 | your HPC cluster via `scp`. The reason it requires root is because the kernel
171 | is shared, and user permissions are implemented in the kernel. So if you want
172 | to do something in the container as root, you actually need to *be* root on
173 | the host when you do it.
174 |
175 | There is also an option to build it without root privileges. This works by
176 | sending your definition to a remote server and building the image there, but I
177 | have had difficulty getting this to work.
178 |
179 | Once you've uploaded your image to your HPC cluster, you can submit a batch
180 | job that runs `singularity exec` with the image file you created and the
181 | command you want to run. That's it!
182 |
183 | ## A Simple PyTorch Program
184 |
185 | I have included a PyTorch program,
186 | [`train_xor.py`](examples/xor/train_xor.py),
187 | that trains a neural network to compute the XOR function and then plots the
188 | loss as a function of training time. It can also save the model to a file. It
189 | depends on the Python modules `torch`, `numpy`, and `matplotlib`.
190 |
191 | ## Installing Singularity
192 |
193 | Consult your HPC cluster's documentation or staff to see if it supports
194 | Singularity. It is normally available via the `singularity` command. The
195 | documentation for the latest version, 3.2, can be found
196 | [here](https://www.sylabs.io/guides/3.2/user-guide/).
197 |
198 | If Singularity is not installed, consider
199 | [requesting
200 | it](https://www.sylabs.io/guides/3.2/user-guide/installation.html#singularity-on-a-shared-resource).
201 |
202 | As for installing Singularity locally, the Singularity docs include detailed
203 | instructions for installing Singularity on major operating systems
204 | [here](https://www.sylabs.io/guides/3.2/user-guide/installation.html).
205 |
206 | ## Defining an Image
207 |
208 | The first step in defining an image is picking which base image to use. This
209 | can be a Linux distribution, such as Ubuntu, or an image with a library
210 | pre-installed, like one of PyTorch's
211 | [official Docker images](https://hub.docker.com/r/pytorch/pytorch/tags). Since
212 | our program depends on more than just PyTorch, let's start with a plain Ubuntu
213 | image and build up from there.
214 |
215 | Let's start with the basic syntax for definition files, which is documented
216 | [here](https://www.sylabs.io/guides/3.2/user-guide/definition_files.html).
217 | The first part of the file is the header, where we define the base image and
218 | other meta-information. The only required keyword in the header is `Bootstrap`,
219 | which defines the type of image being imported. Using `Bootstrap: library`
220 | means that we are importing a library from the official
221 | [Singularity Library](https://cloud.sylabs.io/library).
222 | Using `Bootstrap: docker` means that we are importing a Docker image from a
223 | Docker registry such as
224 | [Docker Hub](https://hub.docker.com/).
225 | Let's import the official
226 | [Ubuntu 18.04](https://cloud.sylabs.io/library/_container/5baba99394feb900016ea433)
227 | image.
228 |
229 | ```
230 | Bootstrap: library
231 | From: ubuntu:18.04
232 | ```
233 |
234 | The rest of the definition file is split up into several **sections** which
235 | serve special roles. The `%post` section defines a series of commands to be run
236 | while the image is being built, inside of a container as the root user. This
237 | is typically where you install packages. The `%environment` section defines
238 | environment variables that are set when the image is instantiated as a
239 | container. The `%files` section lets you copy files into the image. There are
240 | [many other types of section](https://www.sylabs.io/guides/3.2/user-guide/definition_files.html#sections).
241 |
242 | Let's use the `%post` section to install all of our requirements using
243 | `apt-get` and `pip3`.
244 |
245 | ```
246 | %post
247 | # These first few commands allow us to find the python3-pip package later
248 | # on.
249 | apt-get update -y
250 | # Using "noninteractive" mode runs apt-get while ensuring that there are
251 | # no user prompts that would cause the `singularity build` command to hang.
252 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
253 | software-properties-common
254 | add-apt-repository universe
255 | # Downloads the latest package lists (important).
256 | apt-get update -y
257 | # python3-tk is required by matplotlib.
258 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
259 | python3 \
260 | python3-tk \
261 | python3-pip \
262 | python3-distutils \
263 | python3-setuptools
264 | # Reduce the size of the image by deleting the package lists we downloaded,
265 | # which are useless now.
266 | rm -rf /var/lib/apt/lists/*
267 | # Install Python modules.
268 | pip3 install torch numpy matplotlib
269 | ```
270 |
271 | Each line defines a separate command (lines can be continued with a `\`).
272 | Unlike normal shell scripts, the build will be aborted as soon as one of the
273 | commands fails. You do not need to connect the commands with `&&`.
274 |
275 | The final build definition is in the file
276 | [version-1.def](examples/xor/version-1.def).
277 |
278 | ## Building an Image
279 |
280 | Supposing we are on our own Ubuntu machine, we can build this definition into
281 | a `.sif` image file using the following command:
282 |
283 | ```bash
284 | cd examples/xor
285 | sudo singularity build version-1.sif version-1.def
286 | ```
287 |
288 | [View the screencast](https://bdusell.github.io/singularity-tutorial/casts/version-1.html)
289 |
290 | This ran the commands we defined in the `%post` section inside a container and
291 | afterwards saved the state of the container in the image `version-1.sif`.
292 |
293 | ## Running an Image
294 |
295 | Let's run our PyTorch program in a container based on the image we just built.
296 |
297 | ```bash
298 | singularity exec version-1.sif python3 train_xor.py --output model.pt
299 | ```
300 |
301 | This program does not take long to run. Once it finishes, it should open a
302 | window with a plot of the model's loss and accuracy over time.
303 |
304 | [](https://asciinema.org/a/Lqq0AsJSwVgFoo1Hr8S7euMe5)
305 |
306 | 
307 |
308 | The trained model should also be saved in the file `model.pt`. Note that even
309 | though the program ran in a container, it was able to write a file to the host
310 | file system that remained after the program exited and the container was shut
311 | down. If you are familiar with Docker, you probably know that you cannot write
312 | files to the host in this way unless you explicitly **bind mount** two
313 | directories in the host and container file system. Bind mounting makes a file
314 | or directory on the host system synonymous with one in the container.
315 |
316 | For convenience, Singularity
317 | [binds a few important directories by
318 | default](https://www.sylabs.io/guides/3.2/user-guide/bind_paths_and_mounts.html):
319 |
320 | * Your home directory
321 | * The current working directory
322 | * `/sys`
323 | * `/proc`
324 | * others (depending on the version of Singularity)
325 |
326 | You can add to or override these settings if you wish using the
327 | [`--bind` flag](https://www.sylabs.io/guides/3.2/user-guide/bind_paths_and_mounts.html#specifying-bind-paths)
328 | to `singularity exec`. This is important to remember if you want to access a
329 | file that is outside of your home directory -- otherwise you may end up with
330 | inexplicable "file or directory does not exist" errors. If you encounter
331 | cryptic errors when running Singularity, make sure that you have bound all of
332 | the directories you intend your program to have access to.
333 |
334 | It is also important to know that, unlike Docker, environment variables are
335 | inherited inside the container for convenience. This can be a good and a bad
336 | thing. It is good in that it is often convenient, but it is bad in that the
337 | containerized program may behave differently on different hosts for apparently
338 | no reason if the hosts export different environment variables. This behavior
339 | can be disabled by running Singularity with `--cleanenv`.
340 |
341 | Here is an example of when you might want to inherit environment variables.
342 | By default, if there are multiple GPUs available on a system, PyTorch will
343 | grab the first GPU it sees (some toolkits grab all of them). However, your
344 | cluster may allocate a specific GPU for your batch job that is not
345 | necessarily the one that PyTorch would pick. If PyTorch does not respect
346 | this assignment, there can be contention among different jobs. You can
347 | control which GPUs PyTorch has access to using the environment variable
348 | `CUDA_VISIBLE_DEVICES`. As long as your cluster defines this environment
349 | variable for you, you do not need to explicitly forward it to the Singularity
350 | container.
351 |
352 | ## Running an Interactive Shell
353 |
354 | You can also open up a shell inside the container and run commands there. You
355 | can `exit` when you're done. Note that since your home directory is
356 | bind-mounted, the shell inside the container may run your shell's startup file
357 | (e.g. `.bashrc`).
358 |
359 | ```
360 | $ singularity shell version-1.sif
361 | Singularity version-1.sif:~/singularity-tutorial/examples/xor> python3 train_xor.py
362 | ```
363 |
364 | Again, this an instance of the host environment leaking into the container in
365 | a potentially unexpected way that you should be mindful of.
366 |
367 | ## Running an Image
368 |
369 | At this point, you may wish to follow along with the tutorial on a system where
370 | Singularity is installed, either on a personal workstation or on an HPC
371 | account.
372 |
373 | You can pull the first tutorial image like so:
374 |
375 | ```bash
376 | singularity pull version-1.sif library://brian/default/singularity-tutorial:version-1
377 | ```
378 |
379 | Next, clone this repository.
380 |
381 | ```bash
382 | git clone https://github.com/bdusell/singularity-tutorial.git
383 | ```
384 |
385 | Run the program like this:
386 |
387 | ```bash
388 | cd singularity-tutorial/examples/xor
389 | singularity exec ../../../version-1.sif python3 examples/xor/train_xor.py
390 | ```
391 |
392 | This program is running the `python3` executable that exists inside the image
393 | `version-1.sif`. It is *not* running the `python3` executable on the host.
394 | Crucially, the host does not even need to have `python3` installed.
395 |
396 | You should get the same plot from before to show up. This would not have been
397 | possible using the software provided on my own HPC cluster, since its Python
398 | installation does not include Tk, which is required by matplotlib. I have
399 | found this extremely useful for making plots from data I have stored on the
400 | cluster without needing to download the data to another machine.
401 |
402 | ## A Beefier PyTorch Program
403 |
404 | As an example of a program that benefits from GPU acceleration, we will be
405 | running the official
406 | [`word_language_model`](https://github.com/pytorch/examples/tree/master/word_language_model)
407 | example PyTorch program, which I have included at
408 | [`examples/language-model`](examples/language-model).
409 | This program trains an
410 | [LSTM](https://en.wikipedia.org/wiki/Long_short-term_memory)
411 | [language model](https://en.wikipedia.org/wiki/Language_model)
412 | on a corpus of Wikipedia text.
413 |
414 | ## Adding GPU Support
415 |
416 | In order to add GPU support, we need to include CUDA in our image. In
417 | Singularity, this is delightfully simple. We just need to pick one of
418 | [Nvidia's official Docker images](https://hub.docker.com/r/nvidia/cuda)
419 | to base our image on. Again, the easiest way to install library X is often to
420 | Google "X docker" and pick an image from the README or tags page on Docker Hub.
421 |
422 | The README lists several tags. They tend to indicate variants of the image that
423 | have different components and different versions of things installed. Let's
424 | pick the one that is based on CUDA 10.1, uses Ubuntu 18.04, and includes cuDNN
425 | (which PyTorch can leverage for highly optimized neural network operations).
426 | Let's also pick the `devel` version, since PyTorch needs to compile itself in
427 | the container. This is the image tagged `10.1-cudnn7-devel-ubuntu18.04`. Since
428 | the image comes from the nvidia/cuda repository, the full image name is
429 | `nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04`.
430 |
431 | Our definition file now looks
432 | [like this](examples/language-model/version-2.def). Although we don't need it,
433 | I kept matplotlib for good measure.
434 |
435 | ```
436 | Bootstrap: docker
437 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
438 |
439 | %post
440 | # Downloads the latest package lists (important).
441 | apt-get update -y
442 | # Runs apt-get while ensuring that there are no user prompts that would
443 | # cause the build process to hang.
444 | # python3-tk is required by matplotlib.
445 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
446 | python3 \
447 | python3-tk \
448 | python3-pip \
449 | python3-setuptools
450 | # Reduce the size of the image by deleting the package lists we downloaded,
451 | # which are useless now.
452 | rm -rf /var/lib/apt/lists/*
453 | # Install Python modules.
454 | pip3 install torch numpy matplotlib
455 | ```
456 |
457 | We build the image as usual.
458 |
459 | ```bash
460 | cd examples/language-model
461 | sudo singularity build version-2.sif version-2.def
462 | ```
463 |
464 | [View the screencast](https://bdusell.github.io/singularity-tutorial/casts/version-2.html)
465 |
466 | We run the image like before, except that we have to add the `--nv` flag to
467 | allow the container to access the Nvidia drivers on the host in order to use
468 | the GPU. That's all we need to get GPU support working. Not bad!
469 |
470 | This program takes a while to run. When I run one epoch on my workstation
471 | (which has a GPU), the output looks like this:
472 |
473 | ```
474 | $ singularity exec --nv version-2.sif python3 main.py --cuda --epochs 1
475 | | epoch 1 | 200/ 2983 batches | lr 20.00 | ms/batch 45.68 | loss 7.63 | ppl 2050.44
476 | | epoch 1 | 400/ 2983 batches | lr 20.00 | ms/batch 45.11 | loss 6.85 | ppl 945.93
477 | | epoch 1 | 600/ 2983 batches | lr 20.00 | ms/batch 45.03 | loss 6.48 | ppl 653.61
478 | | epoch 1 | 800/ 2983 batches | lr 20.00 | ms/batch 46.43 | loss 6.29 | ppl 541.05
479 | | epoch 1 | 1000/ 2983 batches | lr 20.00 | ms/batch 45.50 | loss 6.14 | ppl 464.91
480 | | epoch 1 | 1200/ 2983 batches | lr 20.00 | ms/batch 44.99 | loss 6.06 | ppl 429.36
481 | | epoch 1 | 1400/ 2983 batches | lr 20.00 | ms/batch 45.27 | loss 5.95 | ppl 382.01
482 | | epoch 1 | 1600/ 2983 batches | lr 20.00 | ms/batch 45.09 | loss 5.95 | ppl 382.31
483 | | epoch 1 | 1800/ 2983 batches | lr 20.00 | ms/batch 45.25 | loss 5.80 | ppl 330.43
484 | | epoch 1 | 2000/ 2983 batches | lr 20.00 | ms/batch 45.08 | loss 5.78 | ppl 324.42
485 | | epoch 1 | 2200/ 2983 batches | lr 20.00 | ms/batch 45.11 | loss 5.66 | ppl 288.16
486 | | epoch 1 | 2400/ 2983 batches | lr 20.00 | ms/batch 45.14 | loss 5.67 | ppl 291.00
487 | | epoch 1 | 2600/ 2983 batches | lr 20.00 | ms/batch 45.21 | loss 5.66 | ppl 287.51
488 | | epoch 1 | 2800/ 2983 batches | lr 20.00 | ms/batch 45.02 | loss 5.54 | ppl 255.68
489 | -----------------------------------------------------------------------------------------
490 | | end of epoch 1 | time: 140.54s | valid loss 5.54 | valid ppl 254.69
491 | -----------------------------------------------------------------------------------------
492 | =========================================================================================
493 | | End of training | test loss 5.46 | test ppl 235.49
494 | =========================================================================================
495 | ```
496 |
497 | ## Separating Python modules from the image
498 |
499 | Now that you know the basics of how to run a GPU-accelerated program, here's a
500 | tip for managing Python modules. There's a problem with our current workflow.
501 | Every time we want to install a new Python library, we have to re-build the
502 | image. We should only need to re-build the image when we install a package with
503 | `apt-get` or inherit from a different base image -- in other words, actions
504 | that require root privileges. It would be nice if we could store our Python
505 | libraries in the current working directory using a **package manager**, and
506 | rely on the image only for the basic Ubuntu/CUDA/Python environment.
507 |
508 | [Pipenv](https://github.com/pypa/pipenv) is a package manager for Python. It's
509 | like the Python equivalent of npm (Node.js package manager) or Bundler (Ruby
510 | package manager). It keeps track of the libraries your project depends on in
511 | text files named `Pipfile` and `Pipfile.lock`, which you can commit to version
512 | control in lieu of the massive libraries themselves. Every time you run
513 | `pipenv install `, Pipenv will update the `Pipfile` and download the
514 | library locally. The important thing is that, rather than putting the library
515 | in a system-wide location, Pipenv installs the library in a *local* directory
516 | called `.venv`. The benefit of this is that the libraries are stored *with*
517 | your project, but they are not part of the image. The image is merely the
518 | vehicle for running them.
519 |
520 | Here is the
521 | [new version](examples/language-model/version-3.def)
522 | of our definition file:
523 |
524 | ```
525 | BootStrap: docker
526 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
527 |
528 | %post
529 | # Downloads the latest package lists (important).
530 | apt-get update -y
531 | # Runs apt-get while ensuring that there are no user prompts that would
532 | # cause the build process to hang.
533 | # python3-tk is required by matplotlib.
534 | # python3-dev is needed to install some packages.
535 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
536 | python3 \
537 | python3-tk \
538 | python3-pip \
539 | python3-dev
540 | # Reduce the size of the image by deleting the package lists we downloaded,
541 | # which are useless now.
542 | rm -rf /var/lib/apt/lists/*
543 | # Install Pipenv.
544 | pip3 install pipenv
545 |
546 | %environment
547 | # Pipenv requires a certain terminal encoding.
548 | export LANG=C.UTF-8
549 | export LC_ALL=C.UTF-8
550 | # This configures Pipenv to store the packages in the current working
551 | # directory.
552 | export PIPENV_VENV_IN_PROJECT=1
553 | ```
554 |
555 | Download the new image.
556 |
557 | ```bash
558 | singularity pull version-3.sif library://brian/default/singularity-tutorial:version-3
559 | ```
560 |
561 | Now we can use the container to install our Python libraries into the current
562 | working directory. We do this by running `pipenv install`.
563 |
564 | ```bash
565 | singularity exec version-3.sif pipenv install torch numpy matplotlib
566 | ```
567 |
568 | [](https://asciinema.org/a/cywx1Ta3XpO89DvwaE0MaogDo)
569 |
570 | This may take a while. When it is finished, it will have installed the
571 | libraries in a directory named `.venv`. The benefit of installing packages like
572 | this is that you can install new ones without re-building the image, and you
573 | can re-use the image for multiple projects. The `.sif` file is smaller too.
574 |
575 | When you're done, you can test it out using the following command:
576 |
577 | ```bash
578 | singularity exec --nv version-3.sif pipenv run python main.py --cuda --epochs 6
579 | ```
580 |
581 | Notice that we have replaced the command `python3` with `pipenv run python`.
582 | This command uses the Python executable managed by Pipenv, which in turn exists
583 | inside of the container.
584 |
585 | ## Docker
586 |
587 | If this container stuff interests you, you might be interested in
588 | [Docker](https://www.docker.com/)
589 | too. Docker has its own set of idiosyncrasies, but a good place to start is the
590 | [Docker documentation](https://docs.docker.com/).
591 |
592 | This would be a good time to plug my
593 | [dockerdev](https://github.com/bdusell/dockerdev)
594 | project, which is a bash library that sets up a streamlined workflow for using
595 | Docker containers as development environments.
596 |
597 | ## Conclusion
598 |
599 | By now I think I have shown you that the sky is the limit when it comes to
600 | containers. Hopefully this will prove useful to your research. If you like, you
601 | can show your appreciation by leaving a star on GitHub. :)
602 |
--------------------------------------------------------------------------------
/examples/language-model/LICENSE:
--------------------------------------------------------------------------------
1 | BSD 3-Clause License
2 |
3 | Copyright (c) 2017,
4 | All rights reserved.
5 |
6 | Redistribution and use in source and binary forms, with or without
7 | modification, are permitted provided that the following conditions are met:
8 |
9 | * Redistributions of source code must retain the above copyright notice, this
10 | list of conditions and the following disclaimer.
11 |
12 | * Redistributions in binary form must reproduce the above copyright notice,
13 | this list of conditions and the following disclaimer in the documentation
14 | and/or other materials provided with the distribution.
15 |
16 | * Neither the name of the copyright holder nor the names of its
17 | contributors may be used to endorse or promote products derived from
18 | this software without specific prior written permission.
19 |
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 |
--------------------------------------------------------------------------------
/examples/language-model/README.md:
--------------------------------------------------------------------------------
1 | # NOTICE
2 |
3 | This code is largely taken from the
4 | [official PyTorch examples repository](https://github.com/pytorch/examples/tree/master/word_language_model)
5 | in comformance with its [LICENSE](LICENSE).
6 |
7 | # Word-level language modeling RNN
8 |
9 | This example trains a multi-layer RNN (Elman, GRU, or LSTM) on a language modeling task.
10 | By default, the training script uses the Wikitext-2 dataset, provided.
11 | The trained model can then be used by the generate script to generate new text.
12 |
13 | ```bash
14 | python main.py --cuda --epochs 6 # Train a LSTM on Wikitext-2 with CUDA
15 | python main.py --cuda --epochs 6 --tied # Train a tied LSTM on Wikitext-2 with CUDA
16 | python main.py --cuda --tied # Train a tied LSTM on Wikitext-2 with CUDA for 40 epochs
17 | python generate.py # Generate samples from the trained LSTM model.
18 | ```
19 |
20 | The model uses the `nn.RNN` module (and its sister modules `nn.GRU` and `nn.LSTM`)
21 | which will automatically use the cuDNN backend if run on CUDA with cuDNN installed.
22 |
23 | During training, if a keyboard interrupt (Ctrl-C) is received,
24 | training is stopped and the current model is evaluated against the test dataset.
25 |
26 | The `main.py` script accepts the following arguments:
27 |
28 | ```bash
29 | optional arguments:
30 | -h, --help show this help message and exit
31 | --data DATA location of the data corpus
32 | --model MODEL type of recurrent net (RNN_TANH, RNN_RELU, LSTM, GRU)
33 | --emsize EMSIZE size of word embeddings
34 | --nhid NHID number of hidden units per layer
35 | --nlayers NLAYERS number of layers
36 | --lr LR initial learning rate
37 | --clip CLIP gradient clipping
38 | --epochs EPOCHS upper epoch limit
39 | --batch_size N batch size
40 | --bptt BPTT sequence length
41 | --dropout DROPOUT dropout applied to layers (0 = no dropout)
42 | --decay DECAY learning rate decay per epoch
43 | --tied tie the word embedding and softmax weights
44 | --seed SEED random seed
45 | --cuda use CUDA
46 | --log-interval N report interval
47 | --save SAVE path to save the final model
48 | ```
49 |
50 | With these arguments, a variety of models can be tested.
51 | As an example, the following arguments produce slower but better models:
52 |
53 | ```bash
54 | python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40
55 | python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied
56 | python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40
57 | python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40 --tied
58 | ```
59 |
--------------------------------------------------------------------------------
/examples/language-model/data.py:
--------------------------------------------------------------------------------
1 | import os
2 | from io import open
3 | import torch
4 |
5 | class Dictionary(object):
6 | def __init__(self):
7 | self.word2idx = {}
8 | self.idx2word = []
9 |
10 | def add_word(self, word):
11 | if word not in self.word2idx:
12 | self.idx2word.append(word)
13 | self.word2idx[word] = len(self.idx2word) - 1
14 | return self.word2idx[word]
15 |
16 | def __len__(self):
17 | return len(self.idx2word)
18 |
19 |
20 | class Corpus(object):
21 | def __init__(self, path):
22 | self.dictionary = Dictionary()
23 | self.train = self.tokenize(os.path.join(path, 'train.txt'))
24 | self.valid = self.tokenize(os.path.join(path, 'valid.txt'))
25 | self.test = self.tokenize(os.path.join(path, 'test.txt'))
26 |
27 | def tokenize(self, path):
28 | """Tokenizes a text file."""
29 | assert os.path.exists(path)
30 | # Add words to the dictionary
31 | with open(path, 'r', encoding="utf8") as f:
32 | tokens = 0
33 | for line in f:
34 | words = line.split() + ['']
35 | tokens += len(words)
36 | for word in words:
37 | self.dictionary.add_word(word)
38 |
39 | # Tokenize file content
40 | with open(path, 'r', encoding="utf8") as f:
41 | ids = torch.LongTensor(tokens)
42 | token = 0
43 | for line in f:
44 | words = line.split() + ['']
45 | for word in words:
46 | ids[token] = self.dictionary.word2idx[word]
47 | token += 1
48 |
49 | return ids
50 |
--------------------------------------------------------------------------------
/examples/language-model/data/wikitext-2/README:
--------------------------------------------------------------------------------
1 | This is raw data from the wikitext-2 dataset.
2 |
3 | See https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/
4 |
--------------------------------------------------------------------------------
/examples/language-model/generate.py:
--------------------------------------------------------------------------------
1 | ###############################################################################
2 | # Language Modeling on Wikitext-2
3 | #
4 | # This file generates new sentences sampled from the language model
5 | #
6 | ###############################################################################
7 |
8 | import argparse
9 |
10 | import torch
11 |
12 | import data
13 |
14 | parser = argparse.ArgumentParser(description='PyTorch Wikitext-2 Language Model')
15 |
16 | # Model parameters.
17 | parser.add_argument('--data', type=str, default='./data/wikitext-2',
18 | help='location of the data corpus')
19 | parser.add_argument('--checkpoint', type=str, default='./model.pt',
20 | help='model checkpoint to use')
21 | parser.add_argument('--outf', type=str, default='generated.txt',
22 | help='output file for generated text')
23 | parser.add_argument('--words', type=int, default='1000',
24 | help='number of words to generate')
25 | parser.add_argument('--seed', type=int, default=1111,
26 | help='random seed')
27 | parser.add_argument('--cuda', action='store_true',
28 | help='use CUDA')
29 | parser.add_argument('--temperature', type=float, default=1.0,
30 | help='temperature - higher will increase diversity')
31 | parser.add_argument('--log-interval', type=int, default=100,
32 | help='reporting interval')
33 | args = parser.parse_args()
34 |
35 | # Set the random seed manually for reproducibility.
36 | torch.manual_seed(args.seed)
37 | if torch.cuda.is_available():
38 | if not args.cuda:
39 | print("WARNING: You have a CUDA device, so you should probably run with --cuda")
40 |
41 | device = torch.device("cuda" if args.cuda else "cpu")
42 |
43 | if args.temperature < 1e-3:
44 | parser.error("--temperature has to be greater or equal 1e-3")
45 |
46 | with open(args.checkpoint, 'rb') as f:
47 | model = torch.load(f).to(device)
48 | model.eval()
49 |
50 | corpus = data.Corpus(args.data)
51 | ntokens = len(corpus.dictionary)
52 | hidden = model.init_hidden(1)
53 | input = torch.randint(ntokens, (1, 1), dtype=torch.long).to(device)
54 |
55 | with open(args.outf, 'w') as outf:
56 | with torch.no_grad(): # no tracking history
57 | for i in range(args.words):
58 | output, hidden = model(input, hidden)
59 | word_weights = output.squeeze().div(args.temperature).exp().cpu()
60 | word_idx = torch.multinomial(word_weights, 1)[0]
61 | input.fill_(word_idx)
62 | word = corpus.dictionary.idx2word[word_idx]
63 |
64 | outf.write(word + ('\n' if i % 20 == 19 else ' '))
65 |
66 | if i % args.log_interval == 0:
67 | print('| Generated {}/{} words'.format(i, args.words))
68 |
--------------------------------------------------------------------------------
/examples/language-model/main.py:
--------------------------------------------------------------------------------
1 | # coding: utf-8
2 | import argparse
3 | import time
4 | import math
5 | import os
6 | import torch
7 | import torch.nn as nn
8 | import torch.onnx
9 |
10 | import data
11 | import model
12 |
13 | parser = argparse.ArgumentParser(description='PyTorch Wikitext-2 RNN/LSTM Language Model')
14 | parser.add_argument('--data', type=str, default='./data/wikitext-2',
15 | help='location of the data corpus')
16 | parser.add_argument('--model', type=str, default='LSTM',
17 | help='type of recurrent net (RNN_TANH, RNN_RELU, LSTM, GRU)')
18 | parser.add_argument('--emsize', type=int, default=200,
19 | help='size of word embeddings')
20 | parser.add_argument('--nhid', type=int, default=200,
21 | help='number of hidden units per layer')
22 | parser.add_argument('--nlayers', type=int, default=2,
23 | help='number of layers')
24 | parser.add_argument('--lr', type=float, default=20,
25 | help='initial learning rate')
26 | parser.add_argument('--clip', type=float, default=0.25,
27 | help='gradient clipping')
28 | parser.add_argument('--epochs', type=int, default=40,
29 | help='upper epoch limit')
30 | parser.add_argument('--batch_size', type=int, default=20, metavar='N',
31 | help='batch size')
32 | parser.add_argument('--bptt', type=int, default=35,
33 | help='sequence length')
34 | parser.add_argument('--dropout', type=float, default=0.2,
35 | help='dropout applied to layers (0 = no dropout)')
36 | parser.add_argument('--tied', action='store_true',
37 | help='tie the word embedding and softmax weights')
38 | parser.add_argument('--seed', type=int, default=1111,
39 | help='random seed')
40 | parser.add_argument('--cuda', action='store_true',
41 | help='use CUDA')
42 | parser.add_argument('--log-interval', type=int, default=200, metavar='N',
43 | help='report interval')
44 | parser.add_argument('--save', type=str, default='model.pt',
45 | help='path to save the final model')
46 | parser.add_argument('--onnx-export', type=str, default='',
47 | help='path to export the final model in onnx format')
48 | args = parser.parse_args()
49 |
50 | # Set the random seed manually for reproducibility.
51 | torch.manual_seed(args.seed)
52 | if torch.cuda.is_available():
53 | if not args.cuda:
54 | print("WARNING: You have a CUDA device, so you should probably run with --cuda")
55 |
56 | device = torch.device("cuda" if args.cuda else "cpu")
57 |
58 | ###############################################################################
59 | # Load data
60 | ###############################################################################
61 |
62 | corpus = data.Corpus(args.data)
63 |
64 | # Starting from sequential data, batchify arranges the dataset into columns.
65 | # For instance, with the alphabet as the sequence and batch size 4, we'd get
66 | # ┌ a g m s ┐
67 | # │ b h n t │
68 | # │ c i o u │
69 | # │ d j p v │
70 | # │ e k q w │
71 | # └ f l r x ┘.
72 | # These columns are treated as independent by the model, which means that the
73 | # dependence of e. g. 'g' on 'f' can not be learned, but allows more efficient
74 | # batch processing.
75 |
76 | def batchify(data, bsz):
77 | # Work out how cleanly we can divide the dataset into bsz parts.
78 | nbatch = data.size(0) // bsz
79 | # Trim off any extra elements that wouldn't cleanly fit (remainders).
80 | data = data.narrow(0, 0, nbatch * bsz)
81 | # Evenly divide the data across the bsz batches.
82 | data = data.view(bsz, -1).t().contiguous()
83 | return data.to(device)
84 |
85 | eval_batch_size = 10
86 | train_data = batchify(corpus.train, args.batch_size)
87 | val_data = batchify(corpus.valid, eval_batch_size)
88 | test_data = batchify(corpus.test, eval_batch_size)
89 |
90 | ###############################################################################
91 | # Build the model
92 | ###############################################################################
93 |
94 | ntokens = len(corpus.dictionary)
95 | model = model.RNNModel(args.model, ntokens, args.emsize, args.nhid, args.nlayers, args.dropout, args.tied).to(device)
96 |
97 | criterion = nn.CrossEntropyLoss()
98 |
99 | ###############################################################################
100 | # Training code
101 | ###############################################################################
102 |
103 | def repackage_hidden(h):
104 | """Wraps hidden states in new Tensors, to detach them from their history."""
105 | if isinstance(h, torch.Tensor):
106 | return h.detach()
107 | else:
108 | return tuple(repackage_hidden(v) for v in h)
109 |
110 |
111 | # get_batch subdivides the source data into chunks of length args.bptt.
112 | # If source is equal to the example output of the batchify function, with
113 | # a bptt-limit of 2, we'd get the following two Variables for i = 0:
114 | # ┌ a g m s ┐ ┌ b h n t ┐
115 | # └ b h n t ┘ └ c i o u ┘
116 | # Note that despite the name of the function, the subdivison of data is not
117 | # done along the batch dimension (i.e. dimension 1), since that was handled
118 | # by the batchify function. The chunks are along dimension 0, corresponding
119 | # to the seq_len dimension in the LSTM.
120 |
121 | def get_batch(source, i):
122 | seq_len = min(args.bptt, len(source) - 1 - i)
123 | data = source[i:i+seq_len]
124 | target = source[i+1:i+1+seq_len].view(-1)
125 | return data, target
126 |
127 |
128 | def evaluate(data_source):
129 | # Turn on evaluation mode which disables dropout.
130 | model.eval()
131 | total_loss = 0.
132 | ntokens = len(corpus.dictionary)
133 | hidden = model.init_hidden(eval_batch_size)
134 | with torch.no_grad():
135 | for i in range(0, data_source.size(0) - 1, args.bptt):
136 | data, targets = get_batch(data_source, i)
137 | output, hidden = model(data, hidden)
138 | output_flat = output.view(-1, ntokens)
139 | total_loss += len(data) * criterion(output_flat, targets).item()
140 | hidden = repackage_hidden(hidden)
141 | return total_loss / (len(data_source) - 1)
142 |
143 |
144 | def train():
145 | # Turn on training mode which enables dropout.
146 | model.train()
147 | total_loss = 0.
148 | start_time = time.time()
149 | ntokens = len(corpus.dictionary)
150 | hidden = model.init_hidden(args.batch_size)
151 | for batch, i in enumerate(range(0, train_data.size(0) - 1, args.bptt)):
152 | data, targets = get_batch(train_data, i)
153 | # Starting each batch, we detach the hidden state from how it was previously produced.
154 | # If we didn't, the model would try backpropagating all the way to start of the dataset.
155 | hidden = repackage_hidden(hidden)
156 | model.zero_grad()
157 | output, hidden = model(data, hidden)
158 | loss = criterion(output.view(-1, ntokens), targets)
159 | loss.backward()
160 |
161 | # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
162 | torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
163 | for p in model.parameters():
164 | p.data.add_(-lr, p.grad.data)
165 |
166 | total_loss += loss.item()
167 |
168 | if batch % args.log_interval == 0 and batch > 0:
169 | cur_loss = total_loss / args.log_interval
170 | elapsed = time.time() - start_time
171 | print('| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.2f} | ms/batch {:5.2f} | '
172 | 'loss {:5.2f} | ppl {:8.2f}'.format(
173 | epoch, batch, len(train_data) // args.bptt, lr,
174 | elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss)))
175 | total_loss = 0
176 | start_time = time.time()
177 |
178 |
179 | def export_onnx(path, batch_size, seq_len):
180 | print('The model is also exported in ONNX format at {}'.
181 | format(os.path.realpath(args.onnx_export)))
182 | model.eval()
183 | dummy_input = torch.LongTensor(seq_len * batch_size).zero_().view(-1, batch_size).to(device)
184 | hidden = model.init_hidden(batch_size)
185 | torch.onnx.export(model, (dummy_input, hidden), path)
186 |
187 |
188 | # Loop over epochs.
189 | lr = args.lr
190 | best_val_loss = None
191 |
192 | # At any point you can hit Ctrl + C to break out of training early.
193 | try:
194 | for epoch in range(1, args.epochs+1):
195 | epoch_start_time = time.time()
196 | train()
197 | val_loss = evaluate(val_data)
198 | print('-' * 89)
199 | print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
200 | 'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
201 | val_loss, math.exp(val_loss)))
202 | print('-' * 89)
203 | # Save the model if the validation loss is the best we've seen so far.
204 | if not best_val_loss or val_loss < best_val_loss:
205 | with open(args.save, 'wb') as f:
206 | torch.save(model, f)
207 | best_val_loss = val_loss
208 | else:
209 | # Anneal the learning rate if no improvement has been seen in the validation dataset.
210 | lr /= 4.0
211 | except KeyboardInterrupt:
212 | print('-' * 89)
213 | print('Exiting from training early')
214 |
215 | # Load the best saved model.
216 | with open(args.save, 'rb') as f:
217 | model = torch.load(f)
218 | # after load the rnn params are not a continuous chunk of memory
219 | # this makes them a continuous chunk, and will speed up forward pass
220 | model.rnn.flatten_parameters()
221 |
222 | # Run on test data.
223 | test_loss = evaluate(test_data)
224 | print('=' * 89)
225 | print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
226 | test_loss, math.exp(test_loss)))
227 | print('=' * 89)
228 |
229 | if len(args.onnx_export) > 0:
230 | # Export the model in ONNX format.
231 | export_onnx(args.onnx_export, batch_size=1, seq_len=args.bptt)
232 |
--------------------------------------------------------------------------------
/examples/language-model/model.py:
--------------------------------------------------------------------------------
1 | import torch.nn as nn
2 |
3 | class RNNModel(nn.Module):
4 | """Container module with an encoder, a recurrent module, and a decoder."""
5 |
6 | def __init__(self, rnn_type, ntoken, ninp, nhid, nlayers, dropout=0.5, tie_weights=False):
7 | super(RNNModel, self).__init__()
8 | self.drop = nn.Dropout(dropout)
9 | self.encoder = nn.Embedding(ntoken, ninp)
10 | if rnn_type in ['LSTM', 'GRU']:
11 | self.rnn = getattr(nn, rnn_type)(ninp, nhid, nlayers, dropout=dropout)
12 | else:
13 | try:
14 | nonlinearity = {'RNN_TANH': 'tanh', 'RNN_RELU': 'relu'}[rnn_type]
15 | except KeyError:
16 | raise ValueError( """An invalid option for `--model` was supplied,
17 | options are ['LSTM', 'GRU', 'RNN_TANH' or 'RNN_RELU']""")
18 | self.rnn = nn.RNN(ninp, nhid, nlayers, nonlinearity=nonlinearity, dropout=dropout)
19 | self.decoder = nn.Linear(nhid, ntoken)
20 |
21 | # Optionally tie weights as in:
22 | # "Using the Output Embedding to Improve Language Models" (Press & Wolf 2016)
23 | # https://arxiv.org/abs/1608.05859
24 | # and
25 | # "Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling" (Inan et al. 2016)
26 | # https://arxiv.org/abs/1611.01462
27 | if tie_weights:
28 | if nhid != ninp:
29 | raise ValueError('When using the tied flag, nhid must be equal to emsize')
30 | self.decoder.weight = self.encoder.weight
31 |
32 | self.init_weights()
33 |
34 | self.rnn_type = rnn_type
35 | self.nhid = nhid
36 | self.nlayers = nlayers
37 |
38 | def init_weights(self):
39 | initrange = 0.1
40 | self.encoder.weight.data.uniform_(-initrange, initrange)
41 | self.decoder.bias.data.zero_()
42 | self.decoder.weight.data.uniform_(-initrange, initrange)
43 |
44 | def forward(self, input, hidden):
45 | emb = self.drop(self.encoder(input))
46 | output, hidden = self.rnn(emb, hidden)
47 | output = self.drop(output)
48 | decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
49 | return decoded.view(output.size(0), output.size(1), decoded.size(1)), hidden
50 |
51 | def init_hidden(self, bsz):
52 | weight = next(self.parameters())
53 | if self.rnn_type == 'LSTM':
54 | return (weight.new_zeros(self.nlayers, bsz, self.nhid),
55 | weight.new_zeros(self.nlayers, bsz, self.nhid))
56 | else:
57 | return weight.new_zeros(self.nlayers, bsz, self.nhid)
58 |
--------------------------------------------------------------------------------
/examples/language-model/run-gpu-job-version-2.bash:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | echo "Running on host $HOSTNAME"
4 | echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
5 | echo "nvidia-smi output:"
6 | nvidia-smi
7 |
8 | singularity exec --nv /scratch365/$USER/version-2.sif "$@"
9 |
--------------------------------------------------------------------------------
/examples/language-model/run-gpu-job-version-3.bash:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | echo "Running on host $HOSTNAME"
4 | echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
5 | echo "nvidia-smi output:"
6 | nvidia-smi
7 |
8 | singularity exec --nv /scratch365/$USER/version-3.sif pipenv run "$@"
9 |
--------------------------------------------------------------------------------
/examples/language-model/submit-gpu-job-version-2.bash:
--------------------------------------------------------------------------------
1 | # The flag `-l gpu_card=1` is necessary when using the GPU queue.
2 | qsub \
3 | -N singularity-tutorial-version-2 \
4 | -o output-version-2.txt \
5 | -q gpu \
6 | -l gpu_card=1 \
7 | run-gpu-job-version-2.bash \
8 | python3 main.py --cuda --epochs 6
9 |
--------------------------------------------------------------------------------
/examples/language-model/submit-gpu-job-version-3.bash:
--------------------------------------------------------------------------------
1 | # The flag `-l gpu_card=1` is necessary when using the GPU queue.
2 | qsub \
3 | -N singularity-tutorial-version-3 \
4 | -o output-version-3.txt \
5 | -q gpu \
6 | -l gpu_card=1 \
7 | run-gpu-job-version-3.bash \
8 | python main.py --cuda --epochs 6
9 |
--------------------------------------------------------------------------------
/examples/language-model/version-2.def:
--------------------------------------------------------------------------------
1 | Bootstrap: docker
2 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
3 |
4 | %post
5 | # Downloads the latest package lists (important).
6 | apt-get update -y
7 | # Runs apt-get while ensuring that there are no user prompts that would
8 | # cause the build process to hang.
9 | # python3-tk is required by matplotlib.
10 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
11 | python3 \
12 | python3-tk \
13 | python3-pip \
14 | python3-setuptools
15 | # Reduce the size of the image by deleting the package lists we downloaded,
16 | # which are useless now.
17 | rm -rf /var/lib/apt/lists/*
18 | # Install Python modules.
19 | pip3 install torch numpy matplotlib
20 |
--------------------------------------------------------------------------------
/examples/language-model/version-3.def:
--------------------------------------------------------------------------------
1 | BootStrap: docker
2 | From: nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
3 |
4 | %post
5 | # Downloads the latest package lists (important).
6 | apt-get update -y
7 | # Runs apt-get while ensuring that there are no user prompts that would
8 | # cause the build process to hang.
9 | # python3-tk is required by matplotlib.
10 | # python3-dev is needed to require some packages.
11 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
12 | python3 \
13 | python3-tk \
14 | python3-pip \
15 | python3-dev
16 | # Reduce the size of the image by deleting the package lists we downloaded,
17 | # which are useless now.
18 | rm -rf /var/lib/apt/lists/*
19 | # Install Pipenv.
20 | pip3 install pipenv
21 |
22 | %environment
23 | # Pipenv requires a certain terminal encoding.
24 | export LANG=C.UTF-8
25 | export LC_ALL=C.UTF-8
26 | # This configures Pipenv to store the packages in the current working
27 | # directory.
28 | export PIPENV_VENV_IN_PROJECT=1
29 |
--------------------------------------------------------------------------------
/examples/xor/train_xor.py:
--------------------------------------------------------------------------------
1 | import argparse
2 |
3 | import matplotlib.pyplot as plt
4 | from matplotlib.ticker import MaxNLocator
5 | import numpy
6 | import torch
7 |
8 | def data_to_tensor_pair(data, device):
9 | x = torch.tensor([x for x, y in data], device=device)
10 | y = torch.tensor([y for x, y in data], device=device)
11 | return x, y
12 |
13 | def evaluate_model(model, x, y):
14 | model.eval()
15 | with torch.no_grad():
16 | y_pred = model(x)
17 | return compute_accuracy(y_pred, y)
18 |
19 | def compute_accuracy(predictions, expected):
20 | correct = 0
21 | total = 0
22 | for y_pred, y in zip(predictions, expected):
23 | correct += round(y_pred.item()) == round(y.item())
24 | total += 1
25 | return correct / total
26 |
27 | def construct_model(hidden_units, num_layers):
28 | layers = []
29 | prev_layer_size = 2
30 | for layer_no in range(num_layers):
31 | layers.extend([
32 | torch.nn.Linear(prev_layer_size, hidden_units),
33 | torch.nn.Tanh()
34 | ])
35 | prev_layer_size = hidden_units
36 | layers.extend([
37 | torch.nn.Linear(prev_layer_size, 1),
38 | torch.nn.Sigmoid()
39 | ])
40 | return torch.nn.Sequential(*layers)
41 |
42 | def main():
43 |
44 | parser = argparse.ArgumentParser()
45 | parser.add_argument('--iterations', type=int, default=100)
46 | parser.add_argument('--learning-rate', type=float, default=1.0)
47 | parser.add_argument('--output')
48 | args = parser.parse_args()
49 |
50 | if torch.cuda.is_available():
51 | print('CUDA is available -- using GPU')
52 | device = torch.device('cuda')
53 | else:
54 | print('CUDA is NOT available -- using CPU')
55 | device = torch.device('cpu')
56 |
57 | # Define our toy training set for the XOR function.
58 | training_data = data_to_tensor_pair([
59 | ([0.0, 0.0], [0.0]),
60 | ([0.0, 1.0], [1.0]),
61 | ([1.0, 0.0], [1.0]),
62 | ([1.0, 1.0], [0.0])
63 | ], device)
64 |
65 | # Define our model. Use default initialization.
66 | model = construct_model(hidden_units=10, num_layers=2)
67 | model.to(device)
68 |
69 | loss_values = []
70 | accuracy_values = []
71 | optimizer = torch.optim.SGD(model.parameters(), lr=args.learning_rate)
72 | criterion = torch.nn.MSELoss()
73 | for iter_no in range(args.iterations):
74 | print('iteration #{}'.format(iter_no + 1))
75 | # Perform a parameter update.
76 | model.train()
77 | optimizer.zero_grad()
78 | x, y = training_data
79 | y_pred = model(x)
80 | loss = criterion(y_pred, y)
81 | loss_value = loss.item()
82 | print(' loss: {}'.format(loss_value))
83 | loss_values.append(loss_value)
84 | loss.backward()
85 | optimizer.step()
86 | # Evaluate the model.
87 | accuracy = evaluate_model(model, x, y)
88 | print(' accuracy: {:.2%}'.format(accuracy))
89 | accuracy_values.append(accuracy)
90 |
91 | if args.output is not None:
92 | print('saving model to {}'.format(args.output))
93 | torch.save(model.state_dict(), args.output)
94 |
95 | # Plot loss and accuracy.
96 | fig, ax = plt.subplots()
97 | ax.set_title('Loss and Accuracy vs. Iterations')
98 | ax.set_ylabel('Loss')
99 | ax.set_xlabel('Iteration')
100 | ax.set_xlim(left=1, right=len(loss_values))
101 | ax.set_ylim(bottom=0.0, auto=None)
102 | ax.xaxis.set_major_locator(MaxNLocator(integer=True))
103 | x_array = numpy.arange(1, len(loss_values) + 1)
104 | loss_y_array = numpy.array(loss_values)
105 | left_plot = ax.plot(x_array, loss_y_array, '-', label='Loss')
106 | right_ax = ax.twinx()
107 | right_ax.set_ylabel('Accuracy')
108 | right_ax.set_ylim(bottom=0.0, top=1.0)
109 | accuracy_y_array = numpy.array(accuracy_values)
110 | right_plot = right_ax.plot(x_array, accuracy_y_array, '--', label='Accuracy')
111 | lines = left_plot + right_plot
112 | ax.legend(lines, [line.get_label() for line in lines])
113 | plt.show()
114 |
115 | if __name__ == '__main__':
116 | main()
117 |
--------------------------------------------------------------------------------
/examples/xor/version-1.def:
--------------------------------------------------------------------------------
1 | Bootstrap: library
2 | From: ubuntu:18.04
3 |
4 | %post
5 | # These first few commands allow us to find the python3-pip package later
6 | # on.
7 | apt-get update -y
8 | # Using "noninteractive" mode runs apt-get while ensuring that there are
9 | # no user prompts that would cause the `singularity build` command to hang.
10 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
11 | software-properties-common
12 | add-apt-repository universe
13 | # Downloads the latest package lists (important).
14 | apt-get update -y
15 | # python3-tk is required by matplotlib.
16 | DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
17 | python3 \
18 | python3-tk \
19 | python3-pip \
20 | python3-distutils \
21 | python3-setuptools
22 | # Reduce the size of the image by deleting the package lists we downloaded,
23 | # which are useless now.
24 | rm -rf /var/lib/apt/lists/*
25 | # Install Python modules.
26 | pip3 install torch numpy matplotlib
27 |
--------------------------------------------------------------------------------
/images/plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bdusell/singularity-tutorial/f8634a2c4eb5d4ba4a86dd6a9de880cc6ca620a9/images/plot.png
--------------------------------------------------------------------------------
/slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/bdusell/singularity-tutorial/f8634a2c4eb5d4ba4a86dd6a9de880cc6ca620a9/slides.pdf
--------------------------------------------------------------------------------