├── .gitignore ├── 01-getting-started.ipynb ├── 02-dask-basics.ipynb ├── LICENSE ├── README.md ├── img ├── batch-reformatting.png ├── project_files.png ├── project_setup1.png ├── project_setup2.png ├── project_setup3.png └── saturn_logo.png ├── inference_demo ├── 03-single-inference.ipynb └── 04-parallel-inference.ipynb ├── tools ├── setup1.py ├── setup2.py └── stats_cache2.tar.gz └── transfer_learning_demo ├── 05-transfer-prepro.ipynb ├── 06a-transfer-training-s3.ipynb ├── 06b-transfer-training-local.ipynb └── 07-learning-results.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /01-getting-started.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "# Introduction to PyTorch with Dask\n", 10 | "\n", 11 | "## Welcome!\n", 12 | "\n", 13 | "This workshop is meant to help users of PyTorch for deep learning get familiar with some useful concepts in Dask that can make your work faster and easier. We will specifically be looking at Computer Vision tasks for our examples, but Pytorch and Dask can be used for many other kinds of deep learning cases.\n", 14 | "\n", 15 | "After this workshop, you will know:\n", 16 | "* Basics of how Dask works\n", 17 | "* How to run inference with a pretrained model on Dask cluster\n", 18 | "* How to run transfer learning on Dask cluster\n", 19 | "\n", 20 | "***\n", 21 | "\n", 22 | "## Saturn Cloud concepts\n", 23 | "\n", 24 | "### Projects\n", 25 | "\n", 26 | "A \"Project\" is where all the work done in Saturn Cloud resides. Each user can have multiple projects, and these projects can be shared between users. The services associated with each project are called \"Resources\" and they are organized in the following manner:\n", 27 | "\n", 28 | "```\n", 29 | "└── Project\n", 30 | " ├── Jupyter Server (*)\n", 31 | " │ └── Dask Cluster\n", 32 | " ├── Deployment\n", 33 | " │ └── Dask Cluster\n", 34 | "```\n", 35 | "\n", 36 | "(*) Every Project has a Jupyter Server, while Dask Clusters and Deployments are optional.\n", 37 | "\n", 38 | "### Images\n", 39 | "\n", 40 | "An \"Image\" is a Docker image that contains a Python environment to be attached to various Resources. A Project is set to use one Image, and all Resources in that Project will utilize the same Image.\n", 41 | "\n", 42 | "Saturn Cloud includes pre-built images for users to get up and running quickly. Users can create custom images by navigating to the \"Images\" tab from the Saturn Cloud UI.\n", 43 | "\n", 44 | "### Jupyter Server\n", 45 | "\n", 46 | "This resource runs the Jupyter Notebook and Jupyter Lab interfaces. Most time will likely be spent \"inside\" one of these Jupyter interfaces. \n", 47 | "\n", 48 | "### Dask Cluster\n", 49 | "\n", 50 | "A Dask Cluster can be attached to a Jupyter Server to scale out work. Clusters are composed of a scheduler instance and any number of worker instances. Clusters can be created and started/stopped from the Saturn Cloud UI. The [dask-saturn](https://github.com/saturncloud/dask-saturn) package is the interface for working with Dask Clusters in a notebook or script within a Jupyter Server, and can also be used to start, stop, or resize the cluster.\n", 51 | "\n", 52 | "### Deployment\n", 53 | "\n", 54 | "A \"Deployment\" is a resource that is created to serve an always-on or scheduled workload such as serving a machine learning model, hosting a dashboard via a web app, or an ETL job. It utilizes the same project Image and code from the Jupyter Server, and can optional have its own Dask cluster assigned to it.\n", 55 | "\n", 56 | "Deployments will not be covered in this workshop.\n", 57 | "\n", 58 | "### Code and data files\n", 59 | "\n", 60 | "The filesystem of a Jupyter Server is maintained on persistent volumes, so any code or files created/uploaded will remain there after shutting down the server. \n", 61 | "\n", 62 | "However, all files are not sent to associated Dask cluster workers or Deployments because those are different machines with their own filesystems. \n", 63 | "\n", 64 | "**Code**: Code maintained in the `/home/jovyan/project` folder or through the Repositories feature will be sent to the resources when they are turned on. \n", 65 | "\n", 66 | "**Data files**: Data files should be managed outside of Saturn Cloud in systems such as S3 or a database. This ensures each worker in a Dask cluster has access to the data.\n", 67 | "\n", 68 | "### Advanced settings\n", 69 | "\n", 70 | "Advanced settings for Projects include Environment Variables and Start Scripts. These will not be covered in the workshop, but more information can be found in the [Saturn Cloud docs](https://www.saturncloud.io/docs/getting-started/spinning/jupyter/#advanced-settings)." 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "***\n", 78 | "\n", 79 | "## How to Connect a Cluster" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "from dask_saturn import SaturnCluster\n", 89 | "from dask.distributed import Client\n", 90 | "\n", 91 | "cluster = SaturnCluster()\n", 92 | "client = Client(cluster)\n", 93 | "client.wait_for_workers(3)\n", 94 | "\n", 95 | "print('Hello, world!')" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "Since we are working on GPU machines for this tutorial, we should check and make sure all our workers and this Jupyter instance have GPU resources." 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "import torch\n", 112 | "\n", 113 | "torch.cuda.is_available() " 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": null, 119 | "metadata": {}, 120 | "outputs": [], 121 | "source": [ 122 | "client.run(lambda: torch.cuda.is_available())" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "metadata": {}, 128 | "source": [ 129 | "## Access data\n", 130 | "\n", 131 | "This workshop will be using the [Stanford Dogs Dataset]( http://vision.stanford.edu/aditya86/ImageNetDogs/)." 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "import s3fs\n", 141 | "\n", 142 | "s3 = s3fs.S3FileSystem(anon=True)\n", 143 | "s3.glob('s3://saturn-public-data/dogs/Images/*/*.jpg')[-10:]" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "If you feel comfortable with all that, then we can begin with [Notebook 2](02-dask-basics.ipynb)!" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "\"go\"" 158 | ] 159 | } 160 | ], 161 | "metadata": { 162 | "kernelspec": { 163 | "display_name": "saturn (Python 3)", 164 | "language": "python", 165 | "name": "python3" 166 | }, 167 | "language_info": { 168 | "codemirror_mode": { 169 | "name": "ipython", 170 | "version": 3 171 | }, 172 | "file_extension": ".py", 173 | "mimetype": "text/x-python", 174 | "name": "python", 175 | "nbconvert_exporter": "python", 176 | "pygments_lexer": "ipython3", 177 | "version": "3.7.7" 178 | } 179 | }, 180 | "nbformat": 4, 181 | "nbformat_minor": 4 182 | } 183 | -------------------------------------------------------------------------------- /02-dask-basics.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "# Introduction to Dask\n", 10 | "\n", 11 | "Before we get into too much complexity, let's talk about the essentials of Dask.\n", 12 | "\n", 13 | "## What is Dask?\n", 14 | "\n", 15 | "Dask is an open-source framework that enables parallelization of Python code. This can be applied to all kinds of Python use cases, not just machine learning. Dask is designed to work well on single-machine setups and on multi-machine clusters. You can use Dask with pandas, NumPy, scikit-learn, and other Python libraries - for our purposes, we'll focus on how you might use it with PyTorch. If you want to learn more about the other areas where Dask can be useful, there's a [great website explaining all of that](https://dask.org/).\n", 16 | "\n", 17 | "## Why Parallelize?\n", 18 | "\n", 19 | "For our use case, there are a couple of areas where Dask parallelization might be useful for making our work faster and better.\n", 20 | "* Loading and handling large datasets (especially if they are too large to hold in memory)\n", 21 | "* Running time or computation heavy tasks at the same time, quickly\n", 22 | "\n", 23 | "\n", 24 | "## Delaying Tasks\n", 25 | "\n", 26 | "Delaying a task with Dask can queue up a set of transformations or calculations so that it's ready to run later, in parallel. This is what's known as \"lazy\" evaluation - it won't evaluate the requested computations until explicitly told to. This differs from other kinds of functions, which compute instantly upon being called. Many very common and handy functions are ported to be native in Dask, which means they will be lazy (delayed computation) without you ever having to even ask. \n", 27 | "\n", 28 | "However, sometimes you will have complicated custom code that is written in pandas, scikit-learn, or even base python, that isn't natively available in Dask. Other times, you may just not have the time or energy to refactor your code into Dask, if edits are needed to take advantage of native Dask elements.\n", 29 | "If this is the case, you can decorate your functions with `@dask.delayed`, which will manually establish that the function should be lazy, and not evaluate until you tell it. You'd tell it with the processes `.compute()` or `.persist()`, described in the next section. We'll use `@dask.delayed` several times in this workshop to make PyTorch tasks easily parallelized.\n", 30 | "\n", 31 | "### Example 1" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "def exponent(x, y):\n", 41 | " '''Define a basic function.'''\n", 42 | " return x ** y\n", 43 | "\n", 44 | "# Function returns result immediately when called\n", 45 | "exponent(4, 5)" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "import dask\n", 55 | "\n", 56 | "@dask.delayed\n", 57 | "def lazy_exponent(x, y):\n", 58 | " '''Define a lazily evaluating function'''\n", 59 | " return x ** y\n", 60 | "\n", 61 | "# Function returns a delayed object, not computation\n", 62 | "lazy_exponent(4, 5)" 63 | ] 64 | }, 65 | { 66 | "cell_type": "code", 67 | "execution_count": null, 68 | "metadata": {}, 69 | "outputs": [], 70 | "source": [ 71 | "# This will now return the computation\n", 72 | "lazy_exponent(4,5).compute()" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "### Example 2\n", 80 | "\n", 81 | "We can take this knowledge and expand it - because our lazy function returns an object, we can assign it and then chain it together in different ways later.\n", 82 | "\n", 83 | "Here we return a delayed value from the first function, and call it x. Then we pass x to the function a second time, and call it y. Finally, we multiply x and y to produce z." 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "x = lazy_exponent(4, 5)\n", 93 | "y = lazy_exponent(x, 2)\n", 94 | "z = x * y\n", 95 | "z" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "z.visualize(rankdir=\"LR\")" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "z.compute()" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "***\n", 121 | "\n", 122 | "## Persist vs Compute\n", 123 | "\n", 124 | "How should we instruct our computer to run the computations we have queued up lazily? We have two choices: `.persist()` and `.compute()`.\n", 125 | "\n", 126 | "First, remember we have several machines working for us right now. We have our Jupyter instance right here running on one, and then our cluster of worker machines also.\n", 127 | "\n", 128 | "### Compute\n", 129 | "If we use `.compute()`, we are asking Dask to take all the computations and adjustments to the data that we have queued up, and run them, and bring it all to the surface here, in Jupyter.\n", 130 | "\n", 131 | "That means if it was distributed we want to convert it into a local object here and now. If it's a Dask Dataframe, when we call `.compute()`, we're saying \"Run the transformations we've queued, and convert this into a pandas dataframe immediately.\"\n", 132 | "\n", 133 | "### Persist\n", 134 | "If we use `.persist()`, we are asking Dask to take all the computations and adjustments to the data that we have queued up, and run them, but then the object is going to remain distributed and will live on the cluster, not on the Jupyter instance.\n", 135 | "\n", 136 | "So when we do this with a Dask Dataframe, we are telling our cluster \"Run the transformations we've queued, and leave this as a distributed Dask Dataframe.\"\n", 137 | "\n", 138 | "So, if you want to process all the delayed tasks you've applied to a Dask object, either of these methods will do it. The difference is where your object will live at the end.\n", 139 | "\n", 140 | "***\n", 141 | "\n", 142 | "### Example: Distributed Data Objects\n", 143 | "\n", 144 | "When we use a Dask Dataframe object, we can see the effect of `.persist()` and `.compute()` in practice." 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": null, 150 | "metadata": {}, 151 | "outputs": [], 152 | "source": [ 153 | "import dask\n", 154 | "import dask.dataframe as dd\n", 155 | "df = dask.datasets.timeseries()\n", 156 | "df.npartitions" 157 | ] 158 | }, 159 | { 160 | "cell_type": "markdown", 161 | "metadata": {}, 162 | "source": [ 163 | "So our Dask Dataframe has 30 partitions. So, if we run some computations on this dataframe, we still have an object that has a number of partitions attribute, and we can check it. We'll filter it, then do some summary statistics with a groupby." 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "df2 = df[df.y > 0]\n", 173 | "df3 = df2.groupby('name').x.std()\n", 174 | "print(type(df3))\n", 175 | "df3.npartitions" 176 | ] 177 | }, 178 | { 179 | "cell_type": "markdown", 180 | "metadata": {}, 181 | "source": [ 182 | "Now, we have reduced the object down to a Series, rather than a dataframe, so it changes the partition number.\n", 183 | "\n", 184 | "We can `repartition` the Series, if we want to!" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "df4 = df3.repartition(npartitions=3)\n", 194 | "df4.npartitions" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "metadata": {}, 200 | "source": [ 201 | "What will happen if we use `.persist()` or `.compute()` on these objects?\n", 202 | "\n", 203 | "As we can see below, `df4` is a Dask Series with 161 queued tasks and 3 partitions. We can run our two different computation commands on the same object and see the different results." 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "metadata": {}, 210 | "outputs": [], 211 | "source": [ 212 | "df4" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": null, 218 | "metadata": {}, 219 | "outputs": [], 220 | "source": [ 221 | "%%time\n", 222 | "\n", 223 | "df4.persist()" 224 | ] 225 | }, 226 | { 227 | "cell_type": "markdown", 228 | "metadata": {}, 229 | "source": [ 230 | "So, what changed when we ran .persist()? Notice that we went from 161 tasks at the bottom of the screen, to just 3. That indicates that there's one task for each partition.\n", 231 | "\n", 232 | "Now, let's try .compute()." 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "%%time\n", 242 | "df4.compute().head()" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "We get back a pandas Series, not a Dask object at all.\n", 250 | "\n", 251 | "***\n", 252 | "\n", 253 | "## Submit to Cluster\n", 254 | "\n", 255 | "To make this all work in a distributed fashion, we need to understand how we send instructions to our cluster. When we use the `@dask.delayed` decorator, we queue up some work and put it in a list, ready to be run. So how do we send it to the workers and explain what we want them to do?\n", 256 | "\n", 257 | "We use the `distributed` module from Dask to make this work. We connect to our cluster (as you saw in [Notebook 1](01-getting-started.ipynb)), and then we'll use some commands to send instructions." 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "from dask_saturn import SaturnCluster\n", 267 | "from dask.distributed import Client\n", 268 | "\n", 269 | "cluster = SaturnCluster()\n", 270 | "client = Client(cluster)" 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": null, 276 | "metadata": {}, 277 | "outputs": [], 278 | "source": [ 279 | "from dask_saturn.core import describe_sizes\n", 280 | "describe_sizes()" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "## Sending Tasks\n", 288 | "\n", 289 | "Now we have created the object `client`. This is the handle we'll use to connect with our cluster, for whatever commands we want to send! We will use a few processes to do this communication: `.submit()` and `.map()`.\n", 290 | "\n", 291 | "* `.submit()` lets us send one task to the cluster, to be run once on whatever worker is free.\n", 292 | "* `.map()` lets us send lots of tasks, which will be disseminated to workers in the most efficient way.\n", 293 | "\n", 294 | "There's also `.run()` which you can use to send one task to EVERY worker on the cluster simultaneously. This is only used for small utility tasks, however - like installing a library or collecting diagnostics.\n", 295 | "\n", 296 | "### map Example\n", 297 | "\n", 298 | "For example, you can use `.map()` in this way:\n", 299 | "\n", 300 | "`futures = client.map(function, list_of_inputs)`\n", 301 | "\n", 302 | "This takes our function, maps it over all the inputs, and then these tasks are distributed to the cluster workers. Note: they still won't actually compute yet!\n", 303 | "\n", 304 | "Let's try an example. Recall our `lazy_exponent` function from earlier. We are going to alter it so that it accepts its inputs as a single list, then we can use it with `.map()`." 305 | ] 306 | }, 307 | { 308 | "cell_type": "code", 309 | "execution_count": null, 310 | "metadata": {}, 311 | "outputs": [], 312 | "source": [ 313 | "@dask.delayed\n", 314 | "def lazy_exponent(args):\n", 315 | " x,y = args\n", 316 | " '''Define a lazily evaluating function'''\n", 317 | " return x ** y" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": null, 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "inputs = [[1,2], [3,4], [5,6]]\n", 327 | "\n", 328 | "example_future = client.map(lazy_exponent, inputs)" 329 | ] 330 | }, 331 | { 332 | "cell_type": "markdown", 333 | "metadata": {}, 334 | "source": [ 335 | "***\n", 336 | "\n", 337 | "## Processing Results\n", 338 | "We have one more step before we use .compute(), which is .gather(). This creates one more instruction to be included in this big delayed job we're establishing: retrieving the results from all of our jobs. It's going to sit tight as well until we finally say .compute().\n", 339 | "\n", 340 | "### gather Example" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "futures_gathered = client.gather(example_future)" 350 | ] 351 | }, 352 | { 353 | "cell_type": "markdown", 354 | "metadata": {}, 355 | "source": [ 356 | "It may help to think of all the work as instructions in a list. We have so far told our cluster: \"map our delayed function over this list of inputs, and pass the resulting tasks to the workers\", \"Gather up the results of those tasks, and bring them back\". But the one thing we haven't said is \"Ok, now begin to process all these instructions\"! That's what `.compute()` will do. For us this looks like:" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": null, 362 | "metadata": {}, 363 | "outputs": [], 364 | "source": [ 365 | "futures_computed = client.compute(futures_gathered, sync=False)" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "We can investigate the results, and use a small list comprehension to return them for later use." 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "futures_computed" 382 | ] 383 | }, 384 | { 385 | "cell_type": "code", 386 | "execution_count": null, 387 | "metadata": {}, 388 | "outputs": [], 389 | "source": [ 390 | "futures_computed[0].result()" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": {}, 397 | "outputs": [], 398 | "source": [ 399 | "results = [x.result() for x in futures_computed]\n", 400 | "results" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "Now we have the background knowledge we need to move on to running PyTorch jobs! \n", 408 | "* If you want to do inference, go to [Notebook 3](03-single-inference.ipynb). \n", 409 | "* If you want to do training/transfer learning, go to [Notebook 5](05-transfer-prepro.ipynb).\n", 410 | "\n", 411 | "### Helpful reference links: \n", 412 | "* https://distributed.dask.org/en/latest/client.html\n", 413 | "* https://distributed.dask.org/en/latest/manage-computation.html\n" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": null, 419 | "metadata": {}, 420 | "outputs": [], 421 | "source": [] 422 | } 423 | ], 424 | "metadata": { 425 | "kernelspec": { 426 | "display_name": "Python 3", 427 | "language": "python", 428 | "name": "python3" 429 | }, 430 | "language_info": { 431 | "codemirror_mode": { 432 | "name": "ipython", 433 | "version": 3 434 | }, 435 | "file_extension": ".py", 436 | "mimetype": "text/x-python", 437 | "name": "python", 438 | "nbconvert_exporter": "python", 439 | "pygments_lexer": "ipython3", 440 | "version": "3.7.7" 441 | } 442 | }, 443 | "nbformat": 4, 444 | "nbformat_minor": 4 445 | } 446 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2020, Saturn Cloud 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | 1. Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | 2. Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | 3. Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # Introduction to PyTorch with Dask 4 | 5 | ## Workshop: learn to apply Dask to improve PyTorch performance 6 | 7 | In this workshop, attendees will have the opportunity to see how common deep learning tasks in PyTorch can be easily parallelized using Dask clusters on Saturn Cloud. 8 | 9 | After this workshop you will know: 10 | - Basics of how Dask works 11 | - How to run inference with a pretrained model on Dask cluster 12 | - How to run transfer learning on Dask cluster 13 | 14 | To get the full learning value from this workshop, attendees should have prior experience with PyTorch. Experience with parallel computing is not needed. 15 | 16 | ## Getting Started 17 | If you are going to work through all the exercises, please use the steps below. If you'd like to just read along and not run the code, you can use the [notebook_output](notebook_output) folder above to see all the notebooks with the code already run. 18 | 19 | ### Setup Steps 20 | 21 | 1. Create an account on [Saturn Cloud Hosted](https://accounts.community.saturnenterprise.io/register) or use your organization's existing Saturn Cloud Enterprise installation. 22 | 1. Create a new project (keep defaults unless specified here) 23 | - Name: "workshop-dask-pytorch" 24 | - Image: `saturncloud/saturn-gpu:2020.11.30` (Or most recent date suffix available) 25 | - Under Advanced Settings, Start Script (Bash) add the following: 26 | ` /srv/conda/envs/saturn/bin/pip install graphviz dask-pytorch-ddp plotnine tensorboardX` 27 | - Under Environment Variables, add the following: 28 | `DASK_DISTRIBUTED__WORKER__DAEMON=False` 29 | - Workspace Settings 30 | - Size: `V100-2XLarge - 8 cores - 61 GB RAM - 1 GPU` 31 | - Click "Create" 32 | 1. Attach a Dask Cluster to the project 33 | - Scheduler Size: `Medium` 34 | - Worker Size: `V100-2XLarge - 8 cores - 61 GB RAM - 1 GPU` 35 | - Number of workers (n_workers): 3 36 | - Number of worker threads (nthreads): 8 37 | - Click "Create" 38 | 1. Start both the Jupyter Server and Dask Cluster 39 | 1. Open Jupyter Lab 40 | 1. From Jupyter Lab, open a new Terminal window and clone the workshop-dask-pytorch repository: 41 | ```bash 42 | git clone https://github.com/saturncloud/workshop-dask-pytorch.git /tmp/workshop-dask-pytorch 43 | cp -r /tmp/workshop-dask-pytorch /home/jovyan/project 44 | ``` 45 | 1. Navigate to the "workshop-dask-pytorch" folder in the File browser and start from the [01-getting-started.ipynb](01-getting-started.ipynb) notebook. 46 | 47 | 48 | ### Screenshots 49 | 50 | The project from the Saturn UI should look similar to this: 51 | 52 | 53 | 54 | 55 | 56 | 57 | Your JupyterLab environment should look like this: 58 | 59 | 60 | -------------------------------------------------------------------------------- /img/batch-reformatting.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saturncloud/workshop-dask-pytorch/d867ecd0e0a49c601ca829c4fb8f1f605458581b/img/batch-reformatting.png -------------------------------------------------------------------------------- /img/project_files.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saturncloud/workshop-dask-pytorch/d867ecd0e0a49c601ca829c4fb8f1f605458581b/img/project_files.png -------------------------------------------------------------------------------- /img/project_setup1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saturncloud/workshop-dask-pytorch/d867ecd0e0a49c601ca829c4fb8f1f605458581b/img/project_setup1.png -------------------------------------------------------------------------------- /img/project_setup2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saturncloud/workshop-dask-pytorch/d867ecd0e0a49c601ca829c4fb8f1f605458581b/img/project_setup2.png -------------------------------------------------------------------------------- /img/project_setup3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saturncloud/workshop-dask-pytorch/d867ecd0e0a49c601ca829c4fb8f1f605458581b/img/project_setup3.png -------------------------------------------------------------------------------- /img/saturn_logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saturncloud/workshop-dask-pytorch/d867ecd0e0a49c601ca829c4fb8f1f605458581b/img/saturn_logo.png -------------------------------------------------------------------------------- /inference_demo/03-single-inference.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "# Baseline Inference\n", 10 | "\n", 11 | "This project will do inference: classify an image with the most accurate label our model can give it. We're using the [Stanford Dogs Dataset]( http://vision.stanford.edu/aditya86/ImageNetDogs/), so we're asking Resnet50 to give us the correct breed label. \n", 12 | "\n", 13 | "Before we go into parallelization of this tasks, let's do a quick single-thread version. Then, in [Notebook 4](04-parallel-inference.ipynb), we'll convert this to a parallelized task.\n", 14 | "\n", 15 | "### Set up file store\n", 16 | "\n", 17 | "Connect to our S3 bucket where the images are held." 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": null, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "import s3fs\n", 27 | "s3 = s3fs.S3FileSystem(anon=True)" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "### Download model and labels for ResNet\n", 35 | "\n", 36 | "First, we connect to the S3 data store, where we will get one sample image, as well as the 1000-item ImageNet label dataset. This will allow us to turn the predictions from our model into human-interpretable strings.\n", 37 | "\n", 38 | "PyTorch has the companion library torchvision which gives us access to a number of handy tools, including copies of popular models like Resnet. You can learn more about the available models in [the torchvision documentation](https://pytorch.org/docs/stable/torchvision/models.html)." 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "from torchvision import datasets, transforms, models\n", 48 | "\n", 49 | "resnet = models.resnet50(pretrained=True)\n", 50 | "\n", 51 | "with s3.open('s3://saturn-public-data/dogs/imagenet1000_clsidx_to_labels.txt') as f:\n", 52 | " classes = [line.strip() for line in f.readlines()]" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "metadata": {}, 58 | "source": [ 59 | "### Load image and design transform steps" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "from PIL import Image\n", 69 | "\n", 70 | "with s3.open(\"s3://saturn-public-data/dogs/2-dog.jpg\", 'rb') as f:\n", 71 | " img = Image.open(f).convert(\"RGB\")\n", 72 | " \n", 73 | "transform = transforms.Compose([\n", 74 | " transforms.Resize(256), \n", 75 | " transforms.CenterCrop(250), \n", 76 | " transforms.ToTensor()])" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "### Set up inference function" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "import torch\n", 93 | "to_pil = transforms.ToPILImage()\n", 94 | "\n", 95 | "def classify_img(transform, img, model):\n", 96 | " img_t = transform(img)\n", 97 | " batch_t = torch.unsqueeze(img_t, 0)\n", 98 | "\n", 99 | " resnet.eval()\n", 100 | " out = model(batch_t)\n", 101 | " \n", 102 | " _, indices = torch.sort(out, descending=True)\n", 103 | " percentage = torch.nn.functional.softmax(out, dim=1)[0] * 100\n", 104 | " labelset = [(classes[idx], percentage[idx].item()) for idx in indices[0][:5]]\n", 105 | " return to_pil(img_t), labelset" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "\n", 113 | "Key aspects of the function to pay attention to include:\n", 114 | "\n", 115 | "* `img_t = transform(img)` : we must run the transformation we defined above on every image before we try to classify it.\n", 116 | "* `batch_t = torch.unsqueeze(img_t, 0)` : this step reshapes our image tensors to allow the model to accept it.\n", 117 | "* `resnet.eval()` : When we download the model, it can either be in training or in evaluation mode. We need it in evaluation mode here, so that it can return the predicted labels to us without changing itself.\n", 118 | "* `out = model(batch_t)` : This step actually evaluates the images. We are using batches of images here, so many can be classified at once.\n", 119 | "\n", 120 | "### Results Processing\n", 121 | "* `_, indices = torch.sort(out, descending=True)` : Sorts the results, high score to low (gives us the most likely labels at the top).\n", 122 | "* `percentage = torch.nn.functional.softmax(out, dim=1)[0] * 100` : Rescales the scores from the model to probabilities (returns probabilities of each label) .\n", 123 | "* `labelset = [(classes[idx], percentage[idx].item()) for idx in indices[0][:5]]` : Interprets the top five labels in human readable form." 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "%%time\n", 133 | "\n", 134 | "dogpic, labels = classify_img(transform, img, resnet)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "dogpic" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "labels" 153 | ] 154 | }, 155 | { 156 | "cell_type": "markdown", 157 | "metadata": {}, 158 | "source": [ 159 | "Great job, we have proved the basic task works!" 160 | ] 161 | }, 162 | { 163 | "cell_type": "markdown", 164 | "metadata": {}, 165 | "source": [ 166 | "\"success\"\n" 167 | ] 168 | }, 169 | { 170 | "cell_type": "markdown", 171 | "metadata": {}, 172 | "source": [ 173 | "***\n", 174 | "\n", 175 | "## Moving to Parallel\n", 176 | "\n", 177 | "Our job with one image runs quite fast! However, if we want to classify all 20,000+ images in the the [Stanford Dogs Dataset]( http://vision.stanford.edu/aditya86/ImageNetDogs/), that's going to add up to real time. So, let's take a look at how we can do this so that images are not classified one at a time, but in a highly parallel way, in [Notebook 4](04-parallel-inference.ipynb)." 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": null, 183 | "metadata": {}, 184 | "outputs": [], 185 | "source": [] 186 | } 187 | ], 188 | "metadata": { 189 | "kernelspec": { 190 | "display_name": "saturn (Python 3)", 191 | "language": "python", 192 | "name": "python3" 193 | }, 194 | "language_info": { 195 | "codemirror_mode": { 196 | "name": "ipython", 197 | "version": 3 198 | }, 199 | "file_extension": ".py", 200 | "mimetype": "text/x-python", 201 | "name": "python", 202 | "nbconvert_exporter": "python", 203 | "pygments_lexer": "ipython3", 204 | "version": "3.7.7" 205 | } 206 | }, 207 | "nbformat": 4, 208 | "nbformat_minor": 4 209 | } 210 | -------------------------------------------------------------------------------- /inference_demo/04-parallel-inference.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "# Parallel Inference\n", 10 | "\n", 11 | "We are ready to scale up our inference task!\n", 12 | "\n", 13 | "\"scaleup\"\n", 14 | "\n", 15 | "\n", 16 | "**Dataset:** [Stanford Dogs](http://vision.stanford.edu/aditya86/ImageNetDogs/main.html) \n", 17 | "**Model:** [Resnet50](https://arxiv.org/abs/1512.03385)\n", 18 | "\n", 19 | "\n", 20 | "We've done this before, but to refresh your memory, get connected to the cluster using the following code." 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "from dask_saturn import SaturnCluster\n", 30 | "from dask.distributed import Client\n", 31 | "from torchvision import datasets, transforms, models\n", 32 | "import re\n", 33 | "\n", 34 | "cluster = SaturnCluster()\n", 35 | "client = Client(cluster)\n", 36 | "client.wait_for_workers(3)\n", 37 | "client" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "%run -i ../tools/setup1.py\n", 47 | "\n", 48 | "display(HTML(gpu_links))" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "We'll use the command above to get ourselves back to the state we need from Notebook 3.\n", 56 | "\n", 57 | "***\n", 58 | "\n", 59 | "## Assigning Objects to GPU Resources\n", 60 | "\n", 61 | "If you are going to run any processes on GPU resources in a cluster, you need all your objects to be explicitly told this. Otherwise, it won't seek out GPU resources. However, if you use a functional setup (as we are going to do later) you'll need to do this INSIDE your function. Our architecture below will have all that written in. But before we go too complex, we should learn how that works in isolation.\n", 62 | "\n", 63 | "This command is all you need to assign an object (a model, an image, etc) to a GPU-type resource. [The PyTorch docs can tell us more.](https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.device) So here's how we do it with the model:" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "import torch\n", 73 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n", 74 | "resnet = models.resnet50(pretrained=True)\n", 75 | "\n", 76 | "resnet = resnet.to(device)" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "What would you write to assign a transformed image (call it `img_t`) to a GPU resource? \n", 84 | "We'll do this a few more times in the upcoming examples." 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "type(img_t)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": null, 99 | "metadata": {}, 100 | "outputs": [], 101 | "source": [ 102 | "img_t = img_t.to(device)" 103 | ] 104 | }, 105 | { 106 | "cell_type": "markdown", 107 | "metadata": {}, 108 | "source": [ 109 | "## Preprocessing Images\n", 110 | "\n", 111 | "Our goal here is to create a nicely streamlined workflow, including loading, transforming, batching, and labeling images, which we can then run in parallel." 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "import dask\n", 121 | "\n", 122 | "@dask.delayed\n", 123 | "def preprocess(path, fs=__builtins__):\n", 124 | " '''Ingest images directly from S3, apply transformations,\n", 125 | " and extract the ground truth and image identifier. Accepts\n", 126 | " a filepath. '''\n", 127 | " \n", 128 | " transform = transforms.Compose([\n", 129 | " transforms.Resize(256), \n", 130 | " transforms.CenterCrop(250), \n", 131 | " transforms.ToTensor(),\n", 132 | " ])\n", 133 | "\n", 134 | " with fs.open(path, 'rb') as f:\n", 135 | " img = Image.open(f).convert(\"RGB\")\n", 136 | " nvis = transform(img)\n", 137 | "\n", 138 | " truth = re.search('dogs/Images/n[0-9]+-([^/]+)/n[0-9]+_[0-9]+.jpg', path).group(1)\n", 139 | " name = re.search('dogs/Images/n[0-9]+-[a-zA-Z-_]+/(n[0-9]+_[0-9]+).jpg', path).group(1)\n", 140 | " \n", 141 | " return [name, nvis, truth]" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "This function does a number of things for us.\n", 149 | "* Open an image file from S3\n", 150 | "* Apply transformations to image\n", 151 | "* Retrieve a unique identifier for the image\n", 152 | "* Retrieve the ground truth label for the image\n", 153 | "\n", 154 | "But you'll notice that this has a `@dask.delayed` decorator, so we can queue it without it running immediately when called. Because of this, we can use some list comprehension strategies to create our batches and get them ready for our inference.\n", 155 | "\n", 156 | "First, we break the list of images we have from our S3 filepath into chunks that will define the batches. (We defined `s3` when we connected to the S3 file storage in [Notebook 3](03-single-inference.ipynb), if you forgot.)\n", 157 | "\n", 158 | "***\n", 159 | "\n", 160 | "### List Image Files" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": null, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "import toolz\n", 170 | "\n", 171 | "s3fpath = 's3://saturn-public-data/dogs/Images/*/*.jpg'\n", 172 | "batch_breaks = [list(batch) for batch in toolz.partition_all(80, s3.glob(s3fpath))]" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "len(batch_breaks)" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "batch_breaks[0][:5]" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "What does one of our batches look like? It's a list of image paths!\n", 198 | "\n", 199 | "***\n", 200 | "\n", 201 | "## Process" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "image_batches = [[preprocess(x, fs=s3) for x in y] for y in batch_breaks]" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [ 219 | "image_batches[0][0].compute()" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "### Reformat\n", 227 | "\n", 228 | "" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "metadata": {}, 235 | "outputs": [], 236 | "source": [ 237 | "@dask.delayed\n", 238 | "def reformat(batch):\n", 239 | " flat_list = [item for item in batch]\n", 240 | " tensors = [x[1] for x in flat_list]\n", 241 | " names = [x[0] for x in flat_list]\n", 242 | " labels = [x[2] for x in flat_list]\n", 243 | " \n", 244 | " tensors = torch.stack(tensors).to(device)\n", 245 | " \n", 246 | " return [names, tensors, labels]\n", 247 | "\n", 248 | "image_batches = [reformat(result) for result in image_batches] " 249 | ] 250 | }, 251 | { 252 | "cell_type": "markdown", 253 | "metadata": {}, 254 | "source": [ 255 | "If we want to get a nice visual representation of the tasks we have queued up, we can use the `.visualize()` method on a delayed object, like this. We've set up a lot of tasks in this one batch!" 256 | ] 257 | }, 258 | { 259 | "cell_type": "code", 260 | "execution_count": null, 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "image_batches[0].visualize()" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "Now we have our images ready! But as you know, we really just have a list of tasks queued up that we're going to ask our cluster to complete later.\n", 272 | "\n", 273 | "***\n", 274 | "\n", 275 | "## Check Images\n", 276 | "\n", 277 | "### Image Identifiers" 278 | ] 279 | }, 280 | { 281 | "cell_type": "code", 282 | "execution_count": null, 283 | "metadata": {}, 284 | "outputs": [], 285 | "source": [ 286 | "test_set = image_batches[25][0][:5].compute()\n", 287 | "test_set" 288 | ] 289 | }, 290 | { 291 | "cell_type": "code", 292 | "execution_count": null, 293 | "metadata": {}, 294 | "outputs": [], 295 | "source": [ 296 | "image_batches[25][2][:5].compute()" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": null, 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "import matplotlib.pyplot as plt\n", 306 | "cpudevice = torch.device(\"cpu\")\n", 307 | "tensorset = image_batches[25].compute()\n", 308 | "to_pil = transforms.ToPILImage()\n", 309 | "\n", 310 | "imglist = [to_pil(tensorset[1][0].to(cpudevice)), \n", 311 | " to_pil(tensorset[1][1].to(cpudevice)),\n", 312 | " to_pil(tensorset[1][2].to(cpudevice)),\n", 313 | " to_pil(tensorset[1][3].to(cpudevice)),\n", 314 | " to_pil(tensorset[1][4].to(cpudevice))]" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "f, ax = plt.subplots(1, 5, figsize=(16,6))\n", 324 | "\n", 325 | "for i in range(0,5):\n", 326 | " img1 = imglist[i]\n", 327 | " ax[i].imshow(img1).axes.xaxis.set_visible(False)\n", 328 | " ax[i].axes.yaxis.set_visible(False)\n", 329 | "\n", 330 | "title = 'Sample Images'\n", 331 | "f.suptitle(title, fontsize=16)\n", 332 | "plt.tight_layout()\n", 333 | "plt.show()" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "\n", 341 | "***\n", 342 | "\n", 343 | "## Run the Model\n", 344 | "We are ready to do the inference task! This is going to have a few steps, all of which are contained in functions described below, but we’ll talk through them so everything is clear, using just one batch as an example.\n", 345 | "\n", 346 | "Our unit of work at this point is batches of 60 images at a time, which we created in the section above. They are all neatly arranged in lists so that we can work with them effectively." 347 | ] 348 | }, 349 | { 350 | "cell_type": "markdown", 351 | "metadata": {}, 352 | "source": [ 353 | "\n", 354 | "***\n", 355 | "\n", 356 | "## Result Evaluation\n", 357 | "\n", 358 | "The predictions and truth we have so far, however, are not really human readable or comparable, so we’ll use the functions that follow to fix them up and get us interpretable results." 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": null, 364 | "metadata": {}, 365 | "outputs": [], 366 | "source": [ 367 | "def evaluate_pred_batch(batch, gtruth, classes):\n", 368 | " ''' Accepts batch of images, returns human readable predictions. '''\n", 369 | " \n", 370 | " _, indices = torch.sort(batch, descending=True)\n", 371 | " percentage = torch.nn.functional.softmax(batch, dim=1)[0] * 100\n", 372 | " \n", 373 | " preds = []\n", 374 | " labslist = []\n", 375 | " for i in range(len(batch)):\n", 376 | " pred = [(classes[idx], percentage[idx].item()) for idx in indices[i][:1]]\n", 377 | " preds.append(pred)\n", 378 | "\n", 379 | " labs = gtruth[i]\n", 380 | " labslist.append(labs)\n", 381 | " \n", 382 | " return(preds, labslist)\n", 383 | "\n", 384 | "def is_match(label, pred):\n", 385 | " ''' Evaluates human readable prediction against ground truth.'''\n", 386 | " if re.search(label.replace('_', ' '), str(pred).replace('_', ' ')):\n", 387 | " match = True\n", 388 | " else:\n", 389 | " match = False\n", 390 | " return(match)" 391 | ] 392 | }, 393 | { 394 | "cell_type": "markdown", 395 | "metadata": {}, 396 | "source": [ 397 | "This takes our results from the model, and a few other elements, to return nice readable predictions and the probabilities the model assigned. From here, we’re nearly done! We want to pass our results back to S3 in a tidy, human readable way, so the rest of the function handles that. It will iterate over each image because these functionalities are not batch handling. `is_match` is one of our custom functions, which you can check out below.\n" 398 | ] 399 | }, 400 | { 401 | "cell_type": "markdown", 402 | "metadata": {}, 403 | "source": [ 404 | "\n", 405 | "## Put It All Together\n", 406 | "\n", 407 | "Now, we aren’t going to patch together all these computations by hand, instead we have assembled them in one single delayed function that will do the work for us. Importantly, we can then map this across all our batches of images across the cluster! Can you spot all the tasks we have described above? " 408 | ] 409 | }, 410 | { 411 | "cell_type": "code", 412 | "execution_count": null, 413 | "metadata": {}, 414 | "outputs": [], 415 | "source": [ 416 | "@dask.delayed\n", 417 | "def run_batch_to_s3(iteritem):\n", 418 | " ''' Accepts iterable result of preprocessing, generates\n", 419 | " inferences and evaluates. '''\n", 420 | " \n", 421 | " names, images, truelabels = iteritem\n", 422 | " \n", 423 | " with s3.open('s3://saturn-public-data/dogs/imagenet1000_clsidx_to_labels.txt') as f:\n", 424 | " classes = [line.strip() for line in f.readlines()]\n", 425 | "\n", 426 | " # Retrieve, set up model\n", 427 | " resnet = models.resnet50(pretrained=True)\n", 428 | " resnet = resnet.to(device)\n", 429 | "\n", 430 | " with torch.no_grad():\n", 431 | " resnet.eval()\n", 432 | " pred_batch = resnet(images)\n", 433 | " \n", 434 | " #Evaluate batch\n", 435 | " preds, labslist = evaluate_pred_batch(pred_batch, truelabels, classes)\n", 436 | "\n", 437 | " #Organize prediction results\n", 438 | " outcomes = []\n", 439 | " for j in range(0, len(images)):\n", 440 | " match = is_match(labslist[j], preds[j]) \n", 441 | " outcome = {'name': names[j], 'ground_truth': labslist[j], \n", 442 | " 'prediction': preds[j], 'evaluation': match}\n", 443 | " outcomes.append(outcome)\n", 444 | " \n", 445 | " return(outcomes)" 446 | ] 447 | }, 448 | { 449 | "cell_type": "markdown", 450 | "metadata": {}, 451 | "source": [ 452 | "## Run the Job\n", 453 | "\n", 454 | "If you think you've filled in everything correctly, now you can try running the tasks in parallel. If you get errors, check the hidden chunk for answers.\n", 455 | "\n", 456 | "Notice that we're going to use client methods below to ensure that our tasks are distributed across the cluster, run, and then retrieved." 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": null, 462 | "metadata": {}, 463 | "outputs": [], 464 | "source": [ 465 | "display(HTML(gpu_links))" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "%%time\n", 475 | "\n", 476 | "futures = client.map(run_batch_to_s3, image_batches) \n", 477 | "futures_gathered = client.gather(futures)\n", 478 | "futures_computed = client.compute(futures_gathered, sync=False)\n", 479 | "\n", 480 | "import logging\n", 481 | "\n", 482 | "results = []\n", 483 | "errors = []\n", 484 | "for fut in futures_computed:\n", 485 | " try:\n", 486 | " result = fut.result()\n", 487 | " except Exception as e:\n", 488 | " errors.append(e)\n", 489 | " logging.error(e)\n", 490 | " else:\n", 491 | " results.extend(result)" 492 | ] 493 | }, 494 | { 495 | "cell_type": "markdown", 496 | "metadata": {}, 497 | "source": [ 498 | "When we run this block, we might want to go visit the Dask dashboard, to see our work as it runs.\n" 499 | ] 500 | }, 501 | { 502 | "cell_type": "markdown", 503 | "metadata": {}, 504 | "source": [ 505 | "***\n", 506 | "\n", 507 | "## Review Results\n", 508 | "\n", 509 | "Look at the graph for one batch, and spot check output." 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": null, 515 | "metadata": {}, 516 | "outputs": [], 517 | "source": [ 518 | "test_sample = run_batch_to_s3(image_batches[0])\n", 519 | "test_sample.visualize(rankdir=\"LR\")" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": null, 525 | "metadata": {}, 526 | "outputs": [], 527 | "source": [ 528 | "futures_computed[0].result()[0]" 529 | ] 530 | }, 531 | { 532 | "cell_type": "code", 533 | "execution_count": null, 534 | "metadata": {}, 535 | "outputs": [], 536 | "source": [ 537 | "results[0]" 538 | ] 539 | }, 540 | { 541 | "cell_type": "markdown", 542 | "metadata": {}, 543 | "source": [ 544 | "### Check Original Sample" 545 | ] 546 | }, 547 | { 548 | "cell_type": "code", 549 | "execution_count": null, 550 | "metadata": {}, 551 | "outputs": [], 552 | "source": [ 553 | "test_dogs = [d for d in results if d['name'] in test_set]\n", 554 | "test_dogs" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": null, 560 | "metadata": {}, 561 | "outputs": [], 562 | "source": [ 563 | "f, ax = plt.subplots(1, 5, figsize=(16,6))\n", 564 | "\n", 565 | "for i in range(0,5):\n", 566 | " img1 = imglist[i]\n", 567 | " ax[i].imshow(img1).axes.xaxis.set_visible(False)\n", 568 | " ax[i].axes.yaxis.set_visible(False)\n", 569 | "\n", 570 | "title = 'Sample Images'\n", 571 | "f.suptitle(title, fontsize=16)\n", 572 | "plt.tight_layout()\n", 573 | "plt.show()" 574 | ] 575 | }, 576 | { 577 | "cell_type": "markdown", 578 | "metadata": {}, 579 | "source": [ 580 | "***\n", 581 | "\n", 582 | "## Overall Accuracy" 583 | ] 584 | }, 585 | { 586 | "cell_type": "code", 587 | "execution_count": null, 588 | "metadata": {}, 589 | "outputs": [], 590 | "source": [ 591 | "len(results)" 592 | ] 593 | }, 594 | { 595 | "cell_type": "code", 596 | "execution_count": null, 597 | "metadata": {}, 598 | "outputs": [], 599 | "source": [ 600 | "true_preds = [x['evaluation'] for x in results if x['evaluation'] == True]\n", 601 | "false_preds = [x['evaluation'] for x in results if x['evaluation'] == False]" 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": null, 607 | "metadata": {}, 608 | "outputs": [], 609 | "source": [ 610 | "len(true_preds)/len(results)*100" 611 | ] 612 | }, 613 | { 614 | "cell_type": "markdown", 615 | "metadata": {}, 616 | "source": [ 617 | "***\n", 618 | "\n", 619 | "## Lessons Learned\n", 620 | "\n", 621 | "* You can apply `@dask.delayed` to your custom code to allow parallelization with nearly zero refactoring\n", 622 | "* Objects that are needed for a parallel task on GPU need to be assigned to a GPU resource\n", 623 | "* Passing tasks to the workers uses mapping across the cluster for peak efficiency\n", 624 | "\n", 625 | "And, of course, having multiple workers makes the job a lot faster!\n", 626 | "\n", 627 | "\"parallel\"\n", 628 | "\n" 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": null, 634 | "metadata": {}, 635 | "outputs": [], 636 | "source": [] 637 | } 638 | ], 639 | "metadata": { 640 | "kernelspec": { 641 | "display_name": "saturn (Python 3)", 642 | "language": "python", 643 | "name": "python3" 644 | }, 645 | "language_info": { 646 | "codemirror_mode": { 647 | "name": "ipython", 648 | "version": 3 649 | }, 650 | "file_extension": ".py", 651 | "mimetype": "text/x-python", 652 | "name": "python", 653 | "nbconvert_exporter": "python", 654 | "pygments_lexer": "ipython3", 655 | "version": "3.7.7" 656 | } 657 | }, 658 | "nbformat": 4, 659 | "nbformat_minor": 4 660 | } 661 | -------------------------------------------------------------------------------- /tools/setup1.py: -------------------------------------------------------------------------------- 1 | import s3fs 2 | s3 = s3fs.S3FileSystem(anon=True) 3 | 4 | from torchvision import datasets, transforms, models 5 | 6 | resnet = models.resnet50(pretrained=True) 7 | 8 | with s3.open('s3://saturn-public-data/dogs/imagenet1000_clsidx_to_labels.txt') as f: 9 | classes = [line.strip() for line in f.readlines()] 10 | 11 | 12 | from PIL import Image 13 | to_pil = transforms.ToPILImage() 14 | 15 | with s3.open("s3://saturn-public-data/dogs/2-dog.jpg", 'rb') as f: 16 | img = Image.open(f).convert("RGB") 17 | 18 | transform = transforms.Compose([ 19 | transforms.Resize(256), 20 | transforms.CenterCrop(250), 21 | transforms.ToTensor()]) 22 | 23 | img_t = transform(img) 24 | 25 | from IPython.display import display, HTML 26 | gpu_links = f''' 27 | Cluster Dashboard links 28 | 33 | ''' -------------------------------------------------------------------------------- /tools/setup2.py: -------------------------------------------------------------------------------- 1 | from dask_saturn import SaturnCluster 2 | from dask.distributed import Client 3 | import matplotlib.pyplot as plt 4 | import numpy as np 5 | import os 6 | 7 | import math 8 | import datetime 9 | import json 10 | import pickle 11 | import tensorboard 12 | #tensorboard.__version__ 13 | from dask_pytorch_ddp import data, dispatch 14 | from torch.utils.data.sampler import SubsetRandomSampler 15 | 16 | import s3fs 17 | import re 18 | 19 | cluster = SaturnCluster() 20 | client = Client(cluster) 21 | client.wait_for_workers(3) 22 | 23 | import torch 24 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 25 | 26 | from dask_pytorch_ddp import results, data, dispatch 27 | from torch.utils.data.sampler import SubsetRandomSampler 28 | 29 | 30 | def prepro_batches(bucket, prefix): 31 | '''Initialize the custom Dataset class defined above, apply transformations.''' 32 | transform = transforms.Compose([ 33 | transforms.Resize(256), 34 | transforms.CenterCrop(250), 35 | transforms.ToTensor()]) 36 | whole_dataset = data.S3ImageFolder(bucket, prefix, transform=transform, anon = True) 37 | return whole_dataset 38 | 39 | 40 | def get_splits_parallel(train_pct, data, batch_size, subset = False, workers = 1): 41 | '''Select two samples of data for training and evaluation''' 42 | classes = data.classes 43 | train_size = math.floor(len(data) * train_pct) 44 | indices = list(range(len(data))) 45 | np.random.shuffle(indices) 46 | train_idx = indices[:train_size] 47 | test_idx = indices[train_size:len(data)] 48 | 49 | if subset: 50 | train_idx = np.random.choice(train_idx, size = int(np.floor(len(train_idx)*(1/workers))), replace=False) 51 | test_idx = np.random.choice(test_idx, size = int(np.floor(len(test_idx)*(1/workers))), replace=False) 52 | 53 | train_sampler = SubsetRandomSampler(train_idx) 54 | test_sampler = SubsetRandomSampler(test_idx) 55 | 56 | train_loader = torch.utils.data.DataLoader(data, sampler=train_sampler, batch_size=batch_size, num_workers=num_workers, multiprocessing_context=mp.get_context('fork')) 57 | test_loader = torch.utils.data.DataLoader(data, sampler=train_sampler, batch_size=batch_size, num_workers=num_workers, multiprocessing_context=mp.get_context('fork')) 58 | 59 | return train_loader, test_loader 60 | 61 | 62 | def replace_label(dataset_label, model_labels): 63 | label_string = re.search('n[0-9]+-([^/]+)', dataset_label).group(1) 64 | 65 | for i in model_labels: 66 | i = str(i).replace('{', '').replace('}', '') 67 | model_label_str = re.search('''b["'][0-9]+: ["']([^\/]+)["'],["']''', str(i)) 68 | model_label_idx = re.search('''b["']([0-9]+):''', str(i)).group(1) 69 | 70 | if re.search(str(label_string).replace('_', ' '), str(model_label_str).replace('_', ' ')): 71 | return i, model_label_idx 72 | break 73 | 74 | 75 | ##### 76 | 77 | 78 | def matplotlib_imshow(img, one_channel=False): 79 | if one_channel: 80 | img = img.mean(dim=0) 81 | img = img / 2 + 0.5 # unnormalize 82 | npimg = img.cpu().numpy() 83 | if one_channel: 84 | plt.imshow(npimg, cmap="Greys") 85 | else: 86 | plt.imshow(np.transpose(npimg, (1, 2, 0))) 87 | 88 | 89 | ## Text parsing 90 | 91 | def format_labels(label, pred): 92 | pred = str(pred).replace('{', '').replace('}', '') 93 | 94 | if re.search('n[0-9]+-([^/]+)', str(label)) is None: 95 | label = re.search('''b["'][0-9]+: ["']([^\/]+)["'],["']''', str(label)).group(1) 96 | else: 97 | label = re.search('n[0-9]+-([^/]+)', str(label)).group(1) 98 | 99 | if re.search('''b["'][0-9]+: ["']([^\/]+)["'],["']''', str(pred)) is None: 100 | pred = re.search('n[0-9]+-([^/]+)', str(pred)).group(1) 101 | else: 102 | pred = re.search('''b["'][0-9]+: ["']([^\/]+)["'],["']''', str(pred)).group(1) 103 | return(label, pred) 104 | 105 | def is_match(label, pred): 106 | ''' Evaluates human readable prediction against ground truth.''' 107 | if re.search(str(label).replace('_', ' '), str(pred).replace('_', ' ')): 108 | match = True 109 | else: 110 | match = False 111 | return(match) 112 | 113 | ## Pred Parsing 114 | 115 | def images_to_probs(net, images): 116 | ''' 117 | Generates predictions and corresponding probabilities from a trained 118 | network and a list of images 119 | ''' 120 | batch = net(images) 121 | _, preds_tensor = torch.max(batch, 1) 122 | preds = preds_tensor.cpu().numpy() 123 | perct = [torch.nn.functional.softmax(el, dim=0)[i].item() for i, el in zip(preds, batch)] 124 | 125 | return preds, perct 126 | 127 | def plot_classes_preds(net, images, labels, preds_tensors, perct, trainclasses): 128 | ''' 129 | Generates matplotlib Figure using a trained network, along with images 130 | and labels from a batch, that shows the network's top prediction along 131 | with its probability, alongside the actual label, coloring this 132 | information based on whether the prediction was correct or not. 133 | Uses the "images_to_probs" function. 134 | ''' 135 | preds = preds_tensors.cpu().numpy() 136 | pred_class_set = [trainclasses[i] for i in preds] 137 | lab_class_set = [trainclasses[i] for i in labels] 138 | 139 | # plot the images in the batch, along with predicted and true labels 140 | fig = plt.figure(figsize=(12, 24)) 141 | plt.subplots_adjust(wspace = 0.6) 142 | 143 | for idx in np.arange(4): 144 | raw_label = lab_class_set[idx] 145 | raw_pred = pred_class_set[idx] 146 | 147 | label, pred = format_labels(raw_label,raw_pred) 148 | 149 | ax = fig.add_subplot(2, 2, idx+1, xticks=[], yticks=[]) 150 | matplotlib_imshow(images[idx], one_channel=False) 151 | ax.set_title("{0}, {1:.1f}%\n(label: {2})".format( 152 | pred, perct[idx]*100, label), color=("green" if is_match(label, pred) else "red")) 153 | 154 | return fig 155 | 156 | 157 | from IPython.display import display, HTML 158 | gpu_links = f''' 159 | Cluster Dashboard links 160 | 165 | ''' -------------------------------------------------------------------------------- /tools/stats_cache2.tar.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/saturncloud/workshop-dask-pytorch/d867ecd0e0a49c601ca829c4fb8f1f605458581b/tools/stats_cache2.tar.gz -------------------------------------------------------------------------------- /transfer_learning_demo/05-transfer-prepro.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "\n", 10 | "# Transfer Learning\n", 11 | "\n", 12 | "In this project, we will use the Stanford Dogs dataset, and starting with Resnet50, and we will use transfer learning to make it perform better at dog image identification.\n", 13 | "\n", 14 | "In order to make this work, we have a few steps to carry out:\n", 15 | "* Preprocessing our data appropriately\n", 16 | "* Applying infrastructure for parallelizing the learning process\n", 17 | "* Running the transfer learning workflow and generating evaluation data\n", 18 | "\n", 19 | "\n", 20 | "### Start and Check Cluster" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "from dask_saturn import SaturnCluster\n", 30 | "from dask.distributed import Client\n", 31 | "import s3fs\n", 32 | "import re\n", 33 | "from torchvision import transforms\n", 34 | "\n", 35 | "cluster = SaturnCluster()\n", 36 | "client = Client(cluster)\n", 37 | "client.wait_for_workers(3)\n", 38 | "client" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": null, 44 | "metadata": {}, 45 | "outputs": [], 46 | "source": [ 47 | "import torch\n", 48 | "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "***\n", 56 | "\n", 57 | "## Preprocessing Data\n", 58 | "\n", 59 | "We are using `dask-pytorch-ddp` to handle a lot of the work involved in training across the entire cluster. This will abstract away lots of worker management tasks, and also sets up a tidy infrastructure for managing model output, but if you're interested to learn more about this, we maintain the [codebase and documentation on Github](https://github.com/saturncloud/dask-pytorch).\n", 60 | "\n", 61 | "Because we want to load our images directly from S3, without saving them to memory (and wasting space/time!) we are going to use the `dask-pytorch-ddp` custom class inheriting from the Dataset class called `S3ImageFolder`.\n", 62 | "\n", 63 | "The preprocessing steps are quite short- we want to load images using the class we discussed above, and apply the transformation of our choosing. If you like, you can make the transformations an argument to this function and pass it in.\n" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": null, 69 | "metadata": {}, 70 | "outputs": [], 71 | "source": [ 72 | "from dask_pytorch_ddp import results, data, dispatch\n", 73 | "from torch.utils.data.sampler import SubsetRandomSampler" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": null, 79 | "metadata": {}, 80 | "outputs": [], 81 | "source": [ 82 | "def prepro_batches(bucket, prefix):\n", 83 | " '''Initialize the custom Dataset class defined above, apply transformations.'''\n", 84 | " transform = transforms.Compose([\n", 85 | " transforms.Resize(256), \n", 86 | " transforms.CenterCrop(250), \n", 87 | " transforms.ToTensor()])\n", 88 | " whole_dataset = data.S3ImageFolder(\n", 89 | " bucket, \n", 90 | " prefix, \n", 91 | " transform=transform, \n", 92 | " anon = True\n", 93 | " )\n", 94 | " return whole_dataset" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "### Optional: Checking Data Labels\n", 102 | "\n", 103 | "Because our task is transfer learning, we're going to be starting with the pretrained Resnet50 model. In order to take full advantage of the training that the model already has, we need to make sure that the label indices on our Stanford Dogs dataset match their equivalents in the Resnet50 label data. (Hint: they aren't going to match, but we'll fix it!)" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "metadata": {}, 110 | "outputs": [], 111 | "source": [ 112 | "s3 = s3fs.S3FileSystem()\n", 113 | "\n", 114 | "with s3.open('s3://saturn-public-data/dogs/imagenet1000_clsidx_to_labels.txt') as f:\n", 115 | " imagenetclasses = [line.strip() for line in f.readlines()]\n", 116 | "\n", 117 | "whole_dataset = prepro_batches(bucket = \"saturn-public-data\", prefix = \"dogs/Images\")" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "Any dataset loaded in a PyTorch image folder object will have a few attributes, including `class_to_idx` which returns a dictionary of the class names and their assigned indices. Let's look at the one for our dog images." 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": null, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "list(whole_dataset.class_to_idx.items())[0:5]" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "metadata": {}, 139 | "source": [ 140 | "So let's look at the Imagenet classes - do they match?" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "imagenetclasses[0:5]" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "Well, that's not going to work! Our model thinks 1 = goldfish while our dataset thinks 1 = Japanese Spaniel. Fortunately, this is a pretty easy fix. \n", 157 | "\n", 158 | "I've created a function called `replace_label()` that checks the labels by text with regex, so that we can be assured that we match them up correctly. This is important, because we can't assume all our dog labels are in exactly the same consecutive order in the imagenet labels." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "def replace_label(dataset_label, model_labels):\n", 168 | " label_string = re.search('n[0-9]+-([^/]+)', dataset_label).group(1)\n", 169 | " \n", 170 | " for i in model_labels:\n", 171 | " i = str(i).replace('{', '').replace('}', '')\n", 172 | " model_label_str = re.search('''b[\"'][0-9]+: [\"']([^\\/]+)[\"'],[\"']''', str(i))\n", 173 | " model_label_idx = re.search('''b[\"']([0-9]+):''', str(i)).group(1)\n", 174 | " \n", 175 | " if re.search(str(label_string).replace('_', ' '), str(model_label_str).replace('_', ' ')):\n", 176 | " return i, model_label_idx\n", 177 | " break" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "We can use this function in a couple of lines of list comprehension to create our new `class_to_idx` object. Now we have the indices assigned to match our imagenet dataset!" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "new_class_to_idx = {x: int(replace_label(x, imagenetclasses)[1]) for x in whole_dataset.classes}" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "list(new_class_to_idx.items())[0:5]" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": null, 208 | "metadata": {}, 209 | "outputs": [], 210 | "source": [ 211 | "imagenetclasses[151:156]" 212 | ] 213 | }, 214 | { 215 | "cell_type": "markdown", 216 | "metadata": {}, 217 | "source": [ 218 | "Let's also make sure our old and new datasets have the same length, so that nothing got missed." 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": {}, 225 | "outputs": [], 226 | "source": [ 227 | "len(new_class_to_idx.items()) == len(whole_dataset.class_to_idx.items())" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "***\n", 235 | "\n", 236 | "### Select Training and Evaluation Samples\n", 237 | "\n", 238 | "In order to run our training, we'll create training and evaluation sample sets to use later. These generate DataLoader objects which we can iterate over. We'll use both later to run and monitor our model's learning.\n", 239 | "\n", 240 | "Note the `multiprocessing_context` argument that we are using in the DataLoader objects - this will allow our large batch jobs to efficiently load more than one image simultaneously, and save us a lot of time." 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": {}, 247 | "outputs": [], 248 | "source": [ 249 | "def get_splits_parallel(train_pct, data, batch_size, num_workers=64):\n", 250 | " '''Select two samples of data for training and evaluation'''\n", 251 | " classes = data.classes\n", 252 | " train_size = math.floor(len(data) * train_pct)\n", 253 | " indices = list(range(len(data)))\n", 254 | " np.random.shuffle(indices)\n", 255 | " train_idx = indices[:train_size]\n", 256 | " test_idx = indices[train_size:len(data)]\n", 257 | "\n", 258 | " train_sampler = SubsetRandomSampler(train_idx)\n", 259 | " test_sampler = SubsetRandomSampler(test_idx)\n", 260 | " \n", 261 | " train_loader = torch.utils.data.DataLoader(\n", 262 | " data, \n", 263 | " sampler=train_sampler,\n", 264 | " batch_size=batch_size,\n", 265 | " num_workers=num_workers,\n", 266 | " multiprocessing_context=mp.get_context('fork'))\n", 267 | " \n", 268 | " test_loader = torch.utils.data.DataLoader(\n", 269 | " data, \n", 270 | " sampler=train_sampler, \n", 271 | " batch_size=batch_size, \n", 272 | " num_workers=num_workers, \n", 273 | " multiprocessing_context=mp.get_context('fork'))\n", 274 | " \n", 275 | " return train_loader, test_loader" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "Aside from using our custom data object, this should be very similar to other PyTorch workflows. While I am using the `S3ImageFolder` class here, you definitely don't have to in your own work. Any standard PyTorch data object type should be compatible with the Dask work we're doing next.\n", 283 | "\n", 284 | "Now, it's time for learning, in [Notebook 6a](06a-transfer-training-s3.ipynb)!\n", 285 | "\n", 286 | "\"learn\"" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [] 295 | } 296 | ], 297 | "metadata": { 298 | "kernelspec": { 299 | "display_name": "saturn (Python 3)", 300 | "language": "python", 301 | "name": "python3" 302 | }, 303 | "language_info": { 304 | "codemirror_mode": { 305 | "name": "ipython", 306 | "version": 3 307 | }, 308 | "file_extension": ".py", 309 | "mimetype": "text/x-python", 310 | "name": "python", 311 | "nbconvert_exporter": "python", 312 | "pygments_lexer": "ipython3", 313 | "version": "3.7.7" 314 | } 315 | }, 316 | "nbformat": 4, 317 | "nbformat_minor": 4 318 | } 319 | -------------------------------------------------------------------------------- /transfer_learning_demo/06a-transfer-training-s3.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "We don't need to run all of Notebook 5 again, we'll just call `setup2.py` in the next chunk to get ourselves back to the right state. This also includes the reindexing work from Notebook 5, and a couple of visualization functions that we'll talk about later.\n", 10 | "\n", 11 | "***\n", 12 | "**Note: This notebook assumes you have an S3 bucket where you can store your model performance statistics.** \n", 13 | "If you don't have access to an S3 bucket, but would still like to train your model and review results, please visit [Notebook 6b](06b-transfer-training-local.ipynb) and [Notebook 7](07-learning-results.ipynb) to see detailed examples of how you can do that.\n", 14 | "***\n", 15 | "\n", 16 | "## Connect to Cluster" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "%run -i ../tools/setup2.py\n", 26 | "\n", 27 | "display(HTML(gpu_links))" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "import torch\n", 37 | "from tensorboardX import SummaryWriter\n", 38 | "\n", 39 | "from torch import nn, optim\n", 40 | "from torch.nn.parallel import DistributedDataParallel as DDP\n", 41 | "\n", 42 | "from torchvision import datasets, transforms, models\n", 43 | "from torch.utils.data import DataLoader\n", 44 | "from torch.utils.data.sampler import SubsetRandomSampler\n", 45 | "\n", 46 | "import torch.distributed as dist" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "client" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "We're ready to do some learning! \n", 63 | "\n", 64 | "## Model Parameters\n", 65 | "\n", 66 | "Aside from the Special Elements noted below, we can write this section essentially the same way we write any other PyTorch training loop. \n", 67 | "* Cross Entropy Loss for our loss function\n", 68 | "* SGD (Stochastic Gradient Descent) for our optimizer\n", 69 | "\n", 70 | "We have two stages in this process, as well - training and evaluation. We run the training set completely using batches of 100 before we move to the evaluation step, where we run the eval set completely also using batches of 100." 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "Most of the training workflow function shown will be very familiar for users of PyTorch. However, there are a couple of elements that are different.\n", 78 | "\n", 79 | "### 1. Tensorboard Writer\n", 80 | "\n", 81 | "We're using Tensorboard to monitor the model's performance, so we'll create a SummaryWriter object in our training function, and use that to write out statistics and sample image classifications. " 82 | ] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "### 2. Model to GPU Resources\n", 89 | "\n", 90 | "```\n", 91 | "device = torch.device(0)\n", 92 | "net = models.resnet50(pretrained=True)\n", 93 | "model = net.to(device)\n", 94 | "```\n", 95 | "\n", 96 | "We need to make sure our model is assigned to a GPU resource- here we do it one time before the training loops begin. We will also assign each image and its label to a GPU resource within the training and evaluation loops.\n", 97 | "\n", 98 | "\n", 99 | "### 3. DDP Wrapper\n", 100 | "```\n", 101 | "model = DDP(model)\n", 102 | "```\n", 103 | "\n", 104 | "And finally, we need to enable the DistributedDataParallel framework. To do this, we are using the `DDP()` wrapper around the model, which is short for the PyTorch function `torch.nn.parallel.DistributedDataParallel`. There is a lot to know about this, but for our purposes the important thing is to understand that this allows the model training to run in parallel on our cluster. https://pytorch.org/docs/stable/notes/ddp.html\n", 105 | "\n", 106 | "\n", 107 | "\n", 108 | "> **Discussing DDP** \n", 109 | "It may be interesting for you to know what DDP is really doing under the hood: for a detailed discussion and more tips about this same workflow, you can visit our blog to read more! [https://www.saturncloud.io/s/combining-dask-and-pytorch-for-better-faster-transfer-learning/](https://www.saturncloud.io/s/combining-dask-and-pytorch-for-better-faster-transfer-learning/)\n" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "***\n", 117 | "\n", 118 | "\n", 119 | "# Training time!\n", 120 | "Our whole training process is going to be contained in one function, here named `run_transfer_learning`." 121 | ] 122 | }, 123 | { 124 | "cell_type": "markdown", 125 | "metadata": {}, 126 | "source": [ 127 | "\n", 128 | "\n", 129 | "## Modeling Functions\n", 130 | "\n", 131 | "Setting these pretty basic steps into a function just helps us ensure perfect parity between our train and evaluation steps." 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": null, 137 | "metadata": {}, 138 | "outputs": [], 139 | "source": [ 140 | "def iterate_model(inputs, labels, model, device):\n", 141 | " # Pass items to GPU\n", 142 | " inputs = inputs.to(device)\n", 143 | " labels = labels.to(device)\n", 144 | "\n", 145 | " # Run model iteration\n", 146 | " outputs = model(inputs)\n", 147 | "\n", 148 | " # Format results\n", 149 | " _, preds = torch.max(outputs, 1)\n", 150 | " perct = [torch.nn.functional.softmax(el, dim=0)[i].item() for i, el in zip(preds, outputs)]\n", 151 | " \n", 152 | " return inputs, labels, outputs, preds, perct" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": null, 158 | "metadata": {}, 159 | "outputs": [], 160 | "source": [ 161 | "def run_transfer_learning(bucket, prefix, train_pct, batch_size, \n", 162 | " n_epochs, base_lr, imagenetclasses, \n", 163 | " n_workers = 1, subset = False):\n", 164 | " '''Load basic Resnet50, run transfer learning over given epochs.\n", 165 | " Uses dataset from the path given as the pool from which to take the \n", 166 | " training and evaluation samples.'''\n", 167 | " \n", 168 | " worker_rank = int(dist.get_rank())\n", 169 | " \n", 170 | " # Set results writer\n", 171 | " writer = SummaryWriter(f's3://pytorchtraining/pytorch_bigbatch/learning_worker{worker_rank}')\n", 172 | " executor = ThreadPoolExecutor(max_workers=64)\n", 173 | " \n", 174 | " # --------- Format model and params --------- #\n", 175 | " device = torch.device(\"cuda\")\n", 176 | " net = models.resnet50(pretrained=True) # True means we start with the imagenet version\n", 177 | " model = net.to(device)\n", 178 | " model = DDP(model)\n", 179 | " \n", 180 | " criterion = nn.CrossEntropyLoss().cuda() \n", 181 | " optimizer = optim.SGD(model.parameters(), lr=base_lr, momentum=0.9)\n", 182 | "\n", 183 | " # --------- Retrieve data for training and eval --------- #\n", 184 | " whole_dataset = prepro_batches(bucket, prefix)\n", 185 | " new_class_to_idx = {x: int(replace_label(x, imagenetclasses)[1]) for x in whole_dataset.classes}\n", 186 | " whole_dataset.class_to_idx = new_class_to_idx\n", 187 | " \n", 188 | " train, val = get_splits_parallel(train_pct, whole_dataset, batch_size=batch_size, subset = subset, workers = n_workers)\n", 189 | " dataloaders = {'train' : train, 'val': val}\n", 190 | "\n", 191 | " # --------- Start iterations --------- #\n", 192 | " count = 0\n", 193 | " t_count = 0\n", 194 | " \n", 195 | " for epoch in range(n_epochs):\n", 196 | " agg_loss = []\n", 197 | " agg_loss_t = []\n", 198 | " \n", 199 | " agg_cor = []\n", 200 | " agg_cor_t = []\n", 201 | " # --------- Training section --------- # \n", 202 | " model.train() # Set model to training mode\n", 203 | " for inputs, labels in dataloaders[\"train\"]:\n", 204 | " dt = datetime.datetime.now().isoformat()\n", 205 | "\n", 206 | " inputs, labels, outputs, preds, perct = iterate_model(inputs, labels, model, device)\n", 207 | " \n", 208 | " loss = criterion(outputs, labels)\n", 209 | " correct = (preds == labels).sum().item()\n", 210 | " \n", 211 | " # zero the parameter gradients\n", 212 | " optimizer.zero_grad()\n", 213 | " loss.backward()\n", 214 | " optimizer.step()\n", 215 | " count += 1\n", 216 | " \n", 217 | " # Track statistics\n", 218 | " for param_group in optimizer.param_groups:\n", 219 | " current_lr = param_group['lr']\n", 220 | " \n", 221 | " agg_loss.append(loss.item())\n", 222 | " agg_cor.append(correct)\n", 223 | "\n", 224 | " if ((count % 25) == 0): \n", 225 | " future = executor.submit(\n", 226 | " writer.add_hparams(\n", 227 | " hparam_dict = {'lr': current_lr, 'bsize': batch_size, 'worker':worker_rank},\n", 228 | " metric_dict = {'correct': correct,'loss': loss.item()},\n", 229 | " name = 'train-iter',\n", 230 | " global_step=count)\n", 231 | " )\n", 232 | "\n", 233 | " # Save a matplotlib figure showing a small sample of actual preds for spot check\n", 234 | " # Functions used here are in setup2.py\n", 235 | " if ((count % 50) == 0):\n", 236 | " future = executor.submit(\n", 237 | " writer.add_figure(\n", 238 | " 'predictions vs. actuals, training',\n", 239 | " plot_classes_preds(model, inputs, labels, preds, perct, imagenetclasses),\n", 240 | " global_step=count\n", 241 | " )\n", 242 | " )\n", 243 | " \n", 244 | " # --------- Evaluation section --------- # \n", 245 | " with torch.no_grad():\n", 246 | " model.eval() # Set model to evaluation mode\n", 247 | " for inputs_t, labels_t in dataloaders[\"val\"]:\n", 248 | " dt = datetime.datetime.now().isoformat()\n", 249 | " \n", 250 | " inputs_t, labels_t, outputs_t, pred_t, perct_t = iterate_model(inputs_t, labels_t, model, device)\n", 251 | "\n", 252 | " loss_t = criterion(outputs_t, labels_t)\n", 253 | " correct_t = (pred_t == labels_t).sum().item()\n", 254 | " \n", 255 | " t_count += 1\n", 256 | "\n", 257 | " # Track statistics\n", 258 | " for param_group in optimizer.param_groups:\n", 259 | " current_lr = param_group['lr']\n", 260 | " \n", 261 | " agg_loss_t.append(loss_t.item())\n", 262 | " agg_cor_t.append(correct_t)\n", 263 | "\n", 264 | " if ((t_count % 25) == 0):\n", 265 | " future = executor.submit(\n", 266 | " writer.add_hparams(\n", 267 | " hparam_dict = {'lr': current_lr, 'bsize': batch_size, 'worker':worker_rank},\n", 268 | " metric_dict = {'correct': correct_t,'loss': loss_t.item()},\n", 269 | " name = 'eval-iter',\n", 270 | " global_step=t_count)\n", 271 | " )\n", 272 | " \n", 273 | " future = executor.submit(\n", 274 | " writer.add_hparams(\n", 275 | " hparam_dict = {'lr': current_lr, 'bsize': batch_size, 'worker':worker_rank},\n", 276 | " metric_dict = {'correct': np.mean(agg_cor),'loss': np.mean(agg_loss), \n", 277 | " 'last_correct': correct,'last_loss': loss.item()},\n", 278 | " name = 'train',\n", 279 | " global_step=epoch)\n", 280 | " )\n", 281 | "\n", 282 | " future = executor.submit(\n", 283 | " writer.add_hparams(\n", 284 | " hparam_dict = {'lr': current_lr, 'bsize': batch_size, 'worker':worker_rank},\n", 285 | " metric_dict = {'correct': np.mean(agg_cor_t),'loss': np.mean(agg_loss_t), \n", 286 | " 'last_correct': correct_t,'last_loss': loss_t.item()},\n", 287 | " name = 'eval',\n", 288 | " global_step=epoch)\n", 289 | " )\n", 290 | " \n", 291 | " pickle.dump(model.state_dict(), s3.open(f\"pytorchtraining/pytorch_bigbatch/model_epoch{epoch}_iter{count}_{dt}.pkl\",'wb'))" 292 | ] 293 | }, 294 | { 295 | "cell_type": "markdown", 296 | "metadata": {}, 297 | "source": [ 298 | "###### \n", 299 | "Now we've done all the hard work, and just need to run our function! Using `dispatch.run` from `dask-pytorch-ddp`, we pass in the transfer learning function so that it gets distributed correctly across our cluster. This creates futures and starts computing them.\n" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": null, 305 | "metadata": {}, 306 | "outputs": [], 307 | "source": [ 308 | "import math\n", 309 | "import numpy as np\n", 310 | "import multiprocessing as mp\n", 311 | "import datetime\n", 312 | "import json \n", 313 | "import pickle\n", 314 | "from concurrent.futures import ThreadPoolExecutor\n", 315 | "\n", 316 | "num_workers = 64\n", 317 | "\n", 318 | "s3 = s3fs.S3FileSystem()\n", 319 | "with s3.open('s3://saturn-public-data/dogs/imagenet1000_clsidx_to_labels.txt') as f:\n", 320 | " imagenetclasses = [line.strip() for line in f.readlines()]\n", 321 | " \n", 322 | "client.restart() # Clears memory on cluster- optional but recommended." 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": null, 328 | "metadata": {}, 329 | "outputs": [], 330 | "source": [ 331 | "startparams = {'n_epochs': 6, \n", 332 | " 'batch_size': 100,\n", 333 | " 'train_pct': .8,\n", 334 | " 'base_lr': 0.01,\n", 335 | " 'imagenetclasses':imagenetclasses,\n", 336 | " 'subset': True,\n", 337 | " 'n_workers': 3} #only necessary if you select subset" 338 | ] 339 | }, 340 | { 341 | "cell_type": "markdown", 342 | "metadata": {}, 343 | "source": [ 344 | "## Kick Off Job" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "### Send Tasks to Workers\n", 352 | " \n", 353 | "We talked in Notebook 2 about how we distribute tasks to the workers in our cluster, and now you get to see it firsthand. Inside the `dispatch.run()` function in `dask-pytorch-ddp`, we are actually using the `client.submit()` method to pass tasks to our workers, and collecting these as futures in a list. We can prove this by looking at the results, here named \"futures\", where we can see they are in fact all pending futures, one for each of the workers in our cluster.\n", 354 | "\n", 355 | "> *Why don't we use `.map()` in this function?* \n", 356 | "> Recall that `.map` allows the Cluster to decide where the tasks are completed - it has the ability to choose which worker is assigned any task. That means that we don't have the control we need to ensure that we have one and only one job per GPU. This could be a problem for our methodology because of the use of DDP. \n", 357 | "> Instead we use `.submit` and manually assign it to the workers by number. This way, each worker is attacking the same problem - our transfer learning problem - and pursuing a solution simultaneously. We'll have one and only one job per worker." 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "%%time \n", 367 | "futures = dispatch.run(client, run_transfer_learning, bucket = \"saturn-public-data\", prefix = \"dogs/Images\", **startparams)\n", 368 | "futures" 369 | ] 370 | }, 371 | { 372 | "cell_type": "code", 373 | "execution_count": null, 374 | "metadata": {}, 375 | "outputs": [], 376 | "source": [ 377 | "futures" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": null, 383 | "metadata": {}, 384 | "outputs": [], 385 | "source": [ 386 | "futures[0].result()" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "\"parallel\"\n", 394 | "\n", 395 | "Now we let our workers run for awhile. This step will take time, so you may not be able to see the full results during our workshop. See the dashboards to view the GPUs efforts as the job runs.\n", 396 | "\n", 397 | "***\n", 398 | "\n", 399 | "If you don't have access to an S3 bucket, but would still like to do model performance review, please visit [Notebook 6b](06b-transfer-training-local.ipynb) and [Notebook 7](07-learning-results.ipynb) to see detailed examples of how you can do that." 400 | ] 401 | }, 402 | { 403 | "cell_type": "markdown", 404 | "metadata": {}, 405 | "source": [ 406 | "***\n", 407 | "\n", 408 | "## Optional: Launch Tensorboard\n", 409 | "\n", 410 | "### If you save files to S3\n", 411 | "Open a terminal on your local machine, run `tensorboard --logdir=s3://[NAMEOFBUCKET]/runs`. Ensure that your AWS creds are in your bash profile/environment.\n", 412 | "\n", 413 | "#### Example of creds you should have\n", 414 | "export AWS_SECRET_ACCESS_KEY=`your secret key` \n", 415 | "export AWS_ACCESS_KEY_ID=`your access key id` \n", 416 | "export S3_REGION=us-east-2 `substitute your region` \n", 417 | "export S3_ENDPOINT=https://s3.us-east-2.amazonaws.com `match to your region` \n", 418 | "\n", 419 | "### If you save files locally\n", 420 | "\n", 421 | "When you are ready to start viewing the board, run this at the terminal inside Jupyter Labs:\n", 422 | "\n", 423 | "`tensorboard --logdir=runs`\n", 424 | "\n", 425 | "Then, in a terminal on your local machine, run: \n", 426 | "\n", 427 | "`ssh -L 6006:localhost:6006 -i ~/.ssh/PATHTOPRIVATEKEY SSHURLFORJUPYTER`\n", 428 | "\n", 429 | "You'll find the private key path on your local machine, and the SSH URL on the project page for this project. You can change the local port (the first 6006) if you like.\n", 430 | "\n", 431 | "At this stage, you'll likely not have any data, but the board will update itself every thirty seconds." 432 | ] 433 | }, 434 | { 435 | "cell_type": "code", 436 | "execution_count": null, 437 | "metadata": {}, 438 | "outputs": [], 439 | "source": [] 440 | } 441 | ], 442 | "metadata": { 443 | "kernelspec": { 444 | "display_name": "saturn (Python 3)", 445 | "language": "python", 446 | "name": "python3" 447 | }, 448 | "language_info": { 449 | "codemirror_mode": { 450 | "name": "ipython", 451 | "version": 3 452 | }, 453 | "file_extension": ".py", 454 | "mimetype": "text/x-python", 455 | "name": "python", 456 | "nbconvert_exporter": "python", 457 | "pygments_lexer": "ipython3", 458 | "version": "3.7.7" 459 | } 460 | }, 461 | "nbformat": 4, 462 | "nbformat_minor": 4 463 | } 464 | -------------------------------------------------------------------------------- /transfer_learning_demo/06b-transfer-training-local.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "We don't need to run all of Notebook 5 again, we'll just call `setup2.py` in the next chunk to get ourselves back to the right state. This also includes the reindexing work from Notebook 5, and a couple of visualization functions that we'll talk about later.\n", 10 | "\n", 11 | "***\n", 12 | "**Note: This notebook assumes you have an S3 bucket where you can store your model performance statistics.** \n", 13 | "If you don't have access to an S3 bucket, but would still like to train your model and review results, please visit [Notebook 6b](06b-transfer-training-local.ipynb) and [Notebook 7](07-learning-results.ipynb) to see detailed examples of how you can do that.\n", 14 | "***\n", 15 | "\n", 16 | "## Connect to Cluster" 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "%run -i ../tools/setup2.py\n", 26 | "\n", 27 | "display(HTML(gpu_links))" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "import torch\n", 37 | "from tensorboardX import SummaryWriter\n", 38 | "\n", 39 | "from torch import nn, optim\n", 40 | "from torch.nn.parallel import DistributedDataParallel as DDP\n", 41 | "\n", 42 | "from torchvision import datasets, transforms, models\n", 43 | "from torch.utils.data import DataLoader\n", 44 | "from torch.utils.data.sampler import SubsetRandomSampler\n", 45 | "\n", 46 | "import torch.distributed as dist" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "client" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "We're ready to do some learning! \n", 63 | "\n", 64 | "## Model Parameters\n", 65 | "\n", 66 | "Aside from the Special Elements noted below, we can write this section essentially the same way we write any other PyTorch training loop. \n", 67 | "* Cross Entropy Loss for our loss function\n", 68 | "* SGD (Stochastic Gradient Descent) for our optimizer\n", 69 | "\n", 70 | "We have two stages in this process, as well - training and evaluation. We run the training set completely using batches of 100 before we move to the evaluation step, where we run the eval set completely also using batches of 100." 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "Most of the training workflow function shown will be very familiar for users of PyTorch. However, there are a couple of elements that are different.\n", 78 | "\n", 79 | "### 1. DaskResultsHandler\n", 80 | "In order to use the model output handler, we need to initialize the `DaskResultsHandler` class for our experiment, from `dask-pytorch-ddp`.\n", 81 | "This object has a few important methods, including letting our model performance at each iteration be automatically documented. " 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "import uuid\n", 91 | "key = uuid.uuid4().hex\n", 92 | "\n", 93 | "rh = results.DaskResultsHandler(key)" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "### 2. Model to GPU Resources\n", 101 | "\n", 102 | "```\n", 103 | "device = torch.device(0)\n", 104 | "net = models.resnet50(pretrained=True)\n", 105 | "model = net.to(device)\n", 106 | "```\n", 107 | "\n", 108 | "We need to make sure our model is assigned to a GPU resource- here we do it one time before the training loops begin. We will also assign each image and its label to a GPU resource within the training and evaluation loops.\n", 109 | "\n", 110 | "\n", 111 | "### 3. DDP Wrapper\n", 112 | "```\n", 113 | "model = DDP(model)\n", 114 | "```\n", 115 | "\n", 116 | "And finally, we need to enable the DistributedDataParallel framework. To do this, we are using the `DDP()` wrapper around the model, which is short for the PyTorch function `torch.nn.parallel.DistributedDataParallel`. There is a lot to know about this, but for our purposes the important thing is to understand that this allows the model training to run in parallel on our cluster. https://pytorch.org/docs/stable/notes/ddp.html\n", 117 | "\n", 118 | "\n", 119 | "\n", 120 | "> **Discussing DDP** \n", 121 | "It may be interesting for you to know what DDP is really doing under the hood: for a detailed discussion and more tips about this same workflow, you can visit our blog to read more! [https://www.saturncloud.io/s/combining-dask-and-pytorch-for-better-faster-transfer-learning/](https://www.saturncloud.io/s/combining-dask-and-pytorch-for-better-faster-transfer-learning/)\n" 122 | ] 123 | }, 124 | { 125 | "cell_type": "markdown", 126 | "metadata": {}, 127 | "source": [ 128 | "***\n", 129 | "\n", 130 | "\n", 131 | "# Training time!\n", 132 | "Our whole training process is going to be contained in one function, here named `run_transfer_learning`." 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "\n", 140 | "\n", 141 | "## Modeling Functions\n", 142 | "\n", 143 | "Setting these pretty basic steps into a function just helps us ensure perfect parity between our train and evaluation steps." 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "def iterate_model(inputs, labels, model, device):\n", 153 | " # Pass items to GPU\n", 154 | " inputs = inputs.to(device)\n", 155 | " labels = labels.to(device)\n", 156 | "\n", 157 | " # Run model iteration\n", 158 | " outputs = model(inputs)\n", 159 | "\n", 160 | " # Format results\n", 161 | " _, preds = torch.max(outputs, 1)\n", 162 | " perct = [torch.nn.functional.softmax(el, dim=0)[i].item() for i, el in zip(preds, outputs)]\n", 163 | " \n", 164 | " return inputs, labels, outputs, preds, perct" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": null, 170 | "metadata": {}, 171 | "outputs": [], 172 | "source": [ 173 | "def run_transfer_learning(bucket, prefix, train_pct, batch_size, \n", 174 | " n_epochs, base_lr, imagenetclasses, \n", 175 | " n_workers = 1, subset = False):\n", 176 | " '''Load basic Resnet50, run transfer learning over given epochs.\n", 177 | " Uses dataset from the path given as the pool from which to take the \n", 178 | " training and evaluation samples.'''\n", 179 | " \n", 180 | " worker_rank = int(dist.get_rank())\n", 181 | " \n", 182 | " # Set results writer\n", 183 | " writer = SummaryWriter(f's3://pytorchtraining/pytorch_bigbatch/learning_worker{worker_rank}')\n", 184 | " executor = ThreadPoolExecutor(max_workers=64)\n", 185 | " \n", 186 | " # --------- Format model and params --------- #\n", 187 | " device = torch.device(\"cuda\")\n", 188 | " net = models.resnet50(pretrained=True) # True means we start with the imagenet version\n", 189 | " model = net.to(device)\n", 190 | " model = DDP(model)\n", 191 | " \n", 192 | " criterion = nn.CrossEntropyLoss().cuda() \n", 193 | " optimizer = optim.SGD(model.parameters(), lr=base_lr, momentum=0.9)\n", 194 | "\n", 195 | " # --------- Retrieve data for training and eval --------- #\n", 196 | " whole_dataset = prepro_batches(bucket, prefix)\n", 197 | " new_class_to_idx = {x: int(replace_label(x, imagenetclasses)[1]) for x in whole_dataset.classes}\n", 198 | " whole_dataset.class_to_idx = new_class_to_idx\n", 199 | " \n", 200 | " train, val = get_splits_parallel(train_pct, whole_dataset, batch_size=batch_size, subset = subset, workers = n_workers)\n", 201 | " dataloaders = {'train' : train, 'val': val}\n", 202 | "\n", 203 | " # --------- Start iterations --------- #\n", 204 | " count = 0\n", 205 | " t_count = 0\n", 206 | " \n", 207 | " for epoch in range(n_epochs):\n", 208 | " agg_loss = []\n", 209 | " agg_loss_t = []\n", 210 | " \n", 211 | " agg_cor = []\n", 212 | " agg_cor_t = []\n", 213 | " # --------- Training section --------- # \n", 214 | " model.train() # Set model to training mode\n", 215 | " for inputs, labels in dataloaders[\"train\"]:\n", 216 | " dt = datetime.datetime.now().isoformat()\n", 217 | "\n", 218 | " inputs, labels, outputs, preds, perct = iterate_model(inputs, labels, model, device)\n", 219 | " \n", 220 | " loss = criterion(outputs, labels)\n", 221 | " correct = (preds == labels).sum().item()\n", 222 | " \n", 223 | " # zero the parameter gradients\n", 224 | " optimizer.zero_grad()\n", 225 | " loss.backward()\n", 226 | " optimizer.step()\n", 227 | " count += 1\n", 228 | " \n", 229 | " # Track statistics\n", 230 | " for param_group in optimizer.param_groups:\n", 231 | " current_lr = param_group['lr']\n", 232 | " \n", 233 | " # Record the results of this model iteration (training sample) for later review.\n", 234 | " rh.submit_result(\n", 235 | " f\"worker/{worker_rank}/data-{dt}.json\", \n", 236 | " json.dumps({\n", 237 | " 'loss': loss.item(),\n", 238 | " 'learning_rate':current_lr, \n", 239 | " 'correct':correct, \n", 240 | " 'epoch': epoch, \n", 241 | " 'count': count, \n", 242 | " 'worker': worker_rank, \n", 243 | " 'sample': 'train'\n", 244 | " })\n", 245 | " )\n", 246 | " \n", 247 | " # --------- Evaluation section --------- # \n", 248 | " with torch.no_grad():\n", 249 | " model.eval() # Set model to evaluation mode\n", 250 | " for inputs_t, labels_t in dataloaders[\"val\"]:\n", 251 | " dt = datetime.datetime.now().isoformat()\n", 252 | " \n", 253 | " inputs_t, labels_t, outputs_t, pred_t, perct_t = iterate_model(inputs_t, labels_t, model, device)\n", 254 | "\n", 255 | " loss_t = criterion(outputs_t, labels_t)\n", 256 | " correct_t = (pred_t == labels_t).sum().item()\n", 257 | " \n", 258 | " t_count += 1\n", 259 | "\n", 260 | " # Track statistics\n", 261 | " for param_group in optimizer.param_groups:\n", 262 | " current_lr = param_group['lr']\n", 263 | " \n", 264 | " # Record the results of this model iteration (evaluation sample) for later review.\n", 265 | " rh.submit_result(\n", 266 | " f\"worker/{worker_rank}/data-{dt}.json\", \n", 267 | " json.dumps({\n", 268 | " 'loss': loss_t.item(),\n", 269 | " 'learning_rate':current_lr, \n", 270 | " 'correct':correct_t, \n", 271 | " 'epoch': epoch, \n", 272 | " 'count': t_count, \n", 273 | " 'worker': worker_rank, \n", 274 | " 'sample': 'eval'\n", 275 | " })\n", 276 | " )\n", 277 | " if worker_rank == 0:\n", 278 | " rh.submit_result(f\"checkpoint-{dt}.pkl\", pickle.dumps(model.state_dict()))" 279 | ] 280 | }, 281 | { 282 | "cell_type": "markdown", 283 | "metadata": {}, 284 | "source": [ 285 | "###### \n", 286 | "Now we've done all the hard work, and just need to run our function! Using `dispatch.run` from `dask-pytorch-ddp`, we pass in the transfer learning function so that it gets distributed correctly across our cluster. This creates futures and starts computing them.\n" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": {}, 293 | "outputs": [], 294 | "source": [ 295 | "import math\n", 296 | "import numpy as np\n", 297 | "import multiprocessing as mp\n", 298 | "import datetime\n", 299 | "import json \n", 300 | "import pickle\n", 301 | "from concurrent.futures import ThreadPoolExecutor\n", 302 | "\n", 303 | "num_workers = 64\n", 304 | "\n", 305 | "s3 = s3fs.S3FileSystem()\n", 306 | "with s3.open('s3://saturn-public-data/dogs/imagenet1000_clsidx_to_labels.txt') as f:\n", 307 | " imagenetclasses = [line.strip() for line in f.readlines()]\n", 308 | " \n", 309 | "client.restart() # Clears memory on cluster- optional but recommended." 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": {}, 316 | "outputs": [], 317 | "source": [ 318 | "startparams = {'n_epochs': 6, \n", 319 | " 'batch_size': 100,\n", 320 | " 'train_pct': .8,\n", 321 | " 'base_lr': 0.01,\n", 322 | " 'imagenetclasses':imagenetclasses,\n", 323 | " 'subset': True,\n", 324 | " 'n_workers': 3} #only necessary if you select subset" 325 | ] 326 | }, 327 | { 328 | "cell_type": "markdown", 329 | "metadata": {}, 330 | "source": [ 331 | "## Kick Off Job" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "metadata": {}, 337 | "source": [ 338 | "### Send Tasks to Workers\n", 339 | " \n", 340 | "We talked in Notebook 2 about how we distribute tasks to the workers in our cluster, and now you get to see it firsthand. Inside the `dispatch.run()` function in `dask-pytorch-ddp`, we are actually using the `client.submit()` method to pass tasks to our workers, and collecting these as futures in a list. We can prove this by looking at the results, here named \"futures\", where we can see they are in fact all pending futures, one for each of the workers in our cluster.\n", 341 | "\n", 342 | "> *Why don't we use `.map()` in this function?* \n", 343 | "> Recall that `.map` allows the Cluster to decide where the tasks are completed - it has the ability to choose which worker is assigned any task. That means that we don't have the control we need to ensure that we have one and only one job per GPU. This could be a problem for our methodology because of the use of DDP. \n", 344 | "> Instead we use `.submit` and manually assign it to the workers by number. This way, each worker is attacking the same problem - our transfer learning problem - and pursuing a solution simultaneously. We'll have one and only one job per worker." 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": {}, 351 | "outputs": [], 352 | "source": [ 353 | "%%time \n", 354 | "futures = dispatch.run(client, run_transfer_learning, bucket = \"saturn-public-data\", prefix = \"dogs/Images\", **startparams)\n", 355 | "futures" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": null, 361 | "metadata": {}, 362 | "outputs": [], 363 | "source": [ 364 | "futures" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "metadata": {}, 371 | "outputs": [], 372 | "source": [ 373 | "#futures[0].result()" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "\"parallel\"\n", 381 | "\n", 382 | "Now we let our workers run for awhile. This step will take time, so you may not be able to see the full results during our workshop. See the dashboards to view the GPUs efforts as the job runs.\n" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "### Retrieve Results\n", 390 | "\n", 391 | "This step is where we gather up and save the results. While the cluster is working away at the computation, we can run the `process_results()` method on the DaskResultsHandler. This will be us requesting the results of each future as they run. To see partial results coming in, you should have the `workshop_results` folder in the folder menu a few moments after you run the next two chunks. Look in this folder to see the results each worker is returning to us." 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": null, 397 | "metadata": {}, 398 | "outputs": [], 399 | "source": [ 400 | "!rm -rf /home/jovyan/project/workshop-dask-pytorch/workshop_results" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "metadata": {}, 407 | "outputs": [], 408 | "source": [ 409 | "%%time\n", 410 | "\n", 411 | "rh.process_results(\"/home/jovyan/project/workshop-dask-pytorch/workshop_results\", futures, raise_errors=False)" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "This task will continue to hold up your Jupyter instance until it has been able to collect all the results." 419 | ] 420 | }, 421 | { 422 | "cell_type": "markdown", 423 | "metadata": {}, 424 | "source": [ 425 | "## Proof of Results\n", 426 | "\n", 427 | "We don't have the time today to run an assortment of different cluster sizes to see what works best, but I happen to have the results of those runs saved and visualized, to demonstrate how well it works! [Follow me to Notebook 7!](07-learning-results.ipynb)" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": null, 433 | "metadata": {}, 434 | "outputs": [], 435 | "source": [] 436 | } 437 | ], 438 | "metadata": { 439 | "kernelspec": { 440 | "display_name": "saturn (Python 3)", 441 | "language": "python", 442 | "name": "python3" 443 | }, 444 | "language_info": { 445 | "codemirror_mode": { 446 | "name": "ipython", 447 | "version": 3 448 | }, 449 | "file_extension": ".py", 450 | "mimetype": "text/x-python", 451 | "name": "python", 452 | "nbconvert_exporter": "python", 453 | "pygments_lexer": "ipython3", 454 | "version": "3.7.7" 455 | } 456 | }, 457 | "nbformat": 4, 458 | "nbformat_minor": 4 459 | } 460 | -------------------------------------------------------------------------------- /transfer_learning_demo/07-learning-results.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "\n", 9 | "# Monitoring Model Learning Performance\n", 10 | "\n", 11 | "Let's take a look at the results we get from running this exact workflow on a few different cluster sizes. You've been given the statistics results from real job runs in the repo. \n", 12 | "\n", 13 | "Run the next chunk, which is a bash command. This will decompress the statistics into your directory." 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": null, 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "%%sh\n", 23 | "cd ~/project/workshop-dask-pytorch/\n", 24 | "gzip -d < ~/project/workshop-dask-pytorch/tools/stats_cache2.tar.gz | tar xvf - > /dev/null" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "from os.path import dirname, join\n", 34 | "import pandas as pd\n", 35 | "import os\n", 36 | "import typing\n", 37 | "import json\n", 38 | "from plotnine import *\n", 39 | "import plotnine\n", 40 | "import dateutil.parser\n", 41 | "import pandas as pd\n", 42 | "\n", 43 | "def parse_results(root):\n", 44 | " workers_dir = join(root, 'worker')\n", 45 | " workers = [int(x) for x in os.listdir(workers_dir)]\n", 46 | " data = []\n", 47 | " for w in workers:\n", 48 | " worker_dir = join(root, 'worker', str(w))\n", 49 | " worker_files = sorted(os.listdir(worker_dir))\n", 50 | " for idx, file in enumerate(worker_files):\n", 51 | " date_str = file.split('data-')[-1].split('.')[0]\n", 52 | " fpath = join(worker_dir, file)\n", 53 | " d = dict(\n", 54 | " count=idx,\n", 55 | " )\n", 56 | " with open(fpath) as f:\n", 57 | " d.update(json.load(f))\n", 58 | " data.append(d)\n", 59 | " return data" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "def process_run(dictinput):\n", 69 | " path, rtype, compute, size, lr, workers = dictinput\n", 70 | " df = pd.DataFrame(parse_results(path))\n", 71 | " cleaned = df[['count', 'loss', 'correct', 'sample']].groupby(['count', 'sample']).agg({'loss': ['mean', 'min', 'max'],'correct': ['mean', 'min', 'max']}).reset_index()\n", 72 | " cleaned['type'], cleaned['compute'], cleaned['size'], cleaned['lr'], cleaned['workers'] = [rtype,compute, size, lr, workers]\n", 73 | " return cleaned\n", 74 | "\n", 75 | "def process_run_epochs(dictinput):\n", 76 | " path, rtype, compute, size, lr, workers = dictinput\n", 77 | " df = pd.DataFrame(parse_results(path))\n", 78 | " cleaned = df[['epoch', 'loss', 'correct', 'sample']].groupby(['epoch', 'sample']).agg({'loss': ['mean', 'min', 'max'],'correct': ['mean', 'min', 'max']}).reset_index()\n", 79 | " cleaned['type'], cleaned['compute'], cleaned['size'], cleaned['lr'], cleaned['workers'] = [rtype,compute, size, lr, workers]\n", 80 | " return cleaned\n", 81 | "\n", 82 | "looplist = [[\"../stats/parallel/pt8_4wk\",\"parallel-4worker\",\"parallel\",100,'adaptive_01', 4],\n", 83 | " [\"../stats/parallel/pt8_10wk\",\"parallel-10worker\",\"parallel\",100,'adaptive_01', 10],\n", 84 | " [\"../stats/parallel/pt8_7wk\",\"parallel-7worker\",\"parallel\",100,'adaptive_01', 7],\n", 85 | " [\"../stats/singlenode/pt8\", \"single\",\"single\",100, 'adaptive_01', 1],\n", 86 | " ]\n", 87 | "\n", 88 | "results = list(map(process_run, looplist))\n", 89 | "e_results = list(map(process_run_epochs, looplist))\n", 90 | "\n", 91 | "test4 = pd.concat(results, axis=0)\n", 92 | "etest4 = pd.concat(e_results, axis=0)\n", 93 | "\n", 94 | "etest4.columns = [''.join(col).strip() for col in etest4.columns.values]\n", 95 | "test4.columns = [''.join(col).strip() for col in test4.columns.values]\n", 96 | "\n", 97 | "plotnine.options.figure_size = (11,4)\n" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "test6 = test4.query('sample == \"train\"')\n", 107 | "(ggplot(test6, aes(x='count', y='correctmean', color = \"factor(workers)\", group = 'type'))\n", 108 | " + facet_wrap(facets = (\"size\"), ncol=3, labeller='label_both')\n", 109 | " + theme_bw()\n", 110 | " + geom_line()\n", 111 | " + xlim(0, 825)\n", 112 | " + labs(title = 'Correct Predictions: Training', x=\"Iterations\", y=\"Mean Correct Preds/Batch (Max 100)\", color = \"Workers\"))" 113 | ] 114 | }, 115 | { 116 | "cell_type": "code", 117 | "execution_count": null, 118 | "metadata": {}, 119 | "outputs": [], 120 | "source": [ 121 | "(ggplot(test6, aes(x='count', y='lossmean', color = \"factor(workers)\", group = 'type'))\n", 122 | " + facet_wrap(facets = (\"size\"), ncol=3, labeller='label_both')\n", 123 | " + theme_bw()\n", 124 | " + geom_line()\n", 125 | " + xlim(0, 825)\n", 126 | " + labs(title = 'Loss Reduction: Training', x=\"Iterations\", y=\"Loss\", color = \"Workers\"))" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "test6 = test4.query('sample == \"eval\"')\n", 136 | "(ggplot(test6, aes(x='count', y='correctmean', color = \"factor(workers)\", group = 'type'))\n", 137 | " + facet_wrap(facets = (\"size\"), ncol=3, labeller='label_both')\n", 138 | " + theme_bw()\n", 139 | " + geom_line()\n", 140 | " + xlim(0, 825)\n", 141 | " + labs(title = 'Correct Predictions: Evaluation', x=\"Iterations\", y=\"Mean Correct Preds/Batch (Max 100)\", color = \"Workers\"))" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": null, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "(ggplot(test6, aes(x='count', y='lossmean', color = \"factor(workers)\", group = 'type'))\n", 151 | " + facet_wrap(facets = (\"size\"), ncol=3, labeller='label_both')\n", 152 | " + theme_bw()\n", 153 | " + geom_line()\n", 154 | " + xlim(0, 825)\n", 155 | " + labs(title = 'Loss Reduction: Evaluation', x=\"Iterations\", y=\"Loss\", color = \"Workers\"))" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "(ggplot(test4, aes(x='count', y='lossmean', color = \"factor(type)\", group = 'type'))\n", 165 | " + facet_grid('workers~lr+sample')\n", 166 | " + theme_bw()\n", 167 | " + geom_line()\n", 168 | " + xlim(0, 825)\n", 169 | " + ylim(0, 13)\n", 170 | " + labs(title = 'Loss Reduction', x=\"Iterations\", y=\"Loss\", color = \"Run Type\"))" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "(ggplot(test4, aes(x='count', y='correctmean', color = \"factor(type)\", group = 'type'))\n", 180 | " + facet_grid('workers~lr+sample')\n", 181 | " + theme_bw()\n", 182 | " + geom_line()\n", 183 | " + xlim(0, 825)\n", 184 | " + labs(title = 'Correct Predictions', x=\"Iterations\", y=\"Correct (Max 100)\", color = \"Run Type\"))" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "test4" 194 | ] 195 | } 196 | ], 197 | "metadata": { 198 | "kernelspec": { 199 | "display_name": "saturn (Python 3)", 200 | "language": "python", 201 | "name": "python3" 202 | }, 203 | "language_info": { 204 | "codemirror_mode": { 205 | "name": "ipython", 206 | "version": 3 207 | }, 208 | "file_extension": ".py", 209 | "mimetype": "text/x-python", 210 | "name": "python", 211 | "nbconvert_exporter": "python", 212 | "pygments_lexer": "ipython3", 213 | "version": "3.7.7" 214 | } 215 | }, 216 | "nbformat": 4, 217 | "nbformat_minor": 4 218 | } 219 | --------------------------------------------------------------------------------