├── .gitignore
├── README.md
├── docker_setup.sh
├── imgs
├── JupyterLabInterface.png
├── JupyterLabURL.png
├── chapter5notebook.png
└── docker_shell_run.png
└── notebooks
├── appendixA
├── Appendix_A.ipynb
└── Appendix_A_intro.ipynb
├── ch05
└── Chapter5.ipynb
├── ch06
└── Chapter6.ipynb
├── ch07
├── Chapter7_1.ipynb
└── Chapter7_2.ipynb
├── ch08
├── Chapter8_1.dbc
├── Chapter8_1.html
└── Chapter8_1.ipynb
├── ch09
├── CleanCode.py
├── UnitTestExample.py
├── WoT.py
└── __init__.py
├── ch10
└── Chapter10_1.ipynb
├── ch11
└── Chapter11.ipynb
├── ch12
└── Chapter12.ipynb
├── ch13
├── Chapter_13.dbc
├── Chapter_13.html
└── Chapter_13.scala
├── ch14
└── Chapter14.ipynb
├── ch15
├── Chapter15_1.ipynb
└── Chapter15_2.ipynb
└── ch16
├── Chapter16_1.dbc
├── Chapter16_1.html
├── Chapter16_1.ipynb
├── Chapter16_2.dbc
├── Chapter16_2.html
└── Chapter16_2.ipynb
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | pip-wheel-metadata/
24 | share/python-wheels/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | MANIFEST
29 |
30 | # PyInstaller
31 | # Usually these files are written by a python script from a template
32 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
33 | *.manifest
34 | *.spec
35 |
36 | # Installer logs
37 | pip-log.txt
38 | pip-delete-this-directory.txt
39 |
40 | # Unit test / coverage reports
41 | htmlcov/
42 | .tox/
43 | .nox/
44 | .coverage
45 | .coverage.*
46 | .cache
47 | nosetests.xml
48 | coverage.xml
49 | *.cover
50 | *.py,cover
51 | .hypothesis/
52 | .pytest_cache/
53 |
54 | # Translations
55 | *.mo
56 | *.pot
57 |
58 | # Django stuff:
59 | *.log
60 | local_settings.py
61 | db.sqlite3
62 | db.sqlite3-journal
63 |
64 | # Flask stuff:
65 | instance/
66 | .webassets-cache
67 |
68 | # Scrapy stuff:
69 | .scrapy
70 |
71 | # Sphinx documentation
72 | docs/_build/
73 |
74 | # PyBuilder
75 | target/
76 |
77 | # Jupyter Notebook
78 | .ipynb_checkpoints
79 |
80 | # IPython
81 | profile_default/
82 | ipython_config.py
83 |
84 | # pyenv
85 | .python-version
86 |
87 | # pipenv
88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
91 | # install all needed dependencies.
92 | #Pipfile.lock
93 |
94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
95 | __pypackages__/
96 |
97 | # Celery stuff
98 | celerybeat-schedule
99 | celerybeat.pid
100 |
101 | # SageMath parsed files
102 | *.sage.py
103 |
104 | # Environments
105 | .env
106 | .venv
107 | env/
108 | venv/
109 | ENV/
110 | env.bak/
111 | venv.bak/
112 |
113 | # Spyder project settings
114 | .spyderproject
115 | .spyproject
116 |
117 | # Rope project settings
118 | .ropeproject
119 |
120 | # mkdocs documentation
121 | /site
122 |
123 | # mypy
124 | .mypy_cache/
125 | .dmypy.json
126 | dmypy.json
127 |
128 | # Pyre type checker
129 | .pyre/
130 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # ML-Engineering
2 | Reference code base for ML Engineering in Action, Manning Publications
3 | Author: Ben Wilson
4 |
5 |
6 | ### About this repo
7 | This is a companion to the Manning book Machine Learning Engineering in Action.
8 | Within this repo are two separate types of notebooks, linked to the examples shown in chapters within the book.
9 | The formats of these notebooks come in several different flavors, depending on the type of examples that they are covering:
10 | - Jupyter notebooks for 'standalone Python'
11 | - PySpark Databricks archive notebooks (these can be imported into Databricks or Databricks Community Edition (free of charge))
12 | - PySpark html notebook representations (these can be loaded into any web browser for visualization)
13 | - Scala Spark Databricks notebooks (in .dbc, .html, and pure .scala formats)
14 |
15 | For the Jupyter notebooks, a pre-configured bash script is provided at the root level of this directory that will generate a docker image and automatically start the created container for you to rapidly get started with these notebooks.
16 |
17 |
18 | ### Getting Started with Docker
19 | To utilize the pre-built environment and follow along with the examples in the book with additional notes and code that wasn't included in the book, we first need Docker.
20 |
21 | There are a number of different ways to acquire Docker. Please visit their [website](https://docs.docker.com/get-docker/) for instructions on installing the desktop GUI and the engine.
22 |
23 | #### Creating the image
24 | The file [here](/docker_setup.sh) will, when executed through a bash command in your linux terminal, create the container to execute the Jupyter notebooks in this repo.
25 | The script will link this repo to the docker environment through piped synchronization to your local machine, download the required data to execute the code in the notebooks, and
26 | install the necessary dependencies to get Jupyter working (as well as some required libraries that are not part of the Anacondas runtime).
27 | ```text
28 | NOTE: Within the bash script is a variable named 'port' that will allow you to customize
29 | the access port that Jupyter and Docker will use to allow you to utilize a Jupyter notebook
30 | from your local web browser. If you currently have Jupyter running on your machine with the
31 | general 'default port' of 8888, this configuration utilizes 8887. Feel free to change it
32 | if there is a conflict.
33 | ```
34 | Once the shell script is executed, as shown below, the container will be constructed for your use.
35 |
36 | 
37 |
38 | #### Getting Jupyter to start up
39 | At the end of the container creation process, your terminal will have a url that you can paste
40 | into your web browser of choice, as shown below.
41 |
42 | 
43 |
44 | After copying one of these URL's (I typically stick to the local host 127.0.0.1/ one), paste it
45 | into a browser. You'll have all of the notebooks available that are part of the chapters of
46 | ML Engineering in Action.
47 |
48 | 
49 |
50 | Navigating within each of the chapters (the ones that have supported standalone Python Jupterlab
51 | examples; the Spark ones will not load here!) will give you the notebook links that you can click
52 | on and open the notebook in a new tab for reading, running, modification, and anything else
53 | you'd like to do.
54 |
55 | 
--------------------------------------------------------------------------------
/docker_setup.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | # A shell script to initialize a new docker container that will host jupyter notebooks with the conda environment.
3 | # Many thanks to Jas Bali for making this so much better than what I originally had here. -Ben
4 | set -x
5 | port=8887
6 | dataset_repo_folder="$(mktemp -d)/tmp-datasets-folder"
7 | dataset_folder="$dataset_repo_folder/datasets"
8 | final_dataset_folder="$PWD/notebooks/TCPD/datasets"
9 | mkdir -p $dataset_repo_folder
10 |
11 | echo "Cloning datasets into folder: $dataset_repo_folder"
12 | git clone https://github.com/alan-turing-institute/TCPD $dataset_repo_folder
13 |
14 | rm -r $final_dataset_folder
15 | echo "Copying datasets from $dataset_folder into $final_dataset_folder"
16 | mkdir -p $final_dataset_folder
17 | cp -r "$dataset_folder/" "$final_dataset_folder/"
18 |
19 | echo "Starting Jupyter notebooks"s
20 |
21 | docker run -i --name=MLEngineeringInAction \
22 | -v $(PWD)/notebooks:/opt/notebooks -t \
23 | -p $port:$port continuumio/anaconda3 bin/bash \
24 | -c "/opt/conda/bin/conda install jupyter -y --quiet && \
25 | /opt/conda/bin/conda install -c conda-forge hyperopt=0.2.5 -y --quiet && \
26 | mkdir -p /opt/notebooks && \
27 | /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks \
28 | --ip='*' --port=$port --no-browser --allow-root"
29 |
--------------------------------------------------------------------------------
/imgs/JupyterLabInterface.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BenWilson2/ML-Engineering/0fc05f4b876b26bbacc85bcb11c7c2aef517cd20/imgs/JupyterLabInterface.png
--------------------------------------------------------------------------------
/imgs/JupyterLabURL.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BenWilson2/ML-Engineering/0fc05f4b876b26bbacc85bcb11c7c2aef517cd20/imgs/JupyterLabURL.png
--------------------------------------------------------------------------------
/imgs/chapter5notebook.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BenWilson2/ML-Engineering/0fc05f4b876b26bbacc85bcb11c7c2aef517cd20/imgs/chapter5notebook.png
--------------------------------------------------------------------------------
/imgs/docker_shell_run.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BenWilson2/ML-Engineering/0fc05f4b876b26bbacc85bcb11c7c2aef517cd20/imgs/docker_shell_run.png
--------------------------------------------------------------------------------
/notebooks/ch07/Chapter7_1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "## Chapter 7, ML Engineering\n",
8 | "##### Author: Ben Wilson\n",
9 | "\n",
10 | "In this notebook, we'll be following along with the code listings shown in Chapter 7 for the stand-alone (local VM) portion. This covers section 7.1, listings 7.1 through 7.3. \n",
11 | "For the remainder of the listing references in section 7.1, you can refer to the project notebook for the local VM implementation for hyperopt tuning of the forecasting problem in the companion notebook to this entitled, \"Chapter7_Local_Hyperopt_Forecasting_Notebook\". That notebook will serve as a full-implementation guide to the remainder of the code listings in section 7.1."
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 1,
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "from matplotlib import pyplot as plt\n",
21 | "import random\n",
22 | "from hyperopt import hp, tpe, Trials, fmin"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "### Listing 7.1 Hyperopt fundamentals: the objective function\n",
30 | "In these first 3 listings, we're going to take a look at what hyperopt is actually doing and how it's a bit different from other implementations of hyperparameter tuning. We'll be comparing how the other algorithms (Random Search and Grid Search) fare against hyperopt from an accuracy standpoint and see what comes of our results. \n",
31 | "To start off, we need a function to optimize. Listing 7.1 below is building a very simple 4th order polynomial that will, based on a value of x that is passed in, provide a loss metric in the form of a reduction factor to the 'y' value based on the submitted x. "
32 | ]
33 | },
34 | {
35 | "cell_type": "code",
36 | "execution_count": 2,
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "import numpy as np\n",
41 | "def objective_function(x):\n",
42 | " func = np.poly1d([1, -3, -88, 112, -5])\n",
43 | " return func(x) * 0.01\n"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "For fun, let's see what this equation yields if we plot it in the space of [-100:100]"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": 3,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "# Get a sorted list between -100 and 100 at every 0.1 increment\n",
60 | "x_values_big = np.arange(-100, 100, 0.1)\n",
61 | "\n",
62 | "# Get the y value for the x values defined above using list comprehension shorthand\n",
63 | "y_values_big = [objective_function(x) for x in x_values_big]\n",
64 | "\n",
65 | "# For those of you who prefer lambda calculus... (and more efficient execution)\n",
66 | "y_values_lambda_big = (lambda x: objective_function(x))(x_values_big)\n",
67 | "\n",
68 | "# Just to placate anyone who wonders if they do the same thing.\n",
69 | "np.testing.assert_array_equal(np.array(y_values_big), y_values_lambda_big)"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "Ok, hold on... what's with the lambda stuff? Who wants functional programming concepts and partial functions in their code base for ML?
\n",
77 | "Professional ML Engineers do. \n",
78 | "That is, when it's called for. \n",
79 | "... and here's why."
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": 4,
85 | "metadata": {},
86 | "outputs": [],
87 | "source": [
88 | "# let's make a few more x values here...\n",
89 | "big_x_test = np.arange(-100, 100, 0.0001)"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 5,
95 | "metadata": {},
96 | "outputs": [
97 | {
98 | "name": "stdout",
99 | "output_type": "stream",
100 | "text": [
101 | "Let's make this many: 2000000\n"
102 | ]
103 | }
104 | ],
105 | "source": [
106 | "print(\"Let's make this many: {}\".format(len(big_x_test)))"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": 6,
112 | "metadata": {},
113 | "outputs": [
114 | {
115 | "name": "stdout",
116 | "output_type": "stream",
117 | "text": [
118 | "CPU times: user 1min 18s, sys: 298 ms, total: 1min 18s\n",
119 | "Wall time: 1min 18s\n"
120 | ]
121 | }
122 | ],
123 | "source": [
124 | "%time list_comp_test = [objective_function(x) for x in big_x_test]"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 7,
130 | "metadata": {},
131 | "outputs": [
132 | {
133 | "name": "stdout",
134 | "output_type": "stream",
135 | "text": [
136 | "CPU times: user 38.9 ms, sys: 33.8 ms, total: 72.7 ms\n",
137 | "Wall time: 72.6 ms\n"
138 | ]
139 | }
140 | ],
141 | "source": [
142 | "%time lambda_test = (lambda x: objective_function(x))(big_x_test)"
143 | ]
144 | },
145 | {
146 | "cell_type": "markdown",
147 | "metadata": {},
148 | "source": [
149 | "That's... quite the difference. \n",
150 | "Feel free to play around with the value specified in the definition for big_x_test and see if you can melt your CPU and see the non-linear relationship in performance between a list comprehension and a lambda. \n",
151 | "Note: There are many times that a list comprehension will actually out-perform lambda tasks. But in this particular case, where we're effectively mapping over a collection and calling a function, the optimizations of how lambda handles the traversal with the numpy array (and the subsequent compiled C++ code at the heart of numpy) means that operating on the numpy array with a lambda is going to be MUCH faster than traversing a Python list with a comprehension. These differences only really come into play when you're doing large-scale operations such as this."
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 8,
157 | "metadata": {},
158 | "outputs": [
159 | {
160 | "name": "stdout",
161 | "output_type": "stream",
162 | "text": [
163 | "CPU times: user 434 ms, sys: 70.7 ms, total: 505 ms\n",
164 | "Wall time: 502 ms\n"
165 | ]
166 | },
167 | {
168 | "data": {
169 | "text/plain": [
170 | "[]"
171 | ]
172 | },
173 | "execution_count": 8,
174 | "metadata": {},
175 | "output_type": "execute_result"
176 | },
177 | {
178 | "data": {
179 | "image/png": "\n",
180 | "text/plain": [
181 | ""
182 | ]
183 | },
184 | "metadata": {
185 | "needs_background": "light"
186 | },
187 | "output_type": "display_data"
188 | }
189 | ],
190 | "source": [
191 | "# Evaluate the wall-clock speed of plotting the list comprehension\n",
192 | "%time plt.plot(big_x_test, list_comp_test)"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": 9,
198 | "metadata": {},
199 | "outputs": [
200 | {
201 | "name": "stdout",
202 | "output_type": "stream",
203 | "text": [
204 | "CPU times: user 71 ms, sys: 44 ms, total: 115 ms\n",
205 | "Wall time: 116 ms\n"
206 | ]
207 | },
208 | {
209 | "data": {
210 | "text/plain": [
211 | "[]"
212 | ]
213 | },
214 | "execution_count": 9,
215 | "metadata": {},
216 | "output_type": "execute_result"
217 | },
218 | {
219 | "data": {
220 | "image/png": "\n",
221 | "text/plain": [
222 | ""
223 | ]
224 | },
225 | "metadata": {
226 | "needs_background": "light"
227 | },
228 | "output_type": "display_data"
229 | }
230 | ],
231 | "source": [
232 | "# And evaluate the wall-clock of plotting the lambda implementation\n",
233 | "%time plt.plot(big_x_test, lambda_test)"
234 | ]
235 | },
236 | {
237 | "cell_type": "markdown",
238 | "metadata": {},
239 | "source": [
240 | ".... and for those who will state, \"But lambda and list comprehensions are lazily evaluated in Py3.x....\", let's ensure that they're materialized through a forced execution by plotting their values in pyplot. This comes down the fact that one is a numpy array and one is a list (because the initial calculation to generate the collections above in the first block of timeit cells actually wasn't lazy)."
241 | ]
242 | },
243 | {
244 | "cell_type": "code",
245 | "execution_count": 10,
246 | "metadata": {},
247 | "outputs": [
248 | {
249 | "data": {
250 | "text/plain": [
251 | "[]"
252 | ]
253 | },
254 | "execution_count": 10,
255 | "metadata": {},
256 | "output_type": "execute_result"
257 | },
258 | {
259 | "data": {
260 | "image/png": "\n",
261 | "text/plain": [
262 | ""
263 | ]
264 | },
265 | "metadata": {
266 | "needs_background": "light"
267 | },
268 | "output_type": "display_data"
269 | }
270 | ],
271 | "source": [
272 | "y_lambda = (lambda x: objective_function(x))(x_values_big)\n",
273 | "plt.plot(x_values_big, y_lambda)"
274 | ]
275 | },
276 | {
277 | "cell_type": "markdown",
278 | "metadata": {},
279 | "source": [
280 | "So... that's not super helpful for seeing the nuance to this equation. The resolution of the plot doesn't allow us to see where the actual minimum value is."
281 | ]
282 | },
283 | {
284 | "cell_type": "markdown",
285 | "metadata": {},
286 | "source": [
287 | "Let's see what the representation will be for this equation if we generate values against this function (in the x space before it becomes very large in the y space so that we can see the challenge that these different methods of searching for a global minima will have. This will help to inform our search range for the different approaches as well."
288 | ]
289 | },
290 | {
291 | "cell_type": "code",
292 | "execution_count": 11,
293 | "metadata": {},
294 | "outputs": [
295 | {
296 | "data": {
297 | "text/plain": [
298 | "[]"
299 | ]
300 | },
301 | "execution_count": 11,
302 | "metadata": {},
303 | "output_type": "execute_result"
304 | },
305 | {
306 | "data": {
307 | "image/png": "\n",
308 | "text/plain": [
309 | ""
310 | ]
311 | },
312 | "metadata": {
313 | "needs_background": "light"
314 | },
315 | "output_type": "display_data"
316 | }
317 | ],
318 | "source": [
319 | "# Now, let's see it zoomed in to the range where we can see what's going on with minimum range of values for y...\n",
320 | "x_axis_values = np.arange(-12.0, 12.0, 0.01).tolist()\n",
321 | "y_values = (lambda x: objective_function(x))(x_axis_values)\n",
322 | "plt.plot(x_axis_values, y_values)"
323 | ]
324 | },
325 | {
326 | "cell_type": "markdown",
327 | "metadata": {},
328 | "source": [
329 | "Now, I know what you might be thinking... (it's what I'd be thinking if someone showed me this as well, likely...) \n",
330 | "\"But Ben, dude, you can just get the minimum value directly from the data.\" \n",
331 | "To which I would reply... \n",
332 | "\"Shhhh... I'm just trying to make a point here. Let's play pretend and think of this as a supervised learning problem where our vector has 17 dimensions and there's no way that our human minds can figure out how to minimize the function.\" \n",
333 | "But, to humor us both, here's the actual minimum x value."
334 | ]
335 | },
336 | {
337 | "cell_type": "code",
338 | "execution_count": 12,
339 | "metadata": {},
340 | "outputs": [
341 | {
342 | "data": {
343 | "text/plain": [
344 | "7.569999999999581"
345 | ]
346 | },
347 | "execution_count": 12,
348 | "metadata": {},
349 | "output_type": "execute_result"
350 | }
351 | ],
352 | "source": [
353 | "x_axis_values[np.argmin(y_values)]"
354 | ]
355 | },
356 | {
357 | "cell_type": "markdown",
358 | "metadata": {},
359 | "source": [
360 | "So, how would this look as a grid search problem?