├── QCon_AI_Docker.png
├── README.md
├── getting_started
    └── README.md
├── inference
    ├── Dockerfile
    ├── README.md
    └── api.py
├── managing_and_scaling
    └── README.md
└── model_training
    ├── Dockerfile
    ├── README.md
    ├── data
        └── iris.csv
    └── train.py


/QCon_AI_Docker.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dwhitena/qcon-ai-docker-workshop/9e83684780301978df32f802258ce601476065ae/QCon_AI_Docker.png


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ![Alt text](QCon_AI_Docker.png)
 2 | 
 3 | # Docker-izing your Data Science Apps
 4 | 
 5 | This repo includes all of the materials and documentation that you will need for the "Docker-izing your Data Science Apps" CodeLab at [QCon AI 2018](https://qcon.ai/).  In this CodeLab, we will learn how to put your data science applications in Docker images and run those as containers on any infrastructure. These skills will help you maintain reproducibility and increase efficiency as you deploy your applications, and they will help you standardize your code to better fit into modern infrastructures, CI/CD tools, and DevOps practices. 
 6 | 
 7 | The CodeLab is compiled such that it can be completed independently. To start, read the [Prerequisites](#prerequisites) section and then follow the first link below (`1. Why Docker...`). If you have any difficulties or questions, the instructor [Daniel Whitenack](https://twitter.com/dwhitena) will be holding office hours at QCon AI on [Tuesday, April 10th](https://qcon.ai/schedule/qconai2018/tabular).
 8 | 
 9 | ## Agenda
10 | 
11 | 1. [Why Docker, gettings started with Docker](getting_started)
12 | 1. [Docker-izing model training](model_training)
13 | 2. [Docker-izing inference, services](inference)
14 | 3. [Managing and scaling Docker-ized data science apps](managing_and_scaling)
15 | 
16 | ## Prerequisites
17 | 
18 | - To complete this CodeLab, you will need to have:
19 |     - A laptop/desktop capable of running Docker (see requirements [here](https://docs.docker.com/install/))
20 |     - A Unix-like terminal (ideally the [WSL on Windows](https://docs.microsoft.com/en-us/windows/wsl/install-win10) although similar operations could be accomplished using the Windows command prompt)
21 |     - A connection to the Internet
22 | 
23 | - If you are new to the command line or need a refresher, look through [this quick tutorial](https://lifehacker.com/5633909/who-needs-a-mouse-learn-to-use-the-command-line-for-almost-anything).
24 | 


--------------------------------------------------------------------------------
/getting_started/README.md:
--------------------------------------------------------------------------------
  1 | # Getting started with Docker
  2 | 
  3 | Ideally, we should be creating ML applications that produce predictable behavior, regardless of where they are deployed. [Docker](https://www.docker.com/) can be utilized to accomplish this goal. 
  4 | 
  5 | The below sections help you understand why Docker is useful in this context and some of the jargon associated with Docker. It will also walk you through installation and basic use of Docker:
  6 | 
  7 | 1. [Why Docker?](README.md#1-why-docker)
  8 | 2. [Docker Jargon](README.md#2-docker-jargon)
  9 | 3. [Installing Docker](README.md#3-installing-docker)
 10 | 4. [Interacting with Docker](README.md#4-interacting-with-docker)
 11 | 
 12 | Finally, I provide some [Resources](README.md#resources) for further exploration.
 13 | 
 14 | ## 1. Why Docker?
 15 | 
 16 | Ok, let's say that we have some code for model training, inference, pre-processing, post-processing, etc. and we need to:
 17 | 
 18 | - scale this code up to larger data sets,
 19 | - run it automatically at certain times or based on certain events, 
 20 | - share it with teammates so they can generate their own results, or
 21 | - connect it to other code running in our company's infrastructure.
 22 | 
 23 | You aren't going to be able to do these things if your code only lives on your laptop and if you have to run it manually in your own environment. You need to *deploy* this code to some other computing resources and/or share it such that others can run it just like you. This other environment could be one or more cloud instances or an on premise cluster of compute nodes. 
 24 | 
 25 | How can we do this with a high degree of reproducibility and operational/computation efficiency? And how can we ensure that our engineering team doesn't hate the data science team because they always have to deploy data science things in a "special" way with "special" data science tools. 
 26 | 
 27 | Well, some of you might be thinking that Virtual Machines (or VMs) are the answer to this problem. To some degree you are correct. VMs were developed to solve some of these issues. However, many have moved on from VMs because they create quite a few pain points:
 28 | 
 29 | - They generally consume a fixed set of resources. This makes it hard to take advantage of computational resources in an optimized way. Most of the time VMs aren't using all of the resources allocated to them, but we have partitioned those resources off from other processes. This is waste.
 30 | - Most of the time they are pretty big. Porting around a 10Gb VM image isn't exactly fun, and I wouldn't consider it incredibly "portable."
 31 | - If you are running applications in the cloud, you can run into all sorts of weirdness if you try to run VMs inside of VMs (which is what cloud instances actually are).
 32 | 
 33 | Docker solves many of these issues and even has additional benefits because it leverages *software containers* as it primary way of encapsulating applications. Containers existed before Docker, but Docker has made containers extremely easy to use and accesible. Thus many just associate software containers with Docker containers. When working with Docker containers, you might see some similarities to VMs, but they are quite different:
 34 | 
 35 | ![Alt text](https://blog.netapp.com/wp-content/uploads/2016/03/Screen_Shot_2016-03-11_at_9.14.20_PM1.png)
 36 | 
 37 | As you can see Docker containers have the following unique properties which make them extremely useful:
 38 | 
 39 | - They don't include an entire guest OS. They just include your application and the associated libraries, file system, etc. This makes them much smaller than VMs (some of my Docker containers are just a few Mb). This also makes spinning up and tearing down containers extremely quick.
 40 | - They share an underlying host kernel and resources. You can spin up 10s or 100s or Docker containers on a single machine. They will all share the underlying resources, such that you can efficiently utilize all of the resources on a node (rather than statically carving our resource per process). 
 41 | 
 42 | This is why Docker containers have become so dominant in the infrastructure world. Data scientists and AI researchers are also latching on to these because they can:
 43 | 
 44 | - Docker-ize an application quickly, hand it off to an engineering organization, and have them run it in a manner similar to any other application.
 45 | - Experiment with a huge number of tools (Tensorflow, PyTorch, Spark, etc.) without having to install anything other than Docker.
 46 | - Manage a diverse set of data pipeline stages in a unified way.
 47 | - Leverage the huge number of excellent infrastructure projects for containers (e.g., those powering Google scale work) to create application that auto-scale, self-heal, are fault tolerant, etc.
 48 | - Easily define and reproduce environments for experimentation.
 49 | 
 50 | ## 2. Docker Jargon
 51 | 
 52 | Docker jargon can sometimes be confusing, so let's go ahead and define some key terms. Refer back to this list later on the CodeLab if you need to:
 53 | 
 54 | - Docker *Image* - the bundle that includes your app & dependencies
 55 | - Docker *Container* - a running instance of a Docker image
 56 | - *Docker engine* - the application that builds and runs images
 57 | - Docker *registry* - where you store, tag, and get pre-built Docker images
 58 | - *Dockerfile* - a file that tells the engine how to build a Docker image
 59 | 
 60 | Thus, a common workflow when building a Docker-ized application is as follows:
 61 | 
 62 | 1. Develop the application (as you normally would)
 63 | 2. Build a Docker image for the app with Docker engine
 64 | 3. Upload the image to a registry
 65 | 4. Deploy a Docker container, based on the image from the registry, to a cloud instance or on premise node
 66 | 
 67 | ## 3. Installing Docker
 68 | 
 69 | Docker can be installed on Linux, Mac, or Windows. If you are using Windows, note that I will be showing unix-style commands throughout the CodeLab. As such, you may want to run command from the the WSL or look up the Windows command prompt equivalents.
 70 | 
 71 | To install Docker (the community edition), following the appropriate guide [here](https://www.docker.com/community-edition#/download). Once installed, you should be able to run the following in a terminal to get the Docker version:
 72 | 
 73 | ```sh
 74 | $ docker version
 75 | Client:
 76 |  Version:	17.12.0-ce
 77 |  API version:	1.35
 78 |  Go version:	go1.9.2
 79 |  Git commit:	c97c6d6
 80 |  Built:	Wed Dec 27 20:03:51 2017
 81 |  OS/Arch:	darwin/amd64
 82 | 
 83 | Server:
 84 |  Engine:
 85 |   Version:	17.12.0-ce
 86 |   API version:	1.35 (minimum version 1.12)
 87 |   Go version:	go1.9.2
 88 |   Git commit:	c97c6d6
 89 |   Built:	Wed Dec 27 20:12:29 2017
 90 |   OS/Arch:	linux/amd64
 91 |   Experimental:	true
 92 | ```
 93 | 
 94 | **Note** - in some cases, you may need to run any `docker ...` commands as `sudo`. This should be fine for this CodeLab, but, if you want to be able to run without `sudo`, you could follow [this guide](https://docs.docker.com/install/linux/linux-postinstall/). 
 95 | 
 96 | ## 4. Interacting with Docker
 97 | 
 98 | Once you have Docker installed, you can manage, build, and run Docker images from the command line. To see what images you have locally, you can run:
 99 | 
100 | ```sh
101 | $ docker images
102 | REPOSITORY                         TAG                 IMAGE ID            CREATED             SIZE
103 | dockerized-test                    latest              6fa518ef72d2        2 hours ago         607MB
104 | ubuntu                             latest              f975c5035748        9 days ago          112MB
105 | ```
106 | 
107 | You might not have any images yet, because you haven't pulled any. Let's go ahead and pull a Docker image from Docker's public registry called [DockerHub](https://hub.docker.com/) (think of it like GitHub but for Docker images):
108 | 
109 | ```sh
110 | $ docker pull dwhitena/minimal-jupyter
111 | Using default tag: latest
112 | latest: Pulling from dwhitena/minimal-jupyter
113 | 550fe1bea624: Already exists
114 | b313ba46199e: Already exists
115 | de349a63b77a: Already exists
116 | 3cd0781adeaa: Already exists
117 | 0cf242809b69: Pull complete
118 | 4a2fb11c3300: Pull complete
119 | Digest: sha256:893107a7f4e27e772460aeddea0626bd1196aba9b0cc6468d3f52c47ff369e03
120 | Status: Downloaded newer image for dwhitena/minimal-jupyter:latest
121 | ```
122 | 
123 | Now, when you list your docker images with `docker images` this image will show up in your local registry of images (because we pulled it from the remote registry).  This `dwhitena/minimal-jupyter` image (as you might have guessed from the name) is a docker image that includes Jupyter. Even if you don't have Jupyter, ipython, etc. installed locally, you can use Jupyter via this Docker image by running it:
124 | 
125 | ```sh
126 | $ docker run -p 8888:8888 dwhitena/minimal-jupyter
127 | [I 20:47:17.087 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret
128 | [W 20:47:18.403 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended.
129 | [W 20:47:18.403 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using authentication. This is highly insecure and not recommended.
130 | [I 20:47:18.421 NotebookApp] Serving notebooks from local directory: /home/jovyan/notebooks
131 | [I 20:47:18.421 NotebookApp] 0 active kernels
132 | [I 20:47:18.421 NotebookApp] The Jupyter Notebook is running at:
133 | [I 20:47:18.421 NotebookApp] http://[all ip addresses on your system]:8888/
134 | [I 20:47:18.422 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
135 | ```
136 | 
137 | As you can see (if you are familiar with Jupyter), this command started a Jupyter notebook server. However, the server isn't running directly on your localhost. It is running inside of a Docker container based on the `dwhitena/minimal-jupyter` Docker image. To see this, open a new terminal window and run `docker ps` to see what containers are running:
138 | 
139 | ```sh
140 | $ docker ps
141 | CONTAINER ID        IMAGE                      COMMAND                  CREATED             STATUS              PORTS                    NAMES
142 | 55803aae3021        dwhitena/minimal-jupyter   "/bin/sh -c 'jupyter…"   2 minutes ago       Up 2 minutes        0.0.0.0:8888->8888/tcp   cranky_sinoussi
143 | ```
144 | 
145 | Try visiting `localhost:8888` in your browser to see the Jupyter notebook server running in the Docker container!
146 | 
147 | There are quite a variety of options that you can specify when running your Docker containers. We specified `-p 8888:8888` above, which mapped port 8888 inside the container (where Jupyter is running) to port 8888 outside of the container. However, we could have also specified a name for the container, run the container as a daemon, changed the container's networking, and much more. For a full list and explaination of these options see the [docker run reference docs](https://docs.docker.com/engine/reference/run/).
148 | 
149 | To stop this running container, you can run `docker rm -f <container ID>` from the terminal where you ran `docker ps`. Or you should be able to close it via `CTRL + C` in the terminal where you ran the `docker run` command. 
150 | 
151 | **Note** - You could recreate this process on any machine in the cloud or on premise, as long as that machine has Docker installed. You wouldn't have to install the right version of Jupyter, ipython, etc. You just need to `docker run`. In fact, you can `docker run` TensorFlow, PyTorch, R, ggplot, Postgres, MongoDB, Spark, or whatever you want, without messing with any dependencies. That's awesome! Hopefully, you are beginning to see the power and flexibility of containers. 
152 | 
153 | **Note** - One other note: If you create a new notebook, you'll notice if you try to import something that requires a dependency (numpy or pandas for example) it won't be found. Those dependencies haven't been added yet. We'll go through how to install dependencies into your docker container in the next part of the codelab. Alternatively, you could use the Jupyter terminal to install dependencies. 
154 | 
155 | ## Resources
156 | 
157 | - [Getting started with Docker](https://docs.docker.com/get-started/)
158 | - [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)
159 | 


--------------------------------------------------------------------------------
/inference/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python
 2 | 
 3 | # install dependencies 
 4 | RUN pip install --upgrade pip && \
 5 |     pip install numpy scipy scikit-learn flask-restful 
 6 | 
 7 | # add our project
 8 | ADD . /
 9 | 
10 | # expose the port for the API
11 | EXPOSE 5000
12 | 
13 | # run the API 
14 | CMD [ "python", "/api.py" ]
15 | 
16 | 


--------------------------------------------------------------------------------
/inference/README.md:
--------------------------------------------------------------------------------
  1 | # Docker-izing Inference, Services
  2 | 
  3 | Now that we have our model training Docker-ized, we could run that model training on any Cloud or on-premise machines to produce a trained model. In addition to that, we might need to create a service (e.g., a REST API) that would allow us (or our users) to make requests for predictions, where our service would utilize the trained model to perform and serve the predictions. This guide will walk you through how you might develop, Docker-ize, and deploy this type of service: 
  4 | 
  5 | 1. [Developing the application](README.md#1-developing-the-application)
  6 | 2. [Creating a Dockerfile](README.md#2-creating-a-dockerfile)
  7 | 3. [Building a Docker image](README.md#3-building-a-docker-image)
  8 | 4. [Pushing the image to a registry (optional)](README.md#4-pushing-the-image-to-a-registry-optional)
  9 | 5. [Running model inference as a service in a container](README.md#5-running-model-inference-as-a-service-in-a-container)
 10 | 
 11 | Finally, I provide some [Resources](README.md#resources) for further exploration.
 12 | 
 13 | ## 1. Developing the application
 14 | 
 15 | Similar to model training, I have already developed the Python code we need for our desired functionality. Specifically, the [api.py script](api.py) will spin up an API (via flask) that will service model predictions. 
 16 | 
 17 | ## 2. Creating a Dockerfile
 18 | 
 19 | A Dockerfile for `api.py` is included [here](Dockerfile):
 20 | 
 21 | ```
 22 | FROM python
 23 | 
 24 | # install dependencies
 25 | RUN pip install --upgrade pip && \
 26 |     pip install numpy scipy scikit-learn flask-restful
 27 | 
 28 | # add our project
 29 | ADD . /
 30 | 
 31 | # expose the port for the API
 32 | EXPOSE 5000
 33 | 
 34 | # run the API
 35 | CMD [ "python", "/api.py" ]
 36 | ```
 37 | 
 38 | This Dockerfile is a little more complicated than the one we used for model training, but some things should look familiar. Everything in the arguments of the `RUN` instructions are things that you might run locally to prepare an environment to run our application. Note, how I have put a couple of operations under a single `RUN` command. Why did I do this?
 39 | 
 40 | Well, remember how a Docker image is built up from "layers" that are versioned in a repository? If I put each of the `apt-get` or `pip` commands into separate `RUN` instructions, then those would build up more and more layers that would always be versioned with the Docker image. By combining them, I can perform some clean up at the end of all the installation to get rid of cached info and other things I don't need. If such a clean up was in a separate `RUN` instruction, it would basically have no effect on the overall size of the image.
 41 | 
 42 | Also, you will notice two new instructions in this Dockerfile that weren't in the model training Dockerfile:
 43 | 
 44 | - The `EXPOSE` instruction will expose a port in the container to be accessed from other running containers. Remember, we are going to make our predictions available via a call to an API, and it so happens that is API will be running on port 5000 in our container.
 45 | - The `CMD` instruction tells Docker that we want to run the provided command whenever the a container, based on this image, starts. The specific command here starts our prediction service using the `api.py` script. This is an alternative to manually specifying the command as we did with model training. 
 46 | 
 47 | ## 3. Building a Docker image
 48 | 
 49 | We can now build our Docker image the same way we did for model training:
 50 | 
 51 | ```sh
 52 | $ docker build -t <the name you choose> .
 53 | ```
 54 | 
 55 | For example, to build the image such that I can push it up to my Docker Hub registry:
 56 | 
 57 | ```sh
 58 | $ docker build -t dwhitena/model-inference:v1.0.0 .
 59 | ```
 60 | 
 61 | ## 4. Pushing the image to a registry (optional)
 62 | 
 63 | If desired you can also push the image to your Docker Hub registry (to version it and make it available on other systems):
 64 | 
 65 | ```sh
 66 | $ docker push dwhitena/model-inference:v1.0.0
 67 | ```
 68 | 
 69 | (where you would replace `dwhitena` with your Docker Hub username)
 70 | 
 71 | ## 5. Running model inference as a service in a container
 72 | 
 73 | Now, to run this Python service in the container let's discuss briefly how we would run the code locally. This Python script is configured to be run as follows:
 74 | 
 75 | ```sh
 76 | $ export MODEL_FILE=/path/to/this/GH/repo/model_training/data/model.pkl
 77 | $ python api.py
 78 | ```
 79 | 
 80 | The `api.py` looks for an environmental variable `MODEL_FILE`, which should be set to location of the serialized model that was the output of our model training container. Once, that code is running, it will serve predictions on port 5000. For example, you could visit the following address in a browser (or via curl, postman, etc.) to get a prediction response in the form of JSON (assuming you are running the code locally):
 81 | 
 82 | ```
 83 | http://localhost:5000/prediction?slength=1.5&swidth=0.7&plength=1.3&pwidth=0.3
 84 | ``` 
 85 | 
 86 | When we run the code in the container, we will need to map the port 5000 inside of the container to a port outside of the container (such that we can use the service), map a volume with the `model.pkl` file into the container, and set the environmental variable `MODEL_FILE` in the container to specify the model file. Thus, to start the prediction service, run: 
 87 | 
 88 | ```sh
 89 | $ docker run -v /path/to/this/GH/repo/model_training/data:/data -e MODEL_FILE='/data/model.pkl' -p 5000:5000 <your image tag>
 90 |  * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
 91 | ```
 92 | 
 93 | where you would replace `<your image tag>` with the name of the Docker In this command:
 94 | 
 95 | - `-v /path/to/this/GH/repo/model_training/data:/data` maps the absolute path to the [data directory](../model_training/data) on your local machine to `/data` inside the container (remember we output our `model.pkl` file there). 
 96 | - `-e MODEL_FILE='/data/model.pkl'` sets the `MODEL_FILE` environmental variable in the container to the location of `model.pkl` in the container.
 97 | - `-p 5000:5000` maps port 5000 inside the container to port 5000 on our localhost. 
 98 | 
 99 | With this container running, you should be able to obtain a prediction from the service by visiting `http://localhost:5000/prediction?slength=1.5&swidth=0.7&plength=1.3&pwidth=0.3` in a browser (or via curl, postman, etc.). Try changing the `slength`, `swidth`, etc. parameters in the URL to get different predictions. Once you are done, you can remove the service via `CTRL+C`.
100 | 
101 | As excercises, look at the `docker run` reference docs to try and figure out how to:
102 | 
103 | - run the container in the background (i.e., non-interactively or as a daemon)
104 | - after running the container in the background, open a bash shell in the running container from another terminal
105 | - get the logs from the running container
106 | 
107 | Now that you are a pro at running your applications in Docker, see the next section of the CodeLab for some suggestions on how to automate ingress/egress of data, how to scale containerized workloads, and more.
108 | 
109 | ## Resources
110 | 
111 | - [Docker run reference docs](https://docs.docker.com/engine/reference/run/)
112 | - [Dockerfile reference docs](https://docs.docker.com/engine/reference/builder/)
113 | 


--------------------------------------------------------------------------------
/inference/api.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | from flask import Flask
 3 | from flask_restful import Resource, Api
 4 | from flask_restful import reqparse
 5 | from sklearn.externals import joblib
 6 | 
 7 | app = Flask(__name__)
 8 | api = Api(app)
 9 | 
10 | class Prediction(Resource):
11 |     def get(self):
12 | 
13 |         parser = reqparse.RequestParser()
14 |         parser.add_argument('slength', type=float, help='slength cannot be converted')
15 |         parser.add_argument('swidth', type=float, help='swidth cannot be converted')
16 |         parser.add_argument('plength', type=float, help='plength cannot be converted')
17 |         parser.add_argument('pwidth', type=float, help='pwidth cannot be converted')
18 |         args = parser.parse_args()
19 | 
20 |         prediction = predict([[
21 |                 args['slength'], 
22 |                 args['swidth'], 
23 |                 args['plength'], 
24 |                 args['pwidth']
25 |             ]])
26 | 
27 |         return {
28 |                 'slength': args['slength'],
29 |                 'swidth': args['swidth'],
30 |                 'plength': args['plength'],
31 |                 'pwidth': args['pwidth'],
32 |                 'species': prediction
33 |                }
34 | 
35 | def predict(inputFeatures):
36 |     mymodel = joblib.load(app.config.get('model_file'))
37 |     prediction = mymodel.predict(inputFeatures)
38 |     return prediction[0]
39 | 
40 | api.add_resource(Prediction, '/prediction')
41 | 
42 | if __name__ == '__main__':
43 |     app.config['model_file'] = os.environ['MODEL_FILE']
44 |     app.run(host='0.0.0.0', debug=False)
45 | 


--------------------------------------------------------------------------------
/managing_and_scaling/README.md:
--------------------------------------------------------------------------------
 1 | # Managing and scaling Docker-ized data science apps
 2 | 
 3 | Although this CodeLab is a good chance to get yours hands dirty with Docker, there are whole books and courses on containers and related systems. The following sections attempt to expose you to some related topics that are relevant to data scientists using Docker.  
 4 | 
 5 | ## Container Orchestration
 6 | 
 7 | Let's say that we love using Docker, and we containerize all of our things. Now we have a secondary problem of figuring out how we are going to manage all of those containers running across our infrastructure. It doesn't make much sense for data scientists and engineers to be ssh'ing into a bunch of machine and executing `docker run ...` commands all day.
 8 | 
 9 | Container orchestrators solve this problem. [Kubernetes](https://kubernetes.io/) is, by far, the leading container orchestration engine. With Kubernetes, you can declaratively specify that you want this many of container A, this many of container B, etc. running. Then Kubernetes will ensure that these things are are started and remain running on the underlying compute nodes (which could be cloud VMs or on-premise nodes).
10 | 
11 | ## Managing Data ingress/egress, pipelines, and scaling
12 | 
13 | You may have noticed, as we went through our examples, that we still had to manually get data from one container to the next. This could turn into a giant, error-prone pain. Data and code in data science workflows are always changing, and we can rely on ourselves to get the right version of the right data to the right code at the right time on the right infrastructure.
14 | 
15 | [Pachyderm](http://pachyderm.io/) is a solution built on Kubernetes that solves these problems, while also allowing data scientists and AI researchers to scale their workloads (across both CPUs and GPUs). Pachyderm allows you to build data pipelines, where each stage of the data pipeline (e.g., training or inference) runs in a Docker container. Then, it will shim in data that you specify into the respective containers at the right time and in the right order. That way, you can focus on data sets and processing, and avoid spending all your time copying data and running containers. 
16 | 
17 | Pachyderm also versions all of your data and processing. This is super important for both compliance and maintenaince of workflows. With Pachyderm, you can always determine what data and processing contributed to a particular result, even if you have since changed the data sources and intermediate processing.
18 | 
19 | To get started with Pachyderm, take a look at their [getting started docs](http://pachyderm.readthedocs.io/en/latest/getting_started/getting_started.html) and join their [public Slack channel](http://slack.pachyderm.io/) to ask questions and get help.
20 | 
21 | ## CI/CD
22 | 
23 | Often, Docker images that are used in production aren't built manually. Rather, the building (and sometimes running) or Docker images is integrated into a continuous integration/deployment pipeline (aka a CI/CD pipeline). 
24 | 
25 | CI/CD pipelines and systems automate the testing, building, and deployment process. These systems are often listening to your GitHub repos. When you push new code to certain branches (e.g., dev, staging, and prod), they will automatically pull your latest code, test that code, build a Docker image for your code, and deploy that Docker image (i.e., run it on) to some cloud or on-premise environment. 
26 | 
27 | Common tools used for CI/CD, and which integrate with Docker, are [Jenkins](https://jenkins.io/), [Travis](https://travis-ci.org/), [CircleCI](https://circleci.com/), and [Ansible](https://www.ansible.com/). Although there are many, many choices.
28 | 


--------------------------------------------------------------------------------
/model_training/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python
2 | 
3 | # Install dependencies
4 | RUN pip install -U numpy scipy scikit-learn pandas
5 | 
6 | # Add our code
7 | ADD train.py /code/train.py
8 | 


--------------------------------------------------------------------------------
/model_training/README.md:
--------------------------------------------------------------------------------
  1 | # Docker-izing Model Training
  2 | 
  3 | To get some practice building Docker images for our data science apps (aka "Docker-izing" our data science apps), we are going to take the "Hello World" of machine learning as an example: the [Iris flower classification problem](https://en.wikipedia.org/wiki/Iris_flower_data_set).
  4 | 
  5 | Let's say that we have two Python applications that do the following, respectively:
  6 | 
  7 | - Train and save a model based on the Iris data set, and
  8 | - Utilize the trained model to perform inferences on new input attributes of flowers.
  9 | 
 10 | First, we will Docker-ize the model training code, such that we can run it on any infrastructure with reproducible behavior. This guide will walk you through that process, which includes:
 11 | 
 12 | 1. [Developing the application](README.md#1-developing-the-application)
 13 | 2. [Creating a Dockerfile](README.md#2-creating-a-dockerfile)
 14 | 3. [Building a Docker image](README.md#3-building-a-docker-image)
 15 | 4. [Pushing the image to a registry (optional)](README.md#4-pushing-the-image-to-a-registry-optional)
 16 | 5. [Running model training in a container](README.md#5-running-model-training-in-a-container)
 17 | 
 18 | Finally, I provide some [Resources](README.md#resources) for further exploration.
 19 | 
 20 | ## 1. Developing the application
 21 | 
 22 | Because this isn't a Python CodeLab per se, I have already developed [a Python script (train.py)](train.py) that will perform this model training for you on the Iris dataset. The model will be capable of taking in a set of 4 measurements of a flower, and it will return a predicted species of that flower. 
 23 | 
 24 | The easiest way to continue the CodeLab with the prepared code to to clone this repo (or download the repo contents from [here](https://github.com/dwhitena/qcon-ai-docker-workshop)):
 25 | 
 26 | ```sh
 27 | $ git clone https://github.com/dwhitena/qcon-ai-docker-workshop.git
 28 | ```
 29 | 
 30 | Then you will be able to run the commands as presented below and modify the respective files. Of course, you are welcome to modify any of the included Python scripts to your liking (e.g., changing the modeling method). 
 31 | 
 32 | ## 2. Creating a Dockerfile
 33 | 
 34 | To build a Docker image (that will allow us to run `train.py` in a container), we will need to create a Dockerfile. A Dockerfile tells Docker engine how a Docker image should be built. Think about this as a kind of recipe that Docker engine uses to build the image.
 35 | 
 36 | A basic Dockerfile for `train.py` is included [here](Dockerfile):
 37 | 
 38 | ```
 39 | FROM python
 40 | 
 41 | # Install dependencies
 42 | RUN pip install -U numpy scipy scikit-learn pandas
 43 | 
 44 | # Add our code
 45 | ADD train.py /code/train.py
 46 | ```
 47 | 
 48 | The Dockerfile format includes a series of *instructions* (in all caps) paired with corresponding *arguments*. Each of the *instructions* will result in a *layer* in the Docker image. The Docker image is generated and versioned in layers, such that you can easily change your code without having to rebuilt the image from scrath. It also allows us to build on other's work as well. For example, we don't have to build up an Ubuntu like file and packaging system or install python (although we could). We can just say `FROM python`. 
 49 | 
 50 | If you are looking for public images (like `python` above) that you can start from, try [Docker Hub](https://hub.docker.com/). For example, if you search for "TensorFlow" in Docker Hub you will find a `tensorflow/tensorflow` image that is maintained by the TensorFlow team (along with many other TF images maintained by other people). There are scikit-learn, caret, ggplot, PyTorch, Spark, and many other public images that will allow you to experiment and create Docker images quickly.
 51 | 
 52 | **Warning** - Although searching Docker Hub is a good place to start when you need a base image for a Dockerfile, not all images published to Docker Hub are operational, secure, or ideal. Just like pulling code and packages from GitHub, you should investigate who is publishing the image, when it was last updated, and if it is created in a sane manner. For example, there are a bunch of base "data science" images on Docker Hub that include a whole ecosystem of data science tooling (jupyter, scikit-learn, TF, PyTorch, etc.). I highly recommend that you avoid these type of images (unless you just want to experiment with them locally). For the most part, they are super bloated (like > 1GB in size) and hard to work with, download, port, etc. At that point you might as well use a full VM. haha. 
 53 | 
 54 | **Continued Warning** - Even images published by known teams are sometimes non-ideal. For example, the "minimal" Jupyter notebook image from the Jupyter team is over 1GB in size, which is hardly minimal (the one we pulled is 72MB). Generally a large image size, lack of documentation, non-public Dockerfile, and lack of recent updates are bad signs when looking for a base image.
 55 | 
 56 | Ok, with that soapbox out of the way, let's look at the other two layers of our Dockerfile. The second instruction (`RUN ...`) tells Docker to install numpy, scipy, scikit-learn, and pandas on top of python. We will need these to run our training. Note, there are public images with scikit-learn etc. already installed, but these include a bunch of other things that we don't need. As such, it makes sense for us to just start from `python` and add in the few things we need.
 57 | 
 58 | Finally, we need to add our code to the image, the `ADD ...` instruction tells Docker to add `train.py` to the `/code/train.py` location in the image.   
 59 | 
 60 | Can you think of ways to make the image built from this Dockerfile smaller or more reproducible? A smaller image will mean it will download to target machine and initiate our work faster. It will also be easier to remove and update. Also, what happens if the `python` base image is updated? How can we keep this constant at a specific version? Try to create a modified version of the Dockerfile that:
 61 | 
 62 | - Utilizes a specific "tagged" version of the `python` base image,
 63 | - Utilizes a smaller version of the `python` base image, 
 64 | - Utilizes a different base image, and/or
 65 | - Cleans up other things in the image that aren't used.
 66 | 
 67 | (Hint - [here](https://github.com/pachyderm/pachyderm/blob/master/doc/examples/ml/iris/python/iris-train-python-svm/Dockerfile) is a Dockerfile that will also run our model training code but that is much smaller)
 68 | 
 69 | ## 3. Building a Docker image
 70 | 
 71 | Now that we have our Dockerfile, we can build our Docker image for model training. First, we need to choose a *tag* for our Docker image. For now, think of this tag as the name of the Docker image (although we will see that it has more utility later). 
 72 | 
 73 | To build our image, run the following from [this directory](.) in the cloned version of this repo:
 74 | 
 75 | ```sh
 76 | $ docker build -t <the name you chose> .
 77 | ```
 78 | 
 79 | The `-t <the name you chose>` argument tells Docker to tag your image as `<the name you chose>`, and the `.` at the end tells Docker to look for the Dockerfile in this directory. Note, you can also specify, via other flags, a Dockerfile in a different directory and/or a Dockerfile named something other than Dockerfile.
 80 | 
 81 | For example, I can create a `model-training` image by running:
 82 | 
 83 | ```sh
 84 | $ docker build -t model-training .
 85 | Sending build context to Docker daemon  32.77kB
 86 | Step 1/3 : FROM python
 87 | latest: Pulling from library/python
 88 | f2b6b4884fc8: Pull complete
 89 | 4fb899b4df21: Pull complete
 90 | 74eaa8be7221: Pull complete
 91 | 2d6e98fe4040: Pull complete
 92 | 414666f7554d: Pull complete
 93 | 135a494fed80: Pull complete
 94 | 6ca3f38fdd4d: Pull complete
 95 | d67ff15d2a78: Pull complete
 96 | Digest: sha256:c021d6c587ea435509775c3a4da58d42287f630cb4ae6e0bc97ec839d9e0da3a
 97 | Status: Downloaded newer image for python:latest
 98 |  ---> d21927554614
 99 | Step 2/3 : RUN pip install -U numpy scipy scikit-learn pandas
100 |  ---> Running in a77a8ec01d94
101 | Collecting numpy
102 |   Downloading numpy-1.14.2-cp36-cp36m-manylinux1_x86_64.whl (12.2MB)
103 | Collecting scipy
104 |   Downloading scipy-1.0.0-cp36-cp36m-manylinux1_x86_64.whl (50.0MB)
105 | Collecting scikit-learn
106 |   Downloading scikit_learn-0.19.1-cp36-cp36m-manylinux1_x86_64.whl (12.4MB)
107 | Collecting pandas
108 |   Downloading pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl (26.2MB)
109 | Collecting pytz>=2011k (from pandas)
110 |   Downloading pytz-2018.3-py2.py3-none-any.whl (509kB)
111 | Collecting python-dateutil>=2 (from pandas)
112 |   Downloading python_dateutil-2.7.0-py2.py3-none-any.whl (207kB)
113 | Collecting six>=1.5 (from python-dateutil>=2->pandas)
114 |   Downloading six-1.11.0-py2.py3-none-any.whl
115 | Installing collected packages: numpy, scipy, scikit-learn, pytz, six, python-dateutil, pandas
116 | Successfully installed numpy-1.14.2 pandas-0.22.0 python-dateutil-2.7.0 pytz-2018.3 scikit-learn-0.19.1 scipy-1.0.0 six-1.11.0
117 | You are using pip version 9.0.1, however version 9.0.2 is available.
118 | You should consider upgrading via the 'pip install --upgrade pip' command.
119 | Removing intermediate container a77a8ec01d94
120 |  ---> 646a15f6e4df
121 | Step 3/3 : ADD train.py /code/train.py
122 |  ---> 3da96f640402
123 | Successfully built 3da96f640402
124 | Successfully tagged model-training:latest
125 | ```
126 | 
127 | You will notice in the output that Docker: pulls your base image, runs your commands to install dependencies, and then adds your code. The docker image will then be shown when you list the images in your local registry:
128 | 
129 | ```sh
130 | $ docker images
131 | REPOSITORY                         TAG                 IMAGE ID            CREATED             SIZE
132 | model-training                     latest              3da96f640402        3 minutes ago       1.21GB
133 | dwhitena/minimal-jupyter           latest              1770383288b4        25 hours ago        203MB
134 | python                             latest              d21927554614        3 days ago          688MB
135 | tensorflow/tensorflow              latest              414b6e39764a        2 weeks ago         1.27GB
136 | ```
137 | 
138 | ## 4. Pushing the image to a registry (optional)
139 | 
140 | One of the goals of Docker-izing our apps is to easily port them to other environments and make them available for other people to run. Thus, we can't be dependent on our laptop as the registry of our Docker images. We need to store and version our Docker images somewhere else (just like we would store and version our code somewhere like GitHub).
141 | 
142 | There are many options to choose from when thinking about where you want to store/version your images. You can also make your images public or keep them private (similar to having public/private repos on GitHub). Common choices for registries are [Docker Hub](https://hub.docker.com/), [AWS ECR](https://aws.amazon.com/ecr/), and [Google GCR](https://cloud.google.com/container-registry/). 
143 | 
144 | To get some practice, create a free user account on Docker Hub. Once you have done that, use the `docker login` command to log into your Docker Hub account locally. 
145 | 
146 | To push our `model-training` Docker image up to the Docker Hub registry as a public Docker image (that could be pulled down and run elsewhere), we first need to re-tag our image. Remember how we "named" our image `model-training`? Well, there's actually more utility in that tag than just a human readable name. We can tag our image with the format: `<registry, user>/<image name>:<version>`, where:
147 | 
148 | - `<registry, user>` specifies the registry and/or user associated with the image,
149 | - `<image name>` specifies the common name of the image, and
150 | - `<version>` specifies a version of the image.
151 | 
152 | For example, because I'm `dwhitena` on Docker Hub, I could tag my image as follows:
153 | 
154 | ```sh
155 | $ docker tag model-training dwhitena/model-training:v1.0.0
156 | ```
157 | 
158 | and then push it to Docker Hub:
159 | 
160 | ```sh
161 | $ docker push dwhitena/model-training:v1.0.0
162 | The push refers to repository [docker.io/dwhitena/model-training]
163 | dbd827af12d1: Pushed
164 | eddb03e225ec: Pushed
165 | aec4f1507d85: Mounted from library/python
166 | a4a7a3673769: Mounted from library/python
167 | 325a22db58ea: Mounted from library/python
168 | 6e1b48dc2ccc: Mounted from library/python
169 | ff57bdb79ac8: Mounted from library/python
170 | 6e5e20cbf4a7: Mounted from library/python
171 | 86985c679800: Mounted from library/python
172 | 8fad67424c4e: Mounted from library/python
173 | v1.0.0: digest: sha256:ca5032522813f696c76e763becec0352f4765015536c6b1ff3f64a0e02898d30 size: 2427
174 | ```
175 | 
176 | Now v1.0.0 of my Docker image is available on Docker Hub [here](https://hub.docker.com/r/dwhitena/model-training/). Try this with your username and check that the image is pushed to Docker Hub.
177 | 
178 | **Note** - If you don't utilize a `:<version>` tag for your image (e.g., if I just used `dwhitena/model-training`), your image will be tagged as `latest`. This can be convenient while testing, but you should never use images tagged `latest` in production, because as you update the image you would lose the ability to revert to previous versions, run specific versions, etc.
179 | 
180 | ## 5. Running model training in a container
181 | 
182 | Now, to run this Python model training code in the container let's discuss briefly how we would run the code locally. This Python script is configured to be run as follows:
183 | 
184 | ```sh
185 | $ python train.py <input directory> <output directory>
186 | ```
187 | 
188 | where `<input directory>` is the directory containing the iris training data set (included [here](data/iris.csv) in this repo) and `<output directory>` is where the Python script will save a serialized version of the trained model.
189 | 
190 | But how do we get the training data into our container? And how do we get the data out? Moreover, if our container finishes and we remove it, will we lose our data?
191 | 
192 | Well, Docker provides the ability to mount a "volume" into container. By mounting a local volume into the container, our container will be able to read data from the local filesystm and write data out to local filesystem. Then, once the container is deleted, we will still have the input/output data. We will use the `-v` flag with `docker run ...` to do this volume mapping.
193 | 
194 | From the root of this repo, you can run the model training in the container as follows:
195 | 
196 | ```sh
197 | $ docker run -v /path/to/this/GH/repo/model_training/data:/data <your image tag> python /code/train.py /data /data
198 | ```  
199 | 
200 | Where you would replace `<your image tag>` with the name of the Docker image you built in step 3 (`dwhitena/model-training:v1.0.0` in my case). `-v /path/to/this/GH/repo/model_training/data:/data` maps the absolute path to the [data directory](data) on your local machine to `/data` inside the container (you can use the `pwd` command to find the absolute path to that directory on your local machine). `python /code/train.py /data /data` is the command that we are running in the container to perform the training. 
201 | 
202 | This should only take a second to run. Once it finishes, you should see the model output in your `data` directory:
203 | 
204 | ```sh
205 | $ ls data
206 | iris.csv  model.pkl model.txt
207 | ```
208 | 
209 | Yay! We successfully trained an ML model inside of a Docker container.
210 | 
211 | ## Resources
212 | 
213 | - [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)
214 | - [Docker run reference](https://docs.docker.com/engine/reference/run/)
215 | - [Docker volumes](https://docs.docker.com/storage/volumes/)
216 | 


--------------------------------------------------------------------------------
/model_training/data/iris.csv:
--------------------------------------------------------------------------------
  1 | 5.1,3.5,1.4,0.2,Iris-setosa
  2 | 4.9,3.0,1.4,0.2,Iris-setosa
  3 | 4.7,3.2,1.3,0.2,Iris-setosa
  4 | 4.6,3.1,1.5,0.2,Iris-setosa
  5 | 5.0,3.6,1.4,0.2,Iris-setosa
  6 | 5.4,3.9,1.7,0.4,Iris-setosa
  7 | 4.6,3.4,1.4,0.3,Iris-setosa
  8 | 5.0,3.4,1.5,0.2,Iris-setosa
  9 | 4.4,2.9,1.4,0.2,Iris-setosa
 10 | 4.9,3.1,1.5,0.1,Iris-setosa
 11 | 5.4,3.7,1.5,0.2,Iris-setosa
 12 | 4.8,3.4,1.6,0.2,Iris-setosa
 13 | 4.8,3.0,1.4,0.1,Iris-setosa
 14 | 4.3,3.0,1.1,0.1,Iris-setosa
 15 | 5.8,4.0,1.2,0.2,Iris-setosa
 16 | 5.7,4.4,1.5,0.4,Iris-setosa
 17 | 5.4,3.9,1.3,0.4,Iris-setosa
 18 | 5.1,3.5,1.4,0.3,Iris-setosa
 19 | 5.7,3.8,1.7,0.3,Iris-setosa
 20 | 5.1,3.8,1.5,0.3,Iris-setosa
 21 | 5.4,3.4,1.7,0.2,Iris-setosa
 22 | 5.1,3.7,1.5,0.4,Iris-setosa
 23 | 4.6,3.6,1.0,0.2,Iris-setosa
 24 | 5.1,3.3,1.7,0.5,Iris-setosa
 25 | 4.8,3.4,1.9,0.2,Iris-setosa
 26 | 5.0,3.0,1.6,0.2,Iris-setosa
 27 | 5.0,3.4,1.6,0.4,Iris-setosa
 28 | 5.2,3.5,1.5,0.2,Iris-setosa
 29 | 5.2,3.4,1.4,0.2,Iris-setosa
 30 | 4.7,3.2,1.6,0.2,Iris-setosa
 31 | 4.8,3.1,1.6,0.2,Iris-setosa
 32 | 5.4,3.4,1.5,0.4,Iris-setosa
 33 | 5.2,4.1,1.5,0.1,Iris-setosa
 34 | 5.5,4.2,1.4,0.2,Iris-setosa
 35 | 4.9,3.1,1.5,0.1,Iris-setosa
 36 | 5.0,3.2,1.2,0.2,Iris-setosa
 37 | 5.5,3.5,1.3,0.2,Iris-setosa
 38 | 4.9,3.1,1.5,0.1,Iris-setosa
 39 | 4.4,3.0,1.3,0.2,Iris-setosa
 40 | 5.1,3.4,1.5,0.2,Iris-setosa
 41 | 5.0,3.5,1.3,0.3,Iris-setosa
 42 | 4.5,2.3,1.3,0.3,Iris-setosa
 43 | 4.4,3.2,1.3,0.2,Iris-setosa
 44 | 5.0,3.5,1.6,0.6,Iris-setosa
 45 | 5.1,3.8,1.9,0.4,Iris-setosa
 46 | 4.8,3.0,1.4,0.3,Iris-setosa
 47 | 5.1,3.8,1.6,0.2,Iris-setosa
 48 | 4.6,3.2,1.4,0.2,Iris-setosa
 49 | 5.3,3.7,1.5,0.2,Iris-setosa
 50 | 5.0,3.3,1.4,0.2,Iris-setosa
 51 | 7.0,3.2,4.7,1.4,Iris-versicolor
 52 | 6.4,3.2,4.5,1.5,Iris-versicolor
 53 | 6.9,3.1,4.9,1.5,Iris-versicolor
 54 | 5.5,2.3,4.0,1.3,Iris-versicolor
 55 | 6.5,2.8,4.6,1.5,Iris-versicolor
 56 | 5.7,2.8,4.5,1.3,Iris-versicolor
 57 | 6.3,3.3,4.7,1.6,Iris-versicolor
 58 | 4.9,2.4,3.3,1.0,Iris-versicolor
 59 | 6.6,2.9,4.6,1.3,Iris-versicolor
 60 | 5.2,2.7,3.9,1.4,Iris-versicolor
 61 | 5.0,2.0,3.5,1.0,Iris-versicolor
 62 | 5.9,3.0,4.2,1.5,Iris-versicolor
 63 | 6.0,2.2,4.0,1.0,Iris-versicolor
 64 | 6.1,2.9,4.7,1.4,Iris-versicolor
 65 | 5.6,2.9,3.6,1.3,Iris-versicolor
 66 | 6.7,3.1,4.4,1.4,Iris-versicolor
 67 | 5.6,3.0,4.5,1.5,Iris-versicolor
 68 | 5.8,2.7,4.1,1.0,Iris-versicolor
 69 | 6.2,2.2,4.5,1.5,Iris-versicolor
 70 | 5.6,2.5,3.9,1.1,Iris-versicolor
 71 | 5.9,3.2,4.8,1.8,Iris-versicolor
 72 | 6.1,2.8,4.0,1.3,Iris-versicolor
 73 | 6.3,2.5,4.9,1.5,Iris-versicolor
 74 | 6.1,2.8,4.7,1.2,Iris-versicolor
 75 | 6.4,2.9,4.3,1.3,Iris-versicolor
 76 | 6.6,3.0,4.4,1.4,Iris-versicolor
 77 | 6.8,2.8,4.8,1.4,Iris-versicolor
 78 | 6.7,3.0,5.0,1.7,Iris-versicolor
 79 | 6.0,2.9,4.5,1.5,Iris-versicolor
 80 | 5.7,2.6,3.5,1.0,Iris-versicolor
 81 | 5.5,2.4,3.8,1.1,Iris-versicolor
 82 | 5.5,2.4,3.7,1.0,Iris-versicolor
 83 | 5.8,2.7,3.9,1.2,Iris-versicolor
 84 | 6.0,2.7,5.1,1.6,Iris-versicolor
 85 | 5.4,3.0,4.5,1.5,Iris-versicolor
 86 | 6.0,3.4,4.5,1.6,Iris-versicolor
 87 | 6.7,3.1,4.7,1.5,Iris-versicolor
 88 | 6.3,2.3,4.4,1.3,Iris-versicolor
 89 | 5.6,3.0,4.1,1.3,Iris-versicolor
 90 | 5.5,2.5,4.0,1.3,Iris-versicolor
 91 | 5.5,2.6,4.4,1.2,Iris-versicolor
 92 | 6.1,3.0,4.6,1.4,Iris-versicolor
 93 | 5.8,2.6,4.0,1.2,Iris-versicolor
 94 | 5.0,2.3,3.3,1.0,Iris-versicolor
 95 | 5.6,2.7,4.2,1.3,Iris-versicolor
 96 | 5.7,3.0,4.2,1.2,Iris-versicolor
 97 | 5.7,2.9,4.2,1.3,Iris-versicolor
 98 | 6.2,2.9,4.3,1.3,Iris-versicolor
 99 | 5.1,2.5,3.0,1.1,Iris-versicolor
100 | 5.7,2.8,4.1,1.3,Iris-versicolor
101 | 6.3,3.3,6.0,2.5,Iris-virginica
102 | 5.8,2.7,5.1,1.9,Iris-virginica
103 | 7.1,3.0,5.9,2.1,Iris-virginica
104 | 6.3,2.9,5.6,1.8,Iris-virginica
105 | 6.5,3.0,5.8,2.2,Iris-virginica
106 | 7.6,3.0,6.6,2.1,Iris-virginica
107 | 4.9,2.5,4.5,1.7,Iris-virginica
108 | 7.3,2.9,6.3,1.8,Iris-virginica
109 | 6.7,2.5,5.8,1.8,Iris-virginica
110 | 7.2,3.6,6.1,2.5,Iris-virginica
111 | 6.5,3.2,5.1,2.0,Iris-virginica
112 | 6.4,2.7,5.3,1.9,Iris-virginica
113 | 6.8,3.0,5.5,2.1,Iris-virginica
114 | 5.7,2.5,5.0,2.0,Iris-virginica
115 | 5.8,2.8,5.1,2.4,Iris-virginica
116 | 6.4,3.2,5.3,2.3,Iris-virginica
117 | 6.5,3.0,5.5,1.8,Iris-virginica
118 | 7.7,3.8,6.7,2.2,Iris-virginica
119 | 7.7,2.6,6.9,2.3,Iris-virginica
120 | 6.0,2.2,5.0,1.5,Iris-virginica
121 | 6.9,3.2,5.7,2.3,Iris-virginica
122 | 5.6,2.8,4.9,2.0,Iris-virginica
123 | 7.7,2.8,6.7,2.0,Iris-virginica
124 | 6.3,2.7,4.9,1.8,Iris-virginica
125 | 6.7,3.3,5.7,2.1,Iris-virginica
126 | 7.2,3.2,6.0,1.8,Iris-virginica
127 | 6.2,2.8,4.8,1.8,Iris-virginica
128 | 6.1,3.0,4.9,1.8,Iris-virginica
129 | 6.4,2.8,5.6,2.1,Iris-virginica
130 | 7.2,3.0,5.8,1.6,Iris-virginica
131 | 7.4,2.8,6.1,1.9,Iris-virginica
132 | 7.9,3.8,6.4,2.0,Iris-virginica
133 | 6.4,2.8,5.6,2.2,Iris-virginica
134 | 6.3,2.8,5.1,1.5,Iris-virginica
135 | 6.1,2.6,5.6,1.4,Iris-virginica
136 | 7.7,3.0,6.1,2.3,Iris-virginica
137 | 6.3,3.4,5.6,2.4,Iris-virginica
138 | 6.4,3.1,5.5,1.8,Iris-virginica
139 | 6.0,3.0,4.8,1.8,Iris-virginica
140 | 6.9,3.1,5.4,2.1,Iris-virginica
141 | 6.7,3.1,5.6,2.4,Iris-virginica
142 | 6.9,3.1,5.1,2.3,Iris-virginica
143 | 5.8,2.7,5.1,1.9,Iris-virginica
144 | 6.8,3.2,5.9,2.3,Iris-virginica
145 | 6.7,3.3,5.7,2.5,Iris-virginica
146 | 6.7,3.0,5.2,2.3,Iris-virginica
147 | 6.3,2.5,5.0,1.9,Iris-virginica
148 | 6.5,3.0,5.2,2.0,Iris-virginica
149 | 6.2,3.4,5.4,2.3,Iris-virginica
150 | 5.9,3.0,5.1,1.8,Iris-virginica
151 | 
152 | 


--------------------------------------------------------------------------------
/model_training/train.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | from sklearn import svm
 3 | from sklearn.externals import joblib
 4 | import argparse
 5 | import os
 6 | 
 7 | # command line arguments
 8 | parser = argparse.ArgumentParser(description='Train a model for iris classification.')
 9 | parser.add_argument('indir', type=str, help='Input directory containing the training set')
10 | parser.add_argument('outdir', type=str, help='Output directory for the trained model')
11 | args = parser.parse_args()
12 | 
13 | # training set column names
14 | cols = [
15 |     "Sepal_Length",
16 |     "Sepal_Width",
17 |     "Petal_Length",
18 |     "Petal_Width",
19 |     "Species"
20 | ]
21 | 
22 | features = [
23 |     "Sepal_Length",
24 |     "Sepal_Width",
25 |     "Petal_Length",
26 |     "Petal_Width"
27 | ]
28 | 
29 | # import the iris training set
30 | irisDF = pd.read_csv(os.path.join(args.indir, "iris.csv"), names=cols)
31 | 
32 | # fit the model
33 | svc = svm.SVC(kernel='linear', C=1.0).fit(irisDF[features], irisDF["Species"])
34 | 
35 | # output a text description of the model
36 | f = open(os.path.join(args.outdir, 'model.txt'), 'w')
37 | f.write(str(svc))
38 | f.close()
39 | 
40 | # persist the model
41 | joblib.dump(svc, os.path.join(args.outdir, 'model.pkl'))
42 | 


--------------------------------------------------------------------------------