├── QCon_AI_Docker.png ├── README.md ├── getting_started └── README.md ├── inference ├── Dockerfile ├── README.md └── api.py ├── managing_and_scaling └── README.md └── model_training ├── Dockerfile ├── README.md ├── data └── iris.csv └── train.py /QCon_AI_Docker.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dwhitena/qcon-ai-docker-workshop/9e83684780301978df32f802258ce601476065ae/QCon_AI_Docker.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![Alt text](QCon_AI_Docker.png) 2 | 3 | # Docker-izing your Data Science Apps 4 | 5 | This repo includes all of the materials and documentation that you will need for the "Docker-izing your Data Science Apps" CodeLab at [QCon AI 2018](https://qcon.ai/). In this CodeLab, we will learn how to put your data science applications in Docker images and run those as containers on any infrastructure. These skills will help you maintain reproducibility and increase efficiency as you deploy your applications, and they will help you standardize your code to better fit into modern infrastructures, CI/CD tools, and DevOps practices. 6 | 7 | The CodeLab is compiled such that it can be completed independently. To start, read the [Prerequisites](#prerequisites) section and then follow the first link below (`1. Why Docker...`). If you have any difficulties or questions, the instructor [Daniel Whitenack](https://twitter.com/dwhitena) will be holding office hours at QCon AI on [Tuesday, April 10th](https://qcon.ai/schedule/qconai2018/tabular). 8 | 9 | ## Agenda 10 | 11 | 1. [Why Docker, gettings started with Docker](getting_started) 12 | 1. [Docker-izing model training](model_training) 13 | 2. [Docker-izing inference, services](inference) 14 | 3. [Managing and scaling Docker-ized data science apps](managing_and_scaling) 15 | 16 | ## Prerequisites 17 | 18 | - To complete this CodeLab, you will need to have: 19 | - A laptop/desktop capable of running Docker (see requirements [here](https://docs.docker.com/install/)) 20 | - A Unix-like terminal (ideally the [WSL on Windows](https://docs.microsoft.com/en-us/windows/wsl/install-win10) although similar operations could be accomplished using the Windows command prompt) 21 | - A connection to the Internet 22 | 23 | - If you are new to the command line or need a refresher, look through [this quick tutorial](https://lifehacker.com/5633909/who-needs-a-mouse-learn-to-use-the-command-line-for-almost-anything). 24 | -------------------------------------------------------------------------------- /getting_started/README.md: -------------------------------------------------------------------------------- 1 | # Getting started with Docker 2 | 3 | Ideally, we should be creating ML applications that produce predictable behavior, regardless of where they are deployed. [Docker](https://www.docker.com/) can be utilized to accomplish this goal. 4 | 5 | The below sections help you understand why Docker is useful in this context and some of the jargon associated with Docker. It will also walk you through installation and basic use of Docker: 6 | 7 | 1. [Why Docker?](README.md#1-why-docker) 8 | 2. [Docker Jargon](README.md#2-docker-jargon) 9 | 3. [Installing Docker](README.md#3-installing-docker) 10 | 4. [Interacting with Docker](README.md#4-interacting-with-docker) 11 | 12 | Finally, I provide some [Resources](README.md#resources) for further exploration. 13 | 14 | ## 1. Why Docker? 15 | 16 | Ok, let's say that we have some code for model training, inference, pre-processing, post-processing, etc. and we need to: 17 | 18 | - scale this code up to larger data sets, 19 | - run it automatically at certain times or based on certain events, 20 | - share it with teammates so they can generate their own results, or 21 | - connect it to other code running in our company's infrastructure. 22 | 23 | You aren't going to be able to do these things if your code only lives on your laptop and if you have to run it manually in your own environment. You need to *deploy* this code to some other computing resources and/or share it such that others can run it just like you. This other environment could be one or more cloud instances or an on premise cluster of compute nodes. 24 | 25 | How can we do this with a high degree of reproducibility and operational/computation efficiency? And how can we ensure that our engineering team doesn't hate the data science team because they always have to deploy data science things in a "special" way with "special" data science tools. 26 | 27 | Well, some of you might be thinking that Virtual Machines (or VMs) are the answer to this problem. To some degree you are correct. VMs were developed to solve some of these issues. However, many have moved on from VMs because they create quite a few pain points: 28 | 29 | - They generally consume a fixed set of resources. This makes it hard to take advantage of computational resources in an optimized way. Most of the time VMs aren't using all of the resources allocated to them, but we have partitioned those resources off from other processes. This is waste. 30 | - Most of the time they are pretty big. Porting around a 10Gb VM image isn't exactly fun, and I wouldn't consider it incredibly "portable." 31 | - If you are running applications in the cloud, you can run into all sorts of weirdness if you try to run VMs inside of VMs (which is what cloud instances actually are). 32 | 33 | Docker solves many of these issues and even has additional benefits because it leverages *software containers* as it primary way of encapsulating applications. Containers existed before Docker, but Docker has made containers extremely easy to use and accesible. Thus many just associate software containers with Docker containers. When working with Docker containers, you might see some similarities to VMs, but they are quite different: 34 | 35 | ![Alt text](https://blog.netapp.com/wp-content/uploads/2016/03/Screen_Shot_2016-03-11_at_9.14.20_PM1.png) 36 | 37 | As you can see Docker containers have the following unique properties which make them extremely useful: 38 | 39 | - They don't include an entire guest OS. They just include your application and the associated libraries, file system, etc. This makes them much smaller than VMs (some of my Docker containers are just a few Mb). This also makes spinning up and tearing down containers extremely quick. 40 | - They share an underlying host kernel and resources. You can spin up 10s or 100s or Docker containers on a single machine. They will all share the underlying resources, such that you can efficiently utilize all of the resources on a node (rather than statically carving our resource per process). 41 | 42 | This is why Docker containers have become so dominant in the infrastructure world. Data scientists and AI researchers are also latching on to these because they can: 43 | 44 | - Docker-ize an application quickly, hand it off to an engineering organization, and have them run it in a manner similar to any other application. 45 | - Experiment with a huge number of tools (Tensorflow, PyTorch, Spark, etc.) without having to install anything other than Docker. 46 | - Manage a diverse set of data pipeline stages in a unified way. 47 | - Leverage the huge number of excellent infrastructure projects for containers (e.g., those powering Google scale work) to create application that auto-scale, self-heal, are fault tolerant, etc. 48 | - Easily define and reproduce environments for experimentation. 49 | 50 | ## 2. Docker Jargon 51 | 52 | Docker jargon can sometimes be confusing, so let's go ahead and define some key terms. Refer back to this list later on the CodeLab if you need to: 53 | 54 | - Docker *Image* - the bundle that includes your app & dependencies 55 | - Docker *Container* - a running instance of a Docker image 56 | - *Docker engine* - the application that builds and runs images 57 | - Docker *registry* - where you store, tag, and get pre-built Docker images 58 | - *Dockerfile* - a file that tells the engine how to build a Docker image 59 | 60 | Thus, a common workflow when building a Docker-ized application is as follows: 61 | 62 | 1. Develop the application (as you normally would) 63 | 2. Build a Docker image for the app with Docker engine 64 | 3. Upload the image to a registry 65 | 4. Deploy a Docker container, based on the image from the registry, to a cloud instance or on premise node 66 | 67 | ## 3. Installing Docker 68 | 69 | Docker can be installed on Linux, Mac, or Windows. If you are using Windows, note that I will be showing unix-style commands throughout the CodeLab. As such, you may want to run command from the the WSL or look up the Windows command prompt equivalents. 70 | 71 | To install Docker (the community edition), following the appropriate guide [here](https://www.docker.com/community-edition#/download). Once installed, you should be able to run the following in a terminal to get the Docker version: 72 | 73 | ```sh 74 | $ docker version 75 | Client: 76 | Version: 17.12.0-ce 77 | API version: 1.35 78 | Go version: go1.9.2 79 | Git commit: c97c6d6 80 | Built: Wed Dec 27 20:03:51 2017 81 | OS/Arch: darwin/amd64 82 | 83 | Server: 84 | Engine: 85 | Version: 17.12.0-ce 86 | API version: 1.35 (minimum version 1.12) 87 | Go version: go1.9.2 88 | Git commit: c97c6d6 89 | Built: Wed Dec 27 20:12:29 2017 90 | OS/Arch: linux/amd64 91 | Experimental: true 92 | ``` 93 | 94 | **Note** - in some cases, you may need to run any `docker ...` commands as `sudo`. This should be fine for this CodeLab, but, if you want to be able to run without `sudo`, you could follow [this guide](https://docs.docker.com/install/linux/linux-postinstall/). 95 | 96 | ## 4. Interacting with Docker 97 | 98 | Once you have Docker installed, you can manage, build, and run Docker images from the command line. To see what images you have locally, you can run: 99 | 100 | ```sh 101 | $ docker images 102 | REPOSITORY TAG IMAGE ID CREATED SIZE 103 | dockerized-test latest 6fa518ef72d2 2 hours ago 607MB 104 | ubuntu latest f975c5035748 9 days ago 112MB 105 | ``` 106 | 107 | You might not have any images yet, because you haven't pulled any. Let's go ahead and pull a Docker image from Docker's public registry called [DockerHub](https://hub.docker.com/) (think of it like GitHub but for Docker images): 108 | 109 | ```sh 110 | $ docker pull dwhitena/minimal-jupyter 111 | Using default tag: latest 112 | latest: Pulling from dwhitena/minimal-jupyter 113 | 550fe1bea624: Already exists 114 | b313ba46199e: Already exists 115 | de349a63b77a: Already exists 116 | 3cd0781adeaa: Already exists 117 | 0cf242809b69: Pull complete 118 | 4a2fb11c3300: Pull complete 119 | Digest: sha256:893107a7f4e27e772460aeddea0626bd1196aba9b0cc6468d3f52c47ff369e03 120 | Status: Downloaded newer image for dwhitena/minimal-jupyter:latest 121 | ``` 122 | 123 | Now, when you list your docker images with `docker images` this image will show up in your local registry of images (because we pulled it from the remote registry). This `dwhitena/minimal-jupyter` image (as you might have guessed from the name) is a docker image that includes Jupyter. Even if you don't have Jupyter, ipython, etc. installed locally, you can use Jupyter via this Docker image by running it: 124 | 125 | ```sh 126 | $ docker run -p 8888:8888 dwhitena/minimal-jupyter 127 | [I 20:47:17.087 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret 128 | [W 20:47:18.403 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using encryption. This is not recommended. 129 | [W 20:47:18.403 NotebookApp] WARNING: The notebook server is listening on all IP addresses and not using authentication. This is highly insecure and not recommended. 130 | [I 20:47:18.421 NotebookApp] Serving notebooks from local directory: /home/jovyan/notebooks 131 | [I 20:47:18.421 NotebookApp] 0 active kernels 132 | [I 20:47:18.421 NotebookApp] The Jupyter Notebook is running at: 133 | [I 20:47:18.421 NotebookApp] http://[all ip addresses on your system]:8888/ 134 | [I 20:47:18.422 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). 135 | ``` 136 | 137 | As you can see (if you are familiar with Jupyter), this command started a Jupyter notebook server. However, the server isn't running directly on your localhost. It is running inside of a Docker container based on the `dwhitena/minimal-jupyter` Docker image. To see this, open a new terminal window and run `docker ps` to see what containers are running: 138 | 139 | ```sh 140 | $ docker ps 141 | CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 142 | 55803aae3021 dwhitena/minimal-jupyter "/bin/sh -c 'jupyter…" 2 minutes ago Up 2 minutes 0.0.0.0:8888->8888/tcp cranky_sinoussi 143 | ``` 144 | 145 | Try visiting `localhost:8888` in your browser to see the Jupyter notebook server running in the Docker container! 146 | 147 | There are quite a variety of options that you can specify when running your Docker containers. We specified `-p 8888:8888` above, which mapped port 8888 inside the container (where Jupyter is running) to port 8888 outside of the container. However, we could have also specified a name for the container, run the container as a daemon, changed the container's networking, and much more. For a full list and explaination of these options see the [docker run reference docs](https://docs.docker.com/engine/reference/run/). 148 | 149 | To stop this running container, you can run `docker rm -f ` from the terminal where you ran `docker ps`. Or you should be able to close it via `CTRL + C` in the terminal where you ran the `docker run` command. 150 | 151 | **Note** - You could recreate this process on any machine in the cloud or on premise, as long as that machine has Docker installed. You wouldn't have to install the right version of Jupyter, ipython, etc. You just need to `docker run`. In fact, you can `docker run` TensorFlow, PyTorch, R, ggplot, Postgres, MongoDB, Spark, or whatever you want, without messing with any dependencies. That's awesome! Hopefully, you are beginning to see the power and flexibility of containers. 152 | 153 | **Note** - One other note: If you create a new notebook, you'll notice if you try to import something that requires a dependency (numpy or pandas for example) it won't be found. Those dependencies haven't been added yet. We'll go through how to install dependencies into your docker container in the next part of the codelab. Alternatively, you could use the Jupyter terminal to install dependencies. 154 | 155 | ## Resources 156 | 157 | - [Getting started with Docker](https://docs.docker.com/get-started/) 158 | - [Dockerfile reference](https://docs.docker.com/engine/reference/builder/) 159 | -------------------------------------------------------------------------------- /inference/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python 2 | 3 | # install dependencies 4 | RUN pip install --upgrade pip && \ 5 | pip install numpy scipy scikit-learn flask-restful 6 | 7 | # add our project 8 | ADD . / 9 | 10 | # expose the port for the API 11 | EXPOSE 5000 12 | 13 | # run the API 14 | CMD [ "python", "/api.py" ] 15 | 16 | -------------------------------------------------------------------------------- /inference/README.md: -------------------------------------------------------------------------------- 1 | # Docker-izing Inference, Services 2 | 3 | Now that we have our model training Docker-ized, we could run that model training on any Cloud or on-premise machines to produce a trained model. In addition to that, we might need to create a service (e.g., a REST API) that would allow us (or our users) to make requests for predictions, where our service would utilize the trained model to perform and serve the predictions. This guide will walk you through how you might develop, Docker-ize, and deploy this type of service: 4 | 5 | 1. [Developing the application](README.md#1-developing-the-application) 6 | 2. [Creating a Dockerfile](README.md#2-creating-a-dockerfile) 7 | 3. [Building a Docker image](README.md#3-building-a-docker-image) 8 | 4. [Pushing the image to a registry (optional)](README.md#4-pushing-the-image-to-a-registry-optional) 9 | 5. [Running model inference as a service in a container](README.md#5-running-model-inference-as-a-service-in-a-container) 10 | 11 | Finally, I provide some [Resources](README.md#resources) for further exploration. 12 | 13 | ## 1. Developing the application 14 | 15 | Similar to model training, I have already developed the Python code we need for our desired functionality. Specifically, the [api.py script](api.py) will spin up an API (via flask) that will service model predictions. 16 | 17 | ## 2. Creating a Dockerfile 18 | 19 | A Dockerfile for `api.py` is included [here](Dockerfile): 20 | 21 | ``` 22 | FROM python 23 | 24 | # install dependencies 25 | RUN pip install --upgrade pip && \ 26 | pip install numpy scipy scikit-learn flask-restful 27 | 28 | # add our project 29 | ADD . / 30 | 31 | # expose the port for the API 32 | EXPOSE 5000 33 | 34 | # run the API 35 | CMD [ "python", "/api.py" ] 36 | ``` 37 | 38 | This Dockerfile is a little more complicated than the one we used for model training, but some things should look familiar. Everything in the arguments of the `RUN` instructions are things that you might run locally to prepare an environment to run our application. Note, how I have put a couple of operations under a single `RUN` command. Why did I do this? 39 | 40 | Well, remember how a Docker image is built up from "layers" that are versioned in a repository? If I put each of the `apt-get` or `pip` commands into separate `RUN` instructions, then those would build up more and more layers that would always be versioned with the Docker image. By combining them, I can perform some clean up at the end of all the installation to get rid of cached info and other things I don't need. If such a clean up was in a separate `RUN` instruction, it would basically have no effect on the overall size of the image. 41 | 42 | Also, you will notice two new instructions in this Dockerfile that weren't in the model training Dockerfile: 43 | 44 | - The `EXPOSE` instruction will expose a port in the container to be accessed from other running containers. Remember, we are going to make our predictions available via a call to an API, and it so happens that is API will be running on port 5000 in our container. 45 | - The `CMD` instruction tells Docker that we want to run the provided command whenever the a container, based on this image, starts. The specific command here starts our prediction service using the `api.py` script. This is an alternative to manually specifying the command as we did with model training. 46 | 47 | ## 3. Building a Docker image 48 | 49 | We can now build our Docker image the same way we did for model training: 50 | 51 | ```sh 52 | $ docker build -t . 53 | ``` 54 | 55 | For example, to build the image such that I can push it up to my Docker Hub registry: 56 | 57 | ```sh 58 | $ docker build -t dwhitena/model-inference:v1.0.0 . 59 | ``` 60 | 61 | ## 4. Pushing the image to a registry (optional) 62 | 63 | If desired you can also push the image to your Docker Hub registry (to version it and make it available on other systems): 64 | 65 | ```sh 66 | $ docker push dwhitena/model-inference:v1.0.0 67 | ``` 68 | 69 | (where you would replace `dwhitena` with your Docker Hub username) 70 | 71 | ## 5. Running model inference as a service in a container 72 | 73 | Now, to run this Python service in the container let's discuss briefly how we would run the code locally. This Python script is configured to be run as follows: 74 | 75 | ```sh 76 | $ export MODEL_FILE=/path/to/this/GH/repo/model_training/data/model.pkl 77 | $ python api.py 78 | ``` 79 | 80 | The `api.py` looks for an environmental variable `MODEL_FILE`, which should be set to location of the serialized model that was the output of our model training container. Once, that code is running, it will serve predictions on port 5000. For example, you could visit the following address in a browser (or via curl, postman, etc.) to get a prediction response in the form of JSON (assuming you are running the code locally): 81 | 82 | ``` 83 | http://localhost:5000/prediction?slength=1.5&swidth=0.7&plength=1.3&pwidth=0.3 84 | ``` 85 | 86 | When we run the code in the container, we will need to map the port 5000 inside of the container to a port outside of the container (such that we can use the service), map a volume with the `model.pkl` file into the container, and set the environmental variable `MODEL_FILE` in the container to specify the model file. Thus, to start the prediction service, run: 87 | 88 | ```sh 89 | $ docker run -v /path/to/this/GH/repo/model_training/data:/data -e MODEL_FILE='/data/model.pkl' -p 5000:5000 90 | * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) 91 | ``` 92 | 93 | where you would replace `` with the name of the Docker In this command: 94 | 95 | - `-v /path/to/this/GH/repo/model_training/data:/data` maps the absolute path to the [data directory](../model_training/data) on your local machine to `/data` inside the container (remember we output our `model.pkl` file there). 96 | - `-e MODEL_FILE='/data/model.pkl'` sets the `MODEL_FILE` environmental variable in the container to the location of `model.pkl` in the container. 97 | - `-p 5000:5000` maps port 5000 inside the container to port 5000 on our localhost. 98 | 99 | With this container running, you should be able to obtain a prediction from the service by visiting `http://localhost:5000/prediction?slength=1.5&swidth=0.7&plength=1.3&pwidth=0.3` in a browser (or via curl, postman, etc.). Try changing the `slength`, `swidth`, etc. parameters in the URL to get different predictions. Once you are done, you can remove the service via `CTRL+C`. 100 | 101 | As excercises, look at the `docker run` reference docs to try and figure out how to: 102 | 103 | - run the container in the background (i.e., non-interactively or as a daemon) 104 | - after running the container in the background, open a bash shell in the running container from another terminal 105 | - get the logs from the running container 106 | 107 | Now that you are a pro at running your applications in Docker, see the next section of the CodeLab for some suggestions on how to automate ingress/egress of data, how to scale containerized workloads, and more. 108 | 109 | ## Resources 110 | 111 | - [Docker run reference docs](https://docs.docker.com/engine/reference/run/) 112 | - [Dockerfile reference docs](https://docs.docker.com/engine/reference/builder/) 113 | -------------------------------------------------------------------------------- /inference/api.py: -------------------------------------------------------------------------------- 1 | import os 2 | from flask import Flask 3 | from flask_restful import Resource, Api 4 | from flask_restful import reqparse 5 | from sklearn.externals import joblib 6 | 7 | app = Flask(__name__) 8 | api = Api(app) 9 | 10 | class Prediction(Resource): 11 | def get(self): 12 | 13 | parser = reqparse.RequestParser() 14 | parser.add_argument('slength', type=float, help='slength cannot be converted') 15 | parser.add_argument('swidth', type=float, help='swidth cannot be converted') 16 | parser.add_argument('plength', type=float, help='plength cannot be converted') 17 | parser.add_argument('pwidth', type=float, help='pwidth cannot be converted') 18 | args = parser.parse_args() 19 | 20 | prediction = predict([[ 21 | args['slength'], 22 | args['swidth'], 23 | args['plength'], 24 | args['pwidth'] 25 | ]]) 26 | 27 | return { 28 | 'slength': args['slength'], 29 | 'swidth': args['swidth'], 30 | 'plength': args['plength'], 31 | 'pwidth': args['pwidth'], 32 | 'species': prediction 33 | } 34 | 35 | def predict(inputFeatures): 36 | mymodel = joblib.load(app.config.get('model_file')) 37 | prediction = mymodel.predict(inputFeatures) 38 | return prediction[0] 39 | 40 | api.add_resource(Prediction, '/prediction') 41 | 42 | if __name__ == '__main__': 43 | app.config['model_file'] = os.environ['MODEL_FILE'] 44 | app.run(host='0.0.0.0', debug=False) 45 | -------------------------------------------------------------------------------- /managing_and_scaling/README.md: -------------------------------------------------------------------------------- 1 | # Managing and scaling Docker-ized data science apps 2 | 3 | Although this CodeLab is a good chance to get yours hands dirty with Docker, there are whole books and courses on containers and related systems. The following sections attempt to expose you to some related topics that are relevant to data scientists using Docker. 4 | 5 | ## Container Orchestration 6 | 7 | Let's say that we love using Docker, and we containerize all of our things. Now we have a secondary problem of figuring out how we are going to manage all of those containers running across our infrastructure. It doesn't make much sense for data scientists and engineers to be ssh'ing into a bunch of machine and executing `docker run ...` commands all day. 8 | 9 | Container orchestrators solve this problem. [Kubernetes](https://kubernetes.io/) is, by far, the leading container orchestration engine. With Kubernetes, you can declaratively specify that you want this many of container A, this many of container B, etc. running. Then Kubernetes will ensure that these things are are started and remain running on the underlying compute nodes (which could be cloud VMs or on-premise nodes). 10 | 11 | ## Managing Data ingress/egress, pipelines, and scaling 12 | 13 | You may have noticed, as we went through our examples, that we still had to manually get data from one container to the next. This could turn into a giant, error-prone pain. Data and code in data science workflows are always changing, and we can rely on ourselves to get the right version of the right data to the right code at the right time on the right infrastructure. 14 | 15 | [Pachyderm](http://pachyderm.io/) is a solution built on Kubernetes that solves these problems, while also allowing data scientists and AI researchers to scale their workloads (across both CPUs and GPUs). Pachyderm allows you to build data pipelines, where each stage of the data pipeline (e.g., training or inference) runs in a Docker container. Then, it will shim in data that you specify into the respective containers at the right time and in the right order. That way, you can focus on data sets and processing, and avoid spending all your time copying data and running containers. 16 | 17 | Pachyderm also versions all of your data and processing. This is super important for both compliance and maintenaince of workflows. With Pachyderm, you can always determine what data and processing contributed to a particular result, even if you have since changed the data sources and intermediate processing. 18 | 19 | To get started with Pachyderm, take a look at their [getting started docs](http://pachyderm.readthedocs.io/en/latest/getting_started/getting_started.html) and join their [public Slack channel](http://slack.pachyderm.io/) to ask questions and get help. 20 | 21 | ## CI/CD 22 | 23 | Often, Docker images that are used in production aren't built manually. Rather, the building (and sometimes running) or Docker images is integrated into a continuous integration/deployment pipeline (aka a CI/CD pipeline). 24 | 25 | CI/CD pipelines and systems automate the testing, building, and deployment process. These systems are often listening to your GitHub repos. When you push new code to certain branches (e.g., dev, staging, and prod), they will automatically pull your latest code, test that code, build a Docker image for your code, and deploy that Docker image (i.e., run it on) to some cloud or on-premise environment. 26 | 27 | Common tools used for CI/CD, and which integrate with Docker, are [Jenkins](https://jenkins.io/), [Travis](https://travis-ci.org/), [CircleCI](https://circleci.com/), and [Ansible](https://www.ansible.com/). Although there are many, many choices. 28 | -------------------------------------------------------------------------------- /model_training/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python 2 | 3 | # Install dependencies 4 | RUN pip install -U numpy scipy scikit-learn pandas 5 | 6 | # Add our code 7 | ADD train.py /code/train.py 8 | -------------------------------------------------------------------------------- /model_training/README.md: -------------------------------------------------------------------------------- 1 | # Docker-izing Model Training 2 | 3 | To get some practice building Docker images for our data science apps (aka "Docker-izing" our data science apps), we are going to take the "Hello World" of machine learning as an example: the [Iris flower classification problem](https://en.wikipedia.org/wiki/Iris_flower_data_set). 4 | 5 | Let's say that we have two Python applications that do the following, respectively: 6 | 7 | - Train and save a model based on the Iris data set, and 8 | - Utilize the trained model to perform inferences on new input attributes of flowers. 9 | 10 | First, we will Docker-ize the model training code, such that we can run it on any infrastructure with reproducible behavior. This guide will walk you through that process, which includes: 11 | 12 | 1. [Developing the application](README.md#1-developing-the-application) 13 | 2. [Creating a Dockerfile](README.md#2-creating-a-dockerfile) 14 | 3. [Building a Docker image](README.md#3-building-a-docker-image) 15 | 4. [Pushing the image to a registry (optional)](README.md#4-pushing-the-image-to-a-registry-optional) 16 | 5. [Running model training in a container](README.md#5-running-model-training-in-a-container) 17 | 18 | Finally, I provide some [Resources](README.md#resources) for further exploration. 19 | 20 | ## 1. Developing the application 21 | 22 | Because this isn't a Python CodeLab per se, I have already developed [a Python script (train.py)](train.py) that will perform this model training for you on the Iris dataset. The model will be capable of taking in a set of 4 measurements of a flower, and it will return a predicted species of that flower. 23 | 24 | The easiest way to continue the CodeLab with the prepared code to to clone this repo (or download the repo contents from [here](https://github.com/dwhitena/qcon-ai-docker-workshop)): 25 | 26 | ```sh 27 | $ git clone https://github.com/dwhitena/qcon-ai-docker-workshop.git 28 | ``` 29 | 30 | Then you will be able to run the commands as presented below and modify the respective files. Of course, you are welcome to modify any of the included Python scripts to your liking (e.g., changing the modeling method). 31 | 32 | ## 2. Creating a Dockerfile 33 | 34 | To build a Docker image (that will allow us to run `train.py` in a container), we will need to create a Dockerfile. A Dockerfile tells Docker engine how a Docker image should be built. Think about this as a kind of recipe that Docker engine uses to build the image. 35 | 36 | A basic Dockerfile for `train.py` is included [here](Dockerfile): 37 | 38 | ``` 39 | FROM python 40 | 41 | # Install dependencies 42 | RUN pip install -U numpy scipy scikit-learn pandas 43 | 44 | # Add our code 45 | ADD train.py /code/train.py 46 | ``` 47 | 48 | The Dockerfile format includes a series of *instructions* (in all caps) paired with corresponding *arguments*. Each of the *instructions* will result in a *layer* in the Docker image. The Docker image is generated and versioned in layers, such that you can easily change your code without having to rebuilt the image from scrath. It also allows us to build on other's work as well. For example, we don't have to build up an Ubuntu like file and packaging system or install python (although we could). We can just say `FROM python`. 49 | 50 | If you are looking for public images (like `python` above) that you can start from, try [Docker Hub](https://hub.docker.com/). For example, if you search for "TensorFlow" in Docker Hub you will find a `tensorflow/tensorflow` image that is maintained by the TensorFlow team (along with many other TF images maintained by other people). There are scikit-learn, caret, ggplot, PyTorch, Spark, and many other public images that will allow you to experiment and create Docker images quickly. 51 | 52 | **Warning** - Although searching Docker Hub is a good place to start when you need a base image for a Dockerfile, not all images published to Docker Hub are operational, secure, or ideal. Just like pulling code and packages from GitHub, you should investigate who is publishing the image, when it was last updated, and if it is created in a sane manner. For example, there are a bunch of base "data science" images on Docker Hub that include a whole ecosystem of data science tooling (jupyter, scikit-learn, TF, PyTorch, etc.). I highly recommend that you avoid these type of images (unless you just want to experiment with them locally). For the most part, they are super bloated (like > 1GB in size) and hard to work with, download, port, etc. At that point you might as well use a full VM. haha. 53 | 54 | **Continued Warning** - Even images published by known teams are sometimes non-ideal. For example, the "minimal" Jupyter notebook image from the Jupyter team is over 1GB in size, which is hardly minimal (the one we pulled is 72MB). Generally a large image size, lack of documentation, non-public Dockerfile, and lack of recent updates are bad signs when looking for a base image. 55 | 56 | Ok, with that soapbox out of the way, let's look at the other two layers of our Dockerfile. The second instruction (`RUN ...`) tells Docker to install numpy, scipy, scikit-learn, and pandas on top of python. We will need these to run our training. Note, there are public images with scikit-learn etc. already installed, but these include a bunch of other things that we don't need. As such, it makes sense for us to just start from `python` and add in the few things we need. 57 | 58 | Finally, we need to add our code to the image, the `ADD ...` instruction tells Docker to add `train.py` to the `/code/train.py` location in the image. 59 | 60 | Can you think of ways to make the image built from this Dockerfile smaller or more reproducible? A smaller image will mean it will download to target machine and initiate our work faster. It will also be easier to remove and update. Also, what happens if the `python` base image is updated? How can we keep this constant at a specific version? Try to create a modified version of the Dockerfile that: 61 | 62 | - Utilizes a specific "tagged" version of the `python` base image, 63 | - Utilizes a smaller version of the `python` base image, 64 | - Utilizes a different base image, and/or 65 | - Cleans up other things in the image that aren't used. 66 | 67 | (Hint - [here](https://github.com/pachyderm/pachyderm/blob/master/doc/examples/ml/iris/python/iris-train-python-svm/Dockerfile) is a Dockerfile that will also run our model training code but that is much smaller) 68 | 69 | ## 3. Building a Docker image 70 | 71 | Now that we have our Dockerfile, we can build our Docker image for model training. First, we need to choose a *tag* for our Docker image. For now, think of this tag as the name of the Docker image (although we will see that it has more utility later). 72 | 73 | To build our image, run the following from [this directory](.) in the cloned version of this repo: 74 | 75 | ```sh 76 | $ docker build -t . 77 | ``` 78 | 79 | The `-t ` argument tells Docker to tag your image as ``, and the `.` at the end tells Docker to look for the Dockerfile in this directory. Note, you can also specify, via other flags, a Dockerfile in a different directory and/or a Dockerfile named something other than Dockerfile. 80 | 81 | For example, I can create a `model-training` image by running: 82 | 83 | ```sh 84 | $ docker build -t model-training . 85 | Sending build context to Docker daemon 32.77kB 86 | Step 1/3 : FROM python 87 | latest: Pulling from library/python 88 | f2b6b4884fc8: Pull complete 89 | 4fb899b4df21: Pull complete 90 | 74eaa8be7221: Pull complete 91 | 2d6e98fe4040: Pull complete 92 | 414666f7554d: Pull complete 93 | 135a494fed80: Pull complete 94 | 6ca3f38fdd4d: Pull complete 95 | d67ff15d2a78: Pull complete 96 | Digest: sha256:c021d6c587ea435509775c3a4da58d42287f630cb4ae6e0bc97ec839d9e0da3a 97 | Status: Downloaded newer image for python:latest 98 | ---> d21927554614 99 | Step 2/3 : RUN pip install -U numpy scipy scikit-learn pandas 100 | ---> Running in a77a8ec01d94 101 | Collecting numpy 102 | Downloading numpy-1.14.2-cp36-cp36m-manylinux1_x86_64.whl (12.2MB) 103 | Collecting scipy 104 | Downloading scipy-1.0.0-cp36-cp36m-manylinux1_x86_64.whl (50.0MB) 105 | Collecting scikit-learn 106 | Downloading scikit_learn-0.19.1-cp36-cp36m-manylinux1_x86_64.whl (12.4MB) 107 | Collecting pandas 108 | Downloading pandas-0.22.0-cp36-cp36m-manylinux1_x86_64.whl (26.2MB) 109 | Collecting pytz>=2011k (from pandas) 110 | Downloading pytz-2018.3-py2.py3-none-any.whl (509kB) 111 | Collecting python-dateutil>=2 (from pandas) 112 | Downloading python_dateutil-2.7.0-py2.py3-none-any.whl (207kB) 113 | Collecting six>=1.5 (from python-dateutil>=2->pandas) 114 | Downloading six-1.11.0-py2.py3-none-any.whl 115 | Installing collected packages: numpy, scipy, scikit-learn, pytz, six, python-dateutil, pandas 116 | Successfully installed numpy-1.14.2 pandas-0.22.0 python-dateutil-2.7.0 pytz-2018.3 scikit-learn-0.19.1 scipy-1.0.0 six-1.11.0 117 | You are using pip version 9.0.1, however version 9.0.2 is available. 118 | You should consider upgrading via the 'pip install --upgrade pip' command. 119 | Removing intermediate container a77a8ec01d94 120 | ---> 646a15f6e4df 121 | Step 3/3 : ADD train.py /code/train.py 122 | ---> 3da96f640402 123 | Successfully built 3da96f640402 124 | Successfully tagged model-training:latest 125 | ``` 126 | 127 | You will notice in the output that Docker: pulls your base image, runs your commands to install dependencies, and then adds your code. The docker image will then be shown when you list the images in your local registry: 128 | 129 | ```sh 130 | $ docker images 131 | REPOSITORY TAG IMAGE ID CREATED SIZE 132 | model-training latest 3da96f640402 3 minutes ago 1.21GB 133 | dwhitena/minimal-jupyter latest 1770383288b4 25 hours ago 203MB 134 | python latest d21927554614 3 days ago 688MB 135 | tensorflow/tensorflow latest 414b6e39764a 2 weeks ago 1.27GB 136 | ``` 137 | 138 | ## 4. Pushing the image to a registry (optional) 139 | 140 | One of the goals of Docker-izing our apps is to easily port them to other environments and make them available for other people to run. Thus, we can't be dependent on our laptop as the registry of our Docker images. We need to store and version our Docker images somewhere else (just like we would store and version our code somewhere like GitHub). 141 | 142 | There are many options to choose from when thinking about where you want to store/version your images. You can also make your images public or keep them private (similar to having public/private repos on GitHub). Common choices for registries are [Docker Hub](https://hub.docker.com/), [AWS ECR](https://aws.amazon.com/ecr/), and [Google GCR](https://cloud.google.com/container-registry/). 143 | 144 | To get some practice, create a free user account on Docker Hub. Once you have done that, use the `docker login` command to log into your Docker Hub account locally. 145 | 146 | To push our `model-training` Docker image up to the Docker Hub registry as a public Docker image (that could be pulled down and run elsewhere), we first need to re-tag our image. Remember how we "named" our image `model-training`? Well, there's actually more utility in that tag than just a human readable name. We can tag our image with the format: `/:`, where: 147 | 148 | - `` specifies the registry and/or user associated with the image, 149 | - `` specifies the common name of the image, and 150 | - `` specifies a version of the image. 151 | 152 | For example, because I'm `dwhitena` on Docker Hub, I could tag my image as follows: 153 | 154 | ```sh 155 | $ docker tag model-training dwhitena/model-training:v1.0.0 156 | ``` 157 | 158 | and then push it to Docker Hub: 159 | 160 | ```sh 161 | $ docker push dwhitena/model-training:v1.0.0 162 | The push refers to repository [docker.io/dwhitena/model-training] 163 | dbd827af12d1: Pushed 164 | eddb03e225ec: Pushed 165 | aec4f1507d85: Mounted from library/python 166 | a4a7a3673769: Mounted from library/python 167 | 325a22db58ea: Mounted from library/python 168 | 6e1b48dc2ccc: Mounted from library/python 169 | ff57bdb79ac8: Mounted from library/python 170 | 6e5e20cbf4a7: Mounted from library/python 171 | 86985c679800: Mounted from library/python 172 | 8fad67424c4e: Mounted from library/python 173 | v1.0.0: digest: sha256:ca5032522813f696c76e763becec0352f4765015536c6b1ff3f64a0e02898d30 size: 2427 174 | ``` 175 | 176 | Now v1.0.0 of my Docker image is available on Docker Hub [here](https://hub.docker.com/r/dwhitena/model-training/). Try this with your username and check that the image is pushed to Docker Hub. 177 | 178 | **Note** - If you don't utilize a `:` tag for your image (e.g., if I just used `dwhitena/model-training`), your image will be tagged as `latest`. This can be convenient while testing, but you should never use images tagged `latest` in production, because as you update the image you would lose the ability to revert to previous versions, run specific versions, etc. 179 | 180 | ## 5. Running model training in a container 181 | 182 | Now, to run this Python model training code in the container let's discuss briefly how we would run the code locally. This Python script is configured to be run as follows: 183 | 184 | ```sh 185 | $ python train.py 186 | ``` 187 | 188 | where `` is the directory containing the iris training data set (included [here](data/iris.csv) in this repo) and `` is where the Python script will save a serialized version of the trained model. 189 | 190 | But how do we get the training data into our container? And how do we get the data out? Moreover, if our container finishes and we remove it, will we lose our data? 191 | 192 | Well, Docker provides the ability to mount a "volume" into container. By mounting a local volume into the container, our container will be able to read data from the local filesystm and write data out to local filesystem. Then, once the container is deleted, we will still have the input/output data. We will use the `-v` flag with `docker run ...` to do this volume mapping. 193 | 194 | From the root of this repo, you can run the model training in the container as follows: 195 | 196 | ```sh 197 | $ docker run -v /path/to/this/GH/repo/model_training/data:/data python /code/train.py /data /data 198 | ``` 199 | 200 | Where you would replace `` with the name of the Docker image you built in step 3 (`dwhitena/model-training:v1.0.0` in my case). `-v /path/to/this/GH/repo/model_training/data:/data` maps the absolute path to the [data directory](data) on your local machine to `/data` inside the container (you can use the `pwd` command to find the absolute path to that directory on your local machine). `python /code/train.py /data /data` is the command that we are running in the container to perform the training. 201 | 202 | This should only take a second to run. Once it finishes, you should see the model output in your `data` directory: 203 | 204 | ```sh 205 | $ ls data 206 | iris.csv model.pkl model.txt 207 | ``` 208 | 209 | Yay! We successfully trained an ML model inside of a Docker container. 210 | 211 | ## Resources 212 | 213 | - [Dockerfile reference](https://docs.docker.com/engine/reference/builder/) 214 | - [Docker run reference](https://docs.docker.com/engine/reference/run/) 215 | - [Docker volumes](https://docs.docker.com/storage/volumes/) 216 | -------------------------------------------------------------------------------- /model_training/data/iris.csv: -------------------------------------------------------------------------------- 1 | 5.1,3.5,1.4,0.2,Iris-setosa 2 | 4.9,3.0,1.4,0.2,Iris-setosa 3 | 4.7,3.2,1.3,0.2,Iris-setosa 4 | 4.6,3.1,1.5,0.2,Iris-setosa 5 | 5.0,3.6,1.4,0.2,Iris-setosa 6 | 5.4,3.9,1.7,0.4,Iris-setosa 7 | 4.6,3.4,1.4,0.3,Iris-setosa 8 | 5.0,3.4,1.5,0.2,Iris-setosa 9 | 4.4,2.9,1.4,0.2,Iris-setosa 10 | 4.9,3.1,1.5,0.1,Iris-setosa 11 | 5.4,3.7,1.5,0.2,Iris-setosa 12 | 4.8,3.4,1.6,0.2,Iris-setosa 13 | 4.8,3.0,1.4,0.1,Iris-setosa 14 | 4.3,3.0,1.1,0.1,Iris-setosa 15 | 5.8,4.0,1.2,0.2,Iris-setosa 16 | 5.7,4.4,1.5,0.4,Iris-setosa 17 | 5.4,3.9,1.3,0.4,Iris-setosa 18 | 5.1,3.5,1.4,0.3,Iris-setosa 19 | 5.7,3.8,1.7,0.3,Iris-setosa 20 | 5.1,3.8,1.5,0.3,Iris-setosa 21 | 5.4,3.4,1.7,0.2,Iris-setosa 22 | 5.1,3.7,1.5,0.4,Iris-setosa 23 | 4.6,3.6,1.0,0.2,Iris-setosa 24 | 5.1,3.3,1.7,0.5,Iris-setosa 25 | 4.8,3.4,1.9,0.2,Iris-setosa 26 | 5.0,3.0,1.6,0.2,Iris-setosa 27 | 5.0,3.4,1.6,0.4,Iris-setosa 28 | 5.2,3.5,1.5,0.2,Iris-setosa 29 | 5.2,3.4,1.4,0.2,Iris-setosa 30 | 4.7,3.2,1.6,0.2,Iris-setosa 31 | 4.8,3.1,1.6,0.2,Iris-setosa 32 | 5.4,3.4,1.5,0.4,Iris-setosa 33 | 5.2,4.1,1.5,0.1,Iris-setosa 34 | 5.5,4.2,1.4,0.2,Iris-setosa 35 | 4.9,3.1,1.5,0.1,Iris-setosa 36 | 5.0,3.2,1.2,0.2,Iris-setosa 37 | 5.5,3.5,1.3,0.2,Iris-setosa 38 | 4.9,3.1,1.5,0.1,Iris-setosa 39 | 4.4,3.0,1.3,0.2,Iris-setosa 40 | 5.1,3.4,1.5,0.2,Iris-setosa 41 | 5.0,3.5,1.3,0.3,Iris-setosa 42 | 4.5,2.3,1.3,0.3,Iris-setosa 43 | 4.4,3.2,1.3,0.2,Iris-setosa 44 | 5.0,3.5,1.6,0.6,Iris-setosa 45 | 5.1,3.8,1.9,0.4,Iris-setosa 46 | 4.8,3.0,1.4,0.3,Iris-setosa 47 | 5.1,3.8,1.6,0.2,Iris-setosa 48 | 4.6,3.2,1.4,0.2,Iris-setosa 49 | 5.3,3.7,1.5,0.2,Iris-setosa 50 | 5.0,3.3,1.4,0.2,Iris-setosa 51 | 7.0,3.2,4.7,1.4,Iris-versicolor 52 | 6.4,3.2,4.5,1.5,Iris-versicolor 53 | 6.9,3.1,4.9,1.5,Iris-versicolor 54 | 5.5,2.3,4.0,1.3,Iris-versicolor 55 | 6.5,2.8,4.6,1.5,Iris-versicolor 56 | 5.7,2.8,4.5,1.3,Iris-versicolor 57 | 6.3,3.3,4.7,1.6,Iris-versicolor 58 | 4.9,2.4,3.3,1.0,Iris-versicolor 59 | 6.6,2.9,4.6,1.3,Iris-versicolor 60 | 5.2,2.7,3.9,1.4,Iris-versicolor 61 | 5.0,2.0,3.5,1.0,Iris-versicolor 62 | 5.9,3.0,4.2,1.5,Iris-versicolor 63 | 6.0,2.2,4.0,1.0,Iris-versicolor 64 | 6.1,2.9,4.7,1.4,Iris-versicolor 65 | 5.6,2.9,3.6,1.3,Iris-versicolor 66 | 6.7,3.1,4.4,1.4,Iris-versicolor 67 | 5.6,3.0,4.5,1.5,Iris-versicolor 68 | 5.8,2.7,4.1,1.0,Iris-versicolor 69 | 6.2,2.2,4.5,1.5,Iris-versicolor 70 | 5.6,2.5,3.9,1.1,Iris-versicolor 71 | 5.9,3.2,4.8,1.8,Iris-versicolor 72 | 6.1,2.8,4.0,1.3,Iris-versicolor 73 | 6.3,2.5,4.9,1.5,Iris-versicolor 74 | 6.1,2.8,4.7,1.2,Iris-versicolor 75 | 6.4,2.9,4.3,1.3,Iris-versicolor 76 | 6.6,3.0,4.4,1.4,Iris-versicolor 77 | 6.8,2.8,4.8,1.4,Iris-versicolor 78 | 6.7,3.0,5.0,1.7,Iris-versicolor 79 | 6.0,2.9,4.5,1.5,Iris-versicolor 80 | 5.7,2.6,3.5,1.0,Iris-versicolor 81 | 5.5,2.4,3.8,1.1,Iris-versicolor 82 | 5.5,2.4,3.7,1.0,Iris-versicolor 83 | 5.8,2.7,3.9,1.2,Iris-versicolor 84 | 6.0,2.7,5.1,1.6,Iris-versicolor 85 | 5.4,3.0,4.5,1.5,Iris-versicolor 86 | 6.0,3.4,4.5,1.6,Iris-versicolor 87 | 6.7,3.1,4.7,1.5,Iris-versicolor 88 | 6.3,2.3,4.4,1.3,Iris-versicolor 89 | 5.6,3.0,4.1,1.3,Iris-versicolor 90 | 5.5,2.5,4.0,1.3,Iris-versicolor 91 | 5.5,2.6,4.4,1.2,Iris-versicolor 92 | 6.1,3.0,4.6,1.4,Iris-versicolor 93 | 5.8,2.6,4.0,1.2,Iris-versicolor 94 | 5.0,2.3,3.3,1.0,Iris-versicolor 95 | 5.6,2.7,4.2,1.3,Iris-versicolor 96 | 5.7,3.0,4.2,1.2,Iris-versicolor 97 | 5.7,2.9,4.2,1.3,Iris-versicolor 98 | 6.2,2.9,4.3,1.3,Iris-versicolor 99 | 5.1,2.5,3.0,1.1,Iris-versicolor 100 | 5.7,2.8,4.1,1.3,Iris-versicolor 101 | 6.3,3.3,6.0,2.5,Iris-virginica 102 | 5.8,2.7,5.1,1.9,Iris-virginica 103 | 7.1,3.0,5.9,2.1,Iris-virginica 104 | 6.3,2.9,5.6,1.8,Iris-virginica 105 | 6.5,3.0,5.8,2.2,Iris-virginica 106 | 7.6,3.0,6.6,2.1,Iris-virginica 107 | 4.9,2.5,4.5,1.7,Iris-virginica 108 | 7.3,2.9,6.3,1.8,Iris-virginica 109 | 6.7,2.5,5.8,1.8,Iris-virginica 110 | 7.2,3.6,6.1,2.5,Iris-virginica 111 | 6.5,3.2,5.1,2.0,Iris-virginica 112 | 6.4,2.7,5.3,1.9,Iris-virginica 113 | 6.8,3.0,5.5,2.1,Iris-virginica 114 | 5.7,2.5,5.0,2.0,Iris-virginica 115 | 5.8,2.8,5.1,2.4,Iris-virginica 116 | 6.4,3.2,5.3,2.3,Iris-virginica 117 | 6.5,3.0,5.5,1.8,Iris-virginica 118 | 7.7,3.8,6.7,2.2,Iris-virginica 119 | 7.7,2.6,6.9,2.3,Iris-virginica 120 | 6.0,2.2,5.0,1.5,Iris-virginica 121 | 6.9,3.2,5.7,2.3,Iris-virginica 122 | 5.6,2.8,4.9,2.0,Iris-virginica 123 | 7.7,2.8,6.7,2.0,Iris-virginica 124 | 6.3,2.7,4.9,1.8,Iris-virginica 125 | 6.7,3.3,5.7,2.1,Iris-virginica 126 | 7.2,3.2,6.0,1.8,Iris-virginica 127 | 6.2,2.8,4.8,1.8,Iris-virginica 128 | 6.1,3.0,4.9,1.8,Iris-virginica 129 | 6.4,2.8,5.6,2.1,Iris-virginica 130 | 7.2,3.0,5.8,1.6,Iris-virginica 131 | 7.4,2.8,6.1,1.9,Iris-virginica 132 | 7.9,3.8,6.4,2.0,Iris-virginica 133 | 6.4,2.8,5.6,2.2,Iris-virginica 134 | 6.3,2.8,5.1,1.5,Iris-virginica 135 | 6.1,2.6,5.6,1.4,Iris-virginica 136 | 7.7,3.0,6.1,2.3,Iris-virginica 137 | 6.3,3.4,5.6,2.4,Iris-virginica 138 | 6.4,3.1,5.5,1.8,Iris-virginica 139 | 6.0,3.0,4.8,1.8,Iris-virginica 140 | 6.9,3.1,5.4,2.1,Iris-virginica 141 | 6.7,3.1,5.6,2.4,Iris-virginica 142 | 6.9,3.1,5.1,2.3,Iris-virginica 143 | 5.8,2.7,5.1,1.9,Iris-virginica 144 | 6.8,3.2,5.9,2.3,Iris-virginica 145 | 6.7,3.3,5.7,2.5,Iris-virginica 146 | 6.7,3.0,5.2,2.3,Iris-virginica 147 | 6.3,2.5,5.0,1.9,Iris-virginica 148 | 6.5,3.0,5.2,2.0,Iris-virginica 149 | 6.2,3.4,5.4,2.3,Iris-virginica 150 | 5.9,3.0,5.1,1.8,Iris-virginica 151 | 152 | -------------------------------------------------------------------------------- /model_training/train.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from sklearn import svm 3 | from sklearn.externals import joblib 4 | import argparse 5 | import os 6 | 7 | # command line arguments 8 | parser = argparse.ArgumentParser(description='Train a model for iris classification.') 9 | parser.add_argument('indir', type=str, help='Input directory containing the training set') 10 | parser.add_argument('outdir', type=str, help='Output directory for the trained model') 11 | args = parser.parse_args() 12 | 13 | # training set column names 14 | cols = [ 15 | "Sepal_Length", 16 | "Sepal_Width", 17 | "Petal_Length", 18 | "Petal_Width", 19 | "Species" 20 | ] 21 | 22 | features = [ 23 | "Sepal_Length", 24 | "Sepal_Width", 25 | "Petal_Length", 26 | "Petal_Width" 27 | ] 28 | 29 | # import the iris training set 30 | irisDF = pd.read_csv(os.path.join(args.indir, "iris.csv"), names=cols) 31 | 32 | # fit the model 33 | svc = svm.SVC(kernel='linear', C=1.0).fit(irisDF[features], irisDF["Species"]) 34 | 35 | # output a text description of the model 36 | f = open(os.path.join(args.outdir, 'model.txt'), 'w') 37 | f.write(str(svc)) 38 | f.close() 39 | 40 | # persist the model 41 | joblib.dump(svc, os.path.join(args.outdir, 'model.pkl')) 42 | --------------------------------------------------------------------------------