├── README.org ├── dev-env ├── 00-devenv.org ├── Dockerfile ├── Untitled.ipynb └── requirements.txt ├── docker-start ├── 00-dockerclient.org ├── 01-dockerfile.org └── Dockerfile └── model-deploy ├── Dockerfile ├── __init__.py ├── main.py └── requirements.txt /README.org: -------------------------------------------------------------------------------- 1 | #+TITLE: Docker for Data Science 2 | #+AUTHOR: Hareem Naveed 3 | #+EMAIL: hnaveed@munichre.ca 4 | #+STARTUP: showeverything 5 | #+STARTUP: nohideblocks 6 | #+STARTUP: Indent 7 | 8 | * Background 9 | 10 | This tutorial will show you how to integrate =docker= into your data science workflow. =docker= is an open source tool that makes it easy to build, deploy and run applications using a container framework. If you do any of the following, you can use =docker= to make your life easier: 11 | 12 | - share and reproduce your analysis 13 | - run large scale data cleaning tasks 14 | - build dashboards and publish models 15 | 16 | * Getting Started 17 | 18 | Clone the repo to your machine 19 | 20 | #+BEGIN_EXAMPLE 21 | git clone https://github.com/harnav/pydata-docker-tutorial.git 22 | #+END_EXAMPLE 23 | 24 | In this tutorial, we will go over three points 25 | 26 | 1. Running a container 27 | 2. Reproducible environments 28 | 3. Deploying models 29 | 30 | *** References 31 | 32 | For more detailed instructions, check out: 33 | 34 | - [[https://towardsdatascience.com/how-docker-can-help-you-become-a-more-effective-data-scientist-7fc048ef91d5][How Docker Can Help you Become a More Effective Data Scientist]] 35 | - [[https://www.analyticsvidhya.com/blog/2017/11/reproducible-data-science-docker-for-data-science/][Reproducible Data Science: Docker for Data Science]] 36 | - [[https://github.com/docker/labs][Docker Labs]] 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | -------------------------------------------------------------------------------- /dev-env/00-devenv.org: -------------------------------------------------------------------------------- 1 | * Setting up a Dev Environment 2 | 3 | When working on a python project, a common way that people will manage their dependencies is running =pip freeze > requirements.txt= and coupling that with =virtualenv= to manage project level dependencies. 4 | 5 | Often, when reproducing somebody else's analysis, it is not enough to run =pip install -r requirements.txt= in the repository. 6 | 7 | However, sometimes there are configurations that are specific system level dependencies that are not captured. As I go along with developing my code, I will install system-level dependencies as required by the package. 8 | 9 | To be able to replicate all the system level dependencies, you can see how Docker could easily be used! 10 | 11 | Now we move to the next stage, which is figuring out how to layer instructions, and put together a Dockerfile to set up a docker container with your exact specifications to replicate your results. 12 | 13 | In the past few weeks, we have observed the following issues: 14 | 1. Somebody builds a tool in a different flavour of python with several package changes 15 | 2. Having pandas 0.20 on one machine and pandas 0.21 on another can change how the metrics are calculated slightly 16 | 3. Struggling to manage =virtualenvs= for different packages, broken =virtualenvs= 17 | 18 | The use case we will explore here, is setting up a python machine with all the tools required for a project, and then setting up a jupyter notebook server within to access all those resources: 19 | 20 | In the terminal run: 21 | #+BEGIN_EXAMPLE 22 | docker build -t devenv . 23 | #+END_EXAMPLE 24 | 25 | This should set up a jupyter notebook that you will be able to access with all the tools in the =requirements.txt= file installed. 26 | 27 | #+BEGIN_EXAMPLE 28 | docker run -p 8888:8888 devenv 29 | #+END_EXAMPLE 30 | 31 | Now access it from your machine, try =localhost:8888=. It will ask you to copy and paste the token you were given. 32 | 33 | 34 | 35 | -------------------------------------------------------------------------------- /dev-env/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3 2 | 3 | RUN apt-get update && apt-get install -y python3-pip 4 | 5 | COPY requirements.txt . 6 | 7 | RUN pip install -r requirements.txt 8 | 9 | # Install jupyter 10 | RUN pip3 install jupyter 11 | 12 | # Create a new system user 13 | RUN useradd -ms /bin/bash demo 14 | 15 | # Change to this new user 16 | USER demo 17 | 18 | # Set the container working directory to the user home folder 19 | WORKDIR /home/demo 20 | 21 | # Start the jupyter notebook 22 | ENTRYPOINT ["jupyter", "notebook", "--ip=0.0.0.0"] 23 | -------------------------------------------------------------------------------- /dev-env/Untitled.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [], 3 | "metadata": {}, 4 | "nbformat": 4, 5 | "nbformat_minor": 2 6 | } 7 | -------------------------------------------------------------------------------- /dev-env/requirements.txt: -------------------------------------------------------------------------------- 1 | agate==1.6.1 2 | asn1crypto==0.24.0 3 | autopep8==1.3.5 4 | Babel==2.5.3 5 | backcall==0.1.0 6 | bleach==2.1.3 7 | census==0.8.7 8 | -------------------------------------------------------------------------------- /docker-start/00-dockerclient.org: -------------------------------------------------------------------------------- 1 | * Getting Familiar with the Docker Client 2 | 3 | To check that everything is set-up, run the following: 4 | 5 | #+BEGIN_EXAMPLE 6 | docker run hello-world 7 | #+END_EXAMPLE 8 | 9 | #+BEGIN_SRC sh 10 | 11 | Unable to find image 'hello-world:latest' locally 12 | latest: Pulling from library/hello-world 13 | d1725b59e92d: Pull complete 14 | Digest: sha256:0add3ace90ecb4adbf7777e9aacf18357296e799f81cabc9fde470971e499788 15 | Status: Downloaded newer image for hello-world:latest 16 | ... 17 | #+END_SRC 18 | 19 | 20 | ** Pulling an Image 21 | 22 | Now that everything is set up, let's walk through how to run your first container. We will run an ubuntu container, and get familiar with some of the =docker= commands. 23 | 24 | In your terminal, run the following: 25 | 26 | #+BEGIN_EXAMPLE 27 | docker pull ubuntu 28 | #+END_EXAMPLE 29 | 30 | If you get a permission denied error, it may require you to run =sudo docker pull=. To avoid this in the future, try: 31 | 32 | #+BEGIN_EXAMPLE 33 | sudo usermod -aG docker $USER 34 | #+END_EXAMPLE 35 | 36 | Then exit and restart your terminal. 37 | 38 | The pull command fetches the latest ubuntu image from *Dockerhub*, a public container registry. To see which images are downloaded to your machine, run the following: 39 | 40 | #+BEGIN_EXAMPLE 41 | docker images 42 | #+END_EXAMPLE 43 | 44 | Now that we have pulled our first image, it is time to run the container. 45 | 46 | ** Running a Container 47 | In your terminal, run the following: 48 | #+BEGIN_EXAMPLE 49 | docker run ubuntu echo "hello!" 50 | #+END_EXAMPLE 51 | 52 | What just happened? 53 | 54 | 1. When you call =run=, the Docker client calls the Docker daemon 55 | 2. The Docker daemon checks locally to see if the image is available, if it is not, it downloads it from Dockerhub 56 | 3. If the image is present, the daemon creates the container and runs the command you specified in the containter 57 | 4. The output of the command is streamed to the client and you observe it 58 | 59 | In our above example, the Docker client ran the command in the container and then exited out...in a matter of seconds! The speed with which containers can be created and commands run makes them very useful in many use cases. 60 | 61 | Note that the container exits after the command you pass to it is run. For it to not exit, you will need to run the container in *interactive* mode: 62 | #+BEGIN_EXAMPLE 63 | docker run -it ubuntu 64 | #+END_EXAMPLE 65 | 66 | This drops you in to the container. Try out your favourite commands (=ls -la=). You can exit the container by typing =exit=. 67 | 68 | If you want to see what containers you have running, type: 69 | #+BEGIN_EXAMPLE 70 | docker ps 71 | #+END_EXAMPLE 72 | 73 | Since you have exited out of all the containers, you will see nothing here. To see the containers that you have run, try: 74 | 75 | #+BEGIN_EXAMPLE 76 | docker ps -a 77 | #+END_EXAMPLE 78 | 79 | This shows you a list of all the containers, you have run and also their Status. To get just the container IDs, you can use =docker ps -a -q=. The point to note here is that the image persists but the containers only exist for the time that you want to run them. You essentially have many machines with various configurations running on your machine or server as you need them. 80 | 81 | If at any time, you want to clean up images and containers, you can use: 82 | #+BEGIN_EXAMPLE 83 | docker rm $(docker ps -a -q) 84 | #+END_EXAMPLE 85 | 86 | This clears all the containers on your machine. Similarly, to remove all images, use =docker rmi $(docker images -a -q)=. 87 | -------------------------------------------------------------------------------- /docker-start/01-dockerfile.org: -------------------------------------------------------------------------------- 1 | 2 | * Getting Familiar with a Dockerfile 3 | 4 | ** Terminology Review 5 | - Image: Blueprint for the container you want to build. Often based on other images. 6 | - Layer: Modification to a base image. Layers are supplied in sequence to create a final image. 7 | - Container: Built from an image. Can have many copies of the same image running as containers. 8 | - Dockerfile: Instructions for how to build an image. Has a special syntax that contains all the commands that are routed to the command line to build an image. 9 | - Commit: One of the benefits of Docker is that it offers version control of a computing environment. This is handled similar to git. 10 | - DockerHub/Container Registry: A repository for Docker containers. Can have public and private. Dockerhub (public) vs Azure/AWS offerings of container registries (private). 11 | 12 | ** Dockerfile Example 13 | A Dockerfile has INSTRUCTIONS and arguments. It is not neccessary that they be capitalized, but it is the convention. 14 | 15 | *** FROM 16 | 17 | The *FROM* statement specifies the base image. In our example, we are taking the =postgres= base image from [[https://hub.docker.com/_/postgres/][Dockerhub]]. 18 | 19 | *** LABEL 20 | 21 | The *LABEL* statement adds metadata to the image. It is optional, but is helpful if you are pushing your containers to a shared registry so people know who to contact in case of an issue. 22 | 23 | *** RUN 24 | 25 | The *RUN* statement is the workhorse of the Dockerfile. In our case, we are using it to run shell commands. These commands have nothing to do with Docker but are basic Linux commands. 26 | 27 | *** WORKDIR 28 | 29 | The *WORKDIR* statement is often sued to specify a working directory. Any subsequent commands will assume that is the working directory. 30 | 31 | *** ADD 32 | 33 | The *ADD* statement lets you copy files from the host machine to the docker container. 34 | 35 | *** CMD 36 | 37 | The *CMD* statement is used to provide defaults when executing a container. Only one *CMD* statement is valid per container, and if you provide several, only the last one will be used by the container. 38 | 39 | More information on Docker commands can be found here: https://docs.docker.com/engine/reference/builder/ 40 | 41 | 42 | 43 | 44 | -------------------------------------------------------------------------------- /docker-start/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM postgres:9.5.10 2 | 3 | ## PostGIS activation 4 | RUN apt-get -y update && \ 5 | apt-get -y install postgis \ 6 | postgresql-9.5-pgrouting 7 | -------------------------------------------------------------------------------- /model-deploy/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM tiangolo/uwsgi-nginx-flask:python3.6 2 | 3 | WORKDIR /app/ 4 | 5 | COPY requirements.txt /app/ 6 | RUN pip install -r ./requirements.txt 7 | 8 | ENV ENVIRONMENT production 9 | 10 | COPY main.py __init__.py /app/ 11 | -------------------------------------------------------------------------------- /model-deploy/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/harnav/pydata-docker-tutorial/f956aff0fcc36947633ec59308c14bb5309fd26b/model-deploy/__init__.py -------------------------------------------------------------------------------- /model-deploy/main.py: -------------------------------------------------------------------------------- 1 | #!flask/bin/python 2 | 3 | import os 4 | from flask import Flask 5 | from flask import request 6 | import pandas as pd 7 | from sklearn import linear_model 8 | from sklearn import datasets 9 | import pickle 10 | import numpy as np 11 | 12 | diabetes = datasets.load_diabetes() 13 | 14 | # Pick just one feature 15 | X = diabetes.data[:, np.newaxis, 2] 16 | 17 | # creating and saving some model 18 | regr = linear_model.LinearRegression() 19 | regr.fit(X, diabetes.target) 20 | pickle.dump(regr, open('diabetes.pkl', 'wb')) 21 | 22 | app = Flask(__name__) 23 | 24 | @app.route('/isAlive') 25 | def index(): 26 | return "true" 27 | 28 | @app.route('/prediction/', methods=['GET']) 29 | def get_prediction(): 30 | feature = float(request.args.get('f')) 31 | model = pickle.load(open('diabetes.pkl', 'rb')) 32 | pred = model.predict([[feature]]) 33 | return str(pred) 34 | 35 | if __name__ == '__main__': 36 | if os.environ['ENVIRONMENT'] == 'production': 37 | app.run(port=80,host='0.0.0.0') 38 | 39 | 40 | -------------------------------------------------------------------------------- /model-deploy/requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.13 2 | scipy==0.19.1 3 | Flask==0.12.2 4 | pandas==0.20.2 5 | scikit_learn==0.18.2 6 | --------------------------------------------------------------------------------