├── README.org
├── dev-env
    ├── 00-devenv.org
    ├── Dockerfile
    ├── Untitled.ipynb
    └── requirements.txt
├── docker-start
    ├── 00-dockerclient.org
    ├── 01-dockerfile.org
    └── Dockerfile
└── model-deploy
    ├── Dockerfile
    ├── __init__.py
    ├── main.py
    └── requirements.txt


/README.org:
--------------------------------------------------------------------------------
 1 | #+TITLE: Docker for Data Science 
 2 | #+AUTHOR: Hareem Naveed
 3 | #+EMAIL: hnaveed@munichre.ca
 4 | #+STARTUP: showeverything
 5 | #+STARTUP: nohideblocks
 6 | #+STARTUP: Indent
 7 | 
 8 | * Background
 9 | 
10 | This tutorial will show you how to integrate =docker= into your data science workflow. =docker= is an open source tool that makes it easy to build, deploy and run applications using a container framework. If you do any of the following, you can use =docker= to make your life easier:
11 | 
12 | - share and reproduce your analysis
13 | - run large scale data cleaning tasks
14 | - build dashboards and publish models 
15 | 
16 | * Getting Started
17 | 
18 | Clone the repo to your machine
19 | 
20 | #+BEGIN_EXAMPLE
21 |  git clone https://github.com/harnav/pydata-docker-tutorial.git
22 | #+END_EXAMPLE
23 | 
24 | In this tutorial, we will go over three points
25 | 
26 | 1. Running a container
27 | 2. Reproducible environments
28 | 3. Deploying models
29 | 
30 | *** References
31 | 
32 | For more detailed instructions, check out: 
33 | 
34 | - [[https://towardsdatascience.com/how-docker-can-help-you-become-a-more-effective-data-scientist-7fc048ef91d5][How Docker Can Help you Become a More Effective Data Scientist]]
35 | - [[https://www.analyticsvidhya.com/blog/2017/11/reproducible-data-science-docker-for-data-science/][Reproducible Data Science: Docker for Data Science]]
36 | - [[https://github.com/docker/labs][Docker Labs]]
37 | 
38 | 
39 | 
40 | 
41 | 
42 | 
43 | 
44 | 
45 |  
46 | 
47 | 


--------------------------------------------------------------------------------
/dev-env/00-devenv.org:
--------------------------------------------------------------------------------
 1 | * Setting up a Dev Environment
 2 | 
 3 | When working on a python project, a common way that people will manage their dependencies is running =pip freeze > requirements.txt= and coupling that with =virtualenv= to manage project level dependencies.
 4 | 
 5 | Often, when reproducing somebody else's analysis, it is not enough to run =pip install -r requirements.txt= in the repository.  
 6 | 
 7 | However, sometimes there are configurations that are specific system level dependencies that are not captured. As I go along with developing my code, I will install system-level dependencies as required by the package.  
 8 | 
 9 | To be able to replicate all the system level dependencies, you can see how Docker could easily be used! 
10 | 
11 | Now we move to the next stage, which is figuring out how to layer instructions, and put together a Dockerfile to set up a docker container with your exact specifications to replicate your results. 
12 | 
13 | In the past few weeks, we have observed the following issues: 
14 | 1. Somebody builds a tool in a different flavour of python with several package changes 
15 | 2. Having pandas 0.20 on one machine and pandas 0.21 on another can change how the metrics are calculated slightly
16 | 3. Struggling to manage =virtualenvs= for different packages, broken =virtualenvs=
17 | 
18 | The use case we will explore here, is setting up a python machine with all the tools required for a project, and then setting up a jupyter notebook server within to access all those resources: 
19 | 
20 | In the terminal run:
21 | #+BEGIN_EXAMPLE
22 | docker build -t devenv .
23 | #+END_EXAMPLE
24 | 
25 | This should set up a jupyter notebook that you will be able to access with all the tools in the =requirements.txt= file installed. 
26 | 
27 | #+BEGIN_EXAMPLE
28 | docker run -p 8888:8888 devenv
29 | #+END_EXAMPLE
30 | 
31 | Now access it from your machine, try =localhost:8888=. It will ask you to copy and paste the token you were given.
32 | 
33 | 
34 | 
35 | 


--------------------------------------------------------------------------------
/dev-env/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3
 2 | 
 3 | RUN apt-get update && apt-get install -y python3-pip
 4 | 
 5 | COPY requirements.txt .
 6 | 
 7 | RUN pip install -r requirements.txt
 8 | 
 9 | # Install jupyter
10 | RUN pip3 install jupyter
11 | 
12 | # Create a new system user
13 | RUN useradd -ms /bin/bash demo
14 | 
15 | # Change to this new user
16 | USER demo
17 | 
18 | # Set the container working directory to the user home folder
19 | WORKDIR /home/demo
20 | 
21 | # Start the jupyter notebook
22 | ENTRYPOINT ["jupyter", "notebook", "--ip=0.0.0.0"]
23 | 


--------------------------------------------------------------------------------
/dev-env/Untitled.ipynb:
--------------------------------------------------------------------------------
1 | {
2 |  "cells": [],
3 |  "metadata": {},
4 |  "nbformat": 4,
5 |  "nbformat_minor": 2
6 | }
7 | 


--------------------------------------------------------------------------------
/dev-env/requirements.txt:
--------------------------------------------------------------------------------
1 | agate==1.6.1
2 | asn1crypto==0.24.0
3 | autopep8==1.3.5
4 | Babel==2.5.3
5 | backcall==0.1.0
6 | bleach==2.1.3
7 | census==0.8.7
8 | 


--------------------------------------------------------------------------------
/docker-start/00-dockerclient.org:
--------------------------------------------------------------------------------
 1 | * Getting Familiar with the Docker Client
 2 | 
 3 | To check that everything is set-up, run the following:
 4 | 
 5 | #+BEGIN_EXAMPLE
 6 |  docker run hello-world
 7 | #+END_EXAMPLE
 8 | 
 9 | #+BEGIN_SRC sh
10 | 
11 | Unable to find image 'hello-world:latest' locally
12 | latest: Pulling from library/hello-world
13 | d1725b59e92d: Pull complete
14 | Digest: sha256:0add3ace90ecb4adbf7777e9aacf18357296e799f81cabc9fde470971e499788
15 | Status: Downloaded newer image for hello-world:latest
16 | ...
17 | #+END_SRC
18 | 
19 | 
20 | ** Pulling an Image
21 | 
22 | Now that everything is set up, let's walk through how to run your first container. We will run an ubuntu container, and get familiar with some of the =docker= commands.
23 | 
24 | In your terminal, run the following:
25 | 
26 | #+BEGIN_EXAMPLE
27 |  docker pull ubuntu
28 | #+END_EXAMPLE
29 | 
30 | If you get a permission denied error, it may require you to run =sudo docker pull=. To avoid this in the future, try:
31 | 
32 | #+BEGIN_EXAMPLE
33 |  sudo usermod -aG docker $USER
34 | #+END_EXAMPLE
35 | 
36 | Then exit and restart your terminal.
37 | 
38 | The pull command fetches the latest ubuntu image from *Dockerhub*, a public container registry. To see which images are downloaded to your machine, run the following:
39 | 
40 | #+BEGIN_EXAMPLE
41 |  docker images
42 | #+END_EXAMPLE
43 | 
44 | Now that we have pulled our first image, it is time to run the container.
45 | 
46 | ** Running a Container
47 | In your terminal, run the following:
48 | #+BEGIN_EXAMPLE
49 |  docker run ubuntu echo "hello!"
50 | #+END_EXAMPLE
51 | 
52 | What just happened?
53 | 
54 | 1. When you call =run=, the Docker client calls the Docker daemon
55 | 2. The Docker daemon checks locally to see if the image is available, if it is not, it downloads it from Dockerhub 
56 | 3. If the image is present, the daemon creates the container and runs the command you specified in the containter
57 | 4. The output of the command is streamed to the client and you observe it
58 | 
59 | In our above example, the Docker client ran the command in the container and then exited out...in a matter of seconds! The speed with which containers can be created and commands run makes them very useful in many use cases. 
60 | 
61 | Note that the container exits after the command you pass to it is run. For it to not exit, you will need to run the container in *interactive* mode:
62 | #+BEGIN_EXAMPLE
63 |  docker run -it ubuntu 
64 | #+END_EXAMPLE
65 | 
66 | This drops you in to the container. Try out your favourite commands (=ls -la=). You can exit the container by typing =exit=.
67 | 
68 | If you want to see what containers you have running, type:
69 | #+BEGIN_EXAMPLE
70 |  docker ps
71 | #+END_EXAMPLE
72 | 
73 | Since you have exited out of all the containers, you will see nothing here. To see the containers that you have run, try:
74 | 
75 | #+BEGIN_EXAMPLE
76 |  docker ps -a
77 | #+END_EXAMPLE
78 | 
79 | This shows you a list of all the containers, you have run and also their Status. To get just the container IDs, you can use =docker ps -a -q=. The point to note here is that the image persists but the containers only exist for the time that you want to run them. You essentially have many machines with various configurations running on your machine or server as you need them. 
80 | 
81 | If at any time, you want to clean up images and containers, you can use:
82 | #+BEGIN_EXAMPLE
83 |  docker rm $(docker ps -a -q)
84 | #+END_EXAMPLE
85 | 
86 | This clears all the containers on your machine. Similarly, to remove all images, use =docker rmi $(docker images -a -q)=.
87 | 


--------------------------------------------------------------------------------
/docker-start/01-dockerfile.org:
--------------------------------------------------------------------------------
 1 | 
 2 | * Getting Familiar with a Dockerfile
 3 | 
 4 | ** Terminology Review
 5 | - Image: Blueprint for the container you want to build. Often based on other images.
 6 | - Layer: Modification to a base image. Layers are supplied in sequence to create a final image.
 7 | - Container: Built from an image. Can have many copies of the same image running as containers.
 8 | - Dockerfile: Instructions for how to build an image. Has a special syntax that contains all the commands that are routed to the command line to build an image. 
 9 | - Commit: One of the benefits of Docker is that it offers version control of a computing environment. This is handled similar to git.
10 | - DockerHub/Container Registry: A repository for Docker containers. Can have public and private. Dockerhub (public) vs Azure/AWS offerings of container registries (private).
11 | 
12 | ** Dockerfile Example
13 | A Dockerfile has INSTRUCTIONS and arguments. It is not neccessary that they be capitalized, but it is the convention.
14 | 
15 | *** FROM
16 | 
17 | The *FROM* statement specifies the base image. In our example, we are taking the =postgres= base image from [[https://hub.docker.com/_/postgres/][Dockerhub]]. 
18 | 
19 | *** LABEL
20 | 
21 | The *LABEL* statement adds metadata to the image. It is optional, but is helpful if you are pushing your containers to a shared registry so people know who to contact in case of an issue.
22 | 
23 | *** RUN
24 | 
25 | The *RUN* statement is the workhorse of the Dockerfile. In our case, we are using it to run shell commands. These commands have nothing to do with Docker but are basic Linux commands. 
26 | 
27 | *** WORKDIR
28 | 
29 | The *WORKDIR* statement is often sued to specify a working directory. Any subsequent commands will assume that is the working directory.
30 | 
31 | *** ADD
32 | 
33 | The *ADD* statement lets you copy files from the host machine to the docker container. 
34 | 
35 | *** CMD
36 | 
37 | The *CMD* statement is used to provide defaults when executing a container. Only one *CMD* statement is valid per container, and if you provide several, only the last one will be used by the container. 
38 | 
39 | More information on Docker commands can be found here: https://docs.docker.com/engine/reference/builder/
40 | 
41 | 
42 | 
43 | 
44 | 


--------------------------------------------------------------------------------
/docker-start/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM postgres:9.5.10
2 | 
3 | ## PostGIS activation
4 | RUN apt-get -y update && \
5 |     apt-get -y install postgis \
6 |                        postgresql-9.5-pgrouting
7 | 


--------------------------------------------------------------------------------
/model-deploy/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM tiangolo/uwsgi-nginx-flask:python3.6
 2 | 
 3 | WORKDIR /app/
 4 | 
 5 | COPY requirements.txt /app/
 6 | RUN pip install -r ./requirements.txt
 7 | 
 8 | ENV ENVIRONMENT production
 9 | 
10 | COPY main.py __init__.py /app/
11 | 


--------------------------------------------------------------------------------
/model-deploy/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/harnav/pydata-docker-tutorial/f956aff0fcc36947633ec59308c14bb5309fd26b/model-deploy/__init__.py


--------------------------------------------------------------------------------
/model-deploy/main.py:
--------------------------------------------------------------------------------
 1 | #!flask/bin/python
 2 | 
 3 | import os
 4 | from flask import Flask
 5 | from flask import request
 6 | import pandas as pd
 7 | from sklearn import linear_model
 8 | from sklearn import datasets
 9 | import pickle
10 | import numpy as np
11 | 
12 | diabetes = datasets.load_diabetes()
13 | 
14 | # Pick just one feature 
15 | X = diabetes.data[:, np.newaxis, 2]
16 | 
17 | # creating and saving some model
18 | regr = linear_model.LinearRegression()
19 | regr.fit(X, diabetes.target)
20 | pickle.dump(regr, open('diabetes.pkl', 'wb'))
21 | 
22 | app = Flask(__name__)
23 | 
24 | @app.route('/isAlive')
25 | def index():
26 |     return "true"
27 | 
28 | @app.route('/prediction/', methods=['GET'])
29 | def get_prediction():
30 |     feature = float(request.args.get('f'))
31 |     model = pickle.load(open('diabetes.pkl', 'rb'))
32 |     pred = model.predict([[feature]])
33 |     return str(pred)
34 | 
35 | if __name__ == '__main__':
36 |     if os.environ['ENVIRONMENT'] == 'production':
37 |         app.run(port=80,host='0.0.0.0')
38 | 
39 | 
40 | 


--------------------------------------------------------------------------------
/model-deploy/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.13
2 | scipy==0.19.1
3 | Flask==0.12.2
4 | pandas==0.20.2
5 | scikit_learn==0.18.2
6 | 


--------------------------------------------------------------------------------