├── README.md ├── data ├── iris.csv └── test │ ├── 1.csv │ └── 2.csv ├── infer.json ├── infer ├── Dockerfile ├── infer.jl └── package_installs.jl ├── pipeline.png ├── train-forest ├── Dockerfile ├── package_installs.jl └── train.jl ├── train-tree ├── Dockerfile ├── package_installs.jl └── train.jl └── train.json /README.md: -------------------------------------------------------------------------------- 1 | # Workshop - Integrating Julia in Real-World, Distributed Pipelines 2 | 3 | ![alt tag](pipeline.png) 4 | 5 | This workshop focuses on building a production scale machine learning pipeline with [Julia](), [Docker](), [Kubernetes](), and [Pachyderm](http://pachyderm.io/). In particular, this pipeline trains and utilizes a model that predicts the species of iris flowers, based on measurements of those flowers. 6 | 7 | The below documentation walks you through experimentation with Docker and the deployment of the pipelines. It also emphasizes a few key features related to reproducibility, pipeline triggering, and provenance: 8 | 9 | 1. [Prepare a Julia program and Docker image for training](README.md#1-prepare-a-julia-program-and-docker-image-for-model-training) 10 | 2. [Prepare a Julia program and Docker image for inference](README.md#2-prepare-a-julia-program-and-docker-image-for-inference) 11 | 3. [Connect to your Kubernetes cluster](README.md#3-connect-to-your-kubernetes-cluster) 12 | 4. [Check that Pachyderm is running on Kubernetes](README.md#4-check-that-pachyderm-is-running-on-kubernetes) 13 | 5. [Create the input "data repositories"](README.md#5-create-the-input-data-repositories) 14 | 6. [Commit the training data set into Pachyderm](README.md#6-commit-the-training-data-set-into-pachyderm) 15 | 7. [Create the training pipeline](README.md#7-create-the-training-pipeline) 16 | 8. [Commit input attributes](README.md#8-commit-input-attributes) 17 | 9. [Create the inference pipeline](README.md#9-create-the-inference-pipeline) 18 | 10. [Examine the results](README.md#10-examine-the-results) 19 | 20 | Bonus: 21 | 22 | 11. [Parallelize the inference](README.md#11-parallelize-the-inference) 23 | 12. [Update the model training](README.md#12-update-the-model-training) 24 | 13. [Update the training data set](README.md#13-update-the-training-data-set) 25 | 14. [Examine pipeline provenance](README.md#14-examine-pipeline-provenance) 26 | 27 | Finally, we provide some [Resources](README.md#resources) for you for further exploration. 28 | 29 | ## Prerequisites 30 | 31 | - Ability to `ssh` into a remote machine. 32 | - An IP for a remote machine (this should have been given to you at the beginning of the workshop). 33 | - Access to this repository on GitHub (we will clone it later on the remote machine). 34 | 35 | ## 1. Prepare a Julia program and Docker image for model training 36 | 37 | First, let's `ssh` into our development/workshop machine. As will be discussed later, this machine is connected to a running kubernetes instance, a pachyderm cluster, and has everything you need to complete the workshop. You should have been given an IP for your dev machine at the beginning of the workshop. Use that IP where indicated (``) below to connect to the machine: 38 | 39 | ``` 40 | $ ssh pachrat@ 41 | ``` 42 | 43 | You will be prompted for a password that will be given out at the workshop. 44 | 45 | Now, let's learn how to prepare a containerized Julia program to train our iris species prediction model. Clone this Github repo on the dev machine as follows: 46 | 47 | ``` 48 | $ git clone https://github.com/dwhitena/julia-workshop.git 49 | ``` 50 | 51 | You should now see a folder containing the contents of this repo: 52 | 53 | ``` 54 | $ ls 55 | admin.conf julia-workshop 56 | ``` 57 | 58 | Navigate to the `train-tree` folder in this workshop repo. Here you will find a Julia program, [train.jl](train-tree/train.jl), that uses the `DecisionTree` package to train and export a model for predicting iris flower species: 59 | 60 | ``` 61 | $ cd julia-workshop/train-tree/ 62 | $ ls 63 | Dockerfile package_installs.jl train.jl 64 | ``` 65 | 66 | You can also see that we have a Julia program `package_installs.jl` that installs the necessary packages for our model training, and we have a "Dockerfile." This [Dockerfile](train-tree/Dockerfile) tells Docker how to build a Docker "image" for our model training. As you can see, in our Docker image, we are installing a couple of dependencies, running `package_installs.jl` and adding our `train.jl` program. 67 | 68 | We have already pre-built this Docker image and uploaded it to Docker Hub [here](https://hub.docker.com/r/dwhitena/julia-train/) for use in this workshop. However, if you wanted to experiment with this image locally and/or build another Julia Docker image, you just need to [install Docker](https://docs.docker.com/engine/installation/) and "build" this Docker image. To "build" the Docker image, you can run something similar to: 69 | 70 | (Note this following won't work on the dev/workshop instance because of certain permissions, but you could do the following to build the Docker image locally, assuming you have Docker installed. Please wait to install and run this until after the workshop so we don't take down the Wifi. haha) 71 | 72 | ``` 73 | $ docker build -t dwhitena/julia-train:tree . 74 | Sending build context to Docker daemon 4.096kB 75 | Step 1/4 : FROM julia 76 | ---> f988666c0ef7 77 | Step 2/4 : ADD package_installs.jl /tmp/package_installs.jl 78 | ---> 649aff2f1f78 79 | Removing intermediate container fbdf08c34c45 80 | Step 3/4 : RUN apt-get update && apt-get install -y build-essential hdf5-tools && julia /tmp/package_installs.jl && rm -rf /var/lib/apt/lists/* 81 | ---> Running in 74fd7254bb87 82 | Get:1 http://security.debian.org jessie/updates InRelease [63.1 kB] 83 | Ign http://deb.debian.org jessie InRelease 84 | Get:2 http://deb.debian.org jessie-updates InRelease [145 kB] 85 | Get:3 http://deb.debian.org jessie Release.gpg [2373 B] 86 | Get:4 http://deb.debian.org jessie Release [148 kB] 87 | Get:5 http://security.debian.org jessie/updates/main amd64 Packages [523 kB] 88 | Get:6 http://deb.debian.org jessie-updates/main amd64 Packages [17.8 kB] 89 | Get:7 http://deb.debian.org jessie/main amd64 Packages [9065 kB] 90 | Fetched 9965 kB in 5s (1674 kB/s) 91 | Reading package lists... 92 | Reading package lists... 93 | Building dependency tree... 94 | Reading state information... 95 | The following extra packages will be installed: 96 | binutils bzip2 cpp cpp-4.9 dpkg-dev fakeroot g++ g++-4.9 gcc gcc-4.9 97 | libalgorithm-c3-perl libalgorithm-diff-perl libalgorithm-diff-xs-perl 98 | 99 | etc... 100 | ``` 101 | 102 | Your Docker image will then be listed under `docker images`: 103 | 104 | ``` 105 | $ docker images 106 | REPOSITORY TAG IMAGE ID CREATED SIZE 107 | 649aff2f1f78 7 hours ago 371MB 108 | c530a73337d8 7 hours ago 371MB 109 | dwhitena/iris-infer julia 5716b96aff25 21 hours ago 557MB 110 | dwhitena/julia-infer 5716b96aff25 21 hours ago 557MB 111 | dwhitena/iris-train julia-tree 28606eba05de 21 hours ago 557MB 112 | 113 | etc... 114 | ``` 115 | 116 | The Docker image can then be run manually (and interactively) as follows. We can see that julia runs in the Docker container and our `train.jl` program is included: 117 | 118 | ``` 119 | $ docker run -it dwhitena/julia-train:tree /bin/bash 120 | root@2862b6f9ea24:/# julia 121 | _ 122 | _ _ _(_)_ | A fresh approach to technical computing 123 | (_) | (_) (_) | Documentation: https://docs.julialang.org 124 | _ _ _| |_ __ _ | Type "?help" for help. 125 | | | | | | | |/ _` | | 126 | | | |_| | | | (_| | | Version 0.5.2 (2017-05-06 16:34 UTC) 127 | _/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release 128 | |__/ | x86_64-pc-linux-gnu 129 | 130 | julia> 1+1 131 | 2 132 | 133 | julia> exit() 134 | root@2862b6f9ea24:/# cat /train.jl 135 | using DataFrames 136 | using DecisionTree 137 | using JLD 138 | 139 | # Read the iris data set. 140 | df = readtable(ARGS[1], header = false) 141 | 142 | # Get the features and labels. 143 | features = convert(Array, df[:, 1:4]) 144 | labels = convert(Array, df[:, 5]) 145 | 146 | # Train decision tree classifier. 147 | model = DecisionTreeClassifier(pruning_purity_threshold=0.9, maxdepth=6) 148 | DecisionTree.fit!(model, features, labels) 149 | 150 | # Save the model. 151 | save(ARGS[2], "model", model) 152 | 153 | root@2862b6f9ea24:/# 154 | ``` 155 | 156 | As mentioned, we have already uploaded this image to Docker Hub for use in this workshop. This was done by building the image as above and using `docker push` to push the image to the public Docker Hub registry. When we use the image later in the workshop, we will be "pulling" that image on Docker Hub tagged as `dwhitena/julia-train:tree`. 157 | 158 | ## 2. Prepare a Julia program and Docker image for inference 159 | 160 | Similar to the process in section (1), we have created a Julia program, [infer.jl]i(infer/infer.jl), and a corresponding Docker image to be used for inference in our ML pipeline. This Docker image is uploaded to Docker Hub as [dwhitena/julia-infer](https://hub.docker.com/r/dwhitena/julia-infer/). 161 | 162 | `infer.jl` does a few things: 163 | 164 | - takes a trained, persisted `model.jld` as input (the output of `train.jl`) 165 | - takes a directory as input 166 | - walks over files in that directory, where the files are sets of new iris attributes 167 | - infers the species of each of the reviews 168 | - outputs the inferred species to a specified output directory 169 | 170 | ## 3. Connect to your Kubernetes cluster 171 | 172 | On your dev/workshop machine, you should be connected to a running Kubernetes cluster. You can interact with this cluster via the Kubernetes CLI, `kubectl`. As a sanity check, you can make sure that kubernetes is up and running as follows: 173 | 174 | ``` 175 | $ kubectl get all 176 | NAME READY STATUS RESTARTS AGE 177 | po/etcd-4197107720-906b7 1/1 Running 0 31m 178 | po/pachd-3548222380-cm1ts 1/1 Running 0 31m 179 | 180 | NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE 181 | svc/etcd 10.97.253.64 2379:32379/TCP 31m 182 | svc/kubernetes 10.96.0.1 443/TCP 32m 183 | svc/pachd 10.108.55.75 650:30650/TCP,651:30651/TCP 31m 184 | 185 | NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE 186 | deploy/etcd 1 1 1 1 31m 187 | deploy/pachd 1 1 1 1 31m 188 | 189 | NAME DESIRED CURRENT READY AGE 190 | rs/etcd-4197107720 1 1 1 31m 191 | rs/pachd-3548222380 1 1 1 31m 192 | ``` 193 | 194 | ## 4. Check that Pachyderm is running on Kubernetes 195 | 196 | Pachyderm should also be running on your Kubernetes cluster. To verify that everything is running correctly on the machine, you should be able to run the following with the corresponding response: 197 | 198 | ``` 199 | $ pachctl version 200 | No config detected. 201 | Default config created at /home/pachrat/.pachyderm/config.json 202 | COMPONENT VERSION 203 | pachctl 1.4.7-RC1 204 | pachd 1.4.7-RC1 205 | ``` 206 | 207 | (note, this was the first time you ran `pachctl` on the machine, so a `pachctl` config was automatically created) 208 | 209 | ## 5. Create the input data repositories 210 | 211 | On the Pachyderm cluster running in your remote machine, we will need to create the two input data repositories (for our training data and input iris attributes). To do this run: 212 | 213 | ``` 214 | $ pachctl create-repo training 215 | $ pachctl create-repo reviews 216 | ``` 217 | 218 | As a sanity check, we can list out the current repos, and you should see the two repos you just created: 219 | 220 | ``` 221 | $ pachctl list-repo 222 | NAME CREATED SIZE 223 | attributes 5 seconds ago 0 B 224 | training 8 seconds ago 0 B 225 | ``` 226 | 227 | ## 6. Commit the training data set into pachyderm 228 | 229 | We have our training data repository, but we haven't put our training data set into this repository yet. The training data set, `iris.csv`, is included here in the [data](data) directory. 230 | 231 | To get this data into Pachyderm, navigate to this directory and run: 232 | 233 | ``` 234 | $ cd /home/pachrat/julia-workshop/data 235 | $ pachctl put-file training master -c -f iris.csv 236 | ``` 237 | 238 | Then, you should be able to see the following: 239 | 240 | ``` 241 | $ pachctl list-repo 242 | NAME CREATED SIZE 243 | training 3 minutes ago 4.444 KiB 244 | attributes 3 minutes ago 0 B 245 | $ pachctl list-file training master 246 | NAME TYPE SIZE 247 | iris.csv file 4.444 KiB 248 | ``` 249 | 250 | ## 7. Create the training pipeline 251 | 252 | Next, we can create the `model` pipeline stage to process the data in the training repository. To do this, we just need to provide Pachyderm with [a JSON pipeline specification](train.json) that tells Pachyderm how to process the data. Once you have that pipeline spec, creating the training pipeline is as easy as: 253 | 254 | ``` 255 | $ cd .. 256 | $ pachctl create-pipeline -f train.json 257 | ``` 258 | 259 | Immediately you will notice that Pachyderm has kicked off a job to perform the model training: 260 | 261 | ``` 262 | $ pachctl list-job 263 | ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE 264 | a0d78926-ce2a-491a-b926-90043bce7371 model/- 12 seconds ago - 0 0 / 1 running 265 | ``` 266 | 267 | This job should run for about 1-2 minutes (it actually runs faster after this, but we have to pull the Docker image on the first run). After your model has successfully been trained, you should see: 268 | 269 | ``` 270 | $ pachctl list-job 271 | ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE 272 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b 2 minutes ago About a minute 0 1 / 1 success 273 | $ pachctl list-repo 274 | NAME CREATED SIZE 275 | model 2 minutes ago 43.67 KiB 276 | training 8 minutes ago 4.444 KiB 277 | attributes 7 minutes ago 0 B 278 | $ pachctl list-file model master 279 | NAME TYPE SIZE 280 | model.jld file 43.67 KiB 281 | ``` 282 | 283 | ## 8. Commit input attributes 284 | 285 | Great! We now have a trained model that will infer the species of iris flowers. Let's commit some iris attributes into Pachyderm that we would like to run through the inference. We have a couple examples under [test](test). Feel free to use these, find your own, or even create your own. To commit our samples (assuming you have cloned this repo on the remote machine), you can run: 286 | 287 | ``` 288 | $ cd /home/pachrat/julia-workshop/data/test/ 289 | $ pachctl put-file attributes master -c -r -f . 290 | ``` 291 | 292 | You should then see: 293 | 294 | ``` 295 | $ pachctl list-file attributes master 296 | NAME TYPE SIZE 297 | 1.csv file 16 B 298 | 2.csv file 96 B 299 | ``` 300 | 301 | ## 9. Create the inference pipeline 302 | 303 | We have another JSON blob, [infer.json](infer.json), that will tell Pachyderm how to perform the processing for the inference stage. This is similar to our last JSON specification except, in this case, we have two input repositories (the `attributes` and the `model`) and we are using a different Docker image that contains `infer.jl`. To create the inference stage, we simply run: 304 | 305 | ``` 306 | $ cd ../../ 307 | $ pachctl create-pipeline -f infer.json 308 | ``` 309 | 310 | This will immediately kick off an inference job, because we have committed unprocessed reviews into the `reviews` repo. The results will then be versioned in a corresponding `inference` data repository: 311 | 312 | ``` 313 | $ pachctl list-job 314 | ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE 315 | 21552ae0-b0a9-4089-bfa5-d74a4a9befd7 inference/- 33 seconds ago - 0 0 / 2 running 316 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b 7 minutes ago About a minute 0 1 / 1 success 317 | $ pachctl list-job 318 | ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE 319 | 21552ae0-b0a9-4089-bfa5-d74a4a9befd7 inference/c4f6b269ad0349469effee39cc9ee8fb About a minute ago About a minute 0 2 / 2 success 320 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b 8 minutes ago About a minute 0 1 / 1 success 321 | $ pachctl list-repo 322 | NAME CREATED SIZE 323 | inference About a minute ago 100 B 324 | attributes 13 minutes ago 112 B 325 | model 8 minutes ago 43.67 KiB 326 | training 13 minutes ago 4.444 KiB 327 | ``` 328 | 329 | ## 10. Examine the results 330 | 331 | We have created results from the inference, but how do we examine those results? There are multiple ways, but an easy way is to just "get" the specific files out of Pachyderm's data versioning: 332 | 333 | ``` 334 | $ pachctl list-file inference master 335 | NAME TYPE SIZE 336 | 1.csv file 15 B 337 | 2.csv file 85 B 338 | $ pachctl get-file inference master 1.csv 339 | Iris-virginica 340 | $ pachctl get-file inference master 2.csv 341 | Iris-versicolor 342 | Iris-virginica 343 | Iris-virginica 344 | Iris-virginica 345 | Iris-setosa 346 | Iris-setosa 347 | ``` 348 | 349 | Here we can see that each result file contains a predicted iris flower species corresponding to each set of input attributes. 350 | 351 | ## Bonus exercises 352 | 353 | You may not get to all of these bonus exercises during the workshop time, but you can perform these and all of the above steps any time you like with a [simple local Pachyderm install](http://docs.pachyderm.io/en/latest/getting_started/local_installation.html). You can spin up this local version of Pachyderm is just a few commands and experiment with this, [other Pachyderm examples](http://docs.pachyderm.io/en/latest/examples/readme.html), and/or your own pipelines. 354 | 355 | ### 11. Parallelize the inference 356 | 357 | You may have noticed that our pipeline specs included a `parallelism_spec` field. This tells Pachyderm how to parallelize a particular pipeline stage. Let's say that in production we start receiving a huge number of attribute files, and we need to keep up with our inference. In particular, let's say we want to spin up 10 inference workers to perform inference in parallel. 358 | 359 | This actually doesn't require any change to our code. We can simply change our `parallelism_spec` in `infer.json` to: 360 | 361 | ``` 362 | "parallelism_spec": { 363 | "strategy": "CONSTANT", 364 | "constant": "10" 365 | }, 366 | ``` 367 | 368 | Pachyderm will then spin up 10 inference workers, each running our same `infer.jl` script, to perform inference in parallel. This can be confirmed by updating our pipeline and then examining the cluster: 369 | 370 | ``` 371 | $ vim infer.json 372 | $ pachctl update-pipeline -f infer.json 373 | $ kubectl get all 374 | NAME READY STATUS RESTARTS AGE 375 | po/etcd-4197107720-906b7 1/1 Running 0 52m 376 | po/pachd-3548222380-cm1ts 1/1 Running 0 52m 377 | po/pipeline-inference-v1-vsq8x 2/2 Terminating 0 6m 378 | po/pipeline-inference-v2-0w438 0/2 Init:0/1 0 5s 379 | po/pipeline-inference-v2-1tdm7 0/2 Pending 0 5s 380 | po/pipeline-inference-v2-2tqtl 0/2 Init:0/1 0 5s 381 | po/pipeline-inference-v2-6x917 0/2 Init:0/1 0 5s 382 | po/pipeline-inference-v2-cc5jz 0/2 Init:0/1 0 5s 383 | po/pipeline-inference-v2-cphcd 0/2 Init:0/1 0 5s 384 | po/pipeline-inference-v2-d5rc0 0/2 Init:0/1 0 5s 385 | po/pipeline-inference-v2-lhpcv 0/2 Init:0/1 0 5s 386 | po/pipeline-inference-v2-mpzwf 0/2 Pending 0 5s 387 | po/pipeline-inference-v2-p753f 0/2 Init:0/1 0 5s 388 | po/pipeline-model-v1-1gqv2 2/2 Running 0 13m 389 | 390 | NAME DESIRED CURRENT READY AGE 391 | rc/pipeline-inference-v2 10 10 0 5s 392 | rc/pipeline-model-v1 1 1 1 13m 393 | 394 | NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE 395 | svc/etcd 10.97.253.64 2379:32379/TCP 52m 396 | svc/kubernetes 10.96.0.1 443/TCP 53m 397 | svc/pachd 10.108.55.75 650:30650/TCP,651:30651/TCP 52m 398 | svc/pipeline-inference-v2 10.99.47.41 80/TCP 5s 399 | svc/pipeline-model-v1 10.109.198.229 80/TCP 13m 400 | 401 | NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE 402 | deploy/etcd 1 1 1 1 52m 403 | deploy/pachd 1 1 1 1 52m 404 | 405 | NAME DESIRED CURRENT READY AGE 406 | rs/etcd-4197107720 1 1 1 52m 407 | rs/pachd-3548222380 1 1 1 52m 408 | $ kubectl get all 409 | NAME READY STATUS RESTARTS AGE 410 | po/etcd-4197107720-906b7 1/1 Running 0 53m 411 | po/pachd-3548222380-cm1ts 1/1 Running 0 53m 412 | po/pipeline-inference-v2-0w438 2/2 Running 0 40s 413 | po/pipeline-inference-v2-1tdm7 2/2 Running 0 40s 414 | po/pipeline-inference-v2-2tqtl 2/2 Running 0 40s 415 | po/pipeline-inference-v2-6x917 2/2 Running 0 40s 416 | po/pipeline-inference-v2-cc5jz 2/2 Running 0 40s 417 | po/pipeline-inference-v2-cphcd 2/2 Running 0 40s 418 | po/pipeline-inference-v2-d5rc0 2/2 Running 0 40s 419 | po/pipeline-inference-v2-lhpcv 2/2 Running 0 40s 420 | po/pipeline-inference-v2-mpzwf 2/2 Running 0 40s 421 | po/pipeline-inference-v2-p753f 2/2 Running 0 40s 422 | po/pipeline-model-v1-1gqv2 2/2 Running 0 14m 423 | 424 | NAME DESIRED CURRENT READY AGE 425 | rc/pipeline-inference-v2 10 10 10 40s 426 | rc/pipeline-model-v1 1 1 1 14m 427 | 428 | NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE 429 | svc/etcd 10.97.253.64 2379:32379/TCP 53m 430 | svc/kubernetes 10.96.0.1 443/TCP 54m 431 | svc/pachd 10.108.55.75 650:30650/TCP,651:30651/TCP 53m 432 | svc/pipeline-inference-v2 10.99.47.41 80/TCP 40s 433 | svc/pipeline-model-v1 10.109.198.229 80/TCP 14m 434 | 435 | NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE 436 | deploy/etcd 1 1 1 1 53m 437 | deploy/pachd 1 1 1 1 53m 438 | 439 | NAME DESIRED CURRENT READY AGE 440 | rs/etcd-4197107720 1 1 1 53m 441 | rs/pachd-3548222380 1 1 1 53m 442 | ``` 443 | 444 | ### 12. Update the model training 445 | 446 | You might have noticed that this repo includes two versions of the training program `train.jl`. There is [train-tree/train.jl](train-tree/train.jl), which we used before in our pipeline, and then there is [train-forest/train.jl](train-forest/train.jl), which does a simliar training with a Random Forest model. 447 | 448 | Let's now imagine that we want to update our model to this random forest version. To do this, modify the image tag in `train.json`: 449 | 450 | ``` 451 | "image": "dwhitena/julia-train:forest", 452 | ``` 453 | 454 | Once you modify the spec, you can update the pipeline by running: 455 | 456 | ``` 457 | $ pachctl update-pipeline -f train.json 458 | ``` 459 | 460 | Pachyderm will then automatically kick off a new job to retrain our model with the random forest algo: 461 | 462 | ``` 463 | $ pachctl list-job 464 | ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE 465 | 7d913835-2c0a-42a3-bfa2-c8a5941ceaa5 model/- 3 seconds ago - 0 0 / 1 running 466 | 21552ae0-b0a9-4089-bfa5-d74a4a9befd7 inference/c4f6b269ad0349469effee39cc9ee8fb 11 minutes ago About a minute 0 2 / 2 success 467 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b 19 minutes ago About a minute 0 1 / 1 success 468 | ``` 469 | 470 | Not only that, once the model is retrained, Pachyderm see the new model and updates our inferences with the latest version of the model: 471 | 472 | ``` 473 | $ pachctl list-job 474 | ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE 475 | 0477e755-79b4-4b14-ac04-5416d9a80cf3 inference/5dec44a330d24a1cb3822610c886489b 53 seconds ago 44 seconds 0 2 / 2 success 476 | 7d913835-2c0a-42a3-bfa2-c8a5941ceaa5 model/444b5950bcb642cfba5b087286640898 About a minute ago 56 seconds 0 1 / 1 success 477 | 21552ae0-b0a9-4089-bfa5-d74a4a9befd7 inference/c4f6b269ad0349469effee39cc9ee8fb 13 minutes ago About a minute 0 2 / 2 success 478 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b 20 minutes ago About a minute 0 1 / 1 success 479 | ``` 480 | 481 | ### 13. Update the training data set 482 | 483 | Let's say that one or more observations in our training data set were corrupt or unwanted. Thus, we want to update our training data set. To simulate this, go ahead and open up `iris.csv` (e.g., with `vim`) and remove a couple of the rows (non-header rows). Then, let's replace our training set: 484 | 485 | ``` 486 | $ pachctl start-commit training master 487 | 9cc070dadc344150ac4ceef2f0758509 488 | $ pachctl delete-file training 9cc070dadc344150ac4ceef2f0758509 iris.csv 489 | $ pachctl put-file training 9cc070dadc344150ac4ceef2f0758509 -f iris.csv 490 | $ pachctl finish-commit training 9cc070dadc344150ac4ceef2f0758509 491 | ``` 492 | 493 | Immediately, Pachyderm "knows" that the data has been updated, and it starts a new jobs to update the model and inferences: 494 | 495 | ### 14. Examine pipeline provenance 496 | 497 | Let's say that we have updated our model or training set in one of the above scenarios (step 11 or 12). Now we have multiple inferences that were made with different models and/or training data sets. How can we know which results came from which specific models and/or training data sets? This is called "provenance," and Pachyderm gives it to you out of the box. 498 | 499 | Suppose we have run the following jobs: 500 | 501 | ``` 502 | $ pachctl list-job 503 | ID OUTPUT COMMIT STARTED DURATION RESTART PROGRESS STATE 504 | 0477e755-79b4-4b14-ac04-5416d9a80cf3 inference/5dec44a330d24a1cb3822610c886489b 6 minutes ago 44 seconds 0 2 / 2 success 505 | 7d913835-2c0a-42a3-bfa2-c8a5941ceaa5 model/444b5950bcb642cfba5b087286640898 7 minutes ago 56 seconds 0 1 / 1 success 506 | 2560f096-0515-4d68-be66-35e3b4f3e730 inference/db9e675de0274ce9a73d3fc9dd50fd51 13 minutes ago About a minute 1 2 / 2 success 507 | 21552ae0-b0a9-4089-bfa5-d74a4a9befd7 inference/c4f6b269ad0349469effee39cc9ee8fb 19 minutes ago About a minute 0 2 / 2 success 508 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b 26 minutes ago About a minute 0 1 / 1 success 509 | ``` 510 | 511 | If we want to know which model and training data set was used for the latest inference, commit id `5dec44a330d24a1cb3822610c886489b`, we just need to inspect the particular commit: 512 | 513 | ``` 514 | $ pachctl inspect-commit inference 5dec44a330d24a1cb3822610c886489b 515 | $ pachctl inspect-commit inference 5dec44a330d24a1cb3822610c886489b 516 | Commit: inference/5dec44a330d24a1cb3822610c886489b 517 | Parent: db9e675de0274ce9a73d3fc9dd50fd51 518 | Started: 6 minutes ago 519 | Finished: 6 minutes ago 520 | Size: 100 B 521 | Provenance: training/9881a078c14c47e9b71fcc626a86499f attributes/f62907bda09d48cfa817476fd3e4f07f model/444b5950bcb642cfba5b087286640898 522 | ``` 523 | 524 | The `Provenance` tells us exactly which model and training set was used (along with which commit to attributes triggered the inference). For example, if we wanted to see the exact model used, we would just need to reference commit `444b5950bcb642cfba5b087286640898` to the `model` repo: 525 | 526 | ``` 527 | $ pachctl list-file model 444b5950bcb642cfba5b087286640898 528 | NAME TYPE SIZE 529 | model.jld file 70.34 KiB 530 | ``` 531 | 532 | We could get this model to examine it, rerun it, revert to a different model, etc. 533 | 534 | ## Resources 535 | 536 | Docker 537 | 538 | - [Install Docker](https://docs.docker.com/engine/installation/) locally. 539 | - Start with the official [Julia base image](https://hub.docker.com/_/julia/). 540 | - Check out the [Docker docs](https://docs.docker.com/). 541 | 542 | Kubernetes: 543 | 544 | - Start by playing with [minikube](https://kubernetes.io/docs/tutorials/stateless-application/hello-minikube/). 545 | - Check out the [Kubernetes docs](https://kubernetes.io/docs/home/). 546 | 547 | Pachyderm: 548 | 549 | - Join the [Pachyderm Slack team](http://slack.pachyderm.io/) to ask questions, get help, and talk about production deploys. 550 | - Follow [Pachyderm on Twitter](https://twitter.com/pachydermIO), 551 | - Find [Pachyderm on GitHub](https://github.com/pachyderm/pachyderm), and 552 | - [Spin up Pachyderm](http://docs.pachyderm.io/en/latest/getting_started/getting_started.html) in just a few commands to try this and [other examples](http://docs.pachyderm.io/en/latest/examples/readme.html) locally. 553 | 554 | -------------------------------------------------------------------------------- /data/iris.csv: -------------------------------------------------------------------------------- 1 | 5.1,3.5,1.4,0.2,Iris-setosa 2 | 4.9,3.0,1.4,0.2,Iris-setosa 3 | 4.7,3.2,1.3,0.2,Iris-setosa 4 | 4.6,3.1,1.5,0.2,Iris-setosa 5 | 5.0,3.6,1.4,0.2,Iris-setosa 6 | 5.4,3.9,1.7,0.4,Iris-setosa 7 | 4.6,3.4,1.4,0.3,Iris-setosa 8 | 5.0,3.4,1.5,0.2,Iris-setosa 9 | 4.4,2.9,1.4,0.2,Iris-setosa 10 | 4.9,3.1,1.5,0.1,Iris-setosa 11 | 5.4,3.7,1.5,0.2,Iris-setosa 12 | 4.8,3.4,1.6,0.2,Iris-setosa 13 | 4.8,3.0,1.4,0.1,Iris-setosa 14 | 4.3,3.0,1.1,0.1,Iris-setosa 15 | 5.8,4.0,1.2,0.2,Iris-setosa 16 | 5.7,4.4,1.5,0.4,Iris-setosa 17 | 5.4,3.9,1.3,0.4,Iris-setosa 18 | 5.1,3.5,1.4,0.3,Iris-setosa 19 | 5.7,3.8,1.7,0.3,Iris-setosa 20 | 5.1,3.8,1.5,0.3,Iris-setosa 21 | 5.4,3.4,1.7,0.2,Iris-setosa 22 | 5.1,3.7,1.5,0.4,Iris-setosa 23 | 4.6,3.6,1.0,0.2,Iris-setosa 24 | 5.1,3.3,1.7,0.5,Iris-setosa 25 | 4.8,3.4,1.9,0.2,Iris-setosa 26 | 5.0,3.0,1.6,0.2,Iris-setosa 27 | 5.0,3.4,1.6,0.4,Iris-setosa 28 | 5.2,3.5,1.5,0.2,Iris-setosa 29 | 5.2,3.4,1.4,0.2,Iris-setosa 30 | 4.7,3.2,1.6,0.2,Iris-setosa 31 | 4.8,3.1,1.6,0.2,Iris-setosa 32 | 5.4,3.4,1.5,0.4,Iris-setosa 33 | 5.2,4.1,1.5,0.1,Iris-setosa 34 | 5.5,4.2,1.4,0.2,Iris-setosa 35 | 4.9,3.1,1.5,0.1,Iris-setosa 36 | 5.0,3.2,1.2,0.2,Iris-setosa 37 | 5.5,3.5,1.3,0.2,Iris-setosa 38 | 4.9,3.1,1.5,0.1,Iris-setosa 39 | 4.4,3.0,1.3,0.2,Iris-setosa 40 | 5.1,3.4,1.5,0.2,Iris-setosa 41 | 5.0,3.5,1.3,0.3,Iris-setosa 42 | 4.5,2.3,1.3,0.3,Iris-setosa 43 | 4.4,3.2,1.3,0.2,Iris-setosa 44 | 5.0,3.5,1.6,0.6,Iris-setosa 45 | 5.1,3.8,1.9,0.4,Iris-setosa 46 | 4.8,3.0,1.4,0.3,Iris-setosa 47 | 5.1,3.8,1.6,0.2,Iris-setosa 48 | 4.6,3.2,1.4,0.2,Iris-setosa 49 | 5.3,3.7,1.5,0.2,Iris-setosa 50 | 5.0,3.3,1.4,0.2,Iris-setosa 51 | 7.0,3.2,4.7,1.4,Iris-versicolor 52 | 6.4,3.2,4.5,1.5,Iris-versicolor 53 | 6.9,3.1,4.9,1.5,Iris-versicolor 54 | 5.5,2.3,4.0,1.3,Iris-versicolor 55 | 6.5,2.8,4.6,1.5,Iris-versicolor 56 | 5.7,2.8,4.5,1.3,Iris-versicolor 57 | 6.3,3.3,4.7,1.6,Iris-versicolor 58 | 4.9,2.4,3.3,1.0,Iris-versicolor 59 | 6.6,2.9,4.6,1.3,Iris-versicolor 60 | 5.2,2.7,3.9,1.4,Iris-versicolor 61 | 5.0,2.0,3.5,1.0,Iris-versicolor 62 | 5.9,3.0,4.2,1.5,Iris-versicolor 63 | 6.0,2.2,4.0,1.0,Iris-versicolor 64 | 6.1,2.9,4.7,1.4,Iris-versicolor 65 | 5.6,2.9,3.6,1.3,Iris-versicolor 66 | 6.7,3.1,4.4,1.4,Iris-versicolor 67 | 5.6,3.0,4.5,1.5,Iris-versicolor 68 | 5.8,2.7,4.1,1.0,Iris-versicolor 69 | 6.2,2.2,4.5,1.5,Iris-versicolor 70 | 5.6,2.5,3.9,1.1,Iris-versicolor 71 | 5.9,3.2,4.8,1.8,Iris-versicolor 72 | 6.1,2.8,4.0,1.3,Iris-versicolor 73 | 6.3,2.5,4.9,1.5,Iris-versicolor 74 | 6.1,2.8,4.7,1.2,Iris-versicolor 75 | 6.4,2.9,4.3,1.3,Iris-versicolor 76 | 6.6,3.0,4.4,1.4,Iris-versicolor 77 | 6.8,2.8,4.8,1.4,Iris-versicolor 78 | 6.7,3.0,5.0,1.7,Iris-versicolor 79 | 6.0,2.9,4.5,1.5,Iris-versicolor 80 | 5.7,2.6,3.5,1.0,Iris-versicolor 81 | 5.5,2.4,3.8,1.1,Iris-versicolor 82 | 5.5,2.4,3.7,1.0,Iris-versicolor 83 | 5.8,2.7,3.9,1.2,Iris-versicolor 84 | 6.0,2.7,5.1,1.6,Iris-versicolor 85 | 5.4,3.0,4.5,1.5,Iris-versicolor 86 | 6.0,3.4,4.5,1.6,Iris-versicolor 87 | 6.7,3.1,4.7,1.5,Iris-versicolor 88 | 6.3,2.3,4.4,1.3,Iris-versicolor 89 | 5.6,3.0,4.1,1.3,Iris-versicolor 90 | 5.5,2.5,4.0,1.3,Iris-versicolor 91 | 5.5,2.6,4.4,1.2,Iris-versicolor 92 | 6.1,3.0,4.6,1.4,Iris-versicolor 93 | 5.8,2.6,4.0,1.2,Iris-versicolor 94 | 5.0,2.3,3.3,1.0,Iris-versicolor 95 | 5.6,2.7,4.2,1.3,Iris-versicolor 96 | 5.7,3.0,4.2,1.2,Iris-versicolor 97 | 5.7,2.9,4.2,1.3,Iris-versicolor 98 | 6.2,2.9,4.3,1.3,Iris-versicolor 99 | 5.1,2.5,3.0,1.1,Iris-versicolor 100 | 5.7,2.8,4.1,1.3,Iris-versicolor 101 | 6.3,3.3,6.0,2.5,Iris-virginica 102 | 5.8,2.7,5.1,1.9,Iris-virginica 103 | 7.1,3.0,5.9,2.1,Iris-virginica 104 | 6.3,2.9,5.6,1.8,Iris-virginica 105 | 6.5,3.0,5.8,2.2,Iris-virginica 106 | 7.6,3.0,6.6,2.1,Iris-virginica 107 | 4.9,2.5,4.5,1.7,Iris-virginica 108 | 7.3,2.9,6.3,1.8,Iris-virginica 109 | 6.7,2.5,5.8,1.8,Iris-virginica 110 | 7.2,3.6,6.1,2.5,Iris-virginica 111 | 6.5,3.2,5.1,2.0,Iris-virginica 112 | 6.4,2.7,5.3,1.9,Iris-virginica 113 | 6.8,3.0,5.5,2.1,Iris-virginica 114 | 5.7,2.5,5.0,2.0,Iris-virginica 115 | 5.8,2.8,5.1,2.4,Iris-virginica 116 | 6.4,3.2,5.3,2.3,Iris-virginica 117 | 6.5,3.0,5.5,1.8,Iris-virginica 118 | 7.7,3.8,6.7,2.2,Iris-virginica 119 | 7.7,2.6,6.9,2.3,Iris-virginica 120 | 6.0,2.2,5.0,1.5,Iris-virginica 121 | 6.9,3.2,5.7,2.3,Iris-virginica 122 | 5.6,2.8,4.9,2.0,Iris-virginica 123 | 7.7,2.8,6.7,2.0,Iris-virginica 124 | 6.3,2.7,4.9,1.8,Iris-virginica 125 | 6.7,3.3,5.7,2.1,Iris-virginica 126 | 7.2,3.2,6.0,1.8,Iris-virginica 127 | 6.2,2.8,4.8,1.8,Iris-virginica 128 | 6.1,3.0,4.9,1.8,Iris-virginica 129 | 6.4,2.8,5.6,2.1,Iris-virginica 130 | 7.2,3.0,5.8,1.6,Iris-virginica 131 | 7.4,2.8,6.1,1.9,Iris-virginica 132 | 7.9,3.8,6.4,2.0,Iris-virginica 133 | 6.4,2.8,5.6,2.2,Iris-virginica 134 | 6.3,2.8,5.1,1.5,Iris-virginica 135 | 6.1,2.6,5.6,1.4,Iris-virginica 136 | 7.7,3.0,6.1,2.3,Iris-virginica 137 | 6.3,3.4,5.6,2.4,Iris-virginica 138 | 6.4,3.1,5.5,1.8,Iris-virginica 139 | 6.0,3.0,4.8,1.8,Iris-virginica 140 | 6.9,3.1,5.4,2.1,Iris-virginica 141 | 6.7,3.1,5.6,2.4,Iris-virginica 142 | 6.9,3.1,5.1,2.3,Iris-virginica 143 | 5.8,2.7,5.1,1.9,Iris-virginica 144 | 6.8,3.2,5.9,2.3,Iris-virginica 145 | 6.7,3.3,5.7,2.5,Iris-virginica 146 | 6.7,3.0,5.2,2.3,Iris-virginica 147 | 6.3,2.5,5.0,1.9,Iris-virginica 148 | 6.5,3.0,5.2,2.0,Iris-virginica 149 | 6.2,3.4,5.4,2.3,Iris-virginica 150 | 5.9,3.0,5.1,1.8,Iris-virginica 151 | 152 | -------------------------------------------------------------------------------- /data/test/1.csv: -------------------------------------------------------------------------------- 1 | 5.9,3.0,5.1,1.8 2 | -------------------------------------------------------------------------------- /data/test/2.csv: -------------------------------------------------------------------------------- 1 | 5.7,2.8,4.1,1.3 2 | 6.3,3.3,6.0,2.5 3 | 5.8,2.7,5.1,1.9 4 | 7.1,3.0,5.9,2.1 5 | 5.1,3.5,1.4,0.2 6 | 4.9,3.0,1.4,0.2 7 | -------------------------------------------------------------------------------- /infer.json: -------------------------------------------------------------------------------- 1 | { 2 | "pipeline": { 3 | "name": "inference" 4 | }, 5 | "transform": { 6 | "image": "dwhitena/julia-infer", 7 | "cmd": [ 8 | "julia", 9 | "/infer.jl", 10 | "/pfs/model/model.jld", 11 | "/pfs/attributes/", 12 | "/pfs/out/" 13 | ] 14 | }, 15 | "parallelism_spec": { 16 | "strategy": "CONSTANT", 17 | "constant": "1" 18 | }, 19 | "input": { 20 | "cross": [ 21 | { 22 | "atom": { 23 | "repo": "attributes", 24 | "glob": "/*" 25 | } 26 | }, 27 | { 28 | "atom": { 29 | "repo": "model", 30 | "glob": "/" 31 | } 32 | } 33 | ] 34 | } 35 | } 36 | -------------------------------------------------------------------------------- /infer/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM julia 2 | 3 | # Install packages. 4 | ADD package_installs.jl /tmp/package_installs.jl 5 | RUN apt-get update && \ 6 | apt-get install -y build-essential hdf5-tools && \ 7 | julia /tmp/package_installs.jl && \ 8 | rm -rf /var/lib/apt/lists/* 9 | 10 | # Add our program. 11 | ADD infer.jl /infer.jl 12 | -------------------------------------------------------------------------------- /infer/infer.jl: -------------------------------------------------------------------------------- 1 | using DataFrames 2 | using DecisionTree 3 | using JLD 4 | 5 | # Load our model. 6 | model = load(ARGS[1], "model") 7 | 8 | # Walk over the directory with input attribute files. 9 | attributes = readdir(ARGS[2]) 10 | for file in attributes 11 | p = joinpath(ARGS[2], file) 12 | if isdir(p) 13 | continue 14 | elseif isfile(p) 15 | df = readtable(p, header = false) 16 | open(joinpath(ARGS[3], file), "a") do x 17 | for r in eachrow(df) 18 | prediction = DecisionTree.predict(model, convert(Array, r)) 19 | write(x, string(prediction[1], "\n")) 20 | end 21 | end 22 | end 23 | end 24 | 25 | -------------------------------------------------------------------------------- /infer/package_installs.jl: -------------------------------------------------------------------------------- 1 | metadata_packages = [ 2 | "DataFrames", 3 | "DecisionTree", 4 | "HDF5", 5 | "JLD"] 6 | 7 | 8 | Pkg.init() 9 | Pkg.update() 10 | 11 | for package=metadata_packages 12 | Pkg.add(package) 13 | end 14 | 15 | Pkg.resolve() 16 | -------------------------------------------------------------------------------- /pipeline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dwhitena/julia-workshop/afc5d6889707e549ceeedb3f30e235b6dd87ef11/pipeline.png -------------------------------------------------------------------------------- /train-forest/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM julia 2 | 3 | # Install packages. 4 | ADD package_installs.jl /tmp/package_installs.jl 5 | RUN apt-get update && \ 6 | apt-get install -y build-essential hdf5-tools && \ 7 | julia /tmp/package_installs.jl && \ 8 | rm -rf /var/lib/apt/lists/* 9 | 10 | # Add our program. 11 | ADD train.jl /train.jl 12 | -------------------------------------------------------------------------------- /train-forest/package_installs.jl: -------------------------------------------------------------------------------- 1 | metadata_packages = [ 2 | "DataFrames", 3 | "DecisionTree", 4 | "HDF5", 5 | "JLD"] 6 | 7 | 8 | Pkg.init() 9 | Pkg.update() 10 | 11 | for package=metadata_packages 12 | Pkg.add(package) 13 | end 14 | 15 | Pkg.resolve() 16 | -------------------------------------------------------------------------------- /train-forest/train.jl: -------------------------------------------------------------------------------- 1 | using DataFrames 2 | using DecisionTree 3 | using JLD 4 | 5 | # Read the iris data set. 6 | df = readtable(ARGS[1], header = false) 7 | 8 | # Get the features and labels. 9 | features = convert(Array, df[:, 1:4]) 10 | labels = convert(Array, df[:, 5]) 11 | 12 | # Train decision tree classifier. 13 | model = RandomForestClassifier(ntrees=3, partialsampling=0.7) 14 | DecisionTree.fit!(model, features, labels) 15 | 16 | # Save the model. 17 | save(ARGS[2], "model", model) 18 | 19 | -------------------------------------------------------------------------------- /train-tree/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM julia 2 | 3 | # Install packages. 4 | ADD package_installs.jl /tmp/package_installs.jl 5 | RUN apt-get update && \ 6 | apt-get install -y build-essential hdf5-tools && \ 7 | julia /tmp/package_installs.jl && \ 8 | rm -rf /var/lib/apt/lists/* 9 | 10 | # Add our program. 11 | ADD train.jl /train.jl 12 | -------------------------------------------------------------------------------- /train-tree/package_installs.jl: -------------------------------------------------------------------------------- 1 | metadata_packages = [ 2 | "DataFrames", 3 | "DecisionTree", 4 | "HDF5", 5 | "JLD"] 6 | 7 | 8 | Pkg.init() 9 | Pkg.update() 10 | 11 | for package=metadata_packages 12 | Pkg.add(package) 13 | end 14 | 15 | Pkg.resolve() 16 | -------------------------------------------------------------------------------- /train-tree/train.jl: -------------------------------------------------------------------------------- 1 | using DataFrames 2 | using DecisionTree 3 | using JLD 4 | 5 | # Read the iris data set. 6 | df = readtable(ARGS[1], header = false) 7 | 8 | # Get the features and labels. 9 | features = convert(Array, df[:, 1:4]) 10 | labels = convert(Array, df[:, 5]) 11 | 12 | # Train decision tree classifier. 13 | model = DecisionTreeClassifier(pruning_purity_threshold=0.9, maxdepth=6) 14 | DecisionTree.fit!(model, features, labels) 15 | 16 | # Save the model. 17 | save(ARGS[2], "model", model) 18 | 19 | -------------------------------------------------------------------------------- /train.json: -------------------------------------------------------------------------------- 1 | { 2 | "pipeline": { 3 | "name": "model" 4 | }, 5 | "transform": { 6 | "image": "dwhitena/julia-train:tree", 7 | "cmd": [ 8 | "julia", 9 | "/train.jl", 10 | "/pfs/training/iris.csv", 11 | "/pfs/out/model.jld" 12 | ] 13 | }, 14 | "parallelism_spec": { 15 | "strategy": "CONSTANT", 16 | "constant": "1" 17 | }, 18 | "input": { 19 | "atom": { 20 | "repo": "training", 21 | "glob": "/" 22 | } 23 | } 24 | } 25 | --------------------------------------------------------------------------------