├── README.md
├── data
    ├── iris.csv
    └── test
    │   ├── 1.csv
    │   └── 2.csv
├── infer.json
├── infer
    ├── Dockerfile
    ├── infer.jl
    └── package_installs.jl
├── pipeline.png
├── train-forest
    ├── Dockerfile
    ├── package_installs.jl
    └── train.jl
├── train-tree
    ├── Dockerfile
    ├── package_installs.jl
    └── train.jl
└── train.json


/README.md:
--------------------------------------------------------------------------------
  1 | # Workshop - Integrating Julia in Real-World, Distributed Pipelines
  2 | 
  3 | ![alt tag](pipeline.png)
  4 | 
  5 | This workshop focuses on building a production scale machine learning pipeline with [Julia](), [Docker](), [Kubernetes](), and [Pachyderm](http://pachyderm.io/).  In particular, this pipeline trains and utilizes a model that predicts the species of iris flowers, based on measurements of those flowers.
  6 | 
  7 | The below documentation walks you through experimentation with Docker and the deployment of the pipelines. It also emphasizes a few key features related to reproducibility, pipeline triggering, and provenance:
  8 | 
  9 | 1. [Prepare a Julia program and Docker image for training](README.md#1-prepare-a-julia-program-and-docker-image-for-model-training)
 10 | 2. [Prepare a Julia program and Docker image for inference](README.md#2-prepare-a-julia-program-and-docker-image-for-inference)
 11 | 3. [Connect to your Kubernetes cluster](README.md#3-connect-to-your-kubernetes-cluster)
 12 | 4. [Check that Pachyderm is running on Kubernetes](README.md#4-check-that-pachyderm-is-running-on-kubernetes)
 13 | 5. [Create the input "data repositories"](README.md#5-create-the-input-data-repositories)
 14 | 6. [Commit the training data set into Pachyderm](README.md#6-commit-the-training-data-set-into-pachyderm)
 15 | 7. [Create the training pipeline](README.md#7-create-the-training-pipeline)
 16 | 8. [Commit input attributes](README.md#8-commit-input-attributes)
 17 | 9. [Create the inference pipeline](README.md#9-create-the-inference-pipeline)
 18 | 10. [Examine the results](README.md#10-examine-the-results)
 19 | 
 20 | Bonus:
 21 | 
 22 | 11. [Parallelize the inference](README.md#11-parallelize-the-inference)
 23 | 12. [Update the model training](README.md#12-update-the-model-training)
 24 | 13. [Update the training data set](README.md#13-update-the-training-data-set)
 25 | 14. [Examine pipeline provenance](README.md#14-examine-pipeline-provenance)
 26 | 
 27 | Finally, we provide some [Resources](README.md#resources) for you for further exploration.
 28 | 
 29 | ## Prerequisites
 30 | 
 31 | - Ability to `ssh` into a remote machine.
 32 | - An IP for a remote machine (this should have been given to you at the beginning of the workshop).
 33 | - Access to this repository on GitHub (we will clone it later on the remote machine). 
 34 | 
 35 | ## 1. Prepare a Julia program and Docker image for model training
 36 | 
 37 | First, let's `ssh` into our development/workshop machine.  As will be discussed later, this machine is connected to a running kubernetes instance, a pachyderm cluster, and has everything you need to complete the workshop.  You should have been given an IP for your dev machine at the beginning of the workshop.  Use that IP where indicated (`<your ip>`) below to connect to the machine:
 38 | 
 39 | ```
 40 | $ ssh pachrat@<your ip>
 41 | ```
 42 | 
 43 | You will be prompted for a password that will be given out at the workshop.
 44 | 
 45 | Now, let's learn how to prepare a containerized Julia program to train our iris species prediction model. Clone this Github repo on the dev machine as follows:
 46 | 
 47 | ```
 48 | $ git clone https://github.com/dwhitena/julia-workshop.git
 49 | ```
 50 | 
 51 | You should now see a folder containing the contents of this repo:
 52 | 
 53 | ```
 54 | $ ls
 55 | admin.conf  julia-workshop
 56 | ```
 57 | 
 58 | Navigate to the `train-tree` folder in this workshop repo.  Here you will find a Julia program, [train.jl](train-tree/train.jl), that uses the `DecisionTree` package to train and export a model for predicting iris flower species:
 59 | 
 60 | ```
 61 | $ cd julia-workshop/train-tree/
 62 | $ ls
 63 | Dockerfile  package_installs.jl  train.jl
 64 | ```
 65 | 
 66 | You can also see that we have a Julia program `package_installs.jl` that installs the necessary packages for our model training, and we have a "Dockerfile."  This [Dockerfile](train-tree/Dockerfile) tells Docker how to build a Docker "image" for our model training.  As you can see, in our Docker image, we are installing a couple of dependencies, running `package_installs.jl` and adding our `train.jl` program.  
 67 | 
 68 | We have already pre-built this Docker image and uploaded it to Docker Hub [here](https://hub.docker.com/r/dwhitena/julia-train/) for use in this workshop.  However, if you wanted to experiment with this image locally and/or build another Julia Docker image, you just need to [install Docker](https://docs.docker.com/engine/installation/) and "build" this Docker image. To "build" the Docker image, you can run something similar to:
 69 | 
 70 | (Note this following won't work on the dev/workshop instance because of certain permissions, but you could do the following to build the Docker image locally, assuming you have Docker installed. Please wait to install and run this until after the workshop so we don't take down the Wifi. haha)
 71 | 
 72 | ```
 73 | $ docker build -t dwhitena/julia-train:tree .
 74 | Sending build context to Docker daemon  4.096kB
 75 | Step 1/4 : FROM julia
 76 |  ---> f988666c0ef7
 77 | Step 2/4 : ADD package_installs.jl /tmp/package_installs.jl
 78 |  ---> 649aff2f1f78
 79 | Removing intermediate container fbdf08c34c45
 80 | Step 3/4 : RUN apt-get update &&     apt-get install -y build-essential hdf5-tools &&     julia /tmp/package_installs.jl &&     rm -rf /var/lib/apt/lists/*
 81 |  ---> Running in 74fd7254bb87
 82 | Get:1 http://security.debian.org jessie/updates InRelease [63.1 kB]
 83 | Ign http://deb.debian.org jessie InRelease
 84 | Get:2 http://deb.debian.org jessie-updates InRelease [145 kB]
 85 | Get:3 http://deb.debian.org jessie Release.gpg [2373 B]
 86 | Get:4 http://deb.debian.org jessie Release [148 kB]
 87 | Get:5 http://security.debian.org jessie/updates/main amd64 Packages [523 kB]
 88 | Get:6 http://deb.debian.org jessie-updates/main amd64 Packages [17.8 kB]
 89 | Get:7 http://deb.debian.org jessie/main amd64 Packages [9065 kB]
 90 | Fetched 9965 kB in 5s (1674 kB/s)
 91 | Reading package lists...
 92 | Reading package lists...
 93 | Building dependency tree...
 94 | Reading state information...
 95 | The following extra packages will be installed:
 96 |   binutils bzip2 cpp cpp-4.9 dpkg-dev fakeroot g++ g++-4.9 gcc gcc-4.9
 97 |   libalgorithm-c3-perl libalgorithm-diff-perl libalgorithm-diff-xs-perl
 98 | 
 99 | etc...
100 | ```    
101 | 
102 | Your Docker image will then be listed under `docker images`:
103 | 
104 | ```
105 | $ docker images
106 | REPOSITORY                                 TAG                 IMAGE ID            CREATED             SIZE
107 | <none>                                     <none>              649aff2f1f78        7 hours ago         371MB
108 | <none>                                     <none>              c530a73337d8        7 hours ago         371MB
109 | dwhitena/iris-infer                        julia               5716b96aff25        21 hours ago        557MB
110 | dwhitena/julia-infer                       <none>              5716b96aff25        21 hours ago        557MB
111 | dwhitena/iris-train                        julia-tree          28606eba05de        21 hours ago        557MB
112 | 
113 | etc...
114 | ```
115 | 
116 | The Docker image can then be run manually (and interactively) as follows.  We can see that julia runs in the Docker container and our `train.jl` program is included:
117 | 
118 | ```
119 | $ docker run -it dwhitena/julia-train:tree /bin/bash
120 | root@2862b6f9ea24:/# julia
121 |                _
122 |    _       _ _(_)_     |  A fresh approach to technical computing
123 |   (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
124 |    _ _   _| |_  __ _   |  Type "?help" for help.
125 |   | | | | | | |/ _` |  |
126 |   | | |_| | | | (_| |  |  Version 0.5.2 (2017-05-06 16:34 UTC)
127 |  _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
128 | |__/                   |  x86_64-pc-linux-gnu
129 | 
130 | julia> 1+1
131 | 2
132 | 
133 | julia> exit()
134 | root@2862b6f9ea24:/# cat /train.jl 
135 | using DataFrames
136 | using DecisionTree
137 | using JLD
138 | 
139 | # Read the iris data set.
140 | df = readtable(ARGS[1], header = false)
141 | 
142 | # Get the features and labels.
143 | features = convert(Array, df[:, 1:4])
144 | labels = convert(Array, df[:, 5])
145 | 
146 | # Train decision tree classifier.
147 | model = DecisionTreeClassifier(pruning_purity_threshold=0.9, maxdepth=6)
148 | DecisionTree.fit!(model, features, labels)
149 | 
150 | # Save the model.
151 | save(ARGS[2], "model", model)
152 | 
153 | root@2862b6f9ea24:/#
154 | ```
155 | 
156 | As mentioned, we have already uploaded this image to Docker Hub for use in this workshop.  This was done by building the image as above and using `docker push` to push the image to the public Docker Hub registry.  When we use the image later in the workshop, we will be "pulling" that image on Docker Hub tagged as `dwhitena/julia-train:tree`.
157 | 
158 | ## 2. Prepare a Julia program and Docker image for inference
159 | 
160 | Similar to the process in section (1), we have created a Julia program, [infer.jl]i(infer/infer.jl), and a corresponding Docker image to be used for inference in our ML pipeline. This Docker image is uploaded to Docker Hub as [dwhitena/julia-infer](https://hub.docker.com/r/dwhitena/julia-infer/).
161 | 
162 | `infer.jl` does a few things:
163 | 
164 | - takes a trained, persisted `model.jld` as input (the output of `train.jl`)
165 | - takes a directory as input
166 | - walks over files in that directory, where the files are sets of new iris attributes
167 | - infers the species of each of the reviews
168 | - outputs the inferred species to a specified output directory
169 | 
170 | ## 3. Connect to your Kubernetes cluster
171 | 
172 | On your dev/workshop machine, you should be connected to a running Kubernetes cluster.  You can interact with this cluster via the Kubernetes CLI, `kubectl`.  As a sanity check, you can make sure that kubernetes is up and running as follows:
173 | 
174 | ```
175 | $ kubectl get all
176 | NAME                        READY     STATUS    RESTARTS   AGE
177 | po/etcd-4197107720-906b7    1/1       Running   0          31m
178 | po/pachd-3548222380-cm1ts   1/1       Running   0          31m
179 | 
180 | NAME             CLUSTER-IP     EXTERNAL-IP   PORT(S)                       AGE
181 | svc/etcd         10.97.253.64   <nodes>       2379:32379/TCP                31m
182 | svc/kubernetes   10.96.0.1      <none>        443/TCP                       32m
183 | svc/pachd        10.108.55.75   <nodes>       650:30650/TCP,651:30651/TCP   31m
184 | 
185 | NAME           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
186 | deploy/etcd    1         1         1            1           31m
187 | deploy/pachd   1         1         1            1           31m
188 | 
189 | NAME                  DESIRED   CURRENT   READY     AGE
190 | rs/etcd-4197107720    1         1         1         31m
191 | rs/pachd-3548222380   1         1         1         31m
192 | ```
193 | 
194 | ## 4. Check that Pachyderm is running on Kubernetes 
195 | 
196 | Pachyderm should also be running on your Kubernetes cluster.  To verify that everything is running correctly on the machine, you should be able to run the following with the corresponding response:
197 | 
198 | ```
199 | $ pachctl version
200 | No config detected.
201 | Default config created at /home/pachrat/.pachyderm/config.json
202 | COMPONENT           VERSION             
203 | pachctl             1.4.7-RC1           
204 | pachd               1.4.7-RC1
205 | ```
206 | 
207 | (note, this was the first time you ran `pachctl` on the machine, so a `pachctl` config was automatically created)
208 | 
209 | ## 5. Create the input data repositories 
210 | 
211 | On the Pachyderm cluster running in your remote machine, we will need to create the two input data repositories (for our training data and input iris attributes).  To do this run:
212 | 
213 | ```
214 | $ pachctl create-repo training
215 | $ pachctl create-repo reviews
216 | ```
217 | 
218 | As a sanity check, we can list out the current repos, and you should see the two repos you just created:
219 | 
220 | ```
221 | $ pachctl list-repo
222 | NAME                CREATED             SIZE                
223 | attributes          5 seconds ago       0 B                 
224 | training            8 seconds ago       0 B
225 | ```
226 | 
227 | ## 6. Commit the training data set into pachyderm
228 | 
229 | We have our training data repository, but we haven't put our training data set into this repository yet.  The training data set, `iris.csv`, is included here in the [data](data) directory. 
230 | 
231 | To get this data into Pachyderm, navigate to this directory and run:
232 | 
233 | ```
234 | $ cd /home/pachrat/julia-workshop/data
235 | $ pachctl put-file training master -c -f iris.csv
236 | ```
237 | 
238 | Then, you should be able to see the following:
239 | 
240 | ```
241 | $ pachctl list-repo
242 | NAME                CREATED             SIZE                
243 | training            3 minutes ago       4.444 KiB           
244 | attributes          3 minutes ago       0 B                 
245 | $ pachctl list-file training master
246 | NAME                TYPE                SIZE                
247 | iris.csv            file                4.444 KiB
248 | ```
249 | 
250 | ## 7. Create the training pipeline
251 | 
252 | Next, we can create the `model` pipeline stage to process the data in the training repository. To do this, we just need to provide Pachyderm with [a JSON pipeline specification](train.json) that tells Pachyderm how to process the data.  Once you have that pipeline spec, creating the training pipeline is as easy as:
253 | 
254 | ```
255 | $ cd ..
256 | $ pachctl create-pipeline -f train.json
257 | ```
258 | 
259 | Immediately you will notice that Pachyderm has kicked off a job to perform the model training:
260 | 
261 | ```
262 | $ pachctl list-job
263 | ID                                   OUTPUT COMMIT STARTED        DURATION RESTART PROGRESS STATE            
264 | a0d78926-ce2a-491a-b926-90043bce7371 model/-       12 seconds ago -        0       0 / 1    running
265 | ```
266 | 
267 | This job should run for about 1-2 minutes (it actually runs faster after this, but we have to pull the Docker image on the first run).  After your model has successfully been trained, you should see:
268 | 
269 | ```
270 | $ pachctl list-job
271 | ID                                   OUTPUT COMMIT                          STARTED       DURATION       RESTART PROGRESS STATE            
272 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b 2 minutes ago About a minute 0       1 / 1    success 
273 | $ pachctl list-repo
274 | NAME                CREATED             SIZE                
275 | model               2 minutes ago       43.67 KiB           
276 | training            8 minutes ago       4.444 KiB           
277 | attributes          7 minutes ago       0 B                 
278 | $ pachctl list-file model master
279 | NAME                TYPE                SIZE                
280 | model.jld           file                43.67 KiB
281 | ```
282 | 
283 | ## 8. Commit input attributes
284 | 
285 | Great! We now have a trained model that will infer the species of iris flowers.  Let's commit some iris attributes into Pachyderm that we would like to run through the inference.  We have a couple examples under [test](test).  Feel free to use these, find your own, or even create your own.  To commit our samples (assuming you have cloned this repo on the remote machine), you can run:
286 | 
287 | ```
288 | $ cd /home/pachrat/julia-workshop/data/test/
289 | $ pachctl put-file attributes master -c -r -f .
290 | ```
291 | 
292 | You should then see:
293 | 
294 | ```
295 | $ pachctl list-file attributes master
296 | NAME                TYPE                SIZE                
297 | 1.csv               file                16 B                
298 | 2.csv               file                96 B 
299 | ```
300 | 
301 | ## 9. Create the inference pipeline
302 | 
303 | We have another JSON blob, [infer.json](infer.json), that will tell Pachyderm how to perform the processing for the inference stage.  This is similar to our last JSON specification except, in this case, we have two input repositories (the `attributes` and the `model`) and we are using a different Docker image that contains `infer.jl`.  To create the inference stage, we simply run:
304 | 
305 | ```
306 | $ cd ../../
307 | $ pachctl create-pipeline -f infer.json
308 | ```
309 | 
310 | This will immediately kick off an inference job, because we have committed unprocessed reviews into the `reviews` repo.  The results will then be versioned in a corresponding `inference` data repository:
311 | 
312 | ```
313 | $ pachctl list-job
314 | ID                                   OUTPUT COMMIT                          STARTED        DURATION       RESTART PROGRESS STATE            
315 | 21552ae0-b0a9-4089-bfa5-d74a4a9befd7 inference/-                            33 seconds ago -              0       0 / 2    running 
316 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b 7 minutes ago  About a minute 0       1 / 1    success 
317 | $ pachctl list-job
318 | ID                                   OUTPUT COMMIT                              STARTED            DURATION       RESTART PROGRESS STATE            
319 | 21552ae0-b0a9-4089-bfa5-d74a4a9befd7 inference/c4f6b269ad0349469effee39cc9ee8fb About a minute ago About a minute 0       2 / 2    success 
320 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b     8 minutes ago      About a minute 0       1 / 1    success 
321 | $ pachctl list-repo
322 | NAME                CREATED              SIZE                
323 | inference           About a minute ago   100 B               
324 | attributes          13 minutes ago       112 B               
325 | model               8 minutes ago        43.67 KiB           
326 | training            13 minutes ago       4.444 KiB
327 | ```
328 | 
329 | ## 10. Examine the results
330 | 
331 | We have created results from the inference, but how do we examine those results?  There are multiple ways, but an easy way is to just "get" the specific files out of Pachyderm's data versioning:
332 | 
333 | ```
334 | $ pachctl list-file inference master
335 | NAME                TYPE                SIZE                
336 | 1.csv               file                15 B                
337 | 2.csv               file                85 B                
338 | $ pachctl get-file inference master 1.csv
339 | Iris-virginica
340 | $ pachctl get-file inference master 2.csv
341 | Iris-versicolor
342 | Iris-virginica
343 | Iris-virginica
344 | Iris-virginica
345 | Iris-setosa
346 | Iris-setosa
347 | ```
348 | 
349 | Here we can see that each result file contains a predicted iris flower species corresponding to each set of input attributes.
350 | 
351 | ## Bonus exercises
352 | 
353 | You may not get to all of these bonus exercises during the workshop time, but you can perform these and all of the above steps any time you like with a [simple local Pachyderm install](http://docs.pachyderm.io/en/latest/getting_started/local_installation.html).  You can spin up this local version of Pachyderm is just a few commands and experiment with this, [other Pachyderm examples](http://docs.pachyderm.io/en/latest/examples/readme.html), and/or your own pipelines.
354 | 
355 | ### 11. Parallelize the inference
356 | 
357 | You may have noticed that our pipeline specs included a `parallelism_spec` field.  This tells Pachyderm how to parallelize a particular pipeline stage.  Let's say that in production we start receiving a huge number of attribute files, and we need to keep up with our inference.  In particular, let's say we want to spin up 10 inference workers to perform inference in parallel.
358 | 
359 | This actually doesn't require any change to our code.  We can simply change our `parallelism_spec` in `infer.json` to:
360 | 
361 | ```
362 |   "parallelism_spec": {
363 |     "strategy": "CONSTANT",
364 |     "constant": "10"
365 |   },
366 | ```
367 | 
368 | Pachyderm will then spin up 10 inference workers, each running our same `infer.jl` script, to perform inference in parallel.  This can be confirmed by updating our pipeline and then examining the cluster:
369 | 
370 | ```
371 | $ vim infer.json 
372 | $ pachctl update-pipeline -f infer.json 
373 | $ kubectl get all
374 | NAME                             READY     STATUS        RESTARTS   AGE
375 | po/etcd-4197107720-906b7         1/1       Running       0          52m
376 | po/pachd-3548222380-cm1ts        1/1       Running       0          52m
377 | po/pipeline-inference-v1-vsq8x   2/2       Terminating   0          6m
378 | po/pipeline-inference-v2-0w438   0/2       Init:0/1      0          5s
379 | po/pipeline-inference-v2-1tdm7   0/2       Pending       0          5s
380 | po/pipeline-inference-v2-2tqtl   0/2       Init:0/1      0          5s
381 | po/pipeline-inference-v2-6x917   0/2       Init:0/1      0          5s
382 | po/pipeline-inference-v2-cc5jz   0/2       Init:0/1      0          5s
383 | po/pipeline-inference-v2-cphcd   0/2       Init:0/1      0          5s
384 | po/pipeline-inference-v2-d5rc0   0/2       Init:0/1      0          5s
385 | po/pipeline-inference-v2-lhpcv   0/2       Init:0/1      0          5s
386 | po/pipeline-inference-v2-mpzwf   0/2       Pending       0          5s
387 | po/pipeline-inference-v2-p753f   0/2       Init:0/1      0          5s
388 | po/pipeline-model-v1-1gqv2       2/2       Running       0          13m
389 | 
390 | NAME                       DESIRED   CURRENT   READY     AGE
391 | rc/pipeline-inference-v2   10        10        0         5s
392 | rc/pipeline-model-v1       1         1         1         13m
393 | 
394 | NAME                        CLUSTER-IP       EXTERNAL-IP   PORT(S)                       AGE
395 | svc/etcd                    10.97.253.64     <nodes>       2379:32379/TCP                52m
396 | svc/kubernetes              10.96.0.1        <none>        443/TCP                       53m
397 | svc/pachd                   10.108.55.75     <nodes>       650:30650/TCP,651:30651/TCP   52m
398 | svc/pipeline-inference-v2   10.99.47.41      <none>        80/TCP                        5s
399 | svc/pipeline-model-v1       10.109.198.229   <none>        80/TCP                        13m
400 | 
401 | NAME           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
402 | deploy/etcd    1         1         1            1           52m
403 | deploy/pachd   1         1         1            1           52m
404 | 
405 | NAME                  DESIRED   CURRENT   READY     AGE
406 | rs/etcd-4197107720    1         1         1         52m
407 | rs/pachd-3548222380   1         1         1         52m
408 | $ kubectl get all
409 | NAME                             READY     STATUS    RESTARTS   AGE
410 | po/etcd-4197107720-906b7         1/1       Running   0          53m
411 | po/pachd-3548222380-cm1ts        1/1       Running   0          53m
412 | po/pipeline-inference-v2-0w438   2/2       Running   0          40s
413 | po/pipeline-inference-v2-1tdm7   2/2       Running   0          40s
414 | po/pipeline-inference-v2-2tqtl   2/2       Running   0          40s
415 | po/pipeline-inference-v2-6x917   2/2       Running   0          40s
416 | po/pipeline-inference-v2-cc5jz   2/2       Running   0          40s
417 | po/pipeline-inference-v2-cphcd   2/2       Running   0          40s
418 | po/pipeline-inference-v2-d5rc0   2/2       Running   0          40s
419 | po/pipeline-inference-v2-lhpcv   2/2       Running   0          40s
420 | po/pipeline-inference-v2-mpzwf   2/2       Running   0          40s
421 | po/pipeline-inference-v2-p753f   2/2       Running   0          40s
422 | po/pipeline-model-v1-1gqv2       2/2       Running   0          14m
423 | 
424 | NAME                       DESIRED   CURRENT   READY     AGE
425 | rc/pipeline-inference-v2   10        10        10        40s
426 | rc/pipeline-model-v1       1         1         1         14m
427 | 
428 | NAME                        CLUSTER-IP       EXTERNAL-IP   PORT(S)                       AGE
429 | svc/etcd                    10.97.253.64     <nodes>       2379:32379/TCP                53m
430 | svc/kubernetes              10.96.0.1        <none>        443/TCP                       54m
431 | svc/pachd                   10.108.55.75     <nodes>       650:30650/TCP,651:30651/TCP   53m
432 | svc/pipeline-inference-v2   10.99.47.41      <none>        80/TCP                        40s
433 | svc/pipeline-model-v1       10.109.198.229   <none>        80/TCP                        14m
434 | 
435 | NAME           DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
436 | deploy/etcd    1         1         1            1           53m
437 | deploy/pachd   1         1         1            1           53m
438 | 
439 | NAME                  DESIRED   CURRENT   READY     AGE
440 | rs/etcd-4197107720    1         1         1         53m
441 | rs/pachd-3548222380   1         1         1         53m
442 | ```
443 | 
444 | ### 12. Update the model training
445 | 
446 | You might have noticed that this repo includes two versions of the training program `train.jl`.  There is [train-tree/train.jl](train-tree/train.jl), which we used before in our pipeline, and then there is [train-forest/train.jl](train-forest/train.jl), which does a simliar training with a Random Forest model. 
447 | 
448 | Let's now imagine that we want to update our model to this random forest version. To do this, modify the image tag in `train.json`:
449 |  
450 | ```
451 | "image": "dwhitena/julia-train:forest",
452 | ```
453 | 
454 | Once you modify the spec, you can update the pipeline by running:
455 | 
456 | ```
457 | $ pachctl update-pipeline -f train.json
458 | ```
459 | 
460 | Pachyderm will then automatically kick off a new job to retrain our model with the random forest algo:
461 | 
462 | ```
463 | $ pachctl list-job
464 | ID                                   OUTPUT COMMIT                              STARTED        DURATION       RESTART PROGRESS STATE            
465 | 7d913835-2c0a-42a3-bfa2-c8a5941ceaa5 model/-                                    3 seconds ago  -              0       0 / 1    running 
466 | 21552ae0-b0a9-4089-bfa5-d74a4a9befd7 inference/c4f6b269ad0349469effee39cc9ee8fb 11 minutes ago About a minute 0       2 / 2    success 
467 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b     19 minutes ago About a minute 0       1 / 1    success
468 | ```
469 | 
470 | Not only that, once the model is retrained, Pachyderm see the new model and updates our inferences with the latest version of the model:
471 | 
472 | ```
473 | $ pachctl list-job
474 | ID                                   OUTPUT COMMIT                              STARTED            DURATION       RESTART PROGRESS STATE            
475 | 0477e755-79b4-4b14-ac04-5416d9a80cf3 inference/5dec44a330d24a1cb3822610c886489b 53 seconds ago     44 seconds     0       2 / 2    success 
476 | 7d913835-2c0a-42a3-bfa2-c8a5941ceaa5 model/444b5950bcb642cfba5b087286640898     About a minute ago 56 seconds     0       1 / 1    success 
477 | 21552ae0-b0a9-4089-bfa5-d74a4a9befd7 inference/c4f6b269ad0349469effee39cc9ee8fb 13 minutes ago     About a minute 0       2 / 2    success 
478 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b     20 minutes ago     About a minute 0       1 / 1    success
479 | ```
480 | 
481 | ### 13. Update the training data set
482 | 
483 | Let's say that one or more observations in our training data set were corrupt or unwanted.  Thus, we want to update our training data set.  To simulate this, go ahead and open up `iris.csv` (e.g., with `vim`) and remove a couple of the rows (non-header rows).  Then, let's replace our training set:
484 | 
485 | ```
486 | $ pachctl start-commit training master
487 | 9cc070dadc344150ac4ceef2f0758509
488 | $ pachctl delete-file training 9cc070dadc344150ac4ceef2f0758509 iris.csv 
489 | $ pachctl put-file training 9cc070dadc344150ac4ceef2f0758509 -f iris.csv 
490 | $ pachctl finish-commit training 9cc070dadc344150ac4ceef2f0758509
491 | ```
492 | 
493 | Immediately, Pachyderm "knows" that the data has been updated, and it starts a new jobs to update the model and inferences:
494 | 
495 | ### 14. Examine pipeline provenance
496 | 
497 | Let's say that we have updated our model or training set in one of the above scenarios (step 11 or 12).  Now we have multiple inferences that were made with different models and/or training data sets.  How can we know which results came from which specific models and/or training data sets?  This is called "provenance," and Pachyderm gives it to you out of the box.  
498 | 
499 | Suppose we have run the following jobs:
500 | 
501 | ```
502 | $ pachctl list-job
503 | ID                                   OUTPUT COMMIT                              STARTED        DURATION       RESTART PROGRESS STATE            
504 | 0477e755-79b4-4b14-ac04-5416d9a80cf3 inference/5dec44a330d24a1cb3822610c886489b 6 minutes ago  44 seconds     0       2 / 2    success 
505 | 7d913835-2c0a-42a3-bfa2-c8a5941ceaa5 model/444b5950bcb642cfba5b087286640898     7 minutes ago  56 seconds     0       1 / 1    success 
506 | 2560f096-0515-4d68-be66-35e3b4f3e730 inference/db9e675de0274ce9a73d3fc9dd50fd51 13 minutes ago About a minute 1       2 / 2    success 
507 | 21552ae0-b0a9-4089-bfa5-d74a4a9befd7 inference/c4f6b269ad0349469effee39cc9ee8fb 19 minutes ago About a minute 0       2 / 2    success 
508 | a0d78926-ce2a-491a-b926-90043bce7371 model/98e55f3bccc6444a888b1adbed4bba8b     26 minutes ago About a minute 0       1 / 1    success
509 | ```
510 | 
511 | If we want to know which model and training data set was used for the latest inference, commit id `5dec44a330d24a1cb3822610c886489b`, we just need to inspect the particular commit:
512 | 
513 | ```
514 | $ pachctl inspect-commit inference 5dec44a330d24a1cb3822610c886489b
515 | $ pachctl inspect-commit inference 5dec44a330d24a1cb3822610c886489b
516 | Commit: inference/5dec44a330d24a1cb3822610c886489b
517 | Parent: db9e675de0274ce9a73d3fc9dd50fd51 
518 | Started: 6 minutes ago
519 | Finished: 6 minutes ago 
520 | Size: 100 B
521 | Provenance:  training/9881a078c14c47e9b71fcc626a86499f  attributes/f62907bda09d48cfa817476fd3e4f07f  model/444b5950bcb642cfba5b087286640898
522 | ```
523 | 
524 | The `Provenance` tells us exactly which model and training set was used (along with which commit to attributes triggered the inference).  For example, if we wanted to see the exact model used, we would just need to reference commit `444b5950bcb642cfba5b087286640898` to the `model` repo:
525 | 
526 | ```
527 | $ pachctl list-file model 444b5950bcb642cfba5b087286640898
528 | NAME                TYPE                SIZE                
529 | model.jld           file                70.34 KiB
530 | ```
531 | 
532 | We could get this model to examine it, rerun it, revert to a different model, etc.
533 | 
534 | ## Resources
535 | 
536 | Docker
537 | 
538 | - [Install Docker](https://docs.docker.com/engine/installation/) locally.
539 | - Start with the official [Julia base image](https://hub.docker.com/_/julia/).
540 | - Check out the [Docker docs](https://docs.docker.com/).
541 | 
542 | Kubernetes:
543 | 
544 | - Start by playing with [minikube](https://kubernetes.io/docs/tutorials/stateless-application/hello-minikube/).
545 | - Check out the [Kubernetes docs](https://kubernetes.io/docs/home/).
546 | 
547 | Pachyderm:
548 | 
549 | - Join the [Pachyderm Slack team](http://slack.pachyderm.io/) to ask questions, get help, and talk about production deploys.
550 | - Follow [Pachyderm on Twitter](https://twitter.com/pachydermIO), 
551 | - Find [Pachyderm on GitHub](https://github.com/pachyderm/pachyderm), and
552 | - [Spin up Pachyderm](http://docs.pachyderm.io/en/latest/getting_started/getting_started.html) in just a few commands to try this and [other examples](http://docs.pachyderm.io/en/latest/examples/readme.html) locally.
553 | 
554 | 


--------------------------------------------------------------------------------
/data/iris.csv:
--------------------------------------------------------------------------------
  1 | 5.1,3.5,1.4,0.2,Iris-setosa
  2 | 4.9,3.0,1.4,0.2,Iris-setosa
  3 | 4.7,3.2,1.3,0.2,Iris-setosa
  4 | 4.6,3.1,1.5,0.2,Iris-setosa
  5 | 5.0,3.6,1.4,0.2,Iris-setosa
  6 | 5.4,3.9,1.7,0.4,Iris-setosa
  7 | 4.6,3.4,1.4,0.3,Iris-setosa
  8 | 5.0,3.4,1.5,0.2,Iris-setosa
  9 | 4.4,2.9,1.4,0.2,Iris-setosa
 10 | 4.9,3.1,1.5,0.1,Iris-setosa
 11 | 5.4,3.7,1.5,0.2,Iris-setosa
 12 | 4.8,3.4,1.6,0.2,Iris-setosa
 13 | 4.8,3.0,1.4,0.1,Iris-setosa
 14 | 4.3,3.0,1.1,0.1,Iris-setosa
 15 | 5.8,4.0,1.2,0.2,Iris-setosa
 16 | 5.7,4.4,1.5,0.4,Iris-setosa
 17 | 5.4,3.9,1.3,0.4,Iris-setosa
 18 | 5.1,3.5,1.4,0.3,Iris-setosa
 19 | 5.7,3.8,1.7,0.3,Iris-setosa
 20 | 5.1,3.8,1.5,0.3,Iris-setosa
 21 | 5.4,3.4,1.7,0.2,Iris-setosa
 22 | 5.1,3.7,1.5,0.4,Iris-setosa
 23 | 4.6,3.6,1.0,0.2,Iris-setosa
 24 | 5.1,3.3,1.7,0.5,Iris-setosa
 25 | 4.8,3.4,1.9,0.2,Iris-setosa
 26 | 5.0,3.0,1.6,0.2,Iris-setosa
 27 | 5.0,3.4,1.6,0.4,Iris-setosa
 28 | 5.2,3.5,1.5,0.2,Iris-setosa
 29 | 5.2,3.4,1.4,0.2,Iris-setosa
 30 | 4.7,3.2,1.6,0.2,Iris-setosa
 31 | 4.8,3.1,1.6,0.2,Iris-setosa
 32 | 5.4,3.4,1.5,0.4,Iris-setosa
 33 | 5.2,4.1,1.5,0.1,Iris-setosa
 34 | 5.5,4.2,1.4,0.2,Iris-setosa
 35 | 4.9,3.1,1.5,0.1,Iris-setosa
 36 | 5.0,3.2,1.2,0.2,Iris-setosa
 37 | 5.5,3.5,1.3,0.2,Iris-setosa
 38 | 4.9,3.1,1.5,0.1,Iris-setosa
 39 | 4.4,3.0,1.3,0.2,Iris-setosa
 40 | 5.1,3.4,1.5,0.2,Iris-setosa
 41 | 5.0,3.5,1.3,0.3,Iris-setosa
 42 | 4.5,2.3,1.3,0.3,Iris-setosa
 43 | 4.4,3.2,1.3,0.2,Iris-setosa
 44 | 5.0,3.5,1.6,0.6,Iris-setosa
 45 | 5.1,3.8,1.9,0.4,Iris-setosa
 46 | 4.8,3.0,1.4,0.3,Iris-setosa
 47 | 5.1,3.8,1.6,0.2,Iris-setosa
 48 | 4.6,3.2,1.4,0.2,Iris-setosa
 49 | 5.3,3.7,1.5,0.2,Iris-setosa
 50 | 5.0,3.3,1.4,0.2,Iris-setosa
 51 | 7.0,3.2,4.7,1.4,Iris-versicolor
 52 | 6.4,3.2,4.5,1.5,Iris-versicolor
 53 | 6.9,3.1,4.9,1.5,Iris-versicolor
 54 | 5.5,2.3,4.0,1.3,Iris-versicolor
 55 | 6.5,2.8,4.6,1.5,Iris-versicolor
 56 | 5.7,2.8,4.5,1.3,Iris-versicolor
 57 | 6.3,3.3,4.7,1.6,Iris-versicolor
 58 | 4.9,2.4,3.3,1.0,Iris-versicolor
 59 | 6.6,2.9,4.6,1.3,Iris-versicolor
 60 | 5.2,2.7,3.9,1.4,Iris-versicolor
 61 | 5.0,2.0,3.5,1.0,Iris-versicolor
 62 | 5.9,3.0,4.2,1.5,Iris-versicolor
 63 | 6.0,2.2,4.0,1.0,Iris-versicolor
 64 | 6.1,2.9,4.7,1.4,Iris-versicolor
 65 | 5.6,2.9,3.6,1.3,Iris-versicolor
 66 | 6.7,3.1,4.4,1.4,Iris-versicolor
 67 | 5.6,3.0,4.5,1.5,Iris-versicolor
 68 | 5.8,2.7,4.1,1.0,Iris-versicolor
 69 | 6.2,2.2,4.5,1.5,Iris-versicolor
 70 | 5.6,2.5,3.9,1.1,Iris-versicolor
 71 | 5.9,3.2,4.8,1.8,Iris-versicolor
 72 | 6.1,2.8,4.0,1.3,Iris-versicolor
 73 | 6.3,2.5,4.9,1.5,Iris-versicolor
 74 | 6.1,2.8,4.7,1.2,Iris-versicolor
 75 | 6.4,2.9,4.3,1.3,Iris-versicolor
 76 | 6.6,3.0,4.4,1.4,Iris-versicolor
 77 | 6.8,2.8,4.8,1.4,Iris-versicolor
 78 | 6.7,3.0,5.0,1.7,Iris-versicolor
 79 | 6.0,2.9,4.5,1.5,Iris-versicolor
 80 | 5.7,2.6,3.5,1.0,Iris-versicolor
 81 | 5.5,2.4,3.8,1.1,Iris-versicolor
 82 | 5.5,2.4,3.7,1.0,Iris-versicolor
 83 | 5.8,2.7,3.9,1.2,Iris-versicolor
 84 | 6.0,2.7,5.1,1.6,Iris-versicolor
 85 | 5.4,3.0,4.5,1.5,Iris-versicolor
 86 | 6.0,3.4,4.5,1.6,Iris-versicolor
 87 | 6.7,3.1,4.7,1.5,Iris-versicolor
 88 | 6.3,2.3,4.4,1.3,Iris-versicolor
 89 | 5.6,3.0,4.1,1.3,Iris-versicolor
 90 | 5.5,2.5,4.0,1.3,Iris-versicolor
 91 | 5.5,2.6,4.4,1.2,Iris-versicolor
 92 | 6.1,3.0,4.6,1.4,Iris-versicolor
 93 | 5.8,2.6,4.0,1.2,Iris-versicolor
 94 | 5.0,2.3,3.3,1.0,Iris-versicolor
 95 | 5.6,2.7,4.2,1.3,Iris-versicolor
 96 | 5.7,3.0,4.2,1.2,Iris-versicolor
 97 | 5.7,2.9,4.2,1.3,Iris-versicolor
 98 | 6.2,2.9,4.3,1.3,Iris-versicolor
 99 | 5.1,2.5,3.0,1.1,Iris-versicolor
100 | 5.7,2.8,4.1,1.3,Iris-versicolor
101 | 6.3,3.3,6.0,2.5,Iris-virginica
102 | 5.8,2.7,5.1,1.9,Iris-virginica
103 | 7.1,3.0,5.9,2.1,Iris-virginica
104 | 6.3,2.9,5.6,1.8,Iris-virginica
105 | 6.5,3.0,5.8,2.2,Iris-virginica
106 | 7.6,3.0,6.6,2.1,Iris-virginica
107 | 4.9,2.5,4.5,1.7,Iris-virginica
108 | 7.3,2.9,6.3,1.8,Iris-virginica
109 | 6.7,2.5,5.8,1.8,Iris-virginica
110 | 7.2,3.6,6.1,2.5,Iris-virginica
111 | 6.5,3.2,5.1,2.0,Iris-virginica
112 | 6.4,2.7,5.3,1.9,Iris-virginica
113 | 6.8,3.0,5.5,2.1,Iris-virginica
114 | 5.7,2.5,5.0,2.0,Iris-virginica
115 | 5.8,2.8,5.1,2.4,Iris-virginica
116 | 6.4,3.2,5.3,2.3,Iris-virginica
117 | 6.5,3.0,5.5,1.8,Iris-virginica
118 | 7.7,3.8,6.7,2.2,Iris-virginica
119 | 7.7,2.6,6.9,2.3,Iris-virginica
120 | 6.0,2.2,5.0,1.5,Iris-virginica
121 | 6.9,3.2,5.7,2.3,Iris-virginica
122 | 5.6,2.8,4.9,2.0,Iris-virginica
123 | 7.7,2.8,6.7,2.0,Iris-virginica
124 | 6.3,2.7,4.9,1.8,Iris-virginica
125 | 6.7,3.3,5.7,2.1,Iris-virginica
126 | 7.2,3.2,6.0,1.8,Iris-virginica
127 | 6.2,2.8,4.8,1.8,Iris-virginica
128 | 6.1,3.0,4.9,1.8,Iris-virginica
129 | 6.4,2.8,5.6,2.1,Iris-virginica
130 | 7.2,3.0,5.8,1.6,Iris-virginica
131 | 7.4,2.8,6.1,1.9,Iris-virginica
132 | 7.9,3.8,6.4,2.0,Iris-virginica
133 | 6.4,2.8,5.6,2.2,Iris-virginica
134 | 6.3,2.8,5.1,1.5,Iris-virginica
135 | 6.1,2.6,5.6,1.4,Iris-virginica
136 | 7.7,3.0,6.1,2.3,Iris-virginica
137 | 6.3,3.4,5.6,2.4,Iris-virginica
138 | 6.4,3.1,5.5,1.8,Iris-virginica
139 | 6.0,3.0,4.8,1.8,Iris-virginica
140 | 6.9,3.1,5.4,2.1,Iris-virginica
141 | 6.7,3.1,5.6,2.4,Iris-virginica
142 | 6.9,3.1,5.1,2.3,Iris-virginica
143 | 5.8,2.7,5.1,1.9,Iris-virginica
144 | 6.8,3.2,5.9,2.3,Iris-virginica
145 | 6.7,3.3,5.7,2.5,Iris-virginica
146 | 6.7,3.0,5.2,2.3,Iris-virginica
147 | 6.3,2.5,5.0,1.9,Iris-virginica
148 | 6.5,3.0,5.2,2.0,Iris-virginica
149 | 6.2,3.4,5.4,2.3,Iris-virginica
150 | 5.9,3.0,5.1,1.8,Iris-virginica
151 | 
152 | 


--------------------------------------------------------------------------------
/data/test/1.csv:
--------------------------------------------------------------------------------
1 | 5.9,3.0,5.1,1.8
2 | 


--------------------------------------------------------------------------------
/data/test/2.csv:
--------------------------------------------------------------------------------
1 | 5.7,2.8,4.1,1.3
2 | 6.3,3.3,6.0,2.5
3 | 5.8,2.7,5.1,1.9
4 | 7.1,3.0,5.9,2.1
5 | 5.1,3.5,1.4,0.2
6 | 4.9,3.0,1.4,0.2
7 | 


--------------------------------------------------------------------------------
/infer.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "pipeline": {
 3 |     "name": "inference"
 4 |   },
 5 |   "transform": {
 6 |     "image": "dwhitena/julia-infer",
 7 |     "cmd": [ 
 8 | 	"julia",
 9 | 	"/infer.jl", 
10 | 	"/pfs/model/model.jld", 
11 | 	"/pfs/attributes/", 
12 | 	"/pfs/out/"
13 |     ]
14 |   },
15 |   "parallelism_spec": {
16 |     "strategy": "CONSTANT",
17 |     "constant": "1"
18 |   },
19 |   "input": {
20 |     "cross": [
21 |       {
22 |         "atom": {
23 | 	  "repo": "attributes",
24 | 	  "glob": "/*"
25 | 	}
26 |       },
27 |       {
28 | 	"atom": {
29 | 	  "repo": "model",
30 | 	  "glob": "/"
31 | 	}
32 |       }
33 |     ]
34 |   }
35 | }
36 | 


--------------------------------------------------------------------------------
/infer/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM julia
 2 | 
 3 | # Install packages.
 4 | ADD package_installs.jl /tmp/package_installs.jl
 5 | RUN apt-get update && \
 6 |     apt-get install -y build-essential hdf5-tools && \
 7 |     julia /tmp/package_installs.jl && \
 8 |     rm -rf /var/lib/apt/lists/*
 9 | 
10 | # Add our program.
11 | ADD infer.jl /infer.jl
12 | 


--------------------------------------------------------------------------------
/infer/infer.jl:
--------------------------------------------------------------------------------
 1 | using DataFrames
 2 | using DecisionTree
 3 | using JLD
 4 | 
 5 | # Load our model.
 6 | model = load(ARGS[1], "model")
 7 | 
 8 | # Walk over the directory with input attribute files.
 9 | attributes = readdir(ARGS[2])
10 | for file in attributes
11 |   p = joinpath(ARGS[2], file)
12 |   if isdir(p)
13 |     continue
14 |   elseif isfile(p)
15 |     df = readtable(p, header = false)
16 |     open(joinpath(ARGS[3], file), "a") do x
17 |       for r in eachrow(df)
18 |         prediction = DecisionTree.predict(model, convert(Array, r))
19 |         write(x, string(prediction[1], "\n"))
20 |       end
21 |     end
22 |   end
23 | end
24 | 
25 | 


--------------------------------------------------------------------------------
/infer/package_installs.jl:
--------------------------------------------------------------------------------
 1 | metadata_packages = [
 2 |     "DataFrames",
 3 |     "DecisionTree",
 4 |     "HDF5",
 5 |     "JLD"]
 6 | 
 7 | 
 8 | Pkg.init()
 9 | Pkg.update()
10 | 
11 | for package=metadata_packages
12 |     Pkg.add(package)
13 | end
14 | 
15 | Pkg.resolve()
16 | 


--------------------------------------------------------------------------------
/pipeline.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dwhitena/julia-workshop/afc5d6889707e549ceeedb3f30e235b6dd87ef11/pipeline.png


--------------------------------------------------------------------------------
/train-forest/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM julia
 2 | 
 3 | # Install packages.
 4 | ADD package_installs.jl /tmp/package_installs.jl
 5 | RUN apt-get update && \
 6 |     apt-get install -y build-essential hdf5-tools && \
 7 |     julia /tmp/package_installs.jl && \
 8 |     rm -rf /var/lib/apt/lists/*
 9 | 
10 | # Add our program.
11 | ADD train.jl /train.jl
12 | 


--------------------------------------------------------------------------------
/train-forest/package_installs.jl:
--------------------------------------------------------------------------------
 1 | metadata_packages = [
 2 |     "DataFrames",
 3 |     "DecisionTree",
 4 |     "HDF5",
 5 |     "JLD"]
 6 | 
 7 | 
 8 | Pkg.init()
 9 | Pkg.update()
10 | 
11 | for package=metadata_packages
12 |     Pkg.add(package)
13 | end
14 | 
15 | Pkg.resolve()
16 | 


--------------------------------------------------------------------------------
/train-forest/train.jl:
--------------------------------------------------------------------------------
 1 | using DataFrames
 2 | using DecisionTree
 3 | using JLD
 4 | 
 5 | # Read the iris data set.
 6 | df = readtable(ARGS[1], header = false)
 7 | 
 8 | # Get the features and labels.
 9 | features = convert(Array, df[:, 1:4])
10 | labels = convert(Array, df[:, 5])
11 | 
12 | # Train decision tree classifier.
13 | model = RandomForestClassifier(ntrees=3, partialsampling=0.7)
14 | DecisionTree.fit!(model, features, labels)
15 | 
16 | # Save the model.
17 | save(ARGS[2], "model", model)
18 | 
19 | 


--------------------------------------------------------------------------------
/train-tree/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM julia
 2 | 
 3 | # Install packages.
 4 | ADD package_installs.jl /tmp/package_installs.jl
 5 | RUN apt-get update && \
 6 |     apt-get install -y build-essential hdf5-tools && \
 7 |     julia /tmp/package_installs.jl && \
 8 |     rm -rf /var/lib/apt/lists/*
 9 | 
10 | # Add our program.
11 | ADD train.jl /train.jl
12 | 


--------------------------------------------------------------------------------
/train-tree/package_installs.jl:
--------------------------------------------------------------------------------
 1 | metadata_packages = [
 2 |     "DataFrames",
 3 |     "DecisionTree",
 4 |     "HDF5",
 5 |     "JLD"]
 6 | 
 7 | 
 8 | Pkg.init()
 9 | Pkg.update()
10 | 
11 | for package=metadata_packages
12 |     Pkg.add(package)
13 | end
14 | 
15 | Pkg.resolve()
16 | 


--------------------------------------------------------------------------------
/train-tree/train.jl:
--------------------------------------------------------------------------------
 1 | using DataFrames
 2 | using DecisionTree
 3 | using JLD
 4 | 
 5 | # Read the iris data set.
 6 | df = readtable(ARGS[1], header = false)
 7 | 
 8 | # Get the features and labels.
 9 | features = convert(Array, df[:, 1:4])
10 | labels = convert(Array, df[:, 5])
11 | 
12 | # Train decision tree classifier.
13 | model = DecisionTreeClassifier(pruning_purity_threshold=0.9, maxdepth=6)
14 | DecisionTree.fit!(model, features, labels)
15 | 
16 | # Save the model.
17 | save(ARGS[2], "model", model)
18 | 
19 | 


--------------------------------------------------------------------------------
/train.json:
--------------------------------------------------------------------------------
 1 | {
 2 |   "pipeline": {
 3 |     "name": "model"
 4 |   },
 5 |   "transform": {
 6 |     "image": "dwhitena/julia-train:tree",
 7 |     "cmd": [ 
 8 | 	"julia", 
 9 | 	"/train.jl",
10 | 	"/pfs/training/iris.csv", 
11 | 	"/pfs/out/model.jld"	
12 |     ]
13 |   },
14 |   "parallelism_spec": {
15 |     "strategy": "CONSTANT",
16 |     "constant": "1"
17 |   },
18 |   "input": {
19 |     "atom": {
20 |       "repo": "training",
21 |       "glob": "/"
22 |     }
23 |   }
24 | }
25 | 


--------------------------------------------------------------------------------