├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── arch.png
├── argo_workflow.png
├── argo_workflow.yaml
├── base
├── Dockerfile
├── Makefile
├── requirements.txt
└── src
│ ├── data
│ └── hub_stackshare_combined_v2.csv.gz
│ ├── fetch_gihub_data.py
│ ├── models
│ ├── DockerHubClassification.py
│ └── requirements.txt
│ ├── process_data.py
│ └── train.py
├── d4m_menu.png
└── seldon_core.png
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contribution guidelines
2 |
3 | Since this project is intended to support a specific use case, contributions are limited to bug fixes or security issues. If you have a question, feel free to open an issue!
4 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Apache License
2 | Version 2.0, January 2004
3 | http://www.apache.org/licenses/
4 |
5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6 |
7 | 1. Definitions.
8 |
9 | "License" shall mean the terms and conditions for use, reproduction,
10 | and distribution as defined by Sections 1 through 9 of this document.
11 |
12 | "Licensor" shall mean the copyright owner or entity authorized by
13 | the copyright owner that is granting the License.
14 |
15 | "Legal Entity" shall mean the union of the acting entity and all
16 | other entities that control, are controlled by, or are under common
17 | control with that entity. For the purposes of this definition,
18 | "control" means (i) the power, direct or indirect, to cause the
19 | direction or management of such entity, whether by contract or
20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
21 | outstanding shares, or (iii) beneficial ownership of such entity.
22 |
23 | "You" (or "Your") shall mean an individual or Legal Entity
24 | exercising permissions granted by this License.
25 |
26 | "Source" form shall mean the preferred form for making modifications,
27 | including but not limited to software source code, documentation
28 | source, and configuration files.
29 |
30 | "Object" form shall mean any form resulting from mechanical
31 | transformation or translation of a Source form, including but
32 | not limited to compiled object code, generated documentation,
33 | and conversions to other media types.
34 |
35 | "Work" shall mean the work of authorship, whether in Source or
36 | Object form, made available under the License, as indicated by a
37 | copyright notice that is included in or attached to the work
38 | (an example is provided in the Appendix below).
39 |
40 | "Derivative Works" shall mean any work, whether in Source or Object
41 | form, that is based on (or derived from) the Work and for which the
42 | editorial revisions, annotations, elaborations, or other modifications
43 | represent, as a whole, an original work of authorship. For the purposes
44 | of this License, Derivative Works shall not include works that remain
45 | separable from, or merely link (or bind by name) to the interfaces of,
46 | the Work and Derivative Works thereof.
47 |
48 | "Contribution" shall mean any work of authorship, including
49 | the original version of the Work and any modifications or additions
50 | to that Work or Derivative Works thereof, that is intentionally
51 | submitted to Licensor for inclusion in the Work by the copyright owner
52 | or by an individual or Legal Entity authorized to submit on behalf of
53 | the copyright owner. For the purposes of this definition, "submitted"
54 | means any form of electronic, verbal, or written communication sent
55 | to the Licensor or its representatives, including but not limited to
56 | communication on electronic mailing lists, source code control systems,
57 | and issue tracking systems that are managed by, or on behalf of, the
58 | Licensor for the purpose of discussing and improving the Work, but
59 | excluding communication that is conspicuously marked or otherwise
60 | designated in writing by the copyright owner as "Not a Contribution."
61 |
62 | "Contributor" shall mean Licensor and any individual or Legal Entity
63 | on behalf of whom a Contribution has been received by Licensor and
64 | subsequently incorporated within the Work.
65 |
66 | 2. Grant of Copyright License. Subject to the terms and conditions of
67 | this License, each Contributor hereby grants to You a perpetual,
68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69 | copyright license to reproduce, prepare Derivative Works of,
70 | publicly display, publicly perform, sublicense, and distribute the
71 | Work and such Derivative Works in Source or Object form.
72 |
73 | 3. Grant of Patent License. Subject to the terms and conditions of
74 | this License, each Contributor hereby grants to You a perpetual,
75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76 | (except as stated in this section) patent license to make, have made,
77 | use, offer to sell, sell, import, and otherwise transfer the Work,
78 | where such license applies only to those patent claims licensable
79 | by such Contributor that are necessarily infringed by their
80 | Contribution(s) alone or by combination of their Contribution(s)
81 | with the Work to which such Contribution(s) was submitted. If You
82 | institute patent litigation against any entity (including a
83 | cross-claim or counterclaim in a lawsuit) alleging that the Work
84 | or a Contribution incorporated within the Work constitutes direct
85 | or contributory patent infringement, then any patent licenses
86 | granted to You under this License for that Work shall terminate
87 | as of the date such litigation is filed.
88 |
89 | 4. Redistribution. You may reproduce and distribute copies of the
90 | Work or Derivative Works thereof in any medium, with or without
91 | modifications, and in Source or Object form, provided that You
92 | meet the following conditions:
93 |
94 | (a) You must give any other recipients of the Work or
95 | Derivative Works a copy of this License; and
96 |
97 | (b) You must cause any modified files to carry prominent notices
98 | stating that You changed the files; and
99 |
100 | (c) You must retain, in the Source form of any Derivative Works
101 | that You distribute, all copyright, patent, trademark, and
102 | attribution notices from the Source form of the Work,
103 | excluding those notices that do not pertain to any part of
104 | the Derivative Works; and
105 |
106 | (d) If the Work includes a "NOTICE" text file as part of its
107 | distribution, then any Derivative Works that You distribute must
108 | include a readable copy of the attribution notices contained
109 | within such NOTICE file, excluding those notices that do not
110 | pertain to any part of the Derivative Works, in at least one
111 | of the following places: within a NOTICE text file distributed
112 | as part of the Derivative Works; within the Source form or
113 | documentation, if provided along with the Derivative Works; or,
114 | within a display generated by the Derivative Works, if and
115 | wherever such third-party notices normally appear. The contents
116 | of the NOTICE file are for informational purposes only and
117 | do not modify the License. You may add Your own attribution
118 | notices within Derivative Works that You distribute, alongside
119 | or as an addendum to the NOTICE text from the Work, provided
120 | that such additional attribution notices cannot be construed
121 | as modifying the License.
122 |
123 | You may add Your own copyright statement to Your modifications and
124 | may provide additional or different license terms and conditions
125 | for use, reproduction, or distribution of Your modifications, or
126 | for any such Derivative Works as a whole, provided Your use,
127 | reproduction, and distribution of the Work otherwise complies with
128 | the conditions stated in this License.
129 |
130 | 5. Submission of Contributions. Unless You explicitly state otherwise,
131 | any Contribution intentionally submitted for inclusion in the Work
132 | by You to the Licensor shall be under the terms and conditions of
133 | this License, without any additional terms or conditions.
134 | Notwithstanding the above, nothing herein shall supersede or modify
135 | the terms of any separate license agreement you may have executed
136 | with Licensor regarding such Contributions.
137 |
138 | 6. Trademarks. This License does not grant permission to use the trade
139 | names, trademarks, service marks, or product names of the Licensor,
140 | except as required for reasonable and customary use in describing the
141 | origin of the Work and reproducing the content of the NOTICE file.
142 |
143 | 7. Disclaimer of Warranty. Unless required by applicable law or
144 | agreed to in writing, Licensor provides the Work (and each
145 | Contributor provides its Contributions) on an "AS IS" BASIS,
146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 | implied, including, without limitation, any warranties or conditions
148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 | PARTICULAR PURPOSE. You are solely responsible for determining the
150 | appropriateness of using or redistributing the Work and assume any
151 | risks associated with Your exercise of permissions under this License.
152 |
153 | 8. Limitation of Liability. In no event and under no legal theory,
154 | whether in tort (including negligence), contract, or otherwise,
155 | unless required by applicable law (such as deliberate and grossly
156 | negligent acts) or agreed to in writing, shall any Contributor be
157 | liable to You for damages, including any direct, indirect, special,
158 | incidental, or consequential damages of any character arising as a
159 | result of this License or out of the use or inability to use the
160 | Work (including but not limited to damages for loss of goodwill,
161 | work stoppage, computer failure or malfunction, or any and all
162 | other commercial damages or losses), even if such Contributor
163 | has been advised of the possibility of such damages.
164 |
165 | 9. Accepting Warranty or Additional Liability. While redistributing
166 | the Work or Derivative Works thereof, You may choose to offer,
167 | and charge a fee for, acceptance of support, warranty, indemnity,
168 | or other liability obligations and/or rights consistent with this
169 | License. However, in accepting such obligations, You may act only
170 | on Your own behalf and on Your sole responsibility, not on behalf
171 | of any other Contributor, and only if You agree to indemnify,
172 | defend, and hold each Contributor harmless for any liability
173 | incurred by, or claims asserted against, such Contributor by reason
174 | of your accepting any such warranty or additional liability.
175 |
176 | END OF TERMS AND CONDITIONS
177 |
178 | APPENDIX: How to apply the Apache License to your work.
179 |
180 | To apply the Apache License to your work, attach the following
181 | boilerplate notice, with the fields enclosed by brackets "[]"
182 | replaced with your own identifying information. (Don't include
183 | the brackets!) The text should be enclosed in the appropriate
184 | comment syntax for the file format. We also recommend that a
185 | file or class name and description of purpose be included on the
186 | same "printed page" as the copyright notice for easier
187 | identification within third-party archives.
188 |
189 | Copyright [yyyy] [name of copyright owner]
190 |
191 | Licensed under the Apache License, Version 2.0 (the "License");
192 | you may not use this file except in compliance with the License.
193 | You may obtain a copy of the License at
194 |
195 | http://www.apache.org/licenses/LICENSE-2.0
196 |
197 | Unless required by applicable law or agreed to in writing, software
198 | distributed under the License is distributed on an "AS IS" BASIS,
199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 | See the License for the specific language governing permissions and
201 | limitations under the License.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Sample for DockerCon EU 2018
2 |
3 | ---
4 |
5 | # End-to-End Machine Learning Pipeline with Docker for Desktop and Kubeflow
6 |
7 | This project is simple example of an automated end-to-end machine learning pipeline using Docker Desktop and Kubeflow.
8 |
9 | 
10 |
11 | ## Architecture
12 |
13 | 
14 |
15 | ### Seldon-Core Architecture
16 |
17 | 
18 |
19 | ## Getting started
20 |
21 | ### Requirements:
22 |
23 | - [Docker Desktop](https://www.docker.com/products/docker-desktop) for Mac or Windows.
24 | - Increase the memory configuration from 2GB to 8GB (Preferences->Advanced->Memory).
25 |

26 | - [ksonnet](https://ksonnet.io/#get-started) version 0.11.0 or later.
27 | - [Argo](https://github.com/argoproj/argo/blob/master/demo.md)
28 |
29 | ### Steps:
30 |
31 | #### 1. Build the base image used in the [Argo](https://argoproj.github.io/) workflow
32 |
33 | ```
34 | $ git clone https://github.com/dockersamples/docker-hub-ml-project
35 | $ cd docker-hub-ml-project
36 | $ cd base && make build; cd ..
37 | ```
38 |
39 | #### 2. Install [Kubeflow](https://www.kubeflow.org/)
40 |
41 | ```
42 | $ export BASE_PATH=$(pwd)
43 | $ export KUBEFLOW_TAG=master
44 | $ curl https://raw.githubusercontent.com/kubeflow/kubeflow/master/scripts/download.sh | bash
45 | $ ${BASE_PATH}/scripts/kfctl.sh init ks_app --platform docker-for-desktop
46 | $ cd ks_app
47 | $ ../scripts/kfctl.sh generate k8s
48 | $ ../scripts/kfctl.sh apply k8s
49 | ```
50 |
51 | Ok, now let's make sure we have everything up and running.
52 |
53 | ```
54 | # switch to the kubeflow namespace
55 | $ kubectl config set-context docker-for-desktop --namespace=kubeflow
56 | $ kubectl get pods
57 | (⎈ |docker-for-desktop:kubeflow) ~/d/r/m/p/d/2/d/d/ks_app$ kgp 16:58:17 ⎇ master
58 | NAME READY STATUS RESTARTS AGE
59 | ambassador-677dd9d8f4-hmwc7 1/1 Running 0 4m
60 | ambassador-677dd9d8f4-jkmb6 1/1 Running 0 4m
61 | ambassador-677dd9d8f4-lcc8m 1/1 Running 0 4m
62 | argo-ui-7b8fff579c-kqbdl 1/1 Running 0 2m
63 | centraldashboard-f8d7d97fb-6zk9v 1/1 Running 0 3m
64 | jupyter-0 1/1 Running 0 3m
65 | metacontroller-0 1/1 Running 0 2m
66 | minio-84969865c4-hjl9f 1/1 Running 0 2m
67 | ml-pipeline-5cf4db85f5-qskvt 1/1 Running 1 2m
68 | ml-pipeline-persistenceagent-748666fdcb-vnjvr 1/1 Running 0 2m
69 | ml-pipeline-scheduledworkflow-5bf775c8c4-8rtwf 1/1 Running 0 2m
70 | ml-pipeline-ui-59f8cbbb86-vjqmv 1/1 Running 0 2m
71 | mysql-c4c4c8f69-wfl99 1/1 Running 0 2m
72 | spartakus-volunteer-74cb649fb9-w277v 1/1 Running 0 2m
73 | tf-job-dashboard-6b95c47f8-qkf5w 1/1 Running 0 3m
74 | tf-job-operator-v1beta1-75587897bb-4zcwp 1/1 Running 0 3m
75 | workflow-controller-59c7967f59-wx426 1/1 Running 0 2m
76 | ```
77 |
78 | If you can see the pods above running, you should be able to access http://localhost:8080/hub to create your Jupyter Notebooks instances.
79 |
80 | #### 3. Deploy Seldon-Core's model serving infrastructure
81 |
82 | The custom resource definition (CRD) and it's controller is installed using the seldon prototype
83 |
84 | ```
85 | $ export NAMESPACE=kubeflow
86 | $ cd ks_app
87 | # Gives cluster-admin role to the default service account in the ${NAMESPACE}
88 | $ kubectl create clusterrolebinding seldon-admin --clusterrole=cluster-admin --serviceaccount=${NAMESPACE}:default
89 | # Install the kubeflow/seldon package
90 | $ ks pkg install kubeflow/seldon
91 | # Generate the seldon component and deploy it
92 | $ ks generate seldon seldon --name=seldon
93 | $ ks apply default -c seldon
94 | ```
95 |
96 | Seldon Core provides an example Helm analytics chart that displays the Prometheus metrics in Grafana. You can install it with:
97 |
98 | ```
99 | $ helm install seldon-core-analytics --name seldon-core-analytics --set grafana_prom_admin_password= --set persistence.enabled=false --repo https://storage.googleapis.com/seldon-charts --namespace kubeflow
100 | ```
101 |
102 | #### 4. Setup the credentails for the machine learning pipeline
103 |
104 | Configure AWS S3 and Docker credentials on your Kubernetes cluster
105 |
106 | ```
107 | # s3-credentials
108 | $ kubectl create secret generic s3-credentials --from-literal=accessKey= --from-literal=secretKey=
109 | # docker-credentials
110 | $ kubectl create secret generic docker-credentials --from-literal=username= --from-literal=password=
111 | ```
112 |
113 | You can upload our sample data located in `base/src/data/hub_stackshare_combined_v2.csv.gz` to your S3 bucket.
114 |
115 | #### 5. Submit the Argo worflow
116 |
117 | This process will perform the following steps:
118 |
119 | - Import data sources
120 | - Process data (clean-up & normalization)
121 | - Split data between training and test datasets
122 | - Training using Keras
123 | - Build and push Docker image using the [Seldon-Core](https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md) wrapper
124 | - Deploy the model with 3 replicas
125 |
126 | Before submitting the Argo job, make sure you change the parameter values accordingly. You can access the `Argo UI` here: http://localhost:8080/argo/workflows.
127 |
128 | **Required fields:**
129 |
130 | - `bucket`: S3 bucket name (e.g. `ml-project-2018`)
131 | - `input-data-key`: Path to the S3 input data file (e.g. `data/hub_stackshare_combined_v2.csv.gz`)
132 |
133 | Add the Argo `artifactRepository` configuration for S3:
134 |
135 | ```
136 | $ kubectl edit configmap workflow-controller-configmap
137 | # update the `data` field with the content below
138 | data:
139 | config: |
140 | executorImage: argoproj/argoexec:v2.2.0
141 | artifactRepository:
142 | s3:
143 | bucket: docker-metrics-backups
144 | endpoint: s3.amazonaws.com #AWS => s3.amazonaws.com; GCS => storage.googleapis.com
145 | accessKeySecret: #omit if accessing via AWS IAM
146 | name: s3-credentials
147 | key: accessKey
148 | secretKeySecret: #omit if accessing via AWS IAM
149 | name: s3-credentials
150 | key: secretKey
151 | # save the new configuration and exit vim
152 | $ configmap "workflow-controller-configmap" edited
153 | ```
154 |
155 | Now let's submit the Argo workflow and monitor its execution from the browser (http://localhost:8080/argo/workflows). You access the artifacts from `step` directly from the `UI`, they are also stored on `S3`.
156 |
157 | ```
158 | $ cd ${BASE_PATH}
159 | $ argo submit argo_workflow.yaml -p bucket="bucket-test1" -p input-data-key="hub_stackshare_combined_v2.csv.gz"
160 |
161 | Name: docker-hub-classificationmcwz7
162 | Namespace: kubeflow
163 | ServiceAccount: default
164 | Status: Pending
165 | Created: Fri Nov 30 10:07:53 -0800 (now)
166 | Parameters:
167 | registry:
168 | model-version: v3
169 | replicas: 3
170 | bucket:
171 | input-data-key:
172 | docker-cert-key:
173 | mount-path: /mnt/workspace/data
174 | loss: binary_crossentropy
175 | test-size: 0.2
176 | batch-size: 100
177 | epochs: 15
178 | validation-split: 0.1
179 | output-train-csv: train_data.csv
180 | output-test-csv: test_data.csv
181 | output-model: hub_classifier.h5
182 | output-vectorized-descriptions: vectorized_descriptions.pckl
183 | output-raw-csv: hub_stackshare_combined_v2.csv
184 | selected-categories: devops,build-test-deploy,languages & frameworks,data stores,programming languages,application hosting,databases,web servers,application utilities,support-sales-and-marketing,operating systems,monitoring tools,continuous integration,self-hosted blogging / cms,open source service discovery,message queue,frameworks (full stack),in-memory databases,crm,search as a service,log management,monitoring,collaboration,virtual machine platforms & containers,server configuration and automation,big data tools,database tools,machine learning tools,code collaboration & version_control,load balancer / reverse proxy,web cache,java build tools,search engines,container tools,package managers,project management,infrastructure build tools,static site generators,code review,microframeworks (backend),assets and media,version control system,front end package manager,headless browsers,data science notebooks,ecommerce,background processing,cross-platform mobile development,issue tracking,analytics,secrets management,text editor,graph databases,cluster management,exception monitoring,business tools,business intelligence,localhost tools,realtime backend / api,microservices tools,chatops,git tools,hosted package repository,js build tools / js task runners,libraries,platform as a service,general analytics,group chat & notifications,browser testing,serverless / task processing,css pre-processors / extensions,image processing and management,integrated development environment,stream processing,cross-platform desktop development,continuous deployment,machine learning,data science,monitoring metrics,metrics,continuous delivery,build automation
185 | ```
186 |
187 | > All the Argo workflow parameters can be overwritten via the CLI using the `-p` flag.
188 |
189 | ## Repo Layout
190 |
191 | ```
192 | .
193 | ├── README.md
194 | ├── argo_workflow.png
195 | ├── argo_workflow.yaml
196 | └── base
197 | ├── Dockerfile
198 | ├── Makefile
199 | ├── requirements.txt
200 | └── src
201 | ├── data
202 | │ └── hub_stackshare_combined_v2.csv.gz
203 | ├── fetch_gihub_data.py
204 | ├── models
205 | │ ├── DockerHubClassification.py
206 | │ └── requirements.txt
207 | ├── process_data.py
208 | └── train.py
209 | ```
210 |
211 | ## Open Source Projects Used
212 |
213 | #### [Ambassador](https://www.getambassador.io/)
214 |
215 | API Gateway based on envoy proxy. It allows you to do self-service publishing and canary deployments.
216 |
217 | #### [Tensorflow](https://www.tensorflow.org/)
218 |
219 | Machine learning framework
220 |
221 | #### [Jupyter Hub](https://jupyterhub.readthedocs.io/en/stable/)
222 |
223 | Multi-user server for Jupyter notebooks
224 |
225 | #### [Seldon Core](https://www.seldon.io/)
226 |
227 | Platform for deploying ML models
228 |
229 | #### [Argo](https://argoproj.github.io/)
230 |
231 | Container-native workflow management (CI/CD)
232 |
233 | #### [Prometheus](https://prometheus.io/)
234 |
235 | Moriting & Alerting platform
236 |
237 | #### [Grafana](https://grafana.com/)
238 |
239 | Open platform for analytics and monitoring. It provides the UI for data visualization.
240 |
241 | #### [Kubernetes](https://kubernetes.io/)
242 |
243 | Open-source system for automating deployment, scaling, and management of containerized applications.
244 |
--------------------------------------------------------------------------------
/arch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/arch.png
--------------------------------------------------------------------------------
/argo_workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/argo_workflow.png
--------------------------------------------------------------------------------
/argo_workflow.yaml:
--------------------------------------------------------------------------------
1 | apiVersion: argoproj.io/v1alpha1
2 | kind: Workflow
3 | metadata:
4 | generateName: docker-hub-classification
5 | spec:
6 | entrypoint: default
7 |
8 | # Create a volume for containers to store their output data.
9 | volumeClaimTemplates:
10 | - metadata:
11 | name: workdir
12 | spec:
13 | accessModes: ["ReadWriteOnce"]
14 | resources:
15 | requests:
16 | storage: 10Gi
17 |
18 | # Arguments of the workflows
19 | arguments:
20 | parameters:
21 | # model version
22 | - name: model-version
23 | value: "v1"
24 |
25 | # The name of the S3 bucket where the data is stored.
26 | - name: bucket
27 | value: ""
28 |
29 | # Docker registry username
30 | - name: username
31 | value: "dev"
32 |
33 | # The path to the input data in the S3 bucket, in csv.gz format
34 | - name: input-data-key
35 | value:
36 |
37 | # The path to the docker cert path in the S3 bucket (e.g. Docker UCP bundle)
38 | - name: docker-cert-key
39 | value:
40 |
41 | # mount path
42 | - name: mount-path
43 | value: /mnt/workspace/data
44 |
45 | # loss function
46 | - name: loss
47 | value: binary_crossentropy
48 |
49 | # Percentage of the dataset used to test the model (e.g. 0.2 == 20%)
50 | - name: test-size
51 | value: 0.2
52 |
53 | # batch size
54 | - name: batch-size
55 | value: 100
56 |
57 | # number of epochs
58 | - name: epochs
59 | value: 15
60 |
61 | # validation split
62 | - name: validation-split
63 | value: 0.1
64 |
65 | # output train data directory path
66 | - name: output-train-csv
67 | value: train_data.csv
68 |
69 | # output test data directory path
70 | - name: output-test-csv
71 | value: test_data.csv
72 |
73 | # output model
74 | - name: output-model
75 | value: hub_classifier.h5
76 |
77 | # output vectorized descriptions
78 | - name: output-vectorized-descriptions
79 | value: vectorized_descriptions.pckl
80 |
81 | # output raw data directory path
82 | - name: output-raw-csv
83 | value: hub_stackshare_combined_v2.csv
84 |
85 | # selected categories
86 | - name: selected-categories
87 | value: "devops,build-test-deploy,languages & frameworks,data stores,programming languages,application hosting,databases,web servers,application utilities,support-sales-and-marketing,operating systems,monitoring tools,continuous integration,self-hosted blogging / cms,open source service discovery,message queue,frameworks (full stack),in-memory databases,crm,search as a service,log management,monitoring,collaboration,virtual machine platforms & containers,server configuration and automation,big data tools,database tools,machine learning tools,code collaboration & version_control,load balancer / reverse proxy,web cache,java build tools,search engines,container tools,package managers,project management,infrastructure build tools,static site generators,code review,microframeworks (backend),assets and media,version control system,front end package manager,headless browsers,data science notebooks,ecommerce,background processing,cross-platform mobile development,issue tracking,analytics,secrets management,text editor,graph databases,cluster management,exception monitoring,business tools,business intelligence,localhost tools,realtime backend / api,microservices tools,chatops,git tools,hosted package repository,js build tools / js task runners,libraries,platform as a service,general analytics,group chat & notifications,browser testing,serverless / task processing,css pre-processors / extensions,image processing and management,integrated development environment,stream processing,cross-platform desktop development,continuous deployment,machine learning,data science,monitoring metrics,metrics,continuous delivery,build automation"
88 |
89 | #The container image to use in the workflow
90 | - name: registry
91 | value: ""
92 |
93 | #The container image to use in the workflow
94 | - name: image-name
95 | value: data-team/base:latest
96 |
97 | templates:
98 | ##################################
99 | # Define the steps of the workflow
100 | ##################################
101 | - name: default
102 | steps:
103 | - - name: import-data
104 | template: import-data
105 | - - name: process-data
106 | template: process-data
107 | - - name: training
108 | template: training
109 | - - name: build-push-image
110 | template: build-push-image
111 | - - name: deploy-model
112 | template: deploy-model
113 |
114 | #################################################
115 | # Import / Unzip
116 | # Imports the input data & docker certs and unpack them.
117 | #################################################
118 | - name: import-data
119 | container:
120 | image: alpine:latest
121 | command: [sh, -c]
122 | args: [
123 | "gzip -d {{workflow.parameters.mount-path}}/hub_stackshare_combined_v2.csv.gz", #mkdir {{workflow.parameters.mount-path}}/docker-cert-bundle; unzip {{workflow.parameters.mount-path}}/docker-cert-bundle.zip -d {{workflow.parameters.mount-path}}/docker-cert-bundle",
124 | ]
125 | volumeMounts:
126 | - name: workdir
127 | mountPath: "{{workflow.parameters.mount-path}}/"
128 | inputs:
129 | artifacts:
130 | - name: data
131 | path: "{{workflow.parameters.mount-path}}/hub_stackshare_combined_v2.csv.gz"
132 | s3:
133 | endpoint: s3.amazonaws.com
134 | bucket: "{{workflow.parameters.bucket}}"
135 | key: "{{workflow.parameters.input-data-key}}"
136 | accessKeySecret:
137 | name: s3-credentials
138 | key: accessKey
139 | secretKeySecret:
140 | name: s3-credentials
141 | key: secretKey
142 | # - name: docker-cert-bundle
143 | # path: "{{workflow.parameters.mount-path}}/docker-cert-bundle.zip"
144 | # s3:
145 | # endpoint: s3.amazonaws.com
146 | # bucket: "{{workflow.parameters.bucket}}"
147 | # key: "{{workflow.parameters.docker-cert-key}}"
148 | # accessKeySecret:
149 | # name: s3-credentials
150 | # key: accessKey
151 | # secretKeySecret:
152 | # name: s3-credentials
153 | # key: secretKey
154 | outputs:
155 | artifacts:
156 | - name: raw-csv
157 | path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-raw-csv}}"
158 |
159 | #########################################################################
160 | # Process Data
161 | #########################################################################
162 | - name: process-data
163 | container:
164 | image: "{{workflow.parameters.registry}}{{workflow.parameters.image-name}}"
165 | imagePullPolicy: "IfNotPresent"
166 | command: [sh, -c]
167 | args:
168 | [
169 | "python /src/process_data.py --mount_path {{workflow.parameters.mount-path}} --input_csv {{workflow.parameters.output-raw-csv}} --output_train_csv {{workflow.parameters.output-train-csv}} --output_test_csv {{workflow.parameters.output-test-csv}} --test_size {{workflow.parameters.test-size}} --selected_categories '{{workflow.parameters.selected-categories}}'",
170 | ]
171 | volumeMounts:
172 | - name: workdir
173 | mountPath: "{{workflow.parameters.mount-path}}/"
174 | outputs:
175 | artifacts:
176 | - name: output-train-csv
177 | path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-train-csv}}"
178 | - name: output-test-csv
179 | path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-test-csv}}"
180 | - name: selected-categories
181 | path: "{{workflow.parameters.mount-path}}/selected_categories.pckl"
182 |
183 | #######################################
184 | # Training and ML model extraction
185 | #######################################
186 | - name: training
187 | container:
188 | image: "{{workflow.parameters.registry}}{{workflow.parameters.image-name}}"
189 | imagePullPolicy: "IfNotPresent"
190 | command: [sh, -c]
191 | args:
192 | [
193 | "python /src/train.py --mount_path {{workflow.parameters.mount-path}} --input_train_csv {{workflow.parameters.output-train-csv}} --input_test_csv {{workflow.parameters.output-test-csv}} --output_model {{workflow.parameters.output-model}} --output_vectorized_descriptions {{workflow.parameters.output-vectorized-descriptions}};cp /src/models/* {{workflow.parameters.mount-path}}/",
194 | ]
195 | volumeMounts:
196 | - name: workdir
197 | mountPath: "{{workflow.parameters.mount-path}}/"
198 | outputs:
199 | artifacts:
200 | - name: output-model
201 | path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-model}}"
202 | - name: output-vectorized-descriptions
203 | path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-vectorized-descriptions}}"
204 |
205 | #######################################
206 | # Build and push a docker image using the Seldon-Core Docker wrapper
207 | #######################################
208 | - name: build-push-image
209 | container:
210 | image: docker:17.10
211 | command: [sh, -c]
212 | args:
213 | [
214 | "cd {{workflow.parameters.mount-path}};sleep 15;rm *.csv;docker run -v {{workflow.parameters.mount-path}}:/model seldonio/core-python-wrapper:0.7 /model DockerHubClassification {{workflow.parameters.model-version}} {{workflow.parameters.registry}}{{workflow.parameters.username}} --base-image=python:3.6 --image-name=dockerhubclassifier;cd build/;./build_image.sh;echo $DOCKER_PASSWORD | docker login -u $DOCKER_USERNAME --password-stdin;./push_image.sh;",
215 | ]
216 | volumeMounts:
217 | - name: workdir
218 | mountPath: "{{workflow.parameters.mount-path}}/"
219 | env:
220 | - name: DOCKER_HOST #the docker daemon can be access on the standard port on localhost
221 | value: 127.0.0.1
222 | - name: DOCKER_USERNAME # name of env var
223 | valueFrom:
224 | secretKeyRef:
225 | name: docker-credentials # name of an existing k8s secret
226 | key: username # 'key' subcomponent of the secret
227 | - name: DOCKER_PASSWORD # name of env var
228 | valueFrom:
229 | secretKeyRef:
230 | name: docker-credentials # name of an existing k8s secret
231 | key: password # 'key' subcomponent of the secret
232 | sidecars:
233 | - name: dind
234 | image: docker:17.10-dind #Docker already provides an image for running a Docker daemon
235 | securityContext:
236 | privileged: true #the Docker daemon can only run in a privileged container
237 | # mirrorVolumeMounts will mount the same volumes specified in the main container
238 | # to the sidecar (including artifacts), at the same mountPaths. This enables
239 | # dind daemon to (partially) see the same filesystem as the main container in
240 | # order to use features such as docker volume binding.
241 | mirrorVolumeMounts: true
242 |
243 | #######################################
244 | # Deploy model
245 | #######################################
246 | - name: deploy-model
247 | resource: #indicates that this is a resource template
248 | action: apply #can be any kubectl action (e.g. create, delete, apply, patch)
249 | #successCondition: ?
250 | manifest: |
251 | apiVersion: "machinelearning.seldon.io/v1alpha2"
252 | kind: "SeldonDeployment"
253 | metadata:
254 | labels:
255 | app: "seldon"
256 | name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}"
257 | namespace: kubeflow
258 | spec:
259 | annotations:
260 | deployment_version: "{{workflow.parameters.model-version}}"
261 | project_name: "Docker Hub ML Project"
262 | name: "docker-hub-classifier"
263 | predictors:
264 | - annotations:
265 | predictor_version: "{{workflow.parameters.model-version}}"
266 | componentSpecs:
267 | - spec:
268 | containers:
269 | - image: "{{workflow.parameters.registry}}{{workflow.parameters.username}}/dockerhubclassifier:{{workflow.parameters.model-version}}"
270 | imagePullPolicy: "Always"
271 | name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}"
272 | graph:
273 | children: []
274 | endpoint:
275 | type: "REST"
276 | name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}"
277 | type: "MODEL"
278 | name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}"
279 | replicas: 3
280 |
--------------------------------------------------------------------------------
/base/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.6
2 |
3 | RUN apt-get update -y
4 | RUN apt-get install -y python-pip python-dev build-essential
5 |
6 | COPY /requirements.txt /tmp/
7 | RUN cd /tmp && \
8 | pip install --no-cache-dir -r requirements.txt
9 |
10 | # copy python scripts
11 | COPY ./src /src
12 |
--------------------------------------------------------------------------------
/base/Makefile:
--------------------------------------------------------------------------------
1 | IMAGE_NAME := data-team/base
2 | IMAGE_TAG := $(VERSION)-$(CHANNEL)
3 |
4 | ifeq ($(IMAGE_TAG),-)
5 | IMAGE_TAG := latest
6 | endif
7 |
8 | default: release
9 |
10 | build: ## Build the container without caching
11 | docker build --no-cache -t $(REGISTRY)$(IMAGE_NAME):$(IMAGE_TAG) .
12 |
13 | release: build publish ## Make a release by building and publishing tagged image to Docker Trusted Registry (DTR)
14 |
15 | publish: ## Publish image to DTR
16 | @echo 'publish $(REGISTRY)$(IMAGE_NAME):$(IMAGE_TAG)'
17 | docker push $(REGISTRY)$(IMAGE_NAME):$(IMAGE_TAG)
18 |
--------------------------------------------------------------------------------
/base/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow
2 | scikit-learn>=0.18
3 | pandas
4 | keras
5 | nltk
--------------------------------------------------------------------------------
/base/src/data/hub_stackshare_combined_v2.csv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/base/src/data/hub_stackshare_combined_v2.csv.gz
--------------------------------------------------------------------------------
/base/src/fetch_gihub_data.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import pandas as pd
4 | from github import Github
5 |
6 | LOGIN = os.environ['GITHUB_LOGIN']
7 | PASSWORD = os.environ['GITHUB_PASSWORD']
8 |
9 | api = Github(LOGIN, PASSWORD)
10 |
11 | # Parsing flags.
12 | parser = argparse.ArgumentParser()
13 | parser.add_argument("--topic_name")
14 | args = parser.parse_args()
15 | print(args)
16 |
17 | result = api.search_repositories(
18 | 'topic:{}'.format(args.topic_name), sort='stars')
19 |
20 | columns = ('repo_name', 'html_url',
21 | 'description', 'topics')
22 |
23 |
24 | # loop through 1,000 repositories order by the number of stars
25 | data = []
26 | for index in range(0, 33):
27 | for item in result.get_page(index):
28 | if item.description and not item.fork:
29 | try:
30 | data.append([item.name, item.html_url, item.description.encode('utf-8'),
31 | item.get_topics()])
32 | except Exception as e:
33 | print(e)
34 | break
35 |
36 | # create pandas dataframe
37 | df = pd.DataFrame(data, columns=columns)
38 | # save to CSV
39 | df.to_csv('{}_output.csv'.format(args.topic_name), index=False)
40 | print('file {}_output.csv created'.format(args.topic_name))
41 |
--------------------------------------------------------------------------------
/base/src/models/DockerHubClassification.py:
--------------------------------------------------------------------------------
1 | import pickle
2 | import re
3 | import numpy as np
4 | from keras.models import load_model
5 |
6 |
7 | class DockerHubClassification(object):
8 | def __init__(self):
9 | self.model = load_model('hub_classifier.h5')
10 | self.model._make_predict_function()
11 | self.selected_categories = pickle.load(
12 | open('selected_categories.pckl', 'rb'))
13 | self.tokenizer = pickle.load(
14 | open('vectorized_descriptions.pckl', 'rb'))
15 |
16 | def _clean_string(self, text):
17 | text = re.sub('[!@#$,.)(=*`]', '', text)
18 | return text.lower()
19 |
20 | def _predict_labels(self, text):
21 | labels = []
22 | description = self._clean_string(str(text))
23 | description_matrix = self.tokenizer.texts_to_matrix([
24 | description], mode='tfidf')
25 | preds = self.model.predict(
26 | description_matrix, batch_size=None, verbose=1, steps=None)
27 |
28 | preds[preds > 0.4] = 1
29 | preds[preds < 0.4] = 0
30 |
31 | for c in range(len(self.selected_categories)):
32 | if preds[0][c] == 1:
33 | labels.append(self.selected_categories[c])
34 | return labels
35 |
36 | def predict(self, text, features_names):
37 | return np.array([self._predict_labels(text)])
38 |
--------------------------------------------------------------------------------
/base/src/models/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow
2 | keras
--------------------------------------------------------------------------------
/base/src/process_data.py:
--------------------------------------------------------------------------------
1 | from nltk.corpus import stopwords
2 | import ast
3 | import argparse
4 | import pickle
5 | import pandas as pd
6 | import nltk
7 | import re
8 | from sklearn.model_selection import train_test_split
9 |
10 | # Parsing flags.
11 | parser = argparse.ArgumentParser()
12 | parser.add_argument("--input_csv")
13 | parser.add_argument("--mount_path")
14 | parser.add_argument("--output_train_csv")
15 | parser.add_argument("--output_test_csv")
16 | parser.add_argument("--selected_categories")
17 | parser.add_argument("--test_size")
18 | args = parser.parse_args()
19 | print(args)
20 |
21 |
22 | # Load data from CVS
23 | raw_data = pd.read_csv('{}/{}'.format(args.mount_path, args.input_csv))
24 |
25 | # Remove records with empty descriptions
26 | data = raw_data[raw_data.FULL_DESCRIPTION.notna()]
27 | # Lower case
28 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.lower()
29 | # Remove punctuation
30 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('[^\w\s]', '')
31 | # Remove numbers
32 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\d+', '')
33 | # Remove `\n` and `\t` characters
34 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\n', ' ')
35 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\t', ' ')
36 | # Remove long strings (len() > 24)
37 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\w{24,}', '')
38 | # Remove urls
39 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('http\w+', '')
40 | # Remove extra spaces
41 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.strip()
42 | data.FULL_DESCRIPTION.replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)
43 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace(' +', ' ')
44 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.replace('\s+', ' ', regex=True)
45 | # Drop duplicates
46 | data = data.drop_duplicates()
47 |
48 | # build-test-deploy,languages & frameworks,databases,web servers,application utilities,devops
49 | # operating systems,business tools,continuous integration,message queue,support-sales-and-marketing
50 | selected_categories = args.selected_categories.split(',')
51 |
52 | for col in selected_categories:
53 | data[col] = 0
54 |
55 | for index, row in data.iterrows():
56 | labels = row.labels
57 | labels = ast.literal_eval(labels)
58 | labels = [e.strip()for e in labels]
59 | for l in labels:
60 | if l in selected_categories:
61 | data.loc[index, l] = 1
62 | data = data.dropna(axis=1)
63 |
64 | # Remove stopwords
65 | nltk.download('stopwords')
66 | stop_words = set(stopwords.words('english'))
67 | stop_words.update([
68 | 'default',
69 | 'image',
70 | 'docker',
71 | 'container',
72 | 'service',
73 | 'production',
74 | 'dockerfile',
75 | 'dockercompose',
76 | 'build',
77 | 'latest',
78 | 'file',
79 | 'tag',
80 | 'instance',
81 | 'run',
82 | 'running',
83 | 'use',
84 | 'will',
85 | 'work',
86 | 'please',
87 | 'install',
88 | 'tags',
89 | 'version',
90 | 'create',
91 | 'want',
92 | 'need',
93 | 'used',
94 | 'well',
95 | 'user',
96 | 'release',
97 | 'config',
98 | 'dir',
99 | 'support',
100 | 'exec',
101 | 'github',
102 | 'rm',
103 | 'mkdir',
104 | 'env',
105 | 'folder',
106 | 'http',
107 | 'repo',
108 | 'cd',
109 | 'ssh',
110 | 'root'])
111 |
112 | re_stop_words = re.compile(r"\b(" + "|".join(stop_words) + ")\\W", re.I)
113 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.apply(
114 | lambda sentence: re_stop_words.sub(" ", sentence))
115 |
116 | # Split data into test and train datasets
117 | train, test = train_test_split(
118 | data, test_size=float(args.test_size), shuffle=True)
119 |
120 | # Print stats about the shape of the data.
121 | print('Train: {:,} rows {:,} columns'.format(train.shape[0], train.shape[1]))
122 | print('Test: {:,} rows {:,} columns'.format(test.shape[0], test.shape[1]))
123 |
124 | # save output as CSV.
125 | train.to_csv('{}/{}'.format(args.mount_path,
126 | args.output_train_csv), index=False)
127 | test.to_csv('{}/{}'.format(args.mount_path,
128 | args.output_test_csv), index=False)
129 | # save list of categories
130 | f2 = open('{}/selected_categories.pckl'.format(args.mount_path), 'wb')
131 | pickle.dump(selected_categories, f2)
132 |
--------------------------------------------------------------------------------
/base/src/train.py:
--------------------------------------------------------------------------------
1 | import argparse
2 | import pickle
3 | import pandas as pd
4 | from keras.layers import Dense
5 | from keras.models import Sequential
6 | from keras import layers
7 | from keras.preprocessing.text import Tokenizer
8 |
9 |
10 | # Parsing flags.
11 | parser = argparse.ArgumentParser()
12 | parser.add_argument("--mount_path")
13 | parser.add_argument("--input_train_csv")
14 | parser.add_argument("--input_test_csv")
15 | parser.add_argument("--output_model")
16 | parser.add_argument("--output_vectorized_descriptions")
17 | parser.add_argument("--loss", default="binary_crossentropy")
18 | parser.add_argument("--batch_size", default=100)
19 | parser.add_argument("--epochs", default=15)
20 | parser.add_argument("--validation_split", default=0.1)
21 | args = parser.parse_args()
22 | print(args)
23 |
24 | # Load data from CVS
25 | train_data = pd.read_csv(
26 | '{}/{}'.format(args.mount_path, args.input_train_csv))
27 | test_data = pd.read_csv(
28 | '{}/{}'.format(args.mount_path, args.input_test_csv))
29 |
30 | # Remove records with empty descriptions
31 | train_data = train_data[train_data.FULL_DESCRIPTION.notna()]
32 | test_data = test_data[test_data.FULL_DESCRIPTION.notna()]
33 |
34 | # Extract full description from datasets
35 | train_text = train_data.FULL_DESCRIPTION.tolist()
36 | test_text = test_data.FULL_DESCRIPTION.tolist()
37 |
38 | t = Tokenizer()
39 | t.fit_on_texts(train_text + test_text)
40 |
41 | # integer encode documents
42 | x_train = t.texts_to_matrix(train_text, mode='tfidf')
43 | x_test = t.texts_to_matrix(test_text, mode='tfidf')
44 |
45 | # Remove unnecessary columns from the datasets
46 | y_train = train_data.drop(labels=['index', 'FULL_DESCRIPTION', 'NAME',
47 | 'DESCRIPTION', 'PULL_COUNT', 'CATEGORY1', 'CATEGORY2', 'labels'], axis=1)
48 | y_test = test_data.drop(labels=['index', 'FULL_DESCRIPTION', 'NAME', 'DESCRIPTION',
49 | 'PULL_COUNT', 'CATEGORY1', 'CATEGORY2', 'labels'], axis=1)
50 |
51 | # KERAS MODEL
52 | n_cols = x_train.shape[1]
53 | model = Sequential()
54 |
55 | # input layer of 70 neurons:
56 | model.add(Dense(70, activation='relu', input_shape=(n_cols,)))
57 |
58 | # output layer of 82 neurons:
59 | model.add(Dense(82, activation='sigmoid'))
60 |
61 | # determining optimizer, loss, and metrics:
62 | model.compile(optimizer='adam', loss=args.loss,
63 | metrics=['binary_accuracy'])
64 |
65 | history = model.fit(
66 | x_train, y_train, batch_size=int(args.batch_size),
67 | epochs=int(args.epochs), verbose=0, validation_split=float(args.validation_split))
68 | score, acc = model.evaluate(
69 | x_test, y_test, batch_size=int(args.batch_size), verbose=0)
70 | print("Score/Loss: ", score)
71 | print("Accuracy: ", acc)
72 | # save model
73 | model.save('{}/{}'.format(args.mount_path, args.output_model))
74 | # save tokenizer
75 | f1 = open('{}/{}'.format(args.mount_path,
76 | args.output_vectorized_descriptions), 'wb')
77 | pickle.dump(t, f1)
78 |
--------------------------------------------------------------------------------
/d4m_menu.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/d4m_menu.png
--------------------------------------------------------------------------------
/seldon_core.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/seldon_core.png
--------------------------------------------------------------------------------