├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── arch.png ├── argo_workflow.png ├── argo_workflow.yaml ├── base ├── Dockerfile ├── Makefile ├── requirements.txt └── src │ ├── data │ └── hub_stackshare_combined_v2.csv.gz │ ├── fetch_gihub_data.py │ ├── models │ ├── DockerHubClassification.py │ └── requirements.txt │ ├── process_data.py │ └── train.py ├── d4m_menu.png └── seldon_core.png /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contribution guidelines 2 | 3 | Since this project is intended to support a specific use case, contributions are limited to bug fixes or security issues. If you have a question, feel free to open an issue! 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Sample for DockerCon EU 2018 2 | 3 | --- 4 | 5 | # End-to-End Machine Learning Pipeline with Docker for Desktop and Kubeflow 6 | 7 | This project is simple example of an automated end-to-end machine learning pipeline using Docker Desktop and Kubeflow. 8 | 9 | ![workflow](argo_workflow.png?raw=true) 10 | 11 | ## Architecture 12 | 13 | ![arch](arch.png?raw=true) 14 | 15 | ### Seldon-Core Architecture 16 | 17 | ![seldon-core](seldon_core.png?raw=true) 18 | 19 | ## Getting started 20 | 21 | ### Requirements: 22 | 23 | - [Docker Desktop](https://www.docker.com/products/docker-desktop) for Mac or Windows. 24 | - Increase the memory configuration from 2GB to 8GB (Preferences->Advanced->Memory). 25 |

26 | - [ksonnet](https://ksonnet.io/#get-started) version 0.11.0 or later. 27 | - [Argo](https://github.com/argoproj/argo/blob/master/demo.md) 28 | 29 | ### Steps: 30 | 31 | #### 1. Build the base image used in the [Argo](https://argoproj.github.io/) workflow 32 | 33 | ``` 34 | $ git clone https://github.com/dockersamples/docker-hub-ml-project 35 | $ cd docker-hub-ml-project 36 | $ cd base && make build; cd .. 37 | ``` 38 | 39 | #### 2. Install [Kubeflow](https://www.kubeflow.org/) 40 | 41 | ``` 42 | $ export BASE_PATH=$(pwd) 43 | $ export KUBEFLOW_TAG=master 44 | $ curl https://raw.githubusercontent.com/kubeflow/kubeflow/master/scripts/download.sh | bash 45 | $ ${BASE_PATH}/scripts/kfctl.sh init ks_app --platform docker-for-desktop 46 | $ cd ks_app 47 | $ ../scripts/kfctl.sh generate k8s 48 | $ ../scripts/kfctl.sh apply k8s 49 | ``` 50 | 51 | Ok, now let's make sure we have everything up and running. 52 | 53 | ``` 54 | # switch to the kubeflow namespace 55 | $ kubectl config set-context docker-for-desktop --namespace=kubeflow 56 | $ kubectl get pods 57 | (⎈ |docker-for-desktop:kubeflow) ~/d/r/m/p/d/2/d/d/ks_app$ kgp 16:58:17 ⎇ master 58 | NAME READY STATUS RESTARTS AGE 59 | ambassador-677dd9d8f4-hmwc7 1/1 Running 0 4m 60 | ambassador-677dd9d8f4-jkmb6 1/1 Running 0 4m 61 | ambassador-677dd9d8f4-lcc8m 1/1 Running 0 4m 62 | argo-ui-7b8fff579c-kqbdl 1/1 Running 0 2m 63 | centraldashboard-f8d7d97fb-6zk9v 1/1 Running 0 3m 64 | jupyter-0 1/1 Running 0 3m 65 | metacontroller-0 1/1 Running 0 2m 66 | minio-84969865c4-hjl9f 1/1 Running 0 2m 67 | ml-pipeline-5cf4db85f5-qskvt 1/1 Running 1 2m 68 | ml-pipeline-persistenceagent-748666fdcb-vnjvr 1/1 Running 0 2m 69 | ml-pipeline-scheduledworkflow-5bf775c8c4-8rtwf 1/1 Running 0 2m 70 | ml-pipeline-ui-59f8cbbb86-vjqmv 1/1 Running 0 2m 71 | mysql-c4c4c8f69-wfl99 1/1 Running 0 2m 72 | spartakus-volunteer-74cb649fb9-w277v 1/1 Running 0 2m 73 | tf-job-dashboard-6b95c47f8-qkf5w 1/1 Running 0 3m 74 | tf-job-operator-v1beta1-75587897bb-4zcwp 1/1 Running 0 3m 75 | workflow-controller-59c7967f59-wx426 1/1 Running 0 2m 76 | ``` 77 | 78 | If you can see the pods above running, you should be able to access http://localhost:8080/hub to create your Jupyter Notebooks instances. 79 | 80 | #### 3. Deploy Seldon-Core's model serving infrastructure 81 | 82 | The custom resource definition (CRD) and it's controller is installed using the seldon prototype 83 | 84 | ``` 85 | $ export NAMESPACE=kubeflow 86 | $ cd ks_app 87 | # Gives cluster-admin role to the default service account in the ${NAMESPACE} 88 | $ kubectl create clusterrolebinding seldon-admin --clusterrole=cluster-admin --serviceaccount=${NAMESPACE}:default 89 | # Install the kubeflow/seldon package 90 | $ ks pkg install kubeflow/seldon 91 | # Generate the seldon component and deploy it 92 | $ ks generate seldon seldon --name=seldon 93 | $ ks apply default -c seldon 94 | ``` 95 | 96 | Seldon Core provides an example Helm analytics chart that displays the Prometheus metrics in Grafana. You can install it with: 97 | 98 | ``` 99 | $ helm install seldon-core-analytics --name seldon-core-analytics --set grafana_prom_admin_password= --set persistence.enabled=false --repo https://storage.googleapis.com/seldon-charts --namespace kubeflow 100 | ``` 101 | 102 | #### 4. Setup the credentails for the machine learning pipeline 103 | 104 | Configure AWS S3 and Docker credentials on your Kubernetes cluster 105 | 106 | ``` 107 | # s3-credentials 108 | $ kubectl create secret generic s3-credentials --from-literal=accessKey= --from-literal=secretKey= 109 | # docker-credentials 110 | $ kubectl create secret generic docker-credentials --from-literal=username= --from-literal=password= 111 | ``` 112 | 113 | You can upload our sample data located in `base/src/data/hub_stackshare_combined_v2.csv.gz` to your S3 bucket. 114 | 115 | #### 5. Submit the Argo worflow 116 | 117 | This process will perform the following steps: 118 | 119 | - Import data sources 120 | - Process data (clean-up & normalization) 121 | - Split data between training and test datasets 122 | - Training using Keras 123 | - Build and push Docker image using the [Seldon-Core](https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md) wrapper 124 | - Deploy the model with 3 replicas 125 | 126 | Before submitting the Argo job, make sure you change the parameter values accordingly. You can access the `Argo UI` here: http://localhost:8080/argo/workflows. 127 | 128 | **Required fields:** 129 | 130 | - `bucket`: S3 bucket name (e.g. `ml-project-2018`) 131 | - `input-data-key`: Path to the S3 input data file (e.g. `data/hub_stackshare_combined_v2.csv.gz`) 132 | 133 | Add the Argo `artifactRepository` configuration for S3: 134 | 135 | ``` 136 | $ kubectl edit configmap workflow-controller-configmap 137 | # update the `data` field with the content below 138 | data: 139 | config: | 140 | executorImage: argoproj/argoexec:v2.2.0 141 | artifactRepository: 142 | s3: 143 | bucket: docker-metrics-backups 144 | endpoint: s3.amazonaws.com #AWS => s3.amazonaws.com; GCS => storage.googleapis.com 145 | accessKeySecret: #omit if accessing via AWS IAM 146 | name: s3-credentials 147 | key: accessKey 148 | secretKeySecret: #omit if accessing via AWS IAM 149 | name: s3-credentials 150 | key: secretKey 151 | # save the new configuration and exit vim 152 | $ configmap "workflow-controller-configmap" edited 153 | ``` 154 | 155 | Now let's submit the Argo workflow and monitor its execution from the browser (http://localhost:8080/argo/workflows). You access the artifacts from `step` directly from the `UI`, they are also stored on `S3`. 156 | 157 | ``` 158 | $ cd ${BASE_PATH} 159 | $ argo submit argo_workflow.yaml -p bucket="bucket-test1" -p input-data-key="hub_stackshare_combined_v2.csv.gz" 160 | 161 | Name: docker-hub-classificationmcwz7 162 | Namespace: kubeflow 163 | ServiceAccount: default 164 | Status: Pending 165 | Created: Fri Nov 30 10:07:53 -0800 (now) 166 | Parameters: 167 | registry: 168 | model-version: v3 169 | replicas: 3 170 | bucket: 171 | input-data-key: 172 | docker-cert-key: 173 | mount-path: /mnt/workspace/data 174 | loss: binary_crossentropy 175 | test-size: 0.2 176 | batch-size: 100 177 | epochs: 15 178 | validation-split: 0.1 179 | output-train-csv: train_data.csv 180 | output-test-csv: test_data.csv 181 | output-model: hub_classifier.h5 182 | output-vectorized-descriptions: vectorized_descriptions.pckl 183 | output-raw-csv: hub_stackshare_combined_v2.csv 184 | selected-categories: devops,build-test-deploy,languages & frameworks,data stores,programming languages,application hosting,databases,web servers,application utilities,support-sales-and-marketing,operating systems,monitoring tools,continuous integration,self-hosted blogging / cms,open source service discovery,message queue,frameworks (full stack),in-memory databases,crm,search as a service,log management,monitoring,collaboration,virtual machine platforms & containers,server configuration and automation,big data tools,database tools,machine learning tools,code collaboration & version_control,load balancer / reverse proxy,web cache,java build tools,search engines,container tools,package managers,project management,infrastructure build tools,static site generators,code review,microframeworks (backend),assets and media,version control system,front end package manager,headless browsers,data science notebooks,ecommerce,background processing,cross-platform mobile development,issue tracking,analytics,secrets management,text editor,graph databases,cluster management,exception monitoring,business tools,business intelligence,localhost tools,realtime backend / api,microservices tools,chatops,git tools,hosted package repository,js build tools / js task runners,libraries,platform as a service,general analytics,group chat & notifications,browser testing,serverless / task processing,css pre-processors / extensions,image processing and management,integrated development environment,stream processing,cross-platform desktop development,continuous deployment,machine learning,data science,monitoring metrics,metrics,continuous delivery,build automation 185 | ``` 186 | 187 | > All the Argo workflow parameters can be overwritten via the CLI using the `-p` flag. 188 | 189 | ## Repo Layout 190 | 191 | ``` 192 | . 193 | ├── README.md 194 | ├── argo_workflow.png 195 | ├── argo_workflow.yaml 196 | └── base 197 | ├── Dockerfile 198 | ├── Makefile 199 | ├── requirements.txt 200 | └── src 201 | ├── data 202 | │   └── hub_stackshare_combined_v2.csv.gz 203 | ├── fetch_gihub_data.py 204 | ├── models 205 | │   ├── DockerHubClassification.py 206 | │   └── requirements.txt 207 | ├── process_data.py 208 | └── train.py 209 | ``` 210 | 211 | ## Open Source Projects Used 212 | 213 | #### [Ambassador](https://www.getambassador.io/) 214 | 215 | API Gateway based on envoy proxy. It allows you to do self-service publishing and canary deployments. 216 | 217 | #### [Tensorflow](https://www.tensorflow.org/) 218 | 219 | Machine learning framework 220 | 221 | #### [Jupyter Hub](https://jupyterhub.readthedocs.io/en/stable/) 222 | 223 | Multi-user server for Jupyter notebooks 224 | 225 | #### [Seldon Core](https://www.seldon.io/) 226 | 227 | Platform for deploying ML models 228 | 229 | #### [Argo](https://argoproj.github.io/) 230 | 231 | Container-native workflow management (CI/CD) 232 | 233 | #### [Prometheus](https://prometheus.io/) 234 | 235 | Moriting & Alerting platform 236 | 237 | #### [Grafana](https://grafana.com/) 238 | 239 | Open platform for analytics and monitoring. It provides the UI for data visualization. 240 | 241 | #### [Kubernetes](https://kubernetes.io/) 242 | 243 | Open-source system for automating deployment, scaling, and management of containerized applications. 244 | -------------------------------------------------------------------------------- /arch.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/arch.png -------------------------------------------------------------------------------- /argo_workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/argo_workflow.png -------------------------------------------------------------------------------- /argo_workflow.yaml: -------------------------------------------------------------------------------- 1 | apiVersion: argoproj.io/v1alpha1 2 | kind: Workflow 3 | metadata: 4 | generateName: docker-hub-classification 5 | spec: 6 | entrypoint: default 7 | 8 | # Create a volume for containers to store their output data. 9 | volumeClaimTemplates: 10 | - metadata: 11 | name: workdir 12 | spec: 13 | accessModes: ["ReadWriteOnce"] 14 | resources: 15 | requests: 16 | storage: 10Gi 17 | 18 | # Arguments of the workflows 19 | arguments: 20 | parameters: 21 | # model version 22 | - name: model-version 23 | value: "v1" 24 | 25 | # The name of the S3 bucket where the data is stored. 26 | - name: bucket 27 | value: "" 28 | 29 | # Docker registry username 30 | - name: username 31 | value: "dev" 32 | 33 | # The path to the input data in the S3 bucket, in csv.gz format 34 | - name: input-data-key 35 | value: 36 | 37 | # The path to the docker cert path in the S3 bucket (e.g. Docker UCP bundle) 38 | - name: docker-cert-key 39 | value: 40 | 41 | # mount path 42 | - name: mount-path 43 | value: /mnt/workspace/data 44 | 45 | # loss function 46 | - name: loss 47 | value: binary_crossentropy 48 | 49 | # Percentage of the dataset used to test the model (e.g. 0.2 == 20%) 50 | - name: test-size 51 | value: 0.2 52 | 53 | # batch size 54 | - name: batch-size 55 | value: 100 56 | 57 | # number of epochs 58 | - name: epochs 59 | value: 15 60 | 61 | # validation split 62 | - name: validation-split 63 | value: 0.1 64 | 65 | # output train data directory path 66 | - name: output-train-csv 67 | value: train_data.csv 68 | 69 | # output test data directory path 70 | - name: output-test-csv 71 | value: test_data.csv 72 | 73 | # output model 74 | - name: output-model 75 | value: hub_classifier.h5 76 | 77 | # output vectorized descriptions 78 | - name: output-vectorized-descriptions 79 | value: vectorized_descriptions.pckl 80 | 81 | # output raw data directory path 82 | - name: output-raw-csv 83 | value: hub_stackshare_combined_v2.csv 84 | 85 | # selected categories 86 | - name: selected-categories 87 | value: "devops,build-test-deploy,languages & frameworks,data stores,programming languages,application hosting,databases,web servers,application utilities,support-sales-and-marketing,operating systems,monitoring tools,continuous integration,self-hosted blogging / cms,open source service discovery,message queue,frameworks (full stack),in-memory databases,crm,search as a service,log management,monitoring,collaboration,virtual machine platforms & containers,server configuration and automation,big data tools,database tools,machine learning tools,code collaboration & version_control,load balancer / reverse proxy,web cache,java build tools,search engines,container tools,package managers,project management,infrastructure build tools,static site generators,code review,microframeworks (backend),assets and media,version control system,front end package manager,headless browsers,data science notebooks,ecommerce,background processing,cross-platform mobile development,issue tracking,analytics,secrets management,text editor,graph databases,cluster management,exception monitoring,business tools,business intelligence,localhost tools,realtime backend / api,microservices tools,chatops,git tools,hosted package repository,js build tools / js task runners,libraries,platform as a service,general analytics,group chat & notifications,browser testing,serverless / task processing,css pre-processors / extensions,image processing and management,integrated development environment,stream processing,cross-platform desktop development,continuous deployment,machine learning,data science,monitoring metrics,metrics,continuous delivery,build automation" 88 | 89 | #The container image to use in the workflow 90 | - name: registry 91 | value: "" 92 | 93 | #The container image to use in the workflow 94 | - name: image-name 95 | value: data-team/base:latest 96 | 97 | templates: 98 | ################################## 99 | # Define the steps of the workflow 100 | ################################## 101 | - name: default 102 | steps: 103 | - - name: import-data 104 | template: import-data 105 | - - name: process-data 106 | template: process-data 107 | - - name: training 108 | template: training 109 | - - name: build-push-image 110 | template: build-push-image 111 | - - name: deploy-model 112 | template: deploy-model 113 | 114 | ################################################# 115 | # Import / Unzip 116 | # Imports the input data & docker certs and unpack them. 117 | ################################################# 118 | - name: import-data 119 | container: 120 | image: alpine:latest 121 | command: [sh, -c] 122 | args: [ 123 | "gzip -d {{workflow.parameters.mount-path}}/hub_stackshare_combined_v2.csv.gz", #mkdir {{workflow.parameters.mount-path}}/docker-cert-bundle; unzip {{workflow.parameters.mount-path}}/docker-cert-bundle.zip -d {{workflow.parameters.mount-path}}/docker-cert-bundle", 124 | ] 125 | volumeMounts: 126 | - name: workdir 127 | mountPath: "{{workflow.parameters.mount-path}}/" 128 | inputs: 129 | artifacts: 130 | - name: data 131 | path: "{{workflow.parameters.mount-path}}/hub_stackshare_combined_v2.csv.gz" 132 | s3: 133 | endpoint: s3.amazonaws.com 134 | bucket: "{{workflow.parameters.bucket}}" 135 | key: "{{workflow.parameters.input-data-key}}" 136 | accessKeySecret: 137 | name: s3-credentials 138 | key: accessKey 139 | secretKeySecret: 140 | name: s3-credentials 141 | key: secretKey 142 | # - name: docker-cert-bundle 143 | # path: "{{workflow.parameters.mount-path}}/docker-cert-bundle.zip" 144 | # s3: 145 | # endpoint: s3.amazonaws.com 146 | # bucket: "{{workflow.parameters.bucket}}" 147 | # key: "{{workflow.parameters.docker-cert-key}}" 148 | # accessKeySecret: 149 | # name: s3-credentials 150 | # key: accessKey 151 | # secretKeySecret: 152 | # name: s3-credentials 153 | # key: secretKey 154 | outputs: 155 | artifacts: 156 | - name: raw-csv 157 | path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-raw-csv}}" 158 | 159 | ######################################################################### 160 | # Process Data 161 | ######################################################################### 162 | - name: process-data 163 | container: 164 | image: "{{workflow.parameters.registry}}{{workflow.parameters.image-name}}" 165 | imagePullPolicy: "IfNotPresent" 166 | command: [sh, -c] 167 | args: 168 | [ 169 | "python /src/process_data.py --mount_path {{workflow.parameters.mount-path}} --input_csv {{workflow.parameters.output-raw-csv}} --output_train_csv {{workflow.parameters.output-train-csv}} --output_test_csv {{workflow.parameters.output-test-csv}} --test_size {{workflow.parameters.test-size}} --selected_categories '{{workflow.parameters.selected-categories}}'", 170 | ] 171 | volumeMounts: 172 | - name: workdir 173 | mountPath: "{{workflow.parameters.mount-path}}/" 174 | outputs: 175 | artifacts: 176 | - name: output-train-csv 177 | path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-train-csv}}" 178 | - name: output-test-csv 179 | path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-test-csv}}" 180 | - name: selected-categories 181 | path: "{{workflow.parameters.mount-path}}/selected_categories.pckl" 182 | 183 | ####################################### 184 | # Training and ML model extraction 185 | ####################################### 186 | - name: training 187 | container: 188 | image: "{{workflow.parameters.registry}}{{workflow.parameters.image-name}}" 189 | imagePullPolicy: "IfNotPresent" 190 | command: [sh, -c] 191 | args: 192 | [ 193 | "python /src/train.py --mount_path {{workflow.parameters.mount-path}} --input_train_csv {{workflow.parameters.output-train-csv}} --input_test_csv {{workflow.parameters.output-test-csv}} --output_model {{workflow.parameters.output-model}} --output_vectorized_descriptions {{workflow.parameters.output-vectorized-descriptions}};cp /src/models/* {{workflow.parameters.mount-path}}/", 194 | ] 195 | volumeMounts: 196 | - name: workdir 197 | mountPath: "{{workflow.parameters.mount-path}}/" 198 | outputs: 199 | artifacts: 200 | - name: output-model 201 | path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-model}}" 202 | - name: output-vectorized-descriptions 203 | path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-vectorized-descriptions}}" 204 | 205 | ####################################### 206 | # Build and push a docker image using the Seldon-Core Docker wrapper 207 | ####################################### 208 | - name: build-push-image 209 | container: 210 | image: docker:17.10 211 | command: [sh, -c] 212 | args: 213 | [ 214 | "cd {{workflow.parameters.mount-path}};sleep 15;rm *.csv;docker run -v {{workflow.parameters.mount-path}}:/model seldonio/core-python-wrapper:0.7 /model DockerHubClassification {{workflow.parameters.model-version}} {{workflow.parameters.registry}}{{workflow.parameters.username}} --base-image=python:3.6 --image-name=dockerhubclassifier;cd build/;./build_image.sh;echo $DOCKER_PASSWORD | docker login -u $DOCKER_USERNAME --password-stdin;./push_image.sh;", 215 | ] 216 | volumeMounts: 217 | - name: workdir 218 | mountPath: "{{workflow.parameters.mount-path}}/" 219 | env: 220 | - name: DOCKER_HOST #the docker daemon can be access on the standard port on localhost 221 | value: 127.0.0.1 222 | - name: DOCKER_USERNAME # name of env var 223 | valueFrom: 224 | secretKeyRef: 225 | name: docker-credentials # name of an existing k8s secret 226 | key: username # 'key' subcomponent of the secret 227 | - name: DOCKER_PASSWORD # name of env var 228 | valueFrom: 229 | secretKeyRef: 230 | name: docker-credentials # name of an existing k8s secret 231 | key: password # 'key' subcomponent of the secret 232 | sidecars: 233 | - name: dind 234 | image: docker:17.10-dind #Docker already provides an image for running a Docker daemon 235 | securityContext: 236 | privileged: true #the Docker daemon can only run in a privileged container 237 | # mirrorVolumeMounts will mount the same volumes specified in the main container 238 | # to the sidecar (including artifacts), at the same mountPaths. This enables 239 | # dind daemon to (partially) see the same filesystem as the main container in 240 | # order to use features such as docker volume binding. 241 | mirrorVolumeMounts: true 242 | 243 | ####################################### 244 | # Deploy model 245 | ####################################### 246 | - name: deploy-model 247 | resource: #indicates that this is a resource template 248 | action: apply #can be any kubectl action (e.g. create, delete, apply, patch) 249 | #successCondition: ? 250 | manifest: | 251 | apiVersion: "machinelearning.seldon.io/v1alpha2" 252 | kind: "SeldonDeployment" 253 | metadata: 254 | labels: 255 | app: "seldon" 256 | name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}" 257 | namespace: kubeflow 258 | spec: 259 | annotations: 260 | deployment_version: "{{workflow.parameters.model-version}}" 261 | project_name: "Docker Hub ML Project" 262 | name: "docker-hub-classifier" 263 | predictors: 264 | - annotations: 265 | predictor_version: "{{workflow.parameters.model-version}}" 266 | componentSpecs: 267 | - spec: 268 | containers: 269 | - image: "{{workflow.parameters.registry}}{{workflow.parameters.username}}/dockerhubclassifier:{{workflow.parameters.model-version}}" 270 | imagePullPolicy: "Always" 271 | name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}" 272 | graph: 273 | children: [] 274 | endpoint: 275 | type: "REST" 276 | name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}" 277 | type: "MODEL" 278 | name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}" 279 | replicas: 3 280 | -------------------------------------------------------------------------------- /base/Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.6 2 | 3 | RUN apt-get update -y 4 | RUN apt-get install -y python-pip python-dev build-essential 5 | 6 | COPY /requirements.txt /tmp/ 7 | RUN cd /tmp && \ 8 | pip install --no-cache-dir -r requirements.txt 9 | 10 | # copy python scripts 11 | COPY ./src /src 12 | -------------------------------------------------------------------------------- /base/Makefile: -------------------------------------------------------------------------------- 1 | IMAGE_NAME := data-team/base 2 | IMAGE_TAG := $(VERSION)-$(CHANNEL) 3 | 4 | ifeq ($(IMAGE_TAG),-) 5 | IMAGE_TAG := latest 6 | endif 7 | 8 | default: release 9 | 10 | build: ## Build the container without caching 11 | docker build --no-cache -t $(REGISTRY)$(IMAGE_NAME):$(IMAGE_TAG) . 12 | 13 | release: build publish ## Make a release by building and publishing tagged image to Docker Trusted Registry (DTR) 14 | 15 | publish: ## Publish image to DTR 16 | @echo 'publish $(REGISTRY)$(IMAGE_NAME):$(IMAGE_TAG)' 17 | docker push $(REGISTRY)$(IMAGE_NAME):$(IMAGE_TAG) 18 | -------------------------------------------------------------------------------- /base/requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow 2 | scikit-learn>=0.18 3 | pandas 4 | keras 5 | nltk -------------------------------------------------------------------------------- /base/src/data/hub_stackshare_combined_v2.csv.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/base/src/data/hub_stackshare_combined_v2.csv.gz -------------------------------------------------------------------------------- /base/src/fetch_gihub_data.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import pandas as pd 4 | from github import Github 5 | 6 | LOGIN = os.environ['GITHUB_LOGIN'] 7 | PASSWORD = os.environ['GITHUB_PASSWORD'] 8 | 9 | api = Github(LOGIN, PASSWORD) 10 | 11 | # Parsing flags. 12 | parser = argparse.ArgumentParser() 13 | parser.add_argument("--topic_name") 14 | args = parser.parse_args() 15 | print(args) 16 | 17 | result = api.search_repositories( 18 | 'topic:{}'.format(args.topic_name), sort='stars') 19 | 20 | columns = ('repo_name', 'html_url', 21 | 'description', 'topics') 22 | 23 | 24 | # loop through 1,000 repositories order by the number of stars 25 | data = [] 26 | for index in range(0, 33): 27 | for item in result.get_page(index): 28 | if item.description and not item.fork: 29 | try: 30 | data.append([item.name, item.html_url, item.description.encode('utf-8'), 31 | item.get_topics()]) 32 | except Exception as e: 33 | print(e) 34 | break 35 | 36 | # create pandas dataframe 37 | df = pd.DataFrame(data, columns=columns) 38 | # save to CSV 39 | df.to_csv('{}_output.csv'.format(args.topic_name), index=False) 40 | print('file {}_output.csv created'.format(args.topic_name)) 41 | -------------------------------------------------------------------------------- /base/src/models/DockerHubClassification.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | import re 3 | import numpy as np 4 | from keras.models import load_model 5 | 6 | 7 | class DockerHubClassification(object): 8 | def __init__(self): 9 | self.model = load_model('hub_classifier.h5') 10 | self.model._make_predict_function() 11 | self.selected_categories = pickle.load( 12 | open('selected_categories.pckl', 'rb')) 13 | self.tokenizer = pickle.load( 14 | open('vectorized_descriptions.pckl', 'rb')) 15 | 16 | def _clean_string(self, text): 17 | text = re.sub('[!@#$,.)(=*`]', '', text) 18 | return text.lower() 19 | 20 | def _predict_labels(self, text): 21 | labels = [] 22 | description = self._clean_string(str(text)) 23 | description_matrix = self.tokenizer.texts_to_matrix([ 24 | description], mode='tfidf') 25 | preds = self.model.predict( 26 | description_matrix, batch_size=None, verbose=1, steps=None) 27 | 28 | preds[preds > 0.4] = 1 29 | preds[preds < 0.4] = 0 30 | 31 | for c in range(len(self.selected_categories)): 32 | if preds[0][c] == 1: 33 | labels.append(self.selected_categories[c]) 34 | return labels 35 | 36 | def predict(self, text, features_names): 37 | return np.array([self._predict_labels(text)]) 38 | -------------------------------------------------------------------------------- /base/src/models/requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow 2 | keras -------------------------------------------------------------------------------- /base/src/process_data.py: -------------------------------------------------------------------------------- 1 | from nltk.corpus import stopwords 2 | import ast 3 | import argparse 4 | import pickle 5 | import pandas as pd 6 | import nltk 7 | import re 8 | from sklearn.model_selection import train_test_split 9 | 10 | # Parsing flags. 11 | parser = argparse.ArgumentParser() 12 | parser.add_argument("--input_csv") 13 | parser.add_argument("--mount_path") 14 | parser.add_argument("--output_train_csv") 15 | parser.add_argument("--output_test_csv") 16 | parser.add_argument("--selected_categories") 17 | parser.add_argument("--test_size") 18 | args = parser.parse_args() 19 | print(args) 20 | 21 | 22 | # Load data from CVS 23 | raw_data = pd.read_csv('{}/{}'.format(args.mount_path, args.input_csv)) 24 | 25 | # Remove records with empty descriptions 26 | data = raw_data[raw_data.FULL_DESCRIPTION.notna()] 27 | # Lower case 28 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.lower() 29 | # Remove punctuation 30 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('[^\w\s]', '') 31 | # Remove numbers 32 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\d+', '') 33 | # Remove `\n` and `\t` characters 34 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\n', ' ') 35 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\t', ' ') 36 | # Remove long strings (len() > 24) 37 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\w{24,}', '') 38 | # Remove urls 39 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('http\w+', '') 40 | # Remove extra spaces 41 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.strip() 42 | data.FULL_DESCRIPTION.replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True) 43 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace(' +', ' ') 44 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.replace('\s+', ' ', regex=True) 45 | # Drop duplicates 46 | data = data.drop_duplicates() 47 | 48 | # build-test-deploy,languages & frameworks,databases,web servers,application utilities,devops 49 | # operating systems,business tools,continuous integration,message queue,support-sales-and-marketing 50 | selected_categories = args.selected_categories.split(',') 51 | 52 | for col in selected_categories: 53 | data[col] = 0 54 | 55 | for index, row in data.iterrows(): 56 | labels = row.labels 57 | labels = ast.literal_eval(labels) 58 | labels = [e.strip()for e in labels] 59 | for l in labels: 60 | if l in selected_categories: 61 | data.loc[index, l] = 1 62 | data = data.dropna(axis=1) 63 | 64 | # Remove stopwords 65 | nltk.download('stopwords') 66 | stop_words = set(stopwords.words('english')) 67 | stop_words.update([ 68 | 'default', 69 | 'image', 70 | 'docker', 71 | 'container', 72 | 'service', 73 | 'production', 74 | 'dockerfile', 75 | 'dockercompose', 76 | 'build', 77 | 'latest', 78 | 'file', 79 | 'tag', 80 | 'instance', 81 | 'run', 82 | 'running', 83 | 'use', 84 | 'will', 85 | 'work', 86 | 'please', 87 | 'install', 88 | 'tags', 89 | 'version', 90 | 'create', 91 | 'want', 92 | 'need', 93 | 'used', 94 | 'well', 95 | 'user', 96 | 'release', 97 | 'config', 98 | 'dir', 99 | 'support', 100 | 'exec', 101 | 'github', 102 | 'rm', 103 | 'mkdir', 104 | 'env', 105 | 'folder', 106 | 'http', 107 | 'repo', 108 | 'cd', 109 | 'ssh', 110 | 'root']) 111 | 112 | re_stop_words = re.compile(r"\b(" + "|".join(stop_words) + ")\\W", re.I) 113 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.apply( 114 | lambda sentence: re_stop_words.sub(" ", sentence)) 115 | 116 | # Split data into test and train datasets 117 | train, test = train_test_split( 118 | data, test_size=float(args.test_size), shuffle=True) 119 | 120 | # Print stats about the shape of the data. 121 | print('Train: {:,} rows {:,} columns'.format(train.shape[0], train.shape[1])) 122 | print('Test: {:,} rows {:,} columns'.format(test.shape[0], test.shape[1])) 123 | 124 | # save output as CSV. 125 | train.to_csv('{}/{}'.format(args.mount_path, 126 | args.output_train_csv), index=False) 127 | test.to_csv('{}/{}'.format(args.mount_path, 128 | args.output_test_csv), index=False) 129 | # save list of categories 130 | f2 = open('{}/selected_categories.pckl'.format(args.mount_path), 'wb') 131 | pickle.dump(selected_categories, f2) 132 | -------------------------------------------------------------------------------- /base/src/train.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import pickle 3 | import pandas as pd 4 | from keras.layers import Dense 5 | from keras.models import Sequential 6 | from keras import layers 7 | from keras.preprocessing.text import Tokenizer 8 | 9 | 10 | # Parsing flags. 11 | parser = argparse.ArgumentParser() 12 | parser.add_argument("--mount_path") 13 | parser.add_argument("--input_train_csv") 14 | parser.add_argument("--input_test_csv") 15 | parser.add_argument("--output_model") 16 | parser.add_argument("--output_vectorized_descriptions") 17 | parser.add_argument("--loss", default="binary_crossentropy") 18 | parser.add_argument("--batch_size", default=100) 19 | parser.add_argument("--epochs", default=15) 20 | parser.add_argument("--validation_split", default=0.1) 21 | args = parser.parse_args() 22 | print(args) 23 | 24 | # Load data from CVS 25 | train_data = pd.read_csv( 26 | '{}/{}'.format(args.mount_path, args.input_train_csv)) 27 | test_data = pd.read_csv( 28 | '{}/{}'.format(args.mount_path, args.input_test_csv)) 29 | 30 | # Remove records with empty descriptions 31 | train_data = train_data[train_data.FULL_DESCRIPTION.notna()] 32 | test_data = test_data[test_data.FULL_DESCRIPTION.notna()] 33 | 34 | # Extract full description from datasets 35 | train_text = train_data.FULL_DESCRIPTION.tolist() 36 | test_text = test_data.FULL_DESCRIPTION.tolist() 37 | 38 | t = Tokenizer() 39 | t.fit_on_texts(train_text + test_text) 40 | 41 | # integer encode documents 42 | x_train = t.texts_to_matrix(train_text, mode='tfidf') 43 | x_test = t.texts_to_matrix(test_text, mode='tfidf') 44 | 45 | # Remove unnecessary columns from the datasets 46 | y_train = train_data.drop(labels=['index', 'FULL_DESCRIPTION', 'NAME', 47 | 'DESCRIPTION', 'PULL_COUNT', 'CATEGORY1', 'CATEGORY2', 'labels'], axis=1) 48 | y_test = test_data.drop(labels=['index', 'FULL_DESCRIPTION', 'NAME', 'DESCRIPTION', 49 | 'PULL_COUNT', 'CATEGORY1', 'CATEGORY2', 'labels'], axis=1) 50 | 51 | # KERAS MODEL 52 | n_cols = x_train.shape[1] 53 | model = Sequential() 54 | 55 | # input layer of 70 neurons: 56 | model.add(Dense(70, activation='relu', input_shape=(n_cols,))) 57 | 58 | # output layer of 82 neurons: 59 | model.add(Dense(82, activation='sigmoid')) 60 | 61 | # determining optimizer, loss, and metrics: 62 | model.compile(optimizer='adam', loss=args.loss, 63 | metrics=['binary_accuracy']) 64 | 65 | history = model.fit( 66 | x_train, y_train, batch_size=int(args.batch_size), 67 | epochs=int(args.epochs), verbose=0, validation_split=float(args.validation_split)) 68 | score, acc = model.evaluate( 69 | x_test, y_test, batch_size=int(args.batch_size), verbose=0) 70 | print("Score/Loss: ", score) 71 | print("Accuracy: ", acc) 72 | # save model 73 | model.save('{}/{}'.format(args.mount_path, args.output_model)) 74 | # save tokenizer 75 | f1 = open('{}/{}'.format(args.mount_path, 76 | args.output_vectorized_descriptions), 'wb') 77 | pickle.dump(t, f1) 78 | -------------------------------------------------------------------------------- /d4m_menu.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/d4m_menu.png -------------------------------------------------------------------------------- /seldon_core.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/seldon_core.png --------------------------------------------------------------------------------