├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── arch.png
├── argo_workflow.png
├── argo_workflow.yaml
├── base
    ├── Dockerfile
    ├── Makefile
    ├── requirements.txt
    └── src
    │   ├── data
    │       └── hub_stackshare_combined_v2.csv.gz
    │   ├── fetch_gihub_data.py
    │   ├── models
    │       ├── DockerHubClassification.py
    │       └── requirements.txt
    │   ├── process_data.py
    │   └── train.py
├── d4m_menu.png
└── seldon_core.png


/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contribution guidelines
2 | 
3 | Since this project is intended to support a specific use case, contributions are limited to bug fixes or security issues. If you have a question, feel free to open an issue!
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Sample for DockerCon EU 2018
  2 | 
  3 | ---
  4 | 
  5 | # End-to-End Machine Learning Pipeline with Docker for Desktop and Kubeflow
  6 | 
  7 | This project is simple example of an automated end-to-end machine learning pipeline using Docker Desktop and Kubeflow.
  8 | 
  9 | ![workflow](argo_workflow.png?raw=true)
 10 | 
 11 | ## Architecture
 12 | 
 13 | ![arch](arch.png?raw=true)
 14 | 
 15 | ### Seldon-Core Architecture
 16 | 
 17 | ![seldon-core](seldon_core.png?raw=true)
 18 | 
 19 | ## Getting started
 20 | 
 21 | ### Requirements:
 22 | 
 23 | - [Docker Desktop](https://www.docker.com/products/docker-desktop) for Mac or Windows.
 24 |   - Increase the memory configuration from 2GB to 8GB (Preferences->Advanced->Memory).
 25 |     <p><img src="d4m_menu.png" width="300"></p>
 26 | - [ksonnet](https://ksonnet.io/#get-started) version 0.11.0 or later.
 27 | - [Argo](https://github.com/argoproj/argo/blob/master/demo.md)
 28 | 
 29 | ### Steps:
 30 | 
 31 | #### 1. Build the base image used in the [Argo](https://argoproj.github.io/) workflow
 32 | 
 33 | ```
 34 | $ git clone https://github.com/dockersamples/docker-hub-ml-project
 35 | $ cd docker-hub-ml-project
 36 | $ cd base && make build; cd ..
 37 | ```
 38 | 
 39 | #### 2. Install [Kubeflow](https://www.kubeflow.org/)
 40 | 
 41 | ```
 42 | $ export BASE_PATH=$(pwd)
 43 | $ export KUBEFLOW_TAG=master
 44 | $ curl https://raw.githubusercontent.com/kubeflow/kubeflow/master/scripts/download.sh | bash
 45 | $ ${BASE_PATH}/scripts/kfctl.sh init ks_app --platform docker-for-desktop
 46 | $ cd ks_app
 47 | $ ../scripts/kfctl.sh generate k8s
 48 | $ ../scripts/kfctl.sh apply k8s
 49 | ```
 50 | 
 51 | Ok, now let's make sure we have everything up and running.
 52 | 
 53 | ```
 54 | # switch to the kubeflow namespace
 55 | $ kubectl config set-context docker-for-desktop  --namespace=kubeflow
 56 | $ kubectl get pods
 57 | (⎈ |docker-for-desktop:kubeflow) ~/d/r/m/p/d/2/d/d/ks_app$ kgp                              16:58:17 ⎇ master
 58 | NAME                                             READY     STATUS              RESTARTS   AGE
 59 | ambassador-677dd9d8f4-hmwc7                      1/1       Running             0          4m
 60 | ambassador-677dd9d8f4-jkmb6                      1/1       Running             0          4m
 61 | ambassador-677dd9d8f4-lcc8m                      1/1       Running             0          4m
 62 | argo-ui-7b8fff579c-kqbdl                         1/1       Running             0          2m
 63 | centraldashboard-f8d7d97fb-6zk9v                 1/1       Running             0          3m
 64 | jupyter-0                                        1/1       Running             0          3m
 65 | metacontroller-0                                 1/1       Running             0          2m
 66 | minio-84969865c4-hjl9f                           1/1       Running             0          2m
 67 | ml-pipeline-5cf4db85f5-qskvt                     1/1       Running             1          2m
 68 | ml-pipeline-persistenceagent-748666fdcb-vnjvr    1/1       Running             0          2m
 69 | ml-pipeline-scheduledworkflow-5bf775c8c4-8rtwf   1/1       Running             0          2m
 70 | ml-pipeline-ui-59f8cbbb86-vjqmv                  1/1       Running             0          2m
 71 | mysql-c4c4c8f69-wfl99                            1/1       Running             0          2m
 72 | spartakus-volunteer-74cb649fb9-w277v             1/1       Running             0          2m
 73 | tf-job-dashboard-6b95c47f8-qkf5w                 1/1       Running             0          3m
 74 | tf-job-operator-v1beta1-75587897bb-4zcwp         1/1       Running             0          3m
 75 | workflow-controller-59c7967f59-wx426             1/1       Running             0          2m
 76 | ```
 77 | 
 78 | If you can see the pods above running, you should be able to access http://localhost:8080/hub to create your Jupyter Notebooks instances.
 79 | 
 80 | #### 3. Deploy Seldon-Core's model serving infrastructure
 81 | 
 82 | The custom resource definition (CRD) and it's controller is installed using the seldon prototype
 83 | 
 84 | ```
 85 | $ export NAMESPACE=kubeflow
 86 | $ cd ks_app
 87 | # Gives cluster-admin role to the default service account in the ${NAMESPACE}
 88 | $ kubectl create clusterrolebinding seldon-admin --clusterrole=cluster-admin --serviceaccount=${NAMESPACE}:default
 89 | # Install the kubeflow/seldon package
 90 | $ ks pkg install kubeflow/seldon
 91 | # Generate the seldon component and deploy it
 92 | $ ks generate seldon seldon --name=seldon
 93 | $ ks apply default -c seldon
 94 | ```
 95 | 
 96 | Seldon Core provides an example Helm analytics chart that displays the Prometheus metrics in Grafana. You can install it with:
 97 | 
 98 | ```
 99 | $ helm install seldon-core-analytics --name seldon-core-analytics --set grafana_prom_admin_password=<choose-your-password> --set persistence.enabled=false --repo https://storage.googleapis.com/seldon-charts --namespace kubeflow
100 | ```
101 | 
102 | #### 4. Setup the credentails for the machine learning pipeline
103 | 
104 | Configure AWS S3 and Docker credentials on your Kubernetes cluster
105 | 
106 | ```
107 | # s3-credentials
108 | $ kubectl create secret generic s3-credentials --from-literal=accessKey=<aws-key> --from-literal=secretKey=<aws-secret>
109 | # docker-credentials
110 | $ kubectl create secret generic docker-credentials --from-literal=username=<username> --from-literal=password=<password>
111 | ```
112 | 
113 | You can upload our sample data located in `base/src/data/hub_stackshare_combined_v2.csv.gz` to your S3 bucket.
114 | 
115 | #### 5. Submit the Argo worflow
116 | 
117 | This process will perform the following steps:
118 | 
119 | - Import data sources
120 | - Process data (clean-up & normalization)
121 | - Split data between training and test datasets
122 | - Training using Keras
123 | - Build and push Docker image using the [Seldon-Core](https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md) wrapper
124 | - Deploy the model with 3 replicas
125 | 
126 | Before submitting the Argo job, make sure you change the parameter values accordingly. You can access the `Argo UI` here: http://localhost:8080/argo/workflows.
127 | 
128 | **Required fields:**
129 | 
130 | - `bucket`: S3 bucket name (e.g. `ml-project-2018`)
131 | - `input-data-key`: Path to the S3 input data file (e.g. `data/hub_stackshare_combined_v2.csv.gz`)
132 | 
133 | Add the Argo `artifactRepository` configuration for S3:
134 | 
135 | ```
136 | $ kubectl edit configmap workflow-controller-configmap
137 | # update the `data` field with the content below
138 | data:
139 |   config: |
140 |     executorImage: argoproj/argoexec:v2.2.0
141 |     artifactRepository:
142 |       s3:
143 |         bucket: docker-metrics-backups
144 |         endpoint: s3.amazonaws.com       #AWS => s3.amazonaws.com; GCS => storage.googleapis.com
145 |         accessKeySecret:                #omit if accessing via AWS IAM
146 |           name: s3-credentials
147 |           key: accessKey
148 |         secretKeySecret:                #omit if accessing via AWS IAM
149 |           name: s3-credentials
150 |           key: secretKey
151 | # save the new configuration and exit vim
152 | $ configmap "workflow-controller-configmap" edited
153 | ```
154 | 
155 | Now let's submit the Argo workflow and monitor its execution from the browser (http://localhost:8080/argo/workflows). You access the artifacts from `step` directly from the `UI`, they are also stored on `S3`.
156 | 
157 | ```
158 | $ cd ${BASE_PATH}
159 | $ argo submit argo_workflow.yaml -p bucket="bucket-test1" -p input-data-key="hub_stackshare_combined_v2.csv.gz"
160 | 
161 | Name:                docker-hub-classificationmcwz7
162 | Namespace:           kubeflow
163 | ServiceAccount:      default
164 | Status:              Pending
165 | Created:             Fri Nov 30 10:07:53 -0800 (now)
166 | Parameters:
167 |  registry:          <registry-url>
168 |  model-version:     v3
169 |  replicas:          3
170 |  bucket:            <bucket-name>
171 |  input-data-key:    <input-data-key-path>
172 |  docker-cert-key:   <docker-cert-key-path>
173 |  mount-path:        /mnt/workspace/data
174 |  loss:              binary_crossentropy
175 |  test-size:         0.2
176 |  batch-size:        100
177 |  epochs:            15
178 |  validation-split:  0.1
179 |  output-train-csv:  train_data.csv
180 |  output-test-csv:   test_data.csv
181 |  output-model:      hub_classifier.h5
182 |  output-vectorized-descriptions: vectorized_descriptions.pckl
183 |  output-raw-csv:    hub_stackshare_combined_v2.csv
184 |  selected-categories: devops,build-test-deploy,languages & frameworks,data stores,programming languages,application hosting,databases,web servers,application utilities,support-sales-and-marketing,operating systems,monitoring tools,continuous integration,self-hosted blogging / cms,open source service discovery,message queue,frameworks (full stack),in-memory databases,crm,search as a service,log management,monitoring,collaboration,virtual machine platforms & containers,server configuration and automation,big data tools,database tools,machine learning tools,code collaboration & version_control,load balancer / reverse proxy,web cache,java build tools,search engines,container tools,package managers,project management,infrastructure build tools,static site generators,code review,microframeworks (backend),assets and media,version control system,front end package manager,headless browsers,data science notebooks,ecommerce,background processing,cross-platform mobile development,issue tracking,analytics,secrets management,text editor,graph databases,cluster management,exception monitoring,business tools,business intelligence,localhost tools,realtime backend / api,microservices tools,chatops,git tools,hosted package repository,js build tools / js task runners,libraries,platform as a service,general analytics,group chat & notifications,browser testing,serverless / task processing,css pre-processors / extensions,image processing and management,integrated development environment,stream processing,cross-platform desktop development,continuous deployment,machine learning,data science,monitoring metrics,metrics,continuous delivery,build automation
185 | ```
186 | 
187 | > All the Argo workflow parameters can be overwritten via the CLI using the `-p` flag.
188 | 
189 | ## Repo Layout
190 | 
191 | ```
192 | .
193 | ├── README.md
194 | ├── argo_workflow.png
195 | ├── argo_workflow.yaml
196 | └── base
197 |     ├── Dockerfile
198 |     ├── Makefile
199 |     ├── requirements.txt
200 |     └── src
201 |         ├── data
202 |         │   └── hub_stackshare_combined_v2.csv.gz
203 |         ├── fetch_gihub_data.py
204 |         ├── models
205 |         │   ├── DockerHubClassification.py
206 |         │   └── requirements.txt
207 |         ├── process_data.py
208 |         └── train.py
209 | ```
210 | 
211 | ## Open Source Projects Used
212 | 
213 | #### [Ambassador](https://www.getambassador.io/)
214 | 
215 | API Gateway based on envoy proxy. It allows you to do self-service publishing and canary deployments.
216 | 
217 | #### [Tensorflow](https://www.tensorflow.org/)
218 | 
219 | Machine learning framework
220 | 
221 | #### [Jupyter Hub](https://jupyterhub.readthedocs.io/en/stable/)
222 | 
223 | Multi-user server for Jupyter notebooks
224 | 
225 | #### [Seldon Core](https://www.seldon.io/)
226 | 
227 | Platform for deploying ML models
228 | 
229 | #### [Argo](https://argoproj.github.io/)
230 | 
231 | Container-native workflow management (CI/CD)
232 | 
233 | #### [Prometheus](https://prometheus.io/)
234 | 
235 | Moriting & Alerting platform
236 | 
237 | #### [Grafana](https://grafana.com/)
238 | 
239 | Open platform for analytics and monitoring. It provides the UI for data visualization.
240 | 
241 | #### [Kubernetes](https://kubernetes.io/)
242 | 
243 | Open-source system for automating deployment, scaling, and management of containerized applications.
244 | 


--------------------------------------------------------------------------------
/arch.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/arch.png


--------------------------------------------------------------------------------
/argo_workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/argo_workflow.png


--------------------------------------------------------------------------------
/argo_workflow.yaml:
--------------------------------------------------------------------------------
  1 | apiVersion: argoproj.io/v1alpha1
  2 | kind: Workflow
  3 | metadata:
  4 |   generateName: docker-hub-classification
  5 | spec:
  6 |   entrypoint: default
  7 | 
  8 |   # Create a volume for containers to store their output data.
  9 |   volumeClaimTemplates:
 10 |     - metadata:
 11 |         name: workdir
 12 |       spec:
 13 |         accessModes: ["ReadWriteOnce"]
 14 |         resources:
 15 |           requests:
 16 |             storage: 10Gi
 17 | 
 18 |   # Arguments of the workflows
 19 |   arguments:
 20 |     parameters:
 21 |       # model version
 22 |       - name: model-version
 23 |         value: "v1"
 24 | 
 25 |       # The name of the S3 bucket where the data is stored.
 26 |       - name: bucket
 27 |         value: "<bucket-name>"
 28 | 
 29 |       # Docker registry username
 30 |       - name: username
 31 |         value: "dev"
 32 | 
 33 |       # The path to the input data in the S3 bucket, in csv.gz format
 34 |       - name: input-data-key
 35 |         value: <input-data-key-path>
 36 | 
 37 |       # The path to the docker cert path in the S3 bucket (e.g. Docker UCP bundle)
 38 |       - name: docker-cert-key
 39 |         value: <docker-cert-key-path>
 40 | 
 41 |       # mount path
 42 |       - name: mount-path
 43 |         value: /mnt/workspace/data
 44 | 
 45 |       # loss function
 46 |       - name: loss
 47 |         value: binary_crossentropy
 48 | 
 49 |       # Percentage of the dataset used to test the model (e.g. 0.2 == 20%)
 50 |       - name: test-size
 51 |         value: 0.2
 52 | 
 53 |       # batch size
 54 |       - name: batch-size
 55 |         value: 100
 56 | 
 57 |       # number of epochs
 58 |       - name: epochs
 59 |         value: 15
 60 | 
 61 |       # validation split
 62 |       - name: validation-split
 63 |         value: 0.1
 64 | 
 65 |       # output train data directory path
 66 |       - name: output-train-csv
 67 |         value: train_data.csv
 68 | 
 69 |       # output test data directory path
 70 |       - name: output-test-csv
 71 |         value: test_data.csv
 72 | 
 73 |       # output model
 74 |       - name: output-model
 75 |         value: hub_classifier.h5
 76 | 
 77 |       # output vectorized descriptions
 78 |       - name: output-vectorized-descriptions
 79 |         value: vectorized_descriptions.pckl
 80 | 
 81 |       # output raw data directory path
 82 |       - name: output-raw-csv
 83 |         value: hub_stackshare_combined_v2.csv
 84 | 
 85 |       # selected categories
 86 |       - name: selected-categories
 87 |         value: "devops,build-test-deploy,languages & frameworks,data stores,programming languages,application hosting,databases,web servers,application utilities,support-sales-and-marketing,operating systems,monitoring tools,continuous integration,self-hosted blogging / cms,open source service discovery,message queue,frameworks (full stack),in-memory databases,crm,search as a service,log management,monitoring,collaboration,virtual machine platforms & containers,server configuration and automation,big data tools,database tools,machine learning tools,code collaboration & version_control,load balancer / reverse proxy,web cache,java build tools,search engines,container tools,package managers,project management,infrastructure build tools,static site generators,code review,microframeworks (backend),assets and media,version control system,front end package manager,headless browsers,data science notebooks,ecommerce,background processing,cross-platform mobile development,issue tracking,analytics,secrets management,text editor,graph databases,cluster management,exception monitoring,business tools,business intelligence,localhost tools,realtime backend / api,microservices tools,chatops,git tools,hosted package repository,js build tools / js task runners,libraries,platform as a service,general analytics,group chat & notifications,browser testing,serverless / task processing,css pre-processors / extensions,image processing and management,integrated development environment,stream processing,cross-platform desktop development,continuous deployment,machine learning,data science,monitoring metrics,metrics,continuous delivery,build automation"
 88 | 
 89 |       #The container image to use in the workflow
 90 |       - name: registry
 91 |         value: ""
 92 | 
 93 |       #The container image to use in the workflow
 94 |       - name: image-name
 95 |         value: data-team/base:latest
 96 | 
 97 |   templates:
 98 |     ##################################
 99 |     # Define the steps of the workflow
100 |     ##################################
101 |     - name: default
102 |       steps:
103 |         - - name: import-data
104 |             template: import-data
105 |         - - name: process-data
106 |             template: process-data
107 |         - - name: training
108 |             template: training
109 |         - - name: build-push-image
110 |             template: build-push-image
111 |         - - name: deploy-model
112 |             template: deploy-model
113 | 
114 |     #################################################
115 |     # Import / Unzip
116 |     # Imports the input data & docker certs and unpack them.
117 |     #################################################
118 |     - name: import-data
119 |       container:
120 |         image: alpine:latest
121 |         command: [sh, -c]
122 |         args: [
123 |             "gzip -d {{workflow.parameters.mount-path}}/hub_stackshare_combined_v2.csv.gz", #mkdir {{workflow.parameters.mount-path}}/docker-cert-bundle; unzip {{workflow.parameters.mount-path}}/docker-cert-bundle.zip -d {{workflow.parameters.mount-path}}/docker-cert-bundle",
124 |           ]
125 |         volumeMounts:
126 |           - name: workdir
127 |             mountPath: "{{workflow.parameters.mount-path}}/"
128 |       inputs:
129 |         artifacts:
130 |           - name: data
131 |             path: "{{workflow.parameters.mount-path}}/hub_stackshare_combined_v2.csv.gz"
132 |             s3:
133 |               endpoint: s3.amazonaws.com
134 |               bucket: "{{workflow.parameters.bucket}}"
135 |               key: "{{workflow.parameters.input-data-key}}"
136 |               accessKeySecret:
137 |                 name: s3-credentials
138 |                 key: accessKey
139 |               secretKeySecret:
140 |                 name: s3-credentials
141 |                 key: secretKey
142 |           # - name: docker-cert-bundle
143 |           #   path: "{{workflow.parameters.mount-path}}/docker-cert-bundle.zip"
144 |           #   s3:
145 |           #     endpoint: s3.amazonaws.com
146 |           #     bucket: "{{workflow.parameters.bucket}}"
147 |           #     key: "{{workflow.parameters.docker-cert-key}}"
148 |           #     accessKeySecret:
149 |           #       name: s3-credentials
150 |           #       key: accessKey
151 |           #     secretKeySecret:
152 |           #       name: s3-credentials
153 |           #       key: secretKey
154 |       outputs:
155 |         artifacts:
156 |           - name: raw-csv
157 |             path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-raw-csv}}"
158 | 
159 |     #########################################################################
160 |     # Process Data
161 |     #########################################################################
162 |     - name: process-data
163 |       container:
164 |         image: "{{workflow.parameters.registry}}{{workflow.parameters.image-name}}"
165 |         imagePullPolicy: "IfNotPresent"
166 |         command: [sh, -c]
167 |         args:
168 |           [
169 |             "python /src/process_data.py --mount_path {{workflow.parameters.mount-path}} --input_csv {{workflow.parameters.output-raw-csv}} --output_train_csv {{workflow.parameters.output-train-csv}} --output_test_csv {{workflow.parameters.output-test-csv}} --test_size {{workflow.parameters.test-size}} --selected_categories '{{workflow.parameters.selected-categories}}'",
170 |           ]
171 |         volumeMounts:
172 |           - name: workdir
173 |             mountPath: "{{workflow.parameters.mount-path}}/"
174 |       outputs:
175 |         artifacts:
176 |           - name: output-train-csv
177 |             path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-train-csv}}"
178 |           - name: output-test-csv
179 |             path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-test-csv}}"
180 |           - name: selected-categories
181 |             path: "{{workflow.parameters.mount-path}}/selected_categories.pckl"
182 | 
183 |     #######################################
184 |     # Training and ML model extraction
185 |     #######################################
186 |     - name: training
187 |       container:
188 |         image: "{{workflow.parameters.registry}}{{workflow.parameters.image-name}}"
189 |         imagePullPolicy: "IfNotPresent"
190 |         command: [sh, -c]
191 |         args:
192 |           [
193 |             "python /src/train.py --mount_path {{workflow.parameters.mount-path}} --input_train_csv {{workflow.parameters.output-train-csv}} --input_test_csv {{workflow.parameters.output-test-csv}} --output_model {{workflow.parameters.output-model}} --output_vectorized_descriptions  {{workflow.parameters.output-vectorized-descriptions}};cp /src/models/* {{workflow.parameters.mount-path}}/",
194 |           ]
195 |         volumeMounts:
196 |           - name: workdir
197 |             mountPath: "{{workflow.parameters.mount-path}}/"
198 |       outputs:
199 |         artifacts:
200 |           - name: output-model
201 |             path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-model}}"
202 |           - name: output-vectorized-descriptions
203 |             path: "{{workflow.parameters.mount-path}}/{{workflow.parameters.output-vectorized-descriptions}}"
204 | 
205 |     #######################################
206 |     # Build and push a docker image using the Seldon-Core Docker wrapper
207 |     #######################################
208 |     - name: build-push-image
209 |       container:
210 |         image: docker:17.10
211 |         command: [sh, -c]
212 |         args:
213 |           [
214 |             "cd {{workflow.parameters.mount-path}};sleep 15;rm *.csv;docker run -v {{workflow.parameters.mount-path}}:/model seldonio/core-python-wrapper:0.7 /model DockerHubClassification {{workflow.parameters.model-version}} {{workflow.parameters.registry}}{{workflow.parameters.username}} --base-image=python:3.6 --image-name=dockerhubclassifier;cd build/;./build_image.sh;echo $DOCKER_PASSWORD | docker login -u $DOCKER_USERNAME --password-stdin;./push_image.sh;",
215 |           ]
216 |         volumeMounts:
217 |           - name: workdir
218 |             mountPath: "{{workflow.parameters.mount-path}}/"
219 |         env:
220 |           - name: DOCKER_HOST #the docker daemon can be access on the standard port on localhost
221 |             value: 127.0.0.1
222 |           - name: DOCKER_USERNAME # name of env var
223 |             valueFrom:
224 |               secretKeyRef:
225 |                 name: docker-credentials # name of an existing k8s secret
226 |                 key: username # 'key' subcomponent of the secret
227 |           - name: DOCKER_PASSWORD # name of env var
228 |             valueFrom:
229 |               secretKeyRef:
230 |                 name: docker-credentials # name of an existing k8s secret
231 |                 key: password # 'key' subcomponent of the secret
232 |       sidecars:
233 |         - name: dind
234 |           image: docker:17.10-dind #Docker already provides an image for running a Docker daemon
235 |           securityContext:
236 |             privileged: true #the Docker daemon can only run in a privileged container
237 |           # mirrorVolumeMounts will mount the same volumes specified in the main container
238 |           # to the sidecar (including artifacts), at the same mountPaths. This enables
239 |           # dind daemon to (partially) see the same filesystem as the main container in
240 |           # order to use features such as docker volume binding.
241 |           mirrorVolumeMounts: true
242 | 
243 |     #######################################
244 |     # Deploy model
245 |     #######################################
246 |     - name: deploy-model
247 |       resource: #indicates that this is a resource template
248 |         action: apply #can be any kubectl action (e.g. create, delete, apply, patch)
249 |         #successCondition: ?
250 |         manifest: |
251 |           apiVersion: "machinelearning.seldon.io/v1alpha2"
252 |           kind: "SeldonDeployment"
253 |           metadata:
254 |             labels:
255 |               app: "seldon"
256 |             name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}"
257 |             namespace: kubeflow
258 |           spec:
259 |             annotations:
260 |               deployment_version: "{{workflow.parameters.model-version}}"
261 |               project_name: "Docker Hub ML Project"
262 |             name: "docker-hub-classifier"
263 |             predictors:
264 |               - annotations:
265 |                   predictor_version: "{{workflow.parameters.model-version}}"
266 |                 componentSpecs:
267 |                   - spec:
268 |                       containers:
269 |                         - image: "{{workflow.parameters.registry}}{{workflow.parameters.username}}/dockerhubclassifier:{{workflow.parameters.model-version}}"
270 |                           imagePullPolicy: "Always"
271 |                           name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}"
272 |                 graph:
273 |                   children: []
274 |                   endpoint:
275 |                     type: "REST"
276 |                   name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}"
277 |                   type: "MODEL"
278 |                 name: "docker-hub-classification-model-serving-{{workflow.parameters.model-version}}"
279 |                 replicas: 3
280 | 


--------------------------------------------------------------------------------
/base/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3.6
 2 | 
 3 | RUN apt-get update -y
 4 | RUN apt-get install -y python-pip python-dev build-essential
 5 | 
 6 | COPY /requirements.txt /tmp/
 7 | RUN cd /tmp && \
 8 |     pip install --no-cache-dir -r requirements.txt
 9 | 
10 | # copy python scripts
11 | COPY ./src /src
12 | 


--------------------------------------------------------------------------------
/base/Makefile:
--------------------------------------------------------------------------------
 1 | IMAGE_NAME := data-team/base
 2 | IMAGE_TAG := $(VERSION)-$(CHANNEL)
 3 | 
 4 | ifeq ($(IMAGE_TAG),-)
 5 | 	IMAGE_TAG := latest
 6 | endif
 7 | 
 8 | default: release
 9 | 
10 | build: ## Build the container without caching
11 | 	docker build --no-cache -t $(REGISTRY)$(IMAGE_NAME):$(IMAGE_TAG) .
12 | 
13 | release: build publish ## Make a release by building and publishing tagged image to Docker Trusted Registry (DTR)
14 | 
15 | publish: ## Publish image to DTR
16 | 	@echo 'publish $(REGISTRY)$(IMAGE_NAME):$(IMAGE_TAG)'
17 | 	docker push $(REGISTRY)$(IMAGE_NAME):$(IMAGE_TAG)
18 | 


--------------------------------------------------------------------------------
/base/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow
2 | scikit-learn>=0.18
3 | pandas
4 | keras
5 | nltk


--------------------------------------------------------------------------------
/base/src/data/hub_stackshare_combined_v2.csv.gz:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/base/src/data/hub_stackshare_combined_v2.csv.gz


--------------------------------------------------------------------------------
/base/src/fetch_gihub_data.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import argparse
 3 | import pandas as pd
 4 | from github import Github
 5 | 
 6 | LOGIN = os.environ['GITHUB_LOGIN']
 7 | PASSWORD = os.environ['GITHUB_PASSWORD']
 8 | 
 9 | api = Github(LOGIN, PASSWORD)
10 | 
11 | # Parsing flags.
12 | parser = argparse.ArgumentParser()
13 | parser.add_argument("--topic_name")
14 | args = parser.parse_args()
15 | print(args)
16 | 
17 | result = api.search_repositories(
18 |     'topic:{}'.format(args.topic_name), sort='stars')
19 | 
20 | columns = ('repo_name', 'html_url',
21 |            'description', 'topics')
22 | 
23 | 
24 | # loop through 1,000 repositories order by the number of stars
25 | data = []
26 | for index in range(0, 33):
27 |     for item in result.get_page(index):
28 |         if item.description and not item.fork:
29 |             try:
30 |                 data.append([item.name, item.html_url, item.description.encode('utf-8'),
31 |                              item.get_topics()])
32 |             except Exception as e:
33 |                 print(e)
34 |                 break
35 | 
36 | # create pandas dataframe
37 | df = pd.DataFrame(data, columns=columns)
38 | # save to CSV
39 | df.to_csv('{}_output.csv'.format(args.topic_name), index=False)
40 | print('file {}_output.csv created'.format(args.topic_name))
41 | 


--------------------------------------------------------------------------------
/base/src/models/DockerHubClassification.py:
--------------------------------------------------------------------------------
 1 | import pickle
 2 | import re
 3 | import numpy as np
 4 | from keras.models import load_model
 5 | 
 6 | 
 7 | class DockerHubClassification(object):
 8 |     def __init__(self):
 9 |         self.model = load_model('hub_classifier.h5')
10 |         self.model._make_predict_function()
11 |         self.selected_categories = pickle.load(
12 |             open('selected_categories.pckl', 'rb'))
13 |         self.tokenizer = pickle.load(
14 |             open('vectorized_descriptions.pckl', 'rb'))
15 | 
16 |     def _clean_string(self, text):
17 |         text = re.sub('[!@#$,.)(=*`]', '', text)
18 |         return text.lower()
19 | 
20 |     def _predict_labels(self, text):
21 |         labels = []
22 |         description = self._clean_string(str(text))
23 |         description_matrix = self.tokenizer.texts_to_matrix([
24 |             description], mode='tfidf')
25 |         preds = self.model.predict(
26 |             description_matrix, batch_size=None, verbose=1, steps=None)
27 | 
28 |         preds[preds > 0.4] = 1
29 |         preds[preds < 0.4] = 0
30 | 
31 |         for c in range(len(self.selected_categories)):
32 |             if preds[0][c] == 1:
33 |                 labels.append(self.selected_categories[c])
34 |         return labels
35 | 
36 |     def predict(self, text, features_names):
37 |         return np.array([self._predict_labels(text)])
38 | 


--------------------------------------------------------------------------------
/base/src/models/requirements.txt:
--------------------------------------------------------------------------------
1 | tensorflow
2 | keras


--------------------------------------------------------------------------------
/base/src/process_data.py:
--------------------------------------------------------------------------------
  1 | from nltk.corpus import stopwords
  2 | import ast
  3 | import argparse
  4 | import pickle
  5 | import pandas as pd
  6 | import nltk
  7 | import re
  8 | from sklearn.model_selection import train_test_split
  9 | 
 10 | # Parsing flags.
 11 | parser = argparse.ArgumentParser()
 12 | parser.add_argument("--input_csv")
 13 | parser.add_argument("--mount_path")
 14 | parser.add_argument("--output_train_csv")
 15 | parser.add_argument("--output_test_csv")
 16 | parser.add_argument("--selected_categories")
 17 | parser.add_argument("--test_size")
 18 | args = parser.parse_args()
 19 | print(args)
 20 | 
 21 | 
 22 | # Load data from CVS
 23 | raw_data = pd.read_csv('{}/{}'.format(args.mount_path, args.input_csv))
 24 | 
 25 | # Remove records with empty descriptions
 26 | data = raw_data[raw_data.FULL_DESCRIPTION.notna()]
 27 | # Lower case
 28 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.lower()
 29 | # Remove punctuation
 30 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('[^\w\s]', '')
 31 | # Remove numbers
 32 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\d+', '')
 33 | # Remove `\n` and  `\t` characters
 34 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\n', ' ')
 35 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\t', ' ')
 36 | # Remove long strings (len() > 24)
 37 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('\w{24,}', '')
 38 | # Remove urls
 39 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace('http\w+', '')
 40 | # Remove extra spaces
 41 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.strip()
 42 | data.FULL_DESCRIPTION.replace({r'[^\x00-\x7F]+': ''}, regex=True, inplace=True)
 43 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.str.replace(' +', ' ')
 44 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.replace('\s+', ' ', regex=True)
 45 | # Drop duplicates
 46 | data = data.drop_duplicates()
 47 | 
 48 | # build-test-deploy,languages & frameworks,databases,web servers,application utilities,devops
 49 | # operating systems,business tools,continuous integration,message queue,support-sales-and-marketing
 50 | selected_categories = args.selected_categories.split(',')
 51 | 
 52 | for col in selected_categories:
 53 |     data[col] = 0
 54 | 
 55 | for index, row in data.iterrows():
 56 |     labels = row.labels
 57 |     labels = ast.literal_eval(labels)
 58 |     labels = [e.strip()for e in labels]
 59 |     for l in labels:
 60 |         if l in selected_categories:
 61 |             data.loc[index, l] = 1
 62 | data = data.dropna(axis=1)
 63 | 
 64 | # Remove stopwords
 65 | nltk.download('stopwords')
 66 | stop_words = set(stopwords.words('english'))
 67 | stop_words.update([
 68 |     'default',
 69 |     'image',
 70 |     'docker',
 71 |     'container',
 72 |     'service',
 73 |     'production',
 74 |     'dockerfile',
 75 |     'dockercompose',
 76 |     'build',
 77 |     'latest',
 78 |     'file',
 79 |     'tag',
 80 |     'instance',
 81 |     'run',
 82 |     'running',
 83 |     'use',
 84 |     'will',
 85 |     'work',
 86 |     'please',
 87 |     'install',
 88 |     'tags',
 89 |     'version',
 90 |     'create',
 91 |     'want',
 92 |     'need',
 93 |     'used',
 94 |     'well',
 95 |     'user',
 96 |     'release',
 97 |     'config',
 98 |     'dir',
 99 |     'support',
100 |     'exec',
101 |     'github',
102 |     'rm',
103 |     'mkdir',
104 |     'env',
105 |     'folder',
106 |     'http',
107 |     'repo',
108 |     'cd',
109 |     'ssh',
110 |     'root'])
111 | 
112 | re_stop_words = re.compile(r"\b(" + "|".join(stop_words) + ")\\W", re.I)
113 | data.FULL_DESCRIPTION = data.FULL_DESCRIPTION.apply(
114 |     lambda sentence: re_stop_words.sub(" ", sentence))
115 | 
116 | # Split data into test and train datasets
117 | train, test = train_test_split(
118 |     data, test_size=float(args.test_size), shuffle=True)
119 | 
120 | # Print stats about the shape of the data.
121 | print('Train: {:,} rows {:,} columns'.format(train.shape[0], train.shape[1]))
122 | print('Test: {:,} rows {:,} columns'.format(test.shape[0], test.shape[1]))
123 | 
124 | # save output as CSV.
125 | train.to_csv('{}/{}'.format(args.mount_path,
126 |                             args.output_train_csv), index=False)
127 | test.to_csv('{}/{}'.format(args.mount_path,
128 |                            args.output_test_csv), index=False)
129 | # save list of categories
130 | f2 = open('{}/selected_categories.pckl'.format(args.mount_path), 'wb')
131 | pickle.dump(selected_categories, f2)
132 | 


--------------------------------------------------------------------------------
/base/src/train.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import pickle
 3 | import pandas as pd
 4 | from keras.layers import Dense
 5 | from keras.models import Sequential
 6 | from keras import layers
 7 | from keras.preprocessing.text import Tokenizer
 8 | 
 9 | 
10 | # Parsing flags.
11 | parser = argparse.ArgumentParser()
12 | parser.add_argument("--mount_path")
13 | parser.add_argument("--input_train_csv")
14 | parser.add_argument("--input_test_csv")
15 | parser.add_argument("--output_model")
16 | parser.add_argument("--output_vectorized_descriptions")
17 | parser.add_argument("--loss", default="binary_crossentropy")
18 | parser.add_argument("--batch_size", default=100)
19 | parser.add_argument("--epochs", default=15)
20 | parser.add_argument("--validation_split", default=0.1)
21 | args = parser.parse_args()
22 | print(args)
23 | 
24 | # Load data from CVS
25 | train_data = pd.read_csv(
26 |     '{}/{}'.format(args.mount_path, args.input_train_csv))
27 | test_data = pd.read_csv(
28 |     '{}/{}'.format(args.mount_path, args.input_test_csv))
29 | 
30 | # Remove records with empty descriptions
31 | train_data = train_data[train_data.FULL_DESCRIPTION.notna()]
32 | test_data = test_data[test_data.FULL_DESCRIPTION.notna()]
33 | 
34 | # Extract full description from datasets
35 | train_text = train_data.FULL_DESCRIPTION.tolist()
36 | test_text = test_data.FULL_DESCRIPTION.tolist()
37 | 
38 | t = Tokenizer()
39 | t.fit_on_texts(train_text + test_text)
40 | 
41 | # integer encode documents
42 | x_train = t.texts_to_matrix(train_text, mode='tfidf')
43 | x_test = t.texts_to_matrix(test_text, mode='tfidf')
44 | 
45 | # Remove unnecessary columns from the datasets
46 | y_train = train_data.drop(labels=['index', 'FULL_DESCRIPTION', 'NAME',
47 |                                   'DESCRIPTION', 'PULL_COUNT', 'CATEGORY1', 'CATEGORY2', 'labels'], axis=1)
48 | y_test = test_data.drop(labels=['index', 'FULL_DESCRIPTION', 'NAME', 'DESCRIPTION',
49 |                                 'PULL_COUNT', 'CATEGORY1', 'CATEGORY2', 'labels'], axis=1)
50 | 
51 | # KERAS MODEL
52 | n_cols = x_train.shape[1]
53 | model = Sequential()
54 | 
55 | # input layer of 70 neurons:
56 | model.add(Dense(70, activation='relu', input_shape=(n_cols,)))
57 | 
58 | # output layer of 82 neurons:
59 | model.add(Dense(82, activation='sigmoid'))
60 | 
61 | # determining optimizer, loss, and metrics:
62 | model.compile(optimizer='adam', loss=args.loss,
63 |               metrics=['binary_accuracy'])
64 | 
65 | history = model.fit(
66 |     x_train, y_train, batch_size=int(args.batch_size),
67 |     epochs=int(args.epochs), verbose=0, validation_split=float(args.validation_split))
68 | score, acc = model.evaluate(
69 |     x_test, y_test, batch_size=int(args.batch_size), verbose=0)
70 | print("Score/Loss: ", score)
71 | print("Accuracy: ", acc)
72 | # save model
73 | model.save('{}/{}'.format(args.mount_path, args.output_model))
74 | # save tokenizer
75 | f1 = open('{}/{}'.format(args.mount_path,
76 |                          args.output_vectorized_descriptions), 'wb')
77 | pickle.dump(t, f1)
78 | 


--------------------------------------------------------------------------------
/d4m_menu.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/d4m_menu.png


--------------------------------------------------------------------------------
/seldon_core.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dockersamples/docker-hub-ml-project/91190862efbc7c9c0e117f7455f12086bf003cf1/seldon_core.png


--------------------------------------------------------------------------------