├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── RESOURCE_GUIDE.md
├── cats
├── DATAFLOW_TUTORIAL.md
├── LICENSE
├── README.md
├── nn_demo_part1.ipynb
├── nn_demo_part2.ipynb
├── run_step_2a_query.sh
├── run_step_2b_get_images.sh
├── run_step_3_split_images.sh
├── setup.py
├── step_0_to_0.ipynb
├── step_1_to_3.ipynb
├── step_2a_query.sql
├── step_2b_get_images.py
├── step_3_split_images.py
├── step_4_to_4_part1.ipynb
├── step_4_to_4_part2.ipynb
├── step_5_to_6_part1.ipynb
├── step_5_to_8_part2.ipynb
└── step_8_to_9.ipynb
├── cleanup_notebooks.py
├── imgs
├── step_2b.png
└── step_3.png
├── requirements.txt
├── setup_step_1_cloud_project.sh
└── setup_step_2_install_software.sh
/.gitignore:
--------------------------------------------------------------------------------
1 | *.idea*
2 | *.pyc*
3 | *.egg-info*
4 | *nyc_dataprep*
5 | *~
6 | *DS_Store
7 | *.bmp
8 | *.ipynb_checkpoints*
9 | *.tar.gz
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # How to Contribute
2 |
3 | While this is meant as a basic tutorial with fixed content, if you would like
4 | to make additions or revisions for the purpose of better teaching data science
5 | on Google Cloud, or if you have fixes for Cloud environment setups, we'd love to
6 | accept your patches and contributions to this project. There are a few small
7 | guidelines you need to follow.
8 |
9 | ## Contributor License Agreement
10 |
11 | Contributions to this project must be accompanied by a Contributor License
12 | Agreement. You (or your employer) retain the copyright to your contribution,
13 | this simply gives us permission to use and redistribute your contributions as
14 | part of the project. Head over to to see
15 | your current agreements on file or to sign a new one.
16 |
17 | You generally only need to submit a CLA once, so if you've already submitted one
18 | (even if it was for a different project), you probably don't need to do it
19 | again.
20 |
21 | ## Code reviews
22 |
23 | All submissions, including submissions by project members, require review. We
24 | use GitHub pull requests for this purpose. Consult
25 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
26 | information on using pull requests.
27 |
28 | ## Notebooks
29 |
30 | For notebook contributions or edits, after making proper changes, please shut
31 | down your notebook editor (e.g. Jupyter, Colab, etc.), and run the python script
32 | in the project's base directory as the FINAL STEP to ensure that outputs and
33 | other editor-specific metadata are removed. This ensures that diffs between
34 | notebooks are clearly marked during code review.
35 |
36 | ```
37 | python cleanup_notebooks.py
38 | ```
39 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 |
2 | Apache License
3 | Version 2.0, January 2004
4 | http://www.apache.org/licenses/
5 |
6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7 |
8 | 1. Definitions.
9 |
10 | "License" shall mean the terms and conditions for use, reproduction,
11 | and distribution as defined by Sections 1 through 9 of this document.
12 |
13 | "Licensor" shall mean the copyright owner or entity authorized by
14 | the copyright owner that is granting the License.
15 |
16 | "Legal Entity" shall mean the union of the acting entity and all
17 | other entities that control, are controlled by, or are under common
18 | control with that entity. For the purposes of this definition,
19 | "control" means (i) the power, direct or indirect, to cause the
20 | direction or management of such entity, whether by contract or
21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
22 | outstanding shares, or (iii) beneficial ownership of such entity.
23 |
24 | "You" (or "Your") shall mean an individual or Legal Entity
25 | exercising permissions granted by this License.
26 |
27 | "Source" form shall mean the preferred form for making modifications,
28 | including but not limited to software source code, documentation
29 | source, and configuration files.
30 |
31 | "Object" form shall mean any form resulting from mechanical
32 | transformation or translation of a Source form, including but
33 | not limited to compiled object code, generated documentation,
34 | and conversions to other media types.
35 |
36 | "Work" shall mean the work of authorship, whether in Source or
37 | Object form, made available under the License, as indicated by a
38 | copyright notice that is included in or attached to the work
39 | (an example is provided in the Appendix below).
40 |
41 | "Derivative Works" shall mean any work, whether in Source or Object
42 | form, that is based on (or derived from) the Work and for which the
43 | editorial revisions, annotations, elaborations, or other modifications
44 | represent, as a whole, an original work of authorship. For the purposes
45 | of this License, Derivative Works shall not include works that remain
46 | separable from, or merely link (or bind by name) to the interfaces of,
47 | the Work and Derivative Works thereof.
48 |
49 | "Contribution" shall mean any work of authorship, including
50 | the original version of the Work and any modifications or additions
51 | to that Work or Derivative Works thereof, that is intentionally
52 | submitted to Licensor for inclusion in the Work by the copyright owner
53 | or by an individual or Legal Entity authorized to submit on behalf of
54 | the copyright owner. For the purposes of this definition, "submitted"
55 | means any form of electronic, verbal, or written communication sent
56 | to the Licensor or its representatives, including but not limited to
57 | communication on electronic mailing lists, source code control systems,
58 | and issue tracking systems that are managed by, or on behalf of, the
59 | Licensor for the purpose of discussing and improving the Work, but
60 | excluding communication that is conspicuously marked or otherwise
61 | designated in writing by the copyright owner as "Not a Contribution."
62 |
63 | "Contributor" shall mean Licensor and any individual or Legal Entity
64 | on behalf of whom a Contribution has been received by Licensor and
65 | subsequently incorporated within the Work.
66 |
67 | 2. Grant of Copyright License. Subject to the terms and conditions of
68 | this License, each Contributor hereby grants to You a perpetual,
69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70 | copyright license to reproduce, prepare Derivative Works of,
71 | publicly display, publicly perform, sublicense, and distribute the
72 | Work and such Derivative Works in Source or Object form.
73 |
74 | 3. Grant of Patent License. Subject to the terms and conditions of
75 | this License, each Contributor hereby grants to You a perpetual,
76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77 | (except as stated in this section) patent license to make, have made,
78 | use, offer to sell, sell, import, and otherwise transfer the Work,
79 | where such license applies only to those patent claims licensable
80 | by such Contributor that are necessarily infringed by their
81 | Contribution(s) alone or by combination of their Contribution(s)
82 | with the Work to which such Contribution(s) was submitted. If You
83 | institute patent litigation against any entity (including a
84 | cross-claim or counterclaim in a lawsuit) alleging that the Work
85 | or a Contribution incorporated within the Work constitutes direct
86 | or contributory patent infringement, then any patent licenses
87 | granted to You under this License for that Work shall terminate
88 | as of the date such litigation is filed.
89 |
90 | 4. Redistribution. You may reproduce and distribute copies of the
91 | Work or Derivative Works thereof in any medium, with or without
92 | modifications, and in Source or Object form, provided that You
93 | meet the following conditions:
94 |
95 | (a) You must give any other recipients of the Work or
96 | Derivative Works a copy of this License; and
97 |
98 | (b) You must cause any modified files to carry prominent notices
99 | stating that You changed the files; and
100 |
101 | (c) You must retain, in the Source form of any Derivative Works
102 | that You distribute, all copyright, patent, trademark, and
103 | attribution notices from the Source form of the Work,
104 | excluding those notices that do not pertain to any part of
105 | the Derivative Works; and
106 |
107 | (d) If the Work includes a "NOTICE" text file as part of its
108 | distribution, then any Derivative Works that You distribute must
109 | include a readable copy of the attribution notices contained
110 | within such NOTICE file, excluding those notices that do not
111 | pertain to any part of the Derivative Works, in at least one
112 | of the following places: within a NOTICE text file distributed
113 | as part of the Derivative Works; within the Source form or
114 | documentation, if provided along with the Derivative Works; or,
115 | within a display generated by the Derivative Works, if and
116 | wherever such third-party notices normally appear. The contents
117 | of the NOTICE file are for informational purposes only and
118 | do not modify the License. You may add Your own attribution
119 | notices within Derivative Works that You distribute, alongside
120 | or as an addendum to the NOTICE text from the Work, provided
121 | that such additional attribution notices cannot be construed
122 | as modifying the License.
123 |
124 | You may add Your own copyright statement to Your modifications and
125 | may provide additional or different license terms and conditions
126 | for use, reproduction, or distribution of Your modifications, or
127 | for any such Derivative Works as a whole, provided Your use,
128 | reproduction, and distribution of the Work otherwise complies with
129 | the conditions stated in this License.
130 |
131 | 5. Submission of Contributions. Unless You explicitly state otherwise,
132 | any Contribution intentionally submitted for inclusion in the Work
133 | by You to the Licensor shall be under the terms and conditions of
134 | this License, without any additional terms or conditions.
135 | Notwithstanding the above, nothing herein shall supersede or modify
136 | the terms of any separate license agreement you may have executed
137 | with Licensor regarding such Contributions.
138 |
139 | 6. Trademarks. This License does not grant permission to use the trade
140 | names, trademarks, service marks, or product names of the Licensor,
141 | except as required for reasonable and customary use in describing the
142 | origin of the Work and reproducing the content of the NOTICE file.
143 |
144 | 7. Disclaimer of Warranty. Unless required by applicable law or
145 | agreed to in writing, Licensor provides the Work (and each
146 | Contributor provides its Contributions) on an "AS IS" BASIS,
147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148 | implied, including, without limitation, any warranties or conditions
149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150 | PARTICULAR PURPOSE. You are solely responsible for determining the
151 | appropriateness of using or redistributing the Work and assume any
152 | risks associated with Your exercise of permissions under this License.
153 |
154 | 8. Limitation of Liability. In no event and under no legal theory,
155 | whether in tort (including negligence), contract, or otherwise,
156 | unless required by applicable law (such as deliberate and grossly
157 | negligent acts) or agreed to in writing, shall any Contributor be
158 | liable to You for damages, including any direct, indirect, special,
159 | incidental, or consequential damages of any character arising as a
160 | result of this License or out of the use or inability to use the
161 | Work (including but not limited to damages for loss of goodwill,
162 | work stoppage, computer failure or malfunction, or any and all
163 | other commercial damages or losses), even if such Contributor
164 | has been advised of the possibility of such damages.
165 |
166 | 9. Accepting Warranty or Additional Liability. While redistributing
167 | the Work or Derivative Works thereof, You may choose to offer,
168 | and charge a fee for, acceptance of support, warranty, indemnity,
169 | or other liability obligations and/or rights consistent with this
170 | License. However, in accepting such obligations, You may act only
171 | on Your own behalf and on Your sole responsibility, not on behalf
172 | of any other Contributor, and only if You agree to indemnify,
173 | defend, and hold each Contributor harmless for any liability
174 | incurred by, or claims asserted against, such Contributor by reason
175 | of your accepting any such warranty or additional liability.
176 |
177 | END OF TERMS AND CONDITIONS
178 |
179 | APPENDIX: How to apply the Apache License to your work.
180 |
181 | To apply the Apache License to your work, attach the following
182 | boilerplate notice, with the fields enclosed by brackets "[]"
183 | replaced with your own identifying information. (Don't include
184 | the brackets!) The text should be enclosed in the appropriate
185 | comment syntax for the file format. We also recommend that a
186 | file or class name and description of purpose be included on the
187 | same "printed page" as the copyright notice for easier
188 | identification within third-party archives.
189 |
190 | Copyright [yyyy] [name of copyright owner]
191 |
192 | Licensed under the Apache License, Version 2.0 (the "License");
193 | you may not use this file except in compliance with the License.
194 | You may obtain a copy of the License at
195 |
196 | http://www.apache.org/licenses/LICENSE-2.0
197 |
198 | Unless required by applicable law or agreed to in writing, software
199 | distributed under the License is distributed on an "AS IS" BASIS,
200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201 | See the License for the specific language governing permissions and
202 | limitations under the License.
203 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Step-by-Step Deep Learning Tutorial
2 |
3 | Authors: Cassie Kozyrkov (@kozyrkov) and Brian Foo (@bkungfoo)
4 |
5 | Team: Google Applied AI
6 |
7 | ## Follow along with us!
8 |
9 | Take a look at these **walkthrough slides** with screenshots to guide you along:
10 |
11 | [github.com/kozyrkov/deep-learning-walkthrough](https://github.com/kozyrkov/deep-learning-walkthrough)
12 |
13 | Bonus: slides contain ML hints, summaries, and pitfall alerts.
14 |
15 | ## Create and Setup Cloud Project
16 |
17 | **This tutorial is meant to run fully on the Google Cloud Platform.**
18 |
19 | Starting with your web browser, do the following:
20 |
21 | * Open a browser and sign up for a [google account](https://accounts.google.com).
22 | * Sign into your account.
23 | * [Create a new GCP project](https://cloud.google.com/resource-manager/docs/creating-managing-projects)
24 | and [enable billing](https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project).
25 |
26 | ### Editing Resource Quota
27 |
28 | **Why edit resource quota?** To complete this tutorial, you will need more computing resources than your
29 | Google Cloud Platform account has access to by default. One reason that accounts start out with limits on
30 | resources is that this protects users from being billed unexpectedly for the more expensive options.
31 | For more information, see the [quotas documentation page](https://cloud.google.com/compute/quotas).
32 |
33 | For this project, we will require several types of resources:
34 |
35 | * **Compute Engine VM for Data Science:** We will create a VM that you will log
36 | into and do most of your work and run notebooks. You will have the option of
37 | creating a VM with a GPU choice, or no GPU. A GPU is strongly recommended for
38 | deep network training because it dramatically cuts down the time to completion.
39 | However, the cost is also higher. Please refer to the
40 | [resource guide](RESOURCE_GUIDE.md) for a brief discussion and comparison of
41 | performances and costs.
42 |
43 | * **Cloud resources:** We will be using Dataflow to run distributed
44 | preprocessing jobs. Thus, we need to extend quotas on Cloud resources, such as
45 | CPUs, IP addresses, and total disk space.
46 |
47 | We will be setting quotas for these two types of resources. Note that quotas
48 | only determine the maximum amount of a resource that your project is allowed to
49 | request! It does not mean that your jobs will use this amount necessarily,
50 | but that you are permitted to use up to this amount. The general
51 | guideline is to set higher quotas such that there is no need to readjust them
52 | for compute-intensive tasks.
53 |
54 | **Quota Setup Instructions:**
55 |
56 | To set up the data science VM, we will need to extend the quota for GPUs.
57 | * Select your project from the list on the
58 | [resource quota page](https://console.cloud.google.com/iam-admin/quotas).
59 | * (If this is the first time creating the project, compute engine may still
60 | need to boot up. If the quota page does not have GPU options, click on
61 | "Compute Engine" in the dropdown menu on the top left, and click quota there.
62 | Wait for it to load, and return to the quota page above.)
63 | * If you would like to try out a GPU machine (recommended),
64 | [find a region that has gpu support](https://cloud.google.com/compute/docs/gpus/).
65 | At the time this tutorial was written, valid regions include
66 | us-east1, us-west1, europe-west1, and asia-east1.
67 | * Select your chosen region from the Region dropdown menu.
68 | Then select the following:
69 | * NVIDIA K80 GPUs
70 | * NVIDIA P100 GPUs
71 | * Click "+ edit quotas" at the top of the page. Change the fields above to
72 | the following values:
73 | * NVIDIA K80 GPUs: 1
74 | * NVIDIA P100 GPUs: 1
75 | * Follow the process to submit a request.
76 |
77 | To setup cloud resources for preprocessing jobs, follow a similar request as
78 | above to edit quotas:
79 | * Find a region with
80 | [Dataflow support](https://cloud.google.com/dataflow/docs/concepts/regional-endpoints#supported_regional_endpoints)
81 | At the time this tutorial was written, valid regions include us-central1 and
82 | europe-west1.
83 | * Select this region in the dropdown menu on the
84 | [resource quota page]((https://console.cloud.google.com/iam-admin/quotas)).
85 | * Change the following quotas:
86 | * CPUs: 400
87 | * In-use IP addresses: 100
88 | * Persistent Disk Standard (GB): 65536
89 | * Select region "Global" in the dropdown menu:
90 | * Change the following quotas:
91 | * CPUs (all regions): 400
92 |
93 | After you have completed these steps, you will need to wait until you receive
94 | an email approving of the quota increases. Please note that you may be asked to
95 | provide credit card details to confirm these increases.
96 |
97 | ## Setup Cloud Project Components and API
98 |
99 | *Expected setup time: 5 minutes*
100 |
101 | Click on the ">_" icon at the top right of your web console to open a cloud
102 | shell. Inside the cloud shell, execute the following commands:
103 |
104 | ```
105 | git clone https://github.com/google-aai/sc17.git
106 | cd sc17
107 | ```
108 |
109 | If you happen to have the project files locally, you can also upload locally by
110 | clicking on the 3 vertically arranged dots on the top right of the shell window,
111 | and then click "upload file".
112 |
113 | After you have the proper scripts uploaded, set permissions on the following
114 | script:
115 |
116 | ```
117 | chmod 777 setup_step_1_cloud_project.sh
118 | ```
119 |
120 | Then run the script to create storage, dataset,
121 | and compute VMs for your project (Note: using the "sh" command will fail as it
122 | is missing some necessary syntax in the cloud shell environment.)
123 | ```
124 | ./setup_step_1_cloud_project.sh project-name [gpu-type] [compute-region] [dataflow-region]
125 | ```
126 |
127 | where:
128 | * [project-name] is the ID of the project you created (check the Cloud Dashboard for the ID extension if needed)
129 | * [gpu-type] (optional) is either None, K80, or P100
130 | (default: None)
131 | * [compute-region] (optional) is the region you will create your data science VM
132 | (default: us-east1)
133 | * [dataflow-region] (optional) is where you will run dataflow preprocessing jobs
134 | (default: us-central1)
135 |
136 | If this is your first time setting up the project, you will be prompted during
137 | the course of running the script, such as selecting the number corresponding to
138 | your project. Enter what is needed to allow the script to continue running.
139 |
140 | If the script stops with an error message "ERROR [timestamp]: message" (e.g.
141 | quota limits are too low), use relevant parts of the error message to fix your
142 | project setting if needed, and rerun the script.
143 |
144 | ## Setting up your VM environment
145 |
146 | ### Library installations
147 |
148 | *Expected setup time: 15 minutes*
149 |
150 | From the
151 | [VM instances page](https://console.cloud.google.com/compute/instances),
152 | click the "SSH" text under "Connect" to connect to your compute VM instance.
153 | You may have to click twice if your browser auto-blocks pop-ups.
154 |
155 | In the new window, run git clone to download the project onto your VM, and cd
156 | into it:
157 |
158 | ```
159 | git clone https://github.com/google-aai/sc17.git
160 | cd sc17
161 | ```
162 |
163 | If you happen to have the project files locally, you can also upload locally by
164 | opening your Storage Bucket from the GCP Console menu and dragging your local files
165 | over. Then in your VM window, download them from your storage bucket by running:
166 |
167 | ```
168 | gsutil cp [gs://[bucket-name]/* .]
169 | ```
170 |
171 | Note that tab-complete will work after ```gs://``` if you don't know your bucket name.
172 |
173 | After you have the script files downloaded to your VM, run the following script:
174 |
175 | ```
176 | sh setup_step_2_install_software.sh
177 | ```
178 |
179 | The script should setup opencv dependencies, python, virtual env, and
180 | jupyter. It will also automatically detect the presence of
181 | an NVIDIA GPU and install/link CUDA libraries and tensorflow GPU if necessary.
182 | The script will also prompt you to **provide a password** at some
183 | point. This password is for connecting to jupyter from your web browser. Please
184 | take note of it since you'll be prompted to enter it when you start working in Jupyter.
185 |
186 | To complete and test the setup, reload bashrc to load the newly created virtual
187 | environment:
188 |
189 | ```
190 | . ~/.bashrc
191 | ```
192 |
193 | ## Use Unix Screen to Start a Notebook
194 |
195 | Screen takes a little to get used to, but it will make working on cloud VMs much
196 | more pleasant, especially with a project that needs to run many tasks!
197 |
198 | For those not familiar with the unix screen command, Screen is known as a
199 | "terminal multiplexer", which allows you to run multiple terminal (shell)
200 | instances at the same time in your ssh session. Furthermore, Screen sessions are
201 | NOT tied to your ssh session, which means that if you accidentally log out or
202 | disconnect from your ssh session in the middle of running a long process running
203 | on your VM, you will not lose your work!
204 |
205 | Furthermore, you might want multiple processes running simultaneously and have
206 | an easy way to switch back and forth. A simple example is that you want to leave
207 | your Jupyter notebook open while running a Cloud Dataflow job (which you do not
208 | want abruptly canceled!). Running these in separate terminals is ideal.
209 |
210 | To start screen for the first time, run:
211 |
212 | ```
213 | screen
214 | ```
215 |
216 | and press return. This opens up a screen terminal (defaults to terminal 0).
217 |
218 | Let's create one more Screen terminal (terminal 1) by pressing `Ctrl-a`, and then
219 | `c` (We will write this shorthand as `Ctrl-a c`)
220 |
221 | You can now jump between the two terminals by using `Ctrl-a n`, or access them
222 | directly using `Ctrl-a 0` or `Ctrl-a 1`.
223 |
224 | Go to terminal 0 by typing `Ctrl-a 0`, and then type:
225 |
226 | ```
227 | jupyter notebook
228 | ```
229 |
230 | to start jupyter.
231 |
232 | Finally, detach from both Screen terminals by typing `Ctrl-a d`. If you want to
233 | resume the screen terminals, simply type:
234 |
235 | ```
236 | screen -R
237 | ```
238 |
239 | Fantastic! Now let's do another cool trick: Make sure you are detached from
240 | Screen terminals (type `Ctrl-a d` if necessary), and then exit the machine
241 | by typing:
242 |
243 | ```
244 | exit
245 | ```
246 |
247 | at the command line. You just exited the machine, but the Screen terminals are
248 | still be running, including Jupyter which you started in Screen!
249 |
250 | ## Connecting to Jupyter
251 |
252 | Jupyter is now running inside a Screen terminal even though your ssh session has
253 | ended. Let's try it out through an ssh tunnel (For security reasons, we will
254 | not simply open up a firewall port and show your notebook to the entire world!)
255 |
256 | On your local computer, make sure you have [gcloud sdk installed](https://cloud.google.com/sdk/downloads).
257 | Then run:
258 |
259 | ```
260 | gcloud init
261 | ```
262 |
263 | Follow the instructions and choose your project, and then choose the region
264 | corresponding to where your vm was created. After this has been setup, run:
265 |
266 | ```
267 | gcloud compute config-ssh
268 | ```
269 |
270 | After this runs successfully, you will get this back in your shell:
271 |
272 | ```
273 | You should now be able to use ssh/scp with your instances.
274 | For example, try running:
275 |
276 | $ ssh [instance-name].[zone-name].[project-name]
277 | ```
278 |
279 | Run the suggested command to check that ssh works when connecting to your cloud
280 | VM. Then exit the ssh shell by typing `exit`.
281 |
282 | Now we are ready to connect to Jupyter! Run the same ssh command again, but this
283 | time, add some flags and ports:
284 |
285 | ```
286 | ssh -N -f -L localhost:8888:localhost:5000 [instance-name].[zone-name].[project-name]
287 | ```
288 |
289 | This command basically configures port forwarding, redirecting port 5000 on your
290 | cloud VM to your own computer's port 8888. Now go to your web browser, and type:
291 |
292 | ```
293 | localhost:8888
294 | ```
295 |
296 | If you see a password page for Jupyter, enter your password as prompted.
297 | Once you are in, you can see the notebook view of the directory you
298 | started Jupyter in.
299 |
300 | Before proceeding, please read the [resource guide](RESOURCE_GUIDE.md) to beware
301 | of common pitfalls (such as forgetting to stop your VM when not using it!) and
302 | other ways to save on cost.
303 |
304 | **Hooray! [Let's go detect some cats!](cats/README.md)**
305 |
--------------------------------------------------------------------------------
/RESOURCE_GUIDE.md:
--------------------------------------------------------------------------------
1 | # Resource Guide
2 |
3 | This guide is meant to provide some resource, time, and money tradeoffs such
4 | that you can choose the best configuration for your data science experience!
5 |
6 | ## STOPPING VM INSTANCES
7 |
8 | **Make sure to stop your VM instance when you are not using it!**
9 |
10 | When you start a VM, GCP will charge you an hourly rate (see below). If you are
11 | not actively running tasks on the VM or using the interface, you should stop the
12 | instance to save money. To do so, you can either do it through the
13 | [web ui](https://console.cloud.google.com/compute/instances) by clicking on the
14 | vertical dots at the right of your instance and stopping it, or use the command
15 | line from cloud shell:
16 |
17 | ```
18 | gcloud compute instances stop [vm-instance-name]
19 | ```
20 |
21 | ## GPU or No GPU
22 |
23 | The NVIDIA Tesla K80 is significantly cheaper than the Tesla P100, but not
24 | nearly as powerful. The P100 can speed up deep network training by 10-20x
25 | compared to no GPU. If you plan to spend a lot of time visualizing and debugging
26 | data and less time running massive training jobs, it may be a good idea to start
27 | without a GPU. When you reach a step in the tutorial where you are training and
28 | optimizing deep neural nets is required, a GPU can saves you a lot of time,
29 | and potentially money. To compare costs and performance, consider the
30 | [training task](cats/step_8_to_9.ipynb) used in the `cats`
31 | project:
32 |
33 | * **No GPU:** A 2 CPU VM costs around $0.10 per hour when active. Training on
34 | the full dataset in a notebook (which only runs on 1 CPU) takes about 8 hours.
35 | * **K80**: around $0.55 per hour when active. Training takes about 31 minutes.
36 | * **P100**: around $1.50 per hour when active. Training takes about 25 minutes.
37 |
38 | (It is possible that the default configuration does not optimize for P100, and
39 | further savings can be achieved.)
40 |
41 | ## Dataflow CPUs
42 |
43 | By default, our project is tuned to run tasks that will not incur more than a
44 | few dollars to your account, while ensuring that jobs complete quickly.
45 | If you would like to save money at the cost of extra time, dataflow jobs can be
46 | tuned using flags such as `--num_workers` and `--worker_machine_type`. The:
47 | example below dispatches a job on 20 machines with a worker type that contains
48 | 4 cpus per machine:
49 |
50 | ```
51 | --num_workers 20 \
52 | --worker_machine_type n1-standard-4 \
53 | ```
54 |
55 | If you are running a lightweight task, such as a small sample of cat images,
56 | consider using fewer machines and/or changing the worker machine type to a
57 | simpler worker. This can reduce resources required to boot up and setup
58 | machines, and reduce communication overhead between machines.
--------------------------------------------------------------------------------
/cats/DATAFLOW_TUTORIAL.md:
--------------------------------------------------------------------------------
1 | # Hello Dataflow Image Preprocessing!
2 |
3 | This tutorial walks through the main steps of 2b and 3 in the cats ML example.
4 | It goes over some of the higher level concepts and discusses in some detail how
5 | the dataflow pipelines are built.
6 |
7 | ## Quick Intro to Beam Pipelines
8 |
9 | Cloud Dataflow is a service that runs Apache Beam pipelines on GCP. A beam
10 | pipeline reads entries from a source, and streams data through numerous
11 | processing elements. In python, a processing element is marked by a **unique**
12 | string name (preceded by a "|") and a corresponding transformation process
13 | (preceded by a ">>"). A single pipeline will generally follow this pattern:
14 |
15 | ```
16 | with beam.Pipeline(options=pipeline_options) as p:
17 | (p
18 | | 'step_1_component_name'
19 | >> do_step_1()
20 | | 'step_2_component_name'
21 | >> do_step_2()
22 | ...
23 | | 'step_3_component_name'
24 | >> do_step_n()
25 | )
26 | ```
27 |
28 | where `with beam.Pipeline(...) as p` creates a new pipeline, and each `do_step`
29 | corresponds to an Apache beam processing component.
30 |
31 | When exiting from the `with beam.Pipeline...` block, the pipeline will execute.
32 | Before then, the components that are merely being defined within the block as
33 | part of the pipeline architecture. Essentially, pipelines are *lazy*. They wait
34 | for you to write code for all of the processing steps, and then they execute.
35 | We will show some real examples of this behavior below in step 2b and step 3.
36 |
37 | ## Step 2b: Preprocessing images
38 |
39 | Preprocessing is an important step for both traditional and deep ML.
40 |
41 | In traditional ML, in order to do meaningful feature extraction where resulting
42 | feature vector lengths are equal, images often have to be reshaped and resized
43 | to a specific size. For example, if you wanted to create a feature out of the
44 | mean values of pixels over 8x8 grids along each image, different image sizes
45 | would lead to different feature lengths.
46 |
47 | In deep ML on images, convolutional neural networks (CNNs) are often used as
48 | they can learn both local and global scale model layers, which can then be
49 | aggregated to make an accurate prediction of the image content. CNNs also
50 | require the same image sizes, since they are architected to learn models whose
51 | inputs are of a fixed dimension.
52 |
53 | Step 2b performs multiple steps to ensure that our images
54 | are in a centralized location (Cloud Storage), and that they have been resized
55 | in a manner that is satisfactory for our model building.
56 |
57 | ### Understanding the Beam Pipeline in Step 2b
58 |
59 | In the `step_2b_get_images.py` file, you see the following pipeline:
60 |
61 | ```
62 | bq_source = bigquery.BigQuerySource(query=query)
63 |
64 | with beam.Pipeline(options=pipeline_options) as p:
65 | _ = (p
66 | | 'read_rows_from_cat_info_table'
67 | >> beam.io.Read(bq_source)
68 | | 'fetch_images_from_urls'
69 | >> beam.Map(fetch_image_from_url)
70 | | 'filter_bad_or_absent_images'
71 | >> beam.Filter(filter_bad_or_missing_image)
72 | | 'resize_and_pad_images'
73 | >> beam.Map(resize_and_pad,
74 | output_image_dim=known_args.output_image_dim)
75 | | 'write_images_to_storage'
76 | >> beam.Map(write_processed_image,
77 | output_dir=output_dir)
78 | )
79 | ```
80 |
81 | We will discuss each step below.
82 |
83 | #### 'read_rows_from_cat_info_table'
84 | The **'read_rows_from_cat_info_table'** step reads rows from a source, in this
85 | case, a bigquery source. The bigquery source is defined by a sql query. In the
86 | default setup for the tutorial, the query passed into the BigQuerySource is:
87 |
88 | ```
89 | SELECT ROW_NUMBER() OVER() as index,
90 | original_url,
91 | label,
92 | randnum
93 | FROM [dataset.catinfo]
94 | ```
95 |
96 | The output of reading this source is a collection of dictionary elements, where
97 | each row read from the query is converted to a dictionary, with each key
98 | corresponding to a column name, and each value corresponding to the value in
99 | that column. A row from the bigquery source would look something like this:
100 |
101 | ```
102 | {
103 | 'index': 1234,
104 | 'original_url': http://path.to.picture.com/cat.jpg,
105 | 'label': 1,
106 | 'randnum': 0.2349
107 | }
108 | ```
109 |
110 | ####'fetch_images_from_urls'
111 |
112 | This step is a known as a **transformation** step and follows the syntax:
113 |
114 | ```
115 | beam.Map(some_fn())
116 | ```
117 |
118 | where `some_fn()` is a function that takes, as input, an element from the
119 | pipeline, and returns, as output, an element to transmit downstream to the next
120 | processor. In this step, `some_fn()` is given by
121 |
122 | ```
123 | def fetch_image_from_url(row):
124 | # return a dict with the url field replaced with actual image data
125 | ```
126 |
127 | The function parses out the 'original_url' field from the input row, downloads
128 | the actual image from the url, and decodes it into a raw 3d numpy tensor (width
129 | x height x 3 colors). The function then returns all of the other entries, but
130 | replacing 'original_url' with the entry {'img': [raw_img_tensor]}. If the image
131 | does not exist, it will return nothing, represented by a python "None" object.
132 |
133 | ####'filter_bad_or_absent_images`
134 |
135 | Obviously, we don't want to process any missing images ("None" objects) from the
136 | previous step. This step is known as a *filter* step and follows the syntax:
137 |
138 | ```
139 | beam.Filter(boolean_fn())
140 | ```
141 |
142 | where `boolean_fn()` takes as input an element from the pipeline, and returns
143 | a True (if the element should pass on to the next step) or False (if the element
144 | should be removed) value. In this step, the filter function is:
145 |
146 | ```
147 | def filter_bad_or_missing_image(img_and_metadata):
148 | # return False if bad/missing image, True otherwise
149 | ```
150 |
151 | `filter_bad_or_missing_image` checks for two things:
152 | * If the datum is a "None" object, remove it from the pipeline.
153 | * If the image in the datum meets certain specifications that match the "missing
154 | image" on Flickr, remove it from the pipeline.
155 | * Otherwise, pass the datum on to the next step.
156 |
157 | ####'resize_and_pad_images'
158 |
159 | This is the most involved step in the tutorial. In order to run our machine
160 | learning algorithms to run properly in subsequent steps, the output image has to
161 | be a square image with 128 pixels per side. If the input image happens to be
162 | wider than it is tall, this step will shrink the image proportionally such that
163 | it is at most 128 pixels wide. Then, it will pad the top
164 | and bottom of the image equally with "zeros", or black pixels, until the result
165 | is a perfect 128x128 square. Likewise, if the image is taller than it is wide,
166 | it will perform the same operation but pad the left and right with zeros.
167 |
168 | This is also a transform step, but notice that the beam.Map() takes in two
169 | arguments, i.e.:
170 |
171 | ```
172 | beam.Map(resize_and_pad,
173 | output_image_dim=known_args.output_image_dim)
174 | ```
175 |
176 | beam.Map() can actually take an indefinite number of arguments, but the first
177 | element must always be a function. Every additional element is known as a *side
178 | input* and gives the pipeline writer the flexibility to pass parameters from
179 | outside the pipeline into the function. This becomes clear when we inspect the
180 | `resize_and_pad()` function, as we see that the second argument in beam.Map(),
181 | i.e. `output_image_dim`, is also an argument in resize_and_pad().
182 |
183 | ```
184 | def resize_and_pad(img_and_metadata, output_image_dim=128):
185 | # returns dict with resized image
186 | # default size is 128 if no side input provided
187 | ```
188 |
189 | ####'write_images_to_storage'
190 |
191 |
192 | The final step is also a *transform* step which writes the output image to a
193 | custom-named file on Cloud Storage. The name follows the format:
194 |
195 | ```
196 | [index]_[randnum_times_1000]_[label].png
197 | ```
198 |
199 | where index is prepended with 0s such that it is always at least 6 digits long,
200 | and randnum_times_1000 is an integer between 0 and 999 obtained by multiplying
201 | the random number by 1000 (and stripping the fraction if necessary).
202 |
203 | There are a couple interesting features to note about this function. First, the
204 | function used here also has a side input `output_dir`:
205 |
206 | ```
207 | def write_processed_image(img_and_metadata, output_dir)
208 | ```
209 |
210 | More importantly, because this is the last transformation in the pipeline, there
211 | is no need for this function to return an output, as there are no further steps
212 | to ingest the output data. Instead, the function takes the processed image and
213 | calls the beam.FileSystems API to write the image data into cloud storage.
214 |
215 | For additional reading, see
216 | [detailed documentation for writing beam pipelines here.](https://beam.apache.org/documentation/programming-guide/)
217 |
218 | ### Running Step 2b
219 |
220 | A sample dataflow run command can be found in `run_step_2b_get_images.sh`:
221 |
222 | ```
223 | python -m step_2b_get_images \
224 | --project $PROJECT \
225 | --runner DataflowRunner \
226 | --staging_location gs://$BUCKET/$IMAGE_DIR/staging \
227 | --temp_location gs://$BUCKET/$IMAGE_DIR/temp \
228 | --num_workers 50 \
229 | --worker_machine_type n1-standard-4 \
230 | --setup_file ./setup.py \
231 | --region $DATAFLOW_REGION \
232 | --dataset dataset \
233 | --table $TABLE \
234 | --storage-bucket $BUCKET \
235 | --output-dir $IMAGE_DIR/all_images \
236 | --output-image-dim 128 \
237 | --cloud
238 | ```
239 |
240 | Note that each argument is either pipeline-runner (in this case, DataflowRunner)
241 | specific, or a custom argument that we parse out in `step_2b_get_images.py`. We
242 | have organized the command such that pipeline-runner specific arguments come
243 | first, followed by project-specific arguments
244 |
245 | **Runner-specific arguments**
246 |
247 | These arguments are implicitly parsed by the runner that is specified when
248 | running the job, i.e. DataflowRunner.
249 |
250 | * **--project:** required for read/write permissions and dataflow billing
251 | * **--runner:** for GCP, this is always DataflowRunner
252 | * **--staging_location:** locations for uploading build files to transfer to Cloud
253 | Dataflow workers
254 | * **--temp_location:** location for writing temporary files during pipeline
255 | processing. Often used to store the end output of a pipeline before moving data
256 | to appropriate destination location.
257 | * **--num_workers:** number of Cloud Dataflow VMs/workers to spin up
258 | * **--worker_machine_type:**
259 | [Cloud compute engine VM types](https://cloud.google.com/compute/docs/machine-types)
260 | with different amounts of memory and cpus
261 | * **--setup_file:** a python file that is run by each worker to setup its
262 | environment and libraries. For instance, our project's `setup.py` file requires
263 | workers to install opencv2 libraries and python packages.
264 | * **--region:**
265 | [the region that dataflow will run on](https://cloud.google.com/dataflow/docs/concepts/regional-endpoints#supported_regional_endpoints)
266 |
267 | **Step 2b specific arguments**
268 |
269 | These arguments are parsed out in the `run()` function at the bottom of the
270 | `step_2b_get_images.py` file using the command line argument parser library
271 | `argparse`.
272 |
273 | * **--dataset:** the dataset where the catinfo table is located
274 | * **--table:** "catinfo"
275 | * **--storage-bucket:** cloud storage bucket name to write output images to
276 | * **--output-dir:** the directory to write output images to in the bucket
277 | * **--output-image-dim:** width/height of the output images
278 | * **--cloud:** Currently, a required argument that tells us to write everything
279 | to cloud. The only case for removing this is for experts to debugging python
280 | code locally using a small sample with "DirectRunner". (In the future, we will
281 | provide local run scripts and tests for debugging.)
282 |
283 | ### Visualizing Step 2b
284 |
285 | After you run step 2, you can go to the
286 | [Dataflow Console](https://console.cloud.google.com/dataflow)
287 | and check the progress of your graph. You should see all the steps mentioned
288 | above organized as a linear pipeline like this:
289 |
290 |
291 |

292 |
293 |
294 | Notice that each step is running simultaneously as in an assembly line. Once an
295 | entry is read from bigquery, it is passed to the next step (fetching the image),
296 | and once that is completed, it is passed to subsequent steps. The preprocessed
297 | images will begin to show up in your storage bucket as soon as they run through
298 | the pipeline!
299 |
300 | Furthermore, you might notice images being written "out of order". This is
301 | because of massive parallelization, where multiple machines read different
302 | subsets of the big query table and start processing these elements using the
303 | pipeline architecture. Pretty cool right?
304 |
305 | Finally, you may notice that each component takes a different amount of "time".
306 | This time may look scary, but it is actually the total amount of processor time.
307 | If you have 200 processors, you would divide the number by 200 to get the
308 | average time spent per processor. Hence, the actual time to run the pipeline on
309 | cloud may only be 20 minutes long, but if you were to run it on your own
310 | computer instead, it could take over a day!
311 |
312 | ## Step 3: Splitting images
313 |
314 | In this step, we would like to split the set of images into training,
315 | validation, and testing image sets. How can we do this?
316 |
317 | Now that we've walked through a pipeline example in step 2b, we will discuss an
318 | additional beam concept called *partitioning*. A linear pipeline, like that in
319 | step 2b, will send all of its data from the first through the last step.
320 | However, suppose we want to write our image set to multiple destination
321 | locations. We can split the original pipeline (after some processing step)
322 | into multiple pipelines. These multiple pipelines can then be processed
323 | independently of each other (for example, if you want to do something with
324 | training images that you do not want to do with validation), and written into
325 | different destination locations.
326 |
327 | Here is a snippet from the step 3 architecture. Note that `split_pipelines` is
328 | the result of a partitioned pipeline, which returns a python list of pipelines.
329 | These pipelines can then be looped through and invoked.
330 |
331 | ```
332 | with beam.Pipeline(options=pipeline_options) as p:
333 | # Read files and partition pipelines
334 | split_pipelines = (
335 | p
336 | | 'read' >> beam.io.Read(LabeledImageFileReader(source_images_pattern))
337 | | 'split'
338 | >> beam.Partition(
339 | generate_split_fn(split_fractions),
340 | len(split_fractions)
341 | )
342 | )
343 |
344 | # Write each pipeline to a corresponding output directory
345 | for partition, split_name_and_dest_dir in enumerate(
346 | zip(split_names, dest_images_dirs)):
347 | _ = (split_pipelines[partition]
348 | | 'write_' + split_name_and_dest_dir[0]
349 | >> beam.Map(write_to_directory,
350 | dst_dir=split_name_and_dest_dir[1]))
351 | ```
352 |
353 | ### Running Step 3
354 |
355 | Like step 2b, there are a number of required DataflowRunner arguments. The
356 | additional arguments for step 3 are as follows:
357 |
358 | * **--storage-bucket:** cloud storage bucket name from which to read and write
359 | images
360 | * **--source-image-dir:** source directory for the full iamge set
361 | * **--dest-image-dir:** parent directory for split image sets
362 | * **--split-names names of the split image sets. Combining
363 | `dest-image-dir/split-name` gives the full name of the output directory.
364 | * **--split-fractions:** space separated floating point numbers indicating what
365 | fractions of the dataset to include in each split set
366 | * **--cloud:** again, a required argument, removed only for expert debugging.
367 | ### Visualizing Step 3
368 |
369 | Again, going to the Dataflow Console and clicking on the job will reveal a
370 | pipeline that is split into 3 pipelines. Despite the "for" loop in the code
371 | snippet above, which indicates that the pipeline processing components are
372 | created sequentially, none of these are actually executed at this point.
373 | We are merely creating architecture (and *laziness* reigns supreme again!).
374 | After exiting the `with beam.Pipeline...` block, the architecture is already
375 | completed, and the pipelines and splits start running in parallel!
376 |
377 |
378 |

379 |
380 |
--------------------------------------------------------------------------------
/cats/LICENSE:
--------------------------------------------------------------------------------
1 |
2 | Apache License
3 | Version 2.0, January 2004
4 | http://www.apache.org/licenses/
5 |
6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7 |
8 | 1. Definitions.
9 |
10 | "License" shall mean the terms and conditions for use, reproduction,
11 | and distribution as defined by Sections 1 through 9 of this document.
12 |
13 | "Licensor" shall mean the copyright owner or entity authorized by
14 | the copyright owner that is granting the License.
15 |
16 | "Legal Entity" shall mean the union of the acting entity and all
17 | other entities that control, are controlled by, or are under common
18 | control with that entity. For the purposes of this definition,
19 | "control" means (i) the power, direct or indirect, to cause the
20 | direction or management of such entity, whether by contract or
21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the
22 | outstanding shares, or (iii) beneficial ownership of such entity.
23 |
24 | "You" (or "Your") shall mean an individual or Legal Entity
25 | exercising permissions granted by this License.
26 |
27 | "Source" form shall mean the preferred form for making modifications,
28 | including but not limited to software source code, documentation
29 | source, and configuration files.
30 |
31 | "Object" form shall mean any form resulting from mechanical
32 | transformation or translation of a Source form, including but
33 | not limited to compiled object code, generated documentation,
34 | and conversions to other media types.
35 |
36 | "Work" shall mean the work of authorship, whether in Source or
37 | Object form, made available under the License, as indicated by a
38 | copyright notice that is included in or attached to the work
39 | (an example is provided in the Appendix below).
40 |
41 | "Derivative Works" shall mean any work, whether in Source or Object
42 | form, that is based on (or derived from) the Work and for which the
43 | editorial revisions, annotations, elaborations, or other modifications
44 | represent, as a whole, an original work of authorship. For the purposes
45 | of this License, Derivative Works shall not include works that remain
46 | separable from, or merely link (or bind by name) to the interfaces of,
47 | the Work and Derivative Works thereof.
48 |
49 | "Contribution" shall mean any work of authorship, including
50 | the original version of the Work and any modifications or additions
51 | to that Work or Derivative Works thereof, that is intentionally
52 | submitted to Licensor for inclusion in the Work by the copyright owner
53 | or by an individual or Legal Entity authorized to submit on behalf of
54 | the copyright owner. For the purposes of this definition, "submitted"
55 | means any form of electronic, verbal, or written communication sent
56 | to the Licensor or its representatives, including but not limited to
57 | communication on electronic mailing lists, source code control systems,
58 | and issue tracking systems that are managed by, or on behalf of, the
59 | Licensor for the purpose of discussing and improving the Work, but
60 | excluding communication that is conspicuously marked or otherwise
61 | designated in writing by the copyright owner as "Not a Contribution."
62 |
63 | "Contributor" shall mean Licensor and any individual or Legal Entity
64 | on behalf of whom a Contribution has been received by Licensor and
65 | subsequently incorporated within the Work.
66 |
67 | 2. Grant of Copyright License. Subject to the terms and conditions of
68 | this License, each Contributor hereby grants to You a perpetual,
69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70 | copyright license to reproduce, prepare Derivative Works of,
71 | publicly display, publicly perform, sublicense, and distribute the
72 | Work and such Derivative Works in Source or Object form.
73 |
74 | 3. Grant of Patent License. Subject to the terms and conditions of
75 | this License, each Contributor hereby grants to You a perpetual,
76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77 | (except as stated in this section) patent license to make, have made,
78 | use, offer to sell, sell, import, and otherwise transfer the Work,
79 | where such license applies only to those patent claims licensable
80 | by such Contributor that are necessarily infringed by their
81 | Contribution(s) alone or by combination of their Contribution(s)
82 | with the Work to which such Contribution(s) was submitted. If You
83 | institute patent litigation against any entity (including a
84 | cross-claim or counterclaim in a lawsuit) alleging that the Work
85 | or a Contribution incorporated within the Work constitutes direct
86 | or contributory patent infringement, then any patent licenses
87 | granted to You under this License for that Work shall terminate
88 | as of the date such litigation is filed.
89 |
90 | 4. Redistribution. You may reproduce and distribute copies of the
91 | Work or Derivative Works thereof in any medium, with or without
92 | modifications, and in Source or Object form, provided that You
93 | meet the following conditions:
94 |
95 | (a) You must give any other recipients of the Work or
96 | Derivative Works a copy of this License; and
97 |
98 | (b) You must cause any modified files to carry prominent notices
99 | stating that You changed the files; and
100 |
101 | (c) You must retain, in the Source form of any Derivative Works
102 | that You distribute, all copyright, patent, trademark, and
103 | attribution notices from the Source form of the Work,
104 | excluding those notices that do not pertain to any part of
105 | the Derivative Works; and
106 |
107 | (d) If the Work includes a "NOTICE" text file as part of its
108 | distribution, then any Derivative Works that You distribute must
109 | include a readable copy of the attribution notices contained
110 | within such NOTICE file, excluding those notices that do not
111 | pertain to any part of the Derivative Works, in at least one
112 | of the following places: within a NOTICE text file distributed
113 | as part of the Derivative Works; within the Source form or
114 | documentation, if provided along with the Derivative Works; or,
115 | within a display generated by the Derivative Works, if and
116 | wherever such third-party notices normally appear. The contents
117 | of the NOTICE file are for informational purposes only and
118 | do not modify the License. You may add Your own attribution
119 | notices within Derivative Works that You distribute, alongside
120 | or as an addendum to the NOTICE text from the Work, provided
121 | that such additional attribution notices cannot be construed
122 | as modifying the License.
123 |
124 | You may add Your own copyright statement to Your modifications and
125 | may provide additional or different license terms and conditions
126 | for use, reproduction, or distribution of Your modifications, or
127 | for any such Derivative Works as a whole, provided Your use,
128 | reproduction, and distribution of the Work otherwise complies with
129 | the conditions stated in this License.
130 |
131 | 5. Submission of Contributions. Unless You explicitly state otherwise,
132 | any Contribution intentionally submitted for inclusion in the Work
133 | by You to the Licensor shall be under the terms and conditions of
134 | this License, without any additional terms or conditions.
135 | Notwithstanding the above, nothing herein shall supersede or modify
136 | the terms of any separate license agreement you may have executed
137 | with Licensor regarding such Contributions.
138 |
139 | 6. Trademarks. This License does not grant permission to use the trade
140 | names, trademarks, service marks, or product names of the Licensor,
141 | except as required for reasonable and customary use in describing the
142 | origin of the Work and reproducing the content of the NOTICE file.
143 |
144 | 7. Disclaimer of Warranty. Unless required by applicable law or
145 | agreed to in writing, Licensor provides the Work (and each
146 | Contributor provides its Contributions) on an "AS IS" BASIS,
147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148 | implied, including, without limitation, any warranties or conditions
149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150 | PARTICULAR PURPOSE. You are solely responsible for determining the
151 | appropriateness of using or redistributing the Work and assume any
152 | risks associated with Your exercise of permissions under this License.
153 |
154 | 8. Limitation of Liability. In no event and under no legal theory,
155 | whether in tort (including negligence), contract, or otherwise,
156 | unless required by applicable law (such as deliberate and grossly
157 | negligent acts) or agreed to in writing, shall any Contributor be
158 | liable to You for damages, including any direct, indirect, special,
159 | incidental, or consequential damages of any character arising as a
160 | result of this License or out of the use or inability to use the
161 | Work (including but not limited to damages for loss of goodwill,
162 | work stoppage, computer failure or malfunction, or any and all
163 | other commercial damages or losses), even if such Contributor
164 | has been advised of the possibility of such damages.
165 |
166 | 9. Accepting Warranty or Additional Liability. While redistributing
167 | the Work or Derivative Works thereof, You may choose to offer,
168 | and charge a fee for, acceptance of support, warranty, indemnity,
169 | or other liability obligations and/or rights consistent with this
170 | License. However, in accepting such obligations, You may act only
171 | on Your own behalf and on Your sole responsibility, not on behalf
172 | of any other Contributor, and only if You agree to indemnify,
173 | defend, and hold each Contributor harmless for any liability
174 | incurred by, or claims asserted against, such Contributor by reason
175 | of your accepting any such warranty or additional liability.
176 |
177 | END OF TERMS AND CONDITIONS
178 |
179 | APPENDIX: How to apply the Apache License to your work.
180 |
181 | To apply the Apache License to your work, attach the following
182 | boilerplate notice, with the fields enclosed by brackets "[]"
183 | replaced with your own identifying information. (Don't include
184 | the brackets!) The text should be enclosed in the appropriate
185 | comment syntax for the file format. We also recommend that a
186 | file or class name and description of purpose be included on the
187 | same "printed page" as the copyright notice for easier
188 | identification within third-party archives.
189 |
190 | Copyright [yyyy] [name of copyright owner]
191 |
192 | Licensed under the Apache License, Version 2.0 (the "License");
193 | you may not use this file except in compliance with the License.
194 | You may obtain a copy of the License at
195 |
196 | http://www.apache.org/licenses/LICENSE-2.0
197 |
198 | Unless required by applicable law or agreed to in writing, software
199 | distributed under the License is distributed on an "AS IS" BASIS,
200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201 | See the License for the specific language governing permissions and
202 | limitations under the License.
203 |
--------------------------------------------------------------------------------
/cats/README.md:
--------------------------------------------------------------------------------
1 | # Feline ML Tutorial
2 |
3 | ## Verify environment variables
4 |
5 | Let's [SSH back into our cloud VM](https://console.cloud.google.com/compute/instances)
6 | by clicking the SSH link. Type:
7 |
8 | ```
9 | screen -R
10 | ```
11 |
12 | to resume the screen terminals.
13 |
14 | Jupyter should already be running in terminal 0, so let's use terminal 1 to
15 | check up on some things. Type `Ctrl-a 1` to switch to terminal 1.
16 |
17 | Let us confirm that bash env variables are set
18 | correctly to make our lives easier. Confirm that the following displays your
19 | project and storage bucket names correctly:
20 |
21 | ```
22 | echo $PROJECT
23 | echo $BUCKET
24 | echo $DATAFLOW_REGION
25 | ```
26 |
27 | ## Step 0: Introduction and Setup Validation
28 |
29 | If you aren't connected to Jupyter via your browser yet,
30 | [connect to it now.](../README.md#connect-to-jupyter) Click on the `cats`
31 | directory to see all of the notebooks we will be using.
32 |
33 | Click the `step_0_to_0.ipynb` notebook to get a "Hello World" view of tools that
34 | we will use, and to verify that we have indeed connected tensorflow to our gpu.
35 |
36 | Note that the first time you run a notebook, the notebook is by default
37 | *not trusted* over the internet. You will have to click the "Not Trusted" button
38 | on the top right to switch to "Trusted" mode in order to run it.
39 |
40 | The last code entry should return a dictionary containing devices you have
41 | available on the machine. There should be a cpu available.
42 | If you chose a GPU for your machine, confirm that the resulting list also
43 | contains an entry like `u'/gpu:0'`.
44 |
45 | When you have finished running the notebook, either shutdown the kernel from
46 | the Kernel dropdown menu, or click the notebook from the directory view and
47 | click "shutdown" to free up resources for the next step.
48 |
49 | ## Step 1: Performance Metrics and Requirements
50 |
51 | Work through `step_1_to_3.ipynb`. Steps 2 and 3 are left blank intentionally as
52 | we will be running them using super cool distributed dataflow jobs!
53 |
54 | ## Step 2: Getting the Data
55 |
56 | In your VM shell, switch to screen terminal 1, i.e. `Ctrl-a 1`. Switch to the
57 | `cats` directory:
58 |
59 | ```
60 | cd ~/sc17/cats
61 | ```
62 |
63 | Step 2 consists of two parts listed below.
64 |
65 | ### Step 2a: Collect Metadata in Bigquery Table
66 |
67 | *Expected run time: < 1 minute*
68 |
69 | To collect our cat images, we will use bigquery on a couple of public tables
70 | containing image urls and labels. The query is contained in the
71 | `step_2a_query.sql` file.
72 |
73 | Run the following script to execute the query and write the results into a
74 | bigquery table `$PROJECT.dataset.catinfo`.
75 |
76 | ```
77 | sh run_step_2a_query.sh
78 | ```
79 |
80 | ### Step 2b: Collect Images into Cloud Storage using Dataflow
81 |
82 | *Expected run time: 25 minutes*
83 |
84 | **For an in-depth discussion of Cloud Dataflow steps 2b and 3, see the
85 | [Dataflow Preprocessing Tutorial](DATAFLOW_TUTORIAL.md).**
86 |
87 | Once you have the output table `$PROJECT.dataset.catinfo` populated, run the
88 | following to collect images from their urls are write them into storage:
89 |
90 | ```
91 | sh run_step_2b_get_images.sh $PROJECT dataset catinfo $BUCKET catimages
92 | ```
93 |
94 | To view the progress of your job, go to the
95 | [Dataflow Web UI](https://console.cloud.google.com/dataflow)
96 | and click on your job to see a graph. Clicking on each of the components of
97 | the graph will show you how many elements/images have been processed by that
98 | component. The total number of elements should be approximately 80k.
99 |
100 | The images will be written into `gs://$BUCKET/catimages/all_images`.
101 |
102 | ## Step 3: Splitting the Data
103 |
104 | *Expected run time: 20 minutes*
105 |
106 | The next step is to split your dataset into training, validation, and test.
107 | The script we will be using is `run_step3_split_images.sh`. If you inspect
108 | the script, you will notice that the python class takes in general
109 | split fractions and output names as parameters for flexibility, but the script
110 | has been written to provide an example that splits the data into
111 | 0.5 (50%) training images, 0.3 (30%) validation images, and 0.2 (20%) testing
112 | images, e.g. with arguments:
113 |
114 | ```
115 | --split-names training_images validation_images test_images
116 | --split-fractions 0.5 0.3 0.2 \
117 | ```
118 |
119 | Run the script as is (do not change it for the time being!):
120 |
121 | ```
122 | sh run_step_3_split_images.sh $PROJECT $BUCKET catimages
123 | ```
124 |
125 | ## Step 4: Exploring the Training Set
126 |
127 | *Expected image set download time: < 2 minutes*
128 |
129 | Let's download a subset of the training images to the VM from storage. We will
130 | use wildcards to choose about 2k images for training, and 1k images for
131 | debugging.
132 |
133 | In screen terminal 1 (Go to the VM shell and type `Ctrl+a 1`), create a folder
134 | to store your training and debugging images, and then copy a small sample of
135 | training images from cloud storage:
136 |
137 | ```
138 | mkdir -p ~/data/training_small
139 | gsutil -m cp gs://$BUCKET/catimages/training_images/000*.png ~/data/training_small/
140 | gsutil -m cp gs://$BUCKET/catimages/training_images/001*.png ~/data/training_small/
141 | mkdir -p ~/data/debugging_small
142 | gsutil -m cp gs://$BUCKET/catimages/training_images/002*.png ~/data/debugging_small
143 | echo "done!"
144 | ```
145 |
146 | Once this is done, you can begin work on notebooks `step_4_to_4_part1.ipynb` and
147 | `step_4_to_4_part2.ipynb`. We suggest running the lines below before you begin with
148 | the notebooks so the downloads can happen while you are working with Jupyter and you
149 | needn't wait idly.
150 |
151 | *Expected image set download time: 20 minutes*
152 |
153 | Download all of your image sets to the VM. Then set aside a few thousand
154 | training images for debugging.
155 |
156 | ```
157 | mkdir -p ~/data/training_images
158 | gsutil -m cp gs://$BUCKET/catimages/training_images/*.png ~/data/training_images/
159 | mkdir -p ~/data/validation_images
160 | gsutil -m cp gs://$BUCKET/catimages/validation_images/*.png ~/data/validation_images/
161 | mkdir -p ~/data/test_images
162 | gsutil -m cp gs://$BUCKET/catimages/test_images/*.png ~/data/test_images/
163 | mkdir -p ~/data/debugging_images
164 | mv ~/data/training_images/000*.png ~/data/debugging_images/
165 | mv ~/data/training_images/001*.png ~/data/debugging_images/
166 | echo "done!"
167 | ```
168 |
169 | ## Step 5: Basics of Neural Networks
170 |
171 | Work through notebooks `nn_demo_part1.ipynb` and `nn_demo_part2.ipynb`.
172 |
173 | ## Step 5-7: Training and Debugging Models with Various Tools
174 |
175 | `step_5_to_6_part1.ipynb` runs through a logistic regression model using Sci-kit
176 | Learn and Tensorflow based on some basic image features.
177 |
178 | `step_5_to_8_part2.ipynb` trains a convolutional neural network using
179 | Tensorflow and runs some debugging steps to look at pictures which were wrongly
180 | classified.
181 |
182 | ## Step 8-9: Validation and Testing
183 |
184 | Before proceeding, if you have been running data science on a VM without GPU,
185 | strongly consider creating a new one with GPU. For this exercise, K80 is the
186 | best tradeoff for its cost and run time.
187 |
188 | Run `step_8_to_9.ipynb` to train a model, do some basic debugging, and run
189 | validation and testing on the entire dataset.
190 |
--------------------------------------------------------------------------------
/cats/nn_demo_part1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Basics of Neural Networks - Numpy Demo"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "**Author(s):** kozyr@google.com"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "We show how to train a very simple neural network from scratch (we're using nothing but ```numpy```!)."
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "## Setup"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "### Identical to ```keras``` version"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 0,
41 | "metadata": {},
42 | "outputs": [],
43 | "source": [
44 | "import numpy as np\n",
45 | "\n",
46 | "# Set up the data and network:\n",
47 | "n_outputs = 5 # We're attempting to learn XOR in this example, so our inputs and outputs will be the same.\n",
48 | "n_hidden_units = 10 # We'll use a single hidden layer with this number of hidden units in it.\n",
49 | "n_obs = 500 # How many observations of the XOR input to output vector will we use for learning?\n",
50 | "\n",
51 | "# How quickly do we want to update our weights?\n",
52 | "learning_rate = 0.1\n",
53 | "\n",
54 | "# How many times will we try to use each observation to improve the weights?\n",
55 | "epochs = 10 # Think of this as iterations if you like.\n",
56 | "\n",
57 | "# Set random seed so that the exercise works out the same way for everyone:\n",
58 | "np.random.seed(42)"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "## Create some data to learn from"
66 | ]
67 | },
68 | {
69 | "cell_type": "markdown",
70 | "metadata": {},
71 | "source": [
72 | "### Identical to ```keras``` version"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": 0,
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "# Create the inputs:\n",
82 | "training_vectors = np.random.binomial(1, 0.5, (n_obs, n_outputs))\n",
83 | "# Each row is a binary vector to learn from.\n",
84 | "print('One instance with ' + str(n_outputs) + ' features: ' + str(training_vectors[0]))\n",
85 | "\n",
86 | "# Create the correct XOR outputs (t is for target):\n",
87 | "xor_training_vectors = training_vectors ^ 1 # This is just XOR, everything is deterministic.\n",
88 | "print('Correct label (simply XOR): ' + str(xor_training_vectors[0]))"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "## Select activation and loss functions"
96 | ]
97 | },
98 | {
99 | "cell_type": "markdown",
100 | "metadata": {},
101 | "source": [
102 | "### Only in ```numpy``` version"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": 0,
108 | "metadata": {},
109 | "outputs": [],
110 | "source": [
111 | "# Define an activation function and its derivative:\n",
112 | "def activ(x):\n",
113 | " # We'll use a sigmoid function:\n",
114 | " return 1.0 / (1.0 + np.exp(-x))\n",
115 | "\n",
116 | "def activ_prime(x):\n",
117 | " # Derivative of the sigmoid function:\n",
118 | " # d/dx 1 / (1 + exp(-x)) = -(-exp(-x)) * (1 + exp(-x)) ^ (-2)\n",
119 | " return np.exp(-x) / ((1.0 + np.exp(-x)) ** 2)\n",
120 | " \n",
121 | "# Define a loss function and its derivative wrt predictions:\n",
122 | "def loss(prediction, truth):\n",
123 | " # We'll choose cross entropy loss for this demo.\n",
124 | " return -np.mean(truth * np.log(prediction) + (1 - truth) * np.log(1 - prediction))\n",
125 | "\n",
126 | "def loss_prime(prediction, truth):\n",
127 | " # Derivative (elementwise) of cross entropy loss wrt prediction.\n",
128 | " # d/dy (-t log(y) - (1-t)log(1-y)) = -t/y + (1-t)/(1-y) = -(t-ty-y+ty) = y - t\n",
129 | " return prediction - truth"
130 | ]
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "metadata": {},
135 | "source": [
136 | "## Initialize the weights"
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "metadata": {},
142 | "source": [
143 | "### Only in ```numpy``` version"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": 0,
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "# Simplest way to initialize is to choose weights uniformly at random between -1 and 1:\n",
153 | "weights1 = np.random.uniform(low=-1, high=1, size=(n_outputs, n_hidden_units))\n",
154 | "weights2 = np.random.uniform(low=-1, high=1, size=(n_hidden_units, n_outputs))\n",
155 | "# Note: there are much better ways to initialize weights, but our goal is simplicity here."
156 | ]
157 | },
158 | {
159 | "cell_type": "markdown",
160 | "metadata": {},
161 | "source": [
162 | "## Forward propagation"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "### Only in ```numpy``` version"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": 0,
175 | "metadata": {},
176 | "outputs": [],
177 | "source": [
178 | "def forward_prop(x, w1, w2):\n",
179 | " \"\"\"Implements forward propagation.\n",
180 | " \n",
181 | " Args:\n",
182 | " x: the input vector.\n",
183 | " w1: first set of weights mapping the input to layer 1.\n",
184 | " w2: second set of weights mapping layer 1 to layer 2.\n",
185 | "\n",
186 | " Returns:\n",
187 | " u1: unactivated unit values from layer 1 in forward prop\n",
188 | " u2: unactivated unit values from layer 2 in forward prop\n",
189 | " a1: activated unit values from layer 1 in forward prop \n",
190 | " a2: activated unit values from layer 2 in forward prop\n",
191 | " lab: the output label\n",
192 | " \"\"\"\n",
193 | " u1 = np.dot(x, w1) # u for unactivated weighted sum unit (other authors might prefer to call it z)\n",
194 | " a1 = activ(u1) # a for activated unit\n",
195 | " u2 = np.dot(a1, w2)\n",
196 | " a2 = activ(u2)\n",
197 | " # Let's output predicted labels too, but converting continuous a2 to binary:\n",
198 | " lab = (a2 > 0.5).astype(int)\n",
199 | " return u1, u2, a1, a2, lab"
200 | ]
201 | },
202 | {
203 | "cell_type": "markdown",
204 | "metadata": {},
205 | "source": [
206 | "## Backward propagation"
207 | ]
208 | },
209 | {
210 | "cell_type": "markdown",
211 | "metadata": {},
212 | "source": [
213 | "### Only in ```numpy``` version"
214 | ]
215 | },
216 | {
217 | "cell_type": "code",
218 | "execution_count": 0,
219 | "metadata": {},
220 | "outputs": [],
221 | "source": [
222 | "def back_prop(x, t, u1, u2, a1, a2, w1, w2):\n",
223 | " \"\"\"Implements backward propagation.\n",
224 | " \n",
225 | " Args:\n",
226 | " x: the input vector\n",
227 | " t: the desired output vector.\n",
228 | " u1: unactivated unit values from layer 1 in forward prop\n",
229 | " u2: unactivated unit values from layer 2 in forward prop\n",
230 | " a1: activated unit values from layer 1 in forward prop \n",
231 | " a2: activated unit values from layer 2 in forward prop\n",
232 | " w1: first set of weights mapping the input to layer 1.\n",
233 | " w2: second set of weights mapping layer 1 to layer 2.\n",
234 | " Returns: \n",
235 | " d1: gradients for weights w1, used for updating w1\n",
236 | " d2: gradients for weights w2, used for updating w2\n",
237 | " \"\"\"\n",
238 | " e2 = loss_prime(a2, t) # e is for error; this is the \"error\" effect in the final layer\n",
239 | " d2 = np.outer(a1, e2) # d is for delta; this is the gradient value for updating weights w2\n",
240 | " e1 = np.dot(w2, e2) * activ_prime(u1) # e is for error\n",
241 | " d1 = np.outer(x, e1) # d is for delta; this is the gradient update for the first set of weights w1\n",
242 | " return d1, d2 # We only need the updates outputted"
243 | ]
244 | },
245 | {
246 | "cell_type": "markdown",
247 | "metadata": {},
248 | "source": [
249 | "## Train the neural network!"
250 | ]
251 | },
252 | {
253 | "cell_type": "markdown",
254 | "metadata": {},
255 | "source": [
256 | "### Only in ```numpy``` version"
257 | ]
258 | },
259 | {
260 | "cell_type": "code",
261 | "execution_count": 0,
262 | "metadata": {},
263 | "outputs": [],
264 | "source": [
265 | "# Train\n",
266 | "for epoch in range(epochs):\n",
267 | " loss_tracker = []\n",
268 | "\n",
269 | " for i in range(training_vectors.shape[0]):\n",
270 | " # Input one obs at a time to become x = binary_vectors[i] (inputs) and t = xor_vectors[i] (targets)\n",
271 | " # Forward propagation:\n",
272 | " u1, u2, a1, a2, labels = forward_prop(training_vectors[i], weights1, weights2)\n",
273 | " # Backward propagation:\n",
274 | " d1, d2 = back_prop(training_vectors[i], xor_training_vectors[i],\n",
275 | " u1, u2, a1, a2, weights1, weights2)\n",
276 | " # Update the weights:\n",
277 | " weights1 -= learning_rate * d1\n",
278 | " weights2 -= learning_rate * d2\n",
279 | " loss_tracker.append(loss(prediction=a2, truth=xor_training_vectors[i]))\n",
280 | "\n",
281 | " print 'Epoch: %d, Average Loss: %.8f' % (epoch+1, np.mean(loss_tracker))"
282 | ]
283 | },
284 | {
285 | "cell_type": "markdown",
286 | "metadata": {},
287 | "source": [
288 | "## Validate"
289 | ]
290 | },
291 | {
292 | "cell_type": "markdown",
293 | "metadata": {},
294 | "source": [
295 | "### Almost identical to ```keras``` version"
296 | ]
297 | },
298 | {
299 | "cell_type": "code",
300 | "execution_count": 0,
301 | "metadata": {},
302 | "outputs": [],
303 | "source": [
304 | "# Print performance to screen:\n",
305 | "def get_performance(n_valid, w1, w2):\n",
306 | " \"\"\"Computes performance and prints it to screen.\n",
307 | " \n",
308 | " Args:\n",
309 | " n_valid: number of validation instances we'd like to simulate.\n",
310 | " w1: first set of weights mapping the input to layer 1.\n",
311 | " w2: second set of weights mapping layer 1 to layer 2.\n",
312 | " \n",
313 | " Returns:\n",
314 | " None\n",
315 | " \"\"\"\n",
316 | " flawless_tracker = []\n",
317 | " validation_vectors = np.random.binomial(1, 0.5, (n_valid, n_outputs))\n",
318 | " xor_validation_vectors = validation_vectors ^ 1\n",
319 | "\n",
320 | " for i in range(n_valid):\n",
321 | " u1, u2, a1, a2, labels = forward_prop(validation_vectors[i], w1, w2)\n",
322 | " if i < 3:\n",
323 | " print('********')\n",
324 | " print('Challenge ' + str(i + 1) + ': ' + str(validation_vectors[i]))\n",
325 | " print('Predicted ' + str(i + 1) + ': ' + str(labels))\n",
326 | " print('Correct ' + str(i + 1) + ': ' + str(xor_validation_vectors[i]))\n",
327 | " instance_score = (np.array_equal(labels, xor_validation_vectors[i]))\n",
328 | " flawless_tracker.append(instance_score)\n",
329 | " \n",
330 | " print('\\nProportion of flawless instances on ' + str(n_valid) +\n",
331 | " ' new examples: ' + str(round(100*np.mean(flawless_tracker),0)) + '%')"
332 | ]
333 | },
334 | {
335 | "cell_type": "code",
336 | "execution_count": 0,
337 | "metadata": {},
338 | "outputs": [],
339 | "source": [
340 | "get_performance(5000, weights1, weights2)"
341 | ]
342 | }
343 | ],
344 | "metadata": {
345 | "kernelspec": {
346 | "display_name": "Python 2",
347 | "language": "python",
348 | "name": "python2"
349 | }
350 | },
351 | "nbformat": 4,
352 | "nbformat_minor": 2
353 | }
354 |
--------------------------------------------------------------------------------
/cats/nn_demo_part2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Basics of Neural Networks - Keras Demo"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "**Author(s):** ronbodkin@google.com, kozyr@google.com, bfoo@google.com"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "We show how to train a very simple neural network from scratch (but let's upgrade from ```numpy``` to ```keras```). Keras is a higher-level API that makes TensorFlow easier to work work with."
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "## Setup"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "### Identical to ```numpy``` version"
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 0,
41 | "metadata": {},
42 | "outputs": [],
43 | "source": [
44 | "import numpy as np\n",
45 | "\n",
46 | "# Set up the data and network:\n",
47 | "n_outputs = 5 # We're attempting to learn XOR in this example, so our inputs and outputs will be the same.\n",
48 | "n_hidden_units = 10 # We'll use a single hidden layer with this number of hidden units in it.\n",
49 | "n_obs = 500 # How many observations of the XOR input to output vector will we use for learning?\n",
50 | "\n",
51 | "# How quickly do we want to update our weights?\n",
52 | "learning_rate = 0.1\n",
53 | "\n",
54 | "# How many times will we try to use each observation to improve the weights?\n",
55 | "epochs = 10 # Think of this as iterations if you like.\n",
56 | "\n",
57 | "# Set random seed so that the exercise works out the same way for everyone:\n",
58 | "np.random.seed(42)"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "### Only in ```keras``` version"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 0,
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "import tensorflow as tf\n",
75 | "# Which version of TensorFlow are we using?\n",
76 | "print(tf.__version__)"
77 | ]
78 | },
79 | {
80 | "cell_type": "code",
81 | "execution_count": 0,
82 | "metadata": {},
83 | "outputs": [],
84 | "source": [
85 | "# Add keras to runtime\n",
86 | "!pip install keras"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 0,
92 | "metadata": {},
93 | "outputs": [],
94 | "source": [
95 | "# Import keras and basic types of NN layers we will use\n",
96 | "# Keras is a higher-level API for neural networks that works with TensorFlow\n",
97 | "import keras\n",
98 | "from keras.models import Sequential\n",
99 | "from keras.layers import Dense, Activation\n",
100 | "import keras.utils as np_utils"
101 | ]
102 | },
103 | {
104 | "cell_type": "markdown",
105 | "metadata": {},
106 | "source": [
107 | "## Create some data to learn from"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "### Identical to ```numpy``` version"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": 0,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": [
123 | "# Create the inputs:\n",
124 | "training_vectors = np.random.binomial(1, 0.5, (n_obs, n_outputs))\n",
125 | "# Each row is a binary vector to learn from.\n",
126 | "print('One instance with ' + str(n_outputs) + ' features: ' + str(training_vectors[0]))\n",
127 | "\n",
128 | "# Create the correct XOR outputs (t is for target):\n",
129 | "xor_training_vectors = training_vectors ^ 1 # This is just XOR, everything is deterministic.\n",
130 | "print('Correct label (simply XOR): ' + str(xor_training_vectors[0]))"
131 | ]
132 | },
133 | {
134 | "cell_type": "markdown",
135 | "metadata": {},
136 | "source": [
137 | "## Build the network directly\n",
138 | "\n",
139 | "There's no need to write the loss and activation functions from scratch or compute the their derivatives, or to write forward and backprop from scratch. We'll just select them and ```keras``` will take care of it. Thanks, ```keras```!"
140 | ]
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "metadata": {},
145 | "source": [
146 | "### Only in ```keras``` version"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 0,
152 | "metadata": {},
153 | "outputs": [],
154 | "source": [
155 | "# 2 layer model with ReLU for hidden layer, sigmoid for output layer\n",
156 | "# Uncomment below to try a 3 layer model with two hidden layers\n",
157 | "model = Sequential()\n",
158 | "model.add(Dense(units=n_hidden_units, input_dim=n_outputs))\n",
159 | "model.add(Activation('relu'))\n",
160 | "#model.add(Dense(units=n_hidden_units))\n",
161 | "#model.add(Activation('sigmoid'))\n",
162 | "model.add(Dense(units=n_outputs))\n",
163 | "model.add(Activation('sigmoid'))"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": 0,
169 | "metadata": {},
170 | "outputs": [],
171 | "source": [
172 | "# Time to choose an optimizer. Let's use SGD:\n",
173 | "sgd = keras.optimizers.SGD(lr=learning_rate, decay=1e-6, momentum=0.9, nesterov=True)"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": 0,
179 | "metadata": {},
180 | "outputs": [],
181 | "source": [
182 | "# Set up model using cross-entropy loss with SGD optimizer:\n",
183 | "model.compile(optimizer=sgd,\n",
184 | " loss='binary_crossentropy',\n",
185 | " metrics=['accuracy'])"
186 | ]
187 | },
188 | {
189 | "cell_type": "markdown",
190 | "metadata": {},
191 | "source": [
192 | "## Train the neural network!"
193 | ]
194 | },
195 | {
196 | "cell_type": "markdown",
197 | "metadata": {},
198 | "source": [
199 | "### Only in ```keras``` version"
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": 0,
205 | "metadata": {},
206 | "outputs": [],
207 | "source": [
208 | "# Fit model:\n",
209 | "model.fit(training_vectors, xor_training_vectors, epochs=epochs)"
210 | ]
211 | },
212 | {
213 | "cell_type": "markdown",
214 | "metadata": {},
215 | "source": [
216 | "## Validate"
217 | ]
218 | },
219 | {
220 | "cell_type": "markdown",
221 | "metadata": {},
222 | "source": [
223 | "### Almost identical to ```numpy``` version\n",
224 | "\n",
225 | "The only difference relates to the use of `model.predict()` and `model.evaluate()`. See `loss_and_metrics` and `predicted`."
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": 0,
231 | "metadata": {},
232 | "outputs": [],
233 | "source": [
234 | "# Print performance to screen:\n",
235 | "def get_performance(n_valid):\n",
236 | " \"\"\"Computes performance and prints it to screen.\n",
237 | " \n",
238 | " Args:\n",
239 | " n_valid: number of validation instances we'd like to simulate.\n",
240 | " \n",
241 | " Returns:\n",
242 | " None\n",
243 | " \"\"\"\n",
244 | " flawless_tracker = []\n",
245 | " validation_vectors = np.random.binomial(1, 0.5, (n_valid, n_outputs))\n",
246 | " xor_validation_vectors = validation_vectors ^ 1\n",
247 | " \n",
248 | " loss_and_metrics = model.evaluate(validation_vectors,\n",
249 | " xor_validation_vectors, batch_size=n_valid)\n",
250 | " print(loss_and_metrics)\n",
251 | "\n",
252 | " for i in range(n_valid):\n",
253 | " predicted = model.predict(np.reshape(validation_vectors[i], (1,-1)), 1)\n",
254 | " labels = (predicted > 0.5).astype(int)[0,]\n",
255 | " if i < 3:\n",
256 | " print('********')\n",
257 | " print('Challenge ' + str(i + 1) + ': ' + str(validation_vectors[i]))\n",
258 | " print('Predicted ' + str(i + 1) + ': ' + str(labels))\n",
259 | " print('Correct ' + str(i + 1) + ': ' + str(xor_validation_vectors[i]))\n",
260 | " instance_score = (np.array_equal(labels, xor_validation_vectors[i]))\n",
261 | " flawless_tracker.append(instance_score)\n",
262 | " \n",
263 | " print('\\nProportion of flawless instances on ' + str(n_valid) +\n",
264 | " ' new examples: ' + str(round(100*np.mean(flawless_tracker),0)) + '%')"
265 | ]
266 | },
267 | {
268 | "cell_type": "code",
269 | "execution_count": 0,
270 | "metadata": {},
271 | "outputs": [],
272 | "source": [
273 | "get_performance(5000)"
274 | ]
275 | }
276 | ],
277 | "metadata": {
278 | "kernelspec": {
279 | "display_name": "Python 2",
280 | "language": "python",
281 | "name": "python2"
282 | }
283 | },
284 | "nbformat": 4,
285 | "nbformat_minor": 2
286 | }
287 |
--------------------------------------------------------------------------------
/cats/run_step_2a_query.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash -eu
2 | #
3 | # Copyright 2017 Google LLC
4 | #
5 | # Licensed under the Apache License, Version 2.0 (the "License");
6 | # you may not use this file except in compliance with the License.
7 | # You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | #
17 | # Run bigquery command to write result of step_2a_query.sql to a table.
18 | #
19 |
20 | QUERY=$(cat 'step_2a_query.sql')
21 |
22 | bq query --use_legacy_sql=false --destination_table=dataset.catinfo "$QUERY" \
23 | || exit 1
24 |
25 | echo "Successfully wrote query output to table dataset.catinfo!"
--------------------------------------------------------------------------------
/cats/run_step_2b_get_images.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash -eu
2 | #
3 | # Copyright 2017 Google LLC
4 | #
5 | # Licensed under the Apache License, Version 2.0 (the "License");
6 | # you may not use this file except in compliance with the License.
7 | # You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | #
17 | # This script runs step 2 part B to collect positive and negative image samples!
18 | # Execute run_step_2a_query.sh first to collect cat-not-cat urls!
19 | #
20 | # Reads image urls and labels from $DATASET.$TABLE, resizes and outputs images
21 | # in a Google storage bucket $BUCKET under a directory $IMAGE_DIR/all_images.
22 |
23 | if [ "$#" -ne 5 ]; then
24 | echo "Usage: $0 project-name dataset table bucket-name image-dir"
25 | echo "dataset.table is the table that stores positive and negative cat labels, urls, and random numbers"
26 | echo "(bucket-name does not contain prefix gs://)"
27 | exit 1
28 | fi
29 |
30 | PROJECT=$1
31 | DATASET=$2
32 | TABLE=$3
33 | BUCKET=$4
34 | IMAGE_DIR=$5
35 |
36 | # This starts a DataflowRunner job on Google Cloud.
37 | # In order to manage read/write permissions and billing for dataflow, the
38 | # job requires a project given by your current project
39 | # which dispatches --num_workers
40 | python -m step_2b_get_images \
41 | --project $PROJECT \
42 | --runner DataflowRunner \
43 | --staging_location gs://$BUCKET/$IMAGE_DIR/staging \
44 | --temp_location gs://$BUCKET/$IMAGE_DIR/temp \
45 | --num_workers 50 \
46 | --worker_machine_type n1-standard-4 \
47 | --setup_file ./setup.py \
48 | --region $DATAFLOW_REGION \
49 | --dataset dataset \
50 | --table $TABLE \
51 | --storage-bucket $BUCKET \
52 | --output-dir $IMAGE_DIR/all_images \
53 | --output-image-dim 128 \
54 | --cloud
55 |
--------------------------------------------------------------------------------
/cats/run_step_3_split_images.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash -eu
2 | #
3 | # Copyright 2017 Google LLC
4 | #
5 | # Licensed under the Apache License, Version 2.0 (the "License");
6 | # you may not use this file except in compliance with the License.
7 | # You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | #
17 | # Note: Run step 2 before running this step!
18 | #
19 | # Takes images in $IMAGE_DIR/all_images and splits them to three separate
20 | # folders: $IMAGE_DIR/training_images, $IMAGE_DIR/validation_images,
21 | # and $IMAGE_DIR/test_images.
22 |
23 | if [ "$#" -lt 3 ]; then
24 | echo "Usage: $0 project-name bucket-name image-dir"
25 | echo "(bucket-name does not contain prefix gs://)"
26 | echo "image-dir is the directory in which you ran for step 2,
27 | should contain an `all_images` folder"
28 | exit 1
29 | fi
30 |
31 | PROJECT=$1
32 | BUCKET=$2
33 | IMAGE_DIR=$3
34 |
35 | python -m step_3_split_images \
36 | --project $PROJECT \
37 | --runner DataflowRunner \
38 | --staging_location gs://$BUCKET/$IMAGE_DIR/staging \
39 | --temp_location gs://$BUCKET/$IMAGE_DIR/temp \
40 | --num_workers 20 \
41 | --worker_machine_type n1-standard-4 \
42 | --setup_file ./setup.py \
43 | --region $DATAFLOW_REGION \
44 | --storage-bucket $BUCKET \
45 | --source-image-dir $IMAGE_DIR/all_images \
46 | --dest-image-dir $IMAGE_DIR \
47 | --split-names training_images validation_images test_images \
48 | --split-fractions 0.5 0.3 0.2 \
49 | --cloud
50 |
51 |
--------------------------------------------------------------------------------
/cats/setup.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | #
3 | # Copyright 2017 Google LLC
4 | #
5 | # Licensed under the Apache License, Version 2.0 (the "License");
6 | # you may not use this file except in compliance with the License.
7 | # You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | #
17 | # Setup.py is called to install packages needed by cloud dataflow VMs.
18 | #
19 | # Some opencv c++ dependencies are needed in addition to python libraries.
20 | #
21 |
22 | from subprocess import STDOUT, check_call, CalledProcessError
23 | from sys import platform
24 | import setuptools
25 |
26 | NAME = 'dataflow_setup'
27 | VERSION = '1.0'
28 | CV = 'opencv-python==3.3.0.10'
29 | CV_CONTRIB = 'opencv-contrib-python==3.3.0.10'
30 |
31 | if __name__ == '__main__':
32 | if platform == "linux" or platform == "linux2":
33 | try:
34 | check_call(['apt-get', 'update'],
35 | stderr=STDOUT)
36 | check_call(['apt-get', '-y', 'upgrade'],
37 | stderr=STDOUT)
38 | # Install opencv-related libraries on workers
39 | check_call(['apt-get', 'install', '-y', 'libgtk2.0-dev', 'libsm6',
40 | 'libxrender1', 'libfontconfig1', 'libxext6'],
41 | stderr=STDOUT)
42 | except CalledProcessError:
43 | pass
44 | setuptools.setup(name=NAME,
45 | version=VERSION,
46 | packages=setuptools.find_packages(),
47 | install_requires=[CV,
48 | CV_CONTRIB])
49 |
--------------------------------------------------------------------------------
/cats/step_0_to_0.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Let's Get Started With Data Science, World!\n",
8 | "=================="
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "**Author(s):** kozyr@google.com\n",
16 | "\n",
17 | "**Reviewer(s):** nrh@google.com\n",
18 | "\n"
19 | ]
20 | },
21 | {
22 | "cell_type": "markdown",
23 | "metadata": {},
24 | "source": [
25 | "It's a beautiful day and we can do all kinds of pretty things. Here are some little examples to get you started."
26 | ]
27 | },
28 | {
29 | "cell_type": "markdown",
30 | "metadata": {},
31 | "source": [
32 | "## Print: ...something"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 0,
38 | "metadata": {},
39 | "outputs": [],
40 | "source": [
41 | "print('Hello world!')"
42 | ]
43 | },
44 | {
45 | "cell_type": "markdown",
46 | "metadata": {},
47 | "source": [
48 | "## Numpy: make some noise!\n",
49 | "\n",
50 | "```numpy``` is the essential package for working with numbers. Simulate some noise and make a straight line."
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": 0,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "import numpy as np\n",
60 | "n = 20\n",
61 | "intercept = -10\n",
62 | "slope = 5\n",
63 | "noise = 10\n",
64 | "error = np.random.normal(0, noise, 20)\n",
65 | "x = np.array(range(n))\n",
66 | "y = intercept + slope * x + error\n",
67 | "\n",
68 | "print(x)\n",
69 | "print(np.round(y, 2))"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": []
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "## Pandas: not just for chewing bamboo\n",
82 | "\n",
83 | "```pandas``` is the essential package for working with dataframes. Make a convenient dataframe for using our feature to predict our label."
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": 0,
89 | "metadata": {},
90 | "outputs": [],
91 | "source": [
92 | "import pandas as pd\n",
93 | "df = pd.DataFrame({'feature': x, 'label': y})\n",
94 | "print(df)"
95 | ]
96 | },
97 | {
98 | "cell_type": "markdown",
99 | "metadata": {},
100 | "source": [
101 | "## Seaborn: pretty plotting\n",
102 | "\n",
103 | "A picture is worth a thousand numbers. ```seaborn``` puts some glamour in your plotting style."
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 0,
109 | "metadata": {},
110 | "outputs": [],
111 | "source": [
112 | "import seaborn as sns\n",
113 | "%matplotlib inline\n",
114 | "sns.regplot(x='feature', y='label', data=df)"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": 0,
120 | "metadata": {},
121 | "outputs": [],
122 | "source": [
123 | "%matplotlib inline\n",
124 | "sns.distplot(error, axlabel='residuals')"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 0,
130 | "metadata": {},
131 | "outputs": [],
132 | "source": [
133 | "%matplotlib inline\n",
134 | "sns.jointplot(x='feature', y='label', data=df)"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "# TensorFlow: built for speed\n",
142 | "\n",
143 | "```tensorflow``` is the essential package for training neural networks efficiently at scale. In order to be efficient at scale, it only runs when it's required to. Let's ask it to greet us..."
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": 0,
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "import tensorflow as tf\n",
153 | "\n",
154 | "c = tf.constant('Hello, world!')\n",
155 | "\n",
156 | "with tf.Session() as sess:\n",
157 | "\n",
158 | " print sess.run(c)"
159 | ]
160 | },
161 | {
162 | "cell_type": "markdown",
163 | "metadata": {},
164 | "source": [
165 | "Finally, let's greet our tensorflow supported devices! Say hello to our CPU, and our GPU if we invited it to the party!"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": 0,
171 | "metadata": {},
172 | "outputs": [],
173 | "source": [
174 | "from tensorflow.python.client import device_lib\n",
175 | "\n",
176 | "def get_devices():\n",
177 | " devices = device_lib.list_local_devices()\n",
178 | " return [x.name for x in devices]\n",
179 | "\n",
180 | "print(get_devices())"
181 | ]
182 | }
183 | ],
184 | "metadata": {
185 | "kernelspec": {
186 | "display_name": "Python 2",
187 | "language": "python",
188 | "name": "python2"
189 | }
190 | },
191 | "nbformat": 4,
192 | "nbformat_minor": 1
193 | }
194 |
--------------------------------------------------------------------------------
/cats/step_1_to_3.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Performance Metric and Requirements\n",
8 | "=================="
9 | ]
10 | },
11 | {
12 | "cell_type": "markdown",
13 | "metadata": {},
14 | "source": [
15 | "**Author(s):** kozyr@google.com"
16 | ]
17 | },
18 | {
19 | "cell_type": "markdown",
20 | "metadata": {},
21 | "source": [
22 | "Before we get started on data, we have to choose our project performance metric and decide the statistical testing criteria. We'll make use of the metric code we write here when we get to Step 6 (Training) and we'll use the criteria in Step 9 (Testing)."
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": 0,
28 | "metadata": {},
29 | "outputs": [],
30 | "source": [
31 | "# Required libraries:\n",
32 | "import numpy as np\n",
33 | "import pandas as pd\n",
34 | "import seaborn as sns"
35 | ]
36 | },
37 | {
38 | "cell_type": "markdown",
39 | "metadata": {},
40 | "source": [
41 | "## Performance Metric: Accuracy\n",
42 | "\n",
43 | "We've picked accuracy as our performance metric.\n",
44 | "\n",
45 | "Accuracy $ = \\frac{\\text{correct predictions}}{\\text{total predictions}}$"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": 0,
51 | "metadata": {},
52 | "outputs": [],
53 | "source": [
54 | "# Accuracy metric:\n",
55 | "def get_accuracy(truth, predictions, threshold=0.5, roundoff=2):\n",
56 | " \"\"\"\n",
57 | " Args:\n",
58 | " truth: can be Boolean (False, True), int (0, 1), or float (0, 1)\n",
59 | " predictions: number between 0 and 1, inclusive\n",
60 | " threshold: we convert predictions to 1s if they're above this value\n",
61 | " roundoff: report accuracy to how many decimal places?\n",
62 | "\n",
63 | " Returns:\n",
64 | " accuracy: number correct divided by total predictions\n",
65 | " \"\"\"\n",
66 | "\n",
67 | " truth = np.array(truth) == (1|True)\n",
68 | " predicted = np.array(predictions) >= threshold\n",
69 | " matches = sum(predicted == truth)\n",
70 | " accuracy = float(matches) / len(truth)\n",
71 | " return round(accuracy, roundoff)"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 0,
77 | "metadata": {},
78 | "outputs": [],
79 | "source": [
80 | "# Try it out:\n",
81 | "acc = get_accuracy(truth=[0, False, 1], predictions=[0.2, 0.7, 0.6])\n",
82 | "print 'Accuracy is ' + str(acc) + '.'"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "## Compare Loss Function with Performance Metric"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 0,
95 | "metadata": {},
96 | "outputs": [],
97 | "source": [
98 | "def get_loss(predictions, truth):\n",
99 | " # Our methods will be using cross-entropy loss.\n",
100 | " return -np.mean(truth * np.log(predictions) + (1 - truth) * np.log(1 - predictions))"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 0,
106 | "metadata": {},
107 | "outputs": [],
108 | "source": [
109 | "# Simulate some situations:\n",
110 | "loss = []\n",
111 | "acc = []\n",
112 | "for i in range(1000):\n",
113 | " for n in [10, 100, 1000]:\n",
114 | " p = np.random.uniform(0.01, 0.99, (1, 1))\n",
115 | " y = np.random.binomial(1, p, (n, 1))\n",
116 | " x = np.random.uniform(0.01, 0.99, (n, 1))\n",
117 | " acc = np.append(acc, get_accuracy(truth=y, predictions=x, roundoff=6))\n",
118 | " loss = np.append(loss, get_loss(predictions=x, truth=y))\n",
119 | "\n",
120 | "df = pd.DataFrame({'accuracy': acc, 'cross-entropy': loss})"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 0,
126 | "metadata": {},
127 | "outputs": [],
128 | "source": [
129 | "# Visualize with Seaborn\n",
130 | "import seaborn as sns\n",
131 | "%matplotlib inline\n",
132 | "sns.regplot(x=\"accuracy\", y=\"cross-entropy\", data=df)"
133 | ]
134 | },
135 | {
136 | "cell_type": "markdown",
137 | "metadata": {},
138 | "source": [
139 | "## Hypothesis Testing Setup"
140 | ]
141 | },
142 | {
143 | "cell_type": "code",
144 | "execution_count": 0,
145 | "metadata": {},
146 | "outputs": [],
147 | "source": [
148 | "# Testing setup:\n",
149 | "SIGNIFICANCE_LEVEL = 0.05\n",
150 | "TARGET_ACCURACY = 0.80\n",
151 | "\n",
152 | "# Hypothesis test we'll use:\n",
153 | "from statsmodels.stats.proportion import proportions_ztest"
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": 0,
159 | "metadata": {},
160 | "outputs": [],
161 | "source": [
162 | "# Using standard notation for a one-sided test of one population proportion:\n",
163 | "n = 100 # Example number of predictions\n",
164 | "x = 95 # Example number of correct predictions\n",
165 | "p_value = proportions_ztest(count=x, nobs=n, value=TARGET_ACCURACY, alternative='larger')[1]\n",
166 | "if p_value < SIGNIFICANCE_LEVEL:\n",
167 | " print 'Congratulations! Your model is good enough to build. It passes testing. Awesome!'\n",
168 | "else:\n",
169 | " print 'Too bad. Better luck next project. To try again, you need a pristine test dataset.'"
170 | ]
171 | },
172 | {
173 | "cell_type": "markdown",
174 | "metadata": {},
175 | "source": [
176 | "# Step 2 - Get Data\n",
177 | "\n",
178 | "This part is done outside Jupyter and run in your VM using the shell script provided."
179 | ]
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "metadata": {},
184 | "source": [
185 | "# Step 3 - Split Data\n",
186 | "\n",
187 | "This part is done outside Jupyter and run in your VM using the shell script provided."
188 | ]
189 | }
190 | ],
191 | "metadata": {
192 | "kernelspec": {
193 | "display_name": "Python 2",
194 | "language": "python",
195 | "name": "python2"
196 | }
197 | },
198 | "nbformat": 4,
199 | "nbformat_minor": 1
200 | }
201 |
--------------------------------------------------------------------------------
/cats/step_2a_query.sql:
--------------------------------------------------------------------------------
1 | WITH all_images_and_labels AS (
2 | SELECT i.original_url, l.label_name, l.confidence
3 | FROM `bigquery-public-data.open_images.images` i
4 | JOIN `bigquery-public-data.open_images.labels` l
5 | ON i.image_id = l.image_id
6 | )
7 | SELECT original_url, label, RAND() as randnum
8 | FROM
9 | (
10 | SELECT DISTINCT original_url, 1 as label
11 | FROM all_images_and_labels
12 | WHERE confidence = 1
13 | AND label_name LIKE '/m/01yrx'
14 | UNION ALL
15 | (
16 | SELECT DISTINCT all_images.original_url, 0 as label
17 | FROM all_images_and_labels all_images
18 | LEFT JOIN
19 | (
20 | SELECT original_url
21 | FROM all_images_and_labels
22 | WHERE confidence = 1
23 | AND NOT (label_name LIKE '/m/01yrx')
24 | ) not_cat
25 | ON all_images.original_url = not_cat.original_url
26 | WHERE not_cat.original_url IS NULL
27 | LIMIT 40000
28 | )
29 | )
30 | ORDER BY randnum
--------------------------------------------------------------------------------
/cats/step_2b_get_images.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | #
3 | # Copyright 2017 Google LLC
4 | #
5 | # Licensed under the Apache License, Version 2.0 (the "License");
6 | # you may not use this file except in compliance with the License.
7 | # You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 |
17 | """Runs a beam pipeline to resize and pad images from urls and save to storage.
18 |
19 | The images are read from the urls in the source bigquery table.
20 |
21 | The images are then filtered by removing any 'None' object returned from a bad
22 | URL, or a 'missing' image provided by Flickr when an image is missing.
23 |
24 | The remaining images are resized and padded using the opencv library such that
25 | they are square images of a particular number of pixels (default: 128).
26 |
27 | Finally, the images are written into a user provided output directory on Cloud
28 | Storage. The output filenames will look like:
29 |
30 | [index]_[randnum_times_1000]_[label].png,
31 |
32 | where:
33 |
34 | -index is an integer corresponding to the row index of the bigquery table,
35 | prepended with 0's until it is at least 6 digits long,
36 |
37 | -randnum_times_1000 is the randnum field from the step 2a query multiplied
38 | by 1000 and truncated at the decimal point, i.e. an integer between 0 and 999.
39 | This was done to make filenames more succinct and avoid extra periods where
40 | there need not be.
41 |
42 | -label is either 0 (not cat) or 1 (cat)
43 | """
44 |
45 | import argparse
46 | import logging
47 | import urllib
48 |
49 | import apache_beam as beam
50 | from apache_beam.io.filesystems import FileSystems
51 | from apache_beam.io.gcp import bigquery
52 | from apache_beam.options.pipeline_options import PipelineOptions, SetupOptions
53 |
54 | import cv2
55 | import numpy as np
56 |
57 | # Keys in dictionaries passed through pipeline
58 | RAND_KEY = 'randnum'
59 | IMAGE_KEY = 'img'
60 | INDEX_KEY = 'index'
61 | LABEL_KEY = 'label'
62 |
63 | # Bad flickr image: If the image is missing, there are specific dimensions to
64 | # the returned "missing image" from flickr with these dimensions. Furthermore,
65 | # the returned image has exactly a certain number of unique pixel values.
66 | # We will omit images of these exact dimensions and unique values.
67 | BAD_DIMENSIONS = [(374, 500, 3), (768, 1024, 3)]
68 | BAD_UNIQUE_PIXEL_VALUES = 86
69 |
70 |
71 | def run_pipeline(pipeline_args, known_args):
72 | """A beam pipeline to resize and pad images from urls and save to storage.
73 |
74 | Args:
75 | pipeline_args: Arguments consumed by the beam pipeline
76 | known_args: Extra args used to set various fields such as the dataset and
77 | table from which to read cat urls and labels, and the bucket
78 | and image directory to write processed images
79 |
80 | Returns:
81 | [nothing], just writes processed images to the image directory
82 | """
83 |
84 | # Specify pipeline options
85 | pipeline_options = PipelineOptions(pipeline_args)
86 | pipeline_options.view_as(SetupOptions).save_main_session = True
87 |
88 | # Determine bigquery source from dataset and table arguments
89 | query = ('SELECT ROW_NUMBER() OVER() as index, original_url, label, randnum'
90 | ' from [' + known_args.dataset + '.' + known_args.table + ']')
91 | bq_source = bigquery.BigQuerySource(query=query)
92 |
93 | logging.info('Starting image collection into directory '
94 | + known_args.output_dir)
95 |
96 | # Create destination directory if it doesn't exist
97 | output_dir = known_args.output_dir
98 | if known_args.cloud:
99 | output_dir = 'gs://' + known_args.storage_bucket + '/' + output_dir
100 |
101 | # Directory needs to be explicitly made on some filesystems.
102 | if not FileSystems.exists(output_dir):
103 | FileSystems.mkdirs(output_dir)
104 |
105 | # Run pipeline
106 | with beam.Pipeline(options=pipeline_options) as p:
107 | _ = (p
108 | | 'read_rows_from_cat_info_table'
109 | >> beam.io.Read(bq_source)
110 | | 'fetch_images_from_urls'
111 | >> beam.Map(fetch_image_from_url)
112 | | 'filter_bad_or_absent_images'
113 | >> beam.Filter(filter_bad_or_missing_image)
114 | | 'resize_and_pad_images'
115 | >> beam.Map(resize_and_pad,
116 | output_image_dim=known_args.output_image_dim)
117 | | 'write_images_to_storage'
118 | >> beam.Map(write_processed_image,
119 | output_dir=output_dir)
120 | )
121 |
122 | logging.info('Done collecting images')
123 |
124 |
125 | def fetch_image_from_url(row):
126 | """Replaces image url field with actual image from the url.
127 |
128 | The images are downloaded and decoded into 3 color bitmaps using opencv.
129 | All other entries are simply forwarded (index, label, randnum).
130 |
131 | Catches exceptions where the url is bad, the image is not found,
132 | or the default flickr "missing" image is returned, and returns `None`.
133 | This is used at the next step downstream (beam.filter) to remove bad images.
134 |
135 | Args:
136 | row: a dictionary with entries 'index', 'randnum', 'label', 'original_url'
137 |
138 | Returns:
139 | a dictionary with entries 'index', 'randnum', 'label', 'img'
140 |
141 | """
142 | url = row['original_url']
143 | try:
144 | resp = urllib.urlopen(url)
145 | img = np.asarray(bytearray(resp.read()), dtype='uint8')
146 | img = cv2.imdecode(img, cv2.IMREAD_COLOR)
147 |
148 | # Check whether the url was bad
149 | if img is None:
150 | logging.warn('Image ' + url + ' not found. Skipping.')
151 | return None
152 |
153 | return {
154 | INDEX_KEY: row[INDEX_KEY],
155 | IMAGE_KEY: img,
156 | LABEL_KEY: row[LABEL_KEY],
157 | RAND_KEY: row[RAND_KEY]
158 | }
159 | except IOError:
160 | logging.warn('Trouble reading image from ' + url + '. Skipping.')
161 |
162 |
163 | def filter_bad_or_missing_image(img_and_metadata):
164 | """Filter rows where images are either missing (`None`), or an error image.
165 |
166 | An error image from Flickr matches specific dimensions and unique pixel
167 | values. The likelihood of another image meeting the exact criteria is
168 | very small.
169 |
170 | Args:
171 | img_and_metadata: a dictionary/row, or `None`
172 |
173 | Returns:
174 | True if the row contains valid image data, False otherwise
175 | """
176 |
177 | if img_and_metadata is not None:
178 | img = img_and_metadata[IMAGE_KEY]
179 |
180 | # Check whether the image is the "missing" flickr image:
181 | if img.shape in BAD_DIMENSIONS:
182 | if len(np.unique(img)) == BAD_UNIQUE_PIXEL_VALUES:
183 | logging.warn('Image number' + str(img_and_metadata[INDEX_KEY]) +
184 | ' has dimensions ' + str(img.shape) +
185 | ' and ' + str(BAD_UNIQUE_PIXEL_VALUES) +
186 | ' unique pixels.' +
187 | ' Very likely to be a missing image on flickr.')
188 | return False
189 | return True
190 |
191 | return False
192 |
193 |
194 | def resize_and_pad(img_and_metadata, output_image_dim=128):
195 | """Resize the image to make it IMAGE_DIM x IMAGE_DIM pixels in size.
196 |
197 | If an image is not square, it will pad the top/bottom or left/right
198 | with black pixels to ensure the image is square.
199 |
200 | Args:
201 | img_and_metadata: row containing a dictionary of the input image along with
202 | its index, label, and randnum
203 |
204 | Returns:
205 | dictionary with same values as input dictionary,
206 | but with image resized and padded
207 | """
208 |
209 | img = img_and_metadata[IMAGE_KEY]
210 | h, w = img.shape[:2]
211 |
212 | # interpolation method
213 | if h > output_image_dim or w > output_image_dim:
214 | # use preferred interpolation method for shrinking image
215 | interp = cv2.INTER_AREA
216 | else:
217 | # use preferred interpolation method for stretching image
218 | interp = cv2.INTER_CUBIC
219 |
220 | # aspect ratio of image
221 | aspect = float(w) / h
222 |
223 | # compute scaling and pad sizing
224 | if aspect > 1: # Image is "wide". Add black pixels on top and bottom.
225 | new_w = output_image_dim
226 | new_h = np.round(new_w / aspect)
227 | pad_vert = (output_image_dim - new_h) / 2
228 | pad_top, pad_bot = int(np.floor(pad_vert)), int(np.ceil(pad_vert))
229 | pad_left, pad_right = 0, 0
230 | elif aspect < 1: # Image is "tall". Add black pixels on left and right.
231 | new_h = output_image_dim
232 | new_w = np.round(new_h * aspect)
233 | pad_horz = (output_image_dim - new_w) / 2
234 | pad_left, pad_right = int(np.floor(pad_horz)), int(np.ceil(pad_horz))
235 | pad_top, pad_bot = 0, 0
236 | else: # square image
237 | new_h = output_image_dim
238 | new_w = output_image_dim
239 | pad_left, pad_right, pad_top, pad_bot = 0, 0, 0, 0
240 |
241 | # scale to IMAGE_DIM x IMAGE_DIM and pad with zeros (black pixels)
242 | scaled_img = cv2.resize(img, (int(new_w), int(new_h)), interpolation=interp)
243 | scaled_img = cv2.copyMakeBorder(scaled_img,
244 | pad_top, pad_bot, pad_left, pad_right,
245 | borderType=cv2.BORDER_CONSTANT, value=0)
246 |
247 | return {
248 | INDEX_KEY: img_and_metadata[INDEX_KEY],
249 | IMAGE_KEY: scaled_img,
250 | LABEL_KEY: img_and_metadata[LABEL_KEY],
251 | RAND_KEY: img_and_metadata[RAND_KEY]
252 | }
253 |
254 |
255 | def write_processed_image(img_and_metadata, output_dir):
256 | """Encode the image as a png and write to google storage.
257 |
258 | Creates a function that will read processed images and save them to png
259 | files in directory output_dir.
260 |
261 | The output image filename is given by a unique index + '_' + randnum to
262 | the 3rd decimal place + '_' + label + '.png'. The index is also prefix-filled
263 | with zeros such that the index is at least length 6. This allows us to
264 | maintain consistent numeric and lexigraphical orderings of filenames by
265 | the index field up to 1 million images.
266 |
267 | Args:
268 | img_and_metadata: image, index, randnum, and label
269 | output_dir: output image directory
270 | cloud: whether to run/save images on cloud.
271 |
272 | Returns:
273 | [nothing] - just writes to file destination
274 | """
275 |
276 | # Construct image filename
277 | img_filename = (str(img_and_metadata[INDEX_KEY]).zfill(6) +
278 | '_' +
279 | '{0:.3f}'.format(img_and_metadata[RAND_KEY]).split('.')[1] +
280 | '_' + str(img_and_metadata[LABEL_KEY]) +
281 | '.png')
282 |
283 | # Encode image to png
284 | png_image = cv2.imencode('.png', img_and_metadata[IMAGE_KEY])[1].tostring()
285 |
286 | # Use beam.io.filesystems package to create local or gs file, and write image
287 | with FileSystems.create(output_dir + '/' + img_filename) as f:
288 | f.write(png_image)
289 |
290 |
291 | def run(argv=None):
292 | """Main entry point to for step 2.
293 |
294 | Args:
295 | argv: Command line arguments. See `run_step_2_get_images.sh` for typical
296 | usage.
297 | """
298 | parser = argparse.ArgumentParser()
299 | parser.add_argument(
300 | '--storage-bucket',
301 | required=True,
302 | help='Google storage bucket used to store processed image outputs'
303 | )
304 | parser.add_argument(
305 | '--dataset',
306 | required=True,
307 | help='dataset of the table through which you will read image urls and'
308 | ' labels'
309 | )
310 | parser.add_argument(
311 | '--table',
312 | required=True,
313 | help='table through which you will read image urls and labels'
314 | )
315 | parser.add_argument(
316 | '--output-dir',
317 | help='determines where to store processed image results',
318 | type=str,
319 | default='cat_images'
320 | )
321 | parser.add_argument(
322 | '--output-image-dim',
323 | help='The number of pixels per side of the output image (square image)',
324 | type=int,
325 | default=128
326 | )
327 | parser.add_argument(
328 | '--cloud',
329 | dest='cloud',
330 | action='store_true',
331 | help='Run on the cloud. If this flag is absent, '
332 | 'it will write processed images to your local directory instead.'
333 | )
334 |
335 | known_args, pipeline_args = parser.parse_known_args(argv)
336 | run_pipeline(pipeline_args, known_args)
337 |
338 |
339 | if __name__ == '__main__':
340 | logging.getLogger().setLevel(logging.INFO)
341 | run()
--------------------------------------------------------------------------------
/cats/step_3_split_images.py:
--------------------------------------------------------------------------------
1 | #
2 | # Copyright 2017 Google LLC
3 | #
4 | # Licensed under the Apache License, Version 2.0 (the "License");
5 | # you may not use this file except in compliance with the License.
6 | # You may obtain a copy of the License at
7 | #
8 | # https://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 | #
16 |
17 | """Splits image set into multiple directories using thresholds on randnum.
18 |
19 | The user provides a source directory, an destination directory, and a set of
20 | sub-directory names and fractions to use for splitting the image set.
21 | The pipeline will copy images from the source directory to one of the
22 | destination sub-directories by thresholding on the randnum field in step 2b.
23 | """
24 |
25 | import argparse
26 | import logging
27 |
28 | import apache_beam as beam
29 | from apache_beam.io.filebasedsource import FileBasedSource
30 | from apache_beam.io.filesystems import FileSystems
31 | from apache_beam.options.pipeline_options import PipelineOptions, SetupOptions
32 |
33 | import numpy as np
34 |
35 | # Keys in dictionaries passed through pipeline
36 | RAND_KEY = 'randnum'
37 | FILENAME_KEY = 'filename'
38 | IMAGE_KEY = 'img'
39 |
40 |
41 | def run_pipeline(pipeline_args, known_args):
42 | """Splits images into separate directories using thresholds on randnum.
43 |
44 | Args:
45 | pipeline_args: arguments ingested by beam pipeline
46 | known_args: additional arguments for this project, such as the storage
47 | bucket, source_image_dir, and dest_image_dir.
48 |
49 | Returns:
50 | [nothing] - runs beam pipeline and copies output files to different dirs
51 | """
52 | # Specify pipeline options
53 | pipeline_options = PipelineOptions(pipeline_args)
54 | pipeline_options.view_as(SetupOptions).save_main_session = True
55 |
56 | # Attach bucket prefix if running on cloud
57 | source_images_pattern = known_args.source_image_dir + '/*'
58 | dest_prefix = known_args.dest_image_dir + '/'
59 | if known_args.cloud:
60 | source_images_pattern = ('gs://' + known_args.storage_bucket +
61 | '/' + source_images_pattern)
62 | dest_prefix = ('gs://' + known_args.storage_bucket +
63 | '/' + dest_prefix)
64 |
65 | # Get output directories for split images
66 | split_names = known_args.split_names
67 | split_fractions = known_args.split_fractions
68 | dest_images_dirs = [dest_prefix + x + '/' for x in split_names]
69 |
70 | # Create output directories if they do not already exist (for local runs)
71 | for dest_images_dir in dest_images_dirs:
72 | if not FileSystems.exists(dest_images_dir):
73 | FileSystems.mkdirs(dest_images_dir)
74 |
75 | # Log information on source, destination, and split fractions
76 | split_log_list = [x[0] + '(' + str(x[1]) + ')' for x in
77 | zip(split_names, split_fractions)]
78 | logging.info('Starting ' + ' | '.join(split_log_list) +
79 | ' split from images with source file pattern ' +
80 | source_images_pattern)
81 | logging.info('Destination parent directory: ' + dest_prefix)
82 |
83 | with beam.Pipeline(options=pipeline_options) as p:
84 | # Read files and partition pipelines
85 | split_pipelines = (
86 | p
87 | | 'read_images'
88 | >> beam.io.Read(LabeledImageFileReader(source_images_pattern))
89 | | 'split_images'
90 | >> beam.Partition(
91 | generate_split_fn(split_fractions),
92 | len(split_fractions)
93 | )
94 | )
95 |
96 | # Write each pipeline to a corresponding output directory
97 | for partition, split_name_and_dest_dir in enumerate(
98 | zip(split_names, dest_images_dirs)):
99 | _ = (split_pipelines[partition]
100 | | 'write_' + split_name_and_dest_dir[0]
101 | >> beam.Map(write_to_directory,
102 | dst_dir=split_name_and_dest_dir[1]))
103 |
104 | logging.info('Done splitting image sets')
105 |
106 |
107 | def generate_split_fn(split_fractions):
108 | """Generate a partition function using the RAND_KEY field and split fractions.
109 |
110 | The function takes the cumulative sum of a split_fractions list and returns
111 | a function mapping each bin to a unique integer between 0 and the length of
112 | the split_fractions list.
113 |
114 | Args:
115 | split_fractions: data split fractions organized in a list of floats
116 |
117 | Returns:
118 | returns a function used to split a pipeline into 3 partitions
119 | """
120 |
121 | cumulative_split_fractions = np.cumsum(split_fractions)
122 |
123 | def _split_fn(data, num_partitions):
124 | for i, thresh in enumerate(cumulative_split_fractions):
125 | if data[RAND_KEY] < thresh:
126 | return i
127 | return _split_fn
128 |
129 |
130 | class LabeledImageFileReader(FileBasedSource):
131 | """A FileBasedSource that yields the entire file content with some metadata.
132 |
133 | A pipeline file source always has a function read_records() that iterates
134 | through entries within the file. Since each of our files is an image file,
135 | we don't have multiple entries, so the implementation here is just to read
136 | the entire file (using f.read()), and return the filename and the random key
137 | along with it.
138 |
139 | Note that the "if not img: break" clause is there to indicate that after a
140 | single read, the iterator "blocks" and no longer returns anything. The
141 | pipeline will then move on to other files.
142 | """
143 |
144 | def read_records(self, filename, offset_range_tracker):
145 | """Creates a generator that returns data needed for splitting the images.
146 |
147 | Args:
148 | filename: full path of the file
149 | offset_range_tracker: tracks where the last read left off, not important.
150 |
151 | Yields:
152 | The entire file content along with the filename and RAND_KEY for
153 | splitting
154 | """
155 | with self.open_file(filename) as f:
156 | while True:
157 | img = f.read()
158 | if not img:
159 | break
160 | # Change the RAND_KEY denominator's power to match the number of decimal
161 | # places for randnum in step 2
162 | yield {
163 | FILENAME_KEY: filename,
164 | RAND_KEY: float(filename.split('_')[-2]) / 1e3,
165 | IMAGE_KEY: img
166 | }
167 |
168 |
169 | def write_to_directory(img_and_metadata, dst_dir):
170 | """Write the serialized image data (png) to dst_dir/filename.
171 |
172 | Filename is the original filename of the image.
173 |
174 | Args:
175 | img_and_metadata: filename, randnum, and serialized image data (png)
176 | dst_dir: output directory
177 |
178 | Returns:
179 | [nothing] - this component serves as a sink
180 | """
181 | source = img_and_metadata[FILENAME_KEY]
182 | with FileSystems.create(dst_dir + FileSystems.split(source)[1]) as f:
183 | f.write(img_and_metadata[IMAGE_KEY])
184 |
185 |
186 | def run(argv=None):
187 | """Main entry point for step 3.
188 |
189 | Args:
190 | argv: Command line args. See the run_step_3_split_images.sh script for
191 | typical flag usage.
192 | """
193 | parser = argparse.ArgumentParser()
194 | parser.add_argument(
195 | '--storage-bucket',
196 | required=True,
197 | help='Google storage bucket used to store processed image outputs'
198 | )
199 | parser.add_argument(
200 | '--source-image-dir',
201 | help='determines where to read original images',
202 | type=str,
203 | default='catimages/all_images'
204 | )
205 | parser.add_argument(
206 | '--dest-image-dir',
207 | help='determines base output directory for splitting images',
208 | type=str,
209 | default='catimages'
210 | )
211 | parser.add_argument(
212 | '--split-names',
213 | help='Names of subsets (directories) split from source image set',
214 | nargs='+',
215 | default=['training_images', 'validation_images', 'test_images']
216 | )
217 | parser.add_argument(
218 | '--split-fractions',
219 | help='Fraction of data to split into subsets',
220 | type=float,
221 | nargs='+',
222 | default=[0.5, 0.3, 0.2]
223 | )
224 | parser.add_argument(
225 | '--cloud',
226 | dest='cloud',
227 | action='store_true',
228 | help='Run on the cloud. If this flag is absent, '
229 | 'it will write processed images to your local directory instead.'
230 | )
231 |
232 | known_args, pipeline_args = parser.parse_known_args(argv)
233 | run_pipeline(pipeline_args, known_args)
234 |
235 |
236 | if __name__ == '__main__':
237 | logging.getLogger().setLevel(logging.INFO)
238 | run()
--------------------------------------------------------------------------------
/cats/step_4_to_4_part1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Exploring the Training Set\n",
8 | "\n",
9 | "**Author(s):** kozyr@google.com, bfoo@google.com\n",
10 | "\n",
11 | "In this notebook, we gather exploratory data from our training set to do feature engineering and model tuning. Before running this notebook, make sure that:\n",
12 | "\n",
13 | "* You have already run steps 2 and 3 to collect and split your data into training, validation, and test. \n",
14 | "* Your training data is in a Google storage folder such as gs://[your-bucket]/[dataprep-dir]/training_images/"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "In the spirit of learning to walk before learning to run, we'll write this notebook in a more basic style than you'll see in a professional setting."
22 | ]
23 | },
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "# Setup\n",
29 | "\n",
30 | "**TODO for you:** In Screen terminal 1 (to begin with Screen in the VM, first type\n",
31 | "`screen` and `Ctrl+a c`), go to the VM shell and type `Ctrl+a 1`,\n",
32 | "create a folder to store your training and debugging images, and then copy a small\n",
33 | "sample of training images from Cloud Storage:\n",
34 | "```\n",
35 | "mkdir -p ~/data/training_small\n",
36 | "gsutil -m cp gs://$BUCKET/catimages/training_images/000*.png ~/data/training_small/\n",
37 | "gsutil -m cp gs://$BUCKET/catimages/training_images/001*.png ~/data/training_small/\n",
38 | "mkdir -p ~/data/debugging_small\n",
39 | "gsutil -m cp gs://$BUCKET/catimages/training_images/002*.png ~/data/debugging_small\n",
40 | "echo \"done!\"\n",
41 | "```\n",
42 | "\n",
43 | "Note that we only take the images starting with those IDs to limit the total number we'll copy over to under 3 thousand images."
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": 0,
49 | "metadata": {},
50 | "outputs": [],
51 | "source": [
52 | "# Enter your username:\n",
53 | "YOUR_GMAIL_ACCOUNT = '******' # Whatever is before @gmail.com in your email address"
54 | ]
55 | },
56 | {
57 | "cell_type": "code",
58 | "execution_count": 0,
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "# Libraries for this section:\n",
63 | "import os\n",
64 | "import matplotlib.pyplot as plt\n",
65 | "import matplotlib.image as mpimg\n",
66 | "import numpy as np\n",
67 | "import pandas as pd\n",
68 | "import cv2\n",
69 | "import warnings\n",
70 | "warnings.filterwarnings('ignore')"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": 0,
76 | "metadata": {},
77 | "outputs": [],
78 | "source": [
79 | "# Grab the filenames:\n",
80 | "TRAINING_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/training_small/')\n",
81 | "files = os.listdir(TRAINING_DIR) # Grab all the files in the VM images directory\n",
82 | "print(files[0:5]) # Let's see some filenames"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "# Eyes on the data!"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 0,
95 | "metadata": {},
96 | "outputs": [],
97 | "source": [
98 | "def show_pictures(filelist, dir, img_rows=2, img_cols=3, figsize=(20, 10)):\n",
99 | " \"\"\"Display the first few images.\n",
100 | "\n",
101 | " Args:\n",
102 | " filelist: list of filenames to pull from\n",
103 | " dir: directory where the files are stored\n",
104 | " img_rows: number of rows of images to display\n",
105 | " img_cols: number of columns of images to display\n",
106 | " figsize: sizing for inline plots\n",
107 | "\n",
108 | " Returns:\n",
109 | " None\n",
110 | " \"\"\"\n",
111 | " plt.close('all')\n",
112 | " fig = plt.figure(figsize=figsize)\n",
113 | "\n",
114 | " for i in range(img_rows * img_cols):\n",
115 | " a=fig.add_subplot(img_rows, img_cols,i+1)\n",
116 | " img = mpimg.imread(os.path.join(dir, filelist[i]))\n",
117 | " plt.imshow(img)\n",
118 | " plt.show()"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 0,
124 | "metadata": {},
125 | "outputs": [],
126 | "source": [
127 | "show_pictures(files, TRAINING_DIR)"
128 | ]
129 | },
130 | {
131 | "cell_type": "markdown",
132 | "metadata": {},
133 | "source": [
134 | "Check out the colors at [rapidtables.com/web/color/RGB_Color](http://www.rapidtables.com/web/color/RGB_Color.htm), but don't forget to flip order of the channels to BGR."
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 0,
140 | "metadata": {},
141 | "outputs": [],
142 | "source": [
143 | "# What does the actual image matrix look like? There are three channels:\n",
144 | "img = cv2.imread(os.path.join(TRAINING_DIR, files[0]))\n",
145 | "print('\\n***Colors in the middle of the first image***\\n')\n",
146 | "print('Blue channel:')\n",
147 | "print(img[63:67,63:67,0])\n",
148 | "print('Green channel:')\n",
149 | "print(img[63:67,63:67,1])\n",
150 | "print('Red channel:')\n",
151 | "print(img[63:67,63:67,2])"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": 0,
157 | "metadata": {},
158 | "outputs": [],
159 | "source": [
160 | "def show_bgr(filelist, dir, img_rows=2, img_cols=3, figsize=(20, 10)):\n",
161 | " \"\"\"Make histograms of the pixel color matrices of first few images.\n",
162 | "\n",
163 | " Args:\n",
164 | " filelist: list of filenames to pull from\n",
165 | " dir: directory where the files are stored\n",
166 | " img_rows: number of rows of images to display\n",
167 | " img_cols: number of columns of images to display\n",
168 | " figsize: sizing for inline plots\n",
169 | "\n",
170 | " Returns:\n",
171 | " None\n",
172 | " \"\"\"\n",
173 | " plt.close('all')\n",
174 | " fig = plt.figure(figsize=figsize)\n",
175 | " color = ('b','g','r')\n",
176 | "\n",
177 | " for i in range(img_rows * img_cols):\n",
178 | " a=fig.add_subplot(img_rows, img_cols, i + 1)\n",
179 | " img = cv2.imread(os.path.join(TRAINING_DIR, files[i]))\n",
180 | " for c,col in enumerate(color):\n",
181 | " histr = cv2.calcHist([img],[c],None,[256],[0,256])\n",
182 | " plt.plot(histr,color = col)\n",
183 | " plt.xlim([0,256])\n",
184 | " plt.ylim([0,500])\n",
185 | " plt.show()"
186 | ]
187 | },
188 | {
189 | "cell_type": "code",
190 | "execution_count": 0,
191 | "metadata": {},
192 | "outputs": [],
193 | "source": [
194 | "show_bgr(files, TRAINING_DIR)"
195 | ]
196 | },
197 | {
198 | "cell_type": "markdown",
199 | "metadata": {},
200 | "source": [
201 | "# Do some sanity checks\n",
202 | "\n",
203 | "For example:\n",
204 | "* Do we have blank images?\n",
205 | "* Do we have images with very few colors?"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": 0,
211 | "metadata": {},
212 | "outputs": [],
213 | "source": [
214 | "# Pull in blue channel for each image, reshape to vector, count unique values:\n",
215 | "unique_colors = []\n",
216 | "landscape = []\n",
217 | "for f in files:\n",
218 | " img = np.array(cv2.imread(os.path.join(TRAINING_DIR, f)))[:,:,0]\n",
219 | " # Determine if landscape is more likely than portrait by comparing\n",
220 | " #amount of zero channel in 3rd row vs 3rd col:\n",
221 | " landscape_likely = (np.count_nonzero(img[:,2]) > np.count_nonzero(img[2,:])) * 1\n",
222 | " # Count number of unique blue values:\n",
223 | " col_count = len(set(img.ravel()))\n",
224 | " # Append to array:\n",
225 | " unique_colors.append(col_count)\n",
226 | " landscape.append(landscape_likely)\n",
227 | " \n",
228 | "unique_colors = pd.DataFrame({'files': files, 'unique_colors': unique_colors,\n",
229 | " 'landscape': landscape})\n",
230 | "unique_colors = unique_colors.sort_values(by=['unique_colors'])\n",
231 | "print(unique_colors[0:10])"
232 | ]
233 | },
234 | {
235 | "cell_type": "code",
236 | "execution_count": 0,
237 | "metadata": {},
238 | "outputs": [],
239 | "source": [
240 | "# Plot the pictures with the lowest diversity of unique color values:\n",
241 | "suspicious = unique_colors['files'].tolist()\n",
242 | "show_pictures(suspicious, TRAINING_DIR, 1)"
243 | ]
244 | },
245 | {
246 | "cell_type": "markdown",
247 | "metadata": {},
248 | "source": [
249 | "# Get labels\n",
250 | "\n",
251 | "Extract labels from the filename and create a pretty dataframe for analysis."
252 | ]
253 | },
254 | {
255 | "cell_type": "code",
256 | "execution_count": 0,
257 | "metadata": {},
258 | "outputs": [],
259 | "source": [
260 | "def get_label(str):\n",
261 | " \"\"\"\n",
262 | " Split out the label from the filename of the image, where we stored it.\n",
263 | " Args:\n",
264 | " str: filename string.\n",
265 | " Returns:\n",
266 | " label: an integer 1 or 0\n",
267 | " \"\"\"\n",
268 | " split_filename = str.split('_')\n",
269 | " label = int(split_filename[-1].split('.')[0])\n",
270 | " return(label)\n",
271 | "\n",
272 | "# Example:\n",
273 | "get_label('12550_0.1574_1.png')"
274 | ]
275 | },
276 | {
277 | "cell_type": "markdown",
278 | "metadata": {},
279 | "source": [
280 | "## Create DataFrame"
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": 0,
286 | "metadata": {},
287 | "outputs": [],
288 | "source": [
289 | "df = unique_colors[:]\n",
290 | "df['label'] = df['files'].apply(lambda x: get_label(x))\n",
291 | "df['landscape_likely'] = df['landscape']\n",
292 | "df = df.drop(['landscape', 'unique_colors'], axis=1)\n",
293 | "df[:10]"
294 | ]
295 | },
296 | {
297 | "cell_type": "markdown",
298 | "metadata": {},
299 | "source": [
300 | "# Basic Feature Engineering\n",
301 | "\n",
302 | "Below, we show an example of a very simple set of features that can be derived from an image. This function simply pulls the mean, standard deviation, min, and max of pixel values in one image band (red, green, or blue)"
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": 0,
308 | "metadata": {},
309 | "outputs": [],
310 | "source": [
311 | "def general_img_features(band):\n",
312 | " \"\"\"\n",
313 | " Define a set of features that we can look at for each color band\n",
314 | " Args:\n",
315 | " band: array which is one of blue, green, or red\n",
316 | " Returns:\n",
317 | " features: unique colors, nonzero count, mean, standard deviation,\n",
318 | " min, and max of the channel's pixel values\n",
319 | " \"\"\"\n",
320 | " return [len(set(band.ravel())), np.count_nonzero(band),\n",
321 | " np.mean(band), np.std(band),\n",
322 | " band.min(), band.max()]\n",
323 | "\n",
324 | "def concat_all_band_features(file, dir):\n",
325 | " \"\"\"\n",
326 | " Extract features from a single image.\n",
327 | " Args:\n",
328 | " file - single image filename\n",
329 | " dir - directory where the files are stored\n",
330 | " Returns:\n",
331 | " features - descriptive statistics for pixels\n",
332 | " \"\"\"\n",
333 | " img = cv2.imread(os.path.join(dir, file))\n",
334 | " features = []\n",
335 | " blue = np.float32(img[:,:,0])\n",
336 | " green = np.float32(img[:,:,1])\n",
337 | " red = np.float32(img[:,:,2])\n",
338 | " features.extend(general_img_features(blue)) # indices 0-4\n",
339 | " features.extend(general_img_features(green)) # indices 5-9\n",
340 | " features.extend(general_img_features(red)) # indices 10-14\n",
341 | " return features"
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": 0,
347 | "metadata": {},
348 | "outputs": [],
349 | "source": [
350 | "# Let's see an example:\n",
351 | "print(files[0] + '\\n')\n",
352 | "example = concat_all_band_features(files[0], TRAINING_DIR)\n",
353 | "print(example)"
354 | ]
355 | },
356 | {
357 | "cell_type": "code",
358 | "execution_count": 0,
359 | "metadata": {},
360 | "outputs": [],
361 | "source": [
362 | "# Apply it to our dataframe:\n",
363 | "feature_names = ['blue_unique', 'blue_nonzero', 'blue_mean', 'blue_sd', 'blue_min', 'blue_max',\n",
364 | " 'green_unique', 'green_nonzero', 'green_mean', 'green_sd', 'green_min', 'green_max',\n",
365 | " 'red_unique', 'red_nonzero', 'red_mean', 'red_sd', 'red_min', 'red_max']\n",
366 | "\n",
367 | "# Compute a series holding all band features as lists\n",
368 | "band_features_series = df['files'].apply(lambda x: concat_all_band_features(x, TRAINING_DIR))\n",
369 | "\n",
370 | "# Loop through lists and distribute them across new columns in the dataframe\n",
371 | "for i in range(len(feature_names)):\n",
372 | " df[feature_names[i]] = band_features_series.apply(lambda x: x[i])\n",
373 | "df[:10]"
374 | ]
375 | },
376 | {
377 | "cell_type": "code",
378 | "execution_count": 0,
379 | "metadata": {},
380 | "outputs": [],
381 | "source": [
382 | "# Are these features good for finding cats?\n",
383 | "# Let's look at some basic correlations.\n",
384 | "df.corr().round(2)"
385 | ]
386 | },
387 | {
388 | "cell_type": "markdown",
389 | "metadata": {},
390 | "source": [
391 | "These coarse features look pretty bad individually. Most of this is due to features capturing absolute pixel values. But photo lighting could vary significantly between different image shots. What we end up with is a lot of noise.\n",
392 | "\n",
393 | "Are there some better feature detectors we can consider? Why yes, there are! Several common features involve finding corners in pictures, and looking for pixel gradients (differences in pixel values between neighboring pixels in different directions)."
394 | ]
395 | },
396 | {
397 | "cell_type": "markdown",
398 | "metadata": {},
399 | "source": [
400 | "# Harris Corner Detector\n",
401 | "\n",
402 | "The following snippet runs code to visualize harris corner detection for a few sample images. Configuring the threshold determines how strong of a signal we need to determine if a pixel corresponds to a corner (high pixel gradients in all directions).\n",
403 | "\n",
404 | "Note that because a Harris corner detector returns another image map with values corresponding to the likelihood of a corner at that pixel, it can also be fed into general_img_features() to extract additional features. What do you notice about corners on cat images?"
405 | ]
406 | },
407 | {
408 | "cell_type": "code",
409 | "execution_count": 0,
410 | "metadata": {},
411 | "outputs": [],
412 | "source": [
413 | "THRESHOLD = 0.05\n",
414 | "\n",
415 | "def show_harris(filelist, dir, band=0, img_rows=4, img_cols=4, figsize=(20, 10)):\n",
416 | " \"\"\"\n",
417 | " Display Harris corner detection for the first few images.\n",
418 | " Args:\n",
419 | " filelist: list of filenames to pull from\n",
420 | " dir: directory where the files are stored\n",
421 | " band: 0 = 'blue', 1 = 'green', 2 = 'red'\n",
422 | " img_rows: number of rows of images to display\n",
423 | " img_cols: number of columns of images to display\n",
424 | " figsize: sizing for inline plots\n",
425 | " Returns:\n",
426 | " None\n",
427 | " \"\"\"\n",
428 | " plt.close('all')\n",
429 | " fig = plt.figure(figsize=figsize)\n",
430 | "\n",
431 | " def plot_bands(src, band_img):\n",
432 | " a=fig.add_subplot(img_rows, img_cols, i + 1)\n",
433 | " dst = cv2.cornerHarris(band_img, 2, 3, 0.04)\n",
434 | " dst = cv2.dilate(dst,None) # dilation makes the marks a little bigger\n",
435 | "\n",
436 | " # Threshold for an optimal value, it may vary depending on the image.\n",
437 | " new_img = src.copy()\n",
438 | " new_img[dst > THRESHOLD * dst.max()]=[0, 0, 255]\n",
439 | " # Note: openCV reverses the red-green-blue channels compared to matplotlib,\n",
440 | " # so we have to flip the image before showing it\n",
441 | " imgplot = plt.imshow(cv2.cvtColor(new_img, cv2.COLOR_BGR2RGB))\n",
442 | "\n",
443 | " for i in range(img_rows * img_cols):\n",
444 | " img = cv2.imread(os.path.join(dir, filelist[i]))\n",
445 | " plot_bands(img, img[:,:,band])\n",
446 | "\n",
447 | " plt.show()"
448 | ]
449 | },
450 | {
451 | "cell_type": "code",
452 | "execution_count": 0,
453 | "metadata": {},
454 | "outputs": [],
455 | "source": [
456 | "show_harris(files, TRAINING_DIR)"
457 | ]
458 | }
459 | ],
460 | "metadata": {
461 | "kernelspec": {
462 | "display_name": "Python 2",
463 | "language": "python",
464 | "name": "python2"
465 | }
466 | },
467 | "nbformat": 4,
468 | "nbformat_minor": 1
469 | }
470 |
--------------------------------------------------------------------------------
/cats/step_4_to_4_part2.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Feature Engineering\n",
8 | "\n",
9 | "**Author(s):** bfoo@google.com, kozyr@google.com\n",
10 | "\n",
11 | "\n",
12 | "\n",
13 | "In this notebook, we gather exploratory data from our training set to do feature engineering and model tuning. Before running this notebook, make sure that:\n",
14 | "\n",
15 | "* You have already run steps 2 and 3 to collect and split your data into training, validation, and test. \n",
16 | "* Your entire training dataset is in a Cloud Storage Bucket such as gs://[your-bucket]/[dataprep-dir]/training_images/\n",
17 | "* You have a small subset of the training data available on your VM already (from the exploration we did in the previous notebook):\n",
18 | "\n",
19 | "\n",
20 | "```\n",
21 | "mkdir -p ~/data/training_small\n",
22 | "gsutil -m cp gs://$BUCKET/catimages/training_images/000*.png ~/data/training_small/\n",
23 | "gsutil -m cp gs://$BUCKET/catimages/training_images/001*.png ~/data/training_small/\n",
24 | "mkdir -p ~/data/debugging_small\n",
25 | "gsutil -m cp gs://$BUCKET/catimages/training_images/002*.png ~/data/debugging_small\n",
26 | "echo \"done!\"\n",
27 | "```\n",
28 | "\n",
29 | "Note that we only take the images starting with those IDs to limit the number we'll copy over to only a few thousand images."
30 | ]
31 | },
32 | {
33 | "cell_type": "markdown",
34 | "metadata": {},
35 | "source": [
36 | "# Setup\n"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": 0,
42 | "metadata": {},
43 | "outputs": [],
44 | "source": [
45 | "# Enter your username:\n",
46 | "YOUR_GMAIL_ACCOUNT = '******' # Whatever is before @gmail.com in your email address"
47 | ]
48 | },
49 | {
50 | "cell_type": "code",
51 | "execution_count": 0,
52 | "metadata": {},
53 | "outputs": [],
54 | "source": [
55 | "# Libraries for this section:\n",
56 | "import os\n",
57 | "import cv2\n",
58 | "import pickle\n",
59 | "import numpy as np\n",
60 | "from sklearn import preprocessing"
61 | ]
62 | },
63 | {
64 | "cell_type": "code",
65 | "execution_count": 0,
66 | "metadata": {},
67 | "outputs": [],
68 | "source": [
69 | "# Directories:\n",
70 | "PREPROC_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/')\n",
71 | "TRAIN_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/training_small/') # Where the training dataset lives.\n",
72 | "DEBUG_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/debugging_small/') # Where the debugging dataset lives."
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {},
78 | "source": [
79 | "# Feature Engineering Functions\n"
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "metadata": {},
85 | "source": [
86 | "## Basic features and concatenation"
87 | ]
88 | },
89 | {
90 | "cell_type": "code",
91 | "execution_count": 0,
92 | "metadata": {},
93 | "outputs": [],
94 | "source": [
95 | "def general_img_features(band):\n",
96 | " \"\"\"\n",
97 | " Define a set of features that we can look at for each color band\n",
98 | " Args:\n",
99 | " band: array which is one of blue, green, or red\n",
100 | " Returns:\n",
101 | " features: unique colors, nonzero count, mean, standard deviation,\n",
102 | " min, and max of the channel's pixel values\n",
103 | " \"\"\"\n",
104 | " return [len(set(band.ravel())), np.count_nonzero(band),\n",
105 | " np.mean(band), np.std(band),\n",
106 | " band.min(), band.max()]\n",
107 | "\n",
108 | "def concat_all_band_features(file, dir):\n",
109 | " \"\"\"\n",
110 | " Extract features from a single image.\n",
111 | " Args:\n",
112 | " file - single image filename\n",
113 | " dir - directory where the files are stored\n",
114 | " Returns:\n",
115 | " features - descriptive statistics for pixels\n",
116 | " \"\"\"\n",
117 | " img = cv2.imread(os.path.join(dir, file))\n",
118 | " features = []\n",
119 | " blue = np.float32(img[:,:,0])\n",
120 | " green = np.float32(img[:,:,1])\n",
121 | " red = np.float32(img[:,:,2])\n",
122 | " features.extend(general_img_features(blue)) # indices 0-4\n",
123 | " features.extend(general_img_features(green)) # indices 5-9\n",
124 | " features.extend(general_img_features(red)) # indices 10-14\n",
125 | " return features"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "metadata": {},
131 | "source": [
132 | "## Harris Corner Detector Histograms\n",
133 | "\n",
134 | "We'll create features based on the histogram of the number of corners detected in every small square in the picture. The threshold indicates how \"sharp\" that corner must be to be detected."
135 | ]
136 | },
137 | {
138 | "cell_type": "code",
139 | "execution_count": 0,
140 | "metadata": {},
141 | "outputs": [],
142 | "source": [
143 | "def harris_density(harris_img, square_size, threshold):\n",
144 | " \"\"\"Apply Harris Corner Detection to image and get count of corners.\n",
145 | "\n",
146 | " Args:\n",
147 | " harris_img: image already processed by Harris Corner Detector (in cv2 package).\n",
148 | " square_size: number of pixels per side of the window in which we detect corners. \n",
149 | " threshold: indicates how \"sharp\" that corner must be to be detected.\n",
150 | "\n",
151 | " Returns: \n",
152 | " bins - counts in each bin of histogram.\n",
153 | " \"\"\"\n",
154 | " max_val = harris_img.max()\n",
155 | " shape = harris_img.shape\n",
156 | " bins = [0] * (square_size * square_size + 1)\n",
157 | " for row in xrange(0, shape[0], square_size):\n",
158 | " for col in xrange(0, shape[1], square_size):\n",
159 | " bin_val = sum(sum(harris_img[row: row + square_size,\n",
160 | " col: col + square_size] > threshold * max_val))\n",
161 | " bins[int(bin_val)] += 1\n",
162 | " return bins"
163 | ]
164 | },
165 | {
166 | "cell_type": "markdown",
167 | "metadata": {},
168 | "source": [
169 | "# Building Feature Vectors\n",
170 | "\n",
171 | "We've defined some functions and checked their outputs. Here is a sample feature vector constructor from pulling out summary features from grayscale, red, green, and blue channels along with harris corner detector output thresholding."
172 | ]
173 | },
174 | {
175 | "cell_type": "code",
176 | "execution_count": 0,
177 | "metadata": {},
178 | "outputs": [],
179 | "source": [
180 | "def get_features(img_path):\n",
181 | " \"\"\"Engineer the features and output feature vectors.\n",
182 | " \n",
183 | " Args:\n",
184 | " img_path: filepath to image file\n",
185 | " \n",
186 | " Returns:\n",
187 | " features: np array of features\n",
188 | " \"\"\"\n",
189 | " img = cv2.imread(img_path)\n",
190 | " # Get the channels\n",
191 | " gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)\n",
192 | " blue = np.float32(img[:, :, 0])\n",
193 | " green = np.float32(img[:, :, 1])\n",
194 | " red = np.float32(img[:, :, 2])\n",
195 | "\n",
196 | " # Run general summarization on each\n",
197 | " features = general_img_features(gray)\n",
198 | " features.extend(general_img_features(blue))\n",
199 | " features.extend(general_img_features(green))\n",
200 | " features.extend(general_img_features(red))\n",
201 | "\n",
202 | " # Get Harris corner detection output\n",
203 | " gray = cv2.cornerHarris(gray, 2, 3, 0.04)\n",
204 | " blue = cv2.cornerHarris(blue, 2, 3, 0.04)\n",
205 | " green = cv2.cornerHarris(green, 2, 3, 0.04)\n",
206 | " red = cv2.cornerHarris(red, 2, 3, 0.04)\n",
207 | "\n",
208 | " # Get general stats on each Harris detector results\n",
209 | " features.extend(general_img_features(gray))\n",
210 | " features.extend(general_img_features(blue))\n",
211 | " features.extend(general_img_features(green))\n",
212 | " features.extend(general_img_features(red))\n",
213 | "\n",
214 | " # Get density bins on Harris detector results\n",
215 | " features.extend(harris_density(gray, 4, 0.05))\n",
216 | "\n",
217 | " return features"
218 | ]
219 | },
220 | {
221 | "cell_type": "code",
222 | "execution_count": 0,
223 | "metadata": {},
224 | "outputs": [],
225 | "source": [
226 | "def get_features_and_labels(dir):\n",
227 | " \"\"\"Get preprocessed features and labels.\n",
228 | "\n",
229 | " Args:\n",
230 | " dir: directory containing image files\n",
231 | "\n",
232 | " Returns:\n",
233 | " features: np array of features\n",
234 | " labels: 1-d np array of binary labels\n",
235 | " \"\"\"\n",
236 | " i = 0\n",
237 | " features = None\n",
238 | " labels = []\n",
239 | " print('\\nImages processed (out of {:d})...'.format(len(os.listdir(dir))))\n",
240 | " for filename in os.listdir(dir):\n",
241 | " feature_row = np.array([get_features(os.path.join(dir, filename))])\n",
242 | " if features is not None:\n",
243 | " features = np.append(features, feature_row, axis=0)\n",
244 | " else:\n",
245 | " features = feature_row\n",
246 | " split_filename = filename.split('_')\n",
247 | " label = int(split_filename[-1].split('.')[0])\n",
248 | " labels = np.append(labels, label)\n",
249 | " i += 1\n",
250 | " if i % 100 == 0:\n",
251 | " print(features.shape[0])\n",
252 | " print(features.shape[0])\n",
253 | " return features, labels"
254 | ]
255 | },
256 | {
257 | "cell_type": "code",
258 | "execution_count": 0,
259 | "metadata": {},
260 | "outputs": [],
261 | "source": [
262 | "# Use a limited set of images, this is computationally expensive:\n",
263 | "training_features, training_labels = get_features_and_labels(TRAIN_DIR)\n",
264 | "debugging_features, debugging_labels = get_features_and_labels(DEBUG_DIR)\n",
265 | "\n",
266 | "print('\\nDone!')"
267 | ]
268 | },
269 | {
270 | "cell_type": "markdown",
271 | "metadata": {},
272 | "source": [
273 | "# Standardize and save\n",
274 | "\n",
275 | "If we don't want the magnitude of a feature column to have an undue influence on the results, we should standardize our features. **Standardization** is a process where the mean is subtracted from feature values, and the result is divided by the standard deviation."
276 | ]
277 | },
278 | {
279 | "cell_type": "code",
280 | "execution_count": 0,
281 | "metadata": {},
282 | "outputs": [],
283 | "source": [
284 | "# Standardize features:\n",
285 | "standardizer = preprocessing.StandardScaler().fit(training_features)\n",
286 | "training_std = standardizer.transform(training_features)\n",
287 | "debugging_std = standardizer.transform(debugging_features)\n",
288 | "\n",
289 | "# Save features as pkl:\n",
290 | "pickle.dump(training_std, open(os.path.join(PREPROC_DIR, 'training_std.pkl'), 'w'))\n",
291 | "pickle.dump(debugging_std, open(os.path.join(PREPROC_DIR, 'debugging_std.pkl'), 'w'))\n",
292 | "pickle.dump(training_labels, open(os.path.join(PREPROC_DIR, 'training_labels.pkl'), 'w'))\n",
293 | "pickle.dump(debugging_labels, open(os.path.join(PREPROC_DIR, 'debugging_labels.pkl'), 'w'))\n",
294 | "\n",
295 | "print ('\\nFeaturing engineering is complete!')"
296 | ]
297 | }
298 | ],
299 | "metadata": {
300 | "kernelspec": {
301 | "display_name": "Python 2",
302 | "language": "python",
303 | "name": "python2"
304 | }
305 | },
306 | "nbformat": 4,
307 | "nbformat_minor": 1
308 | }
309 |
--------------------------------------------------------------------------------
/cats/step_5_to_6_part1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Logistic Regression Based on Extracted Features\n",
8 | "\n",
9 | "**Author(s):** bfoo@google.com, kozyr@google.com\n",
10 | "\n",
11 | "In this notebook, we will perform training over the features collected from step 4's image and feature analysis step. Two tools will be used in this demo:\n",
12 | "\n",
13 | "* **Scikit learn:** the widely used, single machine Python machine learning library\n",
14 | "* **TensorFlow:** Google's home-grown machine learning library that allows distributed machine learning"
15 | ]
16 | },
17 | {
18 | "cell_type": "markdown",
19 | "metadata": {},
20 | "source": [
21 | "# Setup\n",
22 | "\n",
23 | "You need to have worked through the feature engineering notebook in order for this to work since we'll be loading the pickled datasets we saved in Step 4. You might have to adjust the directories below if you made changes to save directory in that notebook."
24 | ]
25 | },
26 | {
27 | "cell_type": "code",
28 | "execution_count": 0,
29 | "metadata": {},
30 | "outputs": [],
31 | "source": [
32 | "# Enter your username:\n",
33 | "YOUR_GMAIL_ACCOUNT = '******' # Whatever is before @gmail.com in your email address"
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": 0,
39 | "metadata": {},
40 | "outputs": [],
41 | "source": [
42 | "import cv2\n",
43 | "import numpy as np\n",
44 | "import os\n",
45 | "import pickle\n",
46 | "import shutil\n",
47 | "import sys\n",
48 | "import matplotlib.pyplot as plt\n",
49 | "import matplotlib.image as mpimg\n",
50 | "from random import random\n",
51 | "from scipy import stats\n",
52 | "from sklearn import preprocessing\n",
53 | "from sklearn import svm\n",
54 | "from sklearn.linear_model import LogisticRegression\n",
55 | "from sklearn.metrics import average_precision_score\n",
56 | "from sklearn.metrics import precision_recall_curve\n",
57 | "\n",
58 | "import tensorflow as tf\n",
59 | "from tensorflow.contrib.learn import LinearClassifier\n",
60 | "from tensorflow.contrib.learn import Experiment\n",
61 | "from tensorflow.contrib.learn.python.learn import learn_runner\n",
62 | "from tensorflow.contrib.layers import real_valued_column\n",
63 | "from tensorflow.contrib.learn import RunConfig"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": 0,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "# Directories:\n",
73 | "PREPROC_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/')\n",
74 | "OUTPUT_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/logreg/') # Does not need to exist yet."
75 | ]
76 | },
77 | {
78 | "cell_type": "markdown",
79 | "metadata": {},
80 | "source": [
81 | "## Load stored features and labels\n",
82 | "\n",
83 | "Load from the pkl files saved in step 4 and confirm that the feature length is correct."
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": 0,
89 | "metadata": {},
90 | "outputs": [],
91 | "source": [
92 | "training_std = pickle.load(open(PREPROC_DIR + 'training_std.pkl', 'r'))\n",
93 | "debugging_std = pickle.load(open(PREPROC_DIR + 'debugging_std.pkl', 'r'))\n",
94 | "training_labels = pickle.load(open(PREPROC_DIR + 'training_labels.pkl', 'r'))\n",
95 | "debugging_labels = pickle.load(open(PREPROC_DIR + 'debugging_labels.pkl', 'r'))\n",
96 | "\n",
97 | "FEATURE_LENGTH = training_std.shape[1]\n",
98 | "print FEATURE_LENGTH"
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": 0,
104 | "metadata": {},
105 | "outputs": [],
106 | "source": [
107 | "# Examine the shape of the feature data we loaded:\n",
108 | "print(type(training_std)) # Type will be numpy array.\n",
109 | "print(np.shape(training_std)) # Rows, columns."
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": 0,
115 | "metadata": {},
116 | "outputs": [],
117 | "source": [
118 | "# Examine the label data we loaded:\n",
119 | "print(type(training_labels)) # Type will be numpy array.\n",
120 | "print(np.shape(training_labels)) # How many datapoints?\n",
121 | "training_labels[:3] # First 3 training labels."
122 | ]
123 | },
124 | {
125 | "cell_type": "markdown",
126 | "metadata": {},
127 | "source": [
128 | "# Step 5: Enabling Logistic Regression to Run\n",
129 | "\n",
130 | "Logistic regression is a generalized linear model that predicts a probability value of whether each picture is a cat. Scikit-learn has a very easy interface for training a logistic regression model."
131 | ]
132 | },
133 | {
134 | "cell_type": "markdown",
135 | "metadata": {},
136 | "source": [
137 | "## Logistic Regression in scikit-learn\n",
138 | "\n",
139 | "In logistic regression, one of the hyperparameters is known as the regularization term C. Regularization is a penalty associated with the complexity of the model itself, such as the value of its weights. The example below uses \"L1\" regularization, which has the following behavior: as C decreases, the number of non-zero weights also decreases (complexity decreases). \n",
140 | "\n",
141 | "A high complexity model (high C) will fit very well to the training data, but will also capture the noise inherent in the training set. This could lead to poor performance when predicting labels on the debugging set.\n",
142 | "\n",
143 | "A low complexity model (low C) does not fit as well with training data, but will generalize better over unseen data. There is a delicate balance in this process, as oversimplifying the model also hurts its performance."
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": 0,
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "# Plug into scikit-learn for logistic regression training\n",
153 | "model = LogisticRegression(penalty='l1', C=0.2) # C is inverse of the regularization strength\n",
154 | "model.fit(training_std, training_labels)\n",
155 | "\n",
156 | "# Print zero coefficients to check regularization strength\n",
157 | "print 'Non-zero weights', sum(model.coef_[0] > 0)"
158 | ]
159 | },
160 | {
161 | "cell_type": "markdown",
162 | "metadata": {},
163 | "source": [
164 | "# Step 6: Train Logistic Regression with scikit-learn\n",
165 | "\n",
166 | "Let's train!"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": 0,
172 | "metadata": {},
173 | "outputs": [],
174 | "source": [
175 | "# Get the output predictions of the training and debugging inputs\n",
176 | "training_predictions = model.predict_proba(training_std)[:, 1]\n",
177 | "debugging_predictions = model.predict_proba(debugging_std)[:, 1]"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "That was easy! But how well did it do? Let's check the accuracy of the model we just trained."
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": 0,
190 | "metadata": {},
191 | "outputs": [],
192 | "source": [
193 | "# Accuracy metric:\n",
194 | "def get_accuracy(truth, predictions, threshold=0.5, roundoff=2):\n",
195 | " \"\"\" \n",
196 | " Args:\n",
197 | " truth: can be Boolean (False, True), int (0, 1), or float (0, 1)\n",
198 | " predictions: number between 0 and 1, inclusive\n",
199 | " threshold: we convert predictions to 1s if they're above this value\n",
200 | " roundoff: report accuracy to how many decimal places?\n",
201 | "\n",
202 | " Returns: \n",
203 | " accuracy: number correct divided by total predictions\n",
204 | " \"\"\"\n",
205 | "\n",
206 | " truth = np.array(truth) == (1|True)\n",
207 | " predicted = np.array(predictions) >= threshold\n",
208 | " matches = sum(predicted == truth)\n",
209 | " accuracy = float(matches) / len(truth)\n",
210 | " return round(accuracy, roundoff)\n",
211 | "\n",
212 | "# Compute our accuracy metric for training and debugging\n",
213 | "print 'Training accuracy is ' + str(get_accuracy(training_labels, training_predictions))\n",
214 | "print 'Debugging accuracy is ' + str(get_accuracy(debugging_labels, debugging_predictions))"
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "# Step 5: Enabling Logistic Regression to Run v2.0"
222 | ]
223 | },
224 | {
225 | "cell_type": "markdown",
226 | "metadata": {},
227 | "source": [
228 | "## Tensorflow Model\n",
229 | "\n",
230 | "Tensorflow is a Google home-grown tool that allows one to define a model and run distributed training on it. In this notebook, we focus on the atomic pieces for building a tensorflow model. However, this will all be trained locally. "
231 | ]
232 | },
233 | {
234 | "cell_type": "markdown",
235 | "metadata": {},
236 | "source": [
237 | "## Input functions\n",
238 | "\n",
239 | "Tensorflow requires the user to define input functions, which are functions that return rows of feature vectors, and their corresponding labels. Tensorflow will periodically call these functions to obtain data as model training progresses. \n",
240 | "\n",
241 | "Why not just provide the feature vectors and labels upfront? Again, this comes down to the distributed aspect of Tensorflow, where data can be received from various sources, and not all data can fit on a single machine. For instance, you may have several million rows distributed across a cluster, but any one machine can only provide a few thousand rows. Tensorflow allows you to define the input function to pull data in from a queue rather than a numpy array, and that queue can contain training data that is available at that time.\n",
242 | "\n",
243 | "Another practical reason for supplying limited training data is that sometimes the feature vectors are very long, and only a few rows can fit within memory at a time. Finally, complex ML models (such as deep neural networks) take a long time to train and use up a lot of resources, and so limiting the training samples at each machine allows us to train faster and without memory issues.\n",
244 | "\n",
245 | "The input function's returned features is defined as a dictionary of scalar, categorical, or tensor-valued features. The returned labels from an input function is defined as a single tensor storing the labels. In this notebook, we will simply return the entire set of features and labels with every function call."
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": 0,
251 | "metadata": {},
252 | "outputs": [],
253 | "source": [
254 | "def train_input_fn():\n",
255 | " training_X_tf = tf.convert_to_tensor(training_std, dtype=tf.float32)\n",
256 | " training_y_tf = tf.convert_to_tensor(training_labels, dtype=tf.float32)\n",
257 | " return {'features': training_X_tf}, training_y_tf\n",
258 | "\n",
259 | "def eval_input_fn():\n",
260 | " debugging_X_tf = tf.convert_to_tensor(debugging_std, dtype=tf.float32)\n",
261 | " debugging_y_tf = tf.convert_to_tensor(debugging_labels, dtype=tf.float32)\n",
262 | " return {'features': debugging_X_tf}, debugging_y_tf"
263 | ]
264 | },
265 | {
266 | "cell_type": "markdown",
267 | "metadata": {},
268 | "source": [
269 | "## Logistic Regression with TensorFlow\n",
270 | "\n",
271 | "Tensorflow's linear classifiers, such as logistic regression, are structured as estimators. An estimator has the ability to compute the objective function of the ML model, and take a step towards reducing it. Tensorflow has built-in estimators such as \"LinearClassifier\", which is just a logistic regression trainer. These estimators have additional metrics that are calculated, such as the average accuracy at threshold = 0.5."
272 | ]
273 | },
274 | {
275 | "cell_type": "code",
276 | "execution_count": 0,
277 | "metadata": {},
278 | "outputs": [],
279 | "source": [
280 | "# Tweak this hyperparameter to improve debugging precision-recall AUC. \n",
281 | "REG_L1 = 5.0 # Use the inverse of C in sklearn, i.e 1/C.\n",
282 | "LEARNING_RATE = 2.0 # How aggressively to adjust coefficients during optimization?\n",
283 | "TRAINING_STEPS = 20000\n",
284 | "\n",
285 | "# The estimator requires an array of features from the dictionary of feature columns to use in the model\n",
286 | "feature_columns = [real_valued_column('features', dimension=FEATURE_LENGTH)]\n",
287 | "\n",
288 | "# We use Tensorflow's built-in LinearClassifier estimator, which implements a logistic regression.\n",
289 | "# You can go to the model_dir below to see what Tensorflow leaves behind during training.\n",
290 | "# Delete the directory if you wish to retrain.\n",
291 | "estimator = LinearClassifier(feature_columns=feature_columns,\n",
292 | " optimizer=tf.train.FtrlOptimizer(\n",
293 | " learning_rate=LEARNING_RATE,\n",
294 | " l1_regularization_strength=REG_L1),\n",
295 | " model_dir=OUTPUT_DIR + '-model-reg-' + str(REG_L1)\n",
296 | " )"
297 | ]
298 | },
299 | {
300 | "cell_type": "markdown",
301 | "metadata": {},
302 | "source": [
303 | "### Experiments and Runners\n",
304 | "\n",
305 | "An experiment is a TensorFlow object that stores the estimator, as well as several other parameters. It can also periodically write the model progress into checkpoints which can be loaded later if you would like to continue the model where the training last left off.\n",
306 | "\n",
307 | "Some of the parameters are:\n",
308 | "\n",
309 | "* train_steps: how many times to adjust model weights before stopping\n",
310 | "* eval_steps: when a summary is written, the model, in its current state of progress, will try to predict the debugging data and calculate its accuracy. Eval_steps is set to 1 because we only need to call the input function once (already returns the entire evaluation dataset).\n",
311 | "* The rest of the parameters just boils down to \"do evaluation once\".\n",
312 | "\n",
313 | "(If you run the below script multiple times without changing REG_L1 or train_steps, you will notice that the model does not train, as you've already trained the model that many steps for the given configuration)."
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": 0,
319 | "metadata": {},
320 | "outputs": [],
321 | "source": [
322 | "def generate_experiment_fn():\n",
323 | " def _experiment_fn(output_dir):\n",
324 | " return Experiment(estimator=estimator,\n",
325 | " train_input_fn=train_input_fn,\n",
326 | " eval_input_fn=eval_input_fn,\n",
327 | " train_steps=TRAINING_STEPS,\n",
328 | " eval_steps=1,\n",
329 | " min_eval_frequency=1)\n",
330 | " return _experiment_fn"
331 | ]
332 | },
333 | {
334 | "cell_type": "markdown",
335 | "metadata": {},
336 | "source": [
337 | "# Step 6: Train Logistic Regression with TensorFlow\n",
338 | "\n",
339 | "Unless you change TensorFlow's verbosity, there is a lot of text that is outputted. Such text can be useful when debugging a distributed training pipeline, but is pretty noisy when running from a notebook locally. The line to look for is the chunk at the end where \"accuracy\" is reported. This is the final result of the model."
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": 0,
345 | "metadata": {},
346 | "outputs": [],
347 | "source": [
348 | "learn_runner.run(generate_experiment_fn(), OUTPUT_DIR + '-model-reg-' + str(REG_L1))"
349 | ]
350 | }
351 | ],
352 | "metadata": {
353 | "kernelspec": {
354 | "display_name": "Python 2",
355 | "language": "python",
356 | "name": "python2"
357 | }
358 | },
359 | "nbformat": 4,
360 | "nbformat_minor": 1
361 | }
362 |
--------------------------------------------------------------------------------
/cats/step_8_to_9.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Feline Neural Network\n",
8 | "\n",
9 | "**Author(s):** kozyr@google.com, bfoo@google.com\n",
10 | "\n",
11 | "**Reviewer(s):** \n",
12 | "\n",
13 | "Let's train a basic convolutional neural network to recognize cats."
14 | ]
15 | },
16 | {
17 | "cell_type": "markdown",
18 | "metadata": {},
19 | "source": [
20 | "# Setup\n",
21 | "\n",
22 | "Download all of your image sets to the VM. Then set aside a couple thousand\n",
23 | "training images for debugging.\n",
24 | "\n",
25 | "```\n",
26 | "mkdir -p ~/data/training_images\n",
27 | "gsutil -m cp gs://$BUCKET/catimages/training_images/*.png ~/data/training_images/\n",
28 | "mkdir -p ~/data/validation_images\n",
29 | "gsutil -m cp gs://$BUCKET/catimages/validation_images/*.png ~/data/validation_images/\n",
30 | "mkdir -p ~/data/test_images\n",
31 | "gsutil -m cp gs://$BUCKET/catimages/test_images/*.png ~/data/test_images/\n",
32 | "mkdir -p ~/data/debugging_images\n",
33 | "mv ~/data/training_images/000*.png ~/data/debugging_images/\n",
34 | "mv ~/data/training_images/001*.png ~/data/debugging_images/\n",
35 | "echo \"done!\"\n",
36 | "```\n",
37 | "\n",
38 | "If you've already trained the model below once, SSH into your VM and run the\n",
39 | "following: ```rm -r ~/data/output_cnn_big``` so that you can start over."
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 0,
45 | "metadata": {},
46 | "outputs": [],
47 | "source": [
48 | "# Enter your username:\n",
49 | "YOUR_GMAIL_ACCOUNT = '******' # Whatever is before @gmail.com in your email address"
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": 0,
55 | "metadata": {},
56 | "outputs": [],
57 | "source": [
58 | "# Libraries for this section:\n",
59 | "import os\n",
60 | "import datetime\n",
61 | "import numpy as np\n",
62 | "import pandas as pd\n",
63 | "import cv2\n",
64 | "import matplotlib.pyplot as plt\n",
65 | "import matplotlib.image as mpimg\n",
66 | "import tensorflow as tf\n",
67 | "from tensorflow.contrib.learn import RunConfig, Experiment\n",
68 | "from tensorflow.contrib.learn.python.learn import learn_runner"
69 | ]
70 | },
71 | {
72 | "cell_type": "code",
73 | "execution_count": 0,
74 | "metadata": {},
75 | "outputs": [],
76 | "source": [
77 | "# Directory settings:\n",
78 | "TRAIN_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/training_images/') # Directory where the training dataset lives.\n",
79 | "DEBUG_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/debugging_images/') # Directory where the debugging dataset lives.\n",
80 | "VALID_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/validation_images/') # Directory where the validation dataset lives.\n",
81 | "TEST_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/test_images/') # Directory where the test dataset lives.\n",
82 | "OUTPUT_DIR = os.path.join('/home', YOUR_GMAIL_ACCOUNT, 'data/output_cnn_big/') # Directory where we store our logging and models.\n",
83 | "\n",
84 | "# TensorFlow setup:\n",
85 | "NUM_CLASSES = 2 # This code can be generalized beyond 2 classes (binary classification).\n",
86 | "QUEUE_CAP = 5000 # Number of images the TensorFlow queue can store during training.\n",
87 | "# For debugging, QUEUE_CAP is ignored in favor of using all images available.\n",
88 | "TRAIN_BATCH_SIZE = 500 # Number of images processed every training iteration.\n",
89 | "DEBUG_BATCH_SIZE = 100 # Number of images processed every debugging iteration.\n",
90 | "TRAIN_STEPS = 3000 # Number of batches to use for training.\n",
91 | "DEBUG_STEPS = 2 # Number of batches to use for debugging.\n",
92 | "# Example: If dataset is 5 batches ABCDE, train_steps = 2 uses AB, train_steps = 7 uses ABCDEAB).\n",
93 | "\n",
94 | "# Monitoring setup:\n",
95 | "TRAINING_LOG_PERIOD_SECS = 60 # How often we want to log training metrics (from training hook in our model_fn).\n",
96 | "CHECKPOINT_PERIOD_SECS = 60 # How often we want to save a checkpoint.\n",
97 | "\n",
98 | "# Hyperparameters we'll tune in the tutorial:\n",
99 | "DROPOUT = 0.6 # Regularization parameter for neural networks - must be between 0 and 1.\n",
100 | "\n",
101 | "# Additional hyperparameters:\n",
102 | "LEARNING_RATE = 0.001 # Rate at which weights update.\n",
103 | "CNN_KERNEL_SIZE = 3 # Receptive field will be square window with this many pixels per side.\n",
104 | "CNN_STRIDES = 2 # Distance between consecutive receptive fields.\n",
105 | "CNN_FILTERS = 16 # Number of filters (new receptive fields to train, i.e. new channels) in first convolutional layer.\n",
106 | "FC_HIDDEN_UNITS = 512 # Number of hidden units in the fully connected layer of the network."
107 | ]
108 | },
109 | {
110 | "cell_type": "markdown",
111 | "metadata": {},
112 | "source": [
113 | "Let's visualize what we're working with and get the pixel count for our images. They should be square for this to work, but luckily we padded them with black pixels where needed previously."
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": 0,
119 | "metadata": {},
120 | "outputs": [],
121 | "source": [
122 | "def show_inputs(dir, filelist=None, img_rows=1, img_cols=3, figsize=(20, 10)):\n",
123 | " \"\"\"Display the first few images.\n",
124 | " \n",
125 | " Args:\n",
126 | " dir: directory where the files are stored\n",
127 | " filelist: list of filenames to pull from, if left as default, all files will be used\n",
128 | " img_rows: number of rows of images to display\n",
129 | " img_cols: number of columns of images to display\n",
130 | " figsize: sizing for inline plots\n",
131 | " \n",
132 | " Returns:\n",
133 | " pixel_dims: pixel dimensions (height and width) of the image\n",
134 | " \"\"\"\n",
135 | " if filelist is None:\n",
136 | " filelist = os.listdir(dir) # Grab all the files in the directory\n",
137 | " filelist = np.array(filelist)\n",
138 | " plt.close('all')\n",
139 | " fig = plt.figure(figsize=figsize)\n",
140 | " print('File names:')\n",
141 | "\n",
142 | " for i in range(img_rows * img_cols):\n",
143 | " print(str(filelist[i]))\n",
144 | " a=fig.add_subplot(img_rows, img_cols,i + 1)\n",
145 | " img = mpimg.imread(os.path.join(dir, str(filelist[i])))\n",
146 | " plt.imshow(img)\n",
147 | " plt.show()\n",
148 | " return np.shape(img)"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": 0,
154 | "metadata": {},
155 | "outputs": [],
156 | "source": [
157 | "pixel_dim = show_inputs(TRAIN_DIR)\n",
158 | "print('Images have ' + str(pixel_dim[0]) + 'x' + str(pixel_dim[1]) + ' pixels.')\n",
159 | "pixels = pixel_dim[0] * pixel_dim[1]"
160 | ]
161 | },
162 | {
163 | "cell_type": "markdown",
164 | "metadata": {},
165 | "source": [
166 | "# Step 5 - Get tooling for training convolutional neural networks\n",
167 | "\n",
168 | "Here is where we enable training convolutional neural networks on data inputs like ours. We'll build it using a TensorFlow estimator. TensorFlow (TF) is designed for scale, which means it doesn't pull all our data into memory all at once, but instead it's all about lazy execution. We'll write functions which it will run when it's efficient to do so. TF will pull in batches of our image data and run the functions we wrote. \n",
169 | "\n",
170 | "In order to make this work, we need to write code for the following:\n",
171 | "\n",
172 | "* Input function: generate_input_fn()\n",
173 | "* Neural network architecture: cnn()\n",
174 | "* Model function: generate_model_fn()\n",
175 | "* Estimator: tf.estimator.Estimator()\n",
176 | "* Experiment: generate_experiment_fn()\n",
177 | "* Prediction generator: cat_finder()"
178 | ]
179 | },
180 | {
181 | "cell_type": "markdown",
182 | "metadata": {},
183 | "source": [
184 | "## Input function\n",
185 | "\n",
186 | "The input function tells TensorFlow what format of feature and label data to expect. We'll set ours up to pull in all images in a directory we point it at. It expects images with filenames in the following format: number_number_label.extension, so if your file naming scheme is different, please edit the input function."
187 | ]
188 | },
189 | {
190 | "cell_type": "code",
191 | "execution_count": 0,
192 | "metadata": {},
193 | "outputs": [],
194 | "source": [
195 | "# Input function:\n",
196 | "def generate_input_fn(dir, batch_size, queue_capacity):\n",
197 | " \"\"\"Return _input_fn for use with TF Experiment.\n",
198 | " \n",
199 | " Will be called in the Experiment section below (see _experiment_fn).\n",
200 | " \n",
201 | " Args:\n",
202 | " dir: directory we're taking our files from, code is written to collect all files in this dir.\n",
203 | " batch_size: number of rows ingested in each training iteration.\n",
204 | " queue_capacity: number of images the TF queue can store.\n",
205 | " \n",
206 | " Returns:\n",
207 | " _input_fn: a function that returns a batch of images and labels.\n",
208 | " \"\"\"\n",
209 | "\n",
210 | " file_pattern = os.path.join(dir, '*') # We're pulling in all files in the directory.\n",
211 | "\n",
212 | " def _input_fn():\n",
213 | " \"\"\"A function that returns a batch of images and labels.\n",
214 | "\n",
215 | " Args:\n",
216 | " None\n",
217 | "\n",
218 | " Returns:\n",
219 | " image_batch: 4-d tensor collection of images.\n",
220 | " label_batch: 1-d tensor of corresponding labels.\n",
221 | " \"\"\"\n",
222 | "\n",
223 | " height, width, channels = [pixel_dim[0], pixel_dim[1], 3] # [height, width, 3] because there are 3 channels per image.\n",
224 | " filenames_tensor = tf.train.match_filenames_once(file_pattern) # Collect the filenames\n",
225 | " # Queue that periodically reads in images from disk:\n",
226 | " # When ready to run iteration, TF will take batch_size number of images out of filename_queue.\n",
227 | " filename_queue = tf.train.string_input_producer(\n",
228 | " filenames_tensor,\n",
229 | " shuffle=False) # Do not shuffle order of the images ingested.\n",
230 | " # Convert filenames from queue into contents (png images pulled into memory):\n",
231 | " reader = tf.WholeFileReader()\n",
232 | " filename, contents = reader.read(filename_queue)\n",
233 | " # Decodes contents pulled in into 3-d tensor per image:\n",
234 | " image = tf.image.decode_png(contents, channels=channels)\n",
235 | " # If dimensions mismatch, pad with zeros (black pixels) or crop to make it fit:\n",
236 | " image = tf.image.resize_image_with_crop_or_pad(image, height, width)\n",
237 | " # Parse out label from filename:\n",
238 | " label = tf.string_to_number(tf.string_split([tf.string_split([filename], '_').values[-1]], '.').values[0])\n",
239 | " # All your filenames should be in this format number_number_label.extension where label is 0 or 1.\n",
240 | " # Execute above in a batch of batch_size to create a 4-d tensor of collection of images:\n",
241 | " image_batch, label_batch = tf.train.batch(\n",
242 | " [image, label],\n",
243 | " batch_size,\n",
244 | " num_threads=1, # We'll decline the multithreading option so that everything stays in filename order.\n",
245 | " capacity=queue_capacity)\n",
246 | " # Normalization for better training:\n",
247 | " # Change scale from pixel uint8 values between 0 and 255 into normalized float32 values between 0 and 1:\n",
248 | " image_batch = tf.to_float(image_batch) / 255\n",
249 | " # Rescale from (0,1) to (-1,1) so that the \"center\" of the image range is 0:\n",
250 | " image_batch = (image_batch * 2) - 1\n",
251 | " return image_batch, label_batch\n",
252 | " return _input_fn"
253 | ]
254 | },
255 | {
256 | "cell_type": "markdown",
257 | "metadata": {},
258 | "source": [
259 | "## Neural network architecture\n",
260 | "\n",
261 | "This is where we define the architecture of the neural network we're using, such are the number of hidden layers and units."
262 | ]
263 | },
264 | {
265 | "cell_type": "code",
266 | "execution_count": 0,
267 | "metadata": {},
268 | "outputs": [],
269 | "source": [
270 | "# CNN architecture:\n",
271 | "def cnn(features, dropout, reuse, is_training):\n",
272 | " \"\"\"Defines the architecture of the neural network.\n",
273 | " \n",
274 | " Will be called within generate_model_fn() below.\n",
275 | " \n",
276 | " Args: \n",
277 | " features: feature data as 4-d tensor (of batch_size) pulled in when_input_fn() is executed.\n",
278 | " dropout: regularization parameter in last layer (between 0 and 1, exclusive).\n",
279 | " reuse: a scoping safeguard. First time training: set to False, after that, set to True.\n",
280 | " is_training: if True then fits model and uses dropout, if False then doesn't consider the dropout\n",
281 | " \n",
282 | " Returns:\n",
283 | " 2-d tensor: each images [logit(1-p), logit(p)] where p=Pr(1),\n",
284 | " i.e. probability that class is 1 (cat in our case).\n",
285 | " Note: logit(p) = logodds(p) = log(p / (1-p))\n",
286 | " \"\"\"\n",
287 | "\n",
288 | " # Next, we define a scope for reusing our variables, choosing our network architecture and naming our layers.\n",
289 | " with tf.variable_scope('cnn', reuse=reuse):\n",
290 | " layer_1 = tf.layers.conv2d( # 2-d convolutional layer; size of output image is (pixels/stride) a side with channels = filters.\n",
291 | " inputs=features, # previous layer (inputs) is features argument to the main function\n",
292 | " kernel_size=CNN_KERNEL_SIZE, # 3x3(x3 because we have 3 channels) receptive field (only square ones allowed)\n",
293 | " strides=CNN_STRIDES, # distance between consecutive receptive fields\n",
294 | " filters=CNN_FILTERS, # number of receptive fields to train; think of this as a CNN_FILTERS-channel image which is input to next layer)\n",
295 | " padding='SAME', # SAME uses zero padding if not all CNN_KERNEL_SIZE x CNN_KERNEL_SIZE positions are filled, VALID will ignore missing\n",
296 | " activation=tf.nn.relu) # activation function is ReLU which is f(x) = max(x, 0) \n",
297 | " \n",
298 | " # For simplicity, this neural network doubles the number of receptive fields (filters) with each layer.\n",
299 | " # By using more filters, we are able to preserve the spatial dimensions better by storing more information.\n",
300 | " #\n",
301 | " # To determine how much information is preserved by each layer, consider that with each layer,\n",
302 | " # the output width and height is decimated by the `strides` value.\n",
303 | " # When strides=2 for example, the input width W and height H is reduced by 2x, resulting in\n",
304 | " # an \"image\" (formally, an activation field) for each filter output with dimensions W/2 x H/2.\n",
305 | " # By doubling the number of filters compared to the input number of filters, the total output\n",
306 | " # dimension becomes W/2 x H/2 x CNN_FILTERS*2, essentially compressing the input of the layer\n",
307 | " # (W x H x CNN_FILTERS) to half as many total \"pixels\" (hidden units) at the output.\n",
308 | " #\n",
309 | " # On the other hand, increasing the number of filters will also increase the training time proportionally,\n",
310 | " # as there are that more weights and biases to train and convolutions to perform.\n",
311 | " #\n",
312 | " # As an exercise, you can play around with different numbers of filters, strides, and kernel_sizes.\n",
313 | " # To avoid very long training time, make sure to keep kernel sizes small (under 5),\n",
314 | " # strides at least 2 but no larger than kernel sizes (or you will skip pixels),\n",
315 | " # and bound the number of filters at each level (no more than 512).\n",
316 | " #\n",
317 | " # When modifying these values, it is VERY important to keep track of the size of your layer outputs,\n",
318 | " # i.e. the number of hidden units, since the final layer will need to be flattened into a 1D vector with size\n",
319 | " # equal to the total number of hidden units. For this reason, using strides that are divisible by the width\n",
320 | " # and height of the input may be the easiest way to avoid miscalculations from rounding.\n",
321 | " layer_2 = tf.layers.conv2d(\n",
322 | " inputs=layer_1,\n",
323 | " kernel_size=CNN_KERNEL_SIZE,\n",
324 | " strides=CNN_STRIDES,\n",
325 | " filters=CNN_FILTERS * (2 ** 1), # Double the number of filters from previous layer\n",
326 | " padding='SAME',\n",
327 | " activation=tf.nn.relu)\n",
328 | " \n",
329 | " layer_3 = tf.layers.conv2d(\n",
330 | " inputs=layer_2,\n",
331 | " kernel_size=CNN_KERNEL_SIZE,\n",
332 | " strides=CNN_STRIDES,\n",
333 | " filters=CNN_FILTERS * (2 ** 2), # Double the number of filters from previous layer\n",
334 | " padding='SAME',\n",
335 | " activation=tf.nn.relu)\n",
336 | " \n",
337 | " layer_4 = tf.layers.conv2d(\n",
338 | " inputs=layer_3,\n",
339 | " kernel_size=CNN_KERNEL_SIZE,\n",
340 | " strides=CNN_STRIDES,\n",
341 | " filters=CNN_FILTERS * (2 ** 3), # Double the number of filters from previous layer\n",
342 | " padding='SAME',\n",
343 | " activation=tf.nn.relu)\n",
344 | " \n",
345 | " layer_5 = tf.layers.conv2d(\n",
346 | " inputs=layer_4,\n",
347 | " kernel_size=CNN_KERNEL_SIZE,\n",
348 | " strides=CNN_STRIDES,\n",
349 | " filters=CNN_FILTERS * (2 ** 4), # Double the number of filters from previous layer\n",
350 | " padding='SAME',\n",
351 | " activation=tf.nn.relu)\n",
352 | " \n",
353 | " layer_5_flat = tf.reshape( # Flattening to 2-d tensor (1-d per image row for feedforward fully-connected layer)\n",
354 | " layer_5, \n",
355 | " shape=[-1, # Reshape final layer to 1-d tensor per image.\n",
356 | " CNN_FILTERS * (2 ** 4) * # Number of filters (depth), times...\n",
357 | " pixels / (CNN_STRIDES ** 5) / (CNN_STRIDES ** 5)]) # Number of hidden units per filter (input pixels / width decimation / height decimation)\n",
358 | " \n",
359 | " dense_layer= tf.layers.dense( # fully connected layer\n",
360 | " inputs=layer_5_flat,\n",
361 | " units=FC_HIDDEN_UNITS, # number of hidden units\n",
362 | " activation=tf.nn.relu)\n",
363 | " \n",
364 | " dropout_layer = tf.layers.dropout( # Dropout layer randomly keeps only dropout*100% of the dense layer's hidden units in training and autonormalizes during prediction.\n",
365 | " inputs=dense_layer, \n",
366 | " rate=dropout,\n",
367 | " training=is_training) \n",
368 | "\n",
369 | " return tf.layers.dense(inputs=dropout_layer, units=NUM_CLASSES) # 2-d tensor: [logit(1-p), logit(p)] for each image in batch. "
370 | ]
371 | },
372 | {
373 | "cell_type": "markdown",
374 | "metadata": {},
375 | "source": [
376 | "## Model function\n",
377 | "\n",
378 | "The model function tells TensorFlow how to call the model we designed above and what to do when we're in training vs evaluation vs prediction mode. This is where we define the loss function, the optimizer, and the performance metric (which we picked in Step 1)."
379 | ]
380 | },
381 | {
382 | "cell_type": "code",
383 | "execution_count": 0,
384 | "metadata": {},
385 | "outputs": [],
386 | "source": [
387 | "# Model function:\n",
388 | "def generate_model_fn(dropout):\n",
389 | " \"\"\"Return a function that determines how TF estimator operates.\n",
390 | "\n",
391 | " The estimator has 3 modes of operation:\n",
392 | " * train (fitting and updating the model)\n",
393 | " * eval (collecting and returning validation metrics)\n",
394 | " * predict (using the model to label unlabeled images)\n",
395 | "\n",
396 | " The returned function _cnn_model_fn below determines what to do depending\n",
397 | " on the mode of operation, and returns specs telling the estimator what to\n",
398 | " execute for that mode.\n",
399 | "\n",
400 | " Args:\n",
401 | " dropout: regularization parameter in last layer (between 0 and 1, exclusive)\n",
402 | "\n",
403 | " Returns:\n",
404 | " _cnn_model_fn: a function that returns specs for use with TF estimator\n",
405 | " \"\"\"\n",
406 | "\n",
407 | " def _cnn_model_fn(features, labels, mode):\n",
408 | " \"\"\"A function that determines specs for the TF estimator based on mode of operation.\n",
409 | " \n",
410 | " Args: \n",
411 | " features: actual data (which goes into scope within estimator function) as 4-d tensor (of batch_size),\n",
412 | " pulled in via tf executing _input_fn(), which is the output to generate_input_fn() and is in memory\n",
413 | " labels: 1-d tensor of 0s and 1s\n",
414 | " mode: TF object indicating whether we're in train, eval, or predict mode.\n",
415 | " \n",
416 | " Returns:\n",
417 | " estim_specs: collections of metrics and tensors that are required for training (e.g. prediction values, loss value, train_op tells model weights how to update)\n",
418 | " \"\"\"\n",
419 | "\n",
420 | " # Use the cnn() to compute logits:\n",
421 | " logits_train = cnn(features, dropout, reuse=False, is_training=True)\n",
422 | " logits_eval = cnn(features, dropout, reuse=True, is_training=False)\n",
423 | " # We'll be evaluating these later.\n",
424 | "\n",
425 | " # Transform logits into predictions:\n",
426 | " pred_classes = tf.argmax(logits_eval, axis=1) # Returns 0 or 1, whichever has larger logit.\n",
427 | " pred_prob = tf.nn.softmax(logits=logits_eval)[:, 1] # Applies softmax function to return 2-d probability vector.\n",
428 | " # Note: we're not outputting pred_prob in this tutorial, that line just shows you\n",
429 | " # how to get it if you want it. Softmax[i] = exp(logit[i]) / sum(exp((logit[:]))\n",
430 | "\n",
431 | " # If we're in prediction mode, early return predicted class (0 or 1):\n",
432 | " if mode == tf.estimator.ModeKeys.PREDICT:\n",
433 | " return tf.estimator.EstimatorSpec(mode, predictions=pred_classes)\n",
434 | "\n",
435 | " # If we're not in prediction mode, define loss function and optimizer.\n",
436 | "\n",
437 | " # Loss function:\n",
438 | " # This is what the algorithm minimizes to learn the weights.\n",
439 | " # tf.reduce_mean() just takes the mean over a batch, giving back a scalar.\n",
440 | " # Inside tf.reduce_mean() we'll select any valid binary loss function we want to use.\n",
441 | " loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(\n",
442 | " logits=logits_train, labels=tf.cast(labels, dtype=tf.int32)))\n",
443 | "\n",
444 | " # Optimizer:\n",
445 | " # This is the scheme the algorithm uses to update the weights.\n",
446 | " # AdamOptimizer is adaptive moving average, feel free to replace with one you prefer.\n",
447 | " optimizer = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE)\n",
448 | "\n",
449 | " # The minimize method below doesn't minimize anything, it just takes a step.\n",
450 | " train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())\n",
451 | "\n",
452 | " # Performance metric:\n",
453 | " # Should be whatever we chose as we defined in Step 1. This is what you said you care about!\n",
454 | " # This output is for reporting only, it is not optimized directly.\n",
455 | " acc = tf.metrics.accuracy(labels=labels, predictions=pred_classes)\n",
456 | "\n",
457 | " # Hooks - pick what to log and show:\n",
458 | " # Hooks are designed for monitoring; every time TF writes a summary, it'll append these.\n",
459 | " logging_hook = tf.train.LoggingTensorHook({\n",
460 | " 'x-entropy loss': loss,\n",
461 | " 'training accuracy': acc[0],\n",
462 | " }, every_n_secs=TRAINING_LOG_PERIOD_SECS)\n",
463 | "\n",
464 | " # Stitch everything together into the estimator specs, which we'll output here so it can\n",
465 | " # later be passed to tf.estimator.Estimator()\n",
466 | " estim_specs = tf.estimator.EstimatorSpec(\n",
467 | " mode=mode,\n",
468 | " predictions=pred_classes,\n",
469 | " loss=loss,\n",
470 | " train_op=train_op,\n",
471 | " training_hooks=[logging_hook],\n",
472 | " eval_metric_ops={\n",
473 | " 'accuracy': acc, # This line is Step 7!\n",
474 | " }\n",
475 | " )\n",
476 | "\n",
477 | " # TF estim_specs defines a huge dict that stores different metrics and operations for useby TF Estimator.\n",
478 | " # This gives you the interaction between your architecture in cnn() and the weights, etc. in the current iteration which\n",
479 | " # will be used as input in the next iteration.\n",
480 | " return estim_specs\n",
481 | "\n",
482 | " return _cnn_model_fn"
483 | ]
484 | },
485 | {
486 | "cell_type": "markdown",
487 | "metadata": {},
488 | "source": [
489 | "# TF Estimator\n",
490 | "\n",
491 | "This is where it all comes together: TF Estimator takes in as input everything we've created thus far and when executed it will output everything that is necessary for training (fits a model), evaluation (outputs metrics), or prediction (outputs predictions)."
492 | ]
493 | },
494 | {
495 | "cell_type": "code",
496 | "execution_count": 0,
497 | "metadata": {},
498 | "outputs": [],
499 | "source": [
500 | "# TF Estimator:\n",
501 | "# WARNING: Don't run this block of code more than once without first changing OUTPUT_DIR.\n",
502 | "estimator = tf.estimator.Estimator(\n",
503 | " model_fn=generate_model_fn(DROPOUT), # Call our generate_model_fn to create model function\n",
504 | " model_dir=OUTPUT_DIR, # Where to look for data and also to paste output.\n",
505 | " config=RunConfig(\n",
506 | " save_checkpoints_secs=CHECKPOINT_PERIOD_SECS,\n",
507 | " keep_checkpoint_max=20,\n",
508 | " save_summary_steps=100,\n",
509 | " log_step_count_steps=100)\n",
510 | ")"
511 | ]
512 | },
513 | {
514 | "cell_type": "markdown",
515 | "metadata": {},
516 | "source": [
517 | "# TF Experiment\n",
518 | "\n",
519 | "A TF Experiment defines **how** to run your TF estimator during training and debugging only. TF Experiments are not necessary for prediction once training is complete.\n",
520 | "\n",
521 | "*TERMINOLOGY WARNING:* The word \"experiment\" here is not used the way it is used by typical scientists and statisticians.\n"
522 | ]
523 | },
524 | {
525 | "cell_type": "code",
526 | "execution_count": 0,
527 | "metadata": {},
528 | "outputs": [],
529 | "source": [
530 | "# TF Experiment:\n",
531 | "def experiment_fn(output_dir):\n",
532 | " \"\"\"Create _experiment_fn which returns a TF experiment\n",
533 | "\n",
534 | " To be used with learn_runner, which we imported from tf.\n",
535 | "\n",
536 | " Args: \n",
537 | " output_dir: which is where we write our models to.\n",
538 | " Returns: \n",
539 | " a TF Experiment\n",
540 | " \"\"\"\n",
541 | "\n",
542 | " return Experiment(\n",
543 | " estimator=estimator, # What is the estimator?\n",
544 | " train_input_fn=generate_input_fn(TRAIN_DIR, TRAIN_BATCH_SIZE, QUEUE_CAP), # Generate input function designed above.\n",
545 | " eval_input_fn=generate_input_fn(DEBUG_DIR, DEBUG_BATCH_SIZE, QUEUE_CAP),\n",
546 | " train_steps=TRAIN_STEPS, # Number of batches to use for training.\n",
547 | " eval_steps=DEBUG_STEPS, # Number of batches to use for eval.\n",
548 | " min_eval_frequency=1, # Run eval once every min_eval_frequency number of checkpoints.\n",
549 | " local_eval_frequency=1\n",
550 | " )"
551 | ]
552 | },
553 | {
554 | "cell_type": "markdown",
555 | "metadata": {},
556 | "source": [
557 | "# Step 6 - Train a model!\n",
558 | "\n",
559 | "Let's run our lovely creation on our training data. In order to train, we need learn_runner(), which we imported from TensorFlow above. For prediction, we will only need estimator.predict()."
560 | ]
561 | },
562 | {
563 | "cell_type": "code",
564 | "execution_count": 0,
565 | "metadata": {},
566 | "outputs": [],
567 | "source": [
568 | "# Enable TF verbose output:\n",
569 | "tf.logging.set_verbosity(tf.logging.INFO)\n",
570 | "start_time = datetime.datetime.now()\n",
571 | "print('It\\'s {:%H:%M} in London'.format(start_time) + ' --- Let\\'s get started!')\n",
572 | "# Let the learning commence! Run the TF Experiment here.\n",
573 | "learn_runner.run(experiment_fn, OUTPUT_DIR)\n",
574 | "# Output lines using the word \"Validation\" are giving our metric on the non-training dataset (from DEBUG_DIR).\n",
575 | "end_time = datetime.datetime.now()\n",
576 | "print('\\nIt was {:%H:%M} in London when we started.'.format(start_time))\n",
577 | "print('\\nWe\\'re finished and it\\'s {:%H:%M} in London'.format(end_time))\n",
578 | "print('\\nCongratulations! Training is complete!')"
579 | ]
580 | },
581 | {
582 | "cell_type": "code",
583 | "execution_count": 0,
584 | "metadata": {},
585 | "outputs": [],
586 | "source": [
587 | "print('\\nIt was {:%H:%M} in London when we started.'.format(start_time))\n",
588 | "print('\\nWe\\'re finished and it\\'s {:%H:%M} in London'.format(end_time))\n",
589 | "print('\\nCongratulations! Training is complete!')"
590 | ]
591 | },
592 | {
593 | "cell_type": "code",
594 | "execution_count": 0,
595 | "metadata": {},
596 | "outputs": [],
597 | "source": [
598 | "# Observed labels from filenames:\n",
599 | "def get_labels(dir):\n",
600 | " \"\"\"Get labels from filenames.\n",
601 | " \n",
602 | " Filenames must be in the following format: number_number_label.png\n",
603 | " \n",
604 | " Args:\n",
605 | " dir: directory containing image files\n",
606 | " \n",
607 | " Returns:\n",
608 | " labels: 1-d np.array of binary labels\n",
609 | " \"\"\"\n",
610 | " filelist = os.listdir(dir) # Use all the files in the directory\n",
611 | " labels = np.array([])\n",
612 | " for f in filelist:\n",
613 | " split_filename = f.split('_')\n",
614 | " label = int(split_filename[-1].split('.')[0])\n",
615 | " labels = np.append(labels, label)\n",
616 | " return labels"
617 | ]
618 | },
619 | {
620 | "cell_type": "code",
621 | "execution_count": 0,
622 | "metadata": {},
623 | "outputs": [],
624 | "source": [
625 | "# Cat_finder function for getting predictions:\n",
626 | "def cat_finder(dir, model_version):\n",
627 | " \"\"\"Get labels from model.\n",
628 | "\n",
629 | " Args:\n",
630 | " dir: directory containing image files\n",
631 | "\n",
632 | " Returns:\n",
633 | " predictions: 1-d np array of binary labels\n",
634 | " \"\"\"\n",
635 | "\n",
636 | " num_predictions = len(os.listdir(dir))\n",
637 | " predictions = [] # Initialize array.\n",
638 | "\n",
639 | " # Estimator.predict() returns a generator g. Call next(g) to retrieve the next value.\n",
640 | " prediction_gen = estimator.predict(\n",
641 | " input_fn=generate_input_fn(dir=dir,\n",
642 | " batch_size=TRAIN_STEPS,\n",
643 | " queue_capacity=QUEUE_CAP\n",
644 | " ),\n",
645 | " checkpoint_path=model_version\n",
646 | " )\n",
647 | "\n",
648 | " # Use generator to ensure ordering is preserved and predictions match order of validation_labels:\n",
649 | " i = 1\n",
650 | " for pred in range(0, num_predictions):\n",
651 | " predictions.append(next(prediction_gen)) #Append the next value of the generator to the prediction array\n",
652 | " i += 1\n",
653 | " if i % 1000 == 0:\n",
654 | " print('{:d} predictions completed (out of {:d})...'.format(i, len(os.listdir(dir))))\n",
655 | " print('{:d} predictions completed (out of {:d})...'.format(len(os.listdir(dir)), len(os.listdir(dir))))\n",
656 | " return np.array(predictions)"
657 | ]
658 | },
659 | {
660 | "cell_type": "markdown",
661 | "metadata": {},
662 | "source": [
663 | "## Get training accuracy"
664 | ]
665 | },
666 | {
667 | "cell_type": "code",
668 | "execution_count": 0,
669 | "metadata": {},
670 | "outputs": [],
671 | "source": [
672 | "def get_accuracy(truth, predictions, threshold=0.5, roundoff = 2):\n",
673 | " \"\"\"Compares labels with model predictions and returns accuracy.\n",
674 | " \n",
675 | " Args:\n",
676 | " truth: can be bool (False, True), int (0, 1), or float (0, 1)\n",
677 | " predictions: number between 0 and 1, inclusive\n",
678 | " threshold: we convert the predictions to 1s if they're above this value\n",
679 | " roundoff: report accuracy to how many decimal places?\n",
680 | " \n",
681 | " Returns:\n",
682 | " accuracy: number correct divided by total predictions\n",
683 | " \"\"\"\n",
684 | " truth = np.array(truth) == (1|True)\n",
685 | " predicted = np.array(predictions) >= threshold\n",
686 | " matches = sum(predicted == truth)\n",
687 | " accuracy = float(matches) / len(truth)\n",
688 | " return round(accuracy, roundoff)"
689 | ]
690 | },
691 | {
692 | "cell_type": "markdown",
693 | "metadata": {},
694 | "source": [
695 | "## Get predictions and performance metrics\n",
696 | "\n",
697 | "Create functions for outputting observed labels, predicted labels, and accuracy. Filenames must be in the following format: number_number_label.extension"
698 | ]
699 | },
700 | {
701 | "cell_type": "code",
702 | "execution_count": 0,
703 | "metadata": {},
704 | "outputs": [],
705 | "source": [
706 | "files = os.listdir(TRAIN_DIR)\n",
707 | "model_version = OUTPUT_DIR + 'model.ckpt-' + str(TRAIN_STEPS)\n",
708 | "observed = get_labels(TRAIN_DIR)\n",
709 | "predicted = cat_finder(TRAIN_DIR, model_version)"
710 | ]
711 | },
712 | {
713 | "cell_type": "code",
714 | "execution_count": 0,
715 | "metadata": {},
716 | "outputs": [],
717 | "source": [
718 | "print('Training accuracy is ' + str(get_accuracy(observed, predicted)))"
719 | ]
720 | },
721 | {
722 | "cell_type": "markdown",
723 | "metadata": {},
724 | "source": [
725 | "# Step 7 - Debugging and Tuning\n",
726 | "\n",
727 | "## Debugging\n",
728 | "\n",
729 | "It's worth taking a look to see if there's something special about the images we misclassified."
730 | ]
731 | },
732 | {
733 | "cell_type": "code",
734 | "execution_count": 0,
735 | "metadata": {},
736 | "outputs": [],
737 | "source": [
738 | "files = os.listdir(DEBUG_DIR)\n",
739 | "predicted = cat_finder(DEBUG_DIR, model_version)\n",
740 | "observed = get_labels(DEBUG_DIR)"
741 | ]
742 | },
743 | {
744 | "cell_type": "code",
745 | "execution_count": 0,
746 | "metadata": {},
747 | "outputs": [],
748 | "source": [
749 | "print('Debugging accuracy is ' + str(get_accuracy(observed, predicted)))"
750 | ]
751 | },
752 | {
753 | "cell_type": "code",
754 | "execution_count": 0,
755 | "metadata": {},
756 | "outputs": [],
757 | "source": [
758 | "df = pd.DataFrame({'files': files, 'predicted': predicted, 'observed': observed})\n",
759 | "hit = df.files[df.observed == df.predicted]\n",
760 | "miss = df.files[df.observed != df.predicted]"
761 | ]
762 | },
763 | {
764 | "cell_type": "code",
765 | "execution_count": 0,
766 | "metadata": {},
767 | "outputs": [],
768 | "source": [
769 | "# Show successful classifications:\n",
770 | "show_inputs(DEBUG_DIR, hit, 3)"
771 | ]
772 | },
773 | {
774 | "cell_type": "code",
775 | "execution_count": 0,
776 | "metadata": {},
777 | "outputs": [],
778 | "source": [
779 | "# Show unsuccessful classifications:\n",
780 | "show_inputs(DEBUG_DIR, miss, 3)"
781 | ]
782 | },
783 | {
784 | "cell_type": "markdown",
785 | "metadata": {},
786 | "source": [
787 | "# Step 8 - Validation\n",
788 | "\n",
789 | "Apply cat_finder() to the validation dataset. Since this is validation, we'll only look at the final performance metric (accuracy) and nothing else."
790 | ]
791 | },
792 | {
793 | "cell_type": "code",
794 | "execution_count": 0,
795 | "metadata": {},
796 | "outputs": [],
797 | "source": [
798 | "files = os.listdir(VALID_DIR)\n",
799 | "predicted = cat_finder(VALID_DIR, model_version)\n",
800 | "observed = get_labels(VALID_DIR)\n",
801 | "print('\\nValidation accuracy is ' + str(get_accuracy(observed, predicted)))"
802 | ]
803 | },
804 | {
805 | "cell_type": "markdown",
806 | "metadata": {},
807 | "source": [
808 | "# Step 9 - Statistical Testing\n",
809 | "\n",
810 | "Apply cat_finder() to the test dataset ONE TIME ONLY. Since this is testing, we'll only look at the final performance metric (accuracy) and the results of the statistical hypothesis test."
811 | ]
812 | },
813 | {
814 | "cell_type": "code",
815 | "execution_count": 0,
816 | "metadata": {},
817 | "outputs": [],
818 | "source": [
819 | "# Hypothesis test we'll use:\n",
820 | "from statsmodels.stats.proportion import proportions_ztest\n",
821 | "\n",
822 | "# Testing setup:\n",
823 | "SIGNIFICANCE_LEVEL = 0.05\n",
824 | "TARGET_ACCURACY = 0.80"
825 | ]
826 | },
827 | {
828 | "cell_type": "code",
829 | "execution_count": 0,
830 | "metadata": {},
831 | "outputs": [],
832 | "source": [
833 | "files = os.listdir(TEST_DIR)\n",
834 | "predicted = cat_finder(TEST_DIR, model_version)\n",
835 | "observed = get_labels(TEST_DIR)\n",
836 | "print('\\nTest accuracy is ' + str(get_accuracy(observed, predicted, roundoff=4)))"
837 | ]
838 | },
839 | {
840 | "cell_type": "code",
841 | "execution_count": 0,
842 | "metadata": {},
843 | "outputs": [],
844 | "source": [
845 | "# Using standard notation for a one-sided test of one population proportion:\n",
846 | "n = len(predicted)\n",
847 | "x = round(get_accuracy(observed, predicted, roundoff=4) * n)\n",
848 | "p_value = proportions_ztest(count=x, nobs=n, value=TARGET_ACCURACY, alternative='larger')[1]\n",
849 | "if p_value < SIGNIFICANCE_LEVEL:\n",
850 | " print('Congratulations! Your model is good enough to build. It passes testing. Awesome!')\n",
851 | "else:\n",
852 | " print('Too bad. Better luck next project. To try again, you need a pristine test dataset.')"
853 | ]
854 | }
855 | ],
856 | "metadata": {
857 | "kernelspec": {
858 | "display_name": "Python 2",
859 | "language": "python",
860 | "name": "python2"
861 | }
862 | },
863 | "nbformat": 4,
864 | "nbformat_minor": 1
865 | }
866 |
--------------------------------------------------------------------------------
/cleanup_notebooks.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | #
3 | # Copyright 2017 Google LLC
4 | #
5 | # Licensed under the Apache License, Version 2.0 (the "License");
6 | # you may not use this file except in compliance with the License.
7 | # You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 |
17 | """Cleans up a notebook's metadata and outputs: for use from contributors.
18 |
19 | If you would like to contribute or modify notebooks, please run this script
20 | to ensure that all notebooks in the project have cell outputs erased, and any
21 | metadata from other notebook editors (e.g. colab) are removed.
22 |
23 | Run this script as follows:
24 |
25 | python cleanup_notebooks.py [project-base-directory]
26 |
27 | If already in the base directory, run:
28 |
29 | python cleanup_notebooks.py .
30 | """
31 |
32 | import json
33 | import os
34 | from glob import glob
35 |
36 | # Remove all but valid metadata keys listed here:
37 | # https://ipython.org/ipython-doc/3/notebook/nbformat.html#metadata
38 | VALID_CELL_METADATA_KEYS = ['collapsed', 'autoscroll', 'deletable',
39 | 'format', 'name', 'tags']
40 | VALID_METADATA_KEYS = ['kernelspec', 'signature']
41 | EXECUTION_COUNT_KEY = 'execution_count'
42 |
43 | # Get the script's current directory (project base directory)
44 | base_dir = os.path.dirname(os.path.realpath(__file__))
45 |
46 | # Recursively find all notebooks
47 | notebook_paths = [pattern for path in os.walk(base_dir)
48 | for pattern in glob(os.path.join(path[0], '*.ipynb'))]
49 | notebook_paths = [x for x in notebook_paths if '.ipynb_checkpoints' not in x]
50 | # Print notebooks to be touched
51 | print(notebook_paths)
52 |
53 | # Pull JSON from each notebook
54 | for notebook_path in notebook_paths:
55 | with open(notebook_path) as notebook_file:
56 | notebook_json = json.load(notebook_file)
57 |
58 | # Remove keys from cells
59 | cell_array = notebook_json['cells']
60 | for cell in cell_array:
61 | cell_metadata = cell['metadata']
62 | for key in cell_metadata.keys():
63 | if key not in VALID_CELL_METADATA_KEYS:
64 | cell_metadata.pop(key)
65 |
66 | # Reset execution counts for all code cells to 0
67 | if cell['cell_type'] == 'code':
68 | cell[EXECUTION_COUNT_KEY] = 0
69 | elif EXECUTION_COUNT_KEY in cell:
70 | cell.pop(EXECUTION_COUNT_KEY)
71 |
72 | # Empty outputs for all cells
73 | if 'outputs' in cell:
74 | cell['outputs'] = []
75 |
76 | #Remove keys from metadata
77 | metadata = notebook_json['metadata']
78 | for key in metadata.keys():
79 | if key not in VALID_METADATA_KEYS:
80 | metadata.pop(key)
81 |
82 | clean_string = json.dumps(notebook_json, indent=1, sort_keys=True)
83 | clean_string = ''.join(line.rstrip() + '\n'
84 | for line in clean_string.splitlines())
85 |
86 | with open(notebook_path, 'w') as notebook_file:
87 | notebook_file.write(clean_string)
88 |
--------------------------------------------------------------------------------
/imgs/step_2b.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kozyrkov/deep-learning-tutorial/8814ef156284a427b52a30d7f1e315e5f2aaab73/imgs/step_2b.png
--------------------------------------------------------------------------------
/imgs/step_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kozyrkov/deep-learning-tutorial/8814ef156284a427b52a30d7f1e315e5f2aaab73/imgs/step_3.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | six==1.10.0
2 | tensorflow==1.3.0
3 | tensorflow-transform==0.1.10
4 | opencv-python==3.3.0.10
5 | opencv-contrib-python==3.3.0.10
6 | scikit-learn==0.19.0
7 | sklearn
8 | statsmodels==0.8.0
9 | pandas==0.20.3
10 | jupyter==1.0.0
11 | grpc-google-iam-v1==0.11.1
12 | google-cloud-storage==1.4.0
13 | google-compute-engine==2.7.0
14 | matplotlib==2.1.0
15 | seaborn==0.8.1
16 | scipy==0.19.1
17 |
--------------------------------------------------------------------------------
/setup_step_1_cloud_project.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash -eu
2 | #
3 | # Copyright 2017 Google LLC
4 | #
5 | # Licensed under the Apache License, Version 2.0 (the "License");
6 | # you may not use this file except in compliance with the License.
7 | # You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | #
17 | # Set up network, compute engine VM, storage, and dataset for the sc17 project.
18 | #
19 | # To ensure correct setup, run this script in the default cloud shell
20 | # in the GCP web console (the ">_" icon at the top right).
21 |
22 | err() {
23 | echo "ERROR [$(date +'%Y-%m-%dT%H:%M:%S%z')]: $@" >&2
24 | exit 1
25 | }
26 |
27 | # Check argument length
28 | if [ "$#" -lt 1 ]; then
29 | echo "Usage: $0 project-name [gpu-type] [compute-region] [dataflow-region]"
30 | echo " project-name is the project name you specified during setup"
31 | echo " gpu-type: either K80, P100, or None"
32 | echo " (default: None)"
33 | echo " compute-region: where you will be running data science notebooks"
34 | echo " (default: us-east1)"
35 | echo " dataflow-region: where you will run dataflow preprocessing jobs"
36 | echo " (default: us-central1)"
37 | exit 1
38 | fi
39 |
40 | PROJECT=$1
41 |
42 | # Parse optional arguments
43 | if [ "$#" -ge 3 ]; then
44 | COMPUTE_REGION=$3
45 | else
46 | COMPUTE_REGION=us-east1
47 | fi
48 |
49 | if [ "$#" -ge 4 ]; then
50 | DATAFLOW_REGION=$4
51 | else
52 | DATAFLOW_REGION=us-central1
53 | fi
54 |
55 | if [ "$#" -ge 2 ]; then
56 | GPU_TYPE=$2
57 | else
58 | GPU_TYPE=None
59 | fi
60 |
61 | #### Check quotas ####
62 |
63 | #######################################
64 | # Check quota and throw error if quota value is below the lower bound.
65 | #
66 | # The purpose of enforcing these quota bounds is to ensure a good user
67 | # experience, since without quota increases some jobs and notebooks may take
68 | # very long to run.
69 | #
70 | # Globals:
71 | # None
72 | # Arguments:
73 | # quota_json: the quota json to parse
74 | # metric_name: the quota metric name
75 | # lower_bound: lower bound on the quota value
76 | # Returns:
77 | # None
78 | #######################################
79 | check_quota() {
80 | METRIC=$(echo "$1" \
81 | | jq ".quotas | map(select(.metric ==\"$2\")) | .[0].limit")
82 | if [ "$METRIC" -lt $3 ]; then
83 | err "Quota $2 must be at least $3: value = $METRIC. Please increase quota!"
84 | fi
85 | }
86 |
87 | GLOBAL_QUOTA=$(gcloud --format json compute project-info \
88 | describe --project $PROJECT)
89 | DATAFLOW_REGION_QUOTA=$(gcloud --format json compute regions \
90 | describe $DATAFLOW_REGION)
91 |
92 | # The quota requirements here may be lower than the recommended in the README,
93 | # but this is mainly to ensure a minimum level of performance.
94 |
95 | echo "Checking quotas for $DATAFLOW_REGION..."
96 | check_quota "$DATAFLOW_REGION_QUOTA" IN_USE_ADDRESSES 50
97 | check_quota "$DATAFLOW_REGION_QUOTA" DISKS_TOTAL_GB 65536
98 | check_quota "$DATAFLOW_REGION_QUOTA" CPUS 200
99 |
100 | echo "Checking global quotas..."
101 | check_quota "$GLOBAL_QUOTA" CPUS_ALL_REGIONS 200
102 |
103 | # If a gpu type is requested, check that you have at least one available
104 | # in the requested region.
105 | if [ "$GPU_TYPE" != "None" ]; then
106 | echo "Checking quotas for $COMPUTE_REGION..."
107 | # Get cloud compute info for your region
108 | COMPUTE_INFO=$(gcloud compute regions describe $COMPUTE_REGION) || exit 1
109 | # Verify that you have gpus available to create a VM
110 | GPUS_QUOTA=$(echo "$COMPUTE_INFO" \
111 | | grep -B1 "metric: NVIDIA_${GPU_TYPE}" \
112 | | grep limit \
113 | | cut -d' ' -f3 \
114 | | cut -d'.' -f1)
115 | # Get gpus in use by parsing the output
116 | GPUS_IN_USE=$(echo "$COMPUTE_INFO" \
117 | | grep -A1 "metric: NVIDIA_${GPU_TYPE}" \
118 | | grep usage \
119 | | cut -d' ' -f4 \
120 | | cut -d'.' -f1)
121 |
122 | # Verify that you still have gpus available
123 | if [ "$GPUS_QUOTA" -gt "$GPUS_IN_USE" ]; then
124 | echo "GPUs are available for creating a new VM!"
125 | else
126 | err "Not enough GPUs available! Either increase quota or stop existing VMs
127 | with GPUs. Quota: $GPUS_QUOTA, Avail: $GPUS_IN_USE"
128 | fi
129 | fi
130 |
131 | # Get a default zone for the vm based on region and GPU availabilities
132 | # See https://cloud.google.com/compute/docs/gpus/#introduction.
133 | if [ "$GPU_TYPE" == "K80" ]; then
134 | ACCELERATOR_ARG="--accelerator type=nvidia-tesla-k80,count=1"
135 |
136 | case "$COMPUTE_REGION" in
137 | "us-east1")
138 | ZONE="us-east1-d"
139 | ;;
140 | "us-west1")
141 | ZONE="us-west1-b"
142 | ;;
143 | "europe-west1")
144 | ZONE="europe-west1-b"
145 | ;;
146 | "asia-east1")
147 | ZONE="asia-east1-b"
148 | ;;
149 | *)
150 | err "Zone does not contain any gpus of type $GPU_TYPE"
151 | ;;
152 | esac
153 | elif [ "$GPU_TYPE" == "P100" ]; then
154 | ACCELERATOR_ARG="--accelerator type=nvidia-tesla-p100,count=1"
155 |
156 | case "$COMPUTE_REGION" in
157 | "us-east1")
158 | ZONE="us-east1-c"
159 | ;;
160 | "us-west1")
161 | ZONE="us-west1-b"
162 | ;;
163 | "europe-west1")
164 | ZONE="europe-west1-d"
165 | ;;
166 | "asia-east1")
167 | ZONE="asia-east1-a"
168 | ;;
169 | *)
170 | err "Zone does not contain any gpus of type $GPU_TYPE"
171 | ;;
172 | esac
173 | elif [ "$GPU_TYPE" != "None" ]; then
174 | err "Unknown GPU type $GPU_TYPE"
175 | else
176 | ACCELERATOR_ARG=""
177 | # Assign zone which is the region name + "-a",
178 | # except regions without such zones
179 | case "$COMPUTE_REGION" in
180 | "europe-west1")
181 | ZONE="europe-west1-b"
182 | ;;
183 | "us-east1")
184 | ZONE="us-east1-b"
185 | ;;
186 | *)
187 | ZONE="$COMPUTE_REGION-a"
188 | ;;
189 | esac
190 | fi
191 |
192 | echo "All quota requirements passed!"
193 | echo "Using zone $ZONE for data science VM creation"
194 |
195 |
196 | #### CLOUD PROJECT APIS ####
197 | # Enable google apis if they aren't already enabled
198 | ENABLED_SERVICES=$(gcloud services list --enabled)
199 | check_and_enable_service() {
200 | SERVICE_ENABLED=$(echo "$ENABLED_SERVICES" | grep $1) || true
201 | if [[ -z "$SERVICE_ENABLED" ]]; then
202 | gcloud services enable $1 || err "could not enable service $1"
203 | fi
204 | }
205 |
206 | echo "Checking and Enabling Google APIs..."
207 | check_and_enable_service dataflow.googleapis.com
208 | check_and_enable_service logging.googleapis.com
209 | check_and_enable_service cloudresourcemanager.googleapis.com
210 | echo "APIs enabled!"
211 |
212 |
213 | #### CREATE VM FOR DATA SCIENCE ####
214 | # Create a VM to use for supercomputing!
215 | # Note that notebooks primarily run on a single cpu core, so we do not need too
216 | # many cores. Tensorflow will be configured to use a gpu, which significantly
217 | # speeds up the running time of deep net training.
218 |
219 | if [ "$GPU_TYPE" != "None" ]; then
220 | STARTUP_SCRIPT='#!/bin/bash
221 | echo "Checking for CUDA and installing."
222 | # Check for CUDA and try to install.
223 | if ! dpkg-query -W cuda-8-0; then
224 | curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
225 | dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
226 | apt-get update
227 | apt-get install cuda-8-0 -y
228 | fi'
229 | else
230 | STARTUP_SCRIPT=""
231 | fi
232 |
233 | COMPUTE_INSTANCE=$PROJECT-compute-instance
234 | gcloud compute instances create $COMPUTE_INSTANCE \
235 | --zone $ZONE \
236 | --custom-cpu 2 \
237 | --custom-memory 13 \
238 | $ACCELERATOR_ARG \
239 | --image-family ubuntu-1604-lts \
240 | --image-project ubuntu-os-cloud \
241 | --boot-disk-type pd-standard \
242 | --boot-disk-size 256GB \
243 | --scopes cloud-platform \
244 | --network default \
245 | --maintenance-policy TERMINATE \
246 | --restart-on-failure \
247 | --metadata startup-script="'$STARTUP_SCRIPT'" \
248 | || err "Error starting up compute instance"
249 |
250 | echo "Compute instance created successfully! GPU=$GPU_TYPE"
251 |
252 | #### SETUP VM ENVIRONMENT FOR DATA SCIENCE ####
253 |
254 | # Create bashrc modification script
255 | BUCKET=$PROJECT-bucket
256 | echo "
257 | #!/bin/bash
258 | sed -i '/export PROJECT.*/d' ~/.bashrc
259 | echo \"export PROJECT=$PROJECT\" >> ~/.bashrc
260 | echo \"Added PROJECT=$PROJECT to .bashrc\"
261 | sed -i '/export BUCKET.*/d' ~/.bashrc
262 | echo \"export BUCKET=$BUCKET\" >> ~/.bashrc
263 | echo \"Added BUCKET=$BUCKET to .bashrc\"
264 | sed -i '/export DATAFLOW_REGION.*/d' ~/.bashrc
265 | echo \"export DATAFLOW_REGION=$DATAFLOW_REGION\" >> ~/.bashrc
266 | echo \"Added the following lines to .bashrc\"
267 | tail -n 3 ~/.bashrc
268 | " > env_vars.sh
269 |
270 | retry=0
271 | # Retry SSH-ing 5 times
272 | until [ $retry -ge 5 ]
273 | do
274 | echo "SSH retries: $retry"
275 |
276 | # Create a temporary firewall rule for ssh connections
277 | # Ignore error if rule is already created.
278 | gcloud compute firewall-rules create allow-ssh \
279 | --allow tcp:22 \
280 | --source-ranges 0.0.0.0/0 \
281 | || true
282 |
283 | # SSH and execute script
284 | gcloud compute config-ssh
285 | gcloud compute ssh --zone $ZONE $COMPUTE_INSTANCE -- 'bash -s' < ~/env_vars.sh \
286 | && break
287 | retry=$[$retry + 1]
288 |
289 | echo "SSH connection failed. Machine may still be booting up. Retrying in 30 seconds..."
290 | sleep 30
291 | done
292 |
293 | if [ $retry -ge 5 ]; then
294 | err "Unsuccessful setting up bashrc environment for $COMPUTE_INSTANCE"
295 | fi
296 |
297 | #### STORAGE ####
298 | # Create a storage bucket for the demo if it doesn't exist.
299 | echo "Creating a storage bucket and dataset..."
300 | gsutil mb -p $PROJECT -c regional -l $COMPUTE_REGION gs://$PROJECT-bucket/ \
301 | || true
302 |
303 | # Create a new dataset for storing tables in this project if it doesn't exist
304 | # If this is your first time making a dataset, you may be prompted to select
305 | # a default project.
306 | bq mk dataset || true
307 | echo "Done!"
308 |
--------------------------------------------------------------------------------
/setup_step_2_install_software.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash -eu
2 | #
3 | # Copyright 2017 Google LLC
4 | #
5 | # Licensed under the Apache License, Version 2.0 (the "License");
6 | # you may not use this file except in compliance with the License.
7 | # You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | #
17 | # Set up a VM environment for data science with Tensorflow GPU support.
18 | #
19 | # Note: RUN THIS ON A COMPUTE ENGINE VM, not in the cloud shell!
20 | #
21 |
22 | #### Exit helper function ####
23 | err() {
24 | echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: $@" >&2
25 | exit 1
26 | }
27 |
28 | #### Bash variables ####
29 | # The directory of the shell script, e.g. ~/code in the lecture slides.
30 | BASE_DIR=$(dirname "$0")
31 | # The remote port for Jupyter traffic.
32 | PORT=5000
33 | # Check if machine has GPU support.
34 | LSPCI_OUTPUT=$(lspci -vnn | grep NVIDIA)
35 |
36 | #### APT-GET library installations ####
37 |
38 | # Update apt-get
39 | sudo apt-get update -y
40 |
41 | # Install python and pip libraries
42 | sudo apt-get install -y \
43 | python-pip \
44 | python-dev \
45 | build-essential \
46 | || err 'failed to install python/pip libraries'
47 |
48 | # Install opencv c++ library dependencies
49 | sudo apt-get install -y \
50 | libsm6 \
51 | libxrender1 \
52 | libfontconfig1 \
53 | libxext6 \
54 | || err 'failed to install opencv dependencies'
55 |
56 | # If we are using a GPU machine, install cuda libraries
57 | if [ -n "$LSPCI_OUTPUT" ]; then
58 | # The 16.04 installer works with 16.10.
59 | sudo curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb \
60 | || err 'failed to find cuda repo for ubuntu 16.0'
61 | sudo dpkg -i ./cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
62 | sudo apt-get update
63 | sudo apt-get install cuda-8-0 -y \
64 | || err 'failed to install cuda 8.0'
65 |
66 | # Check for available cuda 8.0 libraries
67 | CUDA_LIBRARIES_URL=http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1404/x86_64/
68 |
69 | CUDA_8_LIBRARIES=$(curl $CUDA_LIBRARIES_URL \
70 | | grep libcudnn6 \
71 | | grep amd64.deb \
72 | | grep cuda8.0 \
73 | | sed "s/^.*href='\(.*\)'>.*$/\1/") \
74 | || err 'failed to find cuda 8 libraries'
75 |
76 | # Get latest runtime and developer libraries for cuda 8.0
77 | # Download and install
78 | CUDA_8_RUNTIME_LIBRARY=$(echo "$CUDA_8_LIBRARIES" | grep -v dev | tail -n 1)
79 | CUDA_8_DEV_LIBRARY=$(echo "$CUDA_8_LIBRARIES" | grep dev | tail -n 1)
80 |
81 | sudo curl -O $CUDA_LIBRARIES_URL$CUDA_8_RUNTIME_LIBRARY \
82 | || err 'failed to download cuda runtime library'
83 | sudo curl -O $CUDA_LIBRARIES_URL$CUDA_8_DEV_LIBRARY \
84 | || err 'failed to download cuda developer library'
85 | sudo dpkg -i $CUDA_8_RUNTIME_LIBRARY \
86 | || err 'failed to install cuda runtime libraries'
87 | sudo dpkg -i $CUDA_8_DEV_LIBRARY \
88 | || err 'failed to install cuda developer libraries'
89 |
90 | # Point TensorFlow at the correct library path
91 | # Export to .bashrc so env variable is set when entering VM shell
92 | # Remove existing line in bashrc if it already exists.
93 | sed -i '/export LD_LIBRARY_PATH.*\/usr\/local\/cuda-8.0/d' $HOME/.bashrc
94 | echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-8.0/lib64' \
95 | >> $HOME/.bashrc
96 |
97 | # Install cuda profiler tools development library
98 | sudo apt-get install -y libcupti-dev \
99 | || err 'failed to install cuda profiler tools'
100 | fi
101 |
102 |
103 | #### Python Virtual Environment Setup ####
104 |
105 | # Upgrade pip and virtual env
106 | sudo pip install --upgrade pip
107 | sudo pip install --upgrade virtualenv
108 |
109 | # Create a virtual environment.
110 | virtualenv $HOME/env
111 |
112 | # Create a function to activate environment in shell script.
113 | activate () {
114 | . $HOME/env/bin/activate
115 | }
116 | activate || err 'failed to activate virtual env'
117 |
118 | # Save activate command to bashrc so logging into the vm immediately starts env
119 | # Remove any other commands in bashrc that is attempting to start a virtualenv
120 | sed -i '/source.*\/bin\/activate/d' $HOME/.bashrc
121 | echo 'source $HOME/env/bin/activate' >> $HOME/.bashrc
122 |
123 | # Install requirements.txt
124 | pip install -r $BASE_DIR/requirements.txt \
125 | || err 'failed to pip install a required library'
126 |
127 | # If this is a GPU machine, install tensorflow-gpu
128 | if [ -n "$LSPCI_OUTPUT" ]; then
129 | pip install tensorflow-gpu==1.3.0
130 | fi
131 |
132 | #### JUPYTER SETUP ####
133 |
134 | # Switch into $BASE_DIR, e.g. ~/code
135 | cd $BASE_DIR
136 |
137 | # Create a config file for jupyter. Defaults to location ~/.jupyter/jupyter_notebook_config.py
138 | jupyter notebook --generate-config
139 |
140 | # Append several lines to the config file
141 | echo "c = get_config()" >> ~/.jupyter/jupyter_notebook_config.py
142 | echo "c.NotebookApp.ip = '*'" >> ~/.jupyter/jupyter_notebook_config.py
143 | echo "c.NotebookApp.open_browser = False" >> ~/.jupyter/jupyter_notebook_config.py
144 | echo "c.NotebookApp.port = $PORT" >> ~/.jupyter/jupyter_notebook_config.py
145 |
146 | # Create a password for jupyter. This is necessary for remote web logins.
147 | # The password will be hashed and written into ~/.jupyter/jupyter_notebook_config.json
148 | jupyter notebook password
149 |
150 | # Hacky way to parse the json to pick up the hashed password and add to config file
151 | PASSWORD_HASH=$(cat ~/.jupyter/jupyter_notebook_config.json | grep password | cut -d"\"" -f4)
152 | echo "c.NotebookApp.password = u'$PASSWORD_HASH'" >> ~/.jupyter/jupyter_notebook_config.py
153 | echo "Done with jupyter setup!"
154 |
155 | # Add some env variables into your bashrc
156 | echo 'Done with installation! Make sure to type: . ~/.bashrc to finish setup.'
157 |
158 | #### DONE ####
--------------------------------------------------------------------------------