3 |
4 | # Federal University of Rio Grande do Norte
5 | ## Technology Center
6 | ### Graduate Program in Electrical and Computer Engineering
7 | #### Department of Computer Engineering and Automation
8 | ##### EEC1509 Machine Learning
9 |
10 | #### References
11 |
12 | - :books: Aurélien Géron. Hands on Machine Learning with Scikit-Learn, Keras and TensorFlow. [[Link]](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)
13 | - :books: François Chollet. Deep Learning with Python. [[Link]](https://www.manning.com/books/deep-learning-with-python-second-edition)
14 | - :books: Hannes Hapke, Catherine Nelson. Building Machine Learning Pipelines. [[Link]](https://www.oreilly.com/library/view/building-machine-learning/9781492053187/)
15 | - :books: Noah Gift, Alfredo Deza. Practical MLOps: Operationalizing Machine Learning Models [[Link]](https://www.oreilly.com/library/view/practical-mlops/9781098103002/)
16 | - :fist_right: Dataquest Academic Program [[Link]](https://www.dataquest.io/academic-program/)
17 |
18 | **Week 01**: Course Outline [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_01/outline.pdf)
19 | - Motivation, Syllabus, Calender, other issues.
20 |
21 | **Weeks 02, 03, 04** Machine Learning Fundamentals and Decision Trees [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_02/ml_fundamentals_and_decision_trees.pdf)
22 | - Outline [](https://www.loom.com/share/4979782637e34d37a0bb8551835a5a00)
23 | - What is Machine Learning (ML)? [](https://www.loom.com/share/098676fae4c2464788dd67ac1b419340)
24 | - ML types [](https://www.loom.com/share/4005e7ef95d4431db1bd266979a6789c)
25 | - Main challenges of ML
26 | - Variables, pipeline, and controlling chaos [](https://www.loom.com/share/f5456342c6b643799c1824362020fc5e)
27 | - Train, dev and test sets [](https://www.loom.com/share/954298d6f4c1433488239956b5d7007e)
28 | - Bias vs Variance [](https://www.loom.com/share/c496098013c84911a9ac353fec7e3131)
29 | - Decision Trees
30 | - Introduction [](https://www.loom.com/share/4f10b2436c1943f2aaa84d0f56c9e8c3)
31 | - Mathematical foundations [](https://www.loom.com/share/a215906eceda4b9cb655b226261bfb21)
32 | - Evaluation metrics
33 | - How to choose an evaluation metric? [](https://www.loom.com/share/3dd9bd6dcb844704ba9cd1e1b34932c3)
34 | - Threshold metrics [](https://www.loom.com/share/efc3248b6f8747a3ab86cd22cadde993)
35 | - Ranking metrics [](https://www.loom.com/share/1394db7fc27e4592af6f538c06cebbd1)
36 | - :rocket: Case Study [](https://github.com/ivanovitchm/ppgeecmachinelearning/tree/main/lessons/week_02/sources)
37 | - Google Colaboratory [](https://www.loom.com/share/8a4f0d34b3cb4d9ea04b6dcf0b3d1aca) [](https://www.loom.com/share/d96cb0af7d9c4416bfe8145c93248a11)
38 | - Setup of the environment [](https://www.loom.com/share/fea2d097fc7d4de89e53da259ece6d25)
39 | - Extract, Transform and Load (ETL)
40 | - Exploratory Data Analysis (EDA) [](https://www.loom.com/share/799b9712c6274f2fa547a3eb4cd230df)
41 | - Fetch Data [](https://www.loom.com/share/9861e9013ba940aba2c6dd1db5a00ebf)
42 | - EDA using Pandas-Profiling [](https://www.loom.com/share/cf19e023208946938d3f70e6e52018b4)
43 | - Manual EDA [](https://www.loom.com/share/9cec1f4d529a41dc90af19f23ef2082a)
44 | - Preprocessing [](https://www.loom.com/share/51a2972c8ffc4949891e9e249f9f48a3)
45 | - Data Check [](https://www.loom.com/share/f359ca8430b149309f6ac0b1d9c6e233)
46 | - Data Segregation [](https://loom.com/share/25a491791e104c1694b2bf5615fe2c26)
47 | - Train
48 | - Train and validation component [](https://www.loom.com/share/3b708c0820b64ef199178b63fc4ef395)
49 | - Data preparation and outlier removal [](https://www.loom.com/share/140068a18a5e4c8d83b807868ebdd011)
50 | - Encoding the target variable [](https://www.loom.com/share/b0edb4ccb28a4e1884a2f37637b58deb)
51 | - Encoding the independent variables manually [](https://www.loom.com/share/4adce083a32b4d3787fd50b59da4fdb5)
52 | - Using a full-pipeline to prepare categorical features [](https://www.loom.com/share/12de69ebeb744ebdbf2524b07773c7c2)
53 | - Using a full-pipeline to prepare numerical features [](https://www.loom.com/share/3b92e3fd78df42ebbbdce36dbce1707a)
54 | - Creating a full-preprocessing pipeline [](https://www.loom.com/share/6796f0129b1d4865aeb277e68461da80)
55 | - Holdout training [](https://www.loom.com/share/188a610fb09542b883b89cc962d6a823)
56 | - Evaluation metrics [](https://www.loom.com/share/4b2a9dd0ae44465b914974cf886390f9)
57 | - Hyperparameter tuning using Wandb [](https://www.loom.com/share/7e3a9d52709843bbb6026f816fa49d90)
58 | - Configure, train and export the best model [](https://www.loom.com/share/1c7a30cd4e90400daeb3916ee4006534)
59 | - Test [](https://www.loom.com/share/7725679b69a7426c927c317cb634dec3)
60 | - Dataquest Courses
61 | - Elements of the Command Line [](https://www.dataquest.io/course/command-line-elements/)
62 | - You'll learn how to: a) employ the command line for Data Science, b) modify the behavior of commands with options, c) employ glob patterns and wildcards, d) define Important command line concepts, e) navigate he filesystem, f) manage users and permissions.
63 | - Functions: Advanced - Best practices for writing functions [](https://www.dataquest.io/course/python-advanced-functions/)
64 | - Command Line Intermediate [](https://www.dataquest.io/course/command-line-intermediate/)
65 | - Learn more about the command line and how to use it in your data analysis workflow. You'll learn how to: a) employ Jupyter console and b) process data from the command line.
66 | - Git and Version Control [](https://www.dataquest.io/course/git-and-vcs/)
67 | - You'll learn how to: a) organize your code using version control, b) resolve conflicts in version control, c) employ Git and Github to collaborate with others.
68 |
69 | **Weeks 05 and 06** Deploy a ML Pipeline in Production [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_05/deploy_ml.pdf)
70 | - :neckbeard: Hands on [](https://github.com/ivanovitchm/colab2mlops)
71 | - Outline [](https://www.loom.com/share/8bc6b17050e14db1b5a644b614b9863b)
72 | - Previously on the last lesson and next steps [](https://www.loom.com/share/2497e73815354083a0299c376c6b1bb7)
73 | - Install essential tools to configure the dev environment [](https://www.loom.com/share/5147cf6180e146689fe976e1212dfd60)
74 | - Environment management system using conda [](https://www.loom.com/share/b03a14eddae543319071f483e1f73728)
75 | - Using FastAPI to Build Web APIs [](https://www.loom.com/share/7c4ccaa0de28422db02522dbad03bba7)
76 | - Hello world using fastapi [](https://www.loom.com/share/d54ee20891d74c70bd2c866c68fbe4f6)
77 | - Implementing a post method [](https://www.loom.com/share/8514c74f1f3443d3b7a82b8160f9d271)
78 | - Path and query parameter [](https://www.loom.com/share/7b6f2e4a2fc345019b5f5e0081aec490)
79 | - Local API testing [](https://www.loom.com/share/db74b4cc2294486480e2c31f05cbe3d5)
80 | - API deployment with FastAPI [](https://www.loom.com/share/bad07405f31e4625ba0a45a632b4f9d7)
81 | - Run and consuming our RESTful API [](https://www.loom.com/share/bd04680bd2ba4c41bf5e33bd18e6e9c7)
82 | - Using pytest and fastAPI to test our RESTful API [](https://www.loom.com/share/90ec0a6c964a4e669c05d7c3d3f54347)
83 | - Fundamentals of CI/CD [](https://www.loom.com/share/9759c65ddb9b486fb9068ff603dda38c)
84 | - Configure a GitHub action [](https://www.loom.com/share/6551d576e4b340f2a1d7849edd910109)
85 | - Workflow file configuration (Continuous Integration step) [](https://www.loom.com/share/b7d932f842f64ea4805feeb5c11d82ed)
86 | - Delivery your API with Heroku
87 | - Sign up Heroku and create a new app [](https://www.loom.com/share/f2eeba8220cc45b2984813786df0c7f4)
88 | - Install the Heroku CLI, configure credentials, remote repository, buildpacks, dyno, procfile and push CD [](https://www.loom.com/share/3f4f6a31147148418fa5052a545740d4)
89 | - Debuging and query live api [](https://www.loom.com/share/60782b538d25411780a8c9c1c14249f6)
90 |
91 | **Weeks 07, 08 and 09** Project 01
92 | - Create an end-to-end machine learning pipeling
93 | - From fetch data to deploy
94 | - Using: sklearn, wandb, fastapi, github actions, heroku, notebooks
95 |
96 | **Weeks 10 and 11** Fundamentals of Deep Learning [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_10/Week%2010%20Introduction%20to%20Deep%20Learning%20and%20TensorFlow.pdf)
97 | - Outline [](https://www.loom.com/share/694778c1c318458589ad1990c1bb9614)
98 | - The perceptron [](https://www.loom.com/share/8d0ed35c632a4f0e805c103376974ec6)
99 | - Building Neural Networks [](https://www.loom.com/share/4ed93e63f36f468dad163bde0ed4102c)
100 | - Matrix Dimension [](https://www.loom.com/share/ac46a8425264456ea91f9644df3d992a)
101 | - Applying Neural Networks [](https://www.loom.com/share/24f247d0c8a74a48b3e481985fd843bd)
102 | - Training a Neural Networks [](https://www.loom.com/share/c96bebdd16d9444e9c4adf23a4a93398)
103 | - Backpropagation with Pencil & Paper [](https://www.loom.com/share/3098f18d6fdc4a039d5e382357bebc82)
104 | - Learning rate & Batch Size [](https://www.loom.com/share/4271c1e07f294ec181a0b40b93f604b7)
105 | - Exponentially Weighted Average [](https://www.loom.com/share/eb2e3905d23742d98572d14120fb3f57)
106 | - Adam, Momentum, RMSProp, Learning Rate Decay [](https://www.loom.com/share/6b4b3a14b3044dfdbe76a5606bc8e513)
107 | - Hands on :fire:
108 | - TensorFlow Crash Course [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_10/Notebooks/Week%2010%20Task%2001%20-%20TensorFlow%202.x%20%2B%20Keras%20Crash%20Course.ipynb)
109 | - Better Learning - Part I [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_10/Notebooks/Week%2010%20Task%2002%20-%20Better%20Learning%20part%20I.ipynb)
110 |
111 | **Weeks 12 and 13** Better Generalization vs Better Learning [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_12/Better%20Generalizaton%20vs%20Better%20Learning.pdf)
112 | - Outline [](https://www.loom.com/share/f2069bde275c439abf9b8f8b0c774aa3)
113 | - Better Generalization
114 | - Spliting Data [](https://www.loom.com/share/8c127749600f4061ac1e4b233ab459a7)
115 | - Bias vs Variance [](https://www.loom.com/share/890fe680846f4c748668325ee4760b57)
116 | - Weight Regularization [](https://www.loom.com/share/159cdf86ae7140c489c35ea3561cc571)
117 | - Weight Constraint [](https://www.loom.com/share/6b4f15f754b8466a8eb4d46136747390)
118 | - Dropout [](https://www.loom.com/share/bc58906ee39c4269b0d8ec6f67091a47)
119 | - Promote Robustness with Noise [](https://www.loom.com/share/099bed79389b43d4976d35e6de32dcfb)
120 | - Early Stopping [](https://www.loom.com/share/8995488b79434b5d88e887440aa8c953)
121 | - Hands on :eyes:
122 | - Better Generalization - Part I [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_12/Better%20Generalization%20I.ipynb)
123 | - Better Generalization - Part II [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_12/Better%20Generalization%20II.ipynb)
124 | - Better Learning II
125 | - Data scaling [](https://www.loom.com/share/09f19beaffd946b897d1c61f3bf27f02)
126 | - Vanishing/Exploding Gradient [](https://www.loom.com/share/68ed7dd6c5284652a8a81396bb8465ec)
127 | - Fix Vanishing Gradient with Relu [](https://www.loom.com/share/d7fadbbf54c040d69facea3ce035d8f2)
128 | - Fix Exploding Gradient with Gradient Clipping [](https://www.loom.com/share/9dc3598a295a4bcca2f9717ba8f041f5)
129 | - Hands on :mortar_board:
130 | - Better Learning - Part II [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_12/Better%20Learning%20II.ipynb)
131 | - Better Learning - Part III [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_12/Better%20Learning%20III.ipynb)
132 |
133 | **Week 14** - Hyperparameter Tuning & Batch Normalization [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_14/Hyperparameter%20Tuning%20and%20Batch%20Normalization.pdf)
134 |
135 | - Outline [](https://www.loom.com/share/a2c5d4d632b54dbfb27af28efa2a3ad0)
136 | - Hyperparameter Tuning Fundamentals [](https://www.loom.com/share/1993b8e11c764c23a87ac79f2a377275)
137 | - Keras Tuner, and Weight and Biases [](https://www.loom.com/share/162156aa0c1f4e6c95279320cc7be0d7)
138 | - Wandb - Part 01 [](https://www.loom.com/share/0db14b6c327c46d292c07e9da85274ae)
139 | - Wandb - Part 02 [](https://www.loom.com/share/ae3b1ef0b4d542c39c03158ed75dc280)
140 | - Batch Normalization Fundamentals [](https://www.loom.com/share/1217ecc68b904391bf8cb4ab4edc0f7f)
141 | - Batch Normalization Math Details [](https://www.loom.com/share/c031045602fc4d0b8aa2043042f19b81)
142 | - Batch Normalization Case Study [](https://www.loom.com/share/b05d320e561b43658be90a812537c87c)
143 | - Hands on :bell:
144 | - Hyperparameter tuning using keras tuner [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_14/Task%20%2301%20Hyperparameter%20Tuning%20using%20Keras%20Tuner.ipynb)
145 | - Hyperparameter tuning using weights and biases [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_14/Task%20%2302%20Hyperparameter%20Tuning%20using%20Weights%20and%20Biases.ipynb)
146 | - Batch Normalization [](https://github.com/ivanovitchm/ppgeecmachinelearning/blob/main/lessons/week_14/Task%20%2303%20Batch%20Normalization.ipynb)
147 |
--------------------------------------------------------------------------------
/images/ct.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanovitchm/ppgeecmachinelearning/ec32e114013d044419593d4da7b9647439024501/images/ct.jpeg
--------------------------------------------------------------------------------
/images/gender_race.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanovitchm/ppgeecmachinelearning/ec32e114013d044419593d4da7b9647439024501/images/gender_race.png
--------------------------------------------------------------------------------
/images/gender_workclass.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanovitchm/ppgeecmachinelearning/ec32e114013d044419593d4da7b9647439024501/images/gender_workclass.png
--------------------------------------------------------------------------------
/images/workflow.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanovitchm/ppgeecmachinelearning/ec32e114013d044419593d4da7b9647439024501/images/workflow.png
--------------------------------------------------------------------------------
/lessons/week_01/outline.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanovitchm/ppgeecmachinelearning/ec32e114013d044419593d4da7b9647439024501/lessons/week_01/outline.pdf
--------------------------------------------------------------------------------
/lessons/week_02/ml_fundamentals_and_decision_trees.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanovitchm/ppgeecmachinelearning/ec32e114013d044419593d4da7b9647439024501/lessons/week_02/ml_fundamentals_and_decision_trees.pdf
--------------------------------------------------------------------------------
/lessons/week_02/sources/README.md:
--------------------------------------------------------------------------------
1 | # Model Card
2 |
3 | Model cards are a succinct approach for documenting the creation, use, and shortcomings of a model. The idea is to write a documentation such that a non-expert can understand the model card's contents. For additional information see the Model Card paper: https://arxiv.org/pdf/1810.03993.pdf
4 |
5 | ## Model Details
6 | Ivanovitch Silva created the model. A complete data pipeline was built using Google Colab, Scikit-Learn and Weights & Bias to train a Decision Tree model. The big-picture of the data pipeline is shown below:
7 |
8 |
9 |
10 | For the sake of understanding, a simple hyperparameter-tuning was conducted using a Random Sweep of Wandb, and the hyperparameters values adopted in the train were:
11 |
12 | - full_pipeline__num_pipeline__num_transformer__model: 2
13 | - classifier__criterion: 'entropy'
14 | - classifier__splitter: 'best'
15 | - classifier__random_state: 41
16 |
17 | ## Intended Use
18 | This model is used as a proof of concept for the evaluation of an entire data pipeline incorporating Machine Learning fundamentals. The data pipeline is composed of the following stages: a) ``fecht data``, b) ``eda``, c) ``preprocess``, d) ``check data``, e) ``segregate``, f) ``train`` and g) ``test``.
19 |
20 | ## Training Data
21 |
22 | The dataset used in this project is based on individual income in the United States. The *data* is from the *1994 census*, and contains information on an individual's ``marital status, age, type of work, and more``. The target column, or what we want to predict, is whether individuals make *less than or equal to 50k a year*, or *more than 50k a year*.
23 |
24 | You can download the data from the University of California, Irvine's [website](http://archive.ics.uci.edu/ml/datasets/Adult).
25 |
26 | After the EDA stage of the data pipeline, it was noted that the training data is imbalanced when considered the target variable and some features (``sex``, ``race`` and ``workclass``.
27 |
28 |
29 |
30 |
31 | ## Evaluation Data
32 | The dataset under study is split into Train and Test during the ``Segregate`` stage of the data pipeline. 70% of the clean data is used to Train and the remaining 30% to Test. Additionally, 30% of the Train data is used for validation purposes (hyperparameter-tuning).
33 |
34 | ## Metrics
35 | In order to follow the performance of machine learning experiments, the project marked certains stage outputs of the data pipeline as metrics. The metrics adopted are: [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html), [f1](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score), [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score), [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score).
36 |
37 | To calculate the evaluations metrics is only necessary to run:
38 |
39 | The follow results will be shown:
40 |
41 | **Stage [Run]** | **Accuracy** | **F1** | **Precision** | **Recall** |
42 | ---------------------------------|--------------|--------|---------------|------------|
43 | Train [distinctive-sweep-7](https://wandb.ai/ivanovitchm/decision_tree/runs/f40ujfaq/overview?workspace=user-ivanovitchm) | 0.8109 | 0.6075 | 0.6075 | 0.6075 |
44 | Test [crips-resonance-11](https://wandb.ai/ivanovitchm/decision_tree/runs/1wg7ibyy/overview?workspace=user-ivanovitchm) | 0.8019 | 0.5899 | 0.5884 | 0.5914 |
45 |
46 |
47 | ## Ethical Considerations
48 |
49 | We may be tempted to claim that this dataset contains the only attributes capable of predicting someone's income. However, we know that is not true, and we will need to deal with the class imbalances somehow.
50 |
51 | ## Caveats and Recommendations
52 | It should be noted that the model trained in this project was used only for validation of a complete data pipeline. It is notary that some important issues related to dataset imbalances exist, and adequate techniques need to be adopted in order to balance it.
--------------------------------------------------------------------------------
/lessons/week_02/sources/data_check.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "data_check.ipynb",
7 | "provenance": [],
8 | "collapsed_sections": []
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | },
14 | "language_info": {
15 | "name": "python"
16 | }
17 | },
18 | "cells": [
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {
22 | "id": "0I4pgzLVtBTP"
23 | },
24 | "source": [
25 | "# 1.0 An end-to-end classification problem (Data Check)\n",
26 | "\n"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {
32 | "id": "Dh34gim6KPtT"
33 | },
34 | "source": [
35 | "## 1.1 Dataset description"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {
41 | "id": "iE8OJoDZ5AFK"
42 | },
43 | "source": [
44 | "We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.\n",
45 | "\n",
46 | "You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).\n",
47 | "\n",
48 | "Let's take the following steps:\n",
49 | "\n",
50 | "1. ETL (done!!!)\n",
51 | "4. Data Checks\n",
52 | "\n",
53 | "
"
54 | ]
55 | },
56 | {
57 | "cell_type": "markdown",
58 | "metadata": {
59 | "id": "7UpxKxU1Ej7f"
60 | },
61 | "source": [
62 | "## 1.2 Install, load libraries and setup wandb"
63 | ]
64 | },
65 | {
66 | "cell_type": "code",
67 | "execution_count": null,
68 | "metadata": {
69 | "id": "t82KewAPWCYe"
70 | },
71 | "outputs": [],
72 | "source": [
73 | "!pip install wandb"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "source": [
79 | "!pip install pytest pytest-sugar"
80 | ],
81 | "metadata": {
82 | "id": "IumW4s8Sh9i_"
83 | },
84 | "execution_count": null,
85 | "outputs": []
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": null,
90 | "metadata": {
91 | "id": "LASaVZuhRJlL"
92 | },
93 | "outputs": [],
94 | "source": [
95 | "import wandb"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "source": [
101 | "# Login to Weights & Biases\n",
102 | "!wandb login --relogin"
103 | ],
104 | "metadata": {
105 | "id": "QZXcN54GkP25"
106 | },
107 | "execution_count": null,
108 | "outputs": []
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "source": [
113 | "## 1.2 Pytest\n"
114 | ],
115 | "metadata": {
116 | "id": "MPjpyeU7d37l"
117 | }
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {
122 | "id": "WB1QY4sbPs4-"
123 | },
124 | "source": [
125 | "### 1.2.1 How pytest discovers tests\n"
126 | ]
127 | },
128 | {
129 | "cell_type": "markdown",
130 | "source": [
131 | "\n",
132 | "pytests uses the following [conventions](https://docs.pytest.org/en/latest/goodpractices.html#conventions-for-python-test-discovery) to automatically discovering tests:\n",
133 | " 1. files with tests should be called `test_*.py` or `*_test.py `\n",
134 | " 2. test function name should start with `test_`\n",
135 | "\n",
136 | "\n"
137 | ],
138 | "metadata": {
139 | "id": "gXlqh21wiW6P"
140 | }
141 | },
142 | {
143 | "cell_type": "markdown",
144 | "source": [
145 | "### 1.2.2 Fixture"
146 | ],
147 | "metadata": {
148 | "id": "JtTD3oxoiYF6"
149 | }
150 | },
151 | {
152 | "cell_type": "markdown",
153 | "source": [
154 | "\n",
155 | "An important aspect when using ``pytest`` is understanding the fixture's scope works. \n",
156 | "\n",
157 | "The scope of the fixture can have a few legal values, described [here](https://docs.pytest.org/en/6.2.x/fixture.html#fixture-scopes). We are going to consider only **session** and **function**: with the former, the fixture is executed only once in a pytest session and the value it returns is used for all the tests that need it; with the latter, every test function gets a fresh copy of the data. This is useful if the tests modify the input in a way that make the other tests fail, for example."
158 | ],
159 | "metadata": {
160 | "id": "mwNz2mgMevJJ"
161 | }
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {
166 | "id": "uJjWla1qxd3i"
167 | },
168 | "source": [
169 | "### 1.2.3 Create and run a test file\n"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "metadata": {
175 | "id": "9acpZigRVANF"
176 | },
177 | "source": [
178 | "%%file test_data.py\n",
179 | "import pytest\n",
180 | "import wandb\n",
181 | "import pandas as pd\n",
182 | "\n",
183 | "# This is global so all tests are collected under the same run\n",
184 | "run = wandb.init(project=\"decision_tree\", job_type=\"data_checks\")\n",
185 | "\n",
186 | "@pytest.fixture(scope=\"session\")\n",
187 | "def data():\n",
188 | "\n",
189 | " local_path = run.use_artifact(\"decision_tree/preprocessed_data.csv:latest\").file()\n",
190 | " df = pd.read_csv(local_path)\n",
191 | "\n",
192 | " return df\n",
193 | "\n",
194 | "def test_data_length(data):\n",
195 | " \"\"\"\n",
196 | " We test that we have enough data to continue\n",
197 | " \"\"\"\n",
198 | " assert len(data) > 1000\n",
199 | "\n",
200 | "\n",
201 | "def test_number_of_columns(data):\n",
202 | " \"\"\"\n",
203 | " We test that we have enough data to continue\n",
204 | " \"\"\"\n",
205 | " assert data.shape[1] == 15\n",
206 | "\n",
207 | "def test_column_presence_and_type(data):\n",
208 | "\n",
209 | " required_columns = {\n",
210 | " \"age\": pd.api.types.is_int64_dtype,\n",
211 | " \"workclass\": pd.api.types.is_object_dtype,\n",
212 | " \"fnlwgt\": pd.api.types.is_int64_dtype,\n",
213 | " \"education\": pd.api.types.is_object_dtype,\n",
214 | " \"education_num\": pd.api.types.is_int64_dtype,\n",
215 | " \"marital_status\": pd.api.types.is_object_dtype,\n",
216 | " \"occupation\": pd.api.types.is_object_dtype,\n",
217 | " \"relationship\": pd.api.types.is_object_dtype,\n",
218 | " \"race\": pd.api.types.is_object_dtype,\n",
219 | " \"sex\": pd.api.types.is_object_dtype,\n",
220 | " \"capital_gain\": pd.api.types.is_int64_dtype,\n",
221 | " \"capital_loss\": pd.api.types.is_int64_dtype, \n",
222 | " \"hours_per_week\": pd.api.types.is_int64_dtype,\n",
223 | " \"native_country\": pd.api.types.is_object_dtype,\n",
224 | " \"high_income\": pd.api.types.is_object_dtype\n",
225 | " }\n",
226 | "\n",
227 | " # Check column presence\n",
228 | " assert set(data.columns.values).issuperset(set(required_columns.keys()))\n",
229 | "\n",
230 | " for col_name, format_verification_funct in required_columns.items():\n",
231 | "\n",
232 | " assert format_verification_funct(data[col_name]), f\"Column {col_name} failed test {format_verification_funct}\"\n",
233 | "\n",
234 | "\n",
235 | "def test_class_names(data):\n",
236 | "\n",
237 | " # Check that only the known classes are present\n",
238 | " known_classes = [\n",
239 | " \" <=50K\",\n",
240 | " \" >50K\"\n",
241 | " ]\n",
242 | "\n",
243 | " assert data[\"high_income\"].isin(known_classes).all()\n",
244 | "\n",
245 | "\n",
246 | "def test_column_ranges(data):\n",
247 | "\n",
248 | " ranges = {\n",
249 | " \"age\": (17, 90),\n",
250 | " \"fnlwgt\": (1.228500e+04, 1.484705e+06),\n",
251 | " \"education_num\": (1, 16),\n",
252 | " \"capital_gain\": (0, 99999),\n",
253 | " \"capital_loss\": (0, 4356),\n",
254 | " \"hours_per_week\": (1, 99)\n",
255 | " }\n",
256 | "\n",
257 | " for col_name, (minimum, maximum) in ranges.items():\n",
258 | "\n",
259 | " assert data[col_name].dropna().between(minimum, maximum).all(), (\n",
260 | " f\"Column {col_name} failed the test. Should be between {minimum} and {maximum}, \"\n",
261 | " f\"instead min={data[col_name].min()} and max={data[col_name].max()}\"\n",
262 | " )"
263 | ],
264 | "execution_count": null,
265 | "outputs": []
266 | },
267 | {
268 | "cell_type": "markdown",
269 | "metadata": {
270 | "id": "XTBnZ3-vVe0p"
271 | },
272 | "source": [
273 | "Now lets run pytest"
274 | ]
275 | },
276 | {
277 | "cell_type": "code",
278 | "metadata": {
279 | "id": "DXBQkMc8VeD8"
280 | },
281 | "source": [
282 | "!pytest . -vv"
283 | ],
284 | "execution_count": null,
285 | "outputs": []
286 | },
287 | {
288 | "cell_type": "code",
289 | "source": [
290 | "# close the run\n",
291 | "# waiting a while after run the previous cell before execute this\n",
292 | "run.finish()"
293 | ],
294 | "metadata": {
295 | "id": "5284u1A7euMF"
296 | },
297 | "execution_count": null,
298 | "outputs": []
299 | },
300 | {
301 | "cell_type": "code",
302 | "source": [
303 | ""
304 | ],
305 | "metadata": {
306 | "id": "FpvJLXccv8Kz"
307 | },
308 | "execution_count": null,
309 | "outputs": []
310 | }
311 | ]
312 | }
--------------------------------------------------------------------------------
/lessons/week_02/sources/data_segregation.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "data_segregation.ipynb",
7 | "provenance": [],
8 | "collapsed_sections": []
9 | },
10 | "kernelspec": {
11 | "name": "python3",
12 | "display_name": "Python 3"
13 | },
14 | "language_info": {
15 | "name": "python"
16 | }
17 | },
18 | "cells": [
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {
22 | "id": "0I4pgzLVtBTP"
23 | },
24 | "source": [
25 | "# 1.0 An end-to-end classification problem (Data Segregation)\n",
26 | "\n"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {
32 | "id": "Dh34gim6KPtT"
33 | },
34 | "source": [
35 | "## 1.1 Dataset description"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {
41 | "id": "iE8OJoDZ5AFK"
42 | },
43 | "source": [
44 | "We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.\n",
45 | "\n",
46 | "You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).\n",
47 | "\n",
48 | "Let's take the following steps:\n",
49 | "\n",
50 | "1. ETL (done)\n",
51 | "2. Data Checks (done)\n",
52 | "3. Data Segregation\n",
53 | "\n",
54 | "
"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {
60 | "id": "7UpxKxU1Ej7f"
61 | },
62 | "source": [
63 | "## 1.2 Install, load libraries and setup wandb"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {
70 | "id": "t82KewAPWCYe"
71 | },
72 | "outputs": [],
73 | "source": [
74 | "!pip install wandb"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": null,
80 | "metadata": {
81 | "id": "LASaVZuhRJlL"
82 | },
83 | "outputs": [],
84 | "source": [
85 | "import logging\n",
86 | "import tempfile\n",
87 | "import pandas as pd\n",
88 | "import os\n",
89 | "import wandb\n",
90 | "from sklearn.model_selection import train_test_split"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "source": [
96 | "# Login to Weights & Biases\n",
97 | "!wandb login --relogin"
98 | ],
99 | "metadata": {
100 | "id": "QZXcN54GkP25"
101 | },
102 | "execution_count": null,
103 | "outputs": []
104 | },
105 | {
106 | "cell_type": "markdown",
107 | "source": [
108 | "## 1.3 Data Segregation"
109 | ],
110 | "metadata": {
111 | "id": "sa1PvzZYF4J2"
112 | }
113 | },
114 | {
115 | "cell_type": "code",
116 | "source": [
117 | "# global variables\n",
118 | "\n",
119 | "# ratio used to split train and test data\n",
120 | "test_size = 0.30\n",
121 | "\n",
122 | "# seed used to reproduce purposes\n",
123 | "seed = 41\n",
124 | "\n",
125 | "# reference (column) to stratify the data\n",
126 | "stratify = \"high_income\"\n",
127 | "\n",
128 | "# name of the input artifact\n",
129 | "artifact_input_name = \"decision_tree/preprocessed_data.csv:latest\"\n",
130 | "\n",
131 | "# type of the artifact\n",
132 | "artifact_type = \"segregated_data\""
133 | ],
134 | "metadata": {
135 | "id": "6Box9NiNKCgM"
136 | },
137 | "execution_count": null,
138 | "outputs": []
139 | },
140 | {
141 | "cell_type": "code",
142 | "source": [
143 | "# configure logging\n",
144 | "logging.basicConfig(level=logging.INFO,\n",
145 | " format=\"%(asctime)s %(message)s\",\n",
146 | " datefmt='%d-%m-%Y %H:%M:%S')\n",
147 | "\n",
148 | "# reference for a logging obj\n",
149 | "logger = logging.getLogger()\n",
150 | "\n",
151 | "# initiate wandb project\n",
152 | "run = wandb.init(project=\"decision_tree\", job_type=\"split_data\")\n",
153 | "\n",
154 | "logger.info(\"Downloading and reading artifact\")\n",
155 | "artifact = run.use_artifact(artifact_input_name)\n",
156 | "artifact_path = artifact.file()\n",
157 | "df = pd.read_csv(artifact_path)\n",
158 | "\n",
159 | "# Split firstly in train/test, then we further divide the dataset to train and validation\n",
160 | "logger.info(\"Splitting data into train and test\")\n",
161 | "splits = {}\n",
162 | "\n",
163 | "splits[\"train\"], splits[\"test\"] = train_test_split(df,\n",
164 | " test_size=test_size,\n",
165 | " random_state=seed,\n",
166 | " stratify=df[stratify])\n",
167 | "\n",
168 | "# Save the artifacts. We use a temporary directory so we do not leave any trace behind\n",
169 | "with tempfile.TemporaryDirectory() as tmp_dir:\n",
170 | "\n",
171 | " for split, df in splits.items():\n",
172 | "\n",
173 | " # Make the artifact name from the name of the split plus the provided root\n",
174 | " artifact_name = f\"{split}.csv\"\n",
175 | "\n",
176 | " # Get the path on disk within the temp directory\n",
177 | " temp_path = os.path.join(tmp_dir, artifact_name)\n",
178 | "\n",
179 | " logger.info(f\"Uploading the {split} dataset to {artifact_name}\")\n",
180 | "\n",
181 | " # Save then upload to W&B\n",
182 | " df.to_csv(temp_path,index=False)\n",
183 | "\n",
184 | " artifact = wandb.Artifact(name=artifact_name,\n",
185 | " type=artifact_type,\n",
186 | " description=f\"{split} split of dataset {artifact_input_name}\",\n",
187 | " )\n",
188 | " artifact.add_file(temp_path)\n",
189 | "\n",
190 | " logger.info(\"Logging artifact\")\n",
191 | " run.log_artifact(artifact)\n",
192 | "\n",
193 | " # This waits for the artifact to be uploaded to W&B. If you\n",
194 | " # do not add this, the temp directory might be removed before\n",
195 | " # W&B had a chance to upload the datasets, and the upload\n",
196 | " # might fail\n",
197 | " artifact.wait()"
198 | ],
199 | "metadata": {
200 | "id": "4tha7oPLF58G"
201 | },
202 | "execution_count": null,
203 | "outputs": []
204 | },
205 | {
206 | "cell_type": "code",
207 | "source": [
208 | "# close the run\n",
209 | "# waiting a while after run the previous cell before execute this\n",
210 | "run.finish()"
211 | ],
212 | "metadata": {
213 | "id": "s1IcDCzUO57y"
214 | },
215 | "execution_count": null,
216 | "outputs": []
217 | },
218 | {
219 | "cell_type": "code",
220 | "source": [
221 | ""
222 | ],
223 | "metadata": {
224 | "id": "s-decyPSPZ_d"
225 | },
226 | "execution_count": null,
227 | "outputs": []
228 | }
229 | ]
230 | }
--------------------------------------------------------------------------------
/lessons/week_02/sources/eda.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{"id":"0I4pgzLVtBTP"},"source":["# 1.0 An end-to-end classification problem (ETL)\n","\n"]},{"cell_type":"markdown","metadata":{"id":"Dh34gim6KPtT"},"source":["## 1.1 Dataset description"]},{"cell_type":"markdown","metadata":{"id":"iE8OJoDZ5AFK"},"source":["\n","\n","We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.\n","\n","You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).\n","\n","Let's take the following steps:\n","\n","1. Load Libraries\n","2. Fetch Data, including EDA\n","3. Pre-procesing\n","4. Data Segregation\n","\n","
"]},{"cell_type":"markdown","metadata":{"id":"7UpxKxU1Ej7f"},"source":["## 1.2 Install and load libraries"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"e3Zq4gmzWK6z"},"outputs":[],"source":["!pip install pandas-profiling==3.1.0"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"t82KewAPWCYe"},"outputs":[],"source":["!pip install wandb"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"LASaVZuhRJlL"},"outputs":[],"source":["import wandb\n","import matplotlib.pyplot as plt\n","import seaborn as sns\n","import pandas as pd\n","import numpy as np\n","from pandas_profiling import ProfileReport\n","import tempfile\n","import os"]},{"cell_type":"markdown","metadata":{"id":"Z74pHa-qHVrT"},"source":["## 1.3 Exploratory Data Analysis (EDA)"]},{"cell_type":"markdown","metadata":{"id":"MxzlOezZLWQ_"},"source":["### 1.3.1 Login wandb\n"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"xvG-dYfiVwM9"},"outputs":[],"source":["# Login to Weights & Biases\n","!wandb login --relogin"]},{"cell_type":"markdown","metadata":{"id":"18RAS5kFXPAe"},"source":["### 1.3.2 Download raw_data artifact from Wandb"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"5kKY7pRNSGo9"},"outputs":[],"source":["# save_code tracking all changes of the notebook and sync with Wandb\n","run = wandb.init(project=\"decision_tree\", save_code=True)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"H6eyFp6XTLyf"},"outputs":[],"source":["# donwload the latest version of artifact raw_data.csv\n","artifact = run.use_artifact(\"decision_tree/raw_data.csv:latest\")\n","\n","# create a dataframe from the artifact\n","df = pd.read_csv(artifact.file())"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"SWMTKGEUYHeK"},"outputs":[],"source":["df.head()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"4WkUfOnkYI5l"},"outputs":[],"source":["df.info()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"0dVXWRhmGmOn"},"outputs":[],"source":["df.describe()"]},{"cell_type":"markdown","metadata":{"id":"PnBYLNwlBFrO"},"source":["### 1.3.3 Pandas Profilling"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"vGOWSZJoI-s_"},"outputs":[],"source":["ProfileReport(df, title=\"Pandas Profiling Report\", explorative=True)"]},{"cell_type":"markdown","metadata":{"id":"N3yBP7hxM0q1"},"source":["### 1.3.4 EDA Manually"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"JN2R1QPW4ZJo"},"outputs":[],"source":["# There are duplicated rows\n","df.duplicated().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"5Qscf9h04gO_"},"outputs":[],"source":["# Delete duplicated rows\n","df.drop_duplicates(inplace=True)\n","df.duplicated().sum()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"yCPerUdKl-yz"},"outputs":[],"source":["# what the sex column can help us?\n","pd.crosstab(df.high_income,df.sex,margins=True,normalize=False)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"Qrk81pTpUPi-"},"outputs":[],"source":["# income vs [sex & race]?\n","pd.crosstab(df.high_income,[df.sex,df.race],margins=True)"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"WTcnUzDtW7pX"},"outputs":[],"source":["%matplotlib inline\n","\n","sns.catplot(x=\"sex\", \n"," hue=\"race\", \n"," col=\"high_income\",\n"," data=df, kind=\"count\",\n"," height=4, aspect=.7)\n","plt.show()"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"iiNAy2H2mOZT"},"outputs":[],"source":["g = sns.catplot(x=\"sex\", \n"," hue=\"workclass\", \n"," col=\"high_income\",\n"," data=df, kind=\"count\",\n"," height=4, aspect=.7)\n","\n","g.savefig(\"HighIncome_Sex_Workclass.png\", dpi=100)\n","\n","run.log(\n"," {\n"," \"High_Income vs Sex vs Workclass\": wandb.Image(\"HighIncome_Sex_Workclass.png\")\n"," }\n"," )"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"w_-oFseHmmBc"},"outputs":[],"source":["df.isnull().sum()"]},{"cell_type":"code","source":["run.finish()"],"metadata":{"id":"cHDWOxTji0P0"},"execution_count":null,"outputs":[]}],"metadata":{"colab":{"collapsed_sections":[],"name":"eda.ipynb","provenance":[]},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.12"}},"nbformat":4,"nbformat_minor":0}
--------------------------------------------------------------------------------
/lessons/week_02/sources/fetch_data.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{"id":"0I4pgzLVtBTP","tags":[]},"source":["# 1.0 An end-to-end classification problem (ETL)\n","\n"]},{"cell_type":"markdown","metadata":{"id":"Dh34gim6KPtT","tags":[]},"source":["## 1.1 Dataset description"]},{"cell_type":"markdown","metadata":{"id":"iE8OJoDZ5AFK"},"source":["\n","\n","We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.\n","\n","You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).\n","\n","Let's take the following steps:\n","\n","1. Load Libraries\n","2. Fetch Data, including EDA\n","3. Pre-procesing\n","4. Data Segregation\n","\n","
"]},{"cell_type":"markdown","metadata":{"id":"7UpxKxU1Ej7f"},"source":["## 1.2 Install and load libraries"]},{"cell_type":"code","source":["!pip install wandb"],"metadata":{"id":"Y6MAZ5-e7kgA"},"execution_count":null,"outputs":[]},{"cell_type":"code","execution_count":2,"metadata":{"id":"LASaVZuhRJlL","tags":[],"executionInfo":{"status":"ok","timestamp":1649100830251,"user_tz":180,"elapsed":914,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}}},"outputs":[],"source":["import wandb\n","import pandas as pd"]},{"cell_type":"markdown","metadata":{"id":"Z74pHa-qHVrT"},"source":["## 1.3 Fetch Data"]},{"cell_type":"markdown","metadata":{"id":"MxzlOezZLWQ_"},"source":["### 1.3.1 Create the raw_data artifact"]},{"cell_type":"code","execution_count":3,"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":339},"id":"i_n2KZu0usUv","outputId":"7160274a-1f05-456b-fda3-023f55353d51","tags":[],"executionInfo":{"status":"ok","timestamp":1649100856527,"user_tz":180,"elapsed":1404,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}}},"outputs":[{"output_type":"execute_result","data":{"text/plain":[" age workclass fnlwgt education education_num \\\n","0 39 State-gov 77516 Bachelors 13 \n","1 50 Self-emp-not-inc 83311 Bachelors 13 \n","2 38 Private 215646 HS-grad 9 \n","3 53 Private 234721 11th 7 \n","4 28 Private 338409 Bachelors 13 \n","\n"," marital_status occupation relationship race sex \\\n","0 Never-married Adm-clerical Not-in-family White Male \n","1 Married-civ-spouse Exec-managerial Husband White Male \n","2 Divorced Handlers-cleaners Not-in-family White Male \n","3 Married-civ-spouse Handlers-cleaners Husband Black Male \n","4 Married-civ-spouse Prof-specialty Wife Black Female \n","\n"," capital_gain capital_loss hours_per_week native_country high_income \n","0 2174 0 40 United-States <=50K \n","1 0 0 13 United-States <=50K \n","2 0 0 40 United-States <=50K \n","3 0 0 40 United-States <=50K \n","4 0 0 40 Cuba <=50K "],"text/html":["\n","
\n","
\n","
\n","\n","
\n"," \n","
\n","
\n","
age
\n","
workclass
\n","
fnlwgt
\n","
education
\n","
education_num
\n","
marital_status
\n","
occupation
\n","
relationship
\n","
race
\n","
sex
\n","
capital_gain
\n","
capital_loss
\n","
hours_per_week
\n","
native_country
\n","
high_income
\n","
\n"," \n"," \n","
\n","
0
\n","
39
\n","
State-gov
\n","
77516
\n","
Bachelors
\n","
13
\n","
Never-married
\n","
Adm-clerical
\n","
Not-in-family
\n","
White
\n","
Male
\n","
2174
\n","
0
\n","
40
\n","
United-States
\n","
<=50K
\n","
\n","
\n","
1
\n","
50
\n","
Self-emp-not-inc
\n","
83311
\n","
Bachelors
\n","
13
\n","
Married-civ-spouse
\n","
Exec-managerial
\n","
Husband
\n","
White
\n","
Male
\n","
0
\n","
0
\n","
13
\n","
United-States
\n","
<=50K
\n","
\n","
\n","
2
\n","
38
\n","
Private
\n","
215646
\n","
HS-grad
\n","
9
\n","
Divorced
\n","
Handlers-cleaners
\n","
Not-in-family
\n","
White
\n","
Male
\n","
0
\n","
0
\n","
40
\n","
United-States
\n","
<=50K
\n","
\n","
\n","
3
\n","
53
\n","
Private
\n","
234721
\n","
11th
\n","
7
\n","
Married-civ-spouse
\n","
Handlers-cleaners
\n","
Husband
\n","
Black
\n","
Male
\n","
0
\n","
0
\n","
40
\n","
United-States
\n","
<=50K
\n","
\n","
\n","
4
\n","
28
\n","
Private
\n","
338409
\n","
Bachelors
\n","
13
\n","
Married-civ-spouse
\n","
Prof-specialty
\n","
Wife
\n","
Black
\n","
Female
\n","
0
\n","
0
\n","
40
\n","
Cuba
\n","
<=50K
\n","
\n"," \n","
\n","
\n"," \n"," \n"," \n","\n"," \n","
\n","
\n"," "]},"metadata":{},"execution_count":3}],"source":["# columns used \n","columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',\n"," 'marital_status', 'occupation', 'relationship', 'race', \n"," 'sex','capital_gain', 'capital_loss', 'hours_per_week',\n"," 'native_country','high_income']\n","# importing the dataset\n","income = pd.read_csv(\"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data\",\n"," header=None,\n"," names=columns)\n","income.head()"]},{"cell_type":"code","execution_count":4,"metadata":{"id":"j3Otnz1-UYre","tags":[],"executionInfo":{"status":"ok","timestamp":1649100915009,"user_tz":180,"elapsed":465,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}}},"outputs":[],"source":["income.to_csv(\"raw_data.csv\",index=False)"]},{"cell_type":"code","execution_count":5,"metadata":{"tags":[],"colab":{"base_uri":"https://localhost:8080/"},"id":"sgSoPVLZ6-ps","executionInfo":{"status":"ok","timestamp":1649100978757,"user_tz":180,"elapsed":40646,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}},"outputId":"ffcebab8-e0cc-4833-b4f2-d71658dbe66d"},"outputs":[{"output_type":"stream","name":"stdout","text":["\u001b[34m\u001b[1mwandb\u001b[0m: You can find your API key in your browser here: https://wandb.ai/authorize\n","\u001b[34m\u001b[1mwandb\u001b[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: \n","\u001b[34m\u001b[1mwandb\u001b[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc\n"]}],"source":["# Login to Weights & Biases\n","!wandb login --relogin"]},{"cell_type":"code","execution_count":6,"metadata":{"id":"l7lg2qtMUFGW","tags":[],"colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1649101124875,"user_tz":180,"elapsed":15017,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}},"outputId":"a7ac4446-02f6-4e45-ccc1-e249f3131cc0"},"outputs":[{"output_type":"stream","name":"stdout","text":["\u001b[34m\u001b[1mwandb\u001b[0m: Uploading file raw_data.csv to: \"ivanovitchm/decision_tree/raw_data.csv:latest\" (raw_data)\n","\u001b[34m\u001b[1mwandb\u001b[0m: Currently logged in as: \u001b[33mivanovitchm\u001b[0m (use `wandb login --relogin` to force relogin)\n","\u001b[34m\u001b[1mwandb\u001b[0m: Tracking run with wandb version 0.12.11\n","\u001b[34m\u001b[1mwandb\u001b[0m: Run data is saved locally in \u001b[35m\u001b[1m/content/wandb/run-20220404_193832-32qj2hph\u001b[0m\n","\u001b[34m\u001b[1mwandb\u001b[0m: Run \u001b[1m`wandb offline`\u001b[0m to turn off syncing.\n","\u001b[34m\u001b[1mwandb\u001b[0m: Syncing run \u001b[33mquiet-snow-1\u001b[0m\n","\u001b[34m\u001b[1mwandb\u001b[0m: ⭐️ View project at \u001b[34m\u001b[4mhttps://wandb.ai/ivanovitchm/decision_tree\u001b[0m\n","\u001b[34m\u001b[1mwandb\u001b[0m: 🚀 View run at \u001b[34m\u001b[4mhttps://wandb.ai/ivanovitchm/decision_tree/runs/32qj2hph\u001b[0m\n","Artifact uploaded, use this artifact in a run by adding:\n","\n"," artifact = run.use_artifact(\"ivanovitchm/decision_tree/raw_data.csv:latest\")\n","\n","\n","\u001b[34m\u001b[1mwandb\u001b[0m: Waiting for W&B process to finish... \u001b[32m(success).\u001b[0m\n","\u001b[34m\u001b[1mwandb\u001b[0m: \n","\u001b[34m\u001b[1mwandb\u001b[0m: Synced \u001b[33mquiet-snow-1\u001b[0m: \u001b[34m\u001b[4mhttps://wandb.ai/ivanovitchm/decision_tree/runs/32qj2hph\u001b[0m\n","\u001b[34m\u001b[1mwandb\u001b[0m: Synced 5 W&B file(s), 0 media file(s), 1 artifact file(s) and 0 other file(s)\n","\u001b[34m\u001b[1mwandb\u001b[0m: Find logs at: \u001b[35m\u001b[1m./wandb/run-20220404_193832-32qj2hph/logs\u001b[0m\n"]}],"source":["# Send the raw_data.csv to the Wandb storing it as an artifact\n","!wandb artifact put \\\n"," --name decision_tree/raw_data.csv \\\n"," --type raw_data \\\n"," --description \"The raw data from 1994 US Census\" raw_data.csv"]},{"cell_type":"code","source":[""],"metadata":{"id":"e8Z6xmn0Ffaj"},"execution_count":null,"outputs":[]}],"metadata":{"colab":{"collapsed_sections":[],"name":"fetch_data.ipynb","provenance":[]},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.12"}},"nbformat":4,"nbformat_minor":0}
--------------------------------------------------------------------------------
/lessons/week_02/sources/preprocessing.ipynb:
--------------------------------------------------------------------------------
1 | {"cells":[{"cell_type":"markdown","metadata":{"id":"0I4pgzLVtBTP"},"source":["# 1.0 An end-to-end classification problem (ETL)\n","\n"]},{"cell_type":"markdown","metadata":{"id":"Dh34gim6KPtT"},"source":["## 1.1 Dataset description"]},{"cell_type":"markdown","metadata":{"id":"iE8OJoDZ5AFK"},"source":["\n","\n","We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.\n","\n","You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).\n","\n","Let's take the following steps:\n","\n","1. Load Libraries\n","2. Fetch Data, including EDA\n","3. Pre-procesing\n","4. Data Segregation\n","\n","
"]},{"cell_type":"markdown","metadata":{"id":"7UpxKxU1Ej7f"},"source":["## 1.2 Install and load libraries"]},{"cell_type":"code","execution_count":null,"metadata":{"id":"t82KewAPWCYe"},"outputs":[],"source":["!pip install wandb"]},{"cell_type":"code","execution_count":20,"metadata":{"id":"LASaVZuhRJlL","executionInfo":{"status":"ok","timestamp":1649111549347,"user_tz":180,"elapsed":427,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}}},"outputs":[],"source":["import wandb\n","import pandas as pd"]},{"cell_type":"markdown","metadata":{"id":"Z74pHa-qHVrT"},"source":["## 1.3 Preprocessing"]},{"cell_type":"markdown","metadata":{"id":"MxzlOezZLWQ_"},"source":["### 1.3.1 Login wandb\n"]},{"cell_type":"code","execution_count":21,"metadata":{"id":"xvG-dYfiVwM9","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1649111573515,"user_tz":180,"elapsed":21048,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}},"outputId":"2e7d5d03-12c7-47e3-c29f-ff93a0587be6"},"outputs":[{"output_type":"stream","name":"stdout","text":["\u001b[34m\u001b[1mwandb\u001b[0m: You can find your API key in your browser here: https://wandb.ai/authorize\n","\u001b[34m\u001b[1mwandb\u001b[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: \n","\u001b[34m\u001b[1mwandb\u001b[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc\n"]}],"source":["# Login to Weights & Biases\n","!wandb login --relogin"]},{"cell_type":"markdown","source":["### 1.3.2 Artifacts"],"metadata":{"id":"3GrAiPvGm0kl"}},{"cell_type":"code","source":["input_artifact=\"decision_tree/raw_data.csv:latest\"\n","artifact_name=\"preprocessed_data.csv\"\n","artifact_type=\"clean_data\"\n","artifact_description=\"Data after preprocessing\""],"metadata":{"id":"dBiQMbchm78L","executionInfo":{"status":"ok","timestamp":1649111604334,"user_tz":180,"elapsed":539,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}}},"execution_count":22,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"18RAS5kFXPAe"},"source":["### 1.3.3 Setup your wandb project and clean the dataset"]},{"cell_type":"markdown","source":["After the fetch step the raw data artifact was generated.\n","Now, we need to pre-processing the raw data to create a new artfiact (clean_data)."],"metadata":{"id":"C6YkQ8SOn3qo"}},{"cell_type":"code","execution_count":23,"metadata":{"id":"5kKY7pRNSGo9","colab":{"base_uri":"https://localhost:8080/","height":69},"executionInfo":{"status":"ok","timestamp":1649111639094,"user_tz":180,"elapsed":5526,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}},"outputId":"eab22db6-2470-41ae-dd2d-a058ca3b9073"},"outputs":[{"output_type":"display_data","data":{"text/plain":[""],"text/html":["Tracking run with wandb version 0.12.11"]},"metadata":{}},{"output_type":"display_data","data":{"text/plain":[""],"text/html":["Run data is saved locally in /content/wandb/run-20220404_223353-n93uggvj"]},"metadata":{}},{"output_type":"display_data","data":{"text/plain":[""],"text/html":["Syncing run likely-wind-6 to Weights & Biases (docs) "]},"metadata":{}}],"source":["# create a new job_type\n","run = wandb.init(project=\"decision_tree\", job_type=\"process_data\")"]},{"cell_type":"code","execution_count":24,"metadata":{"id":"H6eyFp6XTLyf","executionInfo":{"status":"ok","timestamp":1649111658361,"user_tz":180,"elapsed":2624,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}}},"outputs":[],"source":["# donwload the latest version of artifact raw_data.csv\n","artifact = run.use_artifact(input_artifact)\n","\n","# create a dataframe from the artifact\n","df = pd.read_csv(artifact.file())"]},{"cell_type":"code","execution_count":25,"metadata":{"id":"SWMTKGEUYHeK","executionInfo":{"status":"ok","timestamp":1649111669513,"user_tz":180,"elapsed":548,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}}},"outputs":[],"source":["# Delete duplicated rows\n","df.drop_duplicates(inplace=True)\n","\n","# Generate a \"clean data file\"\n","df.to_csv(artifact_name,index=False)"]},{"cell_type":"code","execution_count":26,"metadata":{"id":"4WkUfOnkYI5l","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1649111718114,"user_tz":180,"elapsed":435,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}},"outputId":"2454d010-301f-4c7a-a8d2-c104c692cf13"},"outputs":[{"output_type":"execute_result","data":{"text/plain":[""]},"metadata":{},"execution_count":26}],"source":["# Create a new artifact and configure with the necessary arguments\n","artifact = wandb.Artifact(name=artifact_name,\n"," type=artifact_type,\n"," description=artifact_description)\n","artifact.add_file(artifact_name)"]},{"cell_type":"code","execution_count":27,"metadata":{"id":"0dVXWRhmGmOn","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1649111732059,"user_tz":180,"elapsed":430,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}},"outputId":"4742801e-ff1c-497c-9554-15ae996514ac"},"outputs":[{"output_type":"execute_result","data":{"text/plain":[""]},"metadata":{},"execution_count":27}],"source":["# Upload the artifact to Wandb\n","run.log_artifact(artifact)"]},{"cell_type":"code","source":["# close the run\n","# waiting a while after run the previous cell before execute this\n","run.finish()"],"metadata":{"colab":{"base_uri":"https://localhost:8080/","height":104,"referenced_widgets":["5a6fb731545e418cad4d849f125a7b75","cc467ffc17814d0595bcbefc7da605c0","343d716007a04ebaa529b0416e5fa984","494329a9abcc46a9a82b2c231e917752","5f92efb7fa3742f3b46eff74fc48ec2f","46aaa8f54fd6438a83af95b4413ca42e","778a30fa4e6b424a80f3c067f43499b2","1c60b15036a64ebd95f57faf6c22c15d"]},"id":"mqRqmqbDp3Zc","executionInfo":{"status":"ok","timestamp":1649111754546,"user_tz":180,"elapsed":5106,"user":{"displayName":"Ivanovitch Silva","userId":"06428777505436195303"}},"outputId":"d20a1136-0fb7-464d-b958-e567709b883d"},"execution_count":28,"outputs":[{"output_type":"stream","name":"stdout","text":["\n"]},{"output_type":"display_data","data":{"text/plain":[""],"text/html":["Waiting for W&B process to finish... (success)."]},"metadata":{}},{"output_type":"display_data","data":{"text/plain":["VBox(children=(Label(value='3.633 MB of 3.633 MB uploaded (0.000 MB deduped)\\r'), FloatProgress(value=1.0, max…"],"application/vnd.jupyter.widget-view+json":{"version_major":2,"version_minor":0,"model_id":"5a6fb731545e418cad4d849f125a7b75"}},"metadata":{}},{"output_type":"display_data","data":{"text/plain":[""],"text/html":["Synced likely-wind-6: https://wandb.ai/ivanovitchm/decision_tree/runs/n93uggvj Synced 4 W&B file(s), 0 media file(s), 1 artifact file(s) and 0 other file(s)"]},"metadata":{}},{"output_type":"display_data","data":{"text/plain":[""],"text/html":["Find logs at: ./wandb/run-20220404_223353-n93uggvj/logs"]},"metadata":{}}]},{"cell_type":"code","source":[""],"metadata":{"id":"i2jFX9TtqPLV"},"execution_count":null,"outputs":[]}],"metadata":{"colab":{"collapsed_sections":[],"name":"preprocessing.ipynb","provenance":[]},"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.12"},"widgets":{"application/vnd.jupyter.widget-state+json":{"5a6fb731545e418cad4d849f125a7b75":{"model_module":"@jupyter-widgets/controls","model_name":"VBoxModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"VBoxModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"VBoxView","box_style":"","children":["IPY_MODEL_cc467ffc17814d0595bcbefc7da605c0","IPY_MODEL_343d716007a04ebaa529b0416e5fa984"],"layout":"IPY_MODEL_494329a9abcc46a9a82b2c231e917752"}},"cc467ffc17814d0595bcbefc7da605c0":{"model_module":"@jupyter-widgets/controls","model_name":"LabelModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"LabelModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"LabelView","description":"","description_tooltip":null,"layout":"IPY_MODEL_5f92efb7fa3742f3b46eff74fc48ec2f","placeholder":"","style":"IPY_MODEL_46aaa8f54fd6438a83af95b4413ca42e","value":"3.640 MB of 3.640 MB uploaded (0.000 MB deduped)\r"}},"343d716007a04ebaa529b0416e5fa984":{"model_module":"@jupyter-widgets/controls","model_name":"FloatProgressModel","model_module_version":"1.5.0","state":{"_dom_classes":[],"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"FloatProgressModel","_view_count":null,"_view_module":"@jupyter-widgets/controls","_view_module_version":"1.5.0","_view_name":"ProgressView","bar_style":"","description":"","description_tooltip":null,"layout":"IPY_MODEL_778a30fa4e6b424a80f3c067f43499b2","max":1,"min":0,"orientation":"horizontal","style":"IPY_MODEL_1c60b15036a64ebd95f57faf6c22c15d","value":1}},"494329a9abcc46a9a82b2c231e917752":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"5f92efb7fa3742f3b46eff74fc48ec2f":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"46aaa8f54fd6438a83af95b4413ca42e":{"model_module":"@jupyter-widgets/controls","model_name":"DescriptionStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"DescriptionStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","description_width":""}},"778a30fa4e6b424a80f3c067f43499b2":{"model_module":"@jupyter-widgets/base","model_name":"LayoutModel","model_module_version":"1.2.0","state":{"_model_module":"@jupyter-widgets/base","_model_module_version":"1.2.0","_model_name":"LayoutModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"LayoutView","align_content":null,"align_items":null,"align_self":null,"border":null,"bottom":null,"display":null,"flex":null,"flex_flow":null,"grid_area":null,"grid_auto_columns":null,"grid_auto_flow":null,"grid_auto_rows":null,"grid_column":null,"grid_gap":null,"grid_row":null,"grid_template_areas":null,"grid_template_columns":null,"grid_template_rows":null,"height":null,"justify_content":null,"justify_items":null,"left":null,"margin":null,"max_height":null,"max_width":null,"min_height":null,"min_width":null,"object_fit":null,"object_position":null,"order":null,"overflow":null,"overflow_x":null,"overflow_y":null,"padding":null,"right":null,"top":null,"visibility":null,"width":null}},"1c60b15036a64ebd95f57faf6c22c15d":{"model_module":"@jupyter-widgets/controls","model_name":"ProgressStyleModel","model_module_version":"1.5.0","state":{"_model_module":"@jupyter-widgets/controls","_model_module_version":"1.5.0","_model_name":"ProgressStyleModel","_view_count":null,"_view_module":"@jupyter-widgets/base","_view_module_version":"1.2.0","_view_name":"StyleView","bar_color":null,"description_width":""}}}}},"nbformat":4,"nbformat_minor":0}
--------------------------------------------------------------------------------
/lessons/week_02/sources/test.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "id": "0I4pgzLVtBTP"
7 | },
8 | "source": [
9 | "# 1.0 An end-to-end classification problem (Testing)\n",
10 | "\n"
11 | ]
12 | },
13 | {
14 | "cell_type": "markdown",
15 | "metadata": {
16 | "id": "Dh34gim6KPtT"
17 | },
18 | "source": [
19 | "## 1.1 Dataset description"
20 | ]
21 | },
22 | {
23 | "cell_type": "markdown",
24 | "metadata": {
25 | "id": "iE8OJoDZ5AFK"
26 | },
27 | "source": [
28 | "We'll be looking at individual income in the United States. The **data** is from the **1994 census**, and contains information on an individual's **marital status**, **age**, **type of work**, and more. The **target column**, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than **50k a year**.\n",
29 | "\n",
30 | "You can download the data from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).\n",
31 | "\n",
32 | "Let's take the following steps:\n",
33 | "\n",
34 | "1. ETL (done)\n",
35 | "2. Data Checks (done)\n",
36 | "3. Data Segregation (done)\n",
37 | "4. Training (done)\n",
38 | "5. Test\n",
39 | "\n",
40 | "
Figure 1: Drop-out on the second hidden layer. At each iteration, you shut down (= set to zero) each neuron of a layer with probability $1 - keep\\_prob$ or keep it with probability $keep\\_prob$ (50% here). The dropped neurons don't contribute to the training in both the forward and backward propagations of the iteration.
\n","\n","
\n","
Figure 2: Drop-out on the first and third hidden layers. $1^{st}$ layer: we shut down on average 40% of the neurons. $3^{rd}$ layer: we shut down on average 20% of the neurons.
\n","\n","\n","When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time. \n"]},{"cell_type":"code","metadata":{"id":"FpbWii871hzS"},"source":["# mlp with weight regularization for the moons dataset\n","from sklearn.datasets import make_moons\n","from tensorflow.keras.models import Sequential\n","from tensorflow.keras.layers import Dense\n","from tensorflow.keras.layers import Dropout\n","from tensorflow.keras.regularizers import l2\n","import matplotlib.pyplot as plt\n","\n","# generate 2d classification dataset\n","x, y = make_moons(n_samples=100, noise=0.2, random_state=1)\n","\n","# split into train and test sets\n","n_train = 30\n","train_x, test_x = x[:n_train, :], x[n_train:, :]\n","train_y, test_y = y[:n_train], y[n_train:]\n","\n","# define model\n","model_dropout = Sequential()\n","model_dropout.add(Dense(500, input_dim=2, activation='relu'))\n","model_dropout.add(Dropout(0.4))\n","model_dropout.add(Dense(1, activation='sigmoid'))\n","model_dropout.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n","\n","# callbacks tensorboard\n","logdir = os.path.join(\"logs\", datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\"))\n","tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=100)\n","\n","# fit model\n","history_dropout = model_dropout.fit(train_x, train_y, \n"," validation_data=(test_x, test_y),\n"," epochs=4000, verbose=0,\n"," callbacks=[MyCustomCallback(),tensorboard_callback])\n","\n","# evaluate the model\n","_, train_acc = model_dropout.evaluate(train_x, train_y, verbose=0)\n","_, test_acc = model_dropout.evaluate(test_x, test_y, verbose=0)\n","print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\n","\n","# plot loss learning curves\n","plt.subplot(211)\n","plt.title('Cross-Entropy Loss', pad=-40)\n","plt.plot(history_dropout.history['loss'], label='train')\n","plt.plot(history_dropout.history['val_loss'], label='test')\n","plt.legend()\n","\n","# plot accuracy learning curves\n","plt.subplot(212)\n","plt.title('Accuracy', pad=-40)\n","plt.plot(history_dropout.history['accuracy'], label='train')\n","plt.plot(history_dropout.history['val_accuracy'], label='test')\n","plt.legend()\n","plt.tight_layout()\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"Z-nFSIaGqU-a"},"source":["from mlxtend.plotting import plot_decision_regions\n","# Plot decision boundary\n","plot_decision_regions(test_x,test_y.squeeze(), clf=model_dropout,zoom_factor=2.0)\n","plt.title(\"Model with dropout\")\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"5eoZoKAa2rnJ"},"source":["# Start TensorBoard within the notebook using magics\n","%tensorboard --logdir logs"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"q2cF42K2FPD3"},"source":["# 5 - L2 vs Dropout"]},{"cell_type":"code","metadata":{"id":"53fGLuKeMFgG"},"source":["def print_analysis(titles,history,loss=True):\n"," if loss:\n"," func = \"loss\"\n"," func_val = \"val_loss\"\n"," else:\n"," func = \"binary_accuracy\"\n"," func_val = \"val_binary_accuracy\"\n","\n"," f, axs = plt.subplots(1,len(titles),figsize=(12,6))\n"," \n"," for i, title in enumerate(titles):\n"," axs[i].set_title(title)\n"," axs[i].plot(history[i].history[func])\n"," axs[i].plot(history[i].history[func_val])\n"," axs[i].set_ylabel(func)\n"," axs[i].set_xlabel('epoch')\n"," axs[i].legend(['train', 'test'], loc='best')\n"," \n"," plt.tight_layout()\n"," plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"i1Q68J32Nrcj"},"source":["titles = ['Model without regularization','Model with regularization L2','Model with dropout']\n","hist = [history,history_l2,history_dropout]\n","print_analysis(titles,hist,loss=True)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"WyCRtKAIPgIY"},"source":["def print_regions(titles,models):\n","\n"," f, axs = plt.subplots(1,len(titles),figsize=(12,4))\n"," \n"," for i, title in enumerate(titles):\n"," plot_decision_regions(test_x,test_y.squeeze(), clf=models[i],zoom_factor=2.0,ax=axs[i])\n"," axs[i].set_title(title)\n"," plt.tight_layout()\n"," plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"668zLQ3BQFNA"},"source":["models = [model,model_l2,model_dropout]\n","print_regions(titles,models)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"3ZnGlZK43WBi"},"source":["# 6 - Force Small Weights with Weight Constraints"]},{"cell_type":"code","metadata":{"id":"loAB8_wRURLy"},"source":["# mlp overfit on the moons dataset with a unit norm constraint\n","from sklearn.datasets import make_moons\n","from tensorflow.keras.layers import Dense\n","from tensorflow.keras.models import Sequential\n","from tensorflow.keras.constraints import unit_norm\n","import matplotlib.pyplot as plt\n","import os\n","\n","# generate 2d classification dataset\n","x, y = make_moons(n_samples=100, noise=0.2, random_state=1)\n","\n","# split into train and test\n","n_train = 30\n","train_x, test_x = x[:n_train, :], x[n_train:, :]\n","train_y, test_y = y[:n_train], y[n_train:]\n","\n","# define model\n","model = Sequential()\n","model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm()))\n","#kernel_constraint=tf.keras.constraints.min_max_norm(min_value=-0.2, max_value=1.0)))\n","model.add(Dense(1, activation='sigmoid'))\n","model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n","\n","# callbacks tensorboard\n","logdir = os.path.join(\"logs\", datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\"))\n","tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=100)\n","\n","# fit model\n","history = model.fit(train_x, train_y,\n"," validation_data=(test_x, test_y),\n"," epochs=4000, verbose=0,\n"," callbacks=[MyCustomCallback(),tensorboard_callback])\n","\n","# evaluate the model\n","_, train_acc = model.evaluate(train_x, train_y, verbose=0)\n","_, test_acc = model.evaluate(test_x, test_y, verbose=0)\n","print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\n","\n","# plot loss learning curves\n","plt.subplot(211)\n","plt.title('Cross-Entropy Loss', pad=-40)\n","plt.plot(history.history['loss'], label='train')\n","plt.plot(history.history['val_loss'], label='test')\n","plt.legend()\n","\n","# plot accuracy learning curves\n","plt.subplot(212)\n","plt.title('Accuracy', pad=-40)\n","plt.plot(history.history['accuracy'], label='train')\n","plt.plot(history.history['val_accuracy'], label='test')\n","plt.legend()\n","plt.tight_layout()\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"BHChR2tCUrpv"},"source":["from mlxtend.plotting import plot_decision_regions\n","# Plot decision boundary\n","plot_decision_regions(test_x,test_y.squeeze(), clf=model,zoom_factor=2.0)\n","plt.title(\"Model with weights constraints\")\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"PWus4c-GaYYH"},"source":["# Start TensorBoard within the notebook using magics\n","%tensorboard --logdir logs"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"FtdmGRF3iLPB"},"source":["filter = tf.keras.constraints.UnitNorm()\n","data = np.arange(3).reshape(3, 1).astype(np.float32)\n","print(data)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"2l86cAzCib0f"},"source":["filter(data)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"iaLEf9vqiiGQ"},"source":["np.linalg.norm(filter(data))"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"ngq7qmwWlvxm"},"source":["np.linalg.norm(data)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"ymjSWu26l5pC"},"source":["data/np.linalg.norm(data)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"I7_8PhubebIR"},"source":["filter = tf.keras.constraints.UnitNorm()\n","data = np.arange(6).reshape(3, 2).astype(np.float32)\n","data"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"-l_ODs3RevKr"},"source":["filter(data)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"eCFtK1cqo8wI"},"source":["np.linalg.norm(filter(data),axis=0)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"Ha4YL7P1e2Kk"},"source":["np.linalg.norm(data,axis=0)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"fTtiuqSBfsBC"},"source":["data/np.linalg.norm(data,axis=0)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"O10e_wAiaxnv"},"source":["# mlp overfit on the moons dataset with a unit norm constraint\n","from sklearn.datasets import make_moons\n","from tensorflow.keras.layers import Dense\n","from tensorflow.keras.models import Sequential\n","from tensorflow.keras.constraints import unit_norm\n","import matplotlib.pyplot as plt\n","import os\n","\n","# generate 2d classification dataset\n","x, y = make_moons(n_samples=100, noise=0.2, random_state=1)\n","\n","# split into train and test\n","n_train = 30\n","train_x, test_x = x[:n_train, :], x[n_train:, :]\n","train_y, test_y = y[:n_train], y[n_train:]\n","\n","# define model\n","model = Sequential()\n","model.add(Dense(500, input_dim=2, activation='relu', \n"," kernel_constraint=tf.keras.constraints.min_max_norm(min_value=-0.2, max_value=1.0)))\n","model.add(Dense(1, activation='sigmoid'))\n","model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])\n","\n","# callbacks tensorboard\n","logdir = os.path.join(\"logs\", datetime.datetime.now().strftime(\"%Y%m%d-%H%M%S\"))\n","tensorboard_callback = tf.keras.callbacks.TensorBoard(logdir, histogram_freq=100)\n","\n","\n","# fit model\n","history = model.fit(train_x, train_y,\n"," validation_data=(test_x, test_y),\n"," epochs=4000, verbose=0,\n"," callbacks=[MyCustomCallback(),tensorboard_callback])\n","\n","# evaluate the model\n","_, train_acc = model.evaluate(train_x, train_y, verbose=0)\n","_, test_acc = model.evaluate(test_x, test_y, verbose=0)\n","print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\n","\n","# plot loss learning curves\n","plt.subplot(211)\n","plt.title('Cross-Entropy Loss', pad=-40)\n","plt.plot(history.history['loss'], label='train')\n","plt.plot(history.history['val_loss'], label='test')\n","plt.legend()\n","\n","# plot accuracy learning curves\n","plt.subplot(212)\n","plt.title('Accuracy', pad=-40)\n","plt.plot(history.history['accuracy'], label='train')\n","plt.plot(history.history['val_accuracy'], label='test')\n","plt.legend()\n","plt.tight_layout()\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"Hpt_bysOqgoP"},"source":["from mlxtend.plotting import plot_decision_regions\n","# Plot decision boundary\n","plot_decision_regions(test_x,test_y.squeeze(), clf=model,zoom_factor=2.0)\n","plt.title(\"Model with weights constraints\")\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"gYjUEqSarISm"},"source":[""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"XEVcChn-mXAd"},"source":[""],"execution_count":null,"outputs":[]}]}
--------------------------------------------------------------------------------
/lessons/week_12/Better Generalizaton vs Better Learning.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanovitchm/ppgeecmachinelearning/ec32e114013d044419593d4da7b9647439024501/lessons/week_12/Better Generalizaton vs Better Learning.pdf
--------------------------------------------------------------------------------
/lessons/week_14/Hyperparameter Tuning and Batch Normalization.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ivanovitchm/ppgeecmachinelearning/ec32e114013d044419593d4da7b9647439024501/lessons/week_14/Hyperparameter Tuning and Batch Normalization.pdf
--------------------------------------------------------------------------------
/lessons/week_14/Task #01 Hyperparameter Tuning using Keras Tuner.ipynb:
--------------------------------------------------------------------------------
1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"accelerator":"GPU","colab":{"name":"Week #06 Task #01 Hyperparameter Tuning using Keras Tuner.ipynb","provenance":[],"collapsed_sections":[]},"kernelspec":{"display_name":"Python 3","name":"python3"}},"cells":[{"cell_type":"markdown","metadata":{"id":"7DLkQmFwhbq4"},"source":["# A brief recap about DL Pipeline"]},{"cell_type":"markdown","metadata":{"id":"VH2FWhOPNzZO"},"source":["- Define the task\n"," - Frame the problem\n"," - Collect a dataset\n"," - Understand your data\n"," - Choose a measure of success\n","- Develop a model\n"," - Prepare the data\n"," - Choose an evaluation protocol\n"," - Beat a baseline\n"," - Scale up: develop a model that overfits\n"," - Regularize and tune your model\n","- Deploy your model\n"," - Explain your work to stakeholders and set expectations\n"," - Ship an inference model\n"," - Deploying a model as a rest API\n"," - Deploying a model on device\n"," - Deploying a model in the browser\n"," - Monitor your model in the wild\n"," - Maintain your model\n"]},{"cell_type":"markdown","metadata":{"id":"cTShha8tLAvY"},"source":["# 1.0 Baseline Model"]},{"cell_type":"markdown","metadata":{"id":"RD1hMKLZEObd"},"source":["## 1.1 Import Libraries"]},{"cell_type":"markdown","metadata":{"id":"4MR6rLenKnry"},"source":["Install and import the Keras Tuner."]},{"cell_type":"code","metadata":{"id":"rEYDnz5LKra8"},"source":["# pip install -q (quiet)\n","!pip install git+https://github.com/keras-team/keras-tuner.git -q"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"jtGJ6_Pi2yG9"},"source":["import tensorflow as tf\n","import numpy as np\n","import matplotlib.pyplot as plt\n","import h5py\n","import time\n","import datetime\n","import pytz\n","import IPython\n","import keras_tuner as kt"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"ccfIzxFaHeOb"},"source":["print('TF version:', tf.__version__)\n","print('KT version:', kt.__version__)\n","print('GPU devices:', tf.config.list_physical_devices('GPU'))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"0WGGeZQGESGP"},"source":["## 1.2 Utils Functions"]},{"cell_type":"code","metadata":{"id":"hw9z0s-QqSCd"},"source":["# download train_catvnoncat.h5\n","!gdown https://drive.google.com/uc?id=1ZPWKlEATuDjFtZJPgHCc5SURrcKaVP9Z\n","\n","# download test_catvnoncat.h5\n","!gdown https://drive.google.com/uc?id=1ndRNAwidOqEgqDHBurA0PGyXqHBlvzz-"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"wJYkR70dEGHh"},"source":["def load_dataset():\n"," # load the train data\n"," train_dataset = h5py.File('train_catvnoncat.h5', \"r\")\n","\n"," # your train set features\n"," train_set_x_orig = np.array(train_dataset[\"train_set_x\"][:]) \n","\n"," # your train set labels\n"," train_set_y_orig = np.array(train_dataset[\"train_set_y\"][:]) \n","\n"," # load the test data\n"," test_dataset = h5py.File('test_catvnoncat.h5', \"r\")\n","\n"," # your test set features\n"," test_set_x_orig = np.array(test_dataset[\"test_set_x\"][:]) \n","\n"," # your test set labels \n"," test_set_y_orig = np.array(test_dataset[\"test_set_y\"][:]) \n","\n"," # the list of classes\n"," classes = np.array(test_dataset[\"list_classes\"][:]) \n","\n"," # reshape the test data\n"," train_set_y_orig = train_set_y_orig.reshape((train_set_y_orig.shape[0],1))\n"," test_set_y_orig = test_set_y_orig.reshape((test_set_y_orig.shape[0],1))\n","\n"," return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"exIeT2zMHhxa"},"source":["## 1.3 Load Dataset"]},{"cell_type":"code","metadata":{"id":"mjQYsrdPHSyh"},"source":["# Loading the data (cat/non-cat)\n","train_set_x_orig, train_y, test_set_x_orig, test_y, classes = load_dataset()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"6ihL83MbHlhc"},"source":["# Reshape the training and test examples\n","train_set_x_flatten = train_set_x_orig.reshape(train_set_x_orig.shape[0],-1)\n","test_set_x_flatten = test_set_x_orig.reshape(test_set_x_orig.shape[0],-1)\n","\n","# Standardize the dataset\n","train_x = train_set_x_flatten/255\n","test_x = test_set_x_flatten/255"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"sQn-Brt-IhK_"},"source":["print (\"train_x shape: \" + str(train_x.shape))\n","print (\"train_y shape: \" + str(train_y.shape))\n","print (\"test_x shape: \" + str(test_x.shape))\n","print (\"test_y shape: \" + str(test_y.shape))"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"Xa9dcSRx1__x"},"source":["# visualize a sample data\n","index = 13\n","plt.imshow(train_set_x_orig[index])"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"5zLrb3ORIlo_"},"source":["## 1.4 Model"]},{"cell_type":"code","metadata":{"id":"zJutkmXnIpG6"},"source":["class MyCustomCallback(tf.keras.callbacks.Callback):\n","\n"," def on_train_begin(self, batch, logs=None):\n"," self.begins = time.time()\n"," print('Training: begins at {}'.format(datetime.datetime.now(pytz.timezone('America/Fortaleza')).strftime(\"%a, %d %b %Y %H:%M:%S\")))\n","\n"," def on_train_end(self, logs=None):\n"," print('Training: ends at {}'.format(datetime.datetime.now(pytz.timezone('America/Fortaleza')).strftime(\"%a, %d %b %Y %H:%M:%S\")))\n"," print('Duration: {:.2f} seconds'.format(time.time() - self.begins)) "],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"SyfcUdH36lGG"},"source":["# Instantiate a simple classification model\n","model = tf.keras.Sequential([\n"," tf.keras.layers.Dense(8, activation=tf.nn.relu, dtype='float64'),\n"," tf.keras.layers.Dense(8, activation=tf.nn.relu, dtype='float64'),\n"," tf.keras.layers.Dense(1, activation=tf.nn.sigmoid, dtype='float64')\n","])\n","\n","# Instantiate a logistic loss function that expects integer targets.\n","loss = tf.keras.losses.BinaryCrossentropy()\n","\n","# Instantiate an accuracy metric.\n","accuracy = tf.keras.metrics.BinaryAccuracy()\n","\n","# Instantiate an optimizer.\n","optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)\n","\n","# configure the optimizer, loss, and metrics to monitor.\n","model.compile(optimizer=optimizer, loss=loss, metrics=[accuracy])\n","\n","# training \n","history = model.fit(x=train_x,\n"," y=train_y,\n"," batch_size=32,\n"," epochs=500,\n"," validation_data=(test_x,test_y),\n"," callbacks=[MyCustomCallback()],\n"," verbose=1)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"fugflUT5JtCe"},"source":["loss, acc = model.evaluate(x=train_x,y=train_y, batch_size=32)\n","print('Train loss: %.4f - acc: %.4f' % (loss, acc))\n","\n","loss_, acc_ = model.evaluate(x=test_x,y=test_y, batch_size=32)\n","print('Test loss: %.4f - acc: %.4f' % (loss_, acc_))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"q6CsWRrqJzFQ"},"source":["# 2.0 Hyperparameter Tuning using Keras-Tuner"]},{"cell_type":"markdown","metadata":{"id":"yff1zTsrLJ4J"},"source":["The [Keras Tuner](https://github.com/keras-team/keras-tuner) is a library that helps you pick the optimal set of hyperparameters for your TensorFlow program. The process of selecting the right set of hyperparameters for your machine learning (ML) application is called **hyperparameter tuning** or **hypertuning**. \n","\n","Hyperparameters are the variables that govern the training process and the topology of an ML model. These variables remain constant over the training process and directly impact the performance of your ML program. Hyperparameters are of two types:\n","1. **Model hyperparameters** which influence model selection such as the number and width of hidden layers\n","2. **Algorithm hyperparameters** which influence the speed and quality of the learning algorithm such as the learning rate for Stochastic Gradient Descent (SGD) and the number of nearest neighbors for a k Nearest Neighbors (KNN) classifier, among others.\n"]},{"cell_type":"markdown","metadata":{"id":"K5YEL2H2Ax3e"},"source":["## 2.1 Define the model\n"]},{"cell_type":"markdown","metadata":{"id":"md7FhKoYcuj7"},"source":["\n","When you build a model for hypertuning, you also define the hyperparameter search space in addition to the model architecture. The model you set up for hypertuning is called a **hypermodel**.\n","\n","You can define a hypermodel through two approaches:\n","\n","* By using a model builder function\n","* By subclassing the `HyperModel` class of the Keras Tuner API\n","\n","You can also use two pre-defined `HyperModel` classes - [HyperXception](https://keras-team.github.io/keras-tuner/documentation/hypermodels/#hyperxception-class) and [HyperResNet](https://keras-team.github.io/keras-tuner/documentation/hypermodels/#hyperresnet-class) for computer vision applications.\n","\n","In this section, you use a model builder function to define the image classification model. The model builder function returns a compiled model and uses hyperparameters you define inline to hypertune the model."]},{"cell_type":"code","metadata":{"id":"M3Iosz7dctGf"},"source":["def model_builder(hp):\n"," # Instantiate a simple classification model\n"," model = tf.keras.Sequential()\n","\n"," # Tune the number of units in the first Dense layer\n"," # Choose an optimal value between 8-32\n"," hp_units = hp.Int('units', min_value = 8, max_value = 32, step = 8)\n"," model.add(tf.keras.layers.Dense(hp_units, activation=tf.nn.relu, dtype='float64'))\n"," model.add(tf.keras.layers.Dense(8, activation=tf.nn.relu, dtype='float64'))\n"," model.add(tf.keras.layers.Dense(1, activation=tf.nn.sigmoid, dtype='float64'))\n","\n"," # Instantiate a logistic loss function that expects integer targets.\n"," loss = tf.keras.losses.BinaryCrossentropy()\n","\n"," # Instantiate an accuracy metric.\n"," accuracy = tf.keras.metrics.BinaryAccuracy()\n","\n"," # Instantiate an optimizer.\n"," optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)\n","\n"," # configure the optimizer, loss, and metrics to monitor.\n"," model.compile(optimizer=optimizer, loss=loss, metrics=[accuracy])\n","\n"," return model"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"L8bYqKpDeBlB"},"source":["## 2.2 Instantiate the tuner and perform hypertuning\n"]},{"cell_type":"markdown","metadata":{"id":"pYX1lhvDeDoz"},"source":["\n","Instantiate the tuner to perform the hypertuning. The Keras Tuner has [four tuners available](https://keras-team.github.io/keras-tuner/documentation/tuners/) - `RandomSearch`, `Hyperband`, `BayesianOptimization`, and `Sklearn`. \n","\n","Notice that in previous subsection we're not fitting there, and we're returning the compiled model. Let's continue to build out the rest of our program first, then we'll make things more dynamic. Adding the dynamic bits will all happen in the **model_builder** function, but we will need some other code that will use this function now. To start, we're going to import **RandomSearch** and after that we'll first define our tuner."]},{"cell_type":"code","metadata":{"id":"BOzyY-KLhQd5"},"source":["from keras_tuner.tuners import RandomSearch"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"V4JXxAYAe67h"},"source":["# path to store results\n","LOG_DIR = f\"{int(time.time())}\""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"Ikjt5YbnepK5"},"source":["tuner = RandomSearch(model_builder,\n"," objective='val_binary_accuracy',\n"," max_trials=4, # how many model configurations would you like to test?\n"," executions_per_trial=1, # how many trials per variation? (same model could perform differently)\n"," directory=LOG_DIR,\n"," project_name=\"my_first_tuner\"\n"," )"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"d8LgEGSAhewb"},"source":["- Your objective here probably should be **validation accuracy**, but you can choose from other things like **val_loss** for example.\n","- **max_trials** allows you limit how many tests will be run. If you put 10 here, you will get 10 different tests (provided you've specified enough variability for 10 different combinations, anyway).\n","- **executions_per_trial** might be 1, but you might also do many more like 3,5, or even 10.\n","\n","Basically, if you're just hunting for a model that works, then you should just do 1 trial per variation. If you're attempting to seek out 1-3% on **validation accuracy**, then you should run 3+ trials most likely per model, because each time a model runs, you should see some variation in final values. So this will just depend on what kind of a search you're doing (just trying to find something that works vs fine tuning...or anything in between)."]},{"cell_type":"markdown","metadata":{"id":"fJwb5NZgiWXt"},"source":["Run the hyperparameter search. The arguments for the search method are the same as those used for [`tf.keras.model.fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit).\n","\n","Before running the hyperparameter search, define a callback to clear the training outputs at the end of every training step."]},{"cell_type":"code","metadata":{"id":"nK6XNozXlJK7"},"source":["class ClearTrainingOutput(tf.keras.callbacks.Callback):\n"," def on_train_end(*args, **kwargs):\n"," IPython.display.clear_output(wait = True)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"PyNf7zRakAcr"},"source":["tuner.search(train_x,\n"," train_y, \n"," epochs = 500, \n"," verbose=1,\n"," batch_size=32,\n"," validation_data = (test_x, test_y),\n"," callbacks = [ClearTrainingOutput()])"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"Th3JZnM5kX_k"},"source":["# print a summary of results\n","tuner.results_summary(num_trials=10)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"ICBtBvvSm7AE"},"source":["# best hyperparameters is a dictionary\n","tuner.get_best_hyperparameters()[0].values"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"_O9cyI_kK0US"},"source":["# search space summary\n","tuner.search_space_summary()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"xfJ3aiJ5p9iv"},"source":["print(f\"\"\"The hyperparameter search is complete. The optimal number of units in the first densely-connected\n","layer is {tuner.get_best_hyperparameters()[0].values.get('units')}\"\"\")"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"uvyvW9oZwkgJ"},"source":["## 2.3 Playing with search space"]},{"cell_type":"markdown","metadata":{"id":"-cIpO1GkxTgZ"},"source":["We also can play with the search space in order to contain conditional hyperparameters. Below, we have a **for loop** creating a **tunable number of layers**, which themselves involve a tunable **units** parameter. This can be pushed to any level of parameter interdependency, including recursion. Note that all parameter names should be unique (here, in the loop over **i**, we name the inner parameters **'units_'** + **str(i)**)."]},{"cell_type":"code","metadata":{"id":"5ZPJ9lSLyEfg"},"source":["def model_builder_all(hp):\n"," # Instantiate a simple classification model\n"," model = tf.keras.Sequential()\n"," \n"," # Create a tunable number of layers 1,2,3,4\n"," for i in range(hp.Int('num_layers', 1, 4)):\n","\n"," # Tune the number of units in the Dense layer\n"," # Choose an optimal value between 8-32\n"," model.add(tf.keras.layers.Dense(units=hp.Int('units_' + str(i),\n"," min_value = 8,\n"," max_value = 32,\n"," step = 8),\n"," # Tune the activation functions\n"," activation= hp.Choice('dense_activation_' + str(i),\n"," values=['relu', 'tanh'],\n"," default='relu'),\n"," dtype='float64'))\n","\n"," model.add(tf.keras.layers.Dense(1, activation=tf.nn.sigmoid, dtype='float64'))\n","\n"," # Instantiate a logistic loss function that expects integer targets.\n"," loss = tf.keras.losses.BinaryCrossentropy()\n","\n"," # Instantiate an accuracy metric.\n"," accuracy = tf.keras.metrics.BinaryAccuracy()\n","\n"," optimizer = hp.Choice('optimizer', ['adam', 'SGD'])\n"," if optimizer == 'adam':\n"," opt = tf.keras.optimizers.Adam(learning_rate=hp.Float('lrate_adam',\n"," min_value=1e-4,\n"," max_value=1e-2, \n"," sampling='LOG'))\n"," else:\n"," opt = tf.keras.optimizers.SGD(learning_rate=hp.Float('lrate_sgd',\n"," min_value=1e-4,\n"," max_value=1e-2, \n"," sampling='LOG'))\n","\n"," # configure the optimizer, loss, and metrics to monitor.\n"," model.compile(optimizer=opt, loss=loss, metrics=[accuracy])\n","\n"," return model"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"3GuC7Drk2ZLC"},"source":["# path to store results\n","LOG_DIR = f\"{int(time.time())}\""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"Xc7BSy102ZLI"},"source":["tuner_ = RandomSearch(model_builder_all,\n"," objective='val_binary_accuracy',\n"," max_trials=20, # how many model configurations would you like to test?\n"," executions_per_trial=1, # how many trials per variation? (same model could perform differently)\n"," directory=LOG_DIR,\n"," project_name=\"my_first_tuner\"\n"," )"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"p39oHKAl2aXK"},"source":["tuner_.search(train_x,\n"," train_y, \n"," epochs = 500,\n"," # verbose = 0 (silent) \n"," verbose=0,\n"," batch_size=32,\n"," validation_data = (test_x, test_y),\n"," callbacks = [ClearTrainingOutput()])"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"1CNxpcdg3phN"},"source":["tuner_.results_summary()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"IPU-lmv_3AHD"},"source":["tuner_.get_best_hyperparameters()[0].values"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"3TUz0_MtWf9e"},"source":["tuner_.search_space_summary()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"NmAuZCk8gAbP"},"source":["## 2.4 Retrain the model with the optimal hyperparameters"]},{"cell_type":"code","metadata":{"id":"pZi_vqSUcX5p"},"source":["# Build the model with the optimal hyperparameters and train it on the data\n","best_hps = tuner_.get_best_hyperparameters()[0]\n","model = tuner_.hypermodel.build(best_hps)\n","model.fit(train_x, train_y, epochs = 500, validation_data = (test_x, test_y),batch_size=32)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"8simDMO93_Cb"},"source":["loss_, acc_ = model.evaluate(x=test_x,y=test_y, batch_size=32)\n","print('Test loss: %.3f - acc: %.3f' % (loss_, acc_))"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"TyDo-xOti6_4"},"source":["model.summary()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"m25iq5vkgpg0"},"source":["Exercise\n","\n","Hyperparameter tuning is a time-consuming task. The previous result was not so good. You can try to improve it the tuning considering:\n","- Other [Tuners](https://keras-team.github.io/keras-tuner/documentation/tuners/): BayesianOptimization, Hyperband\n","- Evaluate **max_trials** ranges over 100 or more?\n","- **executions_per_trial** values in [2,3]?\n","- How about you write an article on Medium about Keras Tuner?"]},{"cell_type":"code","metadata":{"id":"4BOmmZWxiInr"},"source":["# PUT YOUR CODE HERE"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"5QL9wGBN363h"},"source":["# 3.0 References"]},{"cell_type":"markdown","metadata":{"id":"2Hfqjn0sXQVh"},"source":["1. https://www.kaggle.com/fchollet/moa-keras-kerastuner-best-practices/\n","2. https://www.kaggle.com/fchollet/titanic-keras-kerastuner-best-practices\n","3. https://www.kaggle.com/fchollet/keras-kerastuner-best-practices\n","4. https://pythonprogramming.net/keras-tuner-optimizing-neural-network-tutorial/\n","5. https://github.com/keras-team/keras-tuner\n","6. https://machinelearningmastery.com/autokeras-for-classification-and-regression/"]}]}
--------------------------------------------------------------------------------
/lessons/week_14/Task #02 Hyperparameter Tuning using Weights and Biases.ipynb:
--------------------------------------------------------------------------------
1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Week #06 Task #02 Hyperparameter Tuning using Weights and Biases.ipynb","provenance":[],"collapsed_sections":[],"authorship_tag":"ABX9TyNepLrtPIfGc4cSKw0HDKrn"},"kernelspec":{"name":"python3","display_name":"Python 3"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","metadata":{"id":"RIf7eFgwyZ06"},"source":["# 1 Import libraries"]},{"cell_type":"code","metadata":{"id":"L4OeKOISEobo"},"source":["import tensorflow as tf\n","import numpy as np\n","import matplotlib.pyplot as plt\n","import h5py\n","import time\n","import datetime\n","import pytz\n","import IPython"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"aRdkrAFXxYcb"},"source":["print('TF version:', tf.__version__)\n","print('GPU devices:', tf.config.list_physical_devices('GPU'))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"MSK_tRj1yipL"},"source":["# 2 Data load and preprocessing"]},{"cell_type":"code","metadata":{"id":"z7Rhm9scwVx1"},"source":["# download train_catvnoncat.h5\n","!gdown https://drive.google.com/uc?id=1ZPWKlEATuDjFtZJPgHCc5SURrcKaVP9Z\n","\n","# download test_catvnoncat.h5\n","!gdown https://drive.google.com/uc?id=1ndRNAwidOqEgqDHBurA0PGyXqHBlvzz-"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"evhmAkbKxYy_"},"source":["def load_dataset():\n"," # load the train data\n"," train_dataset = h5py.File('train_catvnoncat.h5', \"r\")\n","\n"," # your train set features\n"," train_set_x_orig = np.array(train_dataset[\"train_set_x\"][:]) \n","\n"," # your train set labels\n"," train_set_y_orig = np.array(train_dataset[\"train_set_y\"][:]) \n","\n"," # load the test data\n"," test_dataset = h5py.File('test_catvnoncat.h5', \"r\")\n","\n"," # your test set features\n"," test_set_x_orig = np.array(test_dataset[\"test_set_x\"][:]) \n","\n"," # your test set labels \n"," test_set_y_orig = np.array(test_dataset[\"test_set_y\"][:]) \n","\n"," # the list of classes\n"," classes = np.array(test_dataset[\"list_classes\"][:]) \n","\n"," # reshape the test data\n"," train_set_y_orig = train_set_y_orig.reshape((train_set_y_orig.shape[0],1))\n"," test_set_y_orig = test_set_y_orig.reshape((test_set_y_orig.shape[0],1))\n","\n"," return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"mjQYsrdPHSyh"},"source":["# Loading the data (cat/non-cat)\n","train_set_x_orig, train_y, test_set_x_orig, test_y, classes = load_dataset()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"6ihL83MbHlhc"},"source":["# Reshape the training and test examples\n","train_set_x_flatten = train_set_x_orig.reshape(train_set_x_orig.shape[0],-1)\n","test_set_x_flatten = test_set_x_orig.reshape(test_set_x_orig.shape[0],-1)\n","\n","# Standardize the dataset\n","train_x = train_set_x_flatten/255\n","test_x = test_set_x_flatten/255"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"sQn-Brt-IhK_"},"source":["print (\"train_x shape: \" + str(train_x.shape))\n","print (\"train_y shape: \" + str(train_y.shape))\n","print (\"test_x shape: \" + str(test_x.shape))\n","print (\"test_y shape: \" + str(test_y.shape))"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"ATILeNhddrcC"},"source":["# visualize a sample modified data\n","index = 13\n","plt.imshow(train_x[index].reshape(64,64,3))"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"Xa9dcSRx1__x"},"source":["# visualize a sample raw data\n","index = 13\n","plt.imshow(train_set_x_orig[index])"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"DvACru73ysxG"},"source":["class MyCustomCallback(tf.keras.callbacks.Callback):\n","\n"," def on_train_begin(self, batch, logs=None):\n"," self.begins = time.time()\n"," print('Training: begins at {}'.format(datetime.datetime.now(pytz.timezone('America/Fortaleza')).strftime(\"%a, %d %b %Y %H:%M:%S\")))\n","\n"," def on_train_end(self, logs=None):\n"," print('Training: ends at {}'.format(datetime.datetime.now(pytz.timezone('America/Fortaleza')).strftime(\"%a, %d %b %Y %H:%M:%S\")))\n"," print('Duration: {:.2f} seconds'.format(time.time() - self.begins)) "],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"iyHu7kEeyrh-"},"source":["# 3 Base Model"]},{"cell_type":"code","metadata":{"id":"HPkKk-ZayvfU"},"source":["# Instantiate a simple classification model\n","model = tf.keras.Sequential([\n"," tf.keras.layers.Dense(8, activation=tf.nn.relu, dtype='float64'),\n"," tf.keras.layers.Dense(8, activation=tf.nn.relu, dtype='float64'),\n"," tf.keras.layers.Dense(1, activation=tf.nn.sigmoid, dtype='float64')\n","])\n","\n","# Instantiate a logistic loss function that expects integer targets.\n","loss = tf.keras.losses.BinaryCrossentropy()\n","\n","# Instantiate an accuracy metric.\n","accuracy = tf.keras.metrics.BinaryAccuracy()\n","\n","# Instantiate an optimizer.\n","optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)\n","\n","# configure the optimizer, loss, and metrics to monitor.\n","model.compile(optimizer=optimizer, loss=loss, metrics=[accuracy])\n","\n","# training \n","history = model.fit(x=train_x,\n"," y=train_y,\n"," batch_size=32,\n"," epochs=500,\n"," validation_data=(test_x,test_y),\n"," callbacks=[MyCustomCallback()],\n"," verbose=1)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"nLdVTFjGyyDe"},"source":["loss, acc = model.evaluate(x=train_x,y=train_y, batch_size=32)\n","print('Train loss: %.4f - acc: %.4f' % (loss, acc))\n","\n","loss_, acc_ = model.evaluate(x=test_x,y=test_y, batch_size=32)\n","print('Test loss: %.4f - acc: %.4f' % (loss_, acc_))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"RooPTo2ezBBh"},"source":["# 4 Hyperparameter Tuning "]},{"cell_type":"code","metadata":{"id":"ESSHH5_UzQ3o"},"source":["%%capture\n","!pip install wandb==0.10.17"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"Y_Tjh1Sbz1tJ"},"source":["!wandb login --relogin"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"kKxNLU83GV4q"},"source":["## 4.1 Monitoring a neural network"]},{"cell_type":"code","metadata":{"id":"Hi7lXapOz50x"},"source":["import wandb\n","from wandb.keras import WandbCallback\n","from tensorflow.keras.callbacks import EarlyStopping\n","\n","# Default values for hyperparameters\n","defaults = dict(layer_1 = 8,\n"," layer_2 = 8,\n"," learn_rate = 0.001,\n"," batch_size = 32,\n"," epoch = 500)\n","\n","wandb.init(project=\"week06\", config= defaults, name=\"week06_run_01\")\n","config = wandb.config"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"-bKBiY1q1QdJ"},"source":["# Instantiate a simple classification model\n","model = tf.keras.Sequential([\n"," tf.keras.layers.Dense(config.layer_1, activation=tf.nn.relu, dtype='float64'),\n"," tf.keras.layers.Dense(config.layer_2, activation=tf.nn.relu, dtype='float64'),\n"," tf.keras.layers.Dense(1, activation=tf.nn.sigmoid, dtype='float64')\n","])\n","\n","# Instantiate a logistic loss function that expects integer targets.\n","loss = tf.keras.losses.BinaryCrossentropy()\n","\n","# Instantiate an accuracy metric.\n","accuracy = tf.keras.metrics.BinaryAccuracy()\n","\n","# Instantiate an optimizer.\n","optimizer = tf.keras.optimizers.SGD(learning_rate=config.learn_rate)\n","\n","# configure the optimizer, loss, and metrics to monitor.\n","model.compile(optimizer=optimizer, loss=loss, metrics=[accuracy])"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"TuzJG3XG2jjw"},"source":["%%wandb\n","# Add WandbCallback() to the fit function\n","model.fit(x=train_x,\n"," y=train_y,\n"," batch_size=config.batch_size,\n"," epochs=config.epoch,\n"," validation_data=(test_x,test_y),\n"," callbacks=[WandbCallback(log_weights=True)],\n"," verbose=1)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"vh562csI3WFO"},"source":["## 4.2 Sweeps"]},{"cell_type":"code","metadata":{"id":"mo_cE96gG8Tq"},"source":[" # The sweep calls this function with each set of hyperparameters\n","def train():\n"," # Default values for hyper-parameters we're going to sweep over\n"," defaults = dict(layer_1 = 8,\n"," layer_2 = 8,\n"," learn_rate = 0.001,\n"," batch_size = 32,\n"," epoch = 500)\n"," \n"," # Initialize a new wandb run\n"," wandb.init(project=\"week06\", config= defaults)\n","\n"," # Config is a variable that holds and saves hyperparameters and inputs\n"," config = wandb.config\n"," \n"," # Instantiate a simple classification model\n"," model = tf.keras.Sequential([\n"," tf.keras.layers.Dense(config.layer_1, activation=tf.nn.relu, dtype='float64'),\n"," tf.keras.layers.Dense(config.layer_2, activation=tf.nn.relu, dtype='float64'),\n"," tf.keras.layers.Dense(1, activation=tf.nn.sigmoid, dtype='float64')\n"," ])\n","\n"," # Instantiate a logistic loss function that expects integer targets.\n"," loss = tf.keras.losses.BinaryCrossentropy()\n","\n"," # Instantiate an accuracy metric.\n"," accuracy = tf.keras.metrics.BinaryAccuracy()\n","\n"," # Instantiate an optimizer.\n"," optimizer = tf.keras.optimizers.SGD(learning_rate=config.learn_rate)\n","\n"," # configure the optimizer, loss, and metrics to monitor.\n"," model.compile(optimizer=optimizer, loss=loss, metrics=[accuracy]) \n","\n"," model.fit(train_x, train_y, batch_size=config.batch_size,\n"," epochs=config.epoch,\n"," validation_data=(test_x, test_y),\n"," callbacks=[WandbCallback(),\n"," EarlyStopping(patience=100)]\n"," ) "],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"S6whtzZ1eior"},"source":["# See the source code in order to see other parameters\n","# https://github.com/wandb/client/tree/master/wandb/sweeps"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"1Ibov1wrLgS9"},"source":["# Configure the sweep – specify the parameters to search through, the search strategy, the optimization metric et all.\n","sweep_config = {\n"," 'method': 'random', #grid, random\n"," 'metric': {\n"," 'name': 'binary_accuracy',\n"," 'goal': 'maximize' \n"," },\n"," 'parameters': {\n"," 'layer_1': {\n"," 'max': 32,\n"," 'min': 8,\n"," 'distribution': 'int_uniform',\n"," },\n"," 'layer_2': {\n"," 'max': 32,\n"," 'min': 8,\n"," 'distribution': 'int_uniform',\n"," },\n"," 'learn_rate': {\n"," 'min': -4,\n"," 'max': -2,\n"," 'distribution': 'log_uniform', \n"," },\n"," 'epoch': {\n"," 'values': [300,400,600]\n"," },\n"," 'batch_size': {\n"," 'values': [32,64]\n"," }\n"," }\n","}"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"1rCRIA2HMG1Y"},"source":["# Initialize a new sweep\n","# Arguments:\n","# – sweep_config: the sweep config dictionary defined above\n","# – entity: Set the username for the sweep\n","# – project: Set the project name for the sweep\n","sweep_id = wandb.sweep(sweep_config, entity=\"ivanovitchm\", project=\"week06\")"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"1V0Cobv4MahM"},"source":["# Initialize a new sweep\n","# Arguments:\n","# – sweep_id: the sweep_id to run - this was returned above by wandb.sweep()\n","# – function: function that defines your model architecture and trains it\n","wandb.agent(sweep_id = sweep_id, function=train,count=20)"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"P_dpK2SHjIlu"},"source":["### 4.2.1 Restore a model\n","\n","Restore a file, such as a model checkpoint, into your local run folder to access in your script.\n","\n","See [the restore docs](https://docs.wandb.com/library/restore) for more details."]},{"cell_type":"code","metadata":{"id":"-ikTvdN61mm0"},"source":["%%capture\n","!pip install wandb==0.10.17"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"V8dR38tFEpby"},"source":["!pip install wandb"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"xi-Fv1nfAU6q","colab":{"base_uri":"https://localhost:8080/","height":35},"executionInfo":{"status":"ok","timestamp":1628246847147,"user_tz":180,"elapsed":799,"user":{"displayName":"Ivanovitch Silva","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Git9r91cROvzBPiAlvwQtPMEFxLz44uDidMPM-PrQ=s64","userId":"06428777505436195303"}},"outputId":"cff675dc-f568-495d-d032-92d48ea5d47a"},"source":[" import wandb\n"," wandb.__version__"],"execution_count":2,"outputs":[{"output_type":"execute_result","data":{"application/vnd.google.colaboratory.intrinsic+json":{"type":"string"},"text/plain":["'0.11.2'"]},"metadata":{"tags":[]},"execution_count":2}]},{"cell_type":"code","metadata":{"id":"5zhf0Gix1nYD","colab":{"base_uri":"https://localhost:8080/"},"executionInfo":{"status":"ok","timestamp":1628246856152,"user_tz":180,"elapsed":5630,"user":{"displayName":"Ivanovitch Silva","photoUrl":"https://lh3.googleusercontent.com/a-/AOh14Git9r91cROvzBPiAlvwQtPMEFxLz44uDidMPM-PrQ=s64","userId":"06428777505436195303"}},"outputId":"7e4ab300-d488-4807-a689-15b628270563"},"source":["!wandb login"],"execution_count":3,"outputs":[{"output_type":"stream","text":["\u001b[34m\u001b[1mwandb\u001b[0m: You can find your API key in your browser here: https://wandb.ai/authorize\n","\u001b[34m\u001b[1mwandb\u001b[0m: Paste an API key from your profile and hit enter: \n","\u001b[34m\u001b[1mwandb\u001b[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"0LB6j3O-jIsd"},"source":["# restore the raw model file \"model-best.h5\" from a specific run by user \"ivanovitchm\"\n","# in project \"lesson04\" from run \"sqdv5ccj\"\n","best_model = wandb.restore('model-best.h5', run_path=\"ivanovitchm/week06/cbwfq70j\")"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"wo_JI5RPHzKu"},"source":["# restore the model for tf.keras\n","model = tf.keras.models.load_model(best_model.name)"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"aeM9gLcrDiz7"},"source":["# execute the loss and accuracy using the test dataset\n","loss_, acc_ = model.evaluate(x=test_x,y=test_y, batch_size=64)\n","print('Test loss: %.3f - acc: %.3f' % (loss_, acc_))"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"a55_JCKuR4kJ"},"source":["# source: https://github.com/wandb/awesome-dl-projects/blob/master/ml-tutorial/EMNIST_Dense_Classification.ipynb\n","import seaborn as sns\n","from sklearn.metrics import confusion_matrix\n","\n","predictions = np.greater_equal(model.predict(test_x),0.5).astype(int)\n","cm = confusion_matrix(y_true = test_y, y_pred = predictions)\n","\n","plt.figure(figsize=(6,6));\n","sns.heatmap(cm, annot=True)\n","plt.savefig('confusion_matrix.png', bbox_inches='tight')\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"wTu0f6DiR7oW"},"source":["wandb.init(project=\"week06\")\n","wandb.log({\"image_confusion_matrix\": [wandb.Image('confusion_matrix.png')]})"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"XM5q_a_zqMY0"},"source":["# visualize the images and instances with error\n","# ground-truth\n","print(\"Ground-truth\\n\",test_y[~np.equal(predictions,test_y)])\n","\n","# predictions\n","print(\"Predictions\\n\",predictions[~np.equal(predictions,test_y)])"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"9iJtJ-wWvvl5"},"source":["# Images predicted as non-cat\n","fig, ax = plt.subplots(2,6,figsize=(10,6))\n","wrong_images = (~np.equal(predictions,test_y)).astype(int)\n","index = np.where(wrong_images == 1)[0]\n","\n","for i,value in enumerate(index):\n"," ax[i//6,i%6].imshow(test_x[value].reshape(64,64,3))\n","plt.savefig('wrong_predictions.png', bbox_inches='tight')"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"LnywcEUHxvL_"},"source":["wandb.log({\"wrong_predictions\": [wandb.Image('wrong_predictions.png')]})"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"jhBR6ePBvHy7"},"source":["# 5 References"]},{"cell_type":"markdown","metadata":{"id":"Hb3mCmDDvJjw"},"source":["1. https://github.com/wandb/awesome-dl-projects\n","2. https://docs.wandb.ai/app/features/panels/parameter-importance\n","3. https://wandb.ai/wandb/DistHyperOpt/reports/Modern-Scalable-Hyperparameter-Tuning-Methods--VmlldzoyMTQxODM"]}]}
--------------------------------------------------------------------------------
/lessons/week_14/Task #03 Batch Normalization.ipynb:
--------------------------------------------------------------------------------
1 | {"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Week #06 Track #03 Batch Normalization.ipynb","provenance":[],"collapsed_sections":[],"toc_visible":true,"authorship_tag":"ABX9TyNGK9TN93fpHnMO0nkGhzBz"},"kernelspec":{"name":"python3","display_name":"Python 3"},"accelerator":"GPU"},"cells":[{"cell_type":"markdown","metadata":{"id":"r6LB4RyNr0pn"},"source":["# 1 - Accelerate Learning with Batch Normalization"]},{"cell_type":"markdown","metadata":{"id":"MwJEaq7SsJYQ"},"source":["**Training deep neural networks** with tens of layers is challenging as they can be **sensitive to the initial random weights** and configuration of the learning algorithm. \n","\n","One possible reason for this difficulty is: \n","\n","> the distribution of the inputs to layers deep in the network may change after\n","each minibatch when the weights are updated. \n","\n","This can cause the learning algorithm to chase a moving target forever. This change in the distribution of inputs to layers in the network is referred to by the technical name **internal covariate shift**. \n","\n","**Batch normalization** is a technique for training very deep neural networks that standardize each minibatch layer's inputs. This has the effect of stabilizing the learning process and dramatically reducing the number of training epochs required to train deep networks. This section will discover the batch normalization method used to accelerate deep learning neural networks training. After\n","reading this section, you will know:\n","\n","- Deep neural networks are challenging to train, not least because the input from prior layers can change after weight updates.\n","\n","- Batch normalization is a technique to standardize the inputs to a network, applied to either the activations of a prior layer or inputs directly.\n","\n","- Batch normalization accelerates training, in some cases by halving the number of epochs (or better), and provides some regularization effect, reducing generalization error."]},{"cell_type":"markdown","metadata":{"id":"OH9mCk8fvRAH"},"source":["## 1.1 Batch Normalization"]},{"cell_type":"markdown","metadata":{"id":"m0DvdGy5EH1d"},"source":["Training deep neural networks, e.g., networks with tens of hidden layers, is challenging. One aspect of this challenge is that the model is updated layer-by-layer backward from the output to the input using an **estimate of error that assumes the weights in the layers prior to the current\n","the layer are fixed**.\n","\n","> Very deep models involve the composition of several functions or layers. The gradient tells how to update each parameter, under the assumption that the other layers do not change. In practice, we update all of the layers simultaneously.\n","\n","**Because all layers are changed during an update**, the update procedure is forever chasing a moving target. For example, the weights of a layer are updated given an expectation that the prior layer outputs values with a given distribution. This distribution is likely changed after the\n","weights of the prior layer are updated.\n","\n","\n","> Training Deep Neural Networks is complicated by the fact that **the distribution of each layer's inputs changes during training as the parameters of the previous layers changes**. This slows down the training by **requiring lower learning rates** and **careful parameter initialization**, making it notoriously hard to train models with saturating nonlinearities."]},{"cell_type":"markdown","metadata":{"id":"YUR6FVtmE0My"},"source":["## 1.2 Standardize Layer Inputs"]},{"cell_type":"markdown","metadata":{"id":"6pCPWZmKF0yV"},"source":["Batch normalization, or **batch norm** for short, is [proposed as a technique](https://arxiv.org/pdf/1502.03167.pdf) to help coordinate the update of multiple layers in the model.\n","\n","> Batch normalization provides an elegant way of reparametrizing almost any deep network. The reparametrization significantly **reduces the problem of coordinating updates across many layers**.\n","\n","It does this by scaling the layer's output, specifically by **standardizing the activations of each input variable per minibatch**, such as the activations of a node from the previous layer. Recall that standardization refers to rescaling data to have a **mean of zero** and a **standard deviation of one**, e.g., a standard Gaussian.\n","\n","Standardizing the activations of the prior layer means that assumptions the subsequent layer **makes about the spread and distribution of inputs during the weight update will not change**, at least not dramatically. This has the effect of stabilizing and speeding-up the training process of deep neural networks.\n","\n","> Batch normalization acts to standardize only the mean and variance of each unit in order to stabilize learning but allows the relationships between units and the nonlinear statistics of a single unit to change.\n","\n","Normalizing the inputs to the layer affects the model's training, dramatically reducing the number of epochs required. **It can also have a regularizing effect**, reducing generalization error much like the use of activation regularization.\n","\n","Although **reducing internal covariate shift** was a motivation in the development of the method,\n","there is some suggestion that instead batch normalization is effective because it smooths and, in\n","turn, **simplifies the optimization function that is being solved when training the network**.\n","\n","> According to a [recent paper](https://arxiv.org/pdf/1805.11604.pdf), BatchNorm impacts network training fundamentally: **it makes the landscape of the corresponding optimization problem be significantly more smooth**. This ensures, in particular, that the gradients are more predictive and thus allow for the use of a more extensive range of learning rates and faster network convergence."]},{"cell_type":"markdown","metadata":{"id":"I-o0htyLGIZT"},"source":["## 1.3 How to Standardize Layer Inputs"]},{"cell_type":"markdown","metadata":{"id":"KwnLPVyc9mLL"},"source":["Batch normalization can be **implemented during training by calculating each input variable's mean and standard deviation to a layer per minibatch** and using these statistics to perform the standardization. Alternately, a running average of mean and standard deviation can be\n","maintained across mini-batches but may result in unstable training.\n","\n","This standardization of inputs may be applied to input variables for the first hidden layer or the activations from a hidden layer for deeper layers. In practice, it is common to allow the layer to learn two new parameters, namely a new mean and standard deviation, **Beta** and\n","**Gamma** respectively, that allow the automatic scaling and shifting of the standardized layer inputs. The model learns these parameters as part of the training process.\n","\n","> Note that simply normalizing each input of a layer may change what the layer can represent. **These parameters are learned along with the original model parameters and restore the network's representation power**.\n","\n","Significantly the backpropagation algorithm is updated to operate upon the transformed inputs, and error is also used to update the new scale and shifting parameters learned by the model. The standardization is applied to the inputs to the layer, namely the input variables or the output of the activation function from the last layer. Given the choice of activation function, the input distribution to the layer may be pretty non-Gaussian. In this case, there may be a benefit in standardizing the summed activation before the activation function in the previous layer.\n","\n","\n","> **We add the BN transform immediately before the nonlinearity**. We could have also normalized the layer inputs *u*, but since *u* is likely the output of another nonlinearity, the shape of its distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift."]},{"cell_type":"markdown","metadata":{"id":"s_hUcvfK_1Fc"},"source":["## 1.4 Tips for Using Batch Normalization"]},{"cell_type":"markdown","metadata":{"id":"7_bQaaBhLA_0"},"source":["This section provides tips and suggestions for using batch normalization with your own neural networks.\n","\n","**Use With Different Network Types**\n","\n","> Batch normalization is a general technique that can be used to normalize the inputs to a layer. It can be used with most network types, such as **Multilayer Perceptrons**, **Convolutional Neural Networks**, and **Recurrent Neural Networks**.\n","\n","\n","**Probably Use Before the Activation**\n","\n","> Batch normalization may be used on the inputs to the layer before or after the activation function in the previous layer. It may be more **appropriate after the activation function for s-shaped functions** like the hyperbolic tangent and logistic function. It may be appropriate **before the activation function** for activations that may result in non-Gaussian distributions like\n","the **rectified linear activation function**, the modern default for most network types.\n","\n","The goal of Batch Normalization is to achieve a stable distribution of activation values throughout training. In experiments conducted in the [original paper]((https://arxiv.org/pdf/1502.03167.pdf)), authors applied it before the nonlinearity since matching the first and second moments is more likely to result in a stable distribution.\n","\n","**Use Large Learning Rates**\n","\n","> Using batch normalization makes the network more stable during training. This may require a much greater learning rate than standard learning rates, which may further speed up the learning process.\n","\n","**Less Sensitive to Weight Initialization**\n","\n","> Deep neural networks can be pretty sensitive to the technique used to initialize the weights before training. The stability to training brought by batch normalization can make training deep networks less sensitive to the weight initialization method's choice.\n","\n","**Do not Use With Dropout**\n","\n","> Batch normalization offers some regularization effect, reducing generalization error, perhaps no longer requiring dropout for regularization.\n","\n","Further, it may not be good to use batch normalization and dropout in the same network. The reason is that the statistics used to normalize the prior layer's activations may become noisy given the random dropping out of nodes during the dropout procedure."]},{"cell_type":"markdown","metadata":{"id":"WPSCaslMNf-1"},"source":["## 1.5 Batch Normalization Case Study"]},{"cell_type":"code","metadata":{"id":"Hwvh-fPgTd7Q"},"source":["# scatter plot of the circles dataset with points colored by class\n","from sklearn.datasets import make_circles\n","import numpy as np\n","import matplotlib.pyplot as plt\n","\n","# generate circles\n","x, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\n","\n","# select indices of points with each class label\n","for i in range(2):\n","\tsamples_ix = np.where(y == i)\n","\tplt.scatter(x[samples_ix, 0], x[samples_ix, 1], label=str(i))\n","plt.legend()\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"D0ZGE8hHUfBd"},"source":["### 1.5.1 Multilayer Perceptron Model"]},{"cell_type":"code","metadata":{"id":"HPNBZYQGzjeq"},"source":["%%capture\n","!pip install wandb==0.10.17"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"mhJxFQyFzmea"},"source":["!wandb login"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"zfmevaH6UNns"},"source":["# mlp for the two circles problem\n","from sklearn.datasets import make_circles\n","from tensorflow.keras.models import Sequential\n","from tensorflow.keras.layers import Dense\n","from tensorflow.keras.optimizers import SGD\n","import matplotlib.pyplot as plt\n","import wandb\n","from wandb.keras import WandbCallback"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"xrS0KFjJgQxl"},"source":["# Default values for hyperparameters\n","defaults = dict(layer_1 = 50,\n"," learn_rate = 0.01,\n"," batch_size = 32,\n"," epoch = 100)\n","\n","wandb.init(project=\"week06_bn\", \n"," config= defaults, \n"," name=\"week06_bn_run_01\")\n","config = wandb.config"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"QWWarzqNgPDf"},"source":["%%wandb\n","\n","# generate 2d classification dataset\n","x, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\n","\n","# split into train and test\n","n_train = 500\n","train_x, test_x = x[:n_train, :], x[n_train:, :]\n","train_y, test_y = y[:n_train], y[n_train:]\n","\n","# define model\n","model = Sequential()\n","model.add(Dense(config.layer_1, input_dim=2, \n"," activation='relu', \n"," kernel_initializer='he_uniform'))\n","model.add(Dense(1, activation='sigmoid'))\n","opt = SGD(learning_rate=config.learn_rate, momentum=0.9)\n","model.compile(loss='binary_crossentropy', \n"," optimizer=opt, metrics=['accuracy'])\n","\n","# fit model\n","history = model.fit(train_x, train_y, \n"," validation_data=(test_x, test_y), \n"," epochs=config.epoch, verbose=0, \n"," batch_size=config.batch_size,\n"," callbacks=[WandbCallback(log_weights=True,\n"," log_gradients=True,\n"," training_data=(train_x,train_y))])\n","\n","# for more elaborate results please see the project in wandb"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"BM68FYWygrql"},"source":["# evaluate the model\n","_, train_acc = model.evaluate(train_x, train_y, verbose=0)\n","_, test_acc = model.evaluate(test_x, test_y, verbose=0)\n","print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\n","\n","# plot loss learning curves\n","plt.subplot(211)\n","plt.title('Cross-Entropy Loss', pad=-40)\n","plt.plot(history.history['loss'], label='train')\n","plt.plot(history.history['val_loss'], label='test')\n","plt.legend()\n","# plot accuracy learning curves\n","plt.subplot(212)\n","plt.title('Accuracy', pad=-40)\n","plt.plot(history.history['accuracy'], label='train')\n","plt.plot(history.history['val_accuracy'], label='test')\n","plt.legend()\n","plt.tight_layout()\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"574ahaQXcmfs"},"source":["model.summary()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"Pgf9llT9VF2P"},"source":["### 1.5.2 Multilayer Perceptron with Batch Normalization"]},{"cell_type":"markdown","metadata":{"id":"bpQNrJNZWipY"},"source":["The model introduced in the previous section can be updated to add batch normalization. The expectation is that batch normalization would accelerate the training process, offering similar or better classification accuracy in fewer training epochs. Batch normalization is also reported as providing a subtle form of regularization, meaning that it may also offer a slight reduction in generalization error demonstrated by a small increase in classification accuracy on the holdout test dataset. A new BatchNormalization layer can be added to the model after the hidden layer before the output layer. Specifically, after the activation function of the last hidden layer."]},{"cell_type":"code","metadata":{"id":"tuKBhad8W6yv"},"source":["# mlp for the two circles problem with batchnorm after activation function\n","from sklearn.datasets import make_circles\n","from tensorflow.keras.models import Sequential\n","from tensorflow.keras.layers import Dense\n","from tensorflow.keras.layers import BatchNormalization\n","from tensorflow.keras.optimizers import SGD\n","import matplotlib.pyplot as plt\n","import wandb\n","from wandb.keras import WandbCallback"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"4PMvo2EmhZhH"},"source":["# Default values for hyperparameters\n","defaults = dict(layer_1 = 50,\n"," learn_rate = 0.01,\n"," batch_size = 32,\n"," epoch = 100)\n","\n","wandb.init(project=\"week06_bn\", config= defaults, name=\"week06_bn_run_02\")\n","config = wandb.config"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"NHw-g-THhQWv"},"source":["%%wandb\n","\n","# mlp for the two circles problem with batchnorm after activation function\n","from sklearn.datasets import make_circles\n","from tensorflow.keras.models import Sequential\n","from tensorflow.keras.layers import Dense\n","from tensorflow.keras.layers import BatchNormalization\n","from tensorflow.keras.optimizers import SGD\n","import matplotlib.pyplot as plt\n","\n","# generate 2d classification dataset\n","x, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\n","\n","# split into train and test\n","n_train = 500\n","train_x, test_x = x[:n_train, :], x[n_train:, :]\n","train_y, test_y = y[:n_train], y[n_train:]\n","\n","# define model\n","model = Sequential()\n","model.add(Dense(config.layer_1, \n"," input_dim=2, \n"," activation='relu', kernel_initializer='he_uniform'))\n","model.add(BatchNormalization())\n","model.add(Dense(1, activation='sigmoid'))\n","opt = SGD(learning_rate=config.learn_rate, momentum=0.9)\n","model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])\n","\n","# fit model\n","history = model.fit(train_x, train_y,\n"," validation_data=(test_x, test_y), \n"," epochs=config.epoch, verbose=0,\n"," batch_size=config.batch_size,\n"," callbacks=[WandbCallback(log_weights=True,\n"," log_gradients=True,\n"," training_data=(train_x,train_y))])"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"btAlfnKkhoqM"},"source":["# evaluate the model\n","_, train_acc = model.evaluate(train_x, train_y, verbose=0)\n","_, test_acc = model.evaluate(test_x, test_y, verbose=0)\n","print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\n","\n","# plot loss learning curves\n","plt.subplot(211)\n","plt.title('Cross-Entropy Loss', pad=-40)\n","plt.plot(history.history['loss'], label='train')\n","plt.plot(history.history['val_loss'], label='test')\n","plt.legend()\n","# plot accuracy learning curves\n","plt.subplot(212)\n","plt.title('Accuracy', pad=-40)\n","plt.plot(history.history['accuracy'], label='train')\n","plt.plot(history.history['val_accuracy'], label='test')\n","plt.legend()\n","plt.tight_layout()\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"rPOC2Z-uc6tU"},"source":["# tensorflow.kera use non-trainable params with batch normalization\n","# in order to maintain auxiliary variables used in inference\n","model.summary()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"oksQ49r9XsZT"},"source":["In this case, we can see the model's comparable performance on both the train and test set of about 84% accuracy, very similar to what we saw in the previous section, if not a little bit better.\n","\n","A graph of the learning curves is also created, showing classification accuracy on each training epoch's train and test sets. In this case, we can see that the model has learned the problem faster than the model in the previous section without batch normalization. Specifically,\n","**we can see that classification accuracy on the train and test datasets leap above 80% within the first 20 epochs instead of 30-to-40 epochs in the model without batch normalization**. The plot also shows the effect of batch normalization during training. We can see lower performance\n","on the training dataset than the test dataset: scores on the training dataset that are lower than the test dataset's performance at the end of the training run. This is likely the effect of the input collected and updated each minibatch.\n"]},{"cell_type":"markdown","metadata":{"id":"eDa2D7dCY6f8"},"source":["We can also try a variation of the model where batch normalization is applied prior to the activation function of the hidden layer, instead of after the activation function."]},{"cell_type":"code","metadata":{"id":"Hcawb4h2ZJy7"},"source":["# mlp for the two circles problem with batchnorm before activation function\n","from sklearn.datasets import make_circles\n","from tensorflow.keras.models import Sequential\n","from tensorflow.keras.layers import Dense\n","from tensorflow.keras.layers import Activation\n","from tensorflow.keras.layers import BatchNormalization\n","from tensorflow.keras.optimizers import SGD\n","import matplotlib.pyplot as plt"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"edqyvj9cipVV"},"source":["# Default values for hyperparameters\n","defaults = dict(layer_1 = 50,\n"," learn_rate = 0.01,\n"," batch_size = 32,\n"," epoch = 100)\n","\n","wandb.init(project=\"week06_bn\", config= defaults, name=\"week06_bn_run_03\")\n","config = wandb.config"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"MUtckY_8iJzq"},"source":["%%wandb\n","# generate 2d classification dataset\n","x, y = make_circles(n_samples=1000, noise=0.1, random_state=1)\n","\n","# split into train and test\n","n_train = 500\n","train_x, test_x = x[:n_train, :], x[n_train:, :]\n","train_y, test_y = y[:n_train], y[n_train:]\n","\n","# define model\n","model = Sequential()\n","model.add(Dense(config.layer_1, input_dim=2, kernel_initializer='he_uniform'))\n","model.add(BatchNormalization())\n","model.add(Activation('relu'))\n","model.add(Dense(1, activation='sigmoid'))\n","opt = SGD(learning_rate=config.learn_rate, momentum=0.9)\n","model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])\n","\n","# fit model\n","history = model.fit(train_x, train_y, \n"," validation_data=(test_x, test_y), \n"," epochs=config.epoch, verbose=0,\n"," callbacks=[WandbCallback(log_weights=True,\n"," log_gradients=True,\n"," training_data=(train_x,train_y))])"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"kIaCk10iTnVX"},"source":["# tensorflow.kera use non-trainable params with batch normalization\n","# in order to maintain auxiliary variables used in inference\n","model.summary()"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":"rFTRwvw9iXXs"},"source":["# evaluate the model\n","_, train_acc = model.evaluate(train_x, train_y, verbose=0)\n","_, test_acc = model.evaluate(test_x, test_y, verbose=0)\n","print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))\n","\n","# plot loss learning curves\n","plt.subplot(211)\n","plt.title('Cross-Entropy Loss', pad=-40)\n","plt.plot(history.history['loss'], label='train')\n","plt.plot(history.history['val_loss'], label='test')\n","plt.legend()\n","# plot accuracy learning curves\n","plt.subplot(212)\n","plt.title('Accuracy', pad=-40)\n","plt.plot(history.history['accuracy'], label='train')\n","plt.plot(history.history['val_accuracy'], label='test')\n","plt.legend()\n","plt.tight_layout()\n","plt.show()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"eP4yiqS9Zqe0"},"source":["In this case, we can see the model's comparable performance on the train and test datasets, but slightly worse than the model without batch normalization.\n","\n","The line plot of the learning curves on the train and test sets also tells a different story. The plot shows the model learning perhaps at the same pace as the model without batch normalization, but the model's performance on the training dataset is much worse, hovering around 70% to 75% accuracy, again likely an effect of the statistics collected and used over each minibatch. At least for this model configuration on this specific dataset, it appears that batch normalization is more effective after the rectified linear activation function."]},{"cell_type":"markdown","metadata":{"id":"H8arWJv1ajh7"},"source":["### 1.5.3 Extensions"]},{"cell_type":"markdown","metadata":{"id":"5ABzuop7atwz"},"source":["This section lists some ideas for extending the case study that you may wish to explore.\n","\n","- **Without Beta and Gamma**: update the example to not use the beta and gamma parameters in the batch normalization layer and compare results.\n","- **Without Momentum**: update the example not to use momentum in the batch normalization layer during training and compare results.\n","- **Input Layer**: update the example to use batch normalization after the input to the model and compare results."]}]}
--------------------------------------------------------------------------------