├── 9781801817479_ColorImages.pdf
├── Errata image
├── Errata-Table5.1.PNG
└── Rsquared.png
├── Errata.md
├── LICENSE
├── README.md
├── The_Kaggle_Book.png
├── chapter_01
└── README.md
├── chapter_02
└── README.md
├── chapter_03
└── README.md
├── chapter_04
└── README.md
├── chapter_05
├── README.md
├── focal_loss.py
└── meta_kaggle.ipynb
├── chapter_06
├── README.md
├── adversarial-validation-example.ipynb
└── bootstrap.py
├── chapter_07
├── README.md
├── TargetEncode.py
├── interesting-eda-tsne-umap.ipynb
├── meta-features-and-target-encoding.ipynb
├── really-not-missing-at-random.ipynb
├── reduce_mem_usage.py
├── seed_everything.py
└── tutorial-feature-selection-with-boruta-shap.ipynb
├── chapter_08
├── README.md
├── basic-optimization-practices.ipynb
├── hacking-bayesian-optimization-for-dnns.ipynb
├── hacking-bayesian-optimization.ipynb
├── kerastuner-for-imdb.ipynb
├── optuna-bayesian-optimization.ipynb
├── scikit-optimize-for-lightgbm.ipynb
└── tutorial-bayesian-optimization-with-lightgbm.ipynb
├── chapter_09
├── README.md
└── ensembling.ipynb
├── chapter_10
├── README.md
├── ch10-augmentations-examples.ipynb
├── ch10-images-classification.ipynb
├── ch10-prepare-annotations.ipynb
├── ch10-segmentation-inference.ipynb
├── ch10-segmentation.ipynb
└── chap10-object-detection-yolov5.ipynb
├── chapter_11
├── README.md
├── chap11-nlp-augmentations4.ipynb
├── chapter11-nlp-augmentation1.ipynb
├── chapter11-qanswering.ipynb
└── chapter11-sentiment-extraction.ipynb
├── chapter_12
├── README.md
├── chap12-connectx.ipynb
├── chapter12-mab-santa.ipynb
└── chapter12-rps-notebook1.ipynb
├── chapter_13
└── README.md
├── chapter_14
└── README.md
├── contributors.jpg
└── cover.png
/9781801817479_ColorImages.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/The-Kaggle-Book/9e89503758f3afbb742f3b92367815cff897543d/9781801817479_ColorImages.pdf
--------------------------------------------------------------------------------
/Errata image/Errata-Table5.1.PNG:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/The-Kaggle-Book/9e89503758f3afbb742f3b92367815cff897543d/Errata image/Errata-Table5.1.PNG
--------------------------------------------------------------------------------
/Errata image/Rsquared.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/The-Kaggle-Book/9e89503758f3afbb742f3b92367815cff897543d/Errata image/Rsquared.png
--------------------------------------------------------------------------------
/Errata.md:
--------------------------------------------------------------------------------
1 | # Errata, Corrections and Improvements
2 | ----------------------------------------------------
3 | If you find any mistakes in The Kaggle Book, or if you have suggestions for improvements, then please [raise an issue in this repository](https://github.com/PacktPublishing/The-Kaggle-Book/issues), or email to us.
4 |
5 |
6 | ## Chapter 05, Page no 116, Table 5.1 - Fixed description as per table cells
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 | Here is how we define the cells:
16 | * **TP (true positives)**: These are located in the `lower-right cell`, containing examples that have been correctly predicted as positive ones.
17 | * **FP (false positives)**: These are located in the `upper-right cell`, containing examples that have been predicted as positive but are actually negative.
18 | * **FN (false negatives)**: These are located in the `lower-left cell`, containing examples that have been predicted as negative but are actually positive.
19 | * **TN (true negatives)**: These are located in the `upper-left cell`, containing examples that have been correctly predicted as negative ones.
20 |
21 |
22 | ## Chapter 05, Page no 110, Mean squared error (MSE) and R squared
23 |
24 | Following is the correct formula for `R squared` "co-efficient of determination":
25 |
26 |
27 |
28 | ## Chapter 05, Page no 121, correct reference to the panel in an image
29 |
30 | The first paragraph says: "A bad classifier can be spotted by the ROC curve appearing very similar, if not identical, to the diagonal of the chart, which represents the performance of a purely random classifier, as in the top right of Figure 5.3; ROC-AUC scores near 0.5 are considered to be almost random results."
31 |
32 | ## Chapter 05, Page no 130, typo in the name of the model SSD
33 |
34 | The last line of the note should say: ".....YOLO (https://arxiv.org/abs/1506.02640v1), Faster R-CNN https://arxiv.org/abs/1506.01497v1), or SSD (https://arxiv.org/abs/1512.02325)."
35 |
36 | ## Chapter 06, Page no 183, reference to test case number
37 |
38 | Change "only 1,495" to "about 24,500"
39 |
40 | ## Chapter 06, Page no 184, feature_19 and feature_54
41 |
42 | Instead of feature_19 and feature_54, the correct features that appear the most different between the training/test split are cont14, cont4, and cont5.
43 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2021 Packt
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |

2 |
3 | ## Machine Learning Summit 2025
4 | **Bridging Theory and Practice: ML Solutions for Today’s Challenges**
5 |
6 | 3 days, 20+ experts, and 25+ tech sessions and talks covering critical aspects of:
7 | - **Agentic and Generative AI**
8 | - **Applied Machine Learning in the Real World**
9 | - **ML Engineering and Optimization**
10 |
11 | 👉 [Book your ticket now >>](https://packt.link/mlsumgh)
12 |
13 | ---
14 |
15 | ## Join Our Newsletters 📬
16 |
17 | ### DataPro
18 | *The future of AI is unfolding. Don’t fall behind.*
19 |
20 | 
21 |
22 | Stay ahead with [**DataPro**](https://landing.packtpub.com/subscribe-datapronewsletter/?link_from_packtlink=yes), the free weekly newsletter for data scientists, AI/ML researchers, and data engineers.
23 | From trending tools like **PyTorch**, **scikit-learn**, **XGBoost**, and **BentoML** to hands-on insights on **database optimization** and real-world **ML workflows**, you’ll get what matters, fast.
24 |
25 | > Stay sharp with [DataPro](https://landing.packtpub.com/subscribe-datapronewsletter/?link_from_packtlink=yes). Join **115K+ data professionals** who never miss a beat.
26 |
27 | ---
28 |
29 | ### BIPro
30 | *Business runs on data. Make sure yours tells the right story.*
31 |
32 | 
33 |
34 | [**BIPro**](https://landing.packtpub.com/subscribe-bipro-newsletter/?link_from_packtlink=yes) is your free weekly newsletter for BI professionals, analysts, and data leaders.
35 | Get practical tips on **dashboarding**, **data visualization**, and **analytics strategy** with tools like **Power BI**, **Tableau**, **Looker**, **SQL**, and **dbt**.
36 |
37 | > Get smarter with [BIPro](https://landing.packtpub.com/subscribe-bipro-newsletter/?link_from_packtlink=yes). Trusted by **35K+ BI professionals**, see what you’re missing.
38 |
39 | # The Kaggle Book
40 | ## Data analysis and machine learning for competitive data science
41 | Code Repository for The Kaggle Book, Published by Packt Publishing
42 |
43 |
44 | "Luca and Konradˈs book helps make Kaggle even more accessible. They are both top-ranked users and well-respected members of the Kaggle community. Those who complete this book should expect to be able to engage confidently on Kaggle – and engaging confidently on Kaggle has many rewards."
45 | — Anthony Goldbloom, Kaggle Founder & CEO
46 |
47 |
48 |
49 |
50 |
51 |
52 | |
53 |
54 | Key Features
55 |
56 | - Learn how Kaggle works and how to make the most of competitions from two expert Kaggle Grandmasters
57 | - Sharpen your modeling skills with ensembling, feature engineering, adversarial validation, AutoML, transfer learning, and techniques for parameter tuning
58 | - Challenge yourself with problems regarding tabular data, vision, natural language as well as simulation and optimization
59 | - Discover tips, tricks, and best practices for getting great results on Kaggle and becoming a better data scientist
60 | - Read interviews with 31 Kaggle Masters and Grandmasters telling about their experience and tips
61 |
62 | |
63 |
64 |
65 |
66 |
67 |
68 | Get a step ahead of your competitors with a concise collection of smart data handling and modeling techniques
69 |
70 | ## Getting started
71 |
72 |
73 |
74 | You can run these notebooks on cloud platforms like [Kaggle](https://www.kaggle.com/) [Colab](https://colab.research.google.com/) or your local machine. Note that most chapters require a GPU even TPU sometimes to run in a reasonable amount of time, so we recommend one of the cloud platforms as they come pre-installed with CUDA.
75 |
76 |
77 |
78 | ### Running on a cloud platform
79 |
80 |
81 | To run these notebooks on a cloud platform, just click on one of the badges (Colab or Kaggle) in the table below. The code will be reproduced from Github directly onto the choosen platform (you may have to add the necessary data before running it). Alternatively, we also provide links to the fully working original notebook on Kaggle that you can copy and immediately run.
82 |
83 | |no| Chapter | Notebook | Colab | Kaggle |
84 | |:--| :-------- | :-------- | :-------: | :-------: |
85 | |05| Competition Tasks and Metrics| [meta_kaggle](https://www.kaggle.com/lucamassaron/meta-kaggle) | [](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_05/meta_kaggle.ipynb) | [](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_05/meta_kaggle.ipynb) |
86 | |06| Designing Good Validation| [adversarial-validation-example](https://www.kaggle.com/code/lucamassaron/adversarial-validation-example) | [](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_06/adversarial-validation-example.ipynb) | [](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_06/adversarial-validation-example.ipynb) |
87 | |07| Modeling for Tabular Competitions | [interesting-eda-tsne-umap](https://www.kaggle.com/lucamassaron/interesting-eda-tsne-umap) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_07/interesting-eda-tsne-umap.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_07/interesting-eda-tsne-umap.ipynb)|
88 | || | [meta-features-and-target-encoding](https://www.kaggle.com/lucamassaron/meta-features-and-target-encoding) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_07/meta-features-and-target-encoding.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_07/meta-features-and-target-encoding.ipynb)|
89 | || | [really-not-missing-at-random](https://www.kaggle.com/lucamassaron/really-not-missing-at-random) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_07/really-not-missing-at-random.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_07/really-not-missing-at-random.ipynb)|
90 | || | [tutorial-feature-selection-with-boruta-shap](https://www.kaggle.com/code/lucamassaron/tutorial-feature-selection-with-boruta-shap) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_07/tutorial-feature-selection-with-boruta-shap.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_07/tutorial-feature-selection-with-boruta-shap.ipynb)|
91 | |08| Hyperparameter Optimization | [basic-optimization-practices](https://www.kaggle.com/code/lucamassaron/basic-optimization-practices) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/basic-optimization-practices.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/basic-optimization-practices.ipynb)|
92 | || | [hacking-bayesian-optimization-for-dnns](https://www.kaggle.com/lucamassaron/hacking-bayesian-optimization-for-dnns) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/hacking-bayesian-optimization-for-dnns.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/hacking-bayesian-optimization-for-dnns.ipynb)|
93 | || | [hacking-bayesian-optimization](https://www.kaggle.com/lucamassaron/hacking-bayesian-optimization) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/hacking-bayesian-optimization.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/hacking-bayesian-optimization.ipynb)|
94 | || | [kerastuner-for-imdb](https://www.kaggle.com/lucamassaron/kerastuner-for-imdb/) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/kerastuner-for-imdb.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/kerastuner-for-imdb.ipynb)|
95 | || | [optuna-bayesian-optimization](https://www.kaggle.com/lucamassaron/optuna-bayesian-optimization) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/optuna-bayesian-optimization.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/optuna-bayesian-optimization.ipynb)|
96 | || | [scikit-optimize-for-lightgbm](https://www.kaggle.com/code/lucamassaron/scikit-optimize-for-lightgbm) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/scikit-optimize-for-lightgbm.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/scikit-optimize-for-lightgbm.ipynb)|
97 | || | [tutorial-bayesian-optimization-with-lightgbm](https://www.kaggle.com/lucamassaron/tutorial-bayesian-optimization-with-lightgbm) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/tutorial-bayesian-optimization-with-lightgbm.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_08/tutorial-bayesian-optimization-with-lightgbm.ipynb)|
98 | |09| Ensembling with Blending and Stacking Solutions| [ensembling](https://www.kaggle.com/code/lucamassaron/ensembling) |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_09/ensembling.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_09/ensembling.ipynb)|
99 | |10| Modeling for Computer Vision | augmentations-examples |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/ch10-augmentations-examples.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/ch10-augmentations-examples.ipynb)|
100 | || | images-classification |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/ch10-images-classification.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/ch10-images-classification.ipynb)|
101 | || | prepare-annotations |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/ch10-prepare-annotations.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/ch10-prepare-annotations.ipynb)|
102 | || | segmentation-inference |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/ch10-segmentation-inference.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/ch10-segmentation-inference.ipynb)|
103 | || | segmentation |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/ch10-segmentation.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/ch10-segmentation.ipynb)|
104 | || | object-detection-yolov5 |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/chap10-object-detection-yolov5.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_10/chap10-object-detection-yolov5.ipynb)|
105 | |11| Modeling for NLP | nlp-augmentations4 |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_11/chap11-nlp-augmentations4.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/***)|
106 | || | nlp-augmentation1 |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_11/chapter11-nlp-augmentation1.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_11/chapter11-nlp-augmentation1.ipynb)|
107 | || | qanswering |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_11/chapter11-qanswering.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_11/chapter11-qanswering.ipynb)|
108 | || | sentiment-extraction |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_11/chapter11-sentiment-extraction.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_11/chapter11-sentiment-extraction.ipynb)|
109 | |12| Simulation and Optimization Competitions | connectx |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_12/chap12-connectx.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_12/chap12-connectx.ipynb)|
110 | || | mab-santa |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_12/chapter12-mab-santa.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_12/chapter12-mab-santa.ipynb)|
111 | || | rps-notebook1 |[](https://colab.research.google.com/github/PacktPublishing/The-Kaggle-Book/blob/main/chapter_12/chapter12-rps-notebook1.ipynb)|[](https://kaggle.com/kernels/welcome?src=https://github.com/PacktPublishing/The-Kaggle-Book/blob/main/chapter_12/chapter12-rps-notebook1.ipynb)|
112 |
113 | ## Book Description
114 | Millions of data enthusiasts from around the world compete on Kaggle, the most famous data science competition platform of them all. Participating in Kaggle competitions is a surefire way to improve your data analysis skills, network with the rest of the community, and gain valuable experience to help grow your career.
115 |
116 | The first book of its kind, Data Analysis and Machine Learning with Kaggle assembles the techniques and skills you’ll need for success in competitions, data science projects, and beyond. Two masters of Kaggle walk you through modeling strategies you won’t easily find elsewhere, and the tacit knowledge they’ve accumulated along the way. As well as Kaggle-specific tips, you’ll learn more general techniques for approaching tasks based on image data, tabular data, textual data, and reinforcement learning. You’ll design better validation schemes and work more comfortably with different evaluation metrics.
117 |
118 | Whether you want to climb the ranks of Kaggle, build some more data science skills, or improve the accuracy of your existing models, this book is for you.
119 |
120 | ## What you will learn
121 | * Get acquainted with Kaggle and other competition platforms
122 | * Make the most of Kaggle Notebooks, Datasets, and Discussion forums
123 | * Understand different modeling tasks including binary and multi-class classification, object detection, NLP (Natural Language Processing), and time series
124 | * Design good validation schemes, learning about k-fold, probabilistic, and adversarial validation
125 | * Get to grips with evaluation metrics including MSE and its variants, precision and recall, IoU, mean average precision at k, as well as never-before-seen metrics
126 | * Handle simulation and optimization competitions on Kaggle
127 | * Create a portfolio of projects and ideas to get further in your career
128 |
129 | ## Who This Book Is For
130 | This book is suitable for Kaggle users and data analysts/scientists with at least a basic proficiency in data science topics and Python who are trying to do better in Kaggle competitions and secure jobs with tech giants. At the time of completion of this book, there are 96,190 Kaggle novices (users who have just registered on the website) and 67,666 Kaggle contributors (users who have just filled in their profile) enlisted in Kaggle competitions. This book has been written with all of them in mind and with anyone else wanting to break the ice and start taking part in competitions on Kaggle and learning from them.
131 |
132 | ## Table of Contents
133 | ### Part 1
134 |
135 | 1. Introducing Kaggle and Other Data Science Competitions
136 | 2. Organizing Data with Datasets
137 | 3. Working and Learning with Kaggle Notebooks
138 | 4. Leveraging Discussion Forums
139 |
140 | ### Part 2
141 |
142 | 5. Competition Tasks and Metrics
143 | 6. Designing Good Validation
144 | 7. Modeling for Tabular Competitions
145 | 8. Hyperparameter Optimization
146 | 9. Ensembling with Blending and Stacking Solutions
147 | 10. Modeling for Computer Vision
148 | 11. Modeling for NLP
149 | 12. Simulation and Optimization Competitions
150 |
151 | ### Part 3
152 |
153 | 13. Creating Your Portfolio of Projects and Ideas
154 | 14. Finding New Professional Opportunities
155 | ### Download a free PDF
156 |
157 | If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.
158 | https://packt.link/free-ebook/9781801817479
159 |
--------------------------------------------------------------------------------
/The_Kaggle_Book.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/The-Kaggle-Book/9e89503758f3afbb742f3b92367815cff897543d/The_Kaggle_Book.png
--------------------------------------------------------------------------------
/chapter_01/README.md:
--------------------------------------------------------------------------------
1 | # Introducing Kaggle and Other Data Science Competitions
2 |
3 | The chapter discusses how competitive programming evolved into data science competitions. It explains why the Kaggle platform is the most popular site for these competitions and provides you with an idea about how it works.
4 |
--------------------------------------------------------------------------------
/chapter_02/README.md:
--------------------------------------------------------------------------------
1 | # Organizing Data with Datasets
2 |
3 | This chapter introduces you to Kaggle Datasets, the standard method of data storage on the platform. We discuss setup, gathering data, and utilizing it in your work on Kaggle.
4 |
--------------------------------------------------------------------------------
/chapter_03/README.md:
--------------------------------------------------------------------------------
1 | # Working and Learning with Kaggle Notebooks
2 |
3 | This chapter discusses Kaggle Notebooks, the baseline coding environment. We talk about the basics of Notebook usage, as well as how to leverage the GCP environment, and using them to build up your data science portfolio.
4 |
--------------------------------------------------------------------------------
/chapter_04/README.md:
--------------------------------------------------------------------------------
1 | # Leveraging Discussion Forums
2 |
3 | This chapter allows you to familiarize yourself with discussion forums, the primary manner of communication and idea exchange on Kaggle.
4 |
--------------------------------------------------------------------------------
/chapter_05/README.md:
--------------------------------------------------------------------------------
1 | # Competition Tasks and Metrics
2 |
3 | This chapter details how evaluation metrics for certain kinds of problems strongly influence the way you can operate when building your model solution in a data science competition. The chapter also addresses the large variety of metrics available in Kaggle competitions.
4 |
--------------------------------------------------------------------------------
/chapter_05/focal_loss.py:
--------------------------------------------------------------------------------
1 | from scipy.misc import derivative
2 | import xgboost as xgb
3 |
4 | def focal_loss(alpha, gamma):
5 | def loss_func(y_pred, y_true):
6 | a, g = alpha, gamma
7 | def get_loss(y_pred, y_true):
8 | p = 1 / (1 + np.exp(-y_pred))
9 | loss = (-(a * y_true + (1 - a)*(1 - y_true)) *
10 | ((1 - (y_true * p + (1 - y_true) *
11 | (1 - p)))**g) * (y_true * np.log(p) +
12 | (1 - y_true) * np.log(1 - p)))
13 | return loss
14 | partial_focal = lambda y_pred: get_loss(y_pred, y_true)
15 | grad = derivative(partial_focal, y_pred, n=1, dx=1e-6)
16 | hess = derivative(partial_focal, y_pred, n=2, dx=1e-6)
17 | return grad, hess
18 | return loss_func
19 |
--------------------------------------------------------------------------------
/chapter_05/meta_kaggle.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "import numpy as np\n",
10 | "import pandas as pd\n",
11 | "\n",
12 | "comps = pd.read_csv(\"/kaggle/input/meta-kaggle/Competitions.csv\")\n",
13 | "evaluation = ['EvaluationAlgorithmAbbreviation',\n",
14 | " 'EvaluationAlgorithmName',\n",
15 | " 'EvaluationAlgorithmDescription',]\n",
16 | "\n",
17 | "compt = ['Title', 'EnabledDate', 'HostSegmentTitle']\n",
18 | "\n",
19 | "df = comps[compt + evaluation].copy()\n",
20 | "\n",
21 | "df['year'] = pd.to_datetime(df.EnabledDate).dt.year.values\n",
22 | "df['comps'] = 1\n",
23 | "time_select = df.year >= 2015\n",
24 | "competition_type_select = df.HostSegmentTitle.isin(['Featured', 'Research'])\n",
25 | "\n",
26 | "pd.pivot_table(df[time_select&competition_type_select],\n",
27 | " values='comps',\n",
28 | " index=['EvaluationAlgorithmAbbreviation'],\n",
29 | " columns=['year'],\n",
30 | " fill_value=0.0,\n",
31 | " aggfunc=np.sum,\n",
32 | " margins=True\n",
33 | " ).sort_values(\n",
34 | " by=('All'), ascending=False).iloc[1:,:].head(20)"
35 | ]
36 | },
37 | {
38 | "cell_type": "code",
39 | "execution_count": null,
40 | "metadata": {},
41 | "outputs": [],
42 | "source": [
43 | "metric = 'AUC'\n",
44 | "metric_select = df['EvaluationAlgorithmAbbreviation']==metric\n",
45 | "print(df[time_select&competition_type_select&metric_select][['Title', 'year']])"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": null,
51 | "metadata": {},
52 | "outputs": [],
53 | "source": [
54 | "counts = (df[time_select&competition_type_select]\n",
55 | " .groupby('EvaluationAlgorithmAbbreviation'))\n",
56 | "total_comps_per_year = (df[time_select&competition_type_select]\n",
57 | " .groupby('year').sum())\n",
58 | "single_metrics_per_year = (counts.sum()[counts.sum().comps==1]\n",
59 | " .groupby('year').sum())\n",
60 | "table = (total_comps_per_year.rename(columns={'comps': 'n_comps'})\n",
61 | " .join(single_metrics_per_year / total_comps_per_year))\n",
62 | " \n",
63 | "print(table)"
64 | ]
65 | },
66 | {
67 | "cell_type": "code",
68 | "execution_count": null,
69 | "metadata": {},
70 | "outputs": [],
71 | "source": [
72 | "print(counts.sum()[counts.sum().comps==1].index.values)"
73 | ]
74 | }
75 | ],
76 | "metadata": {
77 | "kernelspec": {
78 | "display_name": "Python 3",
79 | "language": "python",
80 | "name": "python3"
81 | },
82 | "language_info": {
83 | "codemirror_mode": {
84 | "name": "ipython",
85 | "version": 3
86 | },
87 | "file_extension": ".py",
88 | "mimetype": "text/x-python",
89 | "name": "python",
90 | "nbconvert_exporter": "python",
91 | "pygments_lexer": "ipython3",
92 | "version": "3.7.6"
93 | }
94 | },
95 | "nbformat": 4,
96 | "nbformat_minor": 4
97 | }
98 |
--------------------------------------------------------------------------------
/chapter_06/README.md:
--------------------------------------------------------------------------------
1 | # Designing Good Validation
2 |
3 | This chapter will introduce you to the importance of validation in data competitions, discussing overfitting, shake-ups, leakage, adversarial validation, different kinds of validation strategies, and strategies for your final submissions.
4 |
--------------------------------------------------------------------------------
/chapter_06/adversarial-validation-example.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "832cbc95",
7 | "metadata": {
8 | "execution": {
9 | "iopub.execute_input": "2022-05-12T21:24:16.239591Z",
10 | "iopub.status.busy": "2022-05-12T21:24:16.239187Z",
11 | "iopub.status.idle": "2022-05-12T21:24:17.738312Z",
12 | "shell.execute_reply": "2022-05-12T21:24:17.737324Z"
13 | },
14 | "papermill": {
15 | "duration": 1.515163,
16 | "end_time": "2022-05-12T21:24:17.741395",
17 | "exception": false,
18 | "start_time": "2022-05-12T21:24:16.226232",
19 | "status": "completed"
20 | },
21 | "tags": []
22 | },
23 | "outputs": [],
24 | "source": [
25 | "import numpy as np \n",
26 | "import pandas as pd\n",
27 | "from sklearn.ensemble import RandomForestClassifier\n",
28 | "from sklearn.model_selection import cross_val_predict\n",
29 | "from sklearn.metrics import roc_auc_score"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 2,
35 | "id": "feccabad",
36 | "metadata": {
37 | "execution": {
38 | "iopub.execute_input": "2022-05-12T21:24:17.766209Z",
39 | "iopub.status.busy": "2022-05-12T21:24:17.765879Z",
40 | "iopub.status.idle": "2022-05-12T21:24:21.083665Z",
41 | "shell.execute_reply": "2022-05-12T21:24:21.082746Z"
42 | },
43 | "papermill": {
44 | "duration": 3.332449,
45 | "end_time": "2022-05-12T21:24:21.086361",
46 | "exception": false,
47 | "start_time": "2022-05-12T21:24:17.753912",
48 | "status": "completed"
49 | },
50 | "tags": []
51 | },
52 | "outputs": [],
53 | "source": [
54 | "train = pd.read_csv(\"../input/tabular-playground-series-jan-2021/train.csv\")\n",
55 | "test = pd.read_csv(\"../input/tabular-playground-series-jan-2021/test.csv\")"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": 3,
61 | "id": "e4a08b03",
62 | "metadata": {
63 | "execution": {
64 | "iopub.execute_input": "2022-05-12T21:24:21.109156Z",
65 | "iopub.status.busy": "2022-05-12T21:24:21.108793Z",
66 | "iopub.status.idle": "2022-05-12T21:24:21.188984Z",
67 | "shell.execute_reply": "2022-05-12T21:24:21.187969Z"
68 | },
69 | "papermill": {
70 | "duration": 0.095271,
71 | "end_time": "2022-05-12T21:24:21.192002",
72 | "exception": false,
73 | "start_time": "2022-05-12T21:24:21.096731",
74 | "status": "completed"
75 | },
76 | "tags": []
77 | },
78 | "outputs": [],
79 | "source": [
80 | "train = train.fillna(-1).drop([\"id\", \"target\"], axis=1)\n",
81 | "test = test.fillna(-1).drop([\"id\"], axis=1)"
82 | ]
83 | },
84 | {
85 | "cell_type": "code",
86 | "execution_count": 4,
87 | "id": "b9263a7e",
88 | "metadata": {
89 | "execution": {
90 | "iopub.execute_input": "2022-05-12T21:24:21.214042Z",
91 | "iopub.status.busy": "2022-05-12T21:24:21.213574Z",
92 | "iopub.status.idle": "2022-05-12T21:24:21.245962Z",
93 | "shell.execute_reply": "2022-05-12T21:24:21.244879Z"
94 | },
95 | "papermill": {
96 | "duration": 0.046391,
97 | "end_time": "2022-05-12T21:24:21.248658",
98 | "exception": false,
99 | "start_time": "2022-05-12T21:24:21.202267",
100 | "status": "completed"
101 | },
102 | "tags": []
103 | },
104 | "outputs": [],
105 | "source": [
106 | "X = pd.concat([train, test], ignore_index=True)\n",
107 | "y = [0] * len(train) + [1] * len(test)"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": 5,
113 | "id": "7e1cf7e6",
114 | "metadata": {
115 | "execution": {
116 | "iopub.execute_input": "2022-05-12T21:24:21.270675Z",
117 | "iopub.status.busy": "2022-05-12T21:24:21.270308Z",
118 | "iopub.status.idle": "2022-05-12T21:54:18.335472Z",
119 | "shell.execute_reply": "2022-05-12T21:54:18.333817Z"
120 | },
121 | "papermill": {
122 | "duration": 1797.082831,
123 | "end_time": "2022-05-12T21:54:18.341861",
124 | "exception": false,
125 | "start_time": "2022-05-12T21:24:21.259030",
126 | "status": "completed"
127 | },
128 | "tags": []
129 | },
130 | "outputs": [],
131 | "source": [
132 | "model = RandomForestClassifier()\n",
133 | "cv_preds = cross_val_predict(model, X, y, cv=5, n_jobs=-1, method='predict_proba')"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": 6,
139 | "id": "dc5c98a6",
140 | "metadata": {
141 | "execution": {
142 | "iopub.execute_input": "2022-05-12T21:54:18.370805Z",
143 | "iopub.status.busy": "2022-05-12T21:54:18.369439Z",
144 | "iopub.status.idle": "2022-05-12T21:54:18.771394Z",
145 | "shell.execute_reply": "2022-05-12T21:54:18.770344Z"
146 | },
147 | "papermill": {
148 | "duration": 0.420268,
149 | "end_time": "2022-05-12T21:54:18.776677",
150 | "exception": false,
151 | "start_time": "2022-05-12T21:54:18.356409",
152 | "status": "completed"
153 | },
154 | "tags": []
155 | },
156 | "outputs": [
157 | {
158 | "name": "stdout",
159 | "output_type": "stream",
160 | "text": [
161 | "0.49981959930833336\n"
162 | ]
163 | }
164 | ],
165 | "source": [
166 | "print(roc_auc_score(y_true=y, y_score=cv_preds[:,1]))"
167 | ]
168 | },
169 | {
170 | "cell_type": "code",
171 | "execution_count": 7,
172 | "id": "371b4b6f",
173 | "metadata": {
174 | "execution": {
175 | "iopub.execute_input": "2022-05-12T21:54:18.805600Z",
176 | "iopub.status.busy": "2022-05-12T21:54:18.805239Z",
177 | "iopub.status.idle": "2022-05-12T21:54:18.813136Z",
178 | "shell.execute_reply": "2022-05-12T21:54:18.812068Z"
179 | },
180 | "papermill": {
181 | "duration": 0.023365,
182 | "end_time": "2022-05-12T21:54:18.816362",
183 | "exception": false,
184 | "start_time": "2022-05-12T21:54:18.792997",
185 | "status": "completed"
186 | },
187 | "tags": []
188 | },
189 | "outputs": [
190 | {
191 | "name": "stdout",
192 | "output_type": "stream",
193 | "text": [
194 | "24793\n"
195 | ]
196 | }
197 | ],
198 | "source": [
199 | "print(np.sum(cv_preds[:len(X), 1] > 0.5))"
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": 8,
205 | "id": "4b165023",
206 | "metadata": {
207 | "execution": {
208 | "iopub.execute_input": "2022-05-12T21:54:18.849718Z",
209 | "iopub.status.busy": "2022-05-12T21:54:18.849383Z",
210 | "iopub.status.idle": "2022-05-12T22:12:55.247629Z",
211 | "shell.execute_reply": "2022-05-12T22:12:55.246399Z"
212 | },
213 | "papermill": {
214 | "duration": 1116.431812,
215 | "end_time": "2022-05-12T22:12:55.265251",
216 | "exception": false,
217 | "start_time": "2022-05-12T21:54:18.833439",
218 | "status": "completed"
219 | },
220 | "tags": []
221 | },
222 | "outputs": [
223 | {
224 | "data": {
225 | "text/plain": [
226 | "RandomForestClassifier()"
227 | ]
228 | },
229 | "execution_count": 8,
230 | "metadata": {},
231 | "output_type": "execute_result"
232 | }
233 | ],
234 | "source": [
235 | "model.fit(X, y)"
236 | ]
237 | },
238 | {
239 | "cell_type": "code",
240 | "execution_count": 9,
241 | "id": "9a6872a9",
242 | "metadata": {
243 | "execution": {
244 | "iopub.execute_input": "2022-05-12T22:12:55.293966Z",
245 | "iopub.status.busy": "2022-05-12T22:12:55.293295Z",
246 | "iopub.status.idle": "2022-05-12T22:12:55.812658Z",
247 | "shell.execute_reply": "2022-05-12T22:12:55.811391Z"
248 | },
249 | "papermill": {
250 | "duration": 0.537481,
251 | "end_time": "2022-05-12T22:12:55.815373",
252 | "exception": false,
253 | "start_time": "2022-05-12T22:12:55.277892",
254 | "status": "completed"
255 | },
256 | "tags": []
257 | },
258 | "outputs": [
259 | {
260 | "name": "stdout",
261 | "output_type": "stream",
262 | "text": [
263 | "cont14 : 0.0720\n",
264 | "cont4 : 0.0718\n",
265 | "cont5 : 0.0718\n",
266 | "cont2 : 0.0717\n",
267 | "cont7 : 0.0716\n",
268 | "cont8 : 0.0716\n",
269 | "cont3 : 0.0715\n",
270 | "cont1 : 0.0714\n",
271 | "cont11 : 0.0713\n",
272 | "cont13 : 0.0713\n",
273 | "cont10 : 0.0712\n",
274 | "cont12 : 0.0711\n",
275 | "cont9 : 0.0710\n",
276 | "cont6 : 0.0708\n"
277 | ]
278 | }
279 | ],
280 | "source": [
281 | "ranks = sorted(list(zip(X.columns, model.feature_importances_)), \n",
282 | " key=lambda x: x[1], reverse=True)\n",
283 | "\n",
284 | "for feature, score in ranks:\n",
285 | " print(f\"{feature:10} : {score:0.4f}\")"
286 | ]
287 | }
288 | ],
289 | "metadata": {
290 | "kernelspec": {
291 | "display_name": "Python 3",
292 | "language": "python",
293 | "name": "python3"
294 | },
295 | "language_info": {
296 | "codemirror_mode": {
297 | "name": "ipython",
298 | "version": 3
299 | },
300 | "file_extension": ".py",
301 | "mimetype": "text/x-python",
302 | "name": "python",
303 | "nbconvert_exporter": "python",
304 | "pygments_lexer": "ipython3",
305 | "version": "3.7.12"
306 | },
307 | "papermill": {
308 | "default_parameters": {},
309 | "duration": 2933.329012,
310 | "end_time": "2022-05-12T22:12:58.563839",
311 | "environment_variables": {},
312 | "exception": null,
313 | "input_path": "__notebook__.ipynb",
314 | "output_path": "__notebook__.ipynb",
315 | "parameters": {},
316 | "start_time": "2022-05-12T21:24:05.234827",
317 | "version": "2.3.4"
318 | }
319 | },
320 | "nbformat": 4,
321 | "nbformat_minor": 5
322 | }
323 |
--------------------------------------------------------------------------------
/chapter_06/bootstrap.py:
--------------------------------------------------------------------------------
1 | import random
2 |
3 | def Bootstrap(n, n_iter=3, random_state=None):
4 | """
5 | Random sampling with replacement cross-validation generator.
6 | For each iter a sample bootstrap of the indexes [0, n) is
7 | generated and the function returns the obtained sample
8 | and a list of all the excluded indexes.
9 | """
10 | if random_state:
11 | random.seed(random_state)
12 | for j in range(n_iter):
13 | bs = [random.randint(0, n-1) for i in range(n)]
14 | out_bs = list({i for i in range(n)} - set(bs))
15 | yield bs, out_bs
16 |
--------------------------------------------------------------------------------
/chapter_07/README.md:
--------------------------------------------------------------------------------
1 | # Modeling for Tabular Competitions
2 |
3 | This chapter discusses tabular competitions, mostly focusing on the more recent reality of Kaggle, the Tabular Playground Series. Tabular problems are standard practice for the majority of data scientists around and there is a lot to learn from Kaggle.
4 |
--------------------------------------------------------------------------------
/chapter_07/TargetEncode.py:
--------------------------------------------------------------------------------
1 | import numpy as np
2 | import pandas as pd
3 |
4 | from sklearn.base import BaseEstimator, TransformerMixin
5 |
6 | class TargetEncode(BaseEstimator, TransformerMixin):
7 |
8 | def __init__(self, categories='auto', k=1, f=1,
9 | noise_level=0, random_state=None):
10 | if type(categories)==str and categories!='auto':
11 | self.categories = [categories]
12 | else:
13 | self.categories = categories
14 | self.k = k
15 | self.f = f
16 | self.noise_level = noise_level
17 | self.encodings = dict()
18 | self.prior = None
19 | self.random_state = random_state
20 |
21 | def add_noise(self, series, noise_level):
22 | return series * (1 + noise_level *
23 | np.random.randn(len(series)))
24 |
25 | def fit(self, X, y=None):
26 | if type(self.categories)=='auto':
27 | self.categories = np.where(X.dtypes == type(object()))[0]
28 |
29 | temp = X.loc[:, self.categories].copy()
30 | temp['target'] = y
31 | self.prior = np.mean(y)
32 | for variable in self.categories:
33 | avg = (temp.groupby(by=variable)['target']
34 | .agg(['mean', 'count']))
35 | # Compute smoothing
36 | smoothing = (1 / (1 + np.exp(-(avg['count'] - self.k) /
37 | self.f)))
38 | # The bigger the count the less full_avg is accounted
39 | self.encodings[variable] = dict(self.prior * (1 -
40 | smoothing) + avg['mean'] * smoothing)
41 |
42 | return self
43 |
44 | def transform(self, X):
45 | Xt = X.copy()
46 | for variable in self.categories:
47 | Xt[variable].replace(self.encodings[variable],
48 | inplace=True)
49 | unknown_value = {value:self.prior for value in
50 | X[variable].unique()
51 | if value not in
52 | self.encodings[variable].keys()}
53 | if len(unknown_value) > 0:
54 | Xt[variable].replace(unknown_value, inplace=True)
55 | Xt[variable] = Xt[variable].astype(float)
56 | if self.noise_level > 0:
57 | if self.random_state is not None:
58 | np.random.seed(self.random_state)
59 | Xt[variable] = self.add_noise(Xt[variable],
60 | self.noise_level)
61 | return Xt
62 |
63 | def fit_transform(self, X, y=None):
64 | self.fit(X, y)
65 | return self.transform(X)
66 |
--------------------------------------------------------------------------------
/chapter_07/meta-features-and-target-encoding.ipynb:
--------------------------------------------------------------------------------
1 | {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.12","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"import numpy as np\nimport pandas as pd\n\nfrom sklearn.base import BaseEstimator, TransformerMixin\n\nclass TargetEncode(BaseEstimator, TransformerMixin):\n \n def __init__(self, categories='auto', k=1, f=1, \n noise_level=0, random_state=None):\n if type(categories)==str and categories!='auto':\n self.categories = [categories]\n else:\n self.categories = categories\n self.k = k\n self.f = f\n self.noise_level = noise_level\n self.encodings = dict()\n self.prior = None\n self.random_state = random_state\n \n def add_noise(self, series, noise_level):\n return series * (1 + noise_level * \n np.random.randn(len(series)))\n \n def fit(self, X, y=None):\n if type(self.categories)=='auto':\n self.categories = np.where(X.dtypes == type(object()))[0]\n \n temp = X.loc[:, self.categories].copy()\n temp['target'] = y\n self.prior = np.mean(y)\n for variable in self.categories:\n avg = (temp.groupby(by=variable)['target']\n .agg(['mean', 'count']))\n # Compute smoothing \n smoothing = (1 / (1 + np.exp(-(avg['count'] - self.k) / \n self.f)))\n # The bigger the count the less full_avg is accounted\n self.encodings[variable] = dict(self.prior * (1 - \n smoothing) + avg['mean'] * smoothing)\n \n return self\n \n def transform(self, X):\n Xt = X.copy()\n for variable in self.categories:\n Xt[variable].replace(self.encodings[variable], \n inplace=True)\n unknown_value = {value:self.prior for value in \n X[variable].unique() \n if value not in \n self.encodings[variable].keys()}\n if len(unknown_value) > 0:\n Xt[variable].replace(unknown_value, inplace=True)\n Xt[variable] = Xt[variable].astype(float)\n if self.noise_level > 0:\n if self.random_state is not None:\n np.random.seed(self.random_state)\n Xt[variable] = self.add_noise(Xt[variable], \n self.noise_level)\n return Xt\n \n def fit_transform(self, X, y=None):\n self.fit(X, y)\n return self.transform(X)\n","metadata":{"execution":{"iopub.status.busy":"2022-02-20T23:24:46.722503Z","iopub.execute_input":"2022-02-20T23:24:46.723260Z","iopub.status.idle":"2022-02-20T23:24:46.740899Z","shell.execute_reply.started":"2022-02-20T23:24:46.723203Z","shell.execute_reply":"2022-02-20T23:24:46.740024Z"},"trusted":true},"execution_count":112,"outputs":[]},{"cell_type":"code","source":"train = pd.read_csv(\"../input/amazon-employee-access-challenge/train.csv\")","metadata":{"execution":{"iopub.status.busy":"2022-02-20T22:46:21.939860Z","iopub.execute_input":"2022-02-20T22:46:21.940273Z","iopub.status.idle":"2022-02-20T22:46:22.019494Z","shell.execute_reply.started":"2022-02-20T22:46:21.940209Z","shell.execute_reply":"2022-02-20T22:46:22.018713Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"code","source":"# Frequency count of a feature\nfeature_counts = train.groupby('ROLE_TITLE').size()\ntrain['ROLE_TITLE'].apply(lambda x: feature_counts[x])","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Frequency count of a feature grouped by another feature\nfeature_counts = train.groupby(['ROLE_DEPTNAME', 'ROLE_TITLE']).size()\ntrain[['ROLE_DEPTNAME', 'ROLE_TITLE']].apply(lambda x: feature_counts[x[0]][x[1]], axis=1)","metadata":{"execution":{"iopub.status.busy":"2022-02-20T23:02:24.143510Z","iopub.execute_input":"2022-02-20T23:02:24.144246Z","iopub.status.idle":"2022-02-20T23:02:30.325426Z","shell.execute_reply.started":"2022-02-20T23:02:24.144191Z","shell.execute_reply":"2022-02-20T23:02:30.324379Z"},"trusted":true},"execution_count":54,"outputs":[]},{"cell_type":"code","source":"te = TargetEncode(categories='ROLE_TITLE')\nte.fit(train, train['ACTION'])\nte.transform(train[['ROLE_TITLE']])","metadata":{"execution":{"iopub.status.busy":"2022-02-20T23:34:36.536917Z","iopub.execute_input":"2022-02-20T23:34:36.537484Z","iopub.status.idle":"2022-02-20T23:34:36.592922Z","shell.execute_reply.started":"2022-02-20T23:34:36.537436Z","shell.execute_reply":"2022-02-20T23:34:36.592111Z"},"trusted":true},"execution_count":129,"outputs":[]}]}
--------------------------------------------------------------------------------
/chapter_07/reduce_mem_usage.py:
--------------------------------------------------------------------------------
1 | def reduce_mem_usage(df, verbose=True):
2 | numerics = ['int16', 'int32', 'int64',
3 | 'float16', 'float32', 'float64']
4 | start_mem = df.memory_usage().sum() / 1024**2
5 | for col in df.columns:
6 | col_type = df[col].dtypes
7 | if col_type in numerics:
8 | c_min = df[col].min()
9 | c_max = df[col].max()
10 | if str(col_type)[:3] == 'int':
11 | if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
12 | df[col] = df[col].astype(np.int8)
13 | elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
14 | df[col] = df[col].astype(np.int16)
15 | elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
16 | df[col] = df[col].astype(np.int32)
17 | elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
18 | df[col] = df[col].astype(np.int64)
19 | else:
20 | if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
21 | df[col] = df[col].astype(np.float32)
22 | else:
23 | df[col] = df[col].astype(np.float64)
24 | end_mem = df.memory_usage().sum() / 1024**2
25 | if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
26 | return df
27 |
--------------------------------------------------------------------------------
/chapter_07/seed_everything.py:
--------------------------------------------------------------------------------
1 | def seed_everything(seed,
2 | tensorflow_init=True,
3 | pytorch_init=True):
4 | """
5 | Seeds basic parameters for reproducibility of results
6 | """
7 | random.seed(seed)
8 | os.environ["PYTHONHASHSEED"] = str(seed)
9 | np.random.seed(seed)
10 | if tensorflow_init is True:
11 | tf.random.set_seed(seed)
12 | if pytorch_init is True:
13 | torch.manual_seed(seed)
14 | torch.cuda.manual_seed(seed)
15 | torch.backends.cudnn.deterministic = True
16 | torch.backends.cudnn.benchmark = False
17 |
--------------------------------------------------------------------------------
/chapter_08/README.md:
--------------------------------------------------------------------------------
1 | # Hyperparameter Optimization
2 |
3 | This chapter explores how to extend the cross-validation approach to find the best hyperparameters for your models – in other words, those that can generalize in the best way on the private leaderboard – under the pressure and scarcity of time and resources that you experience in Kaggle competitions.
4 |
--------------------------------------------------------------------------------
/chapter_08/basic-optimization-practices.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "4f70b76b",
7 | "metadata": {
8 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
9 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
10 | "execution": {
11 | "iopub.execute_input": "2021-10-28T10:18:49.232672Z",
12 | "iopub.status.busy": "2021-10-28T10:18:49.227561Z",
13 | "iopub.status.idle": "2021-10-28T10:19:04.502716Z",
14 | "shell.execute_reply": "2021-10-28T10:19:04.503216Z",
15 | "shell.execute_reply.started": "2021-10-28T10:08:45.550332Z"
16 | },
17 | "papermill": {
18 | "duration": 15.287961,
19 | "end_time": "2021-10-28T10:19:04.503522",
20 | "exception": false,
21 | "start_time": "2021-10-28T10:18:49.215561",
22 | "status": "completed"
23 | },
24 | "tags": []
25 | },
26 | "outputs": [
27 | {
28 | "name": "stdout",
29 | "output_type": "stream",
30 | "text": [
31 | "Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.7/site-packages (0.23.2)\r\n",
32 | "Collecting scikit-learn\r\n",
33 | " Downloading scikit_learn-1.0.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.2 MB)\r\n",
34 | "\u001b[K |████████████████████████████████| 23.2 MB 545 kB/s \r\n",
35 | "\u001b[?25hRequirement already satisfied: joblib>=0.11 in /opt/conda/lib/python3.7/site-packages (from scikit-learn) (1.0.1)\r\n",
36 | "Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn) (2.2.0)\r\n",
37 | "Requirement already satisfied: numpy>=1.14.6 in /opt/conda/lib/python3.7/site-packages (from scikit-learn) (1.19.5)\r\n",
38 | "Requirement already satisfied: scipy>=1.1.0 in /opt/conda/lib/python3.7/site-packages (from scikit-learn) (1.7.1)\r\n",
39 | "Installing collected packages: scikit-learn\r\n",
40 | " Attempting uninstall: scikit-learn\r\n",
41 | " Found existing installation: scikit-learn 0.23.2\r\n",
42 | " Uninstalling scikit-learn-0.23.2:\r\n",
43 | " Successfully uninstalled scikit-learn-0.23.2\r\n",
44 | "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\r\n",
45 | "pdpbox 0.2.1 requires matplotlib==3.1.1, but you have matplotlib 3.4.3 which is incompatible.\r\n",
46 | "hypertools 0.7.0 requires scikit-learn!=0.22,<0.24,>=0.19.1, but you have scikit-learn 1.0.1 which is incompatible.\u001b[0m\r\n",
47 | "Successfully installed scikit-learn-1.0.1\r\n",
48 | "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\u001b[0m\r\n",
49 | "Note: you may need to restart the kernel to use updated packages.\n"
50 | ]
51 | }
52 | ],
53 | "source": [
54 | "%pip install scikit-learn -U"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 2,
60 | "id": "4b69c03e",
61 | "metadata": {
62 | "execution": {
63 | "iopub.execute_input": "2021-10-28T10:19:04.545681Z",
64 | "iopub.status.busy": "2021-10-28T10:19:04.544725Z",
65 | "iopub.status.idle": "2021-10-28T10:19:05.501277Z",
66 | "shell.execute_reply": "2021-10-28T10:19:05.500590Z",
67 | "shell.execute_reply.started": "2021-10-28T10:09:03.182001Z"
68 | },
69 | "papermill": {
70 | "duration": 0.979723,
71 | "end_time": "2021-10-28T10:19:05.501434",
72 | "exception": false,
73 | "start_time": "2021-10-28T10:19:04.521711",
74 | "status": "completed"
75 | },
76 | "tags": []
77 | },
78 | "outputs": [],
79 | "source": [
80 | "from sklearn.datasets import make_classification\n",
81 | "from sklearn.model_selection import train_test_split\n",
82 | "\n",
83 | "X, y = make_classification(n_samples=300, n_features=50, \n",
84 | " n_informative=10,\n",
85 | " n_redundant=25, n_repeated=15, \n",
86 | " n_clusters_per_class=5,\n",
87 | " flip_y=0.05, class_sep=0.5, \n",
88 | " random_state=0)"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": 3,
94 | "id": "62513810",
95 | "metadata": {
96 | "execution": {
97 | "iopub.execute_input": "2021-10-28T10:19:05.542295Z",
98 | "iopub.status.busy": "2021-10-28T10:19:05.541309Z",
99 | "iopub.status.idle": "2021-10-28T10:19:05.563842Z",
100 | "shell.execute_reply": "2021-10-28T10:19:05.563327Z",
101 | "shell.execute_reply.started": "2021-10-28T10:09:04.087478Z"
102 | },
103 | "papermill": {
104 | "duration": 0.045382,
105 | "end_time": "2021-10-28T10:19:05.563986",
106 | "exception": false,
107 | "start_time": "2021-10-28T10:19:05.518604",
108 | "status": "completed"
109 | },
110 | "tags": []
111 | },
112 | "outputs": [],
113 | "source": [
114 | "from sklearn import svm\n",
115 | "\n",
116 | "svc = svm.SVC()\n",
117 | "svc = svm.SVC(probability=True, random_state=1)\n",
118 | "\n",
119 | "from sklearn import model_selection\n",
120 | "search_grid = [\n",
121 | " {'C': [1, 10, 100, 1000], 'kernel': ['linear']},\n",
122 | " {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001],\n",
123 | " 'kernel': ['rbf']}\n",
124 | " ]\n",
125 | " \n",
126 | "scorer = 'accuracy'"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": 4,
132 | "id": "b949dbdf",
133 | "metadata": {
134 | "execution": {
135 | "iopub.execute_input": "2021-10-28T10:19:05.604674Z",
136 | "iopub.status.busy": "2021-10-28T10:19:05.603884Z",
137 | "iopub.status.idle": "2021-10-28T10:34:48.628686Z",
138 | "shell.execute_reply": "2021-10-28T10:34:48.629211Z"
139 | },
140 | "papermill": {
141 | "duration": 943.048701,
142 | "end_time": "2021-10-28T10:34:48.629569",
143 | "exception": false,
144 | "start_time": "2021-10-28T10:19:05.580868",
145 | "status": "completed"
146 | },
147 | "tags": []
148 | },
149 | "outputs": [
150 | {
151 | "name": "stdout",
152 | "output_type": "stream",
153 | "text": [
154 | "{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}\n",
155 | "0.7\n"
156 | ]
157 | }
158 | ],
159 | "source": [
160 | "search_func = model_selection.GridSearchCV(estimator=svc, \n",
161 | " param_grid=search_grid,\n",
162 | " scoring=scorer, \n",
163 | " n_jobs=-1,\n",
164 | " cv=5)\n",
165 | "search_func.fit(X, y)\n",
166 | "\n",
167 | "print (search_func.best_params_)\n",
168 | "print (search_func.best_score_)"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": 5,
174 | "id": "98d9aa9d",
175 | "metadata": {
176 | "execution": {
177 | "iopub.execute_input": "2021-10-28T10:34:48.670290Z",
178 | "iopub.status.busy": "2021-10-28T10:34:48.669534Z",
179 | "iopub.status.idle": "2021-10-28T10:42:04.936145Z",
180 | "shell.execute_reply": "2021-10-28T10:42:04.936654Z"
181 | },
182 | "papermill": {
183 | "duration": 436.289573,
184 | "end_time": "2021-10-28T10:42:04.936849",
185 | "exception": false,
186 | "start_time": "2021-10-28T10:34:48.647276",
187 | "status": "completed"
188 | },
189 | "tags": []
190 | },
191 | "outputs": [
192 | {
193 | "name": "stdout",
194 | "output_type": "stream",
195 | "text": [
196 | "{'C': 300.1387499644088, 'gamma': 0.014479216173691752, 'kernel': 'rbf'}\n",
197 | "0.6666666666666667\n"
198 | ]
199 | }
200 | ],
201 | "source": [
202 | "import scipy.stats as stats\n",
203 | "from sklearn.utils.fixes import loguniform\n",
204 | "\n",
205 | "search_dict = {'kernel': ['linear', 'rbf'], \n",
206 | " 'C': loguniform(1, 1000),\n",
207 | " 'gamma': loguniform(0.0001, 0.1)\n",
208 | " }\n",
209 | "\n",
210 | "scorer = 'accuracy'\n",
211 | "\n",
212 | "search_func = model_selection.RandomizedSearchCV(estimator=svc,\n",
213 | " param_distributions=search_dict,\n",
214 | " n_iter=6,\n",
215 | " scoring=scorer,\n",
216 | " n_jobs=-1,\n",
217 | " cv=5\n",
218 | " )\n",
219 | "\n",
220 | "search_func.fit(X, y)\n",
221 | "\n",
222 | "print (search_func.best_params_)\n",
223 | "print (search_func.best_score_)"
224 | ]
225 | },
226 | {
227 | "cell_type": "code",
228 | "execution_count": 6,
229 | "id": "72e2b127",
230 | "metadata": {
231 | "execution": {
232 | "iopub.execute_input": "2021-10-28T10:42:04.981942Z",
233 | "iopub.status.busy": "2021-10-28T10:42:04.981239Z",
234 | "iopub.status.idle": "2021-10-28T10:43:10.549111Z",
235 | "shell.execute_reply": "2021-10-28T10:43:10.549612Z"
236 | },
237 | "papermill": {
238 | "duration": 65.594403,
239 | "end_time": "2021-10-28T10:43:10.549795",
240 | "exception": false,
241 | "start_time": "2021-10-28T10:42:04.955392",
242 | "status": "completed"
243 | },
244 | "tags": []
245 | },
246 | "outputs": [
247 | {
248 | "name": "stdout",
249 | "output_type": "stream",
250 | "text": [
251 | "{'C': 86.63279761354555, 'gamma': 0.002054762512591133, 'kernel': 'linear'}\n",
252 | "0.6166666666666667\n"
253 | ]
254 | }
255 | ],
256 | "source": [
257 | "from sklearn.experimental import enable_halving_search_cv\n",
258 | "from sklearn.model_selection import HalvingRandomSearchCV\n",
259 | "\n",
260 | "search_func = HalvingRandomSearchCV(estimator=svc,\n",
261 | " param_distributions=search_dict,\n",
262 | " resource='n_samples',\n",
263 | " max_resources=100,\n",
264 | " aggressive_elimination=True,\n",
265 | " scoring=scorer,\n",
266 | " n_jobs=-1,\n",
267 | " cv=5,\n",
268 | " random_state=0)\n",
269 | "\n",
270 | "search_func.fit(X, y)\n",
271 | "\n",
272 | "print (search_func.best_params_)\n",
273 | "print (search_func.best_score_)"
274 | ]
275 | }
276 | ],
277 | "metadata": {
278 | "kernelspec": {
279 | "display_name": "Python 3",
280 | "language": "python",
281 | "name": "python3"
282 | },
283 | "language_info": {
284 | "codemirror_mode": {
285 | "name": "ipython",
286 | "version": 3
287 | },
288 | "file_extension": ".py",
289 | "mimetype": "text/x-python",
290 | "name": "python",
291 | "nbconvert_exporter": "python",
292 | "pygments_lexer": "ipython3",
293 | "version": "3.7.10"
294 | },
295 | "papermill": {
296 | "default_parameters": {},
297 | "duration": 1470.658665,
298 | "end_time": "2021-10-28T10:43:11.481625",
299 | "environment_variables": {},
300 | "exception": null,
301 | "input_path": "__notebook__.ipynb",
302 | "output_path": "__notebook__.ipynb",
303 | "parameters": {},
304 | "start_time": "2021-10-28T10:18:40.822960",
305 | "version": "2.3.3"
306 | }
307 | },
308 | "nbformat": 4,
309 | "nbformat_minor": 5
310 | }
311 |
--------------------------------------------------------------------------------
/chapter_08/kerastuner-for-imdb.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "33bb4064",
7 | "metadata": {
8 | "execution": {
9 | "iopub.execute_input": "2021-10-01T22:23:10.623889Z",
10 | "iopub.status.busy": "2021-10-01T22:23:10.549685Z",
11 | "iopub.status.idle": "2021-10-01T22:23:15.402690Z",
12 | "shell.execute_reply": "2021-10-01T22:23:15.402079Z",
13 | "shell.execute_reply.started": "2021-10-01T22:21:52.563589Z"
14 | },
15 | "papermill": {
16 | "duration": 4.869378,
17 | "end_time": "2021-10-01T22:23:15.402846",
18 | "exception": false,
19 | "start_time": "2021-10-01T22:23:10.533468",
20 | "status": "completed"
21 | },
22 | "tags": []
23 | },
24 | "outputs": [
25 | {
26 | "name": "stderr",
27 | "output_type": "stream",
28 | "text": [
29 | "2021-10-01 22:23:11.088050: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0\n"
30 | ]
31 | }
32 | ],
33 | "source": [
34 | "import numpy as np\n",
35 | "import pandas as pd\n",
36 | "import tensorflow as tf\n",
37 | "from tensorflow import keras\n",
38 | "import tensorflow_addons as tfa\n",
39 | "from sklearn.model_selection import train_test_split\n",
40 | "\n",
41 | "pad_sequences = keras.preprocessing.sequence.pad_sequences"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 2,
47 | "id": "58798487",
48 | "metadata": {
49 | "execution": {
50 | "iopub.execute_input": "2021-10-01T22:23:15.431363Z",
51 | "iopub.status.busy": "2021-10-01T22:23:15.430816Z",
52 | "iopub.status.idle": "2021-10-01T22:23:23.367293Z",
53 | "shell.execute_reply": "2021-10-01T22:23:23.367709Z",
54 | "shell.execute_reply.started": "2021-10-01T22:21:52.577835Z"
55 | },
56 | "papermill": {
57 | "duration": 7.953133,
58 | "end_time": "2021-10-01T22:23:23.367860",
59 | "exception": false,
60 | "start_time": "2021-10-01T22:23:15.414727",
61 | "status": "completed"
62 | },
63 | "tags": []
64 | },
65 | "outputs": [
66 | {
67 | "name": "stdout",
68 | "output_type": "stream",
69 | "text": [
70 | "Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz\n",
71 | "17465344/17464789 [==============================] - 0s 0us/step\n"
72 | ]
73 | },
74 | {
75 | "name": "stderr",
76 | "output_type": "stream",
77 | "text": [
78 | ":6: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray\n",
79 | "/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/datasets/imdb.py:159: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray\n",
80 | " x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])\n",
81 | "/opt/conda/lib/python3.7/site-packages/tensorflow/python/keras/datasets/imdb.py:160: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray\n",
82 | " x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])\n"
83 | ]
84 | }
85 | ],
86 | "source": [
87 | "imdb = keras.datasets.imdb\n",
88 | "(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)"
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": 3,
94 | "id": "f455bf51",
95 | "metadata": {
96 | "execution": {
97 | "iopub.execute_input": "2021-10-01T22:23:23.402679Z",
98 | "iopub.status.busy": "2021-10-01T22:23:23.402098Z",
99 | "iopub.status.idle": "2021-10-01T22:23:23.405774Z",
100 | "shell.execute_reply": "2021-10-01T22:23:23.406649Z",
101 | "shell.execute_reply.started": "2021-10-01T22:21:59.905413Z"
102 | },
103 | "papermill": {
104 | "duration": 0.02499,
105 | "end_time": "2021-10-01T22:23:23.406784",
106 | "exception": false,
107 | "start_time": "2021-10-01T22:23:23.381794",
108 | "status": "completed"
109 | },
110 | "tags": []
111 | },
112 | "outputs": [],
113 | "source": [
114 | "train_data, val_data, train_labels, val_labels = train_test_split(train_data, \n",
115 | " train_labels, \n",
116 | " test_size=0.30, \n",
117 | " shuffle=True,\n",
118 | " random_state=0)"
119 | ]
120 | },
121 | {
122 | "cell_type": "code",
123 | "execution_count": 4,
124 | "id": "99e24975",
125 | "metadata": {
126 | "execution": {
127 | "iopub.execute_input": "2021-10-01T22:23:23.439816Z",
128 | "iopub.status.busy": "2021-10-01T22:23:23.438959Z",
129 | "iopub.status.idle": "2021-10-01T22:23:23.577004Z",
130 | "shell.execute_reply": "2021-10-01T22:23:23.576328Z",
131 | "shell.execute_reply.started": "2021-10-01T22:21:59.915091Z"
132 | },
133 | "id": "QVBYZDyfPdXl",
134 | "papermill": {
135 | "duration": 0.157055,
136 | "end_time": "2021-10-01T22:23:23.577202",
137 | "exception": false,
138 | "start_time": "2021-10-01T22:23:23.420147",
139 | "status": "completed"
140 | },
141 | "tags": []
142 | },
143 | "outputs": [
144 | {
145 | "name": "stdout",
146 | "output_type": "stream",
147 | "text": [
148 | "Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json\n",
149 | "1646592/1641221 [==============================] - 0s 0us/step\n"
150 | ]
151 | }
152 | ],
153 | "source": [
154 | "# A dictionary mapping words to an integer index\n",
155 | "word_index = imdb.get_word_index()\n",
156 | "\n",
157 | "# The first indices are reserved\n",
158 | "word_index = {k:(v+3) for k,v in word_index.items()} \n",
159 | "word_index[\"\"] = 0\n",
160 | "word_index[\"\"] = 1\n",
161 | "word_index[\"\"] = 2 # unknown\n",
162 | "word_index[\"\"] = 3\n",
163 | "\n",
164 | "reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])\n",
165 | "\n",
166 | "def decode_review(text):\n",
167 | " return ' '.join([reverse_word_index.get(i, '?') for i in text])"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": 5,
173 | "id": "41d71b40",
174 | "metadata": {
175 | "execution": {
176 | "iopub.execute_input": "2021-10-01T22:23:23.611155Z",
177 | "iopub.status.busy": "2021-10-01T22:23:23.610592Z",
178 | "iopub.status.idle": "2021-10-01T22:23:23.614103Z",
179 | "shell.execute_reply": "2021-10-01T22:23:23.614535Z",
180 | "shell.execute_reply.started": "2021-10-01T22:22:00.063848Z"
181 | },
182 | "papermill": {
183 | "duration": 0.022789,
184 | "end_time": "2021-10-01T22:23:23.614659",
185 | "exception": false,
186 | "start_time": "2021-10-01T22:23:23.591870",
187 | "status": "completed"
188 | },
189 | "tags": []
190 | },
191 | "outputs": [],
192 | "source": [
193 | "from tensorflow.keras.models import Sequential\n",
194 | "from tensorflow.keras.layers import LeakyReLU\n",
195 | "from tensorflow.keras.layers import Activation\n",
196 | "from tensorflow.keras.optimizers import SGD, Adam\n",
197 | "\n",
198 | "from tensorflow.keras.wrappers.scikit_learn import KerasClassifier\n",
199 | "from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint"
200 | ]
201 | },
202 | {
203 | "cell_type": "code",
204 | "execution_count": 6,
205 | "id": "58b4a563",
206 | "metadata": {
207 | "execution": {
208 | "iopub.execute_input": "2021-10-01T22:23:23.652851Z",
209 | "iopub.status.busy": "2021-10-01T22:23:23.652349Z",
210 | "iopub.status.idle": "2021-10-01T22:23:25.776167Z",
211 | "shell.execute_reply": "2021-10-01T22:23:25.777364Z",
212 | "shell.execute_reply.started": "2021-10-01T22:22:00.071705Z"
213 | },
214 | "id": "C5vN2mLLPd28",
215 | "papermill": {
216 | "duration": 2.149247,
217 | "end_time": "2021-10-01T22:23:25.777606",
218 | "exception": false,
219 | "start_time": "2021-10-01T22:23:23.628359",
220 | "status": "completed"
221 | },
222 | "tags": []
223 | },
224 | "outputs": [],
225 | "source": [
226 | "pad_length = 256\n",
227 | "\n",
228 | "train_data = pad_sequences(train_data,\n",
229 | " value=word_index[\"\"],\n",
230 | " padding='post',\n",
231 | " maxlen=pad_length)\n",
232 | "\n",
233 | "val_data = pad_sequences(val_data,\n",
234 | " value=word_index[\"\"],\n",
235 | " padding='post',\n",
236 | " maxlen=pad_length)\n",
237 | "\n",
238 | "test_data = pad_sequences(test_data,\n",
239 | " value=word_index[\"\"],\n",
240 | " padding='post',\n",
241 | " maxlen=pad_length)"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": 7,
247 | "id": "9be744bb",
248 | "metadata": {
249 | "execution": {
250 | "iopub.execute_input": "2021-10-01T22:23:25.834489Z",
251 | "iopub.status.busy": "2021-10-01T22:23:25.833591Z",
252 | "iopub.status.idle": "2021-10-01T22:23:25.838518Z",
253 | "shell.execute_reply": "2021-10-01T22:23:25.839715Z",
254 | "shell.execute_reply.started": "2021-10-01T22:22:01.412838Z"
255 | },
256 | "papermill": {
257 | "duration": 0.038694,
258 | "end_time": "2021-10-01T22:23:25.839902",
259 | "exception": false,
260 | "start_time": "2021-10-01T22:23:25.801208",
261 | "status": "completed"
262 | },
263 | "tags": []
264 | },
265 | "outputs": [],
266 | "source": [
267 | "from tensorflow.keras.layers import Dense, Dropout\n",
268 | "from tensorflow.keras.layers import Flatten, RepeatVector, dot, multiply, Permute, Lambda\n",
269 | "K = keras.backend\n",
270 | "\n",
271 | "def attention(layer):\n",
272 | " # --- Attention is all you need --- #\n",
273 | " _,_,units = layer.shape.as_list()\n",
274 | " attention = Dense(1, activation='tanh')(layer)\n",
275 | " attention = Flatten()(attention)\n",
276 | " attention = Activation('softmax')(attention)\n",
277 | " attention = RepeatVector(units)(attention)\n",
278 | " attention = Permute([2, 1])(attention)\n",
279 | " representation = multiply([layer, attention])\n",
280 | " representation = Lambda(lambda x: K.sum(x, axis=-2), \n",
281 | " output_shape=(units,))(representation)\n",
282 | " # ---------------------------------- #\n",
283 | " return representation"
284 | ]
285 | },
286 | {
287 | "cell_type": "code",
288 | "execution_count": 8,
289 | "id": "ca9322c3",
290 | "metadata": {
291 | "execution": {
292 | "iopub.execute_input": "2021-10-01T22:23:25.901790Z",
293 | "iopub.status.busy": "2021-10-01T22:23:25.900560Z",
294 | "iopub.status.idle": "2021-10-01T22:23:25.905272Z",
295 | "shell.execute_reply": "2021-10-01T22:23:25.906086Z",
296 | "shell.execute_reply.started": "2021-10-01T22:22:01.422973Z"
297 | },
298 | "papermill": {
299 | "duration": 0.043292,
300 | "end_time": "2021-10-01T22:23:25.906265",
301 | "exception": false,
302 | "start_time": "2021-10-01T22:23:25.862973",
303 | "status": "completed"
304 | },
305 | "tags": []
306 | },
307 | "outputs": [],
308 | "source": [
309 | "def get_optimizer(option=0, learning_rate=0.001):\n",
310 | " if option==0:\n",
311 | " return tf.keras.optimizers.Adam(learning_rate)\n",
312 | " elif option==1:\n",
313 | " return tf.keras.optimizers.SGD(learning_rate, momentum=0.9, nesterov=True)\n",
314 | " elif option==2:\n",
315 | " return tfa.optimizers.RectifiedAdam(learning_rate)\n",
316 | " elif option==3:\n",
317 | " return tfa.optimizers.Lookahead(tf.optimizers.Adam(learning_rate), sync_period=3)\n",
318 | " elif option==4:\n",
319 | " return tfa.optimizers.SWA(tf.optimizers.Adam(learning_rate))\n",
320 | " elif option==5:\n",
321 | " return tfa.optimizers.SWA(tf.keras.optimizers.SGD(learning_rate, momentum=0.9, nesterov=True))\n",
322 | " else:\n",
323 | " return tf.keras.optimizers.Adam(learning_rate)"
324 | ]
325 | },
326 | {
327 | "cell_type": "code",
328 | "execution_count": 9,
329 | "id": "fdaffb64",
330 | "metadata": {
331 | "execution": {
332 | "iopub.execute_input": "2021-10-01T22:23:25.963239Z",
333 | "iopub.status.busy": "2021-10-01T22:23:25.962449Z",
334 | "iopub.status.idle": "2021-10-01T22:23:25.981807Z",
335 | "shell.execute_reply": "2021-10-01T22:23:25.982500Z",
336 | "shell.execute_reply.started": "2021-10-01T22:22:01.446501Z"
337 | },
338 | "id": "1qiGmUv0dQCI",
339 | "papermill": {
340 | "duration": 0.052597,
341 | "end_time": "2021-10-01T22:23:25.982670",
342 | "exception": false,
343 | "start_time": "2021-10-01T22:23:25.930073",
344 | "status": "completed"
345 | },
346 | "tags": []
347 | },
348 | "outputs": [],
349 | "source": [
350 | "layers = keras.layers\n",
351 | "models = keras.models\n",
352 | " \n",
353 | "def create_tunable_model(hp, vocab_size=10000, pad_length=256):\n",
354 | "\n",
355 | " # Instantiate model params\n",
356 | " embedding_size = hp.Int('embedding_size', min_value=8, max_value=512, step=8)\n",
357 | " spatial_dropout = hp.Float('spatial_dropout', min_value=0, max_value=0.5, step=0.05)\n",
358 | "\n",
359 | " conv_layers = hp.Int('conv_layers', min_value=1, max_value=5, step=1)\n",
360 | " rnn_layers = hp.Int('rnn_layers', min_value=1, max_value=5, step=1)\n",
361 | " dense_layers = hp.Int('dense_layers', min_value=1, max_value=3, step=1)\n",
362 | "\n",
363 | " conv_filters = hp.Int('conv_filters', min_value=32, max_value=512, step=32)\n",
364 | " conv_kernel = hp.Int('conv_kernel', min_value=1, max_value=8, step=1)\n",
365 | "\n",
366 | " concat_dropout = hp.Float('concat_dropout', min_value=0, max_value=0.5, step=0.05)\n",
367 | " dense_dropout = hp.Float('dense_dropout', min_value=0, max_value=0.5, step=0.05)\n",
368 | "\n",
369 | " inputs = layers.Input(name='inputs',shape=[pad_length])\n",
370 | " layer = layers.Embedding(vocab_size, embedding_size, input_length=pad_length)(inputs)\n",
371 | " layer = layers.SpatialDropout1D(spatial_dropout)(layer)\n",
372 | "\n",
373 | " for l in range(conv_layers):\n",
374 | " if l==0:\n",
375 | " conv = layers.Conv1D(filters=conv_filters, kernel_size=conv_kernel, \n",
376 | " padding='valid', kernel_initializer='he_uniform')(layer)\n",
377 | " else:\n",
378 | " conv = layers.Conv1D(filters=conv_filters, kernel_size=conv_kernel, \n",
379 | " padding='valid', kernel_initializer='he_uniform')(conv) \n",
380 | "\n",
381 | " avg_pool_conv = layers.GlobalAveragePooling1D()(conv)\n",
382 | " max_pool_conv = layers.GlobalMaxPooling1D()(conv)\n",
383 | "\n",
384 | " representations = list()\n",
385 | " for l in range(rnn_layers):\n",
386 | " \n",
387 | " use_bidirectional = hp.Choice(f'use_bidirectional_{l}', values=[0, 1])\n",
388 | " use_lstm = hp.Choice(f'use_lstm_{l}', values=[0, 1])\n",
389 | " units = hp.Int(f'units_{l}', min_value=8, max_value=512, step=8)\n",
390 | "\n",
391 | " if use_lstm == 1:\n",
392 | " rnl = layers.LSTM\n",
393 | " else:\n",
394 | " rnl = layers.GRU\n",
395 | "\n",
396 | " if use_bidirectional==1:\n",
397 | " layer = layers.Bidirectional(rnl(units, return_sequences=True))(layer)\n",
398 | " else:\n",
399 | " layer = rnl(units, return_sequences=True)(layer)\n",
400 | "\n",
401 | " representations.append(attention(layer))\n",
402 | "\n",
403 | " layer = layers.concatenate(representations + [avg_pool_conv, max_pool_conv])\n",
404 | " layer = layers.Dropout(concat_dropout)(layer)\n",
405 | "\n",
406 | " for l in range(dense_layers):\n",
407 | " dense_units = hp.Int(f'dense_units_{l}', min_value=8, max_value=512, step=8)\n",
408 | " layer = layers.Dense(dense_units)(layer)\n",
409 | " layer = layers.LeakyReLU()(layer)\n",
410 | " layer = layers.Dropout(dense_dropout)(layer)\n",
411 | "\n",
412 | " layer = layers.Dense(1, name='out_layer')(layer)\n",
413 | " outputs = layers.Activation('sigmoid')(layer)\n",
414 | "\n",
415 | " model = models.Model(inputs=inputs, outputs=outputs)\n",
416 | "\n",
417 | " hp_learning_rate = hp.Choice('learning_rate', values=[0.002, 0.001, 0.0005])\n",
418 | " optimizer_type = hp.Choice('optimizer', values=list(range(6)))\n",
419 | " optimizer = get_optimizer(option=optimizer_type, learning_rate=hp_learning_rate)\n",
420 | " \n",
421 | " model.compile(optimizer=optimizer,\n",
422 | " loss='binary_crossentropy',\n",
423 | " metrics=['acc'])\n",
424 | " \n",
425 | " return model"
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": 10,
431 | "id": "afba0940",
432 | "metadata": {
433 | "execution": {
434 | "iopub.execute_input": "2021-10-01T22:23:26.033835Z",
435 | "iopub.status.busy": "2021-10-01T22:23:26.033016Z",
436 | "iopub.status.idle": "2021-10-01T22:23:26.089471Z",
437 | "shell.execute_reply": "2021-10-01T22:23:26.090112Z",
438 | "shell.execute_reply.started": "2021-10-01T22:22:01.466858Z"
439 | },
440 | "id": "v4AaBohkWErD",
441 | "papermill": {
442 | "duration": 0.084179,
443 | "end_time": "2021-10-01T22:23:26.090283",
444 | "exception": false,
445 | "start_time": "2021-10-01T22:23:26.006104",
446 | "status": "completed"
447 | },
448 | "tags": []
449 | },
450 | "outputs": [],
451 | "source": [
452 | "import keras_tuner as kt"
453 | ]
454 | },
455 | {
456 | "cell_type": "code",
457 | "execution_count": 11,
458 | "id": "ac393d32",
459 | "metadata": {
460 | "execution": {
461 | "iopub.execute_input": "2021-10-01T22:23:26.144616Z",
462 | "iopub.status.busy": "2021-10-01T22:23:26.143871Z",
463 | "iopub.status.idle": "2021-10-02T06:39:51.443315Z",
464 | "shell.execute_reply": "2021-10-02T06:39:51.443722Z"
465 | },
466 | "id": "3vznWJPcdQGt",
467 | "outputId": "57a3e20b-e211-4143-c553-d720c28040ac",
468 | "papermill": {
469 | "duration": 29785.329552,
470 | "end_time": "2021-10-02T06:39:51.443885",
471 | "exception": false,
472 | "start_time": "2021-10-01T22:23:26.114333",
473 | "status": "completed"
474 | },
475 | "tags": []
476 | },
477 | "outputs": [
478 | {
479 | "name": "stdout",
480 | "output_type": "stream",
481 | "text": [
482 | "Trial 100 Complete [00h 03m 42s]\n",
483 | "val_acc: 0.876800000667572\n",
484 | "\n",
485 | "Best val_acc So Far: 0.8925333619117737\n",
486 | "Total elapsed time: 08h 16m 22s\n"
487 | ]
488 | }
489 | ],
490 | "source": [
491 | "tuner = kt.BayesianOptimization(hypermodel=create_tunable_model,\n",
492 | " objective='val_acc',\n",
493 | " max_trials=100,\n",
494 | " num_initial_points=3,\n",
495 | " directory='storage',\n",
496 | " project_name='imdb',\n",
497 | " seed=42)\n",
498 | "\n",
499 | "tuner.search(train_data, train_labels, \n",
500 | " epochs=30,\n",
501 | " batch_size=64, \n",
502 | " validation_data=(val_data, val_labels),\n",
503 | " shuffle=True,\n",
504 | " verbose=2,\n",
505 | " callbacks = [EarlyStopping('val_acc', patience=3, restore_best_weights=True)]\n",
506 | " )"
507 | ]
508 | },
509 | {
510 | "cell_type": "code",
511 | "execution_count": 13,
512 | "id": "55270e2c",
513 | "metadata": {
514 | "execution": {
515 | "iopub.execute_input": "2021-10-02T06:39:51.510338Z",
516 | "iopub.status.busy": "2021-10-02T06:39:51.509805Z",
517 | "iopub.status.idle": "2021-10-02T06:39:52.579690Z",
518 | "shell.execute_reply": "2021-10-02T06:39:52.579220Z"
519 | },
520 | "id": "vYf3mlBbVX25",
521 | "papermill": {
522 | "duration": 1.087822,
523 | "end_time": "2021-10-02T06:39:52.579824",
524 | "exception": false,
525 | "start_time": "2021-10-02T06:39:51.492002",
526 | "status": "completed"
527 | },
528 | "tags": []
529 | },
530 | "outputs": [],
531 | "source": [
532 | "best_hps = tuner.get_best_hyperparameters()[0]\n",
533 | "model = tuner.hypermodel.build(best_hps)"
534 | ]
535 | },
536 | {
537 | "cell_type": "code",
538 | "execution_count": 14,
539 | "id": "2324c102",
540 | "metadata": {
541 | "execution": {
542 | "iopub.execute_input": "2021-10-02T06:39:52.614190Z",
543 | "iopub.status.busy": "2021-10-02T06:39:52.612562Z",
544 | "iopub.status.idle": "2021-10-02T06:39:52.616192Z",
545 | "shell.execute_reply": "2021-10-02T06:39:52.615736Z"
546 | },
547 | "papermill": {
548 | "duration": 0.022146,
549 | "end_time": "2021-10-02T06:39:52.616297",
550 | "exception": false,
551 | "start_time": "2021-10-02T06:39:52.594151",
552 | "status": "completed"
553 | },
554 | "tags": []
555 | },
556 | "outputs": [
557 | {
558 | "name": "stdout",
559 | "output_type": "stream",
560 | "text": [
561 | "{'embedding_size': 264, 'spatial_dropout': 0.2, 'conv_layers': 1, 'rnn_layers': 2, 'dense_layers': 1, 'conv_filters': 192, 'conv_kernel': 3, 'concat_dropout': 0.4, 'dense_dropout': 0.15000000000000002, 'use_bidirectional_0': 0, 'use_lstm_0': 0, 'units_0': 464, 'dense_units_0': 384, 'learning_rate': 0.002, 'optimizer': 3, 'use_bidirectional_1': 0, 'use_lstm_1': 1, 'units_1': 512, 'dense_units_1': 136, 'dense_units_2': 360}\n"
562 | ]
563 | }
564 | ],
565 | "source": [
566 | "print(best_hps.values)"
567 | ]
568 | },
569 | {
570 | "cell_type": "code",
571 | "execution_count": 15,
572 | "id": "a150d52f",
573 | "metadata": {
574 | "execution": {
575 | "iopub.execute_input": "2021-10-02T06:39:52.660556Z",
576 | "iopub.status.busy": "2021-10-02T06:39:52.650521Z",
577 | "iopub.status.idle": "2021-10-02T06:39:52.665019Z",
578 | "shell.execute_reply": "2021-10-02T06:39:52.664596Z"
579 | },
580 | "id": "upzApqVhdP-j",
581 | "papermill": {
582 | "duration": 0.034479,
583 | "end_time": "2021-10-02T06:39:52.665136",
584 | "exception": false,
585 | "start_time": "2021-10-02T06:39:52.630657",
586 | "status": "completed"
587 | },
588 | "tags": []
589 | },
590 | "outputs": [
591 | {
592 | "name": "stdout",
593 | "output_type": "stream",
594 | "text": [
595 | "Model: \"model\"\n",
596 | "__________________________________________________________________________________________________\n",
597 | "Layer (type) Output Shape Param # Connected to \n",
598 | "==================================================================================================\n",
599 | "inputs (InputLayer) [(None, 256)] 0 \n",
600 | "__________________________________________________________________________________________________\n",
601 | "embedding (Embedding) (None, 256, 264) 2640000 inputs[0][0] \n",
602 | "__________________________________________________________________________________________________\n",
603 | "spatial_dropout1d (SpatialDropo (None, 256, 264) 0 embedding[0][0] \n",
604 | "__________________________________________________________________________________________________\n",
605 | "gru (GRU) (None, 256, 464) 1016160 spatial_dropout1d[0][0] \n",
606 | "__________________________________________________________________________________________________\n",
607 | "lstm (LSTM) (None, 256, 512) 2000896 gru[0][0] \n",
608 | "__________________________________________________________________________________________________\n",
609 | "dense (Dense) (None, 256, 1) 465 gru[0][0] \n",
610 | "__________________________________________________________________________________________________\n",
611 | "dense_1 (Dense) (None, 256, 1) 513 lstm[0][0] \n",
612 | "__________________________________________________________________________________________________\n",
613 | "flatten (Flatten) (None, 256) 0 dense[0][0] \n",
614 | "__________________________________________________________________________________________________\n",
615 | "flatten_1 (Flatten) (None, 256) 0 dense_1[0][0] \n",
616 | "__________________________________________________________________________________________________\n",
617 | "activation (Activation) (None, 256) 0 flatten[0][0] \n",
618 | "__________________________________________________________________________________________________\n",
619 | "activation_1 (Activation) (None, 256) 0 flatten_1[0][0] \n",
620 | "__________________________________________________________________________________________________\n",
621 | "repeat_vector (RepeatVector) (None, 464, 256) 0 activation[0][0] \n",
622 | "__________________________________________________________________________________________________\n",
623 | "repeat_vector_1 (RepeatVector) (None, 512, 256) 0 activation_1[0][0] \n",
624 | "__________________________________________________________________________________________________\n",
625 | "permute (Permute) (None, 256, 464) 0 repeat_vector[0][0] \n",
626 | "__________________________________________________________________________________________________\n",
627 | "permute_1 (Permute) (None, 256, 512) 0 repeat_vector_1[0][0] \n",
628 | "__________________________________________________________________________________________________\n",
629 | "multiply (Multiply) (None, 256, 464) 0 gru[0][0] \n",
630 | " permute[0][0] \n",
631 | "__________________________________________________________________________________________________\n",
632 | "multiply_1 (Multiply) (None, 256, 512) 0 lstm[0][0] \n",
633 | " permute_1[0][0] \n",
634 | "__________________________________________________________________________________________________\n",
635 | "conv1d (Conv1D) (None, 254, 192) 152256 spatial_dropout1d[0][0] \n",
636 | "__________________________________________________________________________________________________\n",
637 | "lambda (Lambda) (None, 464) 0 multiply[0][0] \n",
638 | "__________________________________________________________________________________________________\n",
639 | "lambda_1 (Lambda) (None, 512) 0 multiply_1[0][0] \n",
640 | "__________________________________________________________________________________________________\n",
641 | "global_average_pooling1d (Globa (None, 192) 0 conv1d[0][0] \n",
642 | "__________________________________________________________________________________________________\n",
643 | "global_max_pooling1d (GlobalMax (None, 192) 0 conv1d[0][0] \n",
644 | "__________________________________________________________________________________________________\n",
645 | "concatenate (Concatenate) (None, 1360) 0 lambda[0][0] \n",
646 | " lambda_1[0][0] \n",
647 | " global_average_pooling1d[0][0] \n",
648 | " global_max_pooling1d[0][0] \n",
649 | "__________________________________________________________________________________________________\n",
650 | "dropout (Dropout) (None, 1360) 0 concatenate[0][0] \n",
651 | "__________________________________________________________________________________________________\n",
652 | "dense_2 (Dense) (None, 384) 522624 dropout[0][0] \n",
653 | "__________________________________________________________________________________________________\n",
654 | "leaky_re_lu (LeakyReLU) (None, 384) 0 dense_2[0][0] \n",
655 | "__________________________________________________________________________________________________\n",
656 | "dropout_1 (Dropout) (None, 384) 0 leaky_re_lu[0][0] \n",
657 | "__________________________________________________________________________________________________\n",
658 | "out_layer (Dense) (None, 1) 385 dropout_1[0][0] \n",
659 | "__________________________________________________________________________________________________\n",
660 | "activation_2 (Activation) (None, 1) 0 out_layer[0][0] \n",
661 | "==================================================================================================\n",
662 | "Total params: 6,333,299\n",
663 | "Trainable params: 6,333,299\n",
664 | "Non-trainable params: 0\n",
665 | "__________________________________________________________________________________________________\n"
666 | ]
667 | }
668 | ],
669 | "source": [
670 | "model.summary()"
671 | ]
672 | },
673 | {
674 | "cell_type": "code",
675 | "execution_count": 16,
676 | "id": "094a66cc",
677 | "metadata": {
678 | "execution": {
679 | "iopub.execute_input": "2021-10-02T06:39:52.708408Z",
680 | "iopub.status.busy": "2021-10-02T06:39:52.707506Z",
681 | "iopub.status.idle": "2021-10-02T06:39:52.785156Z",
682 | "shell.execute_reply": "2021-10-02T06:39:52.784664Z"
683 | },
684 | "papermill": {
685 | "duration": 0.105375,
686 | "end_time": "2021-10-02T06:39:52.785298",
687 | "exception": false,
688 | "start_time": "2021-10-02T06:39:52.679923",
689 | "status": "completed"
690 | },
691 | "tags": []
692 | },
693 | "outputs": [],
694 | "source": [
695 | "model.save(\"best_model.h5\")"
696 | ]
697 | }
698 | ],
699 | "metadata": {
700 | "kernelspec": {
701 | "display_name": "Python 3",
702 | "language": "python",
703 | "name": "python3"
704 | },
705 | "language_info": {
706 | "codemirror_mode": {
707 | "name": "ipython",
708 | "version": 3
709 | },
710 | "file_extension": ".py",
711 | "mimetype": "text/x-python",
712 | "name": "python",
713 | "nbconvert_exporter": "python",
714 | "pygments_lexer": "ipython3",
715 | "version": "3.7.10"
716 | },
717 | "papermill": {
718 | "default_parameters": {},
719 | "duration": 29811.874925,
720 | "end_time": "2021-10-02T06:39:56.415559",
721 | "environment_variables": {},
722 | "exception": null,
723 | "input_path": "__notebook__.ipynb",
724 | "output_path": "__notebook__.ipynb",
725 | "parameters": {},
726 | "start_time": "2021-10-01T22:23:04.540634",
727 | "version": "2.3.3"
728 | }
729 | },
730 | "nbformat": 4,
731 | "nbformat_minor": 5
732 | }
733 |
--------------------------------------------------------------------------------
/chapter_09/README.md:
--------------------------------------------------------------------------------
1 | # Ensembling with Blending and Stacking Solutions
2 |
3 | This chapter explains ensembling techniques for multiple models such as averaging, blending, and stacking. We will provide you with some theory, some practice, and some code examples you can use as templates when building your own solutions on Kaggle.
4 |
--------------------------------------------------------------------------------
/chapter_09/ensembling.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 29,
6 | "id": "b5994ad4",
7 | "metadata": {
8 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
9 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
10 | "execution": {
11 | "iopub.execute_input": "2021-09-19T19:53:00.546001Z",
12 | "iopub.status.busy": "2021-09-19T19:53:00.545095Z",
13 | "iopub.status.idle": "2021-09-19T19:53:01.555192Z",
14 | "shell.execute_reply": "2021-09-19T19:53:01.556144Z",
15 | "shell.execute_reply.started": "2021-09-19T19:36:55.776083Z"
16 | },
17 | "papermill": {
18 | "duration": 1.052157,
19 | "end_time": "2021-09-19T19:53:01.556613",
20 | "exception": false,
21 | "start_time": "2021-09-19T19:53:00.504456",
22 | "status": "completed"
23 | },
24 | "tags": []
25 | },
26 | "outputs": [],
27 | "source": [
28 | "from sklearn.datasets import make_classification\n",
29 | "from sklearn.model_selection import train_test_split\n",
30 | "\n",
31 | "X, y = make_classification(n_samples=5000, n_features=50, \n",
32 | " n_informative=10,\n",
33 | " n_redundant=25, n_repeated=15, \n",
34 | " n_clusters_per_class=5,\n",
35 | " flip_y=0.05, class_sep=0.5, \n",
36 | " random_state=0)\n",
37 | "\n",
38 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 30,
44 | "id": "fd75371e",
45 | "metadata": {
46 | "execution": {
47 | "iopub.execute_input": "2021-09-19T19:53:01.651956Z",
48 | "iopub.status.busy": "2021-09-19T19:53:01.650857Z",
49 | "iopub.status.idle": "2021-09-19T19:53:01.655983Z",
50 | "shell.execute_reply": "2021-09-19T19:53:01.656880Z",
51 | "shell.execute_reply.started": "2021-09-19T19:36:55.805534Z"
52 | },
53 | "papermill": {
54 | "duration": 0.058079,
55 | "end_time": "2021-09-19T19:53:01.657143",
56 | "exception": false,
57 | "start_time": "2021-09-19T19:53:01.599064",
58 | "status": "completed"
59 | },
60 | "tags": []
61 | },
62 | "outputs": [
63 | {
64 | "data": {
65 | "text/plain": [
66 | "\"\\n# As an alternative to the make_classification synthetic data,\\n# you may decide to use the Madelon dataset by using the code\\n# in this commented cell\\n\\nfrom sklearn.datasets import fetch_openml\\nfrom sklearn.model_selection import train_test_split\\ndata = fetch_openml(name='madelon')\\nX = data.data.astype(float)\\nX = X / X.mean()\\ny = (data.target=='2').astype(float)\\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)\\n\""
67 | ]
68 | },
69 | "execution_count": 30,
70 | "metadata": {},
71 | "output_type": "execute_result"
72 | }
73 | ],
74 | "source": [
75 | "\"\"\"\n",
76 | "# As an alternative to the make_classification synthetic data,\n",
77 | "# you may decide to use the Madelon dataset by using the code\n",
78 | "# in this commented cell\n",
79 | "\n",
80 | "from sklearn.datasets import fetch_openml\n",
81 | "from sklearn.model_selection import train_test_split\n",
82 | "data = fetch_openml(name='madelon')\n",
83 | "X = data.data.astype(float)\n",
84 | "X = X / X.mean()\n",
85 | "y = (data.target=='2').astype(float)\n",
86 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)\n",
87 | "\"\"\""
88 | ]
89 | },
90 | {
91 | "cell_type": "code",
92 | "execution_count": 31,
93 | "id": "d972036a",
94 | "metadata": {
95 | "execution": {
96 | "iopub.execute_input": "2021-09-19T19:53:01.710840Z",
97 | "iopub.status.busy": "2021-09-19T19:53:01.710166Z",
98 | "iopub.status.idle": "2021-09-19T19:53:01.886501Z",
99 | "shell.execute_reply": "2021-09-19T19:53:01.885970Z",
100 | "shell.execute_reply.started": "2021-09-19T19:36:55.814685Z"
101 | },
102 | "papermill": {
103 | "duration": 0.203251,
104 | "end_time": "2021-09-19T19:53:01.886647",
105 | "exception": false,
106 | "start_time": "2021-09-19T19:53:01.683396",
107 | "status": "completed"
108 | },
109 | "tags": []
110 | },
111 | "outputs": [],
112 | "source": [
113 | "from sklearn.svm import SVC\n",
114 | "from sklearn.ensemble import RandomForestClassifier\n",
115 | "from sklearn.neighbors import KNeighborsClassifier\n",
116 | "from sklearn.metrics import log_loss, roc_auc_score, accuracy_score\n",
117 | "\n",
118 | "model_1 = SVC(probability=True, random_state=0)\n",
119 | "model_2 = RandomForestClassifier(random_state=0)\n",
120 | "model_3 = KNeighborsClassifier()"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 32,
126 | "id": "b4d7a61e",
127 | "metadata": {
128 | "execution": {
129 | "iopub.execute_input": "2021-09-19T19:53:01.947741Z",
130 | "iopub.status.busy": "2021-09-19T19:53:01.946910Z",
131 | "iopub.status.idle": "2021-09-19T19:53:08.844771Z",
132 | "shell.execute_reply": "2021-09-19T19:53:08.844108Z",
133 | "shell.execute_reply.started": "2021-09-19T19:36:55.828384Z"
134 | },
135 | "papermill": {
136 | "duration": 6.933691,
137 | "end_time": "2021-09-19T19:53:08.844919",
138 | "exception": false,
139 | "start_time": "2021-09-19T19:53:01.911228",
140 | "status": "completed"
141 | },
142 | "tags": []
143 | },
144 | "outputs": [
145 | {
146 | "data": {
147 | "text/plain": [
148 | "KNeighborsClassifier()"
149 | ]
150 | },
151 | "execution_count": 32,
152 | "metadata": {},
153 | "output_type": "execute_result"
154 | }
155 | ],
156 | "source": [
157 | "model_1.fit(X_train, y_train)\n",
158 | "model_2.fit(X_train, y_train)\n",
159 | "model_3.fit(X_train, y_train)"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": 33,
165 | "id": "cbfd3fe9",
166 | "metadata": {
167 | "execution": {
168 | "iopub.execute_input": "2021-09-19T19:53:08.902240Z",
169 | "iopub.status.busy": "2021-09-19T19:53:08.901254Z",
170 | "iopub.status.idle": "2021-09-19T19:53:09.956348Z",
171 | "shell.execute_reply": "2021-09-19T19:53:09.955686Z",
172 | "shell.execute_reply.started": "2021-09-19T19:37:02.706154Z"
173 | },
174 | "papermill": {
175 | "duration": 1.086808,
176 | "end_time": "2021-09-19T19:53:09.956506",
177 | "exception": false,
178 | "start_time": "2021-09-19T19:53:08.869698",
179 | "status": "completed"
180 | },
181 | "tags": []
182 | },
183 | "outputs": [],
184 | "source": [
185 | "import numpy as np\n",
186 | "from scipy.stats import mode\n",
187 | "\n",
188 | "preds = np.stack([model_1.predict(X_test),\n",
189 | " model_2.predict(X_test),\n",
190 | " model_3.predict(X_test)]).T\n",
191 | "\n",
192 | "max_voting = np.apply_along_axis(mode, 1, preds)[:,0]"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": 34,
198 | "id": "cd45079d",
199 | "metadata": {
200 | "execution": {
201 | "iopub.execute_input": "2021-09-19T19:53:10.011723Z",
202 | "iopub.status.busy": "2021-09-19T19:53:10.011033Z",
203 | "iopub.status.idle": "2021-09-19T19:53:10.013722Z",
204 | "shell.execute_reply": "2021-09-19T19:53:10.014233Z",
205 | "shell.execute_reply.started": "2021-09-19T19:37:03.767203Z"
206 | },
207 | "papermill": {
208 | "duration": 0.033448,
209 | "end_time": "2021-09-19T19:53:10.014394",
210 | "exception": false,
211 | "start_time": "2021-09-19T19:53:09.980946",
212 | "status": "completed"
213 | },
214 | "tags": []
215 | },
216 | "outputs": [
217 | {
218 | "name": "stdout",
219 | "output_type": "stream",
220 | "text": [
221 | "0.24\n"
222 | ]
223 | }
224 | ],
225 | "source": [
226 | "discordant = np.sum(np.var(preds, axis=1) > 0) / len(y_test)\n",
227 | "print(f\"{discordant:0.2f}\") "
228 | ]
229 | },
230 | {
231 | "cell_type": "code",
232 | "execution_count": 35,
233 | "id": "c6e7ab06",
234 | "metadata": {
235 | "execution": {
236 | "iopub.execute_input": "2021-09-19T19:53:10.070976Z",
237 | "iopub.status.busy": "2021-09-19T19:53:10.068520Z",
238 | "iopub.status.idle": "2021-09-19T19:53:10.076497Z",
239 | "shell.execute_reply": "2021-09-19T19:53:10.075983Z",
240 | "shell.execute_reply.started": "2021-09-19T19:37:03.774166Z"
241 | },
242 | "papermill": {
243 | "duration": 0.037388,
244 | "end_time": "2021-09-19T19:53:10.076631",
245 | "exception": false,
246 | "start_time": "2021-09-19T19:53:10.039243",
247 | "status": "completed"
248 | },
249 | "tags": []
250 | },
251 | "outputs": [
252 | {
253 | "name": "stdout",
254 | "output_type": "stream",
255 | "text": [
256 | "Accuracy for model SVC is: 0.804\n",
257 | "Accuracy for model RF is: 0.793\n",
258 | "Accuracy for model KNN is: 0.805\n"
259 | ]
260 | }
261 | ],
262 | "source": [
263 | "for i, model in enumerate(['SVC', 'RF ', 'KNN']):\n",
264 | " acc = accuracy_score(y_true=y_test, y_pred=preds[:, i])\n",
265 | " print(f\"Accuracy for model {model} is: {acc:0.3f}\")"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": 36,
271 | "id": "5088fff6",
272 | "metadata": {
273 | "execution": {
274 | "iopub.execute_input": "2021-09-19T19:53:10.132894Z",
275 | "iopub.status.busy": "2021-09-19T19:53:10.130437Z",
276 | "iopub.status.idle": "2021-09-19T19:53:10.135142Z",
277 | "shell.execute_reply": "2021-09-19T19:53:10.135760Z",
278 | "shell.execute_reply.started": "2021-09-19T19:37:03.791838Z"
279 | },
280 | "papermill": {
281 | "duration": 0.034314,
282 | "end_time": "2021-09-19T19:53:10.135943",
283 | "exception": false,
284 | "start_time": "2021-09-19T19:53:10.101629",
285 | "status": "completed"
286 | },
287 | "tags": []
288 | },
289 | "outputs": [
290 | {
291 | "name": "stdout",
292 | "output_type": "stream",
293 | "text": [
294 | "Accuracy for majority voting is: 0.817\n"
295 | ]
296 | }
297 | ],
298 | "source": [
299 | "max_voting_accuray = accuracy_score(y_true=y_test, y_pred=max_voting)\n",
300 | "print(f\"Accuracy for majority voting is: {max_voting_accuray:0.3f}\")"
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": 37,
306 | "id": "b8769a36",
307 | "metadata": {
308 | "execution": {
309 | "iopub.execute_input": "2021-09-19T19:53:10.192984Z",
310 | "iopub.status.busy": "2021-09-19T19:53:10.192323Z",
311 | "iopub.status.idle": "2021-09-19T19:53:11.000348Z",
312 | "shell.execute_reply": "2021-09-19T19:53:11.000905Z",
313 | "shell.execute_reply.started": "2021-09-19T19:37:03.804667Z"
314 | },
315 | "papermill": {
316 | "duration": 0.839711,
317 | "end_time": "2021-09-19T19:53:11.001092",
318 | "exception": false,
319 | "start_time": "2021-09-19T19:53:10.161381",
320 | "status": "completed"
321 | },
322 | "tags": []
323 | },
324 | "outputs": [],
325 | "source": [
326 | "proba = np.stack([model_1.predict_proba(X_test)[:, 1],\n",
327 | " model_2.predict_proba(X_test)[:, 1],\n",
328 | " model_3.predict_proba(X_test)[:, 1]]).T"
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": 38,
334 | "id": "db12a6a7",
335 | "metadata": {
336 | "execution": {
337 | "iopub.execute_input": "2021-09-19T19:53:11.057741Z",
338 | "iopub.status.busy": "2021-09-19T19:53:11.055316Z",
339 | "iopub.status.idle": "2021-09-19T19:53:11.068234Z",
340 | "shell.execute_reply": "2021-09-19T19:53:11.068758Z",
341 | "shell.execute_reply.started": "2021-09-19T19:37:04.622064Z"
342 | },
343 | "papermill": {
344 | "duration": 0.042392,
345 | "end_time": "2021-09-19T19:53:11.068932",
346 | "exception": false,
347 | "start_time": "2021-09-19T19:53:11.026540",
348 | "status": "completed"
349 | },
350 | "tags": []
351 | },
352 | "outputs": [
353 | {
354 | "name": "stdout",
355 | "output_type": "stream",
356 | "text": [
357 | "ROC-AUC for model SVC is: 0.88126\n",
358 | "ROC-AUC for model RF is: 0.87685\n",
359 | "ROC-AUC for model KNN is: 0.87511\n"
360 | ]
361 | }
362 | ],
363 | "source": [
364 | "for i, model in enumerate(['SVC', 'RF ', 'KNN']):\n",
365 | " ras = roc_auc_score(y_true=y_test, y_score=proba[:, i])\n",
366 | " print(f\"ROC-AUC for model {model} is: {ras:0.5f}\")"
367 | ]
368 | },
369 | {
370 | "cell_type": "code",
371 | "execution_count": 39,
372 | "id": "5763c515",
373 | "metadata": {
374 | "execution": {
375 | "iopub.execute_input": "2021-09-19T19:53:11.124412Z",
376 | "iopub.status.busy": "2021-09-19T19:53:11.123754Z",
377 | "iopub.status.idle": "2021-09-19T19:53:11.129682Z",
378 | "shell.execute_reply": "2021-09-19T19:53:11.130132Z",
379 | "shell.execute_reply.started": "2021-09-19T19:37:04.636518Z"
380 | },
381 | "papermill": {
382 | "duration": 0.035446,
383 | "end_time": "2021-09-19T19:53:11.130305",
384 | "exception": false,
385 | "start_time": "2021-09-19T19:53:11.094859",
386 | "status": "completed"
387 | },
388 | "tags": []
389 | },
390 | "outputs": [
391 | {
392 | "name": "stdout",
393 | "output_type": "stream",
394 | "text": [
395 | "Mean averaging ROC-AUC is: 0.90192\n"
396 | ]
397 | }
398 | ],
399 | "source": [
400 | "arithmetic = proba.mean(axis=1)\n",
401 | "ras = roc_auc_score(y_true=y_test, y_score=arithmetic)\n",
402 | "print(f\"Mean averaging ROC-AUC is: {ras:0.5f}\")"
403 | ]
404 | },
405 | {
406 | "cell_type": "code",
407 | "execution_count": 40,
408 | "id": "ccbed85a",
409 | "metadata": {
410 | "execution": {
411 | "iopub.execute_input": "2021-09-19T19:53:11.187806Z",
412 | "iopub.status.busy": "2021-09-19T19:53:11.187129Z",
413 | "iopub.status.idle": "2021-09-19T19:53:11.192851Z",
414 | "shell.execute_reply": "2021-09-19T19:53:11.192314Z",
415 | "shell.execute_reply.started": "2021-09-19T19:37:04.646724Z"
416 | },
417 | "papermill": {
418 | "duration": 0.036553,
419 | "end_time": "2021-09-19T19:53:11.192994",
420 | "exception": false,
421 | "start_time": "2021-09-19T19:53:11.156441",
422 | "status": "completed"
423 | },
424 | "tags": []
425 | },
426 | "outputs": [
427 | {
428 | "name": "stdout",
429 | "output_type": "stream",
430 | "text": [
431 | "Geometric averaging ROC-AUC is: 0.89857\n"
432 | ]
433 | }
434 | ],
435 | "source": [
436 | "geometric = proba.prod(axis=1)**(1/3)\n",
437 | "ras = roc_auc_score(y_true=y_test, y_score=geometric)\n",
438 | "print(f\"Geometric averaging ROC-AUC is: {ras:0.5f}\")"
439 | ]
440 | },
441 | {
442 | "cell_type": "code",
443 | "execution_count": 41,
444 | "id": "d1f76981",
445 | "metadata": {
446 | "execution": {
447 | "iopub.execute_input": "2021-09-19T19:53:11.249592Z",
448 | "iopub.status.busy": "2021-09-19T19:53:11.248971Z",
449 | "iopub.status.idle": "2021-09-19T19:53:11.255580Z",
450 | "shell.execute_reply": "2021-09-19T19:53:11.256053Z",
451 | "shell.execute_reply.started": "2021-09-19T19:37:04.659367Z"
452 | },
453 | "papermill": {
454 | "duration": 0.03682,
455 | "end_time": "2021-09-19T19:53:11.256222",
456 | "exception": false,
457 | "start_time": "2021-09-19T19:53:11.219402",
458 | "status": "completed"
459 | },
460 | "tags": []
461 | },
462 | "outputs": [
463 | {
464 | "name": "stdout",
465 | "output_type": "stream",
466 | "text": [
467 | "Geometric averaging ROC-AUC is: 0.89916\n"
468 | ]
469 | }
470 | ],
471 | "source": [
472 | "harmonic = 1 / np.mean(1. / (proba + 0.00001), axis=1)\n",
473 | "ras = roc_auc_score(y_true=y_test, y_score=harmonic)\n",
474 | "print(f\"Geometric averaging ROC-AUC is: {ras:0.5f}\")"
475 | ]
476 | },
477 | {
478 | "cell_type": "code",
479 | "execution_count": 66,
480 | "id": "63e39bf0",
481 | "metadata": {},
482 | "outputs": [
483 | {
484 | "name": "stdout",
485 | "output_type": "stream",
486 | "text": [
487 | "Mean of powers averaging ROC-AUC is: 0.89996\n"
488 | ]
489 | }
490 | ],
491 | "source": [
492 | "n = 3\n",
493 | "mean_of_powers = np.mean(proba**n, axis=1)**(1/n)\n",
494 | "ras = roc_auc_score(y_true=y_test, y_score=mean_of_powers)\n",
495 | "print(f\"Mean of powers averaging ROC-AUC is: {ras:0.5f}\")"
496 | ]
497 | },
498 | {
499 | "cell_type": "code",
500 | "execution_count": 45,
501 | "id": "87d398a4",
502 | "metadata": {},
503 | "outputs": [
504 | {
505 | "name": "stdout",
506 | "output_type": "stream",
507 | "text": [
508 | "Logarithmic averaging ROC-AUC is: 0.90179\n"
509 | ]
510 | }
511 | ],
512 | "source": [
513 | "logarithmic = np.expm1(np.mean(np.log1p(proba), axis=1))\n",
514 | "ras = roc_auc_score(y_true=y_test, y_score=logarithmic)\n",
515 | "print(f\"Logarithmic averaging ROC-AUC is: {ras:0.5f}\")"
516 | ]
517 | },
518 | {
519 | "cell_type": "code",
520 | "execution_count": 20,
521 | "id": "9791814a",
522 | "metadata": {
523 | "execution": {
524 | "iopub.execute_input": "2021-09-19T19:53:11.317142Z",
525 | "iopub.status.busy": "2021-09-19T19:53:11.313830Z",
526 | "iopub.status.idle": "2021-09-19T19:53:11.323748Z",
527 | "shell.execute_reply": "2021-09-19T19:53:11.323201Z",
528 | "shell.execute_reply.started": "2021-09-19T19:37:04.671686Z"
529 | },
530 | "papermill": {
531 | "duration": 0.040622,
532 | "end_time": "2021-09-19T19:53:11.323885",
533 | "exception": false,
534 | "start_time": "2021-09-19T19:53:11.283263",
535 | "status": "completed"
536 | },
537 | "tags": []
538 | },
539 | "outputs": [
540 | {
541 | "name": "stdout",
542 | "output_type": "stream",
543 | "text": [
544 | "Weighted averaging ROC-AUC is: 0.90206\n"
545 | ]
546 | }
547 | ],
548 | "source": [
549 | "cormat = np.corrcoef(proba.T)\n",
550 | "np.fill_diagonal(cormat, 0.0)\n",
551 | "W = 1 / np.mean(cormat, axis=1)\n",
552 | "W = W / sum(W) # normalizing to sum==1.0\n",
553 | "weighted = proba.dot(W)\n",
554 | "ras = roc_auc_score(y_true=y_test, y_score=weighted)\n",
555 | "print(f\"Weighted averaging ROC-AUC is: {ras:0.5f}\")"
556 | ]
557 | },
558 | {
559 | "cell_type": "code",
560 | "execution_count": 21,
561 | "id": "24d1dc68",
562 | "metadata": {
563 | "execution": {
564 | "iopub.execute_input": "2021-09-19T19:53:11.384852Z",
565 | "iopub.status.busy": "2021-09-19T19:53:11.381529Z",
566 | "iopub.status.idle": "2021-09-19T19:53:11.390582Z",
567 | "shell.execute_reply": "2021-09-19T19:53:11.391056Z",
568 | "shell.execute_reply.started": "2021-09-19T19:37:04.686934Z"
569 | },
570 | "papermill": {
571 | "duration": 0.040061,
572 | "end_time": "2021-09-19T19:53:11.391229",
573 | "exception": false,
574 | "start_time": "2021-09-19T19:53:11.351168",
575 | "status": "completed"
576 | },
577 | "tags": []
578 | },
579 | "outputs": [
580 | {
581 | "name": "stdout",
582 | "output_type": "stream",
583 | "text": [
584 | "Mean averaging ROC-AUC is: 0.90180\n"
585 | ]
586 | }
587 | ],
588 | "source": [
589 | "from sklearn.preprocessing import MinMaxScaler\n",
590 | "arithmetic = MinMaxScaler().fit_transform(proba).mean(axis=1)\n",
591 | "ras = roc_auc_score(y_true=y_test, y_score=arithmetic)\n",
592 | "print(f\"Mean averaging ROC-AUC is: {ras:0.5f}\")"
593 | ]
594 | },
595 | {
596 | "cell_type": "code",
597 | "execution_count": 22,
598 | "id": "d548eb72",
599 | "metadata": {
600 | "execution": {
601 | "iopub.execute_input": "2021-09-19T19:53:11.450375Z",
602 | "iopub.status.busy": "2021-09-19T19:53:11.449373Z",
603 | "iopub.status.idle": "2021-09-19T19:53:36.050533Z",
604 | "shell.execute_reply": "2021-09-19T19:53:36.050008Z",
605 | "shell.execute_reply.started": "2021-09-19T19:37:04.703326Z"
606 | },
607 | "papermill": {
608 | "duration": 24.631864,
609 | "end_time": "2021-09-19T19:53:36.050703",
610 | "exception": false,
611 | "start_time": "2021-09-19T19:53:11.418839",
612 | "status": "completed"
613 | },
614 | "tags": []
615 | },
616 | "outputs": [
617 | {
618 | "name": "stdout",
619 | "output_type": "stream",
620 | "text": [
621 | "FOLD 0 Mean averaging ROC-AUC is: 0.88202\n",
622 | "FOLD 1 Mean averaging ROC-AUC is: 0.87379\n",
623 | "FOLD 2 Mean averaging ROC-AUC is: 0.91092\n",
624 | "FOLD 3 Mean averaging ROC-AUC is: 0.87909\n",
625 | "FOLD 4 Mean averaging ROC-AUC is: 0.89224\n",
626 | "CV Mean averaging ROC-AUC is: 0.88761\n"
627 | ]
628 | }
629 | ],
630 | "source": [
631 | "from sklearn.model_selection import KFold\n",
632 | "\n",
633 | "kf = KFold(n_splits=5, shuffle=True, random_state=0)\n",
634 | "scores = list()\n",
635 | "\n",
636 | "for k, (train_index, test_index) in enumerate(kf.split(X_train)):\n",
637 | " model_1.fit(X_train[train_index, :], y_train[train_index])\n",
638 | " model_2.fit(X_train[train_index, :], y_train[train_index])\n",
639 | " model_3.fit(X_train[train_index, :], y_train[train_index])\n",
640 | " \n",
641 | " proba = np.stack([model_1.predict_proba(X_train[test_index, :])[:, 1],\n",
642 | " model_2.predict_proba(X_train[test_index, :])[:, 1],\n",
643 | " model_3.predict_proba(X_train[test_index, :])[:, 1]]).T\n",
644 | " \n",
645 | " arithmetic = proba.mean(axis=1)\n",
646 | " ras = roc_auc_score(y_true=y_train[test_index], y_score=arithmetic)\n",
647 | " scores.append(ras)\n",
648 | " print(f\"FOLD {k} Mean averaging ROC-AUC is: {ras:0.5f}\")\n",
649 | " \n",
650 | "print(f\"CV Mean averaging ROC-AUC is: {np.mean(scores):0.5f}\")"
651 | ]
652 | },
653 | {
654 | "cell_type": "code",
655 | "execution_count": 24,
656 | "id": "36b279fd",
657 | "metadata": {
658 | "execution": {
659 | "iopub.execute_input": "2021-09-19T19:53:36.116721Z",
660 | "iopub.status.busy": "2021-09-19T19:53:36.116042Z",
661 | "iopub.status.idle": "2021-09-19T19:53:40.575193Z",
662 | "shell.execute_reply": "2021-09-19T19:53:40.574567Z",
663 | "shell.execute_reply.started": "2021-09-19T19:37:29.529871Z"
664 | },
665 | "papermill": {
666 | "duration": 4.495479,
667 | "end_time": "2021-09-19T19:53:40.575333",
668 | "exception": false,
669 | "start_time": "2021-09-19T19:53:36.079854",
670 | "status": "completed"
671 | },
672 | "tags": []
673 | },
674 | "outputs": [],
675 | "source": [
676 | "X_blend, X_holdout, y_blend, y_holdout = train_test_split(X_train, y_train, test_size=0.25, random_state=0)\n",
677 | "\n",
678 | "model_1.fit(X_blend, y_blend)\n",
679 | "model_2.fit(X_blend, y_blend)\n",
680 | "model_3.fit(X_blend, y_blend)\n",
681 | "\n",
682 | "proba = np.stack([model_1.predict_proba(X_holdout)[:, 1],\n",
683 | " model_2.predict_proba(X_holdout)[:, 1],\n",
684 | " model_3.predict_proba(X_holdout)[:, 1]]).T"
685 | ]
686 | },
687 | {
688 | "cell_type": "code",
689 | "execution_count": 26,
690 | "id": "ff9a3f7f",
691 | "metadata": {},
692 | "outputs": [],
693 | "source": [
694 | "from sklearn.preprocessing import StandardScaler\n",
695 | "\n",
696 | "scaler = StandardScaler()\n",
697 | "proba = scaler.fit_transform(proba)"
698 | ]
699 | },
700 | {
701 | "cell_type": "code",
702 | "execution_count": 27,
703 | "id": "cfc010d9",
704 | "metadata": {
705 | "execution": {
706 | "iopub.execute_input": "2021-09-19T19:53:40.638551Z",
707 | "iopub.status.busy": "2021-09-19T19:53:40.637926Z",
708 | "iopub.status.idle": "2021-09-19T19:53:40.644748Z",
709 | "shell.execute_reply": "2021-09-19T19:53:40.644105Z",
710 | "shell.execute_reply.started": "2021-09-19T19:37:34.042872Z"
711 | },
712 | "papermill": {
713 | "duration": 0.040697,
714 | "end_time": "2021-09-19T19:53:40.644890",
715 | "exception": false,
716 | "start_time": "2021-09-19T19:53:40.604193",
717 | "status": "completed"
718 | },
719 | "tags": []
720 | },
721 | "outputs": [
722 | {
723 | "data": {
724 | "text/plain": [
725 | "LogisticRegression(solver='liblinear')"
726 | ]
727 | },
728 | "execution_count": 27,
729 | "metadata": {},
730 | "output_type": "execute_result"
731 | }
732 | ],
733 | "source": [
734 | "from sklearn.linear_model import LogisticRegression\n",
735 | "blender = LogisticRegression(solver='liblinear')\n",
736 | "blender.fit(proba, y_holdout)"
737 | ]
738 | },
739 | {
740 | "cell_type": "code",
741 | "execution_count": 28,
742 | "id": "a3857934",
743 | "metadata": {
744 | "execution": {
745 | "iopub.execute_input": "2021-09-19T19:53:40.708724Z",
746 | "iopub.status.busy": "2021-09-19T19:53:40.706100Z",
747 | "iopub.status.idle": "2021-09-19T19:53:40.712069Z",
748 | "shell.execute_reply": "2021-09-19T19:53:40.711360Z",
749 | "shell.execute_reply.started": "2021-09-19T19:37:34.051890Z"
750 | },
751 | "papermill": {
752 | "duration": 0.038045,
753 | "end_time": "2021-09-19T19:53:40.712221",
754 | "exception": false,
755 | "start_time": "2021-09-19T19:53:40.674176",
756 | "status": "completed"
757 | },
758 | "tags": []
759 | },
760 | "outputs": [
761 | {
762 | "name": "stdout",
763 | "output_type": "stream",
764 | "text": [
765 | "[[0.78911314 0.47202077 0.75115854]]\n"
766 | ]
767 | }
768 | ],
769 | "source": [
770 | "print(blender.coef_)"
771 | ]
772 | },
773 | {
774 | "cell_type": "code",
775 | "execution_count": 20,
776 | "id": "0dca82ee",
777 | "metadata": {
778 | "execution": {
779 | "iopub.execute_input": "2021-09-19T19:53:40.778198Z",
780 | "iopub.status.busy": "2021-09-19T19:53:40.777508Z",
781 | "iopub.status.idle": "2021-09-19T19:53:41.430060Z",
782 | "shell.execute_reply": "2021-09-19T19:53:41.429513Z",
783 | "shell.execute_reply.started": "2021-09-19T19:37:34.065878Z"
784 | },
785 | "papermill": {
786 | "duration": 0.688341,
787 | "end_time": "2021-09-19T19:53:41.430202",
788 | "exception": false,
789 | "start_time": "2021-09-19T19:53:40.741861",
790 | "status": "completed"
791 | },
792 | "tags": []
793 | },
794 | "outputs": [
795 | {
796 | "name": "stdout",
797 | "output_type": "stream",
798 | "text": [
799 | "ROC-AUC for linear blending KNN is: 0.88621\n"
800 | ]
801 | }
802 | ],
803 | "source": [
804 | "test_proba = np.stack([model_1.predict_proba(X_test)[:, 1],\n",
805 | " model_2.predict_proba(X_test)[:, 1],\n",
806 | " model_3.predict_proba(X_test)[:, 1]]).T\n",
807 | "\n",
808 | "blending = blender.predict_proba(test_proba)[:, 1]\n",
809 | "ras = roc_auc_score(y_true=y_test, y_score=blending)\n",
810 | "print(f\"ROC-AUC for linear blending {model} is: {ras:0.5f}\")"
811 | ]
812 | },
813 | {
814 | "cell_type": "code",
815 | "execution_count": 21,
816 | "id": "689e7d97",
817 | "metadata": {
818 | "execution": {
819 | "iopub.execute_input": "2021-09-19T19:53:41.533221Z",
820 | "iopub.status.busy": "2021-09-19T19:53:41.496952Z",
821 | "iopub.status.idle": "2021-09-19T19:53:42.424367Z",
822 | "shell.execute_reply": "2021-09-19T19:53:42.424909Z",
823 | "shell.execute_reply.started": "2021-09-19T19:37:34.727721Z"
824 | },
825 | "papermill": {
826 | "duration": 0.964892,
827 | "end_time": "2021-09-19T19:53:42.425083",
828 | "exception": false,
829 | "start_time": "2021-09-19T19:53:41.460191",
830 | "status": "completed"
831 | },
832 | "tags": []
833 | },
834 | "outputs": [
835 | {
836 | "name": "stdout",
837 | "output_type": "stream",
838 | "text": [
839 | "ROC-AUC for non-linear blending KNN is: 0.83862\n"
840 | ]
841 | }
842 | ],
843 | "source": [
844 | "blender = RandomForestClassifier()\n",
845 | "blender.fit(proba, y_holdout)\n",
846 | "\n",
847 | "test_proba = np.stack([model_1.predict_proba(X_test)[:, 1],\n",
848 | " model_2.predict_proba(X_test)[:, 1],\n",
849 | " model_3.predict_proba(X_test)[:, 1]]).T\n",
850 | "\n",
851 | "blending = blender.predict_proba(test_proba)[:, 1]\n",
852 | "ras = roc_auc_score(y_true=y_test, y_score=blending)\n",
853 | "print(f\"ROC-AUC for non-linear blending {model} is: {ras:0.5f}\")"
854 | ]
855 | },
856 | {
857 | "cell_type": "code",
858 | "execution_count": 22,
859 | "id": "d483ef3c",
860 | "metadata": {
861 | "execution": {
862 | "iopub.execute_input": "2021-09-19T19:53:42.494830Z",
863 | "iopub.status.busy": "2021-09-19T19:53:42.494176Z",
864 | "iopub.status.idle": "2021-09-19T19:53:45.127241Z",
865 | "shell.execute_reply": "2021-09-19T19:53:45.126342Z",
866 | "shell.execute_reply.started": "2021-09-19T19:37:35.641083Z"
867 | },
868 | "papermill": {
869 | "duration": 2.671601,
870 | "end_time": "2021-09-19T19:53:45.127386",
871 | "exception": false,
872 | "start_time": "2021-09-19T19:53:42.455785",
873 | "status": "completed"
874 | },
875 | "tags": []
876 | },
877 | "outputs": [],
878 | "source": [
879 | "X_blend, X_holdout, y_blend, y_holdout = train_test_split(X_train, y_train, test_size=0.5, random_state=0)\n",
880 | "\n",
881 | "model_1.fit(X_blend, y_blend)\n",
882 | "model_2.fit(X_blend, y_blend)\n",
883 | "model_3.fit(X_blend, y_blend)\n",
884 | "\n",
885 | "proba = np.stack([model_1.predict_proba(X_holdout)[:, 1],\n",
886 | " model_2.predict_proba(X_holdout)[:, 1],\n",
887 | " model_3.predict_proba(X_holdout)[:, 1]]).T"
888 | ]
889 | },
890 | {
891 | "cell_type": "code",
892 | "execution_count": 23,
893 | "id": "0c5c393c",
894 | "metadata": {
895 | "execution": {
896 | "iopub.execute_input": "2021-09-19T19:53:45.191275Z",
897 | "iopub.status.busy": "2021-09-19T19:53:45.190635Z",
898 | "iopub.status.idle": "2021-09-19T19:53:45.744296Z",
899 | "shell.execute_reply": "2021-09-19T19:53:45.743734Z",
900 | "shell.execute_reply.started": "2021-09-19T19:37:38.247690Z"
901 | },
902 | "papermill": {
903 | "duration": 0.586706,
904 | "end_time": "2021-09-19T19:53:45.744483",
905 | "exception": false,
906 | "start_time": "2021-09-19T19:53:45.157777",
907 | "status": "completed"
908 | },
909 | "tags": []
910 | },
911 | "outputs": [
912 | {
913 | "name": "stdout",
914 | "output_type": "stream",
915 | "text": [
916 | "starting baseline is 0.50000\n",
917 | "Adding model_3 to the ensemble: ROC-AUC increases score to 0.84298\n",
918 | "Adding model_2 to the ensemble: ROC-AUC increases score to 0.86533\n",
919 | "Adding model_1 to the ensemble: ROC-AUC increases score to 0.86644\n",
920 | "Adding model_3 to the ensemble: ROC-AUC increases score to 0.86691\n",
921 | "Adding model_2 to the ensemble: ROC-AUC increases score to 0.86779\n",
922 | "Cannot improve furthermore - Stopping\n"
923 | ]
924 | }
925 | ],
926 | "source": [
927 | "iterations = 100\n",
928 | "\n",
929 | "proba = np.stack([model_1.predict_proba(X_holdout)[:, 1],\n",
930 | " model_2.predict_proba(X_holdout)[:, 1],\n",
931 | " model_3.predict_proba(X_holdout)[:, 1]]).T\n",
932 | "\n",
933 | "baseline = 0.5\n",
934 | "print(f\"starting baseline is {baseline:0.5f}\")\n",
935 | "\n",
936 | "models = []\n",
937 | "\n",
938 | "for i in range(iterations):\n",
939 | " challengers = list()\n",
940 | " for j in range(proba.shape[1]):\n",
941 | " new_proba = np.stack(proba[:, models + [j]])\n",
942 | " score = roc_auc_score(y_true=y_holdout, \n",
943 | " y_score=np.mean(new_proba, axis=1))\n",
944 | " challengers.append([score, j])\n",
945 | " \n",
946 | " challengers = sorted(challengers, key=lambda x: x[0], reverse=True)\n",
947 | " best_score, best_model = challengers[0]\n",
948 | " if best_score > baseline:\n",
949 | " print(f\"Adding model_{best_model+1} to the ensemble\", end=': ') \n",
950 | " print(f\"ROC-AUC increases score to {best_score:0.5f}\")\n",
951 | " models.append(best_model)\n",
952 | " baseline = best_score\n",
953 | " else:\n",
954 | " print(\"Cannot improve further - Stopping\")\n",
955 | " break"
956 | ]
957 | },
958 | {
959 | "cell_type": "code",
960 | "execution_count": 24,
961 | "id": "b00263bd",
962 | "metadata": {
963 | "execution": {
964 | "iopub.execute_input": "2021-09-19T19:53:45.810300Z",
965 | "iopub.status.busy": "2021-09-19T19:53:45.809596Z",
966 | "iopub.status.idle": "2021-09-19T19:53:45.814804Z",
967 | "shell.execute_reply": "2021-09-19T19:53:45.815353Z",
968 | "shell.execute_reply.started": "2021-09-19T19:37:38.801526Z"
969 | },
970 | "papermill": {
971 | "duration": 0.039426,
972 | "end_time": "2021-09-19T19:53:45.815519",
973 | "exception": false,
974 | "start_time": "2021-09-19T19:53:45.776093",
975 | "status": "completed"
976 | },
977 | "tags": []
978 | },
979 | "outputs": [
980 | {
981 | "name": "stdout",
982 | "output_type": "stream",
983 | "text": [
984 | "{2: 0.4, 1: 0.4, 0: 0.2}\n"
985 | ]
986 | }
987 | ],
988 | "source": [
989 | "from collections import Counter\n",
990 | "\n",
991 | "freqs = Counter(models)\n",
992 | "weights = {key: freq/len(models) for key, freq in freqs.items()}\n",
993 | "print(weights)"
994 | ]
995 | },
996 | {
997 | "cell_type": "code",
998 | "execution_count": 25,
999 | "id": "6e9d0916",
1000 | "metadata": {
1001 | "execution": {
1002 | "iopub.execute_input": "2021-09-19T19:53:45.887399Z",
1003 | "iopub.status.busy": "2021-09-19T19:53:45.886745Z",
1004 | "iopub.status.idle": "2021-09-19T19:54:10.573509Z",
1005 | "shell.execute_reply": "2021-09-19T19:54:10.572880Z",
1006 | "shell.execute_reply.started": "2021-09-19T19:37:38.808131Z"
1007 | },
1008 | "papermill": {
1009 | "duration": 24.726948,
1010 | "end_time": "2021-09-19T19:54:10.573682",
1011 | "exception": false,
1012 | "start_time": "2021-09-19T19:53:45.846734",
1013 | "status": "completed"
1014 | },
1015 | "tags": []
1016 | },
1017 | "outputs": [],
1018 | "source": [
1019 | "from sklearn.model_selection import KFold\n",
1020 | "\n",
1021 | "kf = KFold(n_splits=5, shuffle=True, random_state=0)\n",
1022 | "scores = list()\n",
1023 | "\n",
1024 | "first_lvl_oof = np.zeros((len(X_train), 3))\n",
1025 | "first_lvl_preds = np.zeros((len(X_test), 3))\n",
1026 | "\n",
1027 | "for k, (train_index, val_index) in enumerate(kf.split(X_train)):\n",
1028 | " model_1.fit(X_train[train_index, :], y_train[train_index])\n",
1029 | " first_lvl_oof[val_index, 0] = model_1.predict_proba(X_train[val_index, :])[:, 1]\n",
1030 | " \n",
1031 | " model_2.fit(X_train[train_index, :], y_train[train_index])\n",
1032 | " first_lvl_oof[val_index, 1] = model_2.predict_proba(X_train[val_index, :])[:, 1]\n",
1033 | " \n",
1034 | " model_3.fit(X_train[train_index, :], y_train[train_index])\n",
1035 | " first_lvl_oof[val_index, 2] = model_3.predict_proba(X_train[val_index, :])[:, 1]"
1036 | ]
1037 | },
1038 | {
1039 | "cell_type": "code",
1040 | "execution_count": 26,
1041 | "id": "337c5322",
1042 | "metadata": {
1043 | "execution": {
1044 | "iopub.execute_input": "2021-09-19T19:54:10.643901Z",
1045 | "iopub.status.busy": "2021-09-19T19:54:10.643251Z",
1046 | "iopub.status.idle": "2021-09-19T19:54:18.328525Z",
1047 | "shell.execute_reply": "2021-09-19T19:54:18.327989Z",
1048 | "shell.execute_reply.started": "2021-09-19T19:38:03.642445Z"
1049 | },
1050 | "papermill": {
1051 | "duration": 7.723602,
1052 | "end_time": "2021-09-19T19:54:18.328687",
1053 | "exception": false,
1054 | "start_time": "2021-09-19T19:54:10.605085",
1055 | "status": "completed"
1056 | },
1057 | "tags": []
1058 | },
1059 | "outputs": [],
1060 | "source": [
1061 | "model_1.fit(X_train, y_train)\n",
1062 | "first_lvl_preds[:, 0] = model_1.predict_proba(X_test)[:, 1]\n",
1063 | "\n",
1064 | "model_2.fit(X_train, y_train)\n",
1065 | "first_lvl_preds[:, 1] = model_2.predict_proba(X_test)[:, 1]\n",
1066 | "\n",
1067 | "model_3.fit(X_train, y_train)\n",
1068 | "first_lvl_preds[:, 2] = model_3.predict_proba(X_test)[:, 1]"
1069 | ]
1070 | },
1071 | {
1072 | "cell_type": "code",
1073 | "execution_count": 27,
1074 | "id": "1cdd4d8d",
1075 | "metadata": {
1076 | "execution": {
1077 | "iopub.execute_input": "2021-09-19T19:54:18.399091Z",
1078 | "iopub.status.busy": "2021-09-19T19:54:18.398383Z",
1079 | "iopub.status.idle": "2021-09-19T19:54:43.019163Z",
1080 | "shell.execute_reply": "2021-09-19T19:54:43.018593Z",
1081 | "shell.execute_reply.started": "2021-09-19T19:38:11.321141Z"
1082 | },
1083 | "papermill": {
1084 | "duration": 24.659889,
1085 | "end_time": "2021-09-19T19:54:43.019324",
1086 | "exception": false,
1087 | "start_time": "2021-09-19T19:54:18.359435",
1088 | "status": "completed"
1089 | },
1090 | "tags": []
1091 | },
1092 | "outputs": [],
1093 | "source": [
1094 | "second_lvl_oof = np.zeros((len(X_train), 3))\n",
1095 | "second_lvl_preds = np.zeros((len(X_test), 3))\n",
1096 | "skip_X_train = np.hstack([X_train, first_lvl_oof])\n",
1097 | "\n",
1098 | "for k, (train_index, val_index) in enumerate(kf.split(X_train)):\n",
1099 | " model_1.fit(skip_X_train[train_index, :], y_train[train_index])\n",
1100 | " second_lvl_oof[val_index, 0] = model_1.predict_proba(skip_X_train[val_index, :])[:, 1]\n",
1101 | " \n",
1102 | " model_2.fit(skip_X_train[train_index, :], y_train[train_index])\n",
1103 | " second_lvl_oof[val_index, 1] = model_2.predict_proba(skip_X_train[val_index, :])[:, 1]\n",
1104 | " \n",
1105 | " model_3.fit(skip_X_train[train_index, :], y_train[train_index])\n",
1106 | " second_lvl_oof[val_index, 2] = model_3.predict_proba(skip_X_train[val_index, :])[:, 1]"
1107 | ]
1108 | },
1109 | {
1110 | "cell_type": "code",
1111 | "execution_count": 28,
1112 | "id": "e9f454b8",
1113 | "metadata": {
1114 | "execution": {
1115 | "iopub.execute_input": "2021-09-19T19:54:43.087800Z",
1116 | "iopub.status.busy": "2021-09-19T19:54:43.087125Z",
1117 | "iopub.status.idle": "2021-09-19T19:54:50.759678Z",
1118 | "shell.execute_reply": "2021-09-19T19:54:50.760214Z",
1119 | "shell.execute_reply.started": "2021-09-19T19:38:36.043148Z"
1120 | },
1121 | "papermill": {
1122 | "duration": 7.710139,
1123 | "end_time": "2021-09-19T19:54:50.760457",
1124 | "exception": false,
1125 | "start_time": "2021-09-19T19:54:43.050318",
1126 | "status": "completed"
1127 | },
1128 | "tags": []
1129 | },
1130 | "outputs": [],
1131 | "source": [
1132 | "skip_X_test = np.hstack([X_test, fist_lvl_preds])\n",
1133 | "\n",
1134 | "model_1.fit(skip_X_train, y_train)\n",
1135 | "second_lvl_preds[:, 0] = model_1.predict_proba(skip_X_test)[:, 1]\n",
1136 | "\n",
1137 | "model_2.fit(skip_X_train, y_train)\n",
1138 | "second_lvl_preds[:, 1] = model_2.predict_proba(skip_X_test)[:, 1]\n",
1139 | "\n",
1140 | "model_3.fit(skip_X_train, y_train)\n",
1141 | "second_lvl_preds[:, 2] = model_3.predict_proba(skip_X_test)[:, 1]"
1142 | ]
1143 | },
1144 | {
1145 | "cell_type": "code",
1146 | "execution_count": 29,
1147 | "id": "c140496e",
1148 | "metadata": {
1149 | "execution": {
1150 | "iopub.execute_input": "2021-09-19T19:54:50.832068Z",
1151 | "iopub.status.busy": "2021-09-19T19:54:50.831395Z",
1152 | "iopub.status.idle": "2021-09-19T19:54:50.837431Z",
1153 | "shell.execute_reply": "2021-09-19T19:54:50.836918Z",
1154 | "shell.execute_reply.started": "2021-09-19T19:38:43.714635Z"
1155 | },
1156 | "papermill": {
1157 | "duration": 0.04513,
1158 | "end_time": "2021-09-19T19:54:50.837572",
1159 | "exception": false,
1160 | "start_time": "2021-09-19T19:54:50.792442",
1161 | "status": "completed"
1162 | },
1163 | "tags": []
1164 | },
1165 | "outputs": [
1166 | {
1167 | "name": "stdout",
1168 | "output_type": "stream",
1169 | "text": [
1170 | "Stacking ROC-AUC is: 0.90424\n"
1171 | ]
1172 | }
1173 | ],
1174 | "source": [
1175 | "arithmetic = second_lvl_preds.mean(axis=1)\n",
1176 | "ras = roc_auc_score(y_true=y_test, y_score=arithmetic)\n",
1177 | "scores.append(ras)\n",
1178 | "print(f\"Stacking ROC-AUC is: {ras:0.5f}\")"
1179 | ]
1180 | }
1181 | ],
1182 | "metadata": {
1183 | "kernelspec": {
1184 | "display_name": "Python 3",
1185 | "language": "python",
1186 | "name": "python3"
1187 | },
1188 | "language_info": {
1189 | "codemirror_mode": {
1190 | "name": "ipython",
1191 | "version": 3
1192 | },
1193 | "file_extension": ".py",
1194 | "mimetype": "text/x-python",
1195 | "name": "python",
1196 | "nbconvert_exporter": "python",
1197 | "pygments_lexer": "ipython3",
1198 | "version": "3.8.9"
1199 | },
1200 | "papermill": {
1201 | "default_parameters": {},
1202 | "duration": 119.797966,
1203 | "end_time": "2021-09-19T19:54:52.277615",
1204 | "environment_variables": {},
1205 | "exception": null,
1206 | "input_path": "__notebook__.ipynb",
1207 | "output_path": "__notebook__.ipynb",
1208 | "parameters": {},
1209 | "start_time": "2021-09-19T19:52:52.479649",
1210 | "version": "2.3.3"
1211 | }
1212 | },
1213 | "nbformat": 4,
1214 | "nbformat_minor": 5
1215 | }
1216 |
--------------------------------------------------------------------------------
/chapter_10/README.md:
--------------------------------------------------------------------------------
1 | # Modeling for Computer Vision
2 |
3 | In this chapter, we discuss problems related to computer vision, one of the most popular topics in AI in general, and on Kaggle specifically. We demonstrate full pipelines for building solutions to challenges in image classification, object detection, and image segmentation.
4 |
--------------------------------------------------------------------------------
/chapter_11/README.md:
--------------------------------------------------------------------------------
1 | # Modelling for NLP
2 |
3 | This chapter focuses on the frequently encountered types of Kaggle challenges related to natural language processing. We demonstrate how to build an end-to-end solution for popular problems like open domain question answering.
4 |
--------------------------------------------------------------------------------
/chapter_11/chapter11-nlp-augmentation1.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "papermill": {
7 | "duration": 0.026049,
8 | "end_time": "2022-01-20T13:42:48.322998",
9 | "exception": false,
10 | "start_time": "2022-01-20T13:42:48.296949",
11 | "status": "completed"
12 | },
13 | "tags": []
14 | },
15 | "source": [
16 | " # Augmentations in NLP\n",
17 | "\n",
18 | "Data Augmentation techniques in NLP show substantial improvements on datasets with less than 500 observations, as illustrated by the original paper.\n",
19 | "\n",
20 | "https://arxiv.org/abs/1901.11196\n",
21 | "\n",
22 | "The Paper Considered here is EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks\n",
23 | "\n",
24 | "\n"
25 | ]
26 | },
27 | {
28 | "cell_type": "code",
29 | "execution_count": 1,
30 | "metadata": {
31 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
32 | "_kg_hide-output": true,
33 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
34 | "execution": {
35 | "iopub.execute_input": "2022-01-20T13:42:48.375525Z",
36 | "iopub.status.busy": "2022-01-20T13:42:48.371750Z",
37 | "iopub.status.idle": "2022-01-20T13:42:48.384611Z",
38 | "shell.execute_reply": "2022-01-20T13:42:48.383761Z",
39 | "shell.execute_reply.started": "2021-10-30T20:27:45.698216Z"
40 | },
41 | "papermill": {
42 | "duration": 0.038939,
43 | "end_time": "2022-01-20T13:42:48.384789",
44 | "exception": false,
45 | "start_time": "2022-01-20T13:42:48.345850",
46 | "status": "completed"
47 | },
48 | "tags": []
49 | },
50 | "outputs": [
51 | {
52 | "name": "stdout",
53 | "output_type": "stream",
54 | "text": [
55 | "/kaggle/input/tweet-sentiment-extraction/sample_submission.csv\n",
56 | "/kaggle/input/tweet-sentiment-extraction/train.csv\n",
57 | "/kaggle/input/tweet-sentiment-extraction/test.csv\n"
58 | ]
59 | }
60 | ],
61 | "source": [
62 | "# This Python 3 environment comes with many helpful analytics libraries installed\n",
63 | "# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n",
64 | "# For example, here's several helpful packages to load\n",
65 | "\n",
66 | "import numpy as np # linear algebra\n",
67 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
68 | "\n",
69 | "# Input data files are available in the read-only \"../input/\" directory\n",
70 | "# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n",
71 | "\n",
72 | "import os\n",
73 | "for dirname, _, filenames in os.walk('/kaggle/input'):\n",
74 | " for filename in filenames:\n",
75 | " print(os.path.join(dirname, filename))\n",
76 | "\n",
77 | "# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n",
78 | "# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session"
79 | ]
80 | },
81 | {
82 | "cell_type": "markdown",
83 | "metadata": {
84 | "papermill": {
85 | "duration": 0.023217,
86 | "end_time": "2022-01-20T13:42:48.433644",
87 | "exception": false,
88 | "start_time": "2022-01-20T13:42:48.410427",
89 | "status": "completed"
90 | },
91 | "tags": []
92 | },
93 | "source": [
94 | "# ***Simple Data Augmentatons Techniques* are:**\n",
95 | "1. SR : Synonym Replacement \n",
96 | "2. RD : Random Deletion\n",
97 | "3. RS : Random Swap\n",
98 | "4. RI : Random Insertion\n",
99 | "\n"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 2,
105 | "metadata": {
106 | "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
107 | "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
108 | "execution": {
109 | "iopub.execute_input": "2022-01-20T13:42:48.483430Z",
110 | "iopub.status.busy": "2022-01-20T13:42:48.482425Z",
111 | "iopub.status.idle": "2022-01-20T13:42:48.671855Z",
112 | "shell.execute_reply": "2022-01-20T13:42:48.671164Z",
113 | "shell.execute_reply.started": "2021-10-30T20:28:39.115622Z"
114 | },
115 | "papermill": {
116 | "duration": 0.215349,
117 | "end_time": "2022-01-20T13:42:48.671972",
118 | "exception": false,
119 | "start_time": "2022-01-20T13:42:48.456623",
120 | "status": "completed"
121 | },
122 | "tags": []
123 | },
124 | "outputs": [],
125 | "source": [
126 | "data = pd.read_csv('../input/tweet-sentiment-extraction/train.csv')"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": 3,
132 | "metadata": {
133 | "execution": {
134 | "iopub.execute_input": "2022-01-20T13:42:48.723181Z",
135 | "iopub.status.busy": "2022-01-20T13:42:48.722197Z",
136 | "iopub.status.idle": "2022-01-20T13:42:48.742206Z",
137 | "shell.execute_reply": "2022-01-20T13:42:48.742726Z",
138 | "shell.execute_reply.started": "2021-10-30T20:28:40.860125Z"
139 | },
140 | "papermill": {
141 | "duration": 0.047323,
142 | "end_time": "2022-01-20T13:42:48.742872",
143 | "exception": false,
144 | "start_time": "2022-01-20T13:42:48.695549",
145 | "status": "completed"
146 | },
147 | "tags": []
148 | },
149 | "outputs": [
150 | {
151 | "data": {
152 | "text/html": [
153 | "\n",
154 | "\n",
167 | "
\n",
168 | " \n",
169 | " \n",
170 | " | \n",
171 | " textID | \n",
172 | " text | \n",
173 | " selected_text | \n",
174 | " sentiment | \n",
175 | "
\n",
176 | " \n",
177 | " \n",
178 | " \n",
179 | " 0 | \n",
180 | " cb774db0d1 | \n",
181 | " I`d have responded, if I were going | \n",
182 | " I`d have responded, if I were going | \n",
183 | " neutral | \n",
184 | "
\n",
185 | " \n",
186 | " 1 | \n",
187 | " 549e992a42 | \n",
188 | " Sooo SAD I will miss you here in San Diego!!! | \n",
189 | " Sooo SAD | \n",
190 | " negative | \n",
191 | "
\n",
192 | " \n",
193 | " 2 | \n",
194 | " 088c60f138 | \n",
195 | " my boss is bullying me... | \n",
196 | " bullying me | \n",
197 | " negative | \n",
198 | "
\n",
199 | " \n",
200 | " 3 | \n",
201 | " 9642c003ef | \n",
202 | " what interview! leave me alone | \n",
203 | " leave me alone | \n",
204 | " negative | \n",
205 | "
\n",
206 | " \n",
207 | " 4 | \n",
208 | " 358bd9e861 | \n",
209 | " Sons of ****, why couldn`t they put them on t... | \n",
210 | " Sons of ****, | \n",
211 | " negative | \n",
212 | "
\n",
213 | " \n",
214 | "
\n",
215 | "
"
216 | ],
217 | "text/plain": [
218 | " textID text \\\n",
219 | "0 cb774db0d1 I`d have responded, if I were going \n",
220 | "1 549e992a42 Sooo SAD I will miss you here in San Diego!!! \n",
221 | "2 088c60f138 my boss is bullying me... \n",
222 | "3 9642c003ef what interview! leave me alone \n",
223 | "4 358bd9e861 Sons of ****, why couldn`t they put them on t... \n",
224 | "\n",
225 | " selected_text sentiment \n",
226 | "0 I`d have responded, if I were going neutral \n",
227 | "1 Sooo SAD negative \n",
228 | "2 bullying me negative \n",
229 | "3 leave me alone negative \n",
230 | "4 Sons of ****, negative "
231 | ]
232 | },
233 | "execution_count": 3,
234 | "metadata": {},
235 | "output_type": "execute_result"
236 | }
237 | ],
238 | "source": [
239 | "data.head()"
240 | ]
241 | },
242 | {
243 | "cell_type": "code",
244 | "execution_count": 4,
245 | "metadata": {
246 | "execution": {
247 | "iopub.execute_input": "2022-01-20T13:42:48.793883Z",
248 | "iopub.status.busy": "2022-01-20T13:42:48.792856Z",
249 | "iopub.status.idle": "2022-01-20T13:42:48.802902Z",
250 | "shell.execute_reply": "2022-01-20T13:42:48.803522Z",
251 | "shell.execute_reply.started": "2021-10-30T20:28:45.014361Z"
252 | },
253 | "papermill": {
254 | "duration": 0.037286,
255 | "end_time": "2022-01-20T13:42:48.803669",
256 | "exception": false,
257 | "start_time": "2022-01-20T13:42:48.766383",
258 | "status": "completed"
259 | },
260 | "tags": []
261 | },
262 | "outputs": [],
263 | "source": [
264 | "list_to_drop = ['textID','selected_text','sentiment']\n",
265 | "data.drop(list_to_drop,axis=1,inplace=True)"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": 5,
271 | "metadata": {
272 | "execution": {
273 | "iopub.execute_input": "2022-01-20T13:42:48.858761Z",
274 | "iopub.status.busy": "2022-01-20T13:42:48.857176Z",
275 | "iopub.status.idle": "2022-01-20T13:42:48.869115Z",
276 | "shell.execute_reply": "2022-01-20T13:42:48.869634Z",
277 | "shell.execute_reply.started": "2021-10-30T20:28:53.374962Z"
278 | },
279 | "papermill": {
280 | "duration": 0.042016,
281 | "end_time": "2022-01-20T13:42:48.869790",
282 | "exception": false,
283 | "start_time": "2022-01-20T13:42:48.827774",
284 | "status": "completed"
285 | },
286 | "tags": []
287 | },
288 | "outputs": [
289 | {
290 | "data": {
291 | "text/html": [
292 | "\n",
293 | "\n",
306 | "
\n",
307 | " \n",
308 | " \n",
309 | " | \n",
310 | " text | \n",
311 | "
\n",
312 | " \n",
313 | " \n",
314 | " \n",
315 | " 0 | \n",
316 | " I`d have responded, if I were going | \n",
317 | "
\n",
318 | " \n",
319 | " 1 | \n",
320 | " Sooo SAD I will miss you here in San Diego!!! | \n",
321 | "
\n",
322 | " \n",
323 | " 2 | \n",
324 | " my boss is bullying me... | \n",
325 | "
\n",
326 | " \n",
327 | " 3 | \n",
328 | " what interview! leave me alone | \n",
329 | "
\n",
330 | " \n",
331 | " 4 | \n",
332 | " Sons of ****, why couldn`t they put them on t... | \n",
333 | "
\n",
334 | " \n",
335 | "
\n",
336 | "
"
337 | ],
338 | "text/plain": [
339 | " text\n",
340 | "0 I`d have responded, if I were going\n",
341 | "1 Sooo SAD I will miss you here in San Diego!!!\n",
342 | "2 my boss is bullying me...\n",
343 | "3 what interview! leave me alone\n",
344 | "4 Sons of ****, why couldn`t they put them on t..."
345 | ]
346 | },
347 | "execution_count": 5,
348 | "metadata": {},
349 | "output_type": "execute_result"
350 | }
351 | ],
352 | "source": [
353 | "data.head()"
354 | ]
355 | },
356 | {
357 | "cell_type": "code",
358 | "execution_count": 6,
359 | "metadata": {
360 | "execution": {
361 | "iopub.execute_input": "2022-01-20T13:42:48.920753Z",
362 | "iopub.status.busy": "2022-01-20T13:42:48.920132Z",
363 | "iopub.status.idle": "2022-01-20T13:42:48.925841Z",
364 | "shell.execute_reply": "2022-01-20T13:42:48.926345Z"
365 | },
366 | "papermill": {
367 | "duration": 0.032704,
368 | "end_time": "2022-01-20T13:42:48.926499",
369 | "exception": false,
370 | "start_time": "2022-01-20T13:42:48.893795",
371 | "status": "completed"
372 | },
373 | "tags": []
374 | },
375 | "outputs": [
376 | {
377 | "name": "stdout",
378 | "output_type": "stream",
379 | "text": [
380 | "Total number of examples to be used is : 27481\n"
381 | ]
382 | }
383 | ],
384 | "source": [
385 | "print(f\"Total number of examples to be used is : {len(data)}\")"
386 | ]
387 | },
388 | {
389 | "cell_type": "markdown",
390 | "metadata": {
391 | "papermill": {
392 | "duration": 0.023903,
393 | "end_time": "2022-01-20T13:42:48.974896",
394 | "exception": false,
395 | "start_time": "2022-01-20T13:42:48.950993",
396 | "status": "completed"
397 | },
398 | "tags": []
399 | },
400 | "source": [
401 | "# 1. Synonym Replacement :\n",
402 | "\n",
403 | "Synonym replacement is a technique in which we replace a word by one of its synonyms\n",
404 | "\n",
405 | "For identifying relevent Synonyms we use WordNet"
406 | ]
407 | },
408 | {
409 | "cell_type": "code",
410 | "execution_count": null,
411 | "metadata": {
412 | "execution": {
413 | "iopub.execute_input": "2021-10-30T20:28:57.741387Z",
414 | "iopub.status.busy": "2021-10-30T20:28:57.740759Z",
415 | "iopub.status.idle": "2021-10-30T20:28:59.44851Z",
416 | "shell.execute_reply": "2021-10-30T20:28:59.4473Z",
417 | "shell.execute_reply.started": "2021-10-30T20:28:57.741353Z"
418 | },
419 | "papermill": {
420 | "duration": 0.02503,
421 | "end_time": "2022-01-20T13:42:49.024269",
422 | "exception": false,
423 | "start_time": "2022-01-20T13:42:48.999239",
424 | "status": "completed"
425 | },
426 | "tags": []
427 | },
428 | "outputs": [],
429 | "source": []
430 | },
431 | {
432 | "cell_type": "markdown",
433 | "metadata": {
434 | "papermill": {
435 | "duration": 0.024967,
436 | "end_time": "2022-01-20T13:42:49.074263",
437 | "exception": false,
438 | "start_time": "2022-01-20T13:42:49.049296",
439 | "status": "completed"
440 | },
441 | "tags": []
442 | },
443 | "source": [
444 | "The get_synonyms funtion will return pre-processed list of synonyms of given word\n",
445 | "\n",
446 | "Now we will replace the words with synonyms"
447 | ]
448 | },
449 | {
450 | "cell_type": "code",
451 | "execution_count": 7,
452 | "metadata": {
453 | "_kg_hide-output": true,
454 | "execution": {
455 | "iopub.execute_input": "2022-01-20T13:42:49.127362Z",
456 | "iopub.status.busy": "2022-01-20T13:42:49.126363Z",
457 | "iopub.status.idle": "2022-01-20T13:42:50.804321Z",
458 | "shell.execute_reply": "2022-01-20T13:42:50.804955Z",
459 | "shell.execute_reply.started": "2021-10-30T20:29:04.106201Z"
460 | },
461 | "papermill": {
462 | "duration": 1.70632,
463 | "end_time": "2022-01-20T13:42:50.805141",
464 | "exception": false,
465 | "start_time": "2022-01-20T13:42:49.098821",
466 | "status": "completed"
467 | },
468 | "tags": []
469 | },
470 | "outputs": [
471 | {
472 | "name": "stdout",
473 | "output_type": "stream",
474 | "text": [
475 | "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n"
476 | ]
477 | }
478 | ],
479 | "source": [
480 | "from nltk.corpus import stopwords\n",
481 | "stop_words = []\n",
482 | "for w in stopwords.words('english'):\n",
483 | " stop_words.append(w)\n",
484 | "print(stop_words)"
485 | ]
486 | },
487 | {
488 | "cell_type": "code",
489 | "execution_count": 8,
490 | "metadata": {
491 | "execution": {
492 | "iopub.execute_input": "2022-01-20T13:42:50.858973Z",
493 | "iopub.status.busy": "2022-01-20T13:42:50.857974Z",
494 | "iopub.status.idle": "2022-01-20T13:42:50.867549Z",
495 | "shell.execute_reply": "2022-01-20T13:42:50.868167Z",
496 | "shell.execute_reply.started": "2021-10-30T20:29:08.154452Z"
497 | },
498 | "papermill": {
499 | "duration": 0.038282,
500 | "end_time": "2022-01-20T13:42:50.868321",
501 | "exception": false,
502 | "start_time": "2022-01-20T13:42:50.830039",
503 | "status": "completed"
504 | },
505 | "tags": []
506 | },
507 | "outputs": [],
508 | "source": [
509 | "import random\n",
510 | "from nltk.corpus import wordnet\n"
511 | ]
512 | },
513 | {
514 | "cell_type": "code",
515 | "execution_count": 9,
516 | "metadata": {
517 | "execution": {
518 | "iopub.execute_input": "2022-01-20T13:42:50.922246Z",
519 | "iopub.status.busy": "2022-01-20T13:42:50.921239Z",
520 | "iopub.status.idle": "2022-01-20T13:42:50.928950Z",
521 | "shell.execute_reply": "2022-01-20T13:42:50.929598Z"
522 | },
523 | "papermill": {
524 | "duration": 0.036512,
525 | "end_time": "2022-01-20T13:42:50.929749",
526 | "exception": false,
527 | "start_time": "2022-01-20T13:42:50.893237",
528 | "status": "completed"
529 | },
530 | "tags": []
531 | },
532 | "outputs": [],
533 | "source": [
534 | "\n",
535 | "def get_synonyms(word):\n",
536 | " \n",
537 | " synonyms = set()\n",
538 | " \n",
539 | " for syn in wordnet.synsets(word):\n",
540 | " for l in syn.lemmas():\n",
541 | " synonym = l.name().replace(\"_\", \" \").replace(\"-\", \" \").lower()\n",
542 | " synonym = \"\".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])\n",
543 | " synonyms.add(synonym) \n",
544 | " if word in synonyms:\n",
545 | " synonyms.remove(word)\n",
546 | " \n",
547 | " return list(synonyms)"
548 | ]
549 | },
550 | {
551 | "cell_type": "code",
552 | "execution_count": 10,
553 | "metadata": {
554 | "execution": {
555 | "iopub.execute_input": "2022-01-20T13:42:50.990343Z",
556 | "iopub.status.busy": "2022-01-20T13:42:50.989619Z",
557 | "iopub.status.idle": "2022-01-20T13:42:50.992573Z",
558 | "shell.execute_reply": "2022-01-20T13:42:50.992063Z",
559 | "shell.execute_reply.started": "2021-10-30T20:29:12.301235Z"
560 | },
561 | "papermill": {
562 | "duration": 0.037801,
563 | "end_time": "2022-01-20T13:42:50.992688",
564 | "exception": false,
565 | "start_time": "2022-01-20T13:42:50.954887",
566 | "status": "completed"
567 | },
568 | "tags": []
569 | },
570 | "outputs": [],
571 | "source": [
572 | "def synonym_replacement(words, n): \n",
573 | " words = words.split() \n",
574 | " new_words = words.copy()\n",
575 | " random_word_list = list(set([word for word in words if word not in stop_words]))\n",
576 | " random.shuffle(random_word_list)\n",
577 | " num_replaced = 0\n",
578 | " \n",
579 | " for random_word in random_word_list:\n",
580 | " synonyms = get_synonyms(random_word)\n",
581 | " \n",
582 | " if len(synonyms) >= 1:\n",
583 | " synonym = random.choice(list(synonyms))\n",
584 | " new_words = [synonym if word == random_word else word for word in new_words]\n",
585 | " num_replaced += 1\n",
586 | " \n",
587 | " if num_replaced >= n: #only replace up to n words\n",
588 | " break\n",
589 | " sentence = ' '.join(new_words)\n",
590 | " return sentence"
591 | ]
592 | },
593 | {
594 | "cell_type": "code",
595 | "execution_count": 11,
596 | "metadata": {
597 | "execution": {
598 | "iopub.execute_input": "2022-01-20T13:42:51.048587Z",
599 | "iopub.status.busy": "2022-01-20T13:42:51.047913Z",
600 | "iopub.status.idle": "2022-01-20T13:42:53.248433Z",
601 | "shell.execute_reply": "2022-01-20T13:42:53.249126Z",
602 | "shell.execute_reply.started": "2021-10-30T20:30:20.915417Z"
603 | },
604 | "papermill": {
605 | "duration": 2.231642,
606 | "end_time": "2022-01-20T13:42:53.249329",
607 | "exception": false,
608 | "start_time": "2022-01-20T13:42:51.017687",
609 | "status": "completed"
610 | },
611 | "tags": []
612 | },
613 | "outputs": [
614 | {
615 | "name": "stdout",
616 | "output_type": "stream",
617 | "text": [
618 | " Example of Synonym Replacement: The spry brown university fox jumpstart over the lazy detent\n"
619 | ]
620 | }
621 | ],
622 | "source": [
623 | "print(f\" Example of Synonym Replacement: {synonym_replacement('The quick brown fox jumps over the lazy dog',4)}\")"
624 | ]
625 | },
626 | {
627 | "cell_type": "markdown",
628 | "metadata": {
629 | "papermill": {
630 | "duration": 0.025037,
631 | "end_time": "2022-01-20T13:42:53.300284",
632 | "exception": false,
633 | "start_time": "2022-01-20T13:42:53.275247",
634 | "status": "completed"
635 | },
636 | "tags": []
637 | },
638 | "source": [
639 | "To Get Larger Diversity of Sentences we could try replacing 1,2 3, .. Words in the given sentence.\n",
640 | "\n",
641 | "Now lets get an example from out dataset and try augmenting it so that we could create 3 additional sentences per tweet "
642 | ]
643 | },
644 | {
645 | "cell_type": "code",
646 | "execution_count": 12,
647 | "metadata": {
648 | "execution": {
649 | "iopub.execute_input": "2022-01-20T13:42:53.355887Z",
650 | "iopub.status.busy": "2022-01-20T13:42:53.354846Z",
651 | "iopub.status.idle": "2022-01-20T13:42:53.363428Z",
652 | "shell.execute_reply": "2022-01-20T13:42:53.362892Z",
653 | "shell.execute_reply.started": "2021-10-30T20:30:39.213323Z"
654 | },
655 | "papermill": {
656 | "duration": 0.037159,
657 | "end_time": "2022-01-20T13:42:53.363547",
658 | "exception": false,
659 | "start_time": "2022-01-20T13:42:53.326388",
660 | "status": "completed"
661 | },
662 | "tags": []
663 | },
664 | "outputs": [
665 | {
666 | "name": "stdout",
667 | "output_type": "stream",
668 | "text": [
669 | "the free fillin` app on my ipod is fun, im addicted\n"
670 | ]
671 | }
672 | ],
673 | "source": [
674 | "trial_sent = data['text'][25]\n",
675 | "print(trial_sent)\n"
676 | ]
677 | },
678 | {
679 | "cell_type": "code",
680 | "execution_count": 13,
681 | "metadata": {
682 | "execution": {
683 | "iopub.execute_input": "2022-01-20T13:42:53.419866Z",
684 | "iopub.status.busy": "2022-01-20T13:42:53.419144Z",
685 | "iopub.status.idle": "2022-01-20T13:42:53.442262Z",
686 | "shell.execute_reply": "2022-01-20T13:42:53.441671Z",
687 | "shell.execute_reply.started": "2021-10-30T20:30:42.115145Z"
688 | },
689 | "papermill": {
690 | "duration": 0.053023,
691 | "end_time": "2022-01-20T13:42:53.442379",
692 | "exception": false,
693 | "start_time": "2022-01-20T13:42:53.389356",
694 | "status": "completed"
695 | },
696 | "tags": []
697 | },
698 | "outputs": [
699 | {
700 | "name": "stdout",
701 | "output_type": "stream",
702 | "text": [
703 | " Example of Synonym Replacement: the free fillin` app on my ipod is fun, im addict\n",
704 | " Example of Synonym Replacement: the innocent fillin` app on my ipod is fun, im addicted\n",
705 | " Example of Synonym Replacement: the relinquish fillin` app on my ipod is fun, im addict\n"
706 | ]
707 | }
708 | ],
709 | "source": [
710 | "# Create 3 Augmented Sentences per data \n",
711 | "\n",
712 | "for n in range(3):\n",
713 | " print(f\" Example of Synonym Replacement: {synonym_replacement(trial_sent,n)}\")"
714 | ]
715 | },
716 | {
717 | "cell_type": "markdown",
718 | "metadata": {
719 | "papermill": {
720 | "duration": 0.025845,
721 | "end_time": "2022-01-20T13:42:53.494461",
722 | "exception": false,
723 | "start_time": "2022-01-20T13:42:53.468616",
724 | "status": "completed"
725 | },
726 | "tags": []
727 | },
728 | "source": [
729 | "Now we are able to augment this Data :)\n",
730 | "\n",
731 | "You can create New colums for the Same text-id in our tweet - sentiment Dataset"
732 | ]
733 | },
734 | {
735 | "cell_type": "markdown",
736 | "metadata": {
737 | "papermill": {
738 | "duration": 0.026065,
739 | "end_time": "2022-01-20T13:42:53.546762",
740 | "exception": false,
741 | "start_time": "2022-01-20T13:42:53.520697",
742 | "status": "completed"
743 | },
744 | "tags": []
745 | },
746 | "source": [
747 | "# 2.Random Deletion (RD)\n",
748 | "\n",
749 | "In Random Deletion, we randomly delete a word if a uniformly generated number between 0 and 1 is smaller than a pre-defined threshold. This allows for a random deletion of some words of the sentence.\n",
750 | "\n"
751 | ]
752 | },
753 | {
754 | "cell_type": "code",
755 | "execution_count": 14,
756 | "metadata": {
757 | "execution": {
758 | "iopub.execute_input": "2022-01-20T13:42:53.609176Z",
759 | "iopub.status.busy": "2022-01-20T13:42:53.608134Z",
760 | "iopub.status.idle": "2022-01-20T13:42:53.610488Z",
761 | "shell.execute_reply": "2022-01-20T13:42:53.610937Z",
762 | "shell.execute_reply.started": "2021-10-30T20:30:45.075507Z"
763 | },
764 | "papermill": {
765 | "duration": 0.037489,
766 | "end_time": "2022-01-20T13:42:53.611097",
767 | "exception": false,
768 | "start_time": "2022-01-20T13:42:53.573608",
769 | "status": "completed"
770 | },
771 | "tags": []
772 | },
773 | "outputs": [],
774 | "source": [
775 | "def random_deletion(words, p):\n",
776 | "\n",
777 | " words = words.split()\n",
778 | " \n",
779 | " #obviously, if there's only one word, don't delete it\n",
780 | " if len(words) == 1:\n",
781 | " return words\n",
782 | "\n",
783 | " #randomly delete words with probability p\n",
784 | " new_words = []\n",
785 | " for word in words:\n",
786 | " r = random.uniform(0, 1)\n",
787 | " if r > p:\n",
788 | " new_words.append(word)\n",
789 | "\n",
790 | " #if you end up deleting all words, just return a random word\n",
791 | " if len(new_words) == 0:\n",
792 | " rand_int = random.randint(0, len(words)-1)\n",
793 | " return [words[rand_int]]\n",
794 | "\n",
795 | " sentence = ' '.join(new_words)\n",
796 | " \n",
797 | " return sentence"
798 | ]
799 | },
800 | {
801 | "cell_type": "markdown",
802 | "metadata": {
803 | "papermill": {
804 | "duration": 0.025516,
805 | "end_time": "2022-01-20T13:42:53.662613",
806 | "exception": false,
807 | "start_time": "2022-01-20T13:42:53.637097",
808 | "status": "completed"
809 | },
810 | "tags": []
811 | },
812 | "source": [
813 | "Lets test out this Augmentation with our test_sample"
814 | ]
815 | },
816 | {
817 | "cell_type": "code",
818 | "execution_count": 15,
819 | "metadata": {
820 | "execution": {
821 | "iopub.execute_input": "2022-01-20T13:42:53.720829Z",
822 | "iopub.status.busy": "2022-01-20T13:42:53.720180Z",
823 | "iopub.status.idle": "2022-01-20T13:42:53.724102Z",
824 | "shell.execute_reply": "2022-01-20T13:42:53.724740Z",
825 | "shell.execute_reply.started": "2021-10-30T20:30:49.749246Z"
826 | },
827 | "papermill": {
828 | "duration": 0.036255,
829 | "end_time": "2022-01-20T13:42:53.724883",
830 | "exception": false,
831 | "start_time": "2022-01-20T13:42:53.688628",
832 | "status": "completed"
833 | },
834 | "tags": []
835 | },
836 | "outputs": [
837 | {
838 | "name": "stdout",
839 | "output_type": "stream",
840 | "text": [
841 | "the free fillin` app on my is fun, addicted\n",
842 | "free fillin` app on my ipod is im addicted\n",
843 | "the free on my ipod is fun, im\n"
844 | ]
845 | }
846 | ],
847 | "source": [
848 | "print(random_deletion(trial_sent,0.2))\n",
849 | "print(random_deletion(trial_sent,0.3))\n",
850 | "print(random_deletion(trial_sent,0.4))"
851 | ]
852 | },
853 | {
854 | "cell_type": "markdown",
855 | "metadata": {
856 | "papermill": {
857 | "duration": 0.025912,
858 | "end_time": "2022-01-20T13:42:53.777171",
859 | "exception": false,
860 | "start_time": "2022-01-20T13:42:53.751259",
861 | "status": "completed"
862 | },
863 | "tags": []
864 | },
865 | "source": [
866 | "This Could help us in reducing Overfitting and may help to imporve our Model Accuracy "
867 | ]
868 | },
869 | {
870 | "cell_type": "markdown",
871 | "metadata": {
872 | "papermill": {
873 | "duration": 0.025805,
874 | "end_time": "2022-01-20T13:42:53.829355",
875 | "exception": false,
876 | "start_time": "2022-01-20T13:42:53.803550",
877 | "status": "completed"
878 | },
879 | "tags": []
880 | },
881 | "source": [
882 | "\n",
883 | "# 3. Random Swap (RS)\n",
884 | "\n",
885 | "In Random Swap, we randomly swap the order of two words in a sentence.\n"
886 | ]
887 | },
888 | {
889 | "cell_type": "code",
890 | "execution_count": 16,
891 | "metadata": {
892 | "execution": {
893 | "iopub.execute_input": "2022-01-20T13:42:53.885998Z",
894 | "iopub.status.busy": "2022-01-20T13:42:53.885374Z",
895 | "iopub.status.idle": "2022-01-20T13:42:53.892176Z",
896 | "shell.execute_reply": "2022-01-20T13:42:53.892768Z",
897 | "shell.execute_reply.started": "2021-10-30T20:30:52.542732Z"
898 | },
899 | "papermill": {
900 | "duration": 0.037349,
901 | "end_time": "2022-01-20T13:42:53.892924",
902 | "exception": false,
903 | "start_time": "2022-01-20T13:42:53.855575",
904 | "status": "completed"
905 | },
906 | "tags": []
907 | },
908 | "outputs": [],
909 | "source": [
910 | "def swap_word(new_words): \n",
911 | " random_idx_1 = random.randint(0, len(new_words)-1)\n",
912 | " random_idx_2 = random_idx_1\n",
913 | " counter = 0 \n",
914 | " while random_idx_2 == random_idx_1:\n",
915 | " random_idx_2 = random.randint(0, len(new_words)-1)\n",
916 | " counter += 1 \n",
917 | " if counter > 3:\n",
918 | " return new_words\n",
919 | " \n",
920 | " new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] \n",
921 | " return new_words"
922 | ]
923 | },
924 | {
925 | "cell_type": "code",
926 | "execution_count": 17,
927 | "metadata": {
928 | "execution": {
929 | "iopub.execute_input": "2022-01-20T13:42:53.949331Z",
930 | "iopub.status.busy": "2022-01-20T13:42:53.948625Z",
931 | "iopub.status.idle": "2022-01-20T13:42:53.954549Z",
932 | "shell.execute_reply": "2022-01-20T13:42:53.955130Z"
933 | },
934 | "papermill": {
935 | "duration": 0.0358,
936 | "end_time": "2022-01-20T13:42:53.955283",
937 | "exception": false,
938 | "start_time": "2022-01-20T13:42:53.919483",
939 | "status": "completed"
940 | },
941 | "tags": []
942 | },
943 | "outputs": [],
944 | "source": [
945 | "def random_swap(words, n): \n",
946 | " words = words.split()\n",
947 | " new_words = words.copy()\n",
948 | " # n is the number of words to be swapped\n",
949 | " for _ in range(n):\n",
950 | " new_words = swap_word(new_words)\n",
951 | " \n",
952 | " sentence = ' '.join(new_words) \n",
953 | " return sentence"
954 | ]
955 | },
956 | {
957 | "cell_type": "code",
958 | "execution_count": 18,
959 | "metadata": {
960 | "execution": {
961 | "iopub.execute_input": "2022-01-20T13:42:54.012526Z",
962 | "iopub.status.busy": "2022-01-20T13:42:54.011866Z",
963 | "iopub.status.idle": "2022-01-20T13:42:54.019318Z",
964 | "shell.execute_reply": "2022-01-20T13:42:54.018553Z"
965 | },
966 | "papermill": {
967 | "duration": 0.037555,
968 | "end_time": "2022-01-20T13:42:54.019479",
969 | "exception": false,
970 | "start_time": "2022-01-20T13:42:53.981924",
971 | "status": "completed"
972 | },
973 | "tags": []
974 | },
975 | "outputs": [
976 | {
977 | "name": "stdout",
978 | "output_type": "stream",
979 | "text": [
980 | "the free addicted app on my ipod is fun, im fillin`\n",
981 | "fun, free fillin` app on my ipod is im the addicted\n",
982 | "free app fillin` the on addicted ipod is fun, im my\n"
983 | ]
984 | }
985 | ],
986 | "source": [
987 | "print(random_swap(trial_sent,1))\n",
988 | "print(random_swap(trial_sent,2))\n",
989 | "print(random_swap(trial_sent,3))"
990 | ]
991 | },
992 | {
993 | "cell_type": "markdown",
994 | "metadata": {
995 | "papermill": {
996 | "duration": 0.027264,
997 | "end_time": "2022-01-20T13:42:54.074453",
998 | "exception": false,
999 | "start_time": "2022-01-20T13:42:54.047189",
1000 | "status": "completed"
1001 | },
1002 | "tags": []
1003 | },
1004 | "source": [
1005 | "This Random Swapping will help to make our models robust and may inturn help in text classification. \n",
1006 | "\n",
1007 | "High order of swapping may downgrade the model\n",
1008 | "\n",
1009 | "There is a high chance to loose semantics of language so be careful while using this augmentaion.\n",
1010 | "\n"
1011 | ]
1012 | },
1013 | {
1014 | "cell_type": "markdown",
1015 | "metadata": {
1016 | "papermill": {
1017 | "duration": 0.026613,
1018 | "end_time": "2022-01-20T13:42:54.128375",
1019 | "exception": false,
1020 | "start_time": "2022-01-20T13:42:54.101762",
1021 | "status": "completed"
1022 | },
1023 | "tags": []
1024 | },
1025 | "source": [
1026 | "# 4. Random Insertion (RI)\n",
1027 | "Finally, in Random Insertion, we randomly insert synonyms of a word at a random position.\n",
1028 | "\n",
1029 | "Data augmentation\n",
1030 | "operations should not change the true label of\n",
1031 | "a sentence, as that would introduce unnecessary\n",
1032 | "noise into the data. Inserting a synonym of a word\n",
1033 | "in a sentence, opposed to a random word, is more\n",
1034 | "likely to be relevant to the context and retain the\n",
1035 | "original label of the sentence."
1036 | ]
1037 | },
1038 | {
1039 | "cell_type": "code",
1040 | "execution_count": 19,
1041 | "metadata": {
1042 | "execution": {
1043 | "iopub.execute_input": "2022-01-20T13:42:54.185704Z",
1044 | "iopub.status.busy": "2022-01-20T13:42:54.185007Z",
1045 | "iopub.status.idle": "2022-01-20T13:42:54.194800Z",
1046 | "shell.execute_reply": "2022-01-20T13:42:54.195378Z"
1047 | },
1048 | "papermill": {
1049 | "duration": 0.040002,
1050 | "end_time": "2022-01-20T13:42:54.195536",
1051 | "exception": false,
1052 | "start_time": "2022-01-20T13:42:54.155534",
1053 | "status": "completed"
1054 | },
1055 | "tags": []
1056 | },
1057 | "outputs": [],
1058 | "source": [
1059 | "def random_insertion(words, n): \n",
1060 | " words = words.split()\n",
1061 | " new_words = words.copy() \n",
1062 | " for _ in range(n):\n",
1063 | " add_word(new_words) \n",
1064 | " sentence = ' '.join(new_words)\n",
1065 | " return sentence\n",
1066 | "\n",
1067 | "def add_word(new_words): \n",
1068 | " synonyms = []\n",
1069 | " counter = 0\n",
1070 | " \n",
1071 | " while len(synonyms) < 1:\n",
1072 | " random_word = new_words[random.randint(0, len(new_words)-1)]\n",
1073 | " synonyms = get_synonyms(random_word)\n",
1074 | " counter += 1\n",
1075 | " if counter >= 10:\n",
1076 | " return \n",
1077 | " random_synonym = synonyms[0]\n",
1078 | " random_idx = random.randint(0, len(new_words)-1)\n",
1079 | " new_words.insert(random_idx, random_synonym)"
1080 | ]
1081 | },
1082 | {
1083 | "cell_type": "code",
1084 | "execution_count": 20,
1085 | "metadata": {
1086 | "execution": {
1087 | "iopub.execute_input": "2022-01-20T13:42:54.253628Z",
1088 | "iopub.status.busy": "2022-01-20T13:42:54.252968Z",
1089 | "iopub.status.idle": "2022-01-20T13:42:54.261794Z",
1090 | "shell.execute_reply": "2022-01-20T13:42:54.262332Z"
1091 | },
1092 | "papermill": {
1093 | "duration": 0.039465,
1094 | "end_time": "2022-01-20T13:42:54.262479",
1095 | "exception": false,
1096 | "start_time": "2022-01-20T13:42:54.223014",
1097 | "status": "completed"
1098 | },
1099 | "tags": []
1100 | },
1101 | "outputs": [
1102 | {
1103 | "name": "stdout",
1104 | "output_type": "stream",
1105 | "text": [
1106 | "the free fillin` app on my addict ipod is fun, im addicted\n",
1107 | "the complimentary free fillin` app on my ipod along is fun, im addicted\n",
1108 | "the free along fillin` app addict on my ipod along is fun, im addicted\n"
1109 | ]
1110 | }
1111 | ],
1112 | "source": [
1113 | "print(random_insertion(trial_sent,1))\n",
1114 | "print(random_insertion(trial_sent,2))\n",
1115 | "print(random_insertion(trial_sent,3))"
1116 | ]
1117 | },
1118 | {
1119 | "cell_type": "code",
1120 | "execution_count": 21,
1121 | "metadata": {
1122 | "execution": {
1123 | "iopub.execute_input": "2022-01-20T13:42:54.321738Z",
1124 | "iopub.status.busy": "2022-01-20T13:42:54.321012Z",
1125 | "iopub.status.idle": "2022-01-20T13:42:54.325880Z",
1126 | "shell.execute_reply": "2022-01-20T13:42:54.326405Z"
1127 | },
1128 | "papermill": {
1129 | "duration": 0.036453,
1130 | "end_time": "2022-01-20T13:42:54.326548",
1131 | "exception": false,
1132 | "start_time": "2022-01-20T13:42:54.290095",
1133 | "status": "completed"
1134 | },
1135 | "tags": []
1136 | },
1137 | "outputs": [],
1138 | "source": [
1139 | "def aug(sent,n,p):\n",
1140 | " print(f\" Original Sentence : {sent}\")\n",
1141 | " print(f\" SR Augmented Sentence : {synonym_replacement(sent,n)}\")\n",
1142 | " print(f\" RD Augmented Sentence : {random_deletion(sent,p)}\")\n",
1143 | " print(f\" RS Augmented Sentence : {random_swap(sent,n)}\")\n",
1144 | " print(f\" RI Augmented Sentence : {random_insertion(sent,n)}\")"
1145 | ]
1146 | },
1147 | {
1148 | "cell_type": "code",
1149 | "execution_count": 22,
1150 | "metadata": {
1151 | "execution": {
1152 | "iopub.execute_input": "2022-01-20T13:42:54.387083Z",
1153 | "iopub.status.busy": "2022-01-20T13:42:54.386400Z",
1154 | "iopub.status.idle": "2022-01-20T13:42:54.402037Z",
1155 | "shell.execute_reply": "2022-01-20T13:42:54.401238Z"
1156 | },
1157 | "papermill": {
1158 | "duration": 0.046853,
1159 | "end_time": "2022-01-20T13:42:54.402403",
1160 | "exception": false,
1161 | "start_time": "2022-01-20T13:42:54.355550",
1162 | "status": "completed"
1163 | },
1164 | "tags": []
1165 | },
1166 | "outputs": [
1167 | {
1168 | "name": "stdout",
1169 | "output_type": "stream",
1170 | "text": [
1171 | " Original Sentence : the free fillin` app on my ipod is fun, im addicted\n",
1172 | " SR Augmented Sentence : the disembarrass fillin` app on my ipod is fun, im hook\n",
1173 | " RD Augmented Sentence : the free app on my ipod fun, im addicted\n",
1174 | " RS Augmented Sentence : on free fillin` ipod is my the app fun, im addicted\n",
1175 | " RI Augmented Sentence : the free fillin` app on gratis addict my ipod is complimentary make up fun, im addicted\n"
1176 | ]
1177 | }
1178 | ],
1179 | "source": [
1180 | "aug(trial_sent,4,0.3)"
1181 | ]
1182 | }
1183 | ],
1184 | "metadata": {
1185 | "kernelspec": {
1186 | "display_name": "Python 3",
1187 | "language": "python",
1188 | "name": "python3"
1189 | },
1190 | "language_info": {
1191 | "codemirror_mode": {
1192 | "name": "ipython",
1193 | "version": 3
1194 | },
1195 | "file_extension": ".py",
1196 | "mimetype": "text/x-python",
1197 | "name": "python",
1198 | "nbconvert_exporter": "python",
1199 | "pygments_lexer": "ipython3",
1200 | "version": "3.7.6"
1201 | },
1202 | "papermill": {
1203 | "duration": 10.912265,
1204 | "end_time": "2022-01-20T13:42:54.539428",
1205 | "environment_variables": {},
1206 | "exception": null,
1207 | "input_path": "__notebook__.ipynb",
1208 | "output_path": "__notebook__.ipynb",
1209 | "parameters": {},
1210 | "start_time": "2022-01-20T13:42:43.627163",
1211 | "version": "2.1.0"
1212 | }
1213 | },
1214 | "nbformat": 4,
1215 | "nbformat_minor": 4
1216 | }
1217 |
--------------------------------------------------------------------------------
/chapter_12/README.md:
--------------------------------------------------------------------------------
1 | # Simulation and Optimization Competitions
2 |
3 | This chapter provides an overview of simulation competitions, a new class of contests gaining popularity on Kaggle over the last few years.
4 |
--------------------------------------------------------------------------------
/chapter_13/README.md:
--------------------------------------------------------------------------------
1 | # Creating Your Portfolio of Projects and Ideas
2 |
3 | This chapter explores ways you can stand out by showcasing your work on Kaggle itself and other sites in an appropriate way.
4 |
--------------------------------------------------------------------------------
/chapter_14/README.md:
--------------------------------------------------------------------------------
1 | # Finding New Professional Opportunities
2 |
3 | This chapter concludes the overview of how Kaggle can positively affect your career by discussing the best ways to leverage all your Kaggle experience in order to find new professional opportunities.
4 |
--------------------------------------------------------------------------------
/contributors.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/The-Kaggle-Book/9e89503758f3afbb742f3b92367815cff897543d/contributors.jpg
--------------------------------------------------------------------------------
/cover.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PacktPublishing/The-Kaggle-Book/9e89503758f3afbb742f3b92367815cff897543d/cover.png
--------------------------------------------------------------------------------