├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── LICENSE-DATA
├── LICENSE-NOTEBOOKS
├── MLU-MAIN.ipynb
├── README.md
├── data
    ├── MLU_Logo.png
    ├── final_project
    │   ├── perfect_submission.csv
    │   ├── test_features.csv
    │   ├── training.csv
    │   └── y_test.csv
    └── review
    │   ├── Austin_Animal_Center_Intakes.csv
    │   ├── Austin_Animal_Center_Intakes_Outcomes.csv
    │   ├── Austin_Animal_Center_Outcomes.csv
    │   └── review_dataset.csv
├── notebooks
    ├── MLA-TAB-DAY1-EDA.ipynb
    ├── MLA-TAB-DAY1-FINAL.ipynb
    ├── MLA-TAB-DAY1-KNN.ipynb
    ├── MLA-TAB-DAY1-MODEL.ipynb
    ├── MLA-TAB-DAY2-TEXT-PROCESS.ipynb
    ├── MLA-TAB-DAY2-TREE.ipynb
    ├── MLA-TAB-DAY3-AUTOML.ipynb
    ├── MLA-TAB-DAY3-NN.ipynb
    ├── MLA-TAB-DAY3-PYTORCH.ipynb
    └── mluvisuals.py
├── requirements.txt
└── slides
    ├── MLA-TAB-Lecture1.pptx
    ├── MLA-TAB-Lecture2.pptx
    └── MLA-TAB-Lecture3.pptx


/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | ## Code of Conduct
2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
4 | opensource-codeofconduct@amazon.com with any additional questions or comments.
5 | 


--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
 1 | # Contributing Guidelines
 2 | 
 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional
 4 | documentation, we greatly value feedback and contributions from our community.
 5 | 
 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary
 7 | information to effectively respond to your bug report or contribution.
 8 | 
 9 | 
10 | ## Reporting Bugs/Feature Requests
11 | 
12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features.
13 | 
14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already
15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful:
16 | 
17 | * A reproducible test case or series of steps
18 | * The version of our code being used
19 | * Any modifications you've made relevant to the bug
20 | * Anything unusual about your environment or deployment
21 | 
22 | 
23 | ## Contributing via Pull Requests
24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that:
25 | 
26 | 1. You are working against the latest source on the *master* branch.
27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already.
28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted.
29 | 
30 | To send us a pull request, please:
31 | 
32 | 1. Fork the repository.
33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change.
34 | 3. Ensure local tests pass.
35 | 4. Commit to your fork using clear commit messages.
36 | 5. Send us a pull request, answering any default questions in the pull request interface.
37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation.
38 | 
39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and
40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/).
41 | 
42 | 
43 | ## Finding contributions to work on
44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start.
45 | 
46 | 
47 | ## Code of Conduct
48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct).
49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact
50 | opensource-codeofconduct@amazon.com with any additional questions or comments.
51 | 
52 | 
53 | ## Security issue notifications
54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue.
55 | 
56 | 
57 | ## Licensing
58 | 
59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution.
60 | 
61 | We may ask you to sign a [Contributor License Agreement (CLA)](http://en.wikipedia.org/wiki/Contributor_License_Agreement) for larger changes.
62 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Creative Commons Attribution-ShareAlike 4.0 International Public License
  2 | 
  3 | By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-ShareAlike 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.
  4 | 
  5 | Section 1 – Definitions.
  6 | 	
  7 |      a.	Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.
  8 | 	
  9 |      b.	Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.
 10 | 	
 11 |      c.	BY-SA Compatible License means a license listed at creativecommons.org/compatiblelicenses, approved by Creative Commons as essentially the equivalent of this Public License.
 12 | 	
 13 |      d.	Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not Copyright and Similar Rights.
 14 | 	
 15 |      e.	Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements.
 16 | 	
 17 |      f.	Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material.
 18 | 	
 19 |      g.	License Elements means the license attributes listed in the name of a Creative Commons Public License. The License Elements of this Public License are Attribution and ShareAlike.
 20 | 	
 21 |      h.	Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License.
 22 | 	
 23 |      i.	Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license.
 24 | 	
 25 |      j.	Licensor means the individual(s) or entity(ies) granting rights under this Public License.
 26 | 	
 27 |      k.	Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.
 28 | 	
 29 |      l.	Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world.
 30 | 	
 31 |      m.	You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.
 32 | 
 33 | Section 2 – Scope.
 34 | 	
 35 |      a.	License grant.
 36 | 	
 37 |           1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to:
 38 | 
 39 |                A. reproduce and Share the Licensed Material, in whole or in part; and	
 40 | 
 41 |                B. produce, reproduce, and Share Adapted Material.
 42 | 	
 43 |           2. Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.
 44 | 	
 45 |           3. Term. The term of this Public License is specified in Section 6(a).
 46 | 	
 47 |           4. Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material.
 48 | 	
 49 |           5. Downstream recipients.
 50 | 
 51 |                A. Offer from the Licensor – Licensed Material. Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License.
 52 | 	
 53 |                B. Additional offer from the Licensor – Adapted Material. Every recipient of Adapted Material from You automatically receives an offer from the Licensor to exercise the Licensed Rights in the Adapted Material under the conditions of the Adapter’s License You apply.
 54 | 	
 55 |                C. No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.
 56 | 	
 57 |           6. No endorsement. Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i).
 58 | 	
 59 |      b.	Other rights.
 60 | 	
 61 |           1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise.
 62 | 	
 63 |           2. Patent and trademark rights are not licensed under this Public License.
 64 | 	
 65 |           3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties.
 66 | 
 67 | Section 3 – License Conditions.
 68 | 
 69 | Your exercise of the Licensed Rights is expressly made subject to the following conditions.
 70 | 	
 71 |      a.	Attribution.
 72 | 	
 73 |           1. If You Share the Licensed Material (including in modified form), You must:
 74 | 
 75 |                A. retain the following if it is supplied by the Licensor with the Licensed Material:
 76 | 
 77 |                     i.	identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);
 78 | 
 79 |                     ii.	a copyright notice;
 80 | 
 81 |                     iii. a notice that refers to this Public License;
 82 | 
 83 |                     iv.	a notice that refers to the disclaimer of warranties;
 84 | 
 85 |                     v.	a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
 86 | 
 87 |                B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and
 88 | 
 89 |                C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License.
 90 | 	
 91 |           2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information.
 92 | 	
 93 |           3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable.
 94 | 	
 95 |      b.	ShareAlike.In addition to the conditions in Section 3(a), if You Share Adapted Material You produce, the following conditions also apply.
 96 | 	
 97 |           1. The Adapter’s License You apply must be a Creative Commons license with the same License Elements, this version or later, or a BY-SA Compatible License.
 98 | 	
 99 |           2. You must include the text of, or the URI or hyperlink to, the Adapter's License You apply. You may satisfy this condition in any reasonable manner based on the medium, means, and context in which You Share Adapted Material.
100 | 	
101 |           3. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, Adapted Material that restrict exercise of the rights granted under the Adapter's License You apply.
102 | 
103 | Section 4 – Sui Generis Database Rights.
104 | 
105 | Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:
106 | 	
107 |      a.	for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database;
108 | 	
109 |      b.	if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material, including for purposes of Section 3(b); and
110 | 	
111 |      c.	You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.
112 | For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.
113 | 
114 | Section 5 – Disclaimer of Warranties and Limitation of Liability.
115 | 	
116 |      a.	Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You.
117 | 	
118 |      b.	To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.
119 | 	
120 |      c.	The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
121 | 
122 | Section 6 – Term and Termination.
123 | 	
124 |      a.	This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically.
125 | 	
126 |      b.	Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:
127 | 	
128 |           1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or
129 | 	
130 |           2. upon express reinstatement by the Licensor.
131 | 	
132 |      c.	For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License.
133 | 	
134 |      d.	For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.
135 | 	
136 |      e.	Sections 1, 5, 6, 7, and 8 survive termination of this Public License.
137 | 
138 | Section 7 – Other Terms and Conditions.
139 | 	
140 |      a.	The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.
141 | 	
142 |      b.	Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.
143 | 
144 | Section 8 – Interpretation.
145 | 	
146 |      a.	For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
147 | 	
148 |      b.	To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.
149 | 	
150 |      c.	No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.
151 | 	
152 |      d.	Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.


--------------------------------------------------------------------------------
/LICENSE-DATA:
--------------------------------------------------------------------------------
1 | AMAZON LICENSE
2 | 
3 | Data set for the course is being provided to you by permission of Amazon and is subject to the terms of the Amazon License and Access (available at https://www.amazon.com/gp/help/customer/display.html?nodeId=201909000 ). You are expressly prohibited from copying, modifying, selling, exporting or using this data set in any way other than for the purpose of completing this course.


--------------------------------------------------------------------------------
/LICENSE-NOTEBOOKS:
--------------------------------------------------------------------------------
 1 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 7 | the Software, and to permit persons to whom the Software is furnished to do so.
 8 | 
 9 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
10 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
11 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
12 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
13 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
14 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15 | 
16 | 


--------------------------------------------------------------------------------
/MLU-MAIN.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "markdown",
 5 |    "id": "610a5a2e",
 6 |    "metadata": {},
 7 |    "source": [
 8 |     "[![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-tab/blob/master/MLU-MAIN.ipynb)"
 9 |    ]
10 |   },
11 |   {
12 |    "cell_type": "markdown",
13 |    "id": "42951fd3",
14 |    "metadata": {},
15 |    "source": [
16 |     "![MLU Logo](data/MLU_Logo.png)"
17 |    ]
18 |   },
19 |   {
20 |    "cell_type": "markdown",
21 |    "id": "ab8947dc",
22 |    "metadata": {},
23 |    "source": [
24 |     "# Machine Learning University\n",
25 |     "Welcome to the GitHub page of __Machine Learning University (MLU)__. Our mission is to make machine learning accessible to anyone, anywhere, anytime. We have courses available across many sub-domains of machine learning.\n",
26 |     "## Getting Started\n",
27 |     "There are just two steps to start your deep learning journey!\n",
28 |     "### Step 1: Start Instance\n",
29 |     "\n",
30 |     "Choose `CPU` or `GPU` and click `Start instance`.\n",
31 |     "First-time users without GPU experience\n",
32 |     "are recommended to start with `CPU`.\n",
33 |     "\n",
34 |     "### Step 2: Copy to Project\n",
35 |     "\n",
36 |     "Click `Copy to project` and install the environment file (.yml).\n",
37 |     "\n",
38 |     "## Course List\n",
39 |     "Learners have access to jupyter notebooks, slides and accompanying video lectures. See the MLU course list below.\n",
40 |     "* ### [Natural Language Processing](https://github.com/aws-samples/aws-machine-learning-university-accelerated-nlp) [![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-nlp/blob/master/notebooks/MLA-NLP-Lecture1-Text-Process.ipynb)\n",
41 |     "This course is designed to help you get started with Natural Language Processing (NLP) and learn how to use NLP in various use cases. You can view the [GitHub](https://github.com/aws-samples/aws-machine-learning-university-accelerated-nlp) repository of this class for more details.\n",
42 |     "* ### [Tabular Data](https://github.com/aws-samples/aws-machine-learning-university-accelerated-tab) [![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-tab/blob/master/MLU-MAIN.ipynb)\n",
43 |     "Learn how to get started with tabular data (spreadsheet-like data) and the widely used machine learning techniques to manipulate tabular data. See the [GitHub](https://github.com/aws-samples/aws-machine-learning-university-accelerated-tab) page for more info and hands-on notebooks.\n",
44 |     "* ### [Computer Vision](https://github.com/aws-samples/aws-machine-learning-university-accelerated-cv) [![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-cv/blob/master/notebooks/MLA-CV-DAY1-NN.ipynb)\n",
45 |     "Through this course, you will gain the necessary skills to get started with computer vision and use it in practical problems. See the [GitHub](https://github.com/aws-samples/aws-machine-learning-university-accelerated-cv) page for more info.\n",
46 |     "* ### [Decision Trees and Ensemble Methods](https://github.com/aws-samples/aws-machine-learning-university-dte) [![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-dte/blob/main/notebooks/lecture_1/DTE-LECTURE-1-PRUNE.ipynb)\n",
47 |     "Get started with tree-based and ensemble models in this class. Visit the [GitHub](https://github.com/aws-samples/aws-machine-learning-university-dte) page to start learning."
48 |    ]
49 |   }
50 |  ],
51 |  "metadata": {
52 |   "kernelspec": {
53 |    "display_name": "conda_python3",
54 |    "language": "python",
55 |    "name": "conda_python3"
56 |   },
57 |   "language_info": {
58 |    "codemirror_mode": {
59 |     "name": "ipython",
60 |     "version": 3
61 |    },
62 |    "file_extension": ".py",
63 |    "mimetype": "text/x-python",
64 |    "name": "python",
65 |    "nbconvert_exporter": "python",
66 |    "pygments_lexer": "ipython3",
67 |    "version": "3.6.13"
68 |   }
69 |  },
70 |  "nbformat": 4,
71 |  "nbformat_minor": 5
72 | }
73 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ![logo](data/MLU_Logo.png)
 2 | ## Machine Learning University: Accelerated Tabular Data Class
 3 | This repository contains __slides__, __notebooks__, and __datasets__ for the __Machine Learning University (MLU) Accelerated Tabular Data__ class. Our mission is to make Machine Learning accessible to everyone. We have courses available across many topics of machine learning and believe knowledge of ML can be a key enabler for success. This class is designed to help you get started with tabular data (spreadsheet-like tables), learn about widely used Machine Learning techniques for tabular data, and apply them to real-world problems.
 4 | 
 5 | ## YouTube
 6 | Watch all Tabular Data class video recordings in this [YouTube playlist](https://www.youtube.com/playlist?list=PL8P_Z6C4GcuVQZCYf_ZnMoIWLLKGx9Mi2) from our [YouTube channel](https://www.youtube.com/channel/UC12LqyqTQYbXatYS9AA7Nuw/playlists).
 7 | 
 8 | [![Playlist](https://img.youtube.com/vi/kj-sPC6pai4/0.jpg)](https://www.youtube.com/playlist?list=PL8P_Z6C4GcuVQZCYf_ZnMoIWLLKGx9Mi2)
 9 | 
10 | ## Course Overview
11 | 
12 | There are three lectures and one final project for this class.
13 | Lecture 1
14 | | title | studio lab |
15 | | :---: | ---: |
16 | | Introduction to ML | - |
17 | | Sample ML Model | - |
18 | | Model Evaluation | [![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-tab/blob/main/notebooks/MLA-TAB-DAY1-MODEL.ipynb) |
19 | | Exploratory Data Analysis | [![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-tab/blob/main/notebooks/MLA-TAB-DAY1-EDA.ipynb) |
20 | | K Nearest Neighbors (KNN) | [![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-tab/blob/main/notebooks/MLA-TAB-DAY1-KNN.ipynb) |
21 | | Final Project | [![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-tab/blob/main/notebooks/MLA-TAB-DAY1-FINAL.ipynb) |
22 | 
23 | Lecture 2
24 | 
25 | | title | studio lab |
26 | | :---: | ---: |
27 | |Feature Engineering | [![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-tab/blob/main/notebooks/MLA-TAB-DAY2-TEXT-PROCESS.ipynb) |
28 | | Tree-based Models | [![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-tab/blob/main/notebooks/MLA-TAB-DAY2-TREE.ipynb) |
29 | | Bagging | - |
30 | | Hyperparameter Tuning | - |
31 | | AWS AI/ML Services |[![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-tab/blob/main/notebooks/MLA-TAB-DAY2-SAGEMAKER.ipynb)  |
32 | 
33 | Lecture 3 
34 | 
35 | | title | studio lab |
36 | | :---: | ---: |
37 | | Optimization | - |
38 | | Regression Models | - |
39 | | Boosting | - |
40 | | Neural Networks |NN [![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-tab/blob/main/notebooks/MLA-TAB-DAY3-NN.ipynb) |
41 | | AutoML |[![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/aws-samples/aws-machine-learning-university-accelerated-tab/blob/main/notebooks/MLA-TAB-DAY3-AUTOML.ipynb) |
42 | 
43 | 
44 | **Final Project:** Practice working with a "real-world" tabular dataset for the final project. Final project dataset is in the [data/final_project folder](https://github.com/aws-samples/aws-machine-learning-university-accelerated-tab/tree/main/data/final_project). For more details on the final project, check out [this notebook](https://github.com/aws-samples/aws-machine-learning-university-accelerated-tab/blob/main/notebooks/MLA-TAB-DAY1-FINAL.ipynb).
45 | 
46 | ## Interactives/Visuals
47 | Interested in visual, interactive explanations of core machine learning concepts? Check out our [MLU-Explain articles](https://mlu-explain.github.io/) to learn at your own pace! 
48 | 
49 | ## Contribute
50 | If you would like to contribute to the project, see [CONTRIBUTING](CONTRIBUTING.md) for more information.
51 | 
52 | ## License
53 | The license for this repository depends on the section.  Data set for the course is being provided to you by permission of Amazon and is subject to the terms of the [Amazon License and Access](https://www.amazon.com/gp/help/customer/display.html?nodeId=201909000). You are expressly prohibited from copying, modifying, selling, exporting or using this data set in any way other than for the purpose of completing this course. The lecture slides are released under the CC-BY-SA-4.0 License.  The code examples are released under the MIT-0 License. See each section's LICENSE file for details.
54 | 


--------------------------------------------------------------------------------
/data/MLU_Logo.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-tab/03005a6ef0ede849a0dd13c389a7511c30a0d2ea/data/MLU_Logo.png


--------------------------------------------------------------------------------
/notebooks/MLA-TAB-DAY1-FINAL.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "![MLU Logo](../data/MLU_Logo.png)"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "# <a name=\"0\">Machine Learning Accelerator - Tabular Data - Lecture 1</a>\n",
 15 |     "\n",
 16 |     "\n",
 17 |     "## Final Project \n",
 18 |     "\n",
 19 |     "In this notebook, we build a ML model to predict the __Time at Center__ field of our final project dataset.\n",
 20 |     "\n",
 21 |     "1. <a href=\"#1\">Read the dataset</a> (Given) \n",
 22 |     "2. <a href=\"#2\">Train a model</a> (Implement)\n",
 23 |     "    * <a href=\"#21\">Exploratory Data Analysis</a>\n",
 24 |     "    * <a href=\"#22\">Select features to build the model</a>\n",
 25 |     "    * <a href=\"#23\">Data processing</a>\n",
 26 |     "    * <a href=\"#24\">Model training</a>\n",
 27 |     "3. <a href=\"#3\">Make predictions on the test dataset</a> (Implement)\n",
 28 |     "4. <a href=\"#4\">Write the test predictions to a CSV file</a> (Given)\n",
 29 |     "\n",
 30 |     "__Austin Animal Center Dataset__:\n",
 31 |     "\n",
 32 |     "In this exercise, we are working with pet adoption data from __Austin Animal Center__. We have two datasets that cover intake and outcome of animals. Intake data is available from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and outcome is from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). \n",
 33 |     "\n",
 34 |     "In order to work with a single table, we joined the intake and outcome tables using the \"Animal ID\" column and created a training.csv, test_features.csv and y_test.csv files. Similar to our review dataset, we didn't consider animals with multiple entries to the facility to keep it simple. If you want to see the original datasets, they are available under data/review folder: Austin_Animal_Center_Intakes.csv, Austin_Animal_Center_Outcomes.csv.\n",
 35 |     "\n",
 36 |     "__Dataset schema:__ \n",
 37 |     "- __Pet ID__ - Unique ID of pet\n",
 38 |     "- __Outcome Type__ - State of pet at the time of recording the outcome\n",
 39 |     "- __Sex upon Outcome__ - Sex of pet at outcome\n",
 40 |     "- __Name__ - Name of pet \n",
 41 |     "- __Found Location__ - Found location of pet before entered the center\n",
 42 |     "- __Intake Type__ - Circumstances bringing the pet to the center\n",
 43 |     "- __Intake Condition__ - Health condition of pet when entered the center\n",
 44 |     "- __Pet Type__ - Type of pet\n",
 45 |     "- __Sex upon Intake__ - Sex of pet when entered the center\n",
 46 |     "- __Breed__ - Breed of pet \n",
 47 |     "- __Color__ - Color of pet \n",
 48 |     "- __Age upon Intake Days__ - Age of pet when entered the center (days)\n",
 49 |     "- __Time at Center__ - Time at center (0 = less than 30 days; 1 = more than 30 days). This is the value to predict. \n"
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": 1,
 55 |    "metadata": {},
 56 |    "outputs": [],
 57 |    "source": [
 58 |     "%%capture\n",
 59 |     "%pip install -q -r ../requirements.txt"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "markdown",
 64 |    "metadata": {},
 65 |    "source": [
 66 |     "## 1. <a name=\"1\">Read the datasets</a> (Given)\n",
 67 |     "(<a href=\"#0\">Go to top</a>)\n",
 68 |     "\n",
 69 |     "Let's read the datasets into dataframes, using Pandas."
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 2,
 75 |    "metadata": {},
 76 |    "outputs": [
 77 |     {
 78 |      "name": "stdout",
 79 |      "output_type": "stream",
 80 |      "text": [
 81 |       "The shape of the training dataset is: (71538, 13)\n",
 82 |       "The shape of the test dataset is: (23846, 12)\n"
 83 |      ]
 84 |     }
 85 |    ],
 86 |    "source": [
 87 |     "import pandas as pd\n",
 88 |     "import numpy as np\n",
 89 |     "\n",
 90 |     "import warnings\n",
 91 |     "warnings.filterwarnings(\"ignore\")\n",
 92 |     "  \n",
 93 |     "training_data = pd.read_csv('../data/final_project/training.csv')\n",
 94 |     "test_data = pd.read_csv('../data/final_project/test_features.csv')\n",
 95 |     "\n",
 96 |     "print('The shape of the training dataset is:', training_data.shape)\n",
 97 |     "print('The shape of the test dataset is:', test_data.shape)\n"
 98 |    ]
 99 |   },
100 |   {
101 |    "cell_type": "markdown",
102 |    "metadata": {},
103 |    "source": [
104 |     "## 2. <a name=\"2\">Train a model</a> (Implement)\n",
105 |     "(<a href=\"#0\">Go to top</a>)\n",
106 |     "\n",
107 |     " * <a href=\"#21\">Exploratory Data Analysis</a>\n",
108 |     " * <a href=\"#22\">Select features to build the model</a>\n",
109 |     " * <a href=\"#23\">Data processing</a>\n",
110 |     " * <a href=\"#24\">Model training</a>\n",
111 |     "\n",
112 |     "### 2.1 <a name=\"21\">Exploratory Data Analysis</a> \n",
113 |     "(<a href=\"#2\">Go to Train a model</a>)\n",
114 |     "\n",
115 |     "We look at number of rows, columns and some simple statistics of the dataset."
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": 3,
121 |    "metadata": {},
122 |    "outputs": [
123 |     {
124 |      "data": {
125 |       "text/html": [
126 |        "<div>\n",
127 |        "<style scoped>\n",
128 |        "    .dataframe tbody tr th:only-of-type {\n",
129 |        "        vertical-align: middle;\n",
130 |        "    }\n",
131 |        "\n",
132 |        "    .dataframe tbody tr th {\n",
133 |        "        vertical-align: top;\n",
134 |        "    }\n",
135 |        "\n",
136 |        "    .dataframe thead th {\n",
137 |        "        text-align: right;\n",
138 |        "    }\n",
139 |        "</style>\n",
140 |        "<table border=\"1\" class=\"dataframe\">\n",
141 |        "  <thead>\n",
142 |        "    <tr style=\"text-align: right;\">\n",
143 |        "      <th></th>\n",
144 |        "      <th>Pet ID</th>\n",
145 |        "      <th>Outcome Type</th>\n",
146 |        "      <th>Sex upon Outcome</th>\n",
147 |        "      <th>Name</th>\n",
148 |        "      <th>Found Location</th>\n",
149 |        "      <th>Intake Type</th>\n",
150 |        "      <th>Intake Condition</th>\n",
151 |        "      <th>Pet Type</th>\n",
152 |        "      <th>Sex upon Intake</th>\n",
153 |        "      <th>Breed</th>\n",
154 |        "      <th>Color</th>\n",
155 |        "      <th>Age upon Intake Days</th>\n",
156 |        "      <th>Time at Center</th>\n",
157 |        "    </tr>\n",
158 |        "  </thead>\n",
159 |        "  <tbody>\n",
160 |        "    <tr>\n",
161 |        "      <th>0</th>\n",
162 |        "      <td>A745079</td>\n",
163 |        "      <td>Transfer</td>\n",
164 |        "      <td>Unknown</td>\n",
165 |        "      <td>NaN</td>\n",
166 |        "      <td>7920 Old Lockhart in Travis (TX)</td>\n",
167 |        "      <td>Stray</td>\n",
168 |        "      <td>Normal</td>\n",
169 |        "      <td>Cat</td>\n",
170 |        "      <td>Unknown</td>\n",
171 |        "      <td>Domestic Shorthair Mix</td>\n",
172 |        "      <td>Blue</td>\n",
173 |        "      <td>3</td>\n",
174 |        "      <td>0</td>\n",
175 |        "    </tr>\n",
176 |        "    <tr>\n",
177 |        "      <th>1</th>\n",
178 |        "      <td>A801765</td>\n",
179 |        "      <td>Transfer</td>\n",
180 |        "      <td>Intact Female</td>\n",
181 |        "      <td>NaN</td>\n",
182 |        "      <td>5006 Table Top in Austin (TX)</td>\n",
183 |        "      <td>Stray</td>\n",
184 |        "      <td>Normal</td>\n",
185 |        "      <td>Cat</td>\n",
186 |        "      <td>Intact Female</td>\n",
187 |        "      <td>Domestic Shorthair</td>\n",
188 |        "      <td>Brown Tabby/White</td>\n",
189 |        "      <td>28</td>\n",
190 |        "      <td>0</td>\n",
191 |        "    </tr>\n",
192 |        "    <tr>\n",
193 |        "      <th>2</th>\n",
194 |        "      <td>A667965</td>\n",
195 |        "      <td>Transfer</td>\n",
196 |        "      <td>Neutered Male</td>\n",
197 |        "      <td>NaN</td>\n",
198 |        "      <td>14100 Thermal Dr in Austin (TX)</td>\n",
199 |        "      <td>Stray</td>\n",
200 |        "      <td>Normal</td>\n",
201 |        "      <td>Dog</td>\n",
202 |        "      <td>Neutered Male</td>\n",
203 |        "      <td>Chihuahua Shorthair Mix</td>\n",
204 |        "      <td>Brown/Tan</td>\n",
205 |        "      <td>1825</td>\n",
206 |        "      <td>0</td>\n",
207 |        "    </tr>\n",
208 |        "    <tr>\n",
209 |        "      <th>3</th>\n",
210 |        "      <td>A687551</td>\n",
211 |        "      <td>Transfer</td>\n",
212 |        "      <td>Intact Male</td>\n",
213 |        "      <td>NaN</td>\n",
214 |        "      <td>5811 Cedardale Dr in Austin (TX)</td>\n",
215 |        "      <td>Stray</td>\n",
216 |        "      <td>Normal</td>\n",
217 |        "      <td>Cat</td>\n",
218 |        "      <td>Intact Male</td>\n",
219 |        "      <td>Domestic Shorthair Mix</td>\n",
220 |        "      <td>Brown Tabby</td>\n",
221 |        "      <td>28</td>\n",
222 |        "      <td>0</td>\n",
223 |        "    </tr>\n",
224 |        "    <tr>\n",
225 |        "      <th>4</th>\n",
226 |        "      <td>A773004</td>\n",
227 |        "      <td>Adoption</td>\n",
228 |        "      <td>Neutered Male</td>\n",
229 |        "      <td>*Boris</td>\n",
230 |        "      <td>Highway 290 And Arterial A in Austin (TX)</td>\n",
231 |        "      <td>Stray</td>\n",
232 |        "      <td>Normal</td>\n",
233 |        "      <td>Dog</td>\n",
234 |        "      <td>Intact Male</td>\n",
235 |        "      <td>Chihuahua Shorthair Mix</td>\n",
236 |        "      <td>Tricolor/Cream</td>\n",
237 |        "      <td>365</td>\n",
238 |        "      <td>0</td>\n",
239 |        "    </tr>\n",
240 |        "  </tbody>\n",
241 |        "</table>\n",
242 |        "</div>"
243 |       ],
244 |       "text/plain": [
245 |        "    Pet ID Outcome Type Sex upon Outcome    Name  \\\n",
246 |        "0  A745079     Transfer          Unknown     NaN   \n",
247 |        "1  A801765     Transfer    Intact Female     NaN   \n",
248 |        "2  A667965     Transfer    Neutered Male     NaN   \n",
249 |        "3  A687551     Transfer      Intact Male     NaN   \n",
250 |        "4  A773004     Adoption    Neutered Male  *Boris   \n",
251 |        "\n",
252 |        "                              Found Location Intake Type Intake Condition  \\\n",
253 |        "0           7920 Old Lockhart in Travis (TX)       Stray           Normal   \n",
254 |        "1              5006 Table Top in Austin (TX)       Stray           Normal   \n",
255 |        "2            14100 Thermal Dr in Austin (TX)       Stray           Normal   \n",
256 |        "3           5811 Cedardale Dr in Austin (TX)       Stray           Normal   \n",
257 |        "4  Highway 290 And Arterial A in Austin (TX)       Stray           Normal   \n",
258 |        "\n",
259 |        "  Pet Type Sex upon Intake                    Breed              Color  \\\n",
260 |        "0      Cat         Unknown   Domestic Shorthair Mix               Blue   \n",
261 |        "1      Cat   Intact Female       Domestic Shorthair  Brown Tabby/White   \n",
262 |        "2      Dog   Neutered Male  Chihuahua Shorthair Mix          Brown/Tan   \n",
263 |        "3      Cat     Intact Male   Domestic Shorthair Mix        Brown Tabby   \n",
264 |        "4      Dog     Intact Male  Chihuahua Shorthair Mix     Tricolor/Cream   \n",
265 |        "\n",
266 |        "   Age upon Intake Days  Time at Center  \n",
267 |        "0                     3               0  \n",
268 |        "1                    28               0  \n",
269 |        "2                  1825               0  \n",
270 |        "3                    28               0  \n",
271 |        "4                   365               0  "
272 |       ]
273 |      },
274 |      "execution_count": 3,
275 |      "metadata": {},
276 |      "output_type": "execute_result"
277 |     }
278 |    ],
279 |    "source": [
280 |     "# Implement here\n",
281 |     "\n",
282 |     "training_data.head()"
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "code",
287 |    "execution_count": 4,
288 |    "metadata": {},
289 |    "outputs": [
290 |     {
291 |      "data": {
292 |       "text/html": [
293 |        "<div>\n",
294 |        "<style scoped>\n",
295 |        "    .dataframe tbody tr th:only-of-type {\n",
296 |        "        vertical-align: middle;\n",
297 |        "    }\n",
298 |        "\n",
299 |        "    .dataframe tbody tr th {\n",
300 |        "        vertical-align: top;\n",
301 |        "    }\n",
302 |        "\n",
303 |        "    .dataframe thead th {\n",
304 |        "        text-align: right;\n",
305 |        "    }\n",
306 |        "</style>\n",
307 |        "<table border=\"1\" class=\"dataframe\">\n",
308 |        "  <thead>\n",
309 |        "    <tr style=\"text-align: right;\">\n",
310 |        "      <th></th>\n",
311 |        "      <th>Pet ID</th>\n",
312 |        "      <th>Outcome Type</th>\n",
313 |        "      <th>Sex upon Outcome</th>\n",
314 |        "      <th>Name</th>\n",
315 |        "      <th>Found Location</th>\n",
316 |        "      <th>Intake Type</th>\n",
317 |        "      <th>Intake Condition</th>\n",
318 |        "      <th>Pet Type</th>\n",
319 |        "      <th>Sex upon Intake</th>\n",
320 |        "      <th>Breed</th>\n",
321 |        "      <th>Color</th>\n",
322 |        "      <th>Age upon Intake Days</th>\n",
323 |        "    </tr>\n",
324 |        "  </thead>\n",
325 |        "  <tbody>\n",
326 |        "    <tr>\n",
327 |        "      <th>0</th>\n",
328 |        "      <td>A782657</td>\n",
329 |        "      <td>Adoption</td>\n",
330 |        "      <td>Spayed Female</td>\n",
331 |        "      <td>NaN</td>\n",
332 |        "      <td>1911 Dear Run Drive in Austin (TX)</td>\n",
333 |        "      <td>Stray</td>\n",
334 |        "      <td>Normal</td>\n",
335 |        "      <td>Dog</td>\n",
336 |        "      <td>Intact Female</td>\n",
337 |        "      <td>Labrador Retriever Mix</td>\n",
338 |        "      <td>Black</td>\n",
339 |        "      <td>60</td>\n",
340 |        "    </tr>\n",
341 |        "    <tr>\n",
342 |        "      <th>1</th>\n",
343 |        "      <td>A804622</td>\n",
344 |        "      <td>Adoption</td>\n",
345 |        "      <td>Neutered Male</td>\n",
346 |        "      <td>NaN</td>\n",
347 |        "      <td>702 Grand Canyon in Austin (TX)</td>\n",
348 |        "      <td>Stray</td>\n",
349 |        "      <td>Normal</td>\n",
350 |        "      <td>Dog</td>\n",
351 |        "      <td>Intact Male</td>\n",
352 |        "      <td>Boxer/Anatol Shepherd</td>\n",
353 |        "      <td>Brown/Tricolor</td>\n",
354 |        "      <td>60</td>\n",
355 |        "    </tr>\n",
356 |        "    <tr>\n",
357 |        "      <th>2</th>\n",
358 |        "      <td>A786693</td>\n",
359 |        "      <td>Return to Owner</td>\n",
360 |        "      <td>Neutered Male</td>\n",
361 |        "      <td>Zeus</td>\n",
362 |        "      <td>Austin (TX)</td>\n",
363 |        "      <td>Public Assist</td>\n",
364 |        "      <td>Normal</td>\n",
365 |        "      <td>Dog</td>\n",
366 |        "      <td>Neutered Male</td>\n",
367 |        "      <td>Australian Cattle Dog/Pit Bull</td>\n",
368 |        "      <td>Black/White</td>\n",
369 |        "      <td>3285</td>\n",
370 |        "    </tr>\n",
371 |        "    <tr>\n",
372 |        "      <th>3</th>\n",
373 |        "      <td>A693330</td>\n",
374 |        "      <td>Adoption</td>\n",
375 |        "      <td>Spayed Female</td>\n",
376 |        "      <td>Hope</td>\n",
377 |        "      <td>Levander Loop &amp; Airport Blvd in Austin (TX)</td>\n",
378 |        "      <td>Stray</td>\n",
379 |        "      <td>Normal</td>\n",
380 |        "      <td>Dog</td>\n",
381 |        "      <td>Intact Female</td>\n",
382 |        "      <td>Miniature Poodle</td>\n",
383 |        "      <td>Gray</td>\n",
384 |        "      <td>1825</td>\n",
385 |        "    </tr>\n",
386 |        "    <tr>\n",
387 |        "      <th>4</th>\n",
388 |        "      <td>A812431</td>\n",
389 |        "      <td>Adoption</td>\n",
390 |        "      <td>Neutered Male</td>\n",
391 |        "      <td>NaN</td>\n",
392 |        "      <td>Austin (TX)</td>\n",
393 |        "      <td>Owner Surrender</td>\n",
394 |        "      <td>Injured</td>\n",
395 |        "      <td>Cat</td>\n",
396 |        "      <td>Intact Male</td>\n",
397 |        "      <td>Domestic Shorthair</td>\n",
398 |        "      <td>Blue/White</td>\n",
399 |        "      <td>210</td>\n",
400 |        "    </tr>\n",
401 |        "  </tbody>\n",
402 |        "</table>\n",
403 |        "</div>"
404 |       ],
405 |       "text/plain": [
406 |        "    Pet ID     Outcome Type Sex upon Outcome  Name  \\\n",
407 |        "0  A782657         Adoption    Spayed Female   NaN   \n",
408 |        "1  A804622         Adoption    Neutered Male   NaN   \n",
409 |        "2  A786693  Return to Owner    Neutered Male  Zeus   \n",
410 |        "3  A693330         Adoption    Spayed Female  Hope   \n",
411 |        "4  A812431         Adoption    Neutered Male   NaN   \n",
412 |        "\n",
413 |        "                                Found Location      Intake Type  \\\n",
414 |        "0           1911 Dear Run Drive in Austin (TX)            Stray   \n",
415 |        "1              702 Grand Canyon in Austin (TX)            Stray   \n",
416 |        "2                                  Austin (TX)    Public Assist   \n",
417 |        "3  Levander Loop & Airport Blvd in Austin (TX)            Stray   \n",
418 |        "4                                  Austin (TX)  Owner Surrender   \n",
419 |        "\n",
420 |        "  Intake Condition Pet Type Sex upon Intake                           Breed  \\\n",
421 |        "0           Normal      Dog   Intact Female          Labrador Retriever Mix   \n",
422 |        "1           Normal      Dog     Intact Male           Boxer/Anatol Shepherd   \n",
423 |        "2           Normal      Dog   Neutered Male  Australian Cattle Dog/Pit Bull   \n",
424 |        "3           Normal      Dog   Intact Female                Miniature Poodle   \n",
425 |        "4          Injured      Cat     Intact Male              Domestic Shorthair   \n",
426 |        "\n",
427 |        "            Color  Age upon Intake Days  \n",
428 |        "0           Black                    60  \n",
429 |        "1  Brown/Tricolor                    60  \n",
430 |        "2     Black/White                  3285  \n",
431 |        "3            Gray                  1825  \n",
432 |        "4      Blue/White                   210  "
433 |       ]
434 |      },
435 |      "execution_count": 4,
436 |      "metadata": {},
437 |      "output_type": "execute_result"
438 |     }
439 |    ],
440 |    "source": [
441 |     "# Implement here\n",
442 |     "\n",
443 |     "test_data.head()"
444 |    ]
445 |   },
446 |   {
447 |    "cell_type": "markdown",
448 |    "metadata": {},
449 |    "source": [
450 |     "### 2.2 <a name=\"22\">Select features to build the model</a> \n",
451 |     "(<a href=\"#2\">Go to Train a model</a>)\n"
452 |    ]
453 |   },
454 |   {
455 |    "cell_type": "code",
456 |    "execution_count": 5,
457 |    "metadata": {},
458 |    "outputs": [],
459 |    "source": [
460 |     "# Implement here\n",
461 |     "\n",
462 |     "# numerical_features = ..."
463 |    ]
464 |   },
465 |   {
466 |    "cell_type": "markdown",
467 |    "metadata": {},
468 |    "source": [
469 |     "### 2.3 <a name=\"23\">Data Processing</a> \n",
470 |     "(<a href=\"#2\">Go to Train a model</a>)\n"
471 |    ]
472 |   },
473 |   {
474 |    "cell_type": "code",
475 |    "execution_count": 6,
476 |    "metadata": {
477 |     "scrolled": true
478 |    },
479 |    "outputs": [],
480 |    "source": [
481 |     "# Implement here\n"
482 |    ]
483 |   },
484 |   {
485 |    "cell_type": "markdown",
486 |    "metadata": {},
487 |    "source": [
488 |     "### 2.4 <a name=\"24\">Model training</a> \n",
489 |     "(<a href=\"#2\">Go to Train a model</a>)\n"
490 |    ]
491 |   },
492 |   {
493 |    "cell_type": "code",
494 |    "execution_count": 7,
495 |    "metadata": {
496 |     "scrolled": true
497 |    },
498 |    "outputs": [],
499 |    "source": [
500 |     "# Implement here\n"
501 |    ]
502 |   },
503 |   {
504 |    "cell_type": "markdown",
505 |    "metadata": {},
506 |    "source": [
507 |     "## 3. <a name=\"3\">Make predictions on the test dataset</a> (Implement)\n",
508 |     "(<a href=\"#0\">Go to top</a>)\n",
509 |     "\n",
510 |     "Use the test set to make predictions with the trained model."
511 |    ]
512 |   },
513 |   {
514 |    "cell_type": "code",
515 |    "execution_count": 8,
516 |    "metadata": {
517 |     "tags": []
518 |    },
519 |    "outputs": [],
520 |    "source": [
521 |     "# Implement here\n",
522 |     "\n",
523 |     "# test_predictions = ..."
524 |    ]
525 |   }
526 |  ],
527 |  "metadata": {
528 |   "kernelspec": {
529 |    "display_name": "sagemaker-distribution:Python",
530 |    "language": "python",
531 |    "name": "conda-env-sagemaker-distribution-py"
532 |   },
533 |   "language_info": {
534 |    "codemirror_mode": {
535 |     "name": "ipython",
536 |     "version": 3
537 |    },
538 |    "file_extension": ".py",
539 |    "mimetype": "text/x-python",
540 |    "name": "python",
541 |    "nbconvert_exporter": "python",
542 |    "pygments_lexer": "ipython3",
543 |    "version": "3.10.14"
544 |   }
545 |  },
546 |  "nbformat": 4,
547 |  "nbformat_minor": 4
548 | }
549 | 


--------------------------------------------------------------------------------
/notebooks/MLA-TAB-DAY2-TEXT-PROCESS.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "![MLU Logo](../data/MLU_Logo.png)"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "# <a name=\"0\">Machine Learning Accelerator - Tabular Data - Lecture 2</a>\n",
 15 |     "\n",
 16 |     "\n",
 17 |     "## Text Preprocessing\n",
 18 |     "\n",
 19 |     "In this notebok we explore techniques to clean and convert text features into numerical features that machine learning algoritms can work with. \n",
 20 |     "\n",
 21 |     "1. <a href=\"#1\">Common text pre-processing</a>\n",
 22 |     "2. <a href=\"#2\">Lexicon-based text processing</a>\n",
 23 |     "3. <a href=\"#3\">Feature Extraction - Bag of Words</a>\n",
 24 |     "4. <a href=\"#4\">Putting it all together</a>\n",
 25 |     "\n"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 1,
 31 |    "metadata": {},
 32 |    "outputs": [],
 33 |    "source": [
 34 |     "%%capture\n",
 35 |     "%pip install -q -r ../requirements.txt"
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "markdown",
 40 |    "metadata": {},
 41 |    "source": [
 42 |     "## 1. <a name=\"1\">Common text pre-processing</a>\n",
 43 |     "(<a href=\"#0\">Go to top</a>)\n",
 44 |     "\n",
 45 |     "In this section, we will do some general purpose text cleaning."
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": 2,
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "text = \"   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  \""
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "metadata": {},
 60 |    "source": [
 61 |     "Let's first lowercase our text. "
 62 |    ]
 63 |   },
 64 |   {
 65 |    "cell_type": "code",
 66 |    "execution_count": 3,
 67 |    "metadata": {},
 68 |    "outputs": [
 69 |     {
 70 |      "name": "stdout",
 71 |      "output_type": "stream",
 72 |      "text": [
 73 |       "   this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  \n"
 74 |      ]
 75 |     }
 76 |    ],
 77 |    "source": [
 78 |     "text = text.lower()\n",
 79 |     "print(text)"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "markdown",
 84 |    "metadata": {},
 85 |    "source": [
 86 |     "We can get rid of leading/trailing whitespace with the following:"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "code",
 91 |    "execution_count": 4,
 92 |    "metadata": {},
 93 |    "outputs": [
 94 |     {
 95 |      "name": "stdout",
 96 |      "output_type": "stream",
 97 |      "text": [
 98 |       "this is a message to be cleaned. it may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .\n"
 99 |      ]
100 |     }
101 |    ],
102 |    "source": [
103 |     "text = text.strip()\n",
104 |     "print(text)"
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "markdown",
109 |    "metadata": {},
110 |    "source": [
111 |     "Remove HTML tags/markups:"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": 5,
117 |    "metadata": {},
118 |    "outputs": [
119 |     {
120 |      "name": "stdout",
121 |      "output_type": "stream",
122 |      "text": [
123 |       "this is a message to be cleaned. it may involve some things like: , ?, :, ''  adjacent spaces and tabs     .\n"
124 |      ]
125 |     }
126 |    ],
127 |    "source": [
128 |     "import re\n",
129 |     "\n",
130 |     "text = re.compile('<.*?>').sub('', text)\n",
131 |     "print(text)"
132 |    ]
133 |   },
134 |   {
135 |    "cell_type": "markdown",
136 |    "metadata": {},
137 |    "source": [
138 |     "Replace punctuation with space"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": 6,
144 |    "metadata": {},
145 |    "outputs": [
146 |     {
147 |      "name": "stdout",
148 |      "output_type": "stream",
149 |      "text": [
150 |       "this is a message to be cleaned  it may involve some things like              adjacent spaces and tabs      \n"
151 |      ]
152 |     }
153 |    ],
154 |    "source": [
155 |     "import re, string\n",
156 |     "\n",
157 |     "text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)\n",
158 |     "print(text)"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "markdown",
163 |    "metadata": {},
164 |    "source": [
165 |     "Remove extra space and tabs"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": 7,
171 |    "metadata": {},
172 |    "outputs": [
173 |     {
174 |      "name": "stdout",
175 |      "output_type": "stream",
176 |      "text": [
177 |       "this is a message to be cleaned it may involve some things like adjacent spaces and tabs \n"
178 |      ]
179 |     }
180 |    ],
181 |    "source": [
182 |     "import re\n",
183 |     "\n",
184 |     "text = re.sub('\\s+', ' ', text)\n",
185 |     "print(text)"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "markdown",
190 |    "metadata": {},
191 |    "source": [
192 |     "## 2. <a name=\"2\">Lexicon-based text processing</a>\n",
193 |     "(<a href=\"#0\">Go to top</a>)\n",
194 |     "\n",
195 |     "In section 1, we saw some general purpose text pre-processing methods. Lexicon based methods are usually used __to normalize sentences in our dataset__ and later in section 3, we will use these normalized sentences for feature extraction. <br/>\n",
196 |     "By normalization, here, __we mean putting words in the sentences into a similar format that will enhance similarities (if any) between sentences__. \n",
197 |     "\n",
198 |     "__Stop word removal:__ There can be some words in our sentences that occur very frequently and don't contribute too much to the overall meaning of the sentences. We usually have a list of these words and remove them from each our sentences. For example: \"a\", \"an\", \"the\", \"this\", \"that\", \"is\""
199 |    ]
200 |   },
201 |   {
202 |    "cell_type": "code",
203 |    "execution_count": 8,
204 |    "metadata": {},
205 |    "outputs": [],
206 |    "source": [
207 |     "stop_words = [\"a\", \"an\", \"the\", \"this\", \"that\", \"is\", \"it\", \"to\", \"and\"]\n",
208 |     "\n",
209 |     "filtered_sentence = []\n",
210 |     "words = text.split(\" \")\n",
211 |     "for w in words:\n",
212 |     "    if w not in stop_words:\n",
213 |     "        filtered_sentence.append(w)\n",
214 |     "text = \" \".join(filtered_sentence)"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": 9,
220 |    "metadata": {},
221 |    "outputs": [
222 |     {
223 |      "name": "stdout",
224 |      "output_type": "stream",
225 |      "text": [
226 |       "message be cleaned may involve some things like adjacent spaces tabs \n"
227 |      ]
228 |     }
229 |    ],
230 |    "source": [
231 |     "print(text)"
232 |    ]
233 |   },
234 |   {
235 |    "cell_type": "markdown",
236 |    "metadata": {},
237 |    "source": [
238 |     "__Stemming:__ Stemming is a rule-based system to __convert words into their root form__. <br/>\n",
239 |     "It removes suffixes from words. This helps us enhace similarities (if any) between sentences. \n",
240 |     "\n",
241 |     "Example:\n",
242 |     "\n",
243 |     "\"jumping\", \"jumped\" -> \"jump\"\n",
244 |     "\n",
245 |     "\"cars\" -> \"car\""
246 |    ]
247 |   },
248 |   {
249 |    "cell_type": "code",
250 |    "execution_count": 10,
251 |    "metadata": {},
252 |    "outputs": [],
253 |    "source": [
254 |     "# We use the NLTK library\n",
255 |     "import nltk\n",
256 |     "from nltk.stem import SnowballStemmer\n",
257 |     "\n",
258 |     "# Initialize the stemmer\n",
259 |     "snow = SnowballStemmer('english')\n",
260 |     "\n",
261 |     "stemmed_sentence = []\n",
262 |     "words = text.split(\" \")\n",
263 |     "for w in words:\n",
264 |     "    stemmed_sentence.append(snow.stem(w))\n",
265 |     "text = \" \".join(stemmed_sentence)"
266 |    ]
267 |   },
268 |   {
269 |    "cell_type": "code",
270 |    "execution_count": 11,
271 |    "metadata": {},
272 |    "outputs": [
273 |     {
274 |      "name": "stdout",
275 |      "output_type": "stream",
276 |      "text": [
277 |       "messag be clean may involv some thing like adjac space tab \n"
278 |      ]
279 |     }
280 |    ],
281 |    "source": [
282 |     "print(text)"
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "markdown",
287 |    "metadata": {},
288 |    "source": [
289 |     "## 3. <a name=\"3\">Feature Extraction - Bag of Words</a>\n",
290 |     "(<a href=\"#0\">Go to top</a>)\n",
291 |     "\n",
292 |     "In this section, we assume we will first apply the common and lexicon based pre-processing to our text. After those, we will convert our text data into numerical data with the __Bag of Words (BoW)__ representation. \n",
293 |     "\n",
294 |     "__Bag of Words (BoW)__: A modeling technique to convert text information into numerical representation. <br/>\n",
295 |     "__Machine learning models expect numerical or categorical values as input and won't work with raw text data__. \n",
296 |     "\n",
297 |     "Steps:\n",
298 |     "1. Create vocabulary of known words\n",
299 |     "2. Measure presence of the known words in sentences\n",
300 |     "\n",
301 |     "Let's seen an interactive example for ourselves:"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": 12,
307 |    "metadata": {},
308 |    "outputs": [
309 |     {
310 |      "data": {
311 |       "text/html": [
312 |        "\n",
313 |        "        <script>var BagOfWords=function(){\"use strict\";var pt=document.createElement(\"style\");pt.textContent=`input.svelte-5idota{width:300px;overflow:scroll;height:38px}.description.svelte-5idota{font-size:16px;opacity:.9}#title-div.svelte-5idota{display:flex;align-items:center}section.svelte-5idota{margin:auto;padding-bottom:15px}table.svelte-5idota{font-size:16px;background-color:#fff;border-collapse:collapse}.table-head.svelte-5idota{padding:16px;font-size:16px;text-align:left;border:none}td.svelte-5idota{padding:16px;background-color:#fff;border-bottom:1px solid black;font-size:18px;text-align:center}button.svelte-5idota{padding:6px 10px;font-size:16px;border:1px solid black;text-align:center;margin:2px 8px;font-weight:700}button.svelte-5idota:hover,.active.svelte-5idota{color:#fff;background-color:coral}thead.svelte-5idota{border-bottom:1px solid black}\n",
314 |        "`,document.head.appendChild(pt);function Q(){}function gt(t){return t()}function vt(){return Object.create(null)}function R(t){t.forEach(gt)}function bt(t){return typeof t==\"function\"}function It(t,e){return t!=t?e==e:t!==e||t&&typeof t==\"object\"||typeof t==\"function\"}function Pt(t){return Object.keys(t).length===0}function s(t,e){t.appendChild(e)}function U(t,e,n){t.insertBefore(e,n||null)}function M(t){t.parentNode&&t.parentNode.removeChild(t)}function Z(t,e){for(let n=0;n<t.length;n+=1)t[n]&&t[n].d(e)}function r(t){return document.createElement(t)}function B(t){return document.createTextNode(t)}function m(){return B(\" \")}function V(t,e,n,i){return t.addEventListener(e,n,i),()=>t.removeEventListener(e,n,i)}function h(t,e,n){n==null?t.removeAttribute(e):t.getAttribute(e)!==n&&t.setAttribute(e,n)}function qt(t){return Array.from(t.childNodes)}function tt(t,e){e=\"\"+e,t.wholeText!==e&&(t.data=e)}function D(t,e){t.value=e==null?\"\":e}function et(t,e,n){t.classList[n?\"add\":\"remove\"](e)}let ut;function X(t){ut=t}const Y=[],mt=[],nt=[],kt=[],Ft=Promise.resolve();let ft=!1;function Gt(){ft||(ft=!0,Ft.then(yt))}function dt(t){nt.push(t)}const ht=new Set;let lt=0;function yt(){const t=ut;do{for(;lt<Y.length;){const e=Y[lt];lt++,X(e),Ht(e.$$)}for(X(null),Y.length=0,lt=0;mt.length;)mt.pop()();for(let e=0;e<nt.length;e+=1){const n=nt[e];ht.has(n)||(ht.add(n),n())}nt.length=0}while(Y.length);for(;kt.length;)kt.pop()();ft=!1,ht.clear(),X(t)}function Ht(t){if(t.fragment!==null){t.update(),R(t.before_update);const e=t.dirty;t.dirty=[-1],t.fragment&&t.fragment.p(t.ctx,e),t.after_update.forEach(dt)}}const Jt=new Set;function Kt(t,e){t&&t.i&&(Jt.delete(t),t.i(e))}function Qt(t,e,n,i){const{fragment:c,after_update:u}=t.$$;c&&c.m(e,n),i||dt(()=>{const y=t.$$.on_mount.map(gt).filter(bt);t.$$.on_destroy?t.$$.on_destroy.push(...y):R(y),t.$$.on_mount=[]}),u.forEach(dt)}function Rt(t,e){const n=t.$$;n.fragment!==null&&(R(n.on_destroy),n.fragment&&n.fragment.d(e),n.on_destroy=n.fragment=null,n.ctx=[])}function Ut(t,e){t.$$.dirty[0]===-1&&(Y.push(t),Gt(),t.$$.dirty.fill(0)),t.$$.dirty[e/31|0]|=1<<e%31}function Vt(t,e,n,i,c,u,y,L=[-1]){const d=ut;X(t);const a=t.$$={fragment:null,ctx:[],props:u,update:Q,not_equal:c,bound:vt(),on_mount:[],on_destroy:[],on_disconnect:[],before_update:[],after_update:[],context:new Map(e.context||(d?d.$$.context:[])),callbacks:vt(),dirty:L,skip_bound:!1,root:e.target||d.$$.root};y&&y(a.root);let x=!1;if(a.ctx=n?n(t,e.props||{},(f,$,...k)=>{const W=k.length?k[0]:$;return a.ctx&&c(a.ctx[f],a.ctx[f]=W)&&(!a.skip_bound&&a.bound[f]&&a.bound[f](W),x&&Ut(t,f)),$}):[],a.update(),x=!0,R(a.before_update),a.fragment=i?i(a.ctx):!1,e.target){if(e.hydrate){const f=qt(e.target);a.fragment&&a.fragment.l(f),f.forEach(M)}else a.fragment&&a.fragment.c();e.intro&&Kt(t.$$.fragment),Qt(t,e.target,e.anchor,e.customElement),yt()}X(d)}class Xt{$destroy(){Rt(this,1),this.$destroy=Q}$on(e,n){if(!bt(n))return Q;const i=this.$$.callbacks[e]||(this.$$.callbacks[e]=[]);return i.push(n),()=>{const c=i.indexOf(n);c!==-1&&i.splice(c,1)}}$set(e){this.$$set&&!Pt(e)&&(this.$$.skip_bound=!0,this.$$set(e),this.$$.skip_bound=!1)}}const ee=\"\";function xt(t,e,n){const i=t.slice();return i[15]=e[n],i}function wt(t,e,n){const i=t.slice();return i[15]=e[n],i}function Ct(t,e,n){const i=t.slice();return i[15]=e[n],i}function $t(t,e,n){const i=t.slice();return i[15]=e[n],i}function Et(t){let e,n=t[15]+\"\",i;return{c(){e=r(\"th\"),i=B(n),h(e,\"class\",\"table-head svelte-5idota\")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&16&&n!==(n=c[15]+\"\")&&tt(i,n)},d(c){c&&M(e)}}}function St(t){let e,n=t[7][t[15]]+\"\",i;return{c(){e=r(\"td\"),i=B(n),h(e,\"class\",\"svelte-5idota\")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&144&&n!==(n=c[7][c[15]]+\"\")&&tt(i,n)},d(c){c&&M(e)}}}function Bt(t){let e,n=t[6][t[15]]+\"\",i;return{c(){e=r(\"td\"),i=B(n),h(e,\"class\",\"svelte-5idota\")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&80&&n!==(n=c[6][c[15]]+\"\")&&tt(i,n)},d(c){c&&M(e)}}}function Ot(t){let e,n=t[5][t[15]]+\"\",i;return{c(){e=r(\"td\"),i=B(n),h(e,\"class\",\"svelte-5idota\")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&48&&n!==(n=c[5][c[15]]+\"\")&&tt(i,n)},d(c){c&&M(e)}}}function Yt(t){let e,n,i,c,u,y,L,d,a,x,f,$,k,W,O,I,z,P,ot,it,w,E,S,A,zt,j,At,Lt,q,ct,st,Wt,N,jt,Nt,F,rt,at,Tt,T,Mt,_t,Dt,G=t[4],p=[];for(let o=0;o<G.length;o+=1)p[o]=Et($t(t,G,o));let H=t[4],g=[];for(let o=0;o<H.length;o+=1)g[o]=St(Ct(t,H,o));let J=t[4],v=[];for(let o=0;o<J.length;o+=1)v[o]=Bt(wt(t,J,o));let K=t[4],b=[];for(let o=0;o<K.length;o+=1)b[o]=Ot(xt(t,K,o));return{c(){e=r(\"section\"),n=r(\"h2\"),n.textContent=\"Bag of Words Demo\",i=m(),c=r(\"p\"),c.textContent=`Edit the sentences and watch the corresponding vocabulary and cell-counts\n",
315 |        "    update:`,u=m(),y=r(\"br\"),L=m(),d=r(\"div\"),a=r(\"p\"),a.textContent=\"Count Method:\",x=m(),f=r(\"button\"),f.textContent=\"Count\",$=m(),k=r(\"button\"),k.textContent=\"Binary\",W=m(),O=r(\"table\"),I=r(\"thead\"),z=r(\"tr\"),P=r(\"th\"),ot=m();for(let o=0;o<p.length;o+=1)p[o].c();it=m(),w=r(\"tbody\"),E=r(\"tr\"),S=r(\"td\"),A=r(\"div\"),zt=B(`Sentence 1:\n",
316 |        "            `),j=r(\"input\"),At=m();for(let o=0;o<g.length;o+=1)g[o].c();Lt=m(),q=r(\"tr\"),ct=r(\"td\"),st=r(\"div\"),Wt=B(`Sentence 2:\n",
317 |        "            `),N=r(\"input\"),jt=m();for(let o=0;o<v.length;o+=1)v[o].c();Nt=m(),F=r(\"tr\"),rt=r(\"td\"),at=r(\"div\"),Tt=B(`Sentence 3:\n",
318 |        "            `),T=r(\"input\"),Mt=m();for(let o=0;o<b.length;o+=1)b[o].c();h(c,\"class\",\"description svelte-5idota\"),h(a,\"class\",\"description svelte-5idota\"),h(f,\"class\",\"svelte-5idota\"),et(f,\"active\",t[3]===\"count\"),h(k,\"class\",\"svelte-5idota\"),et(k,\"active\",t[3]===\"ohe\"),h(d,\"id\",\"title-div\"),h(d,\"class\",\"svelte-5idota\"),h(I,\"class\",\"svelte-5idota\"),h(j,\"class\",\"svelte-5idota\"),h(S,\"class\",\"table-head svelte-5idota\"),h(N,\"class\",\"svelte-5idota\"),h(ct,\"class\",\"table-head svelte-5idota\"),h(T,\"class\",\"svelte-5idota\"),h(rt,\"class\",\"table-head svelte-5idota\"),h(O,\"class\",\"svelte-5idota\"),h(e,\"class\",\"svelte-5idota\")},m(o,_){U(o,e,_),s(e,n),s(e,i),s(e,c),s(e,u),s(e,y),s(e,L),s(e,d),s(d,a),s(d,x),s(d,f),s(d,$),s(d,k),s(e,W),s(e,O),s(O,I),s(I,z),s(z,P),s(z,ot);for(let l=0;l<p.length;l+=1)p[l].m(z,null);s(O,it),s(O,w),s(w,E),s(E,S),s(S,A),s(A,zt),s(A,j),D(j,t[0]),s(E,At);for(let l=0;l<g.length;l+=1)g[l].m(E,null);s(w,Lt),s(w,q),s(q,ct),s(ct,st),s(st,Wt),s(st,N),D(N,t[1]),s(q,jt);for(let l=0;l<v.length;l+=1)v[l].m(q,null);s(w,Nt),s(w,F),s(F,rt),s(rt,at),s(at,Tt),s(at,T),D(T,t[2]),s(F,Mt);for(let l=0;l<b.length;l+=1)b[l].m(F,null);_t||(Dt=[V(f,\"click\",t[9]),V(k,\"click\",t[10]),V(j,\"input\",t[11]),V(N,\"input\",t[12]),V(T,\"input\",t[13])],_t=!0)},p(o,[_]){if(_&8&&et(f,\"active\",o[3]===\"count\"),_&8&&et(k,\"active\",o[3]===\"ohe\"),_&16){G=o[4];let l;for(l=0;l<G.length;l+=1){const C=$t(o,G,l);p[l]?p[l].p(C,_):(p[l]=Et(C),p[l].c(),p[l].m(z,null))}for(;l<p.length;l+=1)p[l].d(1);p.length=G.length}if(_&1&&j.value!==o[0]&&D(j,o[0]),_&144){H=o[4];let l;for(l=0;l<H.length;l+=1){const C=Ct(o,H,l);g[l]?g[l].p(C,_):(g[l]=St(C),g[l].c(),g[l].m(E,null))}for(;l<g.length;l+=1)g[l].d(1);g.length=H.length}if(_&2&&N.value!==o[1]&&D(N,o[1]),_&80){J=o[4];let l;for(l=0;l<J.length;l+=1){const C=wt(o,J,l);v[l]?v[l].p(C,_):(v[l]=Bt(C),v[l].c(),v[l].m(q,null))}for(;l<v.length;l+=1)v[l].d(1);v.length=J.length}if(_&4&&T.value!==o[2]&&D(T,o[2]),_&48){K=o[4];let l;for(l=0;l<K.length;l+=1){const C=xt(o,K,l);b[l]?b[l].p(C,_):(b[l]=Ot(C),b[l].c(),b[l].m(F,null))}for(;l<b.length;l+=1)b[l].d(1);b.length=K.length}},i:Q,o:Q,d(o){o&&M(e),Z(p,o),Z(g,o),Z(v,o),Z(b,o),_t=!1,R(Dt)}}}function Zt(t,e,n){let i,c,u,y,L,d=\"I\",a=\"love dogs\",x=\"dogs dogs dogs\",f=\"count\";function $(P,ot,it){let w={},E=ot.replace(/[^\\w\\s]/gi,\"\").toLowerCase();for(const S of P)it===\"count\"?w[S]=E.split(\" \").filter(A=>A===S).length:w[S]=E.split(\" \").filter(A=>A===S).length?1:0;return w}const k=()=>{n(3,f=\"count\")},W=()=>{n(3,f=\"ohe\")};function O(){d=this.value,n(0,d)}function I(){a=this.value,n(1,a)}function z(){x=this.value,n(2,x)}return t.$$.update=()=>{t.$$.dirty&7&&n(8,i=(d+\" \"+a+\" \"+x).replace(/[^\\w\\s]/gi,\"\").toLowerCase()),t.$$.dirty&256&&console.log(\"tokens\",i),t.$$.dirty&256&&n(4,c=Array.from(new Set(i.split(\" \"))).filter(P=>P!==\"\")),t.$$.dirty&16&&console.log(\"vocab\",c),t.$$.dirty&25&&n(7,u=$(c,d,f)),t.$$.dirty&26&&n(6,y=$(c,a,f)),t.$$.dirty&28&&n(5,L=$(c,x,f))},[d,a,x,f,c,L,y,u,i,k,W,O,I,z]}class te extends Xt{constructor(e){super(),Vt(this,e,Zt,Yt,It,{})}}return te}();\n",
319 |        "</script>\n",
320 |        "        \n",
321 |        "        <div id=\"BagOfWords-328cd234\"></div>\n",
322 |        "        <script>\n",
323 |        "        (() => {\n",
324 |        "            var data = {};\n",
325 |        "            window.BagOfWords_data = data;\n",
326 |        "            var BagOfWords_inst = new BagOfWords({\n",
327 |        "                \"target\": document.getElementById(\"BagOfWords-328cd234\"),\n",
328 |        "                \"props\": data\n",
329 |        "            });\n",
330 |        "        })();\n",
331 |        "        </script>\n",
332 |        "        \n",
333 |        "        "
334 |       ],
335 |       "text/plain": [
336 |        "<mluvisuals.BagOfWords at 0x7fa5101db4f0>"
337 |       ]
338 |      },
339 |      "execution_count": 12,
340 |      "metadata": {},
341 |      "output_type": "execute_result"
342 |     }
343 |    ],
344 |    "source": [
345 |     "from mluvisuals import *\n",
346 |     "\n",
347 |     "BagOfWords()"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "markdown",
352 |    "metadata": {},
353 |    "source": [
354 |     "We will use the sklearn library's Bag of Words implementation:\n",
355 |     "\n",
356 |     "`from sklearn.feature_extraction.text import CountVectorizer`\n",
357 |     "\n",
358 |     "`countVectorizer = CountVectorizer(binary=True)`"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "code",
363 |    "execution_count": 13,
364 |    "metadata": {},
365 |    "outputs": [],
366 |    "source": [
367 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
368 |     "countVectorizer = CountVectorizer(binary=True)\n",
369 |     "\n",
370 |     "sentences = [\n",
371 |     "    'This is the first document.',\n",
372 |     "    'This is the second second document.',\n",
373 |     "    'And the third one.',\n",
374 |     "    'Is this the first document?'\n",
375 |     "]\n",
376 |     "X = countVectorizer.fit_transform(sentences)"
377 |    ]
378 |   },
379 |   {
380 |    "cell_type": "markdown",
381 |    "metadata": {},
382 |    "source": [
383 |     "Let's print the vocabulary below. <br/>\n",
384 |     "Each number next to a word shows the index of it in the vocabulary (From 0 to 8 here).<br/>\n",
385 |     "They are alphabetically ordered-> and:0, document:1, first:2, ..."
386 |    ]
387 |   },
388 |   {
389 |    "cell_type": "code",
390 |    "execution_count": 14,
391 |    "metadata": {},
392 |    "outputs": [
393 |     {
394 |      "name": "stdout",
395 |      "output_type": "stream",
396 |      "text": [
397 |       "{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}\n"
398 |      ]
399 |     }
400 |    ],
401 |    "source": [
402 |     "print(countVectorizer.vocabulary_)"
403 |    ]
404 |   },
405 |   {
406 |    "cell_type": "markdown",
407 |    "metadata": {},
408 |    "source": [
409 |     "__Note:__ Sklearn automatically removes punctuation, but doesn't do the other extra pre-processing methods we discussed here. <br/>\n",
410 |     "Lexicon-based methods are also not automaticaly applied, we need to call those methods before feature extraction."
411 |    ]
412 |   },
413 |   {
414 |    "cell_type": "code",
415 |    "execution_count": 15,
416 |    "metadata": {},
417 |    "outputs": [
418 |     {
419 |      "name": "stdout",
420 |      "output_type": "stream",
421 |      "text": [
422 |       "[[0 1 1 1 0 0 1 0 1]\n",
423 |       " [0 1 0 1 0 1 1 0 1]\n",
424 |       " [1 0 0 0 1 0 1 1 0]\n",
425 |       " [0 1 1 1 0 0 1 0 1]]\n"
426 |      ]
427 |     }
428 |    ],
429 |    "source": [
430 |     "print(X.toarray())"
431 |    ]
432 |   },
433 |   {
434 |    "cell_type": "markdown",
435 |    "metadata": {},
436 |    "source": [
437 |     "__What happens when we encounter a new word during prediction?__ \n",
438 |     "\n",
439 |     "__New words will be skipped__. <br/>\n",
440 |     "This usually happens when we are making predictions. For our test and validation data/text, we need to use the __.transform()__ function this time. <br/>\n",
441 |     "This simulates a real-time prediction case where we cannot re-train the model quickly whenever we receive new words."
442 |    ]
443 |   },
444 |   {
445 |    "cell_type": "code",
446 |    "execution_count": 16,
447 |    "metadata": {},
448 |    "outputs": [
449 |     {
450 |      "name": "stdout",
451 |      "output_type": "stream",
452 |      "text": [
453 |       "[[0 1 0 0 0 0 0 0 1]\n",
454 |       " [0 0 0 1 1 0 0 0 1]]\n"
455 |      ]
456 |     }
457 |    ],
458 |    "source": [
459 |     "test_sentences = [\"this document has some new words\",\n",
460 |     "                 \"this one is new too\"]\n",
461 |     "\n",
462 |     "count_vectors = countVectorizer.transform(test_sentences)\n",
463 |     "print(count_vectors.toarray())"
464 |    ]
465 |   },
466 |   {
467 |    "cell_type": "markdown",
468 |    "metadata": {},
469 |    "source": [
470 |     "See that these last two vectors have the same lenght 9 (same vocabulary) like the ones before."
471 |    ]
472 |   },
473 |   {
474 |    "cell_type": "markdown",
475 |    "metadata": {},
476 |    "source": [
477 |     "## 4. <a name=\"4\">Putting it all together</a>\n",
478 |     "(<a href=\"#0\">Go to top</a>)\n",
479 |     "\n",
480 |     "Let's have a full example here. We will apply everything discussed in this notebook."
481 |    ]
482 |   },
483 |   {
484 |    "cell_type": "code",
485 |    "execution_count": 17,
486 |    "metadata": {},
487 |    "outputs": [],
488 |    "source": [
489 |     "# Prepare cleaning functions\n",
490 |     "import re, string\n",
491 |     "import nltk\n",
492 |     "from nltk.stem import SnowballStemmer\n",
493 |     "\n",
494 |     "stop_words = [\"a\", \"an\", \"the\", \"this\", \"that\", \"is\", \"it\", \"to\", \"and\"]\n",
495 |     "\n",
496 |     "stemmer = SnowballStemmer('english')\n",
497 |     "\n",
498 |     "def preProcessText(text):\n",
499 |     "    # lowercase and strip leading/trailing white space\n",
500 |     "    text = text.lower().strip()\n",
501 |     "    \n",
502 |     "    # remove HTML tags\n",
503 |     "    text = re.compile('<.*?>').sub('', text)\n",
504 |     "    \n",
505 |     "    # remove punctuation\n",
506 |     "    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)\n",
507 |     "    \n",
508 |     "    # remove extra white space\n",
509 |     "    text = re.sub('\\s+', ' ', text)\n",
510 |     "    \n",
511 |     "    return text\n",
512 |     "\n",
513 |     "def lexiconProcess(text, stop_words, stemmer):\n",
514 |     "    filtered_sentence = []\n",
515 |     "    words = text.split(\" \")\n",
516 |     "    for w in words:\n",
517 |     "        if w not in stop_words:\n",
518 |     "            filtered_sentence.append(stemmer.stem(w))\n",
519 |     "    text = \" \".join(filtered_sentence)\n",
520 |     "    \n",
521 |     "    return text\n",
522 |     "\n",
523 |     "def cleanSentence(text, stop_words, stemmer):\n",
524 |     "    return lexiconProcess(preProcessText(text), stop_words, stemmer)"
525 |    ]
526 |   },
527 |   {
528 |    "cell_type": "code",
529 |    "execution_count": 18,
530 |    "metadata": {},
531 |    "outputs": [],
532 |    "source": [
533 |     "# Prepare vectorizer \n",
534 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
535 |     "\n",
536 |     "textvectorizer = CountVectorizer(binary=True)# can also limit vocabulary size here, with say, max_features=50"
537 |    ]
538 |   },
539 |   {
540 |    "cell_type": "code",
541 |    "execution_count": 19,
542 |    "metadata": {
543 |     "tags": []
544 |    },
545 |    "outputs": [
546 |     {
547 |      "name": "stdout",
548 |      "output_type": "stream",
549 |      "text": [
550 |       "4\n",
551 |       "Vocabulary: \n",
552 |       " {'like': 11, 'materi': 13, 'color': 4, 'overal': 19, 'how': 10, 'look': 12, 'work': 29, 'okay': 18, 'first': 7, 'two': 27, 'time': 26, 'use': 28, 'but': 3, 'third': 24, 'burn': 2, 'my': 15, 'face': 6, 'am': 1, 'not': 17, 'sure': 23, 'about': 0, 'product': 21, 'never': 16, 'thought': 25, 'would': 30, 'pay': 20, 'so': 22, 'much': 14, 'for': 8, 'hair': 9, 'dryer': 5}\n",
553 |       "Bag of Words Binary Features: \n",
554 |       " [[0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]\n",
555 |       " [0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 0]\n",
556 |       " [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0]\n",
557 |       " [0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 1]]\n",
558 |       "(4, 31)\n"
559 |      ]
560 |     }
561 |    ],
562 |    "source": [
563 |     "# Clean and vectorize a text feature with four samples\n",
564 |     "text_feature = [\"I liked the material, color and overall how it looks.<br /><br />\",\n",
565 |     "             \"Worked okay first two times I used it, but third time burned my face.\",\n",
566 |     "             \"I am not sure about this product.\",\n",
567 |     "             \"I never thought I would pay so much for a hair dryer.\",\n",
568 |     "            ]\n",
569 |     "\n",
570 |     "print(len(text_feature))\n",
571 |     "\n",
572 |     "# Clean up the text\n",
573 |     "text_feature_cleaned = [cleanSentence(item, stop_words, stemmer) for item in text_feature]\n",
574 |     "\n",
575 |     "# Vectorize the cleaned text\n",
576 |     "text_feature_vectorized = textvectorizer.fit_transform(text_feature_cleaned)\n",
577 |     "print('Vocabulary: \\n', textvectorizer.vocabulary_)\n",
578 |     "print('Bag of Words Binary Features: \\n', text_feature_vectorized.toarray())\n",
579 |     "\n",
580 |     "print(text_feature_vectorized.shape)"
581 |    ]
582 |   }
583 |  ],
584 |  "metadata": {
585 |   "kernelspec": {
586 |    "display_name": "sagemaker-distribution:Python",
587 |    "language": "python",
588 |    "name": "conda-env-sagemaker-distribution-py"
589 |   },
590 |   "language_info": {
591 |    "codemirror_mode": {
592 |     "name": "ipython",
593 |     "version": 3
594 |    },
595 |    "file_extension": ".py",
596 |    "mimetype": "text/x-python",
597 |    "name": "python",
598 |    "nbconvert_exporter": "python",
599 |    "pygments_lexer": "ipython3",
600 |    "version": "3.10.14"
601 |   }
602 |  },
603 |  "nbformat": 4,
604 |  "nbformat_minor": 4
605 | }
606 | 


--------------------------------------------------------------------------------
/notebooks/MLA-TAB-DAY3-AUTOML.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "![MLU Logo](../data/MLU_Logo.png)"
   8 |    ]
   9 |   },
  10 |   {
  11 |    "cell_type": "markdown",
  12 |    "metadata": {},
  13 |    "source": [
  14 |     "# <a name=\"0\">Machine Learning Accelerator - Tabular Data - Lecture 3</a>\n",
  15 |     "\n",
  16 |     "\n",
  17 |     "## AutoGluon\n",
  18 |     "\n",
  19 |     "In this notebook, we use __AutoGluon__ to predict the __Outcome Type__ field of our review dataset.\n",
  20 |     "\n",
  21 |     "\n",
  22 |     "[AutoGluon](https://auto.gluon.ai/stable/index.html) implements many of the best practices that we have discussed in this class, and more!  In particular, it sets itself apart from other AutoML solutions by having excellent automated feature engineering that can handle text data and missing values without any hand-coded solutions (See their [paper](https://arxiv.org/abs/2003.06505) for details).  It is too new to be in an existing Sagemaker kernel, so let's install it.\n",
  23 |     "\n",
  24 |     "1. <a href=\"#1\">Set up AutoGluon</a>\n",
  25 |     "2. <a href=\"#2\">Read the datasets</a>\n",
  26 |     "3. <a href=\"#3\">Train a classifier with AutoGluon</a>\n",
  27 |     "4. <a href=\"#4\">Model evaluation</a>\n",
  28 |     "5. <a href=\"#5\">Clean up model artifacts</a>\n",
  29 |     "\n",
  30 |     "__Austin Animal Center Dataset__:\n",
  31 |     "\n",
  32 |     "In this exercise, we are working with pet adoption data from __Austin Animal Center__. We have two datasets that cover intake and outcome of animals. Intake data is available from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and outcome is from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). \n",
  33 |     "\n",
  34 |     "In order to work with a single table, we joined the intake and outcome tables using the \"Animal ID\" column and created a single __review.csv__ file. We also didn't consider animals with multiple entries to the facility to keep our dataset simple. If you want to see the original datasets and the merged data with multiple entries, they are available under data/review folder: Austin_Animal_Center_Intakes.csv, Austin_Animal_Center_Outcomes.csv and Austin_Animal_Center_Intakes_Outcomes.csv.\n",
  35 |     "\n",
  36 |     "__Dataset schema:__ \n",
  37 |     "- __Pet ID__ - Unique ID of pet\n",
  38 |     "- __Outcome Type__ - State of pet at the time of recording the outcome (0 = not placed, 1 = placed). This is the field to predict.\n",
  39 |     "- __Sex upon Outcome__ - Sex of pet at outcome\n",
  40 |     "- __Name__ - Name of pet \n",
  41 |     "- __Found Location__ - Found location of pet before entered the center\n",
  42 |     "- __Intake Type__ - Circumstances bringing the pet to the center\n",
  43 |     "- __Intake Condition__ - Health condition of pet when entered the center\n",
  44 |     "- __Pet Type__ - Type of pet\n",
  45 |     "- __Sex upon Intake__ - Sex of pet when entered the center\n",
  46 |     "- __Breed__ - Breed of pet \n",
  47 |     "- __Color__ - Color of pet \n",
  48 |     "- __Age upon Intake Days__ - Age of pet when entered the center (days)\n",
  49 |     "- __Age upon Outcome Days__ - Age of pet at outcome (days))"
  50 |    ]
  51 |   },
  52 |   {
  53 |    "cell_type": "markdown",
  54 |    "metadata": {},
  55 |    "source": [
  56 |     "## 1. <a name=\"1\">Set up AutoGluon</a>\n",
  57 |     "(<a href=\"#0\">Go to top</a>)"
  58 |    ]
  59 |   },
  60 |   {
  61 |    "cell_type": "code",
  62 |    "execution_count": 1,
  63 |    "metadata": {
  64 |     "tags": []
  65 |    },
  66 |    "outputs": [],
  67 |    "source": [
  68 |     "%%capture\n",
  69 |     "%pip install -q -r ../requirements.txt"
  70 |    ]
  71 |   },
  72 |   {
  73 |    "cell_type": "markdown",
  74 |    "metadata": {},
  75 |    "source": [
  76 |     "## 2. <a name=\"2\">Read the dataset</a>\n",
  77 |     "(<a href=\"#0\">Go to top</a>)\n",
  78 |     "\n",
  79 |     "Let's read the dataset into a dataframe, using Pandas, and split the dataset into train and test sets (AutoGluon will handle the validation itself)."
  80 |    ]
  81 |   },
  82 |   {
  83 |    "cell_type": "code",
  84 |    "execution_count": 2,
  85 |    "metadata": {
  86 |     "tags": []
  87 |    },
  88 |    "outputs": [],
  89 |    "source": [
  90 |     "import pandas as pd\n",
  91 |     "\n",
  92 |     "df = pd.read_csv('../data/review/review_dataset.csv')"
  93 |    ]
  94 |   },
  95 |   {
  96 |    "cell_type": "code",
  97 |    "execution_count": 3,
  98 |    "metadata": {
  99 |     "tags": []
 100 |    },
 101 |    "outputs": [],
 102 |    "source": [
 103 |     "from sklearn.model_selection import train_test_split\n",
 104 |     "\n",
 105 |     "train_data, test_data = train_test_split(df, test_size=0.1, shuffle=True, random_state=23)"
 106 |    ]
 107 |   },
 108 |   {
 109 |    "cell_type": "markdown",
 110 |    "metadata": {},
 111 |    "source": [
 112 |     "## 3. <a name=\"3\">Train a classifier with AutoGluon</a>\n",
 113 |     "(<a href=\"#0\">Go to top</a>)\n",
 114 |     "\n",
 115 |     "We can run AutoGluon with a short snippet. For fitting, we just call the __.fit()__ function. In this exercise, we used the data frame objects, but this tool also accepts the raw csv files as input. To use this tool with simple csv files, you can follow the code snippet below.\n",
 116 |     "\n",
 117 |     "```python\n",
 118 |     "from autogluon.tabular import TabularDataset, TabularPredictor\n",
 119 |     "\n",
 120 |     "train_data = TabularDataset(file_path='path_to_dataset/train.csv')\n",
 121 |     "test_data = TabularDataset(file_path='path_to_dataset/test.csv')\n",
 122 |     "\n",
 123 |     "predictor = TabularPredictor(label='label_column').fit(train_data)\n",
 124 |     "test_predictions = predictor.predict(test_data)\n",
 125 |     "```\n",
 126 |     "\n",
 127 |     "We have our separate __data frames__ for training and test data, so we work with them below. We grab the first 10000 data points for a quick demo. You can also pass the full dataset."
 128 |    ]
 129 |   },
 130 |   {
 131 |    "cell_type": "code",
 132 |    "execution_count": 4,
 133 |    "metadata": {
 134 |     "scrolled": true,
 135 |     "tags": []
 136 |    },
 137 |    "outputs": [
 138 |     {
 139 |      "name": "stderr",
 140 |      "output_type": "stream",
 141 |      "text": [
 142 |       "No path specified. Models will be saved in: \"AutogluonModels/ag-20241001_183703/\"\n",
 143 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/core/utils/utils.py:549: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.\n",
 144 |       "  with pd.option_context(\"mode.use_inf_as_na\", True):  # treat None, NaN, INF, NINF as NA\n",
 145 |       "Beginning AutoGluon training ...\n",
 146 |       "AutoGluon will save models to \"AutogluonModels/ag-20241001_183703/\"\n",
 147 |       "AutoGluon Version:  0.8.3\n",
 148 |       "Python Version:     3.10.14\n",
 149 |       "Operating System:   Linux\n",
 150 |       "Platform Machine:   x86_64\n",
 151 |       "Platform Version:   #1 SMP Tue Sep 10 22:02:55 UTC 2024\n",
 152 |       "Disk Space Avail:   10.77 GB / 26.83 GB (40.1%)\n",
 153 |       "Train Data Rows:    10000\n",
 154 |       "Train Data Columns: 12\n",
 155 |       "Label Column: Outcome Type\n",
 156 |       "Preprocessing data ...\n",
 157 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/core/utils/utils.py:549: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.\n",
 158 |       "  with pd.option_context(\"mode.use_inf_as_na\", True):  # treat None, NaN, INF, NINF as NA\n",
 159 |       "AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).\n",
 160 |       "\t2 unique label values:  [1.0, 0.0]\n",
 161 |       "\tIf 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])\n",
 162 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/tabular/learner/default_learner.py:215: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.\n",
 163 |       "  with pd.option_context(\"mode.use_inf_as_na\", True):  # treat None, NaN, INF, NINF as NA\n",
 164 |       "Selected class <--> label mapping:  class 1 = 1, class 0 = 0\n",
 165 |       "Using Feature Generators to preprocess the data ...\n",
 166 |       "Fitting AutoMLPipelineFeatureGenerator...\n",
 167 |       "\tAvailable Memory:                    12980.59 MB\n",
 168 |       "\tTrain Data (Original)  Memory Usage: 6.86 MB (0.1% of available memory)\n",
 169 |       "\tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
 170 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/features/infer_types.py:118: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
 171 |       "  result = pd.to_datetime(X, errors=\"coerce\")\n",
 172 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/features/infer_types.py:118: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
 173 |       "  result = pd.to_datetime(X, errors=\"coerce\")\n",
 174 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/features/infer_types.py:118: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
 175 |       "  result = pd.to_datetime(X, errors=\"coerce\")\n",
 176 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/features/infer_types.py:118: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
 177 |       "  result = pd.to_datetime(X, errors=\"coerce\")\n",
 178 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/features/infer_types.py:118: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
 179 |       "  result = pd.to_datetime(X, errors=\"coerce\")\n",
 180 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/features/infer_types.py:118: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
 181 |       "  result = pd.to_datetime(X, errors=\"coerce\")\n",
 182 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/features/infer_types.py:118: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
 183 |       "  result = pd.to_datetime(X, errors=\"coerce\")\n",
 184 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/features/infer_types.py:118: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
 185 |       "  result = pd.to_datetime(X, errors=\"coerce\")\n",
 186 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/features/infer_types.py:118: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
 187 |       "  result = pd.to_datetime(X, errors=\"coerce\")\n",
 188 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/features/infer_types.py:118: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.\n",
 189 |       "  result = pd.to_datetime(X, errors=\"coerce\")\n",
 190 |       "\tStage 1 Generators:\n",
 191 |       "\t\tFitting AsTypeFeatureGenerator...\n",
 192 |       "\tStage 2 Generators:\n",
 193 |       "\t\tFitting FillNaFeatureGenerator...\n",
 194 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/features/generators/fillna.py:58: FutureWarning: The 'downcast' keyword in fillna is deprecated and will be removed in a future version. Use res.infer_objects(copy=False) to infer non-object dtype, or pd.to_numeric with the 'downcast' keyword to downcast numeric results.\n",
 195 |       "  X.fillna(self._fillna_feature_map, inplace=True, downcast=False)\n",
 196 |       "\tStage 3 Generators:\n",
 197 |       "\t\tFitting IdentityFeatureGenerator...\n",
 198 |       "\t\tFitting CategoryFeatureGenerator...\n",
 199 |       "\t\t\tFitting CategoryMemoryMinimizeFeatureGenerator...\n",
 200 |       "\t\tFitting TextSpecialFeatureGenerator...\n",
 201 |       "\t\t\tFitting BinnedFeatureGenerator...\n",
 202 |       "\t\t\tFitting DropDuplicatesFeatureGenerator...\n",
 203 |       "\t\tFitting TextNgramFeatureGenerator...\n",
 204 |       "\t\t\tFitting CountVectorizer for text features: ['Found Location']\n",
 205 |       "\t\t\tCountVectorizer fit with vocabulary size = 198\n",
 206 |       "\tStage 4 Generators:\n",
 207 |       "\t\tFitting DropUniqueFeatureGenerator...\n",
 208 |       "\tStage 5 Generators:\n",
 209 |       "\t\tFitting DropDuplicatesFeatureGenerator...\n",
 210 |       "\tUnused Original Features (Count: 1): ['Pet ID']\n",
 211 |       "\t\tThese features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.\n",
 212 |       "\t\tFeatures can also be unused if they carry very little information, such as being categorical but having almost entirely unique values or being duplicates of other features.\n",
 213 |       "\t\tThese features do not need to be present at inference time.\n",
 214 |       "\t\t('object', []) : 1 | ['Pet ID']\n",
 215 |       "\tTypes of features in original data (raw dtype, special dtypes):\n",
 216 |       "\t\t('int', [])          : 2 | ['Age upon Intake Days', 'Age upon Outcome Days']\n",
 217 |       "\t\t('object', [])       : 8 | ['Sex upon Outcome', 'Name', 'Intake Type', 'Intake Condition', 'Pet Type', ...]\n",
 218 |       "\t\t('object', ['text']) : 1 | ['Found Location']\n",
 219 |       "\tTypes of features in processed data (raw dtype, special dtypes):\n",
 220 |       "\t\t('category', [])                    :   8 | ['Sex upon Outcome', 'Name', 'Intake Type', 'Intake Condition', 'Pet Type', ...]\n",
 221 |       "\t\t('category', ['text_as_category'])  :   1 | ['Found Location']\n",
 222 |       "\t\t('int', [])                         :   2 | ['Age upon Intake Days', 'Age upon Outcome Days']\n",
 223 |       "\t\t('int', ['binned', 'text_special']) :  12 | ['Found Location.char_count', 'Found Location.word_count', 'Found Location.capital_ratio', 'Found Location.lower_ratio', 'Found Location.digit_ratio', ...]\n",
 224 |       "\t\t('int', ['text_ngram'])             : 176 | ['__nlp__.183', '__nlp__.183 and', '__nlp__.183 in', '__nlp__.1st', '__nlp__.290', ...]\n",
 225 |       "\t10.9s = Fit runtime\n",
 226 |       "\t11 features in original data used to generate 199 features in processed data.\n",
 227 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '20637.401015228428' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.\n",
 228 |       "  memory_usage[column] = (\n",
 229 |       "\tTrain Data (Processed) Memory Usage: 3.94 MB (0.0% of available memory)\n",
 230 |       "Data preprocessing and feature engineering runtime = 11.05s ...\n",
 231 |       "AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n",
 232 |       "\tTo change this, specify the eval_metric parameter of Predictor()\n",
 233 |       "Automatically generating train/validation split with holdout_frac=0.1, Train Rows: 9000, Val Rows: 1000\n",
 234 |       "User-specified model hyperparameters to be fit:\n",
 235 |       "{\n",
 236 |       "\t'NN_TORCH': {},\n",
 237 |       "\t'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],\n",
 238 |       "\t'CAT': {},\n",
 239 |       "\t'XGB': {},\n",
 240 |       "\t'FASTAI': {},\n",
 241 |       "\t'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],\n",
 242 |       "\t'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],\n",
 243 |       "\t'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],\n",
 244 |       "}\n",
 245 |       "Fitting 13 L1 models ...\n",
 246 |       "Fitting model: KNeighborsUnif ...\n",
 247 |       "\t0.653\t = Validation score   (accuracy)\n",
 248 |       "\t2.99s\t = Training   runtime\n",
 249 |       "\t0.35s\t = Validation runtime\n",
 250 |       "Fitting model: KNeighborsDist ...\n",
 251 |       "\t0.663\t = Validation score   (accuracy)\n",
 252 |       "\t0.05s\t = Training   runtime\n",
 253 |       "\t0.15s\t = Validation runtime\n",
 254 |       "Fitting model: LightGBMXT ...\n",
 255 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '18637.401015228428' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.\n",
 256 |       "  memory_usage[column] = (\n",
 257 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/dask/dataframe/__init__.py:31: FutureWarning: \n",
 258 |       "Dask dataframe query planning is disabled because dask-expr is not installed.\n",
 259 |       "\n",
 260 |       "You can install it with `pip install dask[dataframe]` or `conda install dask`.\n",
 261 |       "This will raise in a future version.\n",
 262 |       "\n",
 263 |       "  warnings.warn(msg, FutureWarning)\n",
 264 |       "\t0.848\t = Validation score   (accuracy)\n",
 265 |       "\t4.29s\t = Training   runtime\n",
 266 |       "\t0.06s\t = Validation runtime\n",
 267 |       "Fitting model: LightGBM ...\n",
 268 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '18637.401015228428' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.\n",
 269 |       "  memory_usage[column] = (\n",
 270 |       "\t0.853\t = Validation score   (accuracy)\n",
 271 |       "\t3.05s\t = Training   runtime\n",
 272 |       "\t0.03s\t = Validation runtime\n",
 273 |       "Fitting model: RandomForestGini ...\n",
 274 |       "\t0.853\t = Validation score   (accuracy)\n",
 275 |       "\t5.8s\t = Training   runtime\n",
 276 |       "\t0.21s\t = Validation runtime\n",
 277 |       "Fitting model: RandomForestEntr ...\n",
 278 |       "\t0.85\t = Validation score   (accuracy)\n",
 279 |       "\t5.78s\t = Training   runtime\n",
 280 |       "\t0.22s\t = Validation runtime\n",
 281 |       "Fitting model: CatBoost ...\n",
 282 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '18637.401015228428' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.\n",
 283 |       "  memory_usage[column] = (\n",
 284 |       "\t0.854\t = Validation score   (accuracy)\n",
 285 |       "\t17.46s\t = Training   runtime\n",
 286 |       "\t0.06s\t = Validation runtime\n",
 287 |       "Fitting model: ExtraTreesGini ...\n",
 288 |       "\t0.836\t = Validation score   (accuracy)\n",
 289 |       "\t5.81s\t = Training   runtime\n",
 290 |       "\t0.2s\t = Validation runtime\n",
 291 |       "Fitting model: ExtraTreesEntr ...\n",
 292 |       "\t0.844\t = Validation score   (accuracy)\n",
 293 |       "\t6.1s\t = Training   runtime\n",
 294 |       "\t0.19s\t = Validation runtime\n",
 295 |       "Fitting model: NeuralNetFastAI ...\n",
 296 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '18637.401015228428' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.\n",
 297 |       "  memory_usage[column] = (\n",
 298 |       "No improvement since epoch 7: early stopping\n",
 299 |       "\t0.819\t = Validation score   (accuracy)\n",
 300 |       "\t34.9s\t = Training   runtime\n",
 301 |       "\t0.1s\t = Validation runtime\n",
 302 |       "Fitting model: XGBoost ...\n",
 303 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '18637.401015228428' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.\n",
 304 |       "  memory_usage[column] = (\n",
 305 |       "\t0.854\t = Validation score   (accuracy)\n",
 306 |       "\t8.64s\t = Training   runtime\n",
 307 |       "\t0.02s\t = Validation runtime\n",
 308 |       "Fitting model: NeuralNetTorch ...\n",
 309 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '18637.401015228428' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.\n",
 310 |       "  memory_usage[column] = (\n",
 311 |       "\t0.853\t = Validation score   (accuracy)\n",
 312 |       "\t45.23s\t = Training   runtime\n",
 313 |       "\t0.04s\t = Validation runtime\n",
 314 |       "Fitting model: LightGBMLarge ...\n",
 315 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/common/utils/pandas_utils.py:50: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '18637.401015228428' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.\n",
 316 |       "  memory_usage[column] = (\n",
 317 |       "\t0.847\t = Validation score   (accuracy)\n",
 318 |       "\t4.97s\t = Training   runtime\n",
 319 |       "\t0.03s\t = Validation runtime\n",
 320 |       "Fitting model: WeightedEnsemble_L2 ...\n",
 321 |       "\t0.871\t = Validation score   (accuracy)\n",
 322 |       "\t2.38s\t = Training   runtime\n",
 323 |       "\t0.0s\t = Validation runtime\n",
 324 |       "AutoGluon training complete, total runtime = 161.57s ... Best model: \"WeightedEnsemble_L2\"\n",
 325 |       "TabularPredictor saved. To load, use: predictor = TabularPredictor.load(\"AutogluonModels/ag-20241001_183703/\")\n"
 326 |      ]
 327 |     }
 328 |    ],
 329 |    "source": [
 330 |     "from autogluon.tabular import TabularDataset, TabularPredictor\n",
 331 |     "\n",
 332 |     "k = 10000 # grab less data for a quick demo\n",
 333 |     "#k = train_data.shape[0] # grad the whole dataset\n",
 334 |     "\n",
 335 |     "predictor = TabularPredictor(label='Outcome Type').fit(train_data.head(k))"
 336 |    ]
 337 |   },
 338 |   {
 339 |    "cell_type": "markdown",
 340 |    "metadata": {},
 341 |    "source": [
 342 |     "We can also summarize what happened during fit."
 343 |    ]
 344 |   },
 345 |   {
 346 |    "cell_type": "code",
 347 |    "execution_count": 5,
 348 |    "metadata": {
 349 |     "tags": []
 350 |    },
 351 |    "outputs": [
 352 |     {
 353 |      "name": "stdout",
 354 |      "output_type": "stream",
 355 |      "text": [
 356 |       "*** Summary of fit() ***\n",
 357 |       "Estimated performance of each model:\n",
 358 |       "                  model  score_val  pred_time_val   fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order\n",
 359 |       "0   WeightedEnsemble_L2      0.871       0.625275  92.942297                0.004572           2.378834            2       True         14\n",
 360 |       "1               XGBoost      0.854       0.023012   8.635196                0.023012           8.635196            1       True         11\n",
 361 |       "2              CatBoost      0.854       0.064999  17.457963                0.064999          17.457963            1       True          7\n",
 362 |       "3              LightGBM      0.853       0.025760   3.052832                0.025760           3.052832            1       True          4\n",
 363 |       "4        NeuralNetTorch      0.853       0.039637  45.233796                0.039637          45.233796            1       True         12\n",
 364 |       "5      RandomForestGini      0.853       0.214309   5.797882                0.214309           5.797882            1       True          5\n",
 365 |       "6      RandomForestEntr      0.850       0.218737   5.784430                0.218737           5.784430            1       True          6\n",
 366 |       "7            LightGBMXT      0.848       0.061231   4.287037                0.061231           4.287037            1       True          3\n",
 367 |       "8         LightGBMLarge      0.847       0.030208   4.965655                0.030208           4.965655            1       True         13\n",
 368 |       "9        ExtraTreesEntr      0.844       0.191755   6.098757                0.191755           6.098757            1       True          9\n",
 369 |       "10       ExtraTreesGini      0.836       0.204940   5.812799                0.204940           5.812799            1       True          8\n",
 370 |       "11      NeuralNetFastAI      0.819       0.097966  34.902269                0.097966          34.902269            1       True         10\n",
 371 |       "12       KNeighborsDist      0.663       0.154423   0.054532                0.154423           0.054532            1       True          2\n",
 372 |       "13       KNeighborsUnif      0.653       0.347130   2.985137                0.347130           2.985137            1       True          1\n",
 373 |       "Number of models trained: 14\n",
 374 |       "Types of models trained:\n",
 375 |       "{'NNFastAiTabularModel', 'TabularNeuralNetTorchModel', 'WeightedEnsembleModel', 'RFModel', 'XGBoostModel', 'CatBoostModel', 'LGBModel', 'XTModel', 'KNNModel'}\n",
 376 |       "Bagging used: False \n",
 377 |       "Multi-layer stack-ensembling used: False \n",
 378 |       "Feature Metadata (Processed):\n",
 379 |       "(raw dtype, special dtypes):\n",
 380 |       "('category', [])                    :   8 | ['Sex upon Outcome', 'Name', 'Intake Type', 'Intake Condition', 'Pet Type', ...]\n",
 381 |       "('category', ['text_as_category'])  :   1 | ['Found Location']\n",
 382 |       "('int', [])                         :   2 | ['Age upon Intake Days', 'Age upon Outcome Days']\n",
 383 |       "('int', ['binned', 'text_special']) :  12 | ['Found Location.char_count', 'Found Location.word_count', 'Found Location.capital_ratio', 'Found Location.lower_ratio', 'Found Location.digit_ratio', ...]\n",
 384 |       "('int', ['text_ngram'])             : 176 | ['__nlp__.183', '__nlp__.183 and', '__nlp__.183 in', '__nlp__.1st', '__nlp__.290', ...]\n",
 385 |       "*** End of fit() summary ***\n"
 386 |      ]
 387 |     },
 388 |     {
 389 |      "name": "stderr",
 390 |      "output_type": "stream",
 391 |      "text": [
 392 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/core/utils/plots.py:169: UserWarning: AutoGluon summary plots cannot be created because bokeh is not installed. To see plots, please do: \"pip install bokeh==2.0.1\"\n",
 393 |       "  warnings.warn('AutoGluon summary plots cannot be created because bokeh is not installed. To see plots, please do: \"pip install bokeh==2.0.1\"')\n"
 394 |      ]
 395 |     },
 396 |     {
 397 |      "data": {
 398 |       "text/plain": [
 399 |        "{'model_types': {'KNeighborsUnif': 'KNNModel',\n",
 400 |        "  'KNeighborsDist': 'KNNModel',\n",
 401 |        "  'LightGBMXT': 'LGBModel',\n",
 402 |        "  'LightGBM': 'LGBModel',\n",
 403 |        "  'RandomForestGini': 'RFModel',\n",
 404 |        "  'RandomForestEntr': 'RFModel',\n",
 405 |        "  'CatBoost': 'CatBoostModel',\n",
 406 |        "  'ExtraTreesGini': 'XTModel',\n",
 407 |        "  'ExtraTreesEntr': 'XTModel',\n",
 408 |        "  'NeuralNetFastAI': 'NNFastAiTabularModel',\n",
 409 |        "  'XGBoost': 'XGBoostModel',\n",
 410 |        "  'NeuralNetTorch': 'TabularNeuralNetTorchModel',\n",
 411 |        "  'LightGBMLarge': 'LGBModel',\n",
 412 |        "  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},\n",
 413 |        " 'model_performance': {'KNeighborsUnif': 0.653,\n",
 414 |        "  'KNeighborsDist': 0.663,\n",
 415 |        "  'LightGBMXT': 0.848,\n",
 416 |        "  'LightGBM': 0.853,\n",
 417 |        "  'RandomForestGini': 0.853,\n",
 418 |        "  'RandomForestEntr': 0.85,\n",
 419 |        "  'CatBoost': 0.854,\n",
 420 |        "  'ExtraTreesGini': 0.836,\n",
 421 |        "  'ExtraTreesEntr': 0.844,\n",
 422 |        "  'NeuralNetFastAI': 0.819,\n",
 423 |        "  'XGBoost': 0.854,\n",
 424 |        "  'NeuralNetTorch': 0.853,\n",
 425 |        "  'LightGBMLarge': 0.847,\n",
 426 |        "  'WeightedEnsemble_L2': 0.871},\n",
 427 |        " 'model_best': 'WeightedEnsemble_L2',\n",
 428 |        " 'model_paths': {'KNeighborsUnif': 'AutogluonModels/ag-20241001_183703/models/KNeighborsUnif/',\n",
 429 |        "  'KNeighborsDist': 'AutogluonModels/ag-20241001_183703/models/KNeighborsDist/',\n",
 430 |        "  'LightGBMXT': 'AutogluonModels/ag-20241001_183703/models/LightGBMXT/',\n",
 431 |        "  'LightGBM': 'AutogluonModels/ag-20241001_183703/models/LightGBM/',\n",
 432 |        "  'RandomForestGini': 'AutogluonModels/ag-20241001_183703/models/RandomForestGini/',\n",
 433 |        "  'RandomForestEntr': 'AutogluonModels/ag-20241001_183703/models/RandomForestEntr/',\n",
 434 |        "  'CatBoost': 'AutogluonModels/ag-20241001_183703/models/CatBoost/',\n",
 435 |        "  'ExtraTreesGini': 'AutogluonModels/ag-20241001_183703/models/ExtraTreesGini/',\n",
 436 |        "  'ExtraTreesEntr': 'AutogluonModels/ag-20241001_183703/models/ExtraTreesEntr/',\n",
 437 |        "  'NeuralNetFastAI': 'AutogluonModels/ag-20241001_183703/models/NeuralNetFastAI/',\n",
 438 |        "  'XGBoost': 'AutogluonModels/ag-20241001_183703/models/XGBoost/',\n",
 439 |        "  'NeuralNetTorch': 'AutogluonModels/ag-20241001_183703/models/NeuralNetTorch/',\n",
 440 |        "  'LightGBMLarge': 'AutogluonModels/ag-20241001_183703/models/LightGBMLarge/',\n",
 441 |        "  'WeightedEnsemble_L2': 'AutogluonModels/ag-20241001_183703/models/WeightedEnsemble_L2/'},\n",
 442 |        " 'model_fit_times': {'KNeighborsUnif': 2.9851365089416504,\n",
 443 |        "  'KNeighborsDist': 0.05453157424926758,\n",
 444 |        "  'LightGBMXT': 4.287037134170532,\n",
 445 |        "  'LightGBM': 3.0528316497802734,\n",
 446 |        "  'RandomForestGini': 5.797881841659546,\n",
 447 |        "  'RandomForestEntr': 5.784429550170898,\n",
 448 |        "  'CatBoost': 17.45796298980713,\n",
 449 |        "  'ExtraTreesGini': 5.812798976898193,\n",
 450 |        "  'ExtraTreesEntr': 6.098757028579712,\n",
 451 |        "  'NeuralNetFastAI': 34.90226936340332,\n",
 452 |        "  'XGBoost': 8.6351957321167,\n",
 453 |        "  'NeuralNetTorch': 45.23379588127136,\n",
 454 |        "  'LightGBMLarge': 4.965655326843262,\n",
 455 |        "  'WeightedEnsemble_L2': 2.3788342475891113},\n",
 456 |        " 'model_pred_times': {'KNeighborsUnif': 0.34712982177734375,\n",
 457 |        "  'KNeighborsDist': 0.15442276000976562,\n",
 458 |        "  'LightGBMXT': 0.06123089790344238,\n",
 459 |        "  'LightGBM': 0.02575993537902832,\n",
 460 |        "  'RandomForestGini': 0.2143092155456543,\n",
 461 |        "  'RandomForestEntr': 0.21873688697814941,\n",
 462 |        "  'CatBoost': 0.06499910354614258,\n",
 463 |        "  'ExtraTreesGini': 0.204939603805542,\n",
 464 |        "  'ExtraTreesEntr': 0.1917552947998047,\n",
 465 |        "  'NeuralNetFastAI': 0.09796595573425293,\n",
 466 |        "  'XGBoost': 0.023012399673461914,\n",
 467 |        "  'NeuralNetTorch': 0.03963661193847656,\n",
 468 |        "  'LightGBMLarge': 0.030208110809326172,\n",
 469 |        "  'WeightedEnsemble_L2': 0.0045719146728515625},\n",
 470 |        " 'num_bag_folds': 0,\n",
 471 |        " 'max_stack_level': 2,\n",
 472 |        " 'num_classes': 2,\n",
 473 |        " 'model_hyperparams': {'KNeighborsUnif': {'weights': 'uniform'},\n",
 474 |        "  'KNeighborsDist': {'weights': 'distance'},\n",
 475 |        "  'LightGBMXT': {'learning_rate': 0.05, 'extra_trees': True},\n",
 476 |        "  'LightGBM': {'learning_rate': 0.05},\n",
 477 |        "  'RandomForestGini': {'n_estimators': 300,\n",
 478 |        "   'max_leaf_nodes': 15000,\n",
 479 |        "   'n_jobs': -1,\n",
 480 |        "   'random_state': 0,\n",
 481 |        "   'bootstrap': True,\n",
 482 |        "   'criterion': 'gini'},\n",
 483 |        "  'RandomForestEntr': {'n_estimators': 300,\n",
 484 |        "   'max_leaf_nodes': 15000,\n",
 485 |        "   'n_jobs': -1,\n",
 486 |        "   'random_state': 0,\n",
 487 |        "   'bootstrap': True,\n",
 488 |        "   'criterion': 'entropy'},\n",
 489 |        "  'CatBoost': {'iterations': 10000,\n",
 490 |        "   'learning_rate': 0.05,\n",
 491 |        "   'random_seed': 0,\n",
 492 |        "   'allow_writing_files': False,\n",
 493 |        "   'eval_metric': 'Accuracy'},\n",
 494 |        "  'ExtraTreesGini': {'n_estimators': 300,\n",
 495 |        "   'max_leaf_nodes': 15000,\n",
 496 |        "   'n_jobs': -1,\n",
 497 |        "   'random_state': 0,\n",
 498 |        "   'bootstrap': True,\n",
 499 |        "   'criterion': 'gini'},\n",
 500 |        "  'ExtraTreesEntr': {'n_estimators': 300,\n",
 501 |        "   'max_leaf_nodes': 15000,\n",
 502 |        "   'n_jobs': -1,\n",
 503 |        "   'random_state': 0,\n",
 504 |        "   'bootstrap': True,\n",
 505 |        "   'criterion': 'entropy'},\n",
 506 |        "  'NeuralNetFastAI': {'layers': None,\n",
 507 |        "   'emb_drop': 0.1,\n",
 508 |        "   'ps': 0.1,\n",
 509 |        "   'bs': 'auto',\n",
 510 |        "   'lr': 0.01,\n",
 511 |        "   'epochs': 'auto',\n",
 512 |        "   'early.stopping.min_delta': 0.0001,\n",
 513 |        "   'early.stopping.patience': 20,\n",
 514 |        "   'smoothing': 0.0},\n",
 515 |        "  'XGBoost': {'n_estimators': 10000,\n",
 516 |        "   'learning_rate': 0.1,\n",
 517 |        "   'n_jobs': -1,\n",
 518 |        "   'proc.max_category_levels': 100,\n",
 519 |        "   'objective': 'binary:logistic',\n",
 520 |        "   'booster': 'gbtree'},\n",
 521 |        "  'NeuralNetTorch': {'num_epochs': 500,\n",
 522 |        "   'epochs_wo_improve': 20,\n",
 523 |        "   'activation': 'relu',\n",
 524 |        "   'embedding_size_factor': 1.0,\n",
 525 |        "   'embed_exponent': 0.56,\n",
 526 |        "   'max_embedding_dim': 100,\n",
 527 |        "   'y_range': None,\n",
 528 |        "   'y_range_extend': 0.05,\n",
 529 |        "   'dropout_prob': 0.1,\n",
 530 |        "   'optimizer': 'adam',\n",
 531 |        "   'learning_rate': 0.0003,\n",
 532 |        "   'weight_decay': 1e-06,\n",
 533 |        "   'proc.embed_min_categories': 4,\n",
 534 |        "   'proc.impute_strategy': 'median',\n",
 535 |        "   'proc.max_category_levels': 100,\n",
 536 |        "   'proc.skew_threshold': 0.99,\n",
 537 |        "   'use_ngram_features': False,\n",
 538 |        "   'num_layers': 4,\n",
 539 |        "   'hidden_size': 128,\n",
 540 |        "   'max_batch_size': 512,\n",
 541 |        "   'use_batchnorm': False,\n",
 542 |        "   'loss_function': 'auto'},\n",
 543 |        "  'LightGBMLarge': {'learning_rate': 0.03,\n",
 544 |        "   'num_leaves': 128,\n",
 545 |        "   'feature_fraction': 0.9,\n",
 546 |        "   'min_data_in_leaf': 5},\n",
 547 |        "  'WeightedEnsemble_L2': {'use_orig_features': False,\n",
 548 |        "   'max_base_models': 25,\n",
 549 |        "   'max_base_models_per_type': 5,\n",
 550 |        "   'save_bag_folds': True}},\n",
 551 |        " 'leaderboard':                   model  score_val  pred_time_val   fit_time  \\\n",
 552 |        " 0   WeightedEnsemble_L2      0.871       0.625275  92.942297   \n",
 553 |        " 1               XGBoost      0.854       0.023012   8.635196   \n",
 554 |        " 2              CatBoost      0.854       0.064999  17.457963   \n",
 555 |        " 3              LightGBM      0.853       0.025760   3.052832   \n",
 556 |        " 4        NeuralNetTorch      0.853       0.039637  45.233796   \n",
 557 |        " 5      RandomForestGini      0.853       0.214309   5.797882   \n",
 558 |        " 6      RandomForestEntr      0.850       0.218737   5.784430   \n",
 559 |        " 7            LightGBMXT      0.848       0.061231   4.287037   \n",
 560 |        " 8         LightGBMLarge      0.847       0.030208   4.965655   \n",
 561 |        " 9        ExtraTreesEntr      0.844       0.191755   6.098757   \n",
 562 |        " 10       ExtraTreesGini      0.836       0.204940   5.812799   \n",
 563 |        " 11      NeuralNetFastAI      0.819       0.097966  34.902269   \n",
 564 |        " 12       KNeighborsDist      0.663       0.154423   0.054532   \n",
 565 |        " 13       KNeighborsUnif      0.653       0.347130   2.985137   \n",
 566 |        " \n",
 567 |        "     pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  \\\n",
 568 |        " 0                 0.004572           2.378834            2       True   \n",
 569 |        " 1                 0.023012           8.635196            1       True   \n",
 570 |        " 2                 0.064999          17.457963            1       True   \n",
 571 |        " 3                 0.025760           3.052832            1       True   \n",
 572 |        " 4                 0.039637          45.233796            1       True   \n",
 573 |        " 5                 0.214309           5.797882            1       True   \n",
 574 |        " 6                 0.218737           5.784430            1       True   \n",
 575 |        " 7                 0.061231           4.287037            1       True   \n",
 576 |        " 8                 0.030208           4.965655            1       True   \n",
 577 |        " 9                 0.191755           6.098757            1       True   \n",
 578 |        " 10                0.204940           5.812799            1       True   \n",
 579 |        " 11                0.097966          34.902269            1       True   \n",
 580 |        " 12                0.154423           0.054532            1       True   \n",
 581 |        " 13                0.347130           2.985137            1       True   \n",
 582 |        " \n",
 583 |        "     fit_order  \n",
 584 |        " 0          14  \n",
 585 |        " 1          11  \n",
 586 |        " 2           7  \n",
 587 |        " 3           4  \n",
 588 |        " 4          12  \n",
 589 |        " 5           5  \n",
 590 |        " 6           6  \n",
 591 |        " 7           3  \n",
 592 |        " 8          13  \n",
 593 |        " 9           9  \n",
 594 |        " 10          8  \n",
 595 |        " 11         10  \n",
 596 |        " 12          2  \n",
 597 |        " 13          1  }"
 598 |       ]
 599 |      },
 600 |      "execution_count": 5,
 601 |      "metadata": {},
 602 |      "output_type": "execute_result"
 603 |     }
 604 |    ],
 605 |    "source": [
 606 |     "predictor.fit_summary()"
 607 |    ]
 608 |   },
 609 |   {
 610 |    "cell_type": "markdown",
 611 |    "metadata": {},
 612 |    "source": [
 613 |     "## 4. <a name=\"4\">Model evaluation</a>\n",
 614 |     "(<a href=\"#0\">Go to top</a>)\n",
 615 |     "\n",
 616 |     "Next, we load a separate test data to demonstrate how to make predictions on new examples at inference time."
 617 |    ]
 618 |   },
 619 |   {
 620 |    "cell_type": "code",
 621 |    "execution_count": 6,
 622 |    "metadata": {
 623 |     "tags": []
 624 |    },
 625 |    "outputs": [
 626 |     {
 627 |      "name": "stderr",
 628 |      "output_type": "stream",
 629 |      "text": [
 630 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/features/generators/fillna.py:58: FutureWarning: The 'downcast' keyword in fillna is deprecated and will be removed in a future version. Use res.infer_objects(copy=False) to infer non-object dtype, or pd.to_numeric with the 'downcast' keyword to downcast numeric results.\n",
 631 |       "  X.fillna(self._fillna_feature_map, inplace=True, downcast=False)\n",
 632 |       "Evaluation: accuracy on test data: 0.8593570007330611\n",
 633 |       "Evaluations on test data:\n",
 634 |       "{\n",
 635 |       "    \"accuracy\": 0.8593570007330611,\n",
 636 |       "    \"balanced_accuracy\": 0.846874357214327,\n",
 637 |       "    \"mcc\": 0.7158381734029869,\n",
 638 |       "    \"f1\": 0.8834504903237004,\n",
 639 |       "    \"precision\": 0.8336062888961677,\n",
 640 |       "    \"recall\": 0.9396344840317519\n",
 641 |       "}\n"
 642 |      ]
 643 |     },
 644 |     {
 645 |      "data": {
 646 |       "text/plain": [
 647 |        "{'accuracy': 0.8593570007330611,\n",
 648 |        " 'balanced_accuracy': 0.846874357214327,\n",
 649 |        " 'mcc': 0.7158381734029869,\n",
 650 |        " 'f1': 0.8834504903237004,\n",
 651 |        " 'precision': 0.8336062888961677,\n",
 652 |        " 'recall': 0.9396344840317519}"
 653 |       ]
 654 |      },
 655 |      "execution_count": 6,
 656 |      "metadata": {},
 657 |      "output_type": "execute_result"
 658 |     }
 659 |    ],
 660 |    "source": [
 661 |     "# First predictions\n",
 662 |     "y_pred = predictor.predict(test_data.head(k))\n",
 663 |     "\n",
 664 |     "# Then, evaluations\n",
 665 |     "predictor.evaluate_predictions(y_true=test_data['Outcome Type'],\n",
 666 |     "                               y_pred=y_pred,\n",
 667 |     "                               auxiliary_metrics=True)"
 668 |    ]
 669 |   },
 670 |   {
 671 |    "cell_type": "markdown",
 672 |    "metadata": {},
 673 |    "source": [
 674 |     "We can see the performance of each individual trained model on the test data:"
 675 |    ]
 676 |   },
 677 |   {
 678 |    "cell_type": "code",
 679 |    "execution_count": 7,
 680 |    "metadata": {
 681 |     "tags": []
 682 |    },
 683 |    "outputs": [
 684 |     {
 685 |      "name": "stderr",
 686 |      "output_type": "stream",
 687 |      "text": [
 688 |       "/opt/conda/envs/sagemaker-distribution/lib/python3.10/site-packages/autogluon/features/generators/fillna.py:58: FutureWarning: The 'downcast' keyword in fillna is deprecated and will be removed in a future version. Use res.infer_objects(copy=False) to infer non-object dtype, or pd.to_numeric with the 'downcast' keyword to downcast numeric results.\n",
 689 |       "  X.fillna(self._fillna_feature_map, inplace=True, downcast=False)\n"
 690 |      ]
 691 |     },
 692 |     {
 693 |      "data": {
 694 |       "text/html": [
 695 |        "<div>\n",
 696 |        "<style scoped>\n",
 697 |        "    .dataframe tbody tr th:only-of-type {\n",
 698 |        "        vertical-align: middle;\n",
 699 |        "    }\n",
 700 |        "\n",
 701 |        "    .dataframe tbody tr th {\n",
 702 |        "        vertical-align: top;\n",
 703 |        "    }\n",
 704 |        "\n",
 705 |        "    .dataframe thead th {\n",
 706 |        "        text-align: right;\n",
 707 |        "    }\n",
 708 |        "</style>\n",
 709 |        "<table border=\"1\" class=\"dataframe\">\n",
 710 |        "  <thead>\n",
 711 |        "    <tr style=\"text-align: right;\">\n",
 712 |        "      <th></th>\n",
 713 |        "      <th>model</th>\n",
 714 |        "      <th>score_test</th>\n",
 715 |        "      <th>score_val</th>\n",
 716 |        "      <th>pred_time_test</th>\n",
 717 |        "      <th>pred_time_val</th>\n",
 718 |        "      <th>fit_time</th>\n",
 719 |        "      <th>pred_time_test_marginal</th>\n",
 720 |        "      <th>pred_time_val_marginal</th>\n",
 721 |        "      <th>fit_time_marginal</th>\n",
 722 |        "      <th>stack_level</th>\n",
 723 |        "      <th>can_infer</th>\n",
 724 |        "      <th>fit_order</th>\n",
 725 |        "    </tr>\n",
 726 |        "  </thead>\n",
 727 |        "  <tbody>\n",
 728 |        "    <tr>\n",
 729 |        "      <th>0</th>\n",
 730 |        "      <td>WeightedEnsemble_L2</td>\n",
 731 |        "      <td>0.859357</td>\n",
 732 |        "      <td>0.871</td>\n",
 733 |        "      <td>3.335518</td>\n",
 734 |        "      <td>0.625275</td>\n",
 735 |        "      <td>92.942297</td>\n",
 736 |        "      <td>0.006877</td>\n",
 737 |        "      <td>0.004572</td>\n",
 738 |        "      <td>2.378834</td>\n",
 739 |        "      <td>2</td>\n",
 740 |        "      <td>True</td>\n",
 741 |        "      <td>14</td>\n",
 742 |        "    </tr>\n",
 743 |        "    <tr>\n",
 744 |        "      <th>1</th>\n",
 745 |        "      <td>RandomForestEntr</td>\n",
 746 |        "      <td>0.855482</td>\n",
 747 |        "      <td>0.850</td>\n",
 748 |        "      <td>1.125673</td>\n",
 749 |        "      <td>0.218737</td>\n",
 750 |        "      <td>5.784430</td>\n",
 751 |        "      <td>1.125673</td>\n",
 752 |        "      <td>0.218737</td>\n",
 753 |        "      <td>5.784430</td>\n",
 754 |        "      <td>1</td>\n",
 755 |        "      <td>True</td>\n",
 756 |        "      <td>6</td>\n",
 757 |        "    </tr>\n",
 758 |        "    <tr>\n",
 759 |        "      <th>2</th>\n",
 760 |        "      <td>CatBoost</td>\n",
 761 |        "      <td>0.854016</td>\n",
 762 |        "      <td>0.854</td>\n",
 763 |        "      <td>0.107882</td>\n",
 764 |        "      <td>0.064999</td>\n",
 765 |        "      <td>17.457963</td>\n",
 766 |        "      <td>0.107882</td>\n",
 767 |        "      <td>0.064999</td>\n",
 768 |        "      <td>17.457963</td>\n",
 769 |        "      <td>1</td>\n",
 770 |        "      <td>True</td>\n",
 771 |        "      <td>7</td>\n",
 772 |        "    </tr>\n",
 773 |        "    <tr>\n",
 774 |        "      <th>3</th>\n",
 775 |        "      <td>RandomForestGini</td>\n",
 776 |        "      <td>0.854016</td>\n",
 777 |        "      <td>0.853</td>\n",
 778 |        "      <td>0.954611</td>\n",
 779 |        "      <td>0.214309</td>\n",
 780 |        "      <td>5.797882</td>\n",
 781 |        "      <td>0.954611</td>\n",
 782 |        "      <td>0.214309</td>\n",
 783 |        "      <td>5.797882</td>\n",
 784 |        "      <td>1</td>\n",
 785 |        "      <td>True</td>\n",
 786 |        "      <td>5</td>\n",
 787 |        "    </tr>\n",
 788 |        "    <tr>\n",
 789 |        "      <th>4</th>\n",
 790 |        "      <td>LightGBM</td>\n",
 791 |        "      <td>0.850141</td>\n",
 792 |        "      <td>0.853</td>\n",
 793 |        "      <td>0.200560</td>\n",
 794 |        "      <td>0.025760</td>\n",
 795 |        "      <td>3.052832</td>\n",
 796 |        "      <td>0.200560</td>\n",
 797 |        "      <td>0.025760</td>\n",
 798 |        "      <td>3.052832</td>\n",
 799 |        "      <td>1</td>\n",
 800 |        "      <td>True</td>\n",
 801 |        "      <td>4</td>\n",
 802 |        "    </tr>\n",
 803 |        "    <tr>\n",
 804 |        "      <th>5</th>\n",
 805 |        "      <td>XGBoost</td>\n",
 806 |        "      <td>0.849618</td>\n",
 807 |        "      <td>0.854</td>\n",
 808 |        "      <td>0.175631</td>\n",
 809 |        "      <td>0.023012</td>\n",
 810 |        "      <td>8.635196</td>\n",
 811 |        "      <td>0.175631</td>\n",
 812 |        "      <td>0.023012</td>\n",
 813 |        "      <td>8.635196</td>\n",
 814 |        "      <td>1</td>\n",
 815 |        "      <td>True</td>\n",
 816 |        "      <td>11</td>\n",
 817 |        "    </tr>\n",
 818 |        "    <tr>\n",
 819 |        "      <th>6</th>\n",
 820 |        "      <td>NeuralNetTorch</td>\n",
 821 |        "      <td>0.846895</td>\n",
 822 |        "      <td>0.853</td>\n",
 823 |        "      <td>0.158293</td>\n",
 824 |        "      <td>0.039637</td>\n",
 825 |        "      <td>45.233796</td>\n",
 826 |        "      <td>0.158293</td>\n",
 827 |        "      <td>0.039637</td>\n",
 828 |        "      <td>45.233796</td>\n",
 829 |        "      <td>1</td>\n",
 830 |        "      <td>True</td>\n",
 831 |        "      <td>12</td>\n",
 832 |        "    </tr>\n",
 833 |        "    <tr>\n",
 834 |        "      <th>7</th>\n",
 835 |        "      <td>LightGBMLarge</td>\n",
 836 |        "      <td>0.846686</td>\n",
 837 |        "      <td>0.847</td>\n",
 838 |        "      <td>0.137499</td>\n",
 839 |        "      <td>0.030208</td>\n",
 840 |        "      <td>4.965655</td>\n",
 841 |        "      <td>0.137499</td>\n",
 842 |        "      <td>0.030208</td>\n",
 843 |        "      <td>4.965655</td>\n",
 844 |        "      <td>1</td>\n",
 845 |        "      <td>True</td>\n",
 846 |        "      <td>13</td>\n",
 847 |        "    </tr>\n",
 848 |        "    <tr>\n",
 849 |        "      <th>8</th>\n",
 850 |        "      <td>LightGBMXT</td>\n",
 851 |        "      <td>0.846686</td>\n",
 852 |        "      <td>0.848</td>\n",
 853 |        "      <td>0.347909</td>\n",
 854 |        "      <td>0.061231</td>\n",
 855 |        "      <td>4.287037</td>\n",
 856 |        "      <td>0.347909</td>\n",
 857 |        "      <td>0.061231</td>\n",
 858 |        "      <td>4.287037</td>\n",
 859 |        "      <td>1</td>\n",
 860 |        "      <td>True</td>\n",
 861 |        "      <td>3</td>\n",
 862 |        "    </tr>\n",
 863 |        "    <tr>\n",
 864 |        "      <th>9</th>\n",
 865 |        "      <td>ExtraTreesGini</td>\n",
 866 |        "      <td>0.842811</td>\n",
 867 |        "      <td>0.836</td>\n",
 868 |        "      <td>1.077193</td>\n",
 869 |        "      <td>0.204940</td>\n",
 870 |        "      <td>5.812799</td>\n",
 871 |        "      <td>1.077193</td>\n",
 872 |        "      <td>0.204940</td>\n",
 873 |        "      <td>5.812799</td>\n",
 874 |        "      <td>1</td>\n",
 875 |        "      <td>True</td>\n",
 876 |        "      <td>8</td>\n",
 877 |        "    </tr>\n",
 878 |        "    <tr>\n",
 879 |        "      <th>10</th>\n",
 880 |        "      <td>ExtraTreesEntr</td>\n",
 881 |        "      <td>0.841030</td>\n",
 882 |        "      <td>0.844</td>\n",
 883 |        "      <td>1.383755</td>\n",
 884 |        "      <td>0.191755</td>\n",
 885 |        "      <td>6.098757</td>\n",
 886 |        "      <td>1.383755</td>\n",
 887 |        "      <td>0.191755</td>\n",
 888 |        "      <td>6.098757</td>\n",
 889 |        "      <td>1</td>\n",
 890 |        "      <td>True</td>\n",
 891 |        "      <td>9</td>\n",
 892 |        "    </tr>\n",
 893 |        "    <tr>\n",
 894 |        "      <th>11</th>\n",
 895 |        "      <td>NeuralNetFastAI</td>\n",
 896 |        "      <td>0.827416</td>\n",
 897 |        "      <td>0.819</td>\n",
 898 |        "      <td>0.574561</td>\n",
 899 |        "      <td>0.097966</td>\n",
 900 |        "      <td>34.902269</td>\n",
 901 |        "      <td>0.574561</td>\n",
 902 |        "      <td>0.097966</td>\n",
 903 |        "      <td>34.902269</td>\n",
 904 |        "      <td>1</td>\n",
 905 |        "      <td>True</td>\n",
 906 |        "      <td>10</td>\n",
 907 |        "    </tr>\n",
 908 |        "    <tr>\n",
 909 |        "      <th>12</th>\n",
 910 |        "      <td>KNeighborsDist</td>\n",
 911 |        "      <td>0.651168</td>\n",
 912 |        "      <td>0.663</td>\n",
 913 |        "      <td>1.443214</td>\n",
 914 |        "      <td>0.154423</td>\n",
 915 |        "      <td>0.054532</td>\n",
 916 |        "      <td>1.443214</td>\n",
 917 |        "      <td>0.154423</td>\n",
 918 |        "      <td>0.054532</td>\n",
 919 |        "      <td>1</td>\n",
 920 |        "      <td>True</td>\n",
 921 |        "      <td>2</td>\n",
 922 |        "    </tr>\n",
 923 |        "    <tr>\n",
 924 |        "      <th>13</th>\n",
 925 |        "      <td>KNeighborsUnif</td>\n",
 926 |        "      <td>0.648654</td>\n",
 927 |        "      <td>0.653</td>\n",
 928 |        "      <td>1.576504</td>\n",
 929 |        "      <td>0.347130</td>\n",
 930 |        "      <td>2.985137</td>\n",
 931 |        "      <td>1.576504</td>\n",
 932 |        "      <td>0.347130</td>\n",
 933 |        "      <td>2.985137</td>\n",
 934 |        "      <td>1</td>\n",
 935 |        "      <td>True</td>\n",
 936 |        "      <td>1</td>\n",
 937 |        "    </tr>\n",
 938 |        "  </tbody>\n",
 939 |        "</table>\n",
 940 |        "</div>"
 941 |       ],
 942 |       "text/plain": [
 943 |        "                  model  score_test  score_val  pred_time_test  pred_time_val  \\\n",
 944 |        "0   WeightedEnsemble_L2    0.859357      0.871        3.335518       0.625275   \n",
 945 |        "1      RandomForestEntr    0.855482      0.850        1.125673       0.218737   \n",
 946 |        "2              CatBoost    0.854016      0.854        0.107882       0.064999   \n",
 947 |        "3      RandomForestGini    0.854016      0.853        0.954611       0.214309   \n",
 948 |        "4              LightGBM    0.850141      0.853        0.200560       0.025760   \n",
 949 |        "5               XGBoost    0.849618      0.854        0.175631       0.023012   \n",
 950 |        "6        NeuralNetTorch    0.846895      0.853        0.158293       0.039637   \n",
 951 |        "7         LightGBMLarge    0.846686      0.847        0.137499       0.030208   \n",
 952 |        "8            LightGBMXT    0.846686      0.848        0.347909       0.061231   \n",
 953 |        "9        ExtraTreesGini    0.842811      0.836        1.077193       0.204940   \n",
 954 |        "10       ExtraTreesEntr    0.841030      0.844        1.383755       0.191755   \n",
 955 |        "11      NeuralNetFastAI    0.827416      0.819        0.574561       0.097966   \n",
 956 |        "12       KNeighborsDist    0.651168      0.663        1.443214       0.154423   \n",
 957 |        "13       KNeighborsUnif    0.648654      0.653        1.576504       0.347130   \n",
 958 |        "\n",
 959 |        "     fit_time  pred_time_test_marginal  pred_time_val_marginal  \\\n",
 960 |        "0   92.942297                 0.006877                0.004572   \n",
 961 |        "1    5.784430                 1.125673                0.218737   \n",
 962 |        "2   17.457963                 0.107882                0.064999   \n",
 963 |        "3    5.797882                 0.954611                0.214309   \n",
 964 |        "4    3.052832                 0.200560                0.025760   \n",
 965 |        "5    8.635196                 0.175631                0.023012   \n",
 966 |        "6   45.233796                 0.158293                0.039637   \n",
 967 |        "7    4.965655                 0.137499                0.030208   \n",
 968 |        "8    4.287037                 0.347909                0.061231   \n",
 969 |        "9    5.812799                 1.077193                0.204940   \n",
 970 |        "10   6.098757                 1.383755                0.191755   \n",
 971 |        "11  34.902269                 0.574561                0.097966   \n",
 972 |        "12   0.054532                 1.443214                0.154423   \n",
 973 |        "13   2.985137                 1.576504                0.347130   \n",
 974 |        "\n",
 975 |        "    fit_time_marginal  stack_level  can_infer  fit_order  \n",
 976 |        "0            2.378834            2       True         14  \n",
 977 |        "1            5.784430            1       True          6  \n",
 978 |        "2           17.457963            1       True          7  \n",
 979 |        "3            5.797882            1       True          5  \n",
 980 |        "4            3.052832            1       True          4  \n",
 981 |        "5            8.635196            1       True         11  \n",
 982 |        "6           45.233796            1       True         12  \n",
 983 |        "7            4.965655            1       True         13  \n",
 984 |        "8            4.287037            1       True          3  \n",
 985 |        "9            5.812799            1       True          8  \n",
 986 |        "10           6.098757            1       True          9  \n",
 987 |        "11          34.902269            1       True         10  \n",
 988 |        "12           0.054532            1       True          2  \n",
 989 |        "13           2.985137            1       True          1  "
 990 |       ]
 991 |      },
 992 |      "execution_count": 7,
 993 |      "metadata": {},
 994 |      "output_type": "execute_result"
 995 |     }
 996 |    ],
 997 |    "source": [
 998 |     "predictor.leaderboard(test_data, silent=True)"
 999 |    ]
1000 |   },
1001 |   {
1002 |    "cell_type": "markdown",
1003 |    "metadata": {},
1004 |    "source": [
1005 |     "## 5. <a name=\"5\">Clean up model artifacts</a>\n",
1006 |     "(<a href=\"#0\">Go to top</a>)"
1007 |    ]
1008 |   },
1009 |   {
1010 |    "cell_type": "code",
1011 |    "execution_count": 8,
1012 |    "metadata": {
1013 |     "tags": []
1014 |    },
1015 |    "outputs": [],
1016 |    "source": [
1017 |     "!rm -r AutogluonModels"
1018 |    ]
1019 |   }
1020 |  ],
1021 |  "metadata": {
1022 |   "kernelspec": {
1023 |    "display_name": "sagemaker-distribution:Python",
1024 |    "language": "python",
1025 |    "name": "conda-env-sagemaker-distribution-py"
1026 |   },
1027 |   "language_info": {
1028 |    "codemirror_mode": {
1029 |     "name": "ipython",
1030 |     "version": 3
1031 |    },
1032 |    "file_extension": ".py",
1033 |    "mimetype": "text/x-python",
1034 |    "name": "python",
1035 |    "nbconvert_exporter": "python",
1036 |    "pygments_lexer": "ipython3",
1037 |    "version": "3.10.14"
1038 |   }
1039 |  },
1040 |  "nbformat": 4,
1041 |  "nbformat_minor": 4
1042 | }
1043 | 


--------------------------------------------------------------------------------
/notebooks/MLA-TAB-DAY3-PYTORCH.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "![MLU Logo](../data/MLU_Logo.png)"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "# <a name=\"0\">Machine Learning Accelerator - Tabular Data - Lecture 3</a>\n",
 15 |     "\n",
 16 |     "\n",
 17 |     "## PyTorch\n",
 18 |     "\n",
 19 |     "1. <a href=\"#1\">PyTorch: Tensors and Autograd</a>\n",
 20 |     "2. <a href=\"#2\">PyTorch: Building a Neural Network</a>\n"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": 1,
 26 |    "metadata": {
 27 |     "tags": []
 28 |    },
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "%%capture\n",
 32 |     "%pip install -q -r ../requirements.txt"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "markdown",
 37 |    "metadata": {},
 38 |    "source": [
 39 |     "## 1. <a name=\"1\">PyTorch: Tensors and Autograd</a>\n",
 40 |     "<a href=\"#0\">Go to top</a>\n",
 41 |     "\n",
 42 |     "This tutorial follows the concepts from the original MXNet tutorial but uses PyTorch instead.\n",
 43 |     "\n",
 44 |     "To get started, let's import PyTorch and NumPy.\n"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 2,
 50 |    "metadata": {},
 51 |    "outputs": [],
 52 |    "source": [
 53 |     "import torch"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "markdown",
 58 |    "metadata": {},
 59 |    "source": [
 60 |     "Next, let's see how to create a 2D tensor (also called a matrix) with values from two sets of numbers: 1, 2, 3 and 4, 5, 6."
 61 |    ]
 62 |   },
 63 |   {
 64 |    "cell_type": "code",
 65 |    "execution_count": 3,
 66 |    "metadata": {},
 67 |    "outputs": [
 68 |     {
 69 |      "data": {
 70 |       "text/plain": [
 71 |        "tensor([[1, 2, 3],\n",
 72 |        "        [5, 6, 7]])"
 73 |       ]
 74 |      },
 75 |      "execution_count": 3,
 76 |      "metadata": {},
 77 |      "output_type": "execute_result"
 78 |     }
 79 |    ],
 80 |    "source": [
 81 |     "torch.tensor([[1,2,3],[5,6,7]])"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "markdown",
 86 |    "metadata": {},
 87 |    "source": [
 88 |     "We can also create a very simple matrix with the same shape (2 rows by 3 columns), but fill it with 1s."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": 4,
 94 |    "metadata": {},
 95 |    "outputs": [
 96 |     {
 97 |      "data": {
 98 |       "text/plain": [
 99 |        "tensor([[1., 1., 1.],\n",
100 |        "        [1., 1., 1.]])"
101 |       ]
102 |      },
103 |      "execution_count": 4,
104 |      "metadata": {},
105 |      "output_type": "execute_result"
106 |     }
107 |    ],
108 |    "source": [
109 |     "x = torch.ones((2,3))\n",
110 |     "x"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "markdown",
115 |    "metadata": {},
116 |    "source": [
117 |     "Often we'll want to create tensors whose values are sampled randomly. For example, sampling values uniformly between -1 and 1."
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "code",
122 |    "execution_count": 5,
123 |    "metadata": {},
124 |    "outputs": [
125 |     {
126 |      "data": {
127 |       "text/plain": [
128 |        "tensor([[ 0.6748,  0.4310,  0.6130],\n",
129 |        "        [-0.9225, -0.8389, -0.4594]])"
130 |       ]
131 |      },
132 |      "execution_count": 5,
133 |      "metadata": {},
134 |      "output_type": "execute_result"
135 |     }
136 |    ],
137 |    "source": [
138 |     "y = torch.rand(2, 3) * 2 - 1  # Values between -1 and 1\n",
139 |     "y"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "markdown",
144 |    "metadata": {},
145 |    "source": [
146 |     "You can also fill a tensor of a given shape with a given value, such as 2.0."
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 6,
152 |    "metadata": {},
153 |    "outputs": [
154 |     {
155 |      "data": {
156 |       "text/plain": [
157 |        "tensor([[2., 2., 2.],\n",
158 |        "        [2., 2., 2.]])"
159 |       ]
160 |      },
161 |      "execution_count": 6,
162 |      "metadata": {},
163 |      "output_type": "execute_result"
164 |     }
165 |    ],
166 |    "source": [
167 |     "x = torch.full((2,3), 2.0)\n",
168 |     "x"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "markdown",
173 |    "metadata": {},
174 |    "source": [
175 |     "As with NumPy, the dimensions of each tensor are accessible by accessing the .shape attribute. We can also query its size and data type."
176 |    ]
177 |   },
178 |   {
179 |    "cell_type": "code",
180 |    "execution_count": 7,
181 |    "metadata": {},
182 |    "outputs": [
183 |     {
184 |      "data": {
185 |       "text/plain": [
186 |        "(torch.Size([2, 3]), 6, torch.float32)"
187 |       ]
188 |      },
189 |      "execution_count": 7,
190 |      "metadata": {},
191 |      "output_type": "execute_result"
192 |     }
193 |    ],
194 |    "source": [
195 |     "(x.shape, x.numel(), x.dtype)"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "markdown",
200 |    "metadata": {},
201 |    "source": [
202 |     "### Operations\n",
203 |     "\n",
204 |     "PyTorch supports a large number of standard mathematical operations. Such as element-wise multiplication:"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": 8,
210 |    "metadata": {},
211 |    "outputs": [
212 |     {
213 |      "data": {
214 |       "text/plain": [
215 |        "tensor([[ 1.3496,  0.8619,  1.2259],\n",
216 |        "        [-1.8450, -1.6778, -0.9188]])"
217 |       ]
218 |      },
219 |      "execution_count": 8,
220 |      "metadata": {},
221 |      "output_type": "execute_result"
222 |     }
223 |    ],
224 |    "source": [
225 |     "x * y"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "metadata": {},
231 |    "source": [
232 |     "Exponentiation:"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": 9,
238 |    "metadata": {},
239 |    "outputs": [
240 |     {
241 |      "data": {
242 |       "text/plain": [
243 |        "tensor([[1.9637, 1.5387, 1.8459],\n",
244 |        "        [0.3975, 0.4322, 0.6317]])"
245 |       ]
246 |      },
247 |      "execution_count": 9,
248 |      "metadata": {},
249 |      "output_type": "execute_result"
250 |     }
251 |    ],
252 |    "source": [
253 |     "y.exp()"
254 |    ]
255 |   },
256 |   {
257 |    "cell_type": "markdown",
258 |    "metadata": {},
259 |    "source": [
260 |     "And matrix multiplication:"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "code",
265 |    "execution_count": 10,
266 |    "metadata": {},
267 |    "outputs": [
268 |     {
269 |      "data": {
270 |       "text/plain": [
271 |        "tensor([[ 3.4375, -4.4415],\n",
272 |        "        [ 3.4375, -4.4415]])"
273 |       ]
274 |      },
275 |      "execution_count": 10,
276 |      "metadata": {},
277 |      "output_type": "execute_result"
278 |     }
279 |    ],
280 |    "source": [
281 |     "torch.mm(x, y.t())"
282 |    ]
283 |   },
284 |   {
285 |    "cell_type": "markdown",
286 |    "metadata": {},
287 |    "source": [
288 |     "### Indexing\n",
289 |     "\n",
290 |     "PyTorch tensors support slicing in all the ways you might imagine accessing your data. Here's an example of reading a particular element, which returns a scalar tensor."
291 |    ]
292 |   },
293 |   {
294 |    "cell_type": "code",
295 |    "execution_count": 11,
296 |    "metadata": {},
297 |    "outputs": [
298 |     {
299 |      "data": {
300 |       "text/plain": [
301 |        "tensor(-0.4594)"
302 |       ]
303 |      },
304 |      "execution_count": 11,
305 |      "metadata": {},
306 |      "output_type": "execute_result"
307 |     }
308 |    ],
309 |    "source": [
310 |     "y[1,2]"
311 |    ]
312 |   },
313 |   {
314 |    "cell_type": "markdown",
315 |    "metadata": {},
316 |    "source": [
317 |     "Read the second and third columns from y."
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "code",
322 |    "execution_count": 12,
323 |    "metadata": {},
324 |    "outputs": [
325 |     {
326 |      "data": {
327 |       "text/plain": [
328 |        "tensor([[ 0.4310,  0.6130],\n",
329 |        "        [-0.8389, -0.4594]])"
330 |       ]
331 |      },
332 |      "execution_count": 12,
333 |      "metadata": {},
334 |      "output_type": "execute_result"
335 |     }
336 |    ],
337 |    "source": [
338 |     "y[:,1:3]"
339 |    ]
340 |   },
341 |   {
342 |    "cell_type": "markdown",
343 |    "metadata": {},
344 |    "source": [
345 |     "and writing to a specific element"
346 |    ]
347 |   },
348 |   {
349 |    "cell_type": "code",
350 |    "execution_count": 13,
351 |    "metadata": {},
352 |    "outputs": [
353 |     {
354 |      "data": {
355 |       "text/plain": [
356 |        "tensor([[ 0.6748,  2.0000,  2.0000],\n",
357 |        "        [-0.9225,  2.0000,  2.0000]])"
358 |       ]
359 |      },
360 |      "execution_count": 13,
361 |      "metadata": {},
362 |      "output_type": "execute_result"
363 |     }
364 |    ],
365 |    "source": [
366 |     "y[:,1:3] = 2\n",
367 |     "y"
368 |    ]
369 |   },
370 |   {
371 |    "cell_type": "markdown",
372 |    "metadata": {},
373 |    "source": [
374 |     "Multi-dimensional slicing is also supported."
375 |    ]
376 |   },
377 |   {
378 |    "cell_type": "code",
379 |    "execution_count": 14,
380 |    "metadata": {},
381 |    "outputs": [
382 |     {
383 |      "data": {
384 |       "text/plain": [
385 |        "tensor([[0.6748, 2.0000, 2.0000],\n",
386 |        "        [4.0000, 4.0000, 2.0000]])"
387 |       ]
388 |      },
389 |      "execution_count": 14,
390 |      "metadata": {},
391 |      "output_type": "execute_result"
392 |     }
393 |    ],
394 |    "source": [
395 |     "y[1:2,0:2] = 4\n",
396 |     "y"
397 |    ]
398 |   },
399 |   {
400 |    "cell_type": "markdown",
401 |    "metadata": {},
402 |    "source": [
403 |     "### Automatic differentiation with autograd\n",
404 |     "\n",
405 |     "PyTorch provides automatic differentiation through its autograd package. Let's see how it works with a simple example."
406 |    ]
407 |   },
408 |   {
409 |    "cell_type": "code",
410 |    "execution_count": 15,
411 |    "metadata": {},
412 |    "outputs": [
413 |     {
414 |      "data": {
415 |       "text/plain": [
416 |        "tensor([[1., 2.],\n",
417 |        "        [3., 4.]], requires_grad=True)"
418 |       ]
419 |      },
420 |      "execution_count": 15,
421 |      "metadata": {},
422 |      "output_type": "execute_result"
423 |     }
424 |    ],
425 |    "source": [
426 |     "x = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True)\n",
427 |     "x"
428 |    ]
429 |   },
430 |   {
431 |    "cell_type": "markdown",
432 |    "metadata": {},
433 |    "source": [
434 |     "Now let's define a function $y=f(x) = 0.6x^2$"
435 |    ]
436 |   },
437 |   {
438 |    "cell_type": "code",
439 |    "execution_count": 16,
440 |    "metadata": {},
441 |    "outputs": [
442 |     {
443 |      "data": {
444 |       "text/plain": [
445 |        "tensor([[0.6000, 2.4000],\n",
446 |        "        [5.4000, 9.6000]], grad_fn=<MulBackward0>)"
447 |       ]
448 |      },
449 |      "execution_count": 16,
450 |      "metadata": {},
451 |      "output_type": "execute_result"
452 |     }
453 |    ],
454 |    "source": [
455 |     "y = 0.6 * x * x\n",
456 |     "y"
457 |    ]
458 |   },
459 |   {
460 |    "cell_type": "markdown",
461 |    "metadata": {},
462 |    "source": [
463 |     "Let's compute the gradients"
464 |    ]
465 |   },
466 |   {
467 |    "cell_type": "code",
468 |    "execution_count": 17,
469 |    "metadata": {},
470 |    "outputs": [
471 |     {
472 |      "data": {
473 |       "text/plain": [
474 |        "tensor([[1.2000, 2.4000],\n",
475 |        "        [3.6000, 4.8000]])"
476 |       ]
477 |      },
478 |      "execution_count": 17,
479 |      "metadata": {},
480 |      "output_type": "execute_result"
481 |     }
482 |    ],
483 |    "source": [
484 |     "y.sum().backward()\n",
485 |     "x.grad"
486 |    ]
487 |   },
488 |   {
489 |    "cell_type": "markdown",
490 |    "metadata": {},
491 |    "source": [
492 |     "## 2. <a name=\"2\">PyTorch: Building a Neural Network</a>\n",
493 |     "<a href=\"#0\">Go to top</a>"
494 |    ]
495 |   },
496 |   {
497 |    "cell_type": "markdown",
498 |    "metadata": {},
499 |    "source": [
500 |     "### Implement a network with sequential mode \n",
501 |     "\n",
502 |     "Let's implement a simple neural network with two hidden layers of size 64 and 128 using the sequential mode. We will have 5 inputs, 1 output and some dropouts between the layers."
503 |    ]
504 |   },
505 |   {
506 |    "cell_type": "code",
507 |    "execution_count": 18,
508 |    "metadata": {},
509 |    "outputs": [
510 |     {
511 |      "data": {
512 |       "text/plain": [
513 |        "Sequential(\n",
514 |        "  (0): Linear(in_features=5, out_features=64, bias=True)\n",
515 |        "  (1): ReLU()\n",
516 |        "  (2): Dropout(p=0.4, inplace=False)\n",
517 |        "  (3): Linear(in_features=64, out_features=128, bias=True)\n",
518 |        "  (4): ReLU()\n",
519 |        "  (5): Dropout(p=0.3, inplace=False)\n",
520 |        "  (6): Linear(in_features=128, out_features=1, bias=True)\n",
521 |        "  (7): Sigmoid()\n",
522 |        ")"
523 |       ]
524 |      },
525 |      "execution_count": 18,
526 |      "metadata": {},
527 |      "output_type": "execute_result"
528 |     }
529 |    ],
530 |    "source": [
531 |     "import torch.nn as nn\n",
532 |     "\n",
533 |     "net = nn.Sequential(\n",
534 |     "    nn.Linear(5, 64),\n",
535 |     "    nn.ReLU(),\n",
536 |     "    nn.Dropout(0.4),\n",
537 |     "    nn.Linear(64, 128),\n",
538 |     "    nn.ReLU(),\n",
539 |     "    nn.Dropout(0.3),\n",
540 |     "    nn.Linear(128, 1),\n",
541 |     "    nn.Sigmoid()\n",
542 |     ")\n",
543 |     "net"
544 |    ]
545 |   },
546 |   {
547 |    "cell_type": "markdown",
548 |    "metadata": {},
549 |    "source": [
550 |     "Let's send a batch of data to this network (batch size is 4 in this case)"
551 |    ]
552 |   },
553 |   {
554 |    "cell_type": "code",
555 |    "execution_count": 19,
556 |    "metadata": {},
557 |    "outputs": [
558 |     {
559 |      "name": "stdout",
560 |      "output_type": "stream",
561 |      "text": [
562 |       "Random input data with shape torch.Size([4, 5])\n",
563 |       "tensor([[0.6891, 0.5221, 0.7773, 0.9408, 0.7547],\n",
564 |       "        [0.2574, 0.5219, 0.3243, 0.9965, 0.1699],\n",
565 |       "        [0.5062, 0.1165, 0.5882, 0.4178, 0.0667],\n",
566 |       "        [0.7801, 0.5441, 0.5210, 0.3496, 0.3415]])\n",
567 |       "\n",
568 |       "Output shape: torch.Size([4, 1])\n",
569 |       "Network output:  tensor([[0.4622],\n",
570 |       "        [0.5201],\n",
571 |       "        [0.5014],\n",
572 |       "        [0.4897]], grad_fn=<SigmoidBackward0>)\n"
573 |      ]
574 |     }
575 |    ],
576 |    "source": [
577 |     "# Input shape is (batch_size, data length)\n",
578 |     "x = torch.rand(4, 5)\n",
579 |     "y = net(x)\n",
580 |     "\n",
581 |     "print(\"Random input data with shape\", x.shape)\n",
582 |     "print(x)\n",
583 |     "print(\"\\nOutput shape:\", y.shape)\n",
584 |     "print(\"Network output: \", y)"
585 |    ]
586 |   },
587 |   {
588 |    "cell_type": "markdown",
589 |    "metadata": {},
590 |    "source": [
591 |     "We can also see the initialized weights for each layer."
592 |    ]
593 |   },
594 |   {
595 |    "cell_type": "code",
596 |    "execution_count": 20,
597 |    "metadata": {},
598 |    "outputs": [
599 |     {
600 |      "name": "stdout",
601 |      "output_type": "stream",
602 |      "text": [
603 |       "torch.Size([64, 5]) torch.Size([64])\n",
604 |       "Parameter containing:\n",
605 |       "tensor([[ 0.4140, -0.2097,  0.1934,  0.0987, -0.3828],\n",
606 |       "        [-0.3258, -0.1371, -0.2716,  0.2433,  0.3157],\n",
607 |       "        [ 0.3060,  0.2025, -0.1249, -0.2841, -0.1136],\n",
608 |       "        [ 0.0635, -0.2865,  0.3451, -0.2566,  0.2379],\n",
609 |       "        [-0.2022, -0.3182, -0.1616,  0.1147,  0.0196],\n",
610 |       "        [ 0.0514,  0.4180, -0.1799, -0.3582, -0.3167],\n",
611 |       "        [-0.2233, -0.0761,  0.3520, -0.1367,  0.0231],\n",
612 |       "        [ 0.0652,  0.0074, -0.1976,  0.0652, -0.0874],\n",
613 |       "        [ 0.2888,  0.1323,  0.2426, -0.3566, -0.1998],\n",
614 |       "        [-0.2552,  0.4010, -0.3824, -0.0141, -0.0860],\n",
615 |       "        [-0.2668, -0.2012, -0.0907, -0.2436,  0.1911],\n",
616 |       "        [ 0.1006, -0.0848, -0.3372,  0.4433,  0.1452],\n",
617 |       "        [ 0.0564,  0.0578, -0.0198, -0.2309, -0.0589],\n",
618 |       "        [-0.1424,  0.3267, -0.4456,  0.3973, -0.2852],\n",
619 |       "        [-0.4185, -0.0388,  0.3620,  0.2704, -0.0656],\n",
620 |       "        [-0.3409,  0.0460, -0.2915, -0.3246,  0.0052],\n",
621 |       "        [ 0.0496, -0.3019,  0.3156,  0.0079,  0.3143],\n",
622 |       "        [ 0.3830, -0.3231,  0.4193,  0.2370, -0.4453],\n",
623 |       "        [ 0.0963, -0.2967,  0.2495, -0.0356, -0.2095],\n",
624 |       "        [-0.0252, -0.1415, -0.3344, -0.0490, -0.3190],\n",
625 |       "        [-0.1498,  0.2223, -0.3334, -0.1432,  0.2012],\n",
626 |       "        [ 0.2746, -0.1717,  0.0109,  0.1719,  0.1868],\n",
627 |       "        [-0.2521,  0.1618,  0.2235, -0.4178,  0.3538],\n",
628 |       "        [-0.4126,  0.3020, -0.3663,  0.0462,  0.0851],\n",
629 |       "        [-0.0646,  0.4186, -0.2545, -0.3375, -0.0655],\n",
630 |       "        [-0.1856,  0.3097, -0.4052,  0.1449,  0.2151],\n",
631 |       "        [-0.1731, -0.1986, -0.1555,  0.1463, -0.0857],\n",
632 |       "        [-0.2523, -0.1973, -0.2736, -0.2426,  0.0587],\n",
633 |       "        [-0.3090, -0.1566, -0.1199,  0.3582, -0.2981],\n",
634 |       "        [ 0.3307,  0.2290, -0.0395, -0.2179, -0.1259],\n",
635 |       "        [ 0.3688, -0.1597,  0.3606,  0.0557, -0.0646],\n",
636 |       "        [ 0.2586, -0.3155, -0.0124, -0.2741, -0.1273],\n",
637 |       "        [-0.2071,  0.3514, -0.3882, -0.0621,  0.3038],\n",
638 |       "        [-0.0540,  0.2552, -0.3168,  0.1888,  0.4385],\n",
639 |       "        [-0.4350,  0.0270,  0.3162, -0.3843, -0.2997],\n",
640 |       "        [-0.3382,  0.2364,  0.0146,  0.0499,  0.3829],\n",
641 |       "        [ 0.1828, -0.3370,  0.3974, -0.1320,  0.2109],\n",
642 |       "        [ 0.0316, -0.2776,  0.2335, -0.1636, -0.3523],\n",
643 |       "        [ 0.4066, -0.0690,  0.3488,  0.3690, -0.0343],\n",
644 |       "        [-0.0262, -0.3873, -0.1189, -0.1093, -0.1183],\n",
645 |       "        [-0.0731, -0.1124, -0.2861,  0.3533, -0.3186],\n",
646 |       "        [ 0.0551, -0.2362, -0.4419,  0.2498, -0.1034],\n",
647 |       "        [-0.0657,  0.2276, -0.1839,  0.1906, -0.3480],\n",
648 |       "        [ 0.1351,  0.3720, -0.4355,  0.3825, -0.4155],\n",
649 |       "        [ 0.0468,  0.0226,  0.2082,  0.0353, -0.4345],\n",
650 |       "        [ 0.0359, -0.2988,  0.2885,  0.2160, -0.4355],\n",
651 |       "        [ 0.1941,  0.0895,  0.1975,  0.4031,  0.2917],\n",
652 |       "        [-0.2787, -0.1937,  0.3792, -0.0090,  0.2317],\n",
653 |       "        [-0.3598,  0.1516, -0.1411, -0.0970, -0.0474],\n",
654 |       "        [-0.3468,  0.0296, -0.4169, -0.0196, -0.4110],\n",
655 |       "        [-0.0034,  0.3747, -0.0232, -0.0106,  0.4303],\n",
656 |       "        [-0.0273,  0.3280, -0.1235, -0.0130,  0.0794],\n",
657 |       "        [ 0.1583, -0.2897, -0.3968, -0.1599,  0.3241],\n",
658 |       "        [-0.4112, -0.0183,  0.1791,  0.3945,  0.2804],\n",
659 |       "        [-0.3166, -0.3587, -0.0840, -0.3551,  0.2014],\n",
660 |       "        [-0.0169, -0.0654, -0.4339, -0.2892, -0.0567],\n",
661 |       "        [-0.3501,  0.0951,  0.2189,  0.2135, -0.3416],\n",
662 |       "        [-0.4256,  0.0879, -0.2271,  0.0058, -0.1469],\n",
663 |       "        [ 0.0039, -0.2761, -0.2123,  0.2835, -0.2394],\n",
664 |       "        [-0.0166, -0.3109,  0.0727,  0.3113,  0.1122],\n",
665 |       "        [-0.0071, -0.1357,  0.1317, -0.0891,  0.0404],\n",
666 |       "        [ 0.0461,  0.0357, -0.3066, -0.3605, -0.4040],\n",
667 |       "        [-0.0420,  0.3559, -0.3655,  0.2689,  0.2067],\n",
668 |       "        [ 0.4344, -0.2565,  0.2187, -0.1426, -0.3401]], requires_grad=True) Parameter containing:\n",
669 |       "tensor([ 0.2083,  0.2013,  0.0666,  0.1022,  0.0034, -0.0214,  0.1302,  0.4317,\n",
670 |       "         0.3050,  0.0675, -0.0308,  0.0456, -0.0562, -0.3867,  0.3498, -0.0969,\n",
671 |       "        -0.1095,  0.4283, -0.0587,  0.3590,  0.1086, -0.1134, -0.4071, -0.4229,\n",
672 |       "        -0.3123,  0.1790,  0.4012, -0.4471, -0.0255,  0.3238,  0.0350, -0.4072,\n",
673 |       "        -0.3451,  0.1151, -0.4271, -0.2166, -0.3191,  0.1175, -0.3801,  0.3896,\n",
674 |       "        -0.0230, -0.3635,  0.0548, -0.0588,  0.4303,  0.0133,  0.1301, -0.0525,\n",
675 |       "        -0.3908, -0.0770,  0.1977, -0.3945, -0.1251,  0.2640, -0.0665, -0.1348,\n",
676 |       "        -0.0917, -0.3470,  0.2834,  0.0611, -0.2251,  0.3852, -0.2869,  0.1219],\n",
677 |       "       requires_grad=True)\n"
678 |      ]
679 |     }
680 |    ],
681 |    "source": [
682 |     "print(net[0].weight.shape, net[0].bias.shape)\n",
683 |     "print(net[0].weight, net[0].bias)"
684 |    ]
685 |   },
686 |   {
687 |    "cell_type": "markdown",
688 |    "metadata": {},
689 |    "source": [
690 |     "### Implement the network flexibly:\n",
691 |     "\n",
692 |     "Now let's implement the same network using a custom module, which gives more flexibility in defining the forward pass."
693 |    ]
694 |   },
695 |   {
696 |    "cell_type": "code",
697 |    "execution_count": 21,
698 |    "metadata": {},
699 |    "outputs": [
700 |     {
701 |      "data": {
702 |       "text/plain": [
703 |        "MixMLP(\n",
704 |        "  (fc1): Linear(in_features=5, out_features=64, bias=True)\n",
705 |        "  (fc2): Linear(in_features=64, out_features=128, bias=True)\n",
706 |        "  (fc3): Linear(in_features=128, out_features=1, bias=True)\n",
707 |        "  (dropout1): Dropout(p=0.4, inplace=False)\n",
708 |        "  (dropout2): Dropout(p=0.3, inplace=False)\n",
709 |        ")"
710 |       ]
711 |      },
712 |      "execution_count": 21,
713 |      "metadata": {},
714 |      "output_type": "execute_result"
715 |     }
716 |    ],
717 |    "source": [
718 |     "class MixMLP(nn.Module):\n",
719 |     "    def __init__(self):\n",
720 |     "        super(MixMLP, self).__init__()\n",
721 |     "        self.fc1 = nn.Linear(5, 64)\n",
722 |     "        self.fc2 = nn.Linear(64, 128)\n",
723 |     "        self.fc3 = nn.Linear(128, 1)\n",
724 |     "        self.dropout1 = nn.Dropout(0.4)\n",
725 |     "        self.dropout2 = nn.Dropout(0.3)\n",
726 |     "        \n",
727 |     "    def forward(self, x):\n",
728 |     "        x = torch.relu(self.fc1(x))\n",
729 |     "        x = self.dropout1(x)\n",
730 |     "        x = torch.relu(self.fc2(x))\n",
731 |     "        x = self.dropout2(x)\n",
732 |     "        x = torch.sigmoid(self.fc3(x))\n",
733 |     "        return x\n",
734 |     "\n",
735 |     "net = MixMLP()\n",
736 |     "net"
737 |    ]
738 |   },
739 |   {
740 |    "cell_type": "markdown",
741 |    "metadata": {},
742 |    "source": [
743 |     "The usage of net is similar as before."
744 |    ]
745 |   },
746 |   {
747 |    "cell_type": "code",
748 |    "execution_count": 22,
749 |    "metadata": {
750 |     "tags": []
751 |    },
752 |    "outputs": [
753 |     {
754 |      "data": {
755 |       "text/plain": [
756 |        "tensor([[0.4729],\n",
757 |        "        [0.4819],\n",
758 |        "        [0.4444],\n",
759 |        "        [0.4414]], grad_fn=<SigmoidBackward0>)"
760 |       ]
761 |      },
762 |      "execution_count": 22,
763 |      "metadata": {},
764 |      "output_type": "execute_result"
765 |     }
766 |    ],
767 |    "source": [
768 |     "# Input shape is (batch_size, data length)\n",
769 |     "x = torch.rand(4, 5)\n",
770 |     "net(x)"
771 |    ]
772 |   }
773 |  ],
774 |  "metadata": {
775 |   "kernelspec": {
776 |    "display_name": "sagemaker-distribution:Python",
777 |    "language": "python",
778 |    "name": "conda-env-sagemaker-distribution-py"
779 |   },
780 |   "language_info": {
781 |    "codemirror_mode": {
782 |     "name": "ipython",
783 |     "version": 3
784 |    },
785 |    "file_extension": ".py",
786 |    "mimetype": "text/x-python",
787 |    "name": "python",
788 |    "nbconvert_exporter": "python",
789 |    "pygments_lexer": "ipython3",
790 |    "version": "3.10.14"
791 |   }
792 |  },
793 |  "nbformat": 4,
794 |  "nbformat_minor": 4
795 | }
796 | 


--------------------------------------------------------------------------------
/notebooks/mluvisuals.py:
--------------------------------------------------------------------------------
 1 | import json 
 2 | 
 3 | from typing import Dict, Any 
 4 | 
 5 | class BagOfWords:
 6 |     def __init__(self, **kwargs: Any) -> None:
 7 |         """
 8 |         Initialize the component.
 9 |         """
10 |         self.name = 'BagOfWords'
11 |         self.iife_script = '''<script>var BagOfWords=function(){"use strict";var pt=document.createElement("style");pt.textContent=`input.svelte-5idota{width:300px;overflow:scroll;height:38px}.description.svelte-5idota{font-size:16px;opacity:.9}#title-div.svelte-5idota{display:flex;align-items:center}section.svelte-5idota{margin:auto;padding-bottom:15px}table.svelte-5idota{font-size:16px;background-color:#fff;border-collapse:collapse}.table-head.svelte-5idota{padding:16px;font-size:16px;text-align:left;border:none}td.svelte-5idota{padding:16px;background-color:#fff;border-bottom:1px solid black;font-size:18px;text-align:center}button.svelte-5idota{padding:6px 10px;font-size:16px;border:1px solid black;text-align:center;margin:2px 8px;font-weight:700}button.svelte-5idota:hover,.active.svelte-5idota{color:#fff;background-color:coral}thead.svelte-5idota{border-bottom:1px solid black}
12 | `,document.head.appendChild(pt);function Q(){}function gt(t){return t()}function vt(){return Object.create(null)}function R(t){t.forEach(gt)}function bt(t){return typeof t=="function"}function It(t,e){return t!=t?e==e:t!==e||t&&typeof t=="object"||typeof t=="function"}function Pt(t){return Object.keys(t).length===0}function s(t,e){t.appendChild(e)}function U(t,e,n){t.insertBefore(e,n||null)}function M(t){t.parentNode&&t.parentNode.removeChild(t)}function Z(t,e){for(let n=0;n<t.length;n+=1)t[n]&&t[n].d(e)}function r(t){return document.createElement(t)}function B(t){return document.createTextNode(t)}function m(){return B(" ")}function V(t,e,n,i){return t.addEventListener(e,n,i),()=>t.removeEventListener(e,n,i)}function h(t,e,n){n==null?t.removeAttribute(e):t.getAttribute(e)!==n&&t.setAttribute(e,n)}function qt(t){return Array.from(t.childNodes)}function tt(t,e){e=""+e,t.wholeText!==e&&(t.data=e)}function D(t,e){t.value=e==null?"":e}function et(t,e,n){t.classList[n?"add":"remove"](e)}let ut;function X(t){ut=t}const Y=[],mt=[],nt=[],kt=[],Ft=Promise.resolve();let ft=!1;function Gt(){ft||(ft=!0,Ft.then(yt))}function dt(t){nt.push(t)}const ht=new Set;let lt=0;function yt(){const t=ut;do{for(;lt<Y.length;){const e=Y[lt];lt++,X(e),Ht(e.$$)}for(X(null),Y.length=0,lt=0;mt.length;)mt.pop()();for(let e=0;e<nt.length;e+=1){const n=nt[e];ht.has(n)||(ht.add(n),n())}nt.length=0}while(Y.length);for(;kt.length;)kt.pop()();ft=!1,ht.clear(),X(t)}function Ht(t){if(t.fragment!==null){t.update(),R(t.before_update);const e=t.dirty;t.dirty=[-1],t.fragment&&t.fragment.p(t.ctx,e),t.after_update.forEach(dt)}}const Jt=new Set;function Kt(t,e){t&&t.i&&(Jt.delete(t),t.i(e))}function Qt(t,e,n,i){const{fragment:c,after_update:u}=t.$$;c&&c.m(e,n),i||dt(()=>{const y=t.$$.on_mount.map(gt).filter(bt);t.$$.on_destroy?t.$$.on_destroy.push(...y):R(y),t.$$.on_mount=[]}),u.forEach(dt)}function Rt(t,e){const n=t.$$;n.fragment!==null&&(R(n.on_destroy),n.fragment&&n.fragment.d(e),n.on_destroy=n.fragment=null,n.ctx=[])}function Ut(t,e){t.$$.dirty[0]===-1&&(Y.push(t),Gt(),t.$$.dirty.fill(0)),t.$$.dirty[e/31|0]|=1<<e%31}function Vt(t,e,n,i,c,u,y,L=[-1]){const d=ut;X(t);const a=t.$$={fragment:null,ctx:[],props:u,update:Q,not_equal:c,bound:vt(),on_mount:[],on_destroy:[],on_disconnect:[],before_update:[],after_update:[],context:new Map(e.context||(d?d.$$.context:[])),callbacks:vt(),dirty:L,skip_bound:!1,root:e.target||d.$$.root};y&&y(a.root);let x=!1;if(a.ctx=n?n(t,e.props||{},(f,$,...k)=>{const W=k.length?k[0]:$;return a.ctx&&c(a.ctx[f],a.ctx[f]=W)&&(!a.skip_bound&&a.bound[f]&&a.bound[f](W),x&&Ut(t,f)),$}):[],a.update(),x=!0,R(a.before_update),a.fragment=i?i(a.ctx):!1,e.target){if(e.hydrate){const f=qt(e.target);a.fragment&&a.fragment.l(f),f.forEach(M)}else a.fragment&&a.fragment.c();e.intro&&Kt(t.$$.fragment),Qt(t,e.target,e.anchor,e.customElement),yt()}X(d)}class Xt{$destroy(){Rt(this,1),this.$destroy=Q}$on(e,n){if(!bt(n))return Q;const i=this.$$.callbacks[e]||(this.$$.callbacks[e]=[]);return i.push(n),()=>{const c=i.indexOf(n);c!==-1&&i.splice(c,1)}}$set(e){this.$$set&&!Pt(e)&&(this.$$.skip_bound=!0,this.$$set(e),this.$$.skip_bound=!1)}}const ee="";function xt(t,e,n){const i=t.slice();return i[15]=e[n],i}function wt(t,e,n){const i=t.slice();return i[15]=e[n],i}function Ct(t,e,n){const i=t.slice();return i[15]=e[n],i}function $t(t,e,n){const i=t.slice();return i[15]=e[n],i}function Et(t){let e,n=t[15]+"",i;return{c(){e=r("th"),i=B(n),h(e,"class","table-head svelte-5idota")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&16&&n!==(n=c[15]+"")&&tt(i,n)},d(c){c&&M(e)}}}function St(t){let e,n=t[7][t[15]]+"",i;return{c(){e=r("td"),i=B(n),h(e,"class","svelte-5idota")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&144&&n!==(n=c[7][c[15]]+"")&&tt(i,n)},d(c){c&&M(e)}}}function Bt(t){let e,n=t[6][t[15]]+"",i;return{c(){e=r("td"),i=B(n),h(e,"class","svelte-5idota")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&80&&n!==(n=c[6][c[15]]+"")&&tt(i,n)},d(c){c&&M(e)}}}function Ot(t){let e,n=t[5][t[15]]+"",i;return{c(){e=r("td"),i=B(n),h(e,"class","svelte-5idota")},m(c,u){U(c,e,u),s(e,i)},p(c,u){u&48&&n!==(n=c[5][c[15]]+"")&&tt(i,n)},d(c){c&&M(e)}}}function Yt(t){let e,n,i,c,u,y,L,d,a,x,f,$,k,W,O,I,z,P,ot,it,w,E,S,A,zt,j,At,Lt,q,ct,st,Wt,N,jt,Nt,F,rt,at,Tt,T,Mt,_t,Dt,G=t[4],p=[];for(let o=0;o<G.length;o+=1)p[o]=Et($t(t,G,o));let H=t[4],g=[];for(let o=0;o<H.length;o+=1)g[o]=St(Ct(t,H,o));let J=t[4],v=[];for(let o=0;o<J.length;o+=1)v[o]=Bt(wt(t,J,o));let K=t[4],b=[];for(let o=0;o<K.length;o+=1)b[o]=Ot(xt(t,K,o));return{c(){e=r("section"),n=r("h2"),n.textContent="Bag of Words Demo",i=m(),c=r("p"),c.textContent=`Edit the sentences and watch the corresponding vocabulary and cell-counts
13 |     update:`,u=m(),y=r("br"),L=m(),d=r("div"),a=r("p"),a.textContent="Count Method:",x=m(),f=r("button"),f.textContent="Count",$=m(),k=r("button"),k.textContent="Binary",W=m(),O=r("table"),I=r("thead"),z=r("tr"),P=r("th"),ot=m();for(let o=0;o<p.length;o+=1)p[o].c();it=m(),w=r("tbody"),E=r("tr"),S=r("td"),A=r("div"),zt=B(`Sentence 1:
14 |             `),j=r("input"),At=m();for(let o=0;o<g.length;o+=1)g[o].c();Lt=m(),q=r("tr"),ct=r("td"),st=r("div"),Wt=B(`Sentence 2:
15 |             `),N=r("input"),jt=m();for(let o=0;o<v.length;o+=1)v[o].c();Nt=m(),F=r("tr"),rt=r("td"),at=r("div"),Tt=B(`Sentence 3:
16 |             `),T=r("input"),Mt=m();for(let o=0;o<b.length;o+=1)b[o].c();h(c,"class","description svelte-5idota"),h(a,"class","description svelte-5idota"),h(f,"class","svelte-5idota"),et(f,"active",t[3]==="count"),h(k,"class","svelte-5idota"),et(k,"active",t[3]==="ohe"),h(d,"id","title-div"),h(d,"class","svelte-5idota"),h(I,"class","svelte-5idota"),h(j,"class","svelte-5idota"),h(S,"class","table-head svelte-5idota"),h(N,"class","svelte-5idota"),h(ct,"class","table-head svelte-5idota"),h(T,"class","svelte-5idota"),h(rt,"class","table-head svelte-5idota"),h(O,"class","svelte-5idota"),h(e,"class","svelte-5idota")},m(o,_){U(o,e,_),s(e,n),s(e,i),s(e,c),s(e,u),s(e,y),s(e,L),s(e,d),s(d,a),s(d,x),s(d,f),s(d,$),s(d,k),s(e,W),s(e,O),s(O,I),s(I,z),s(z,P),s(z,ot);for(let l=0;l<p.length;l+=1)p[l].m(z,null);s(O,it),s(O,w),s(w,E),s(E,S),s(S,A),s(A,zt),s(A,j),D(j,t[0]),s(E,At);for(let l=0;l<g.length;l+=1)g[l].m(E,null);s(w,Lt),s(w,q),s(q,ct),s(ct,st),s(st,Wt),s(st,N),D(N,t[1]),s(q,jt);for(let l=0;l<v.length;l+=1)v[l].m(q,null);s(w,Nt),s(w,F),s(F,rt),s(rt,at),s(at,Tt),s(at,T),D(T,t[2]),s(F,Mt);for(let l=0;l<b.length;l+=1)b[l].m(F,null);_t||(Dt=[V(f,"click",t[9]),V(k,"click",t[10]),V(j,"input",t[11]),V(N,"input",t[12]),V(T,"input",t[13])],_t=!0)},p(o,[_]){if(_&8&&et(f,"active",o[3]==="count"),_&8&&et(k,"active",o[3]==="ohe"),_&16){G=o[4];let l;for(l=0;l<G.length;l+=1){const C=$t(o,G,l);p[l]?p[l].p(C,_):(p[l]=Et(C),p[l].c(),p[l].m(z,null))}for(;l<p.length;l+=1)p[l].d(1);p.length=G.length}if(_&1&&j.value!==o[0]&&D(j,o[0]),_&144){H=o[4];let l;for(l=0;l<H.length;l+=1){const C=Ct(o,H,l);g[l]?g[l].p(C,_):(g[l]=St(C),g[l].c(),g[l].m(E,null))}for(;l<g.length;l+=1)g[l].d(1);g.length=H.length}if(_&2&&N.value!==o[1]&&D(N,o[1]),_&80){J=o[4];let l;for(l=0;l<J.length;l+=1){const C=wt(o,J,l);v[l]?v[l].p(C,_):(v[l]=Bt(C),v[l].c(),v[l].m(q,null))}for(;l<v.length;l+=1)v[l].d(1);v.length=J.length}if(_&4&&T.value!==o[2]&&D(T,o[2]),_&48){K=o[4];let l;for(l=0;l<K.length;l+=1){const C=xt(o,K,l);b[l]?b[l].p(C,_):(b[l]=Ot(C),b[l].c(),b[l].m(F,null))}for(;l<b.length;l+=1)b[l].d(1);b.length=K.length}},i:Q,o:Q,d(o){o&&M(e),Z(p,o),Z(g,o),Z(v,o),Z(b,o),_t=!1,R(Dt)}}}function Zt(t,e,n){let i,c,u,y,L,d="I",a="love dogs",x="dogs dogs dogs",f="count";function $(P,ot,it){let w={},E=ot.replace(/[^\w\s]/gi,"").toLowerCase();for(const S of P)it==="count"?w[S]=E.split(" ").filter(A=>A===S).length:w[S]=E.split(" ").filter(A=>A===S).length?1:0;return w}const k=()=>{n(3,f="count")},W=()=>{n(3,f="ohe")};function O(){d=this.value,n(0,d)}function I(){a=this.value,n(1,a)}function z(){x=this.value,n(2,x)}return t.$$.update=()=>{t.$$.dirty&7&&n(8,i=(d+" "+a+" "+x).replace(/[^\w\s]/gi,"").toLowerCase()),t.$$.dirty&256&&console.log("tokens",i),t.$$.dirty&256&&n(4,c=Array.from(new Set(i.split(" "))).filter(P=>P!=="")),t.$$.dirty&16&&console.log("vocab",c),t.$$.dirty&25&&n(7,u=$(c,d,f)),t.$$.dirty&26&&n(6,y=$(c,a,f)),t.$$.dirty&28&&n(5,L=$(c,x,f))},[d,a,x,f,c,L,y,u,i,k,W,O,I,z]}class te extends Xt{constructor(e){super(),Vt(this,e,Zt,Yt,It,{})}}return te}();
17 | </script>'''
18 |         self.div_id = 'BagOfWords-328cd234'
19 |         self.props = []
20 |         self.markup = ""
21 |         self.add_params(kwargs)
22 | 
23 |     def add_params(self, params: Dict[str, Any]) -> None:
24 |         """
25 |         Add parameters to the component and serve in html.
26 | 
27 |         Parameters
28 |         ----------
29 |         params : dict
30 |             The parameters to add to the component.
31 |         """
32 |         js_data = json.dumps(params, indent=0)
33 |         self.markup = f"""
34 |         <div id="{self.div_id}"></div>
35 |         <script>
36 |         (() => {{
37 |             var data = {js_data};
38 |             window.{self.name}_data = data;
39 |             var {self.name}_inst = new {self.name}({{
40 |                 "target": document.getElementById("{self.div_id}"),
41 |                 "props": data
42 |             }});
43 |         }})();
44 |         </script>
45 |         """
46 | 
47 |     def _repr_html_(self) -> str:
48 |         """
49 |         Return the component as an HTML string.
50 |         """
51 |         return f"""
52 |         {self.iife_script}
53 |         {self.markup}
54 |         """
55 | 
56 |     def __call__(self, **kwargs: Any) -> "BagOfWords":
57 |         """
58 |         Call the component with the given kwargs.
59 | 
60 |         Parameters
61 |         ----------
62 |         kwargs : any
63 |             The kwargs to pass to the component.
64 | 
65 |         Returns
66 |         -------
67 |         PackagedComponent
68 |             A python class representing the svelte component, renderable in Jupyter.
69 |         """
70 |         # render with given arguments
71 |         self.add_params(kwargs)
72 |         return self
73 | 
74 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | seaborn>=0.13.2


--------------------------------------------------------------------------------
/slides/MLA-TAB-Lecture1.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-tab/03005a6ef0ede849a0dd13c389a7511c30a0d2ea/slides/MLA-TAB-Lecture1.pptx


--------------------------------------------------------------------------------
/slides/MLA-TAB-Lecture2.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-tab/03005a6ef0ede849a0dd13c389a7511c30a0d2ea/slides/MLA-TAB-Lecture2.pptx


--------------------------------------------------------------------------------
/slides/MLA-TAB-Lecture3.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-tab/03005a6ef0ede849a0dd13c389a7511c30a0d2ea/slides/MLA-TAB-Lecture3.pptx


--------------------------------------------------------------------------------