├── .github
├── ISSUE_TEMPLATE
│ ├── bug-report.md
│ ├── documentation-improvement.md
│ ├── feature-request.md
│ ├── improvement.md
│ ├── new-exercise.md
│ └── submit-question.md
├── pull_request_template.md
└── workflows
│ └── lint.yml
├── .gitignore
├── 000
├── exercise
│ └── readme.md
└── solution
│ ├── hello_world.py
│ └── readme.md
├── 001
├── data
│ ├── test.csv
│ └── train.csv
├── exercise
│ ├── readme.md
│ └── starter-notebook.ipynb
└── solution
│ ├── README.md
│ ├── prediction.csv
│ ├── prediction_pt.csv
│ ├── submission.csv
│ ├── titanic_classical.ipynb
│ ├── titanic_pt_nn.ipynb
│ └── titanic_tf_nn.ipynb
├── 002
├── data
│ └── housing_prices.csv
├── exercise
│ ├── linear_regression.ipynb
│ └── readme.md
└── solution
│ ├── linear_regression.ipynb
│ └── readme.md
├── 003
├── exercise
│ └── readme.md
└── solution
│ ├── README.md
│ ├── digit_recog_nn.ipynb
│ ├── images
│ ├── ANN.jpg
│ └── cnn-procedure.png
│ ├── test_cnn.h5
│ └── test_nn.h5
├── 004
├── data
│ └── AM.txt
├── exercise
│ └── readme.md
└── solution
│ ├── readme.md
│ └── text_generation_model.ipynb
├── 005
├── data
│ ├── books_small.json
│ └── books_small_10000.json
├── exercise
│ └── readme.md
└── solution
│ ├── readme.md
│ └── sentiment_analysis.ipynb
├── 006
├── data
│ ├── breast_cancer_diagnosis.csv
│ └── readme.md
├── exercise
│ └── readme.md
└── solution
│ ├── ensemble_techniques.ipynb
│ └── readme.md
├── 007
├── data
│ ├── employee_attrition.csv
│ └── readme.md
├── exercise
│ └── readme.md
└── solution
│ └── readme.md
├── 008
├── exercise
│ ├── NaiveBayes - exercise starter.ipynb
│ └── README.md
└── solution
│ └── NaiveBayes Solution.ipynb
├── 009
├── data
│ ├── test.csv
│ └── train.csv
├── exercise
│ └── readme.md
└── solution
│ ├── insurance_cross_sell.ipynb
│ └── readme.md
├── 010
├── exercise
│ ├── knn_starter_exercise.ipynb
│ └── readme.md
└── solution
│ ├── knn_from_scratch.ipynb
│ ├── knn_using_sklearn.ipynb
│ └── readme.md
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── Resources
├── Numpy
│ └── NumPy Tutorial.ipynb
└── Pandas
│ ├── Pandas Tutorial.ipynb
│ ├── pokemon_data.csv
│ ├── pokemon_data.txt
│ ├── pokemon_data.xlsx
│ ├── ufo-sightings.csv
│ └── user.txt
└── __init__.py
/.github/ISSUE_TEMPLATE/bug-report.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Bug report
3 | about: Create a report to help us improve
4 | title: "[BUG]"
5 | labels: bug
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Describe the bug**
11 | A clear and concise description of what the bug is.
12 |
13 | **To Reproduce**
14 | Steps to reproduce the behavior:
15 | 1. Go to '...'
16 | 2. Click on '....'
17 | 3. Scroll down to '....'
18 | 4. See error
19 |
20 | **Expected behavior**
21 | A clear and concise description of what you expected to happen.
22 |
23 | **Screenshots**
24 | If applicable, add screenshots to help explain your problem.
25 |
26 | **Desktop (please complete the following information):**
27 | - OS: [e.g. iOS]
28 | - Browser [e.g. chrome, safari]
29 | - Version [e.g. 22]
30 |
31 | **Smartphone (please complete the following information):**
32 | - Device: [e.g. iPhone6]
33 | - OS: [e.g. iOS8.1]
34 | - Browser [e.g. stock browser, safari]
35 | - Version [e.g. 22]
36 |
37 | **Additional context**
38 | Add any other context about the problem here.
39 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/documentation-improvement.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Documentation Improvement
3 | about: Report wrong/missing documentation or improvement of documentation
4 | title: "[DOC]"
5 | labels: documentation
6 | assignees: ''
7 |
8 | ---
9 |
10 |
11 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/feature-request.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Feature request
3 | about: Suggest an idea for this project
4 | title: "[FEA]"
5 | labels: 'feature'
6 | assignees: ''
7 |
8 | ---
9 |
10 | **Is your feature request related to a problem? Please describe.**
11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12 |
13 | **Describe the solution you'd like**
14 | A clear and concise description of what you want to happen.
15 |
16 | **Describe alternatives you've considered**
17 | A clear and concise description of any alternative solutions or features you've considered.
18 |
19 | **Additional context**
20 | Add any other context or screenshots about the feature request here.
21 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/improvement.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Exercise/solution Improvement
3 | about: Suggest improvement/enhancement of a particular exercise/solution
4 | title: "[IMP]"
5 | labels: improvement
6 | assignees: ''
7 |
8 | ---
9 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/new-exercise.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: New exercise
3 | about: Suggest a new exercise
4 | title: "[EXE]"
5 | labels: idea for exercise
6 | assignees: ''
7 |
8 | ---
9 |
10 | #### Learning Goals
11 | [Learning goals, bulleted/numbered list is preferred]
12 | [e.g. learn the concept and the use of train/validation/test dataset using scikit-learn ]
13 |
14 | ### Exercise Statement
15 | [Explain and describe what the exercise is]
16 | [e.g. apply simple random-forest model to classify titanic survivability from titanic data ]
17 |
18 |
19 | ### Prerequisites
20 | [Prerequisites, in terms of concepts or other exercises in this repo]
21 | [e.g. random-forest model, stochastic gradient descent, exercise #32]
22 |
23 | ### Data source/summary:
24 | [Provide a succinct summary of what the data is and where it is from]
25 | [e.g. This involves covid19 fatality dataset from John Hopkin's website (links..) ]
26 |
27 | ### (Optional) Suggest/Propose Solutions
28 | [e.g. I have the solution using PyTorch, will be happy to create pull request to include the exercise statement/solution]
29 | [e.g. I think chapter 3 of A. Geron's textbook works out the solution for this exercise]
30 | [e.g. fast.ai's chapter 5 has the perfect solution for this]
31 |
32 |
33 | ### (Optional) Further Links/Credits to Relevant Resources:
34 | [e.g. This exercise and solution's proposal came from a lab session from DL2020]
35 |
--------------------------------------------------------------------------------
/.github/ISSUE_TEMPLATE/submit-question.md:
--------------------------------------------------------------------------------
1 | ---
2 | name: Submit Question
3 | about: Ask any question on our projects
4 | title: "[QST]"
5 | labels: ''
6 | assignees: ''
7 |
8 | ---
9 |
10 |
11 |
--------------------------------------------------------------------------------
/.github/pull_request_template.md:
--------------------------------------------------------------------------------
1 | #### Reference Issues/PRs
2 |
5 |
6 |
7 | #### What does this implement/fix? Explain your changes.
8 |
9 |
10 |
11 | #### Any other comments?
12 |
13 |
14 |
15 |
--------------------------------------------------------------------------------
/.github/workflows/lint.yml:
--------------------------------------------------------------------------------
1 | # This workflow will install Python dependencies, run int with a single version of Python (3.6.11)
2 |
3 | name: Flake8
4 |
5 | on:
6 | push:
7 | branches: [ master ]
8 | pull_request:
9 | branches: [ master ]
10 |
11 | jobs:
12 | flake8_py3:
13 | runs-on: ubuntu-latest
14 | steps:
15 | - name: Setup Python
16 | uses: actions/setup-python@v1
17 | with:
18 | python-version: 3.8.6
19 | architecture: x64
20 |
21 |
22 | - name: Install flake8
23 | run: pip install flake8
24 |
25 | - name: Run flake8
26 | uses: suo/flake8-github-action@releases/v1
27 | with:
28 | checkName: 'flake8_py3' # NOTE: this needs to be the same as the job name
29 | env:
30 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
31 |
32 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Created by https://www.toptal.com/developers/gitignore/api/python,jupyternotebooks
2 | # Edit at https://www.toptal.com/developers/gitignore?templates=python,jupyternotebooks
3 |
4 | ### JupyterNotebooks ###
5 | # gitignore template for Jupyter Notebooks
6 | # website: http://jupyter.org/
7 |
8 | .ipynb_checkpoints
9 | */.ipynb_checkpoints/*
10 |
11 | # IPython
12 | profile_default/
13 | ipython_config.py
14 |
15 | # Remove previous ipynb_checkpoints
16 | # git rm -r .ipynb_checkpoints/
17 |
18 | ### Python ###
19 | # Byte-compiled / optimized / DLL files
20 | __pycache__/
21 | *.py[cod]
22 | *$py.class
23 |
24 | # C extensions
25 | *.so
26 |
27 | # Distribution / packaging
28 | .Python
29 | build/
30 | develop-eggs/
31 | dist/
32 | downloads/
33 | eggs/
34 | .eggs/
35 | lib/
36 | lib64/
37 | parts/
38 | sdist/
39 | var/
40 | wheels/
41 | pip-wheel-metadata/
42 | share/python-wheels/
43 | *.egg-info/
44 | .installed.cfg
45 | *.egg
46 | MANIFEST
47 |
48 | # PyInstaller
49 | # Usually these files are written by a python script from a template
50 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
51 | *.manifest
52 | *.spec
53 |
54 | # Installer logs
55 | pip-log.txt
56 | pip-delete-this-directory.txt
57 |
58 | # Unit test / coverage reports
59 | htmlcov/
60 | .tox/
61 | .nox/
62 | .coverage
63 | .coverage.*
64 | .cache
65 | nosetests.xml
66 | coverage.xml
67 | *.cover
68 | *.py,cover
69 | .hypothesis/
70 | .pytest_cache/
71 |
72 | # Translations
73 | *.mo
74 | *.pot
75 |
76 | # Django stuff:
77 | *.log
78 | local_settings.py
79 | db.sqlite3
80 | db.sqlite3-journal
81 |
82 | # Flask stuff:
83 | instance/
84 | .webassets-cache
85 |
86 | # Scrapy stuff:
87 | .scrapy
88 |
89 | # Sphinx documentation
90 | docs/_build/
91 |
92 | # PyBuilder
93 | target/
94 |
95 | # Jupyter Notebook
96 |
97 | # IPython
98 |
99 | # pyenv
100 | .python-version
101 |
102 | # pipenv
103 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
104 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
105 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
106 | # install all needed dependencies.
107 | #Pipfile.lock
108 |
109 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow
110 | __pypackages__/
111 |
112 | # Celery stuff
113 | celerybeat-schedule
114 | celerybeat.pid
115 |
116 | # SageMath parsed files
117 | *.sage.py
118 |
119 | # Environments
120 | .env
121 | .venv
122 | env/
123 | venv/
124 | ENV/
125 | env.bak/
126 | venv.bak/
127 |
128 | # Spyder project settings
129 | .spyderproject
130 | .spyproject
131 |
132 | # Rope project settings
133 | .ropeproject
134 |
135 | # mkdocs documentation
136 | /site
137 |
138 | # mypy
139 | .mypy_cache/
140 | .dmypy.json
141 | dmypy.json
142 |
143 | # Pyre type checker
144 | .pyre/
145 |
146 | # pytype static type analyzer
147 | .pytype/
148 |
149 | # End of https://www.toptal.com/developers/gitignore/api/python,jupyternotebooks
150 |
--------------------------------------------------------------------------------
/000/exercise/readme.md:
--------------------------------------------------------------------------------
1 | # Exercise goal
2 | - Learn the `print` statement in Python 3.xx
3 |
4 | # Task
5 | - Print out `hello world!`
6 |
--------------------------------------------------------------------------------
/000/solution/hello_world.py:
--------------------------------------------------------------------------------
1 | def print_hello():
2 | print("Hello world !")
3 |
4 |
5 | print_hello()
6 |
--------------------------------------------------------------------------------
/000/solution/readme.md:
--------------------------------------------------------------------------------
1 | # My solution
2 | - Use the handy `print` function of Python 3.x
3 | - The `hello_world.py` file implements that.
4 |
--------------------------------------------------------------------------------
/001/exercise/readme.md:
--------------------------------------------------------------------------------
1 | # 👋🛳️ Ahoy, Welcome to your first ML project.
2 |
3 | This is one of the first newbie challenge/problem that you should be doing to enter into the world of ML.
4 |
5 | We will use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
6 |
7 | # THE CHALLENGE
8 |
9 | The sinking of the Titanic is one of the most infamous shipwrecks in history.
10 | On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
11 |
12 | While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
13 |
14 | In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
15 | In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is in `./data/train.csv` and the other is in `./data/test.csv`.
16 |
17 | The file `./data/train.csv` will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”.
18 |
19 | The `./data/test.csv` dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes.
20 |
21 | Using the patterns you find in the `./data/train.csv` data, predict whether the other 418 passengers on board (found in `./data/test.csv`) survived.
22 |
23 | # Tasks
24 |
25 | Most Machine learning projects usually follow a similar outline (a set of phases) which are listed below to help you get started.
26 |
27 | - Inspecting Data (this process is where you familiarize yourself with the data)
28 | - Inspect the data
29 | - Check for null/missing values
30 | - View statistical details using ``describe()``
31 |
32 | What can you infer from the statistical measures? Like possible outliers?
33 |
34 | - Data Analysis and Visualization (this process is where you explore the data, clean it and infer some insights from it)
35 |
36 | - Delete columns irrelevant or not useful for prediction
37 | - Get the average rate of survival by Gender, Pclass
38 | - Plotting the number of people who survived and who didn't survive
39 | - Plot the precenrage of survival by gender
40 | - Handle null/missing values
41 | - Plot the survival rate by Age
42 | - Handle categorical text values and turn them into numerical
43 | - Plot the correlation between features and label
44 |
45 | What do you infer from the data? What can you conclude from it? Who is most likely to survive, based on the data?
46 |
47 | - Model Building and Evaluation (this process is where you start building models and choose a model based on accuracy metrics)
48 | - Choose a model suitable for classification
49 | - Fit the training data
50 | - Use cross-validation to get the average accuracy for model selection or accuracy benchmark
51 | - Find out how accurate the model performs on test data using some metrics
52 | - **Bonus**: Try other classification algorithms and compare the accuracy metrics (and/or F1 score) by presenting them in a readable, easy to compare format
53 | - Choose the model with the best accuracy metric
54 |
55 | - Hint: To get you started, we have written a minimal template-like notebook:
56 |
57 | `starter-notebook.ipynb`
58 |
59 | [](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/001/exercise/starter-notebook.ipynb)
60 | [](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/001/exercise/starter-notebook.ipynb)
61 |
62 |
63 | Project and Data Source: https://www.kaggle.com/c/titanic
64 |
--------------------------------------------------------------------------------
/001/exercise/starter-notebook.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "metadata": {
3 | "language_info": {
4 | "codemirror_mode": {
5 | "name": "ipython",
6 | "version": 3
7 | },
8 | "file_extension": ".py",
9 | "mimetype": "text/x-python",
10 | "name": "python",
11 | "nbconvert_exporter": "python",
12 | "pygments_lexer": "ipython3",
13 | "version": "3.8.2-final"
14 | },
15 | "orig_nbformat": 2,
16 | "kernelspec": {
17 | "name": "python38264bitpandassconda285c25d0d8784f5bba9542830bc5427b",
18 | "display_name": "Python 3.8.2 64-bit ('pandass': conda)"
19 | }
20 | },
21 | "nbformat": 4,
22 | "nbformat_minor": 2,
23 | "cells": [
24 | {
25 | "cell_type": "markdown",
26 | "metadata": {},
27 | "source": [
28 | "
"
33 | ]
34 | },
35 | {
36 | "cell_type": "markdown",
37 | "metadata": {},
38 | "source": [
39 | "## A starter notebook with tasks listed to help you get started on your first machine learning project"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "metadata": {},
45 | "source": [
46 | "### First phase: Inspecting the data\n",
47 | "(this process is where you familiarize yourself with the data)\n",
48 | "\n",
49 | "Tasks:\n",
50 | "- Inspect the data\n",
51 | "- Check for null/missing values\n",
52 | "- View statistical details using ``describe()``\n",
53 | "\n",
54 | "Step one has been done for you \n",
55 | "After finishing the tasks think of the following:\n",
56 | "\n",
57 | "What can you infer from the statistical measures? like possible outliers? "
58 | ]
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {},
63 | "source": [
64 | "**Important note:**\n",
65 | "\n",
66 | "The data is already split into train.csv and test.csv, you generally would build your model using the train.csv **without** using any test.csv data as that would lead to overfitting your model\n",
67 | "\n",
68 | "#### Reminder: you do not have to stick to the tasks listed word for word, the tasks are listed as a guide to guide you.\n",
69 | "#### Feel free to explore more and do more on your own!"
70 | ]
71 | },
72 | {
73 | "cell_type": "code",
74 | "execution_count": 2,
75 | "metadata": {},
76 | "outputs": [],
77 | "source": [
78 | "# imports \n",
79 | "import pandas as pd"
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": 3,
85 | "metadata": {},
86 | "outputs": [
87 | {
88 | "output_type": "execute_result",
89 | "data": {
90 | "text/plain": " PassengerId Survived Pclass \\\n0 1 0 3 \n1 2 1 1 \n2 3 1 3 \n3 4 1 1 \n4 5 0 3 \n\n Name Sex Age SibSp \\\n0 Braund, Mr. Owen Harris male 22.0 1 \n1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n2 Heikkinen, Miss. Laina female 26.0 0 \n3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n4 Allen, Mr. William Henry male 35.0 0 \n\n Parch Ticket Fare Cabin Embarked \n0 0 A/5 21171 7.2500 NaN S \n1 0 PC 17599 71.2833 C85 C \n2 0 STON/O2. 3101282 7.9250 NaN S \n3 0 113803 53.1000 C123 S \n4 0 373450 8.0500 NaN S ",
91 | "text/html": "
\n\n
\n \n
\n
\n
PassengerId
\n
Survived
\n
Pclass
\n
Name
\n
Sex
\n
Age
\n
SibSp
\n
Parch
\n
Ticket
\n
Fare
\n
Cabin
\n
Embarked
\n
\n \n \n
\n
0
\n
1
\n
0
\n
3
\n
Braund, Mr. Owen Harris
\n
male
\n
22.0
\n
1
\n
0
\n
A/5 21171
\n
7.2500
\n
NaN
\n
S
\n
\n
\n
1
\n
2
\n
1
\n
1
\n
Cumings, Mrs. John Bradley (Florence Briggs Th...
\n
female
\n
38.0
\n
1
\n
0
\n
PC 17599
\n
71.2833
\n
C85
\n
C
\n
\n
\n
2
\n
3
\n
1
\n
3
\n
Heikkinen, Miss. Laina
\n
female
\n
26.0
\n
0
\n
0
\n
STON/O2. 3101282
\n
7.9250
\n
NaN
\n
S
\n
\n
\n
3
\n
4
\n
1
\n
1
\n
Futrelle, Mrs. Jacques Heath (Lily May Peel)
\n
female
\n
35.0
\n
1
\n
0
\n
113803
\n
53.1000
\n
C123
\n
S
\n
\n
\n
4
\n
5
\n
0
\n
3
\n
Allen, Mr. William Henry
\n
male
\n
35.0
\n
0
\n
0
\n
373450
\n
8.0500
\n
NaN
\n
S
\n
\n \n
\n
"
92 | },
93 | "metadata": {},
94 | "execution_count": 3
95 | }
96 | ],
97 | "source": [
98 |
99 | "project_url = 'https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/'\n",
100 | "data_path = 'master/001/data/'\n",
101 | "train = pd.read_csv(project_url+data_path+'train.csv')\n",
102 | "test = pd.read_csv(project_url+data_path+'test.csv')\n",
103 | "\n",
104 | "train.head()"
105 | ]
106 | },
107 | {
108 | "cell_type": "code",
109 | "execution_count": 4,
110 | "metadata": {},
111 | "outputs": [
112 | {
113 | "output_type": "execute_result",
114 | "data": {
115 | "text/plain": " PassengerId Pclass Name Sex \\\n0 892 3 Kelly, Mr. James male \n1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n2 894 2 Myles, Mr. Thomas Francis male \n3 895 3 Wirz, Mr. Albert male \n4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n\n Age SibSp Parch Ticket Fare Cabin Embarked \n0 34.5 0 0 330911 7.8292 NaN Q \n1 47.0 1 0 363272 7.0000 NaN S \n2 62.0 0 0 240276 9.6875 NaN Q \n3 27.0 0 0 315154 8.6625 NaN S \n4 22.0 1 1 3101298 12.2875 NaN S ",
116 | "text/html": "
\n\n
\n \n
\n
\n
PassengerId
\n
Pclass
\n
Name
\n
Sex
\n
Age
\n
SibSp
\n
Parch
\n
Ticket
\n
Fare
\n
Cabin
\n
Embarked
\n
\n \n \n
\n
0
\n
892
\n
3
\n
Kelly, Mr. James
\n
male
\n
34.5
\n
0
\n
0
\n
330911
\n
7.8292
\n
NaN
\n
Q
\n
\n
\n
1
\n
893
\n
3
\n
Wilkes, Mrs. James (Ellen Needs)
\n
female
\n
47.0
\n
1
\n
0
\n
363272
\n
7.0000
\n
NaN
\n
S
\n
\n
\n
2
\n
894
\n
2
\n
Myles, Mr. Thomas Francis
\n
male
\n
62.0
\n
0
\n
0
\n
240276
\n
9.6875
\n
NaN
\n
Q
\n
\n
\n
3
\n
895
\n
3
\n
Wirz, Mr. Albert
\n
male
\n
27.0
\n
0
\n
0
\n
315154
\n
8.6625
\n
NaN
\n
S
\n
\n
\n
4
\n
896
\n
3
\n
Hirvonen, Mrs. Alexander (Helga E Lindqvist)
\n
female
\n
22.0
\n
1
\n
1
\n
3101298
\n
12.2875
\n
NaN
\n
S
\n
\n \n
\n
"
117 | },
118 | "metadata": {},
119 | "execution_count": 4
120 | }
121 | ],
122 | "source": [
123 | "test.head()"
124 | ]
125 | },
126 | {
127 | "cell_type": "code",
128 | "execution_count": 5,
129 | "metadata": {},
130 | "outputs": [],
131 | "source": [
132 | "# Check for null/missing values \n"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": null,
138 | "metadata": {},
139 | "outputs": [],
140 | "source": [
141 | "# Inspect the statistical measures\n"
142 | ]
143 | },
144 | {
145 | "cell_type": "markdown",
146 | "metadata": {},
147 | "source": [
148 | "### Second phase: Data Analysis and Visualization (this process is where you explore the data, clean it and infer some insights from it)\n",
149 | "Tasks:\n",
150 | "- Plot the gender distribution of passengers on board\n",
151 | "- Plot the rate of men and women who survived and who didnt survive\n",
152 | "- Plot the survival rate by \"Pclass\" (male and female counted together in each Pclass)\n",
153 | "- Plot the survival rate by males only in each \"Pclass\"\n",
154 | "- Plot the survival rate by females only in each \"Pclass\"\n",
155 | "\n",
156 | "Think about the plots that you just did, what can you infer from them?\n",
157 | "Now lets have a look at the columns, there are always unnecessary columns that do not contribute to the prediction.\n",
158 | "\n",
159 | "- Remove unnecessary columns from both train and test datasets\n",
160 | "- Handle categorical text values and turn them into numerical\n",
161 | "- Plot number of people who survived over age and passenger class\n",
162 | "\n",
163 | "- Deal with null values in the Age column\n",
164 | "- Group the data in the Age column into groups for better prediction.\n",
165 | "- Get survival rates by age groups \n",
166 | "- Plot the correlation between features and label\n",
167 | "\n",
168 | "\n",
169 | "What do you infer from the data? What can you conclude from it?who is most likely to survive, based on the data?\n"
170 | ]
171 | },
172 | {
173 | "cell_type": "code",
174 | "execution_count": null,
175 | "metadata": {},
176 | "outputs": [],
177 | "source": []
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": null,
182 | "metadata": {},
183 | "outputs": [],
184 | "source": []
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": null,
189 | "metadata": {},
190 | "outputs": [],
191 | "source": []
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "metadata": {},
196 | "source": [
197 | "### Third phase:\n",
198 | "### Building the model and testing the accuracy (this process is where you start building models and choose a model based on accuracy metrics)\n",
199 | "- Choose a model suitable for classification\n",
200 | "- Fit the data\n",
201 | "- Find out how \"well\" the model performs using some metrics\n",
202 | "- Use cross validation to get the avg accuracy on the model you chose\n",
203 | "- **Bonus 1** \n",
204 | " - Try other classification algorithms \n",
205 | " - Compare the accuracy metrics (including cross-validation) for the classification algorithms used by presenting them in a readable, easy to compare format\n",
206 | " - Choose a model with the best cross validation accuracy metric\n",
207 | "- **Bonus 2**: Get the feature importance of your features using random forests\n",
208 | " (if you are not sure what that is or how to do it, google it!)"
209 | ]
210 | },
211 | {
212 | "cell_type": "code",
213 | "execution_count": null,
214 | "metadata": {},
215 | "outputs": [],
216 | "source": []
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": null,
221 | "metadata": {},
222 | "outputs": [],
223 | "source": []
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": null,
228 | "metadata": {},
229 | "outputs": [],
230 | "source": []
231 | },
232 | {
233 | "cell_type": "markdown",
234 | "metadata": {},
235 | "source": []
236 | }
237 | ]
238 | }
239 |
--------------------------------------------------------------------------------
/001/solution/README.md:
--------------------------------------------------------------------------------
1 | # My Solution
2 |
3 | In this folder, you will find two approaches to modeling the Titanic survival prediciton classfication problem:
4 |
5 | ## Classical approach:
6 |
7 | `titanic_classical.ipynb`
8 |
9 | [](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_classical.ipynb)
10 | [](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_classical.ipynb)
11 |
12 |
13 |
14 | In this file, you will find a detailed data wrangling approach to clean and prepare the data. Moreoever, classical machine learning classifiers from the list below are used:
15 |
16 | - Logistic Regression
17 | - Support Vector Machines
18 | - KNN or K-Nearest Neighbors
19 | - Decision Trees
20 | - SGDClassifier
21 | - Random Forest
22 | - Gaussian Naive Bayes
23 |
24 |
25 | ## Simple neural network in Tensorflow and keras:
26 |
27 | `titanic_tf_nn.ipynb`
28 |
29 | [](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_tf_nn.ipynb)
30 | [](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_tf_nn.ipynb)
31 |
32 |
33 | In this file, you will find a simple 5-layer neural network approach performing the survival classfication task. This is done using Tensorflow using Keras as frontend/API.
34 |
35 | ## Simple neural network in Pytorch:
36 |
37 | `titanic_pt_nn.ipynb`
38 |
39 | [](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_pt_nn.ipynb)
40 | [](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_pt_nn.ipynb)
41 |
42 | In this file is a simple 4-layer neural network similar to the solution above however using the Pytorch framework.
43 |
--------------------------------------------------------------------------------
/001/solution/prediction.csv:
--------------------------------------------------------------------------------
1 | PassengerId,Survived
2 | 892,0
3 | 893,0
4 | 894,0
5 | 895,0
6 | 896,0
7 | 897,0
8 | 898,0
9 | 899,0
10 | 900,0
11 | 901,0
12 | 902,0
13 | 903,0
14 | 904,1
15 | 905,0
16 | 906,1
17 | 907,0
18 | 908,0
19 | 909,0
20 | 910,0
21 | 911,0
22 | 912,0
23 | 913,0
24 | 914,1
25 | 915,0
26 | 916,0
27 | 917,0
28 | 918,0
29 | 919,0
30 | 920,0
31 | 921,0
32 | 922,0
33 | 923,0
34 | 924,0
35 | 925,0
36 | 926,0
37 | 927,0
38 | 928,0
39 | 929,0
40 | 930,0
41 | 931,0
42 | 932,0
43 | 933,0
44 | 934,0
45 | 935,0
46 | 936,1
47 | 937,0
48 | 938,0
49 | 939,0
50 | 940,1
51 | 941,0
52 | 942,0
53 | 943,0
54 | 944,0
55 | 945,0
56 | 946,0
57 | 947,0
58 | 948,0
59 | 949,0
60 | 950,0
61 | 951,1
62 | 952,0
63 | 953,0
64 | 954,0
65 | 955,0
66 | 956,0
67 | 957,0
68 | 958,0
69 | 959,0
70 | 960,0
71 | 961,0
72 | 962,0
73 | 963,0
74 | 964,0
75 | 965,0
76 | 966,0
77 | 967,0
78 | 968,0
79 | 969,1
80 | 970,0
81 | 971,0
82 | 972,0
83 | 973,0
84 | 974,0
85 | 975,0
86 | 976,0
87 | 977,0
88 | 978,0
89 | 979,0
90 | 980,0
91 | 981,0
92 | 982,0
93 | 983,0
94 | 984,0
95 | 985,0
96 | 986,0
97 | 987,0
98 | 988,0
99 | 989,0
100 | 990,0
101 | 991,0
102 | 992,0
103 | 993,0
104 | 994,0
105 | 995,0
106 | 996,0
107 | 997,0
108 | 998,0
109 | 999,0
110 | 1000,0
111 | 1001,0
112 | 1002,0
113 | 1003,0
114 | 1004,0
115 | 1005,0
116 | 1006,1
117 | 1007,0
118 | 1008,0
119 | 1009,0
120 | 1010,0
121 | 1011,0
122 | 1012,0
123 | 1013,0
124 | 1014,0
125 | 1015,0
126 | 1016,0
127 | 1017,0
128 | 1018,0
129 | 1019,0
130 | 1020,0
131 | 1021,0
132 | 1022,0
133 | 1023,0
134 | 1024,0
135 | 1025,0
136 | 1026,0
137 | 1027,0
138 | 1028,0
139 | 1029,0
140 | 1030,0
141 | 1031,0
142 | 1032,0
143 | 1033,0
144 | 1034,0
145 | 1035,0
146 | 1036,0
147 | 1037,0
148 | 1038,0
149 | 1039,0
150 | 1040,0
151 | 1041,0
152 | 1042,0
153 | 1043,0
154 | 1044,0
155 | 1045,0
156 | 1046,0
157 | 1047,0
158 | 1048,1
159 | 1049,0
160 | 1050,0
161 | 1051,0
162 | 1052,0
163 | 1053,0
164 | 1054,0
165 | 1055,0
166 | 1056,0
167 | 1057,0
168 | 1058,0
169 | 1059,0
170 | 1060,1
171 | 1061,0
172 | 1062,0
173 | 1063,0
174 | 1064,0
175 | 1065,0
176 | 1066,0
177 | 1067,0
178 | 1068,0
179 | 1069,0
180 | 1070,0
181 | 1071,0
182 | 1072,0
183 | 1073,0
184 | 1074,1
185 | 1075,0
186 | 1076,0
187 | 1077,0
188 | 1078,0
189 | 1079,0
190 | 1080,0
191 | 1081,0
192 | 1082,0
193 | 1083,0
194 | 1084,0
195 | 1085,0
196 | 1086,0
197 | 1087,0
198 | 1088,0
199 | 1089,0
200 | 1090,0
201 | 1091,0
202 | 1092,0
203 | 1093,0
204 | 1094,0
205 | 1095,0
206 | 1096,0
207 | 1097,0
208 | 1098,0
209 | 1099,0
210 | 1100,1
211 | 1101,0
212 | 1102,0
213 | 1103,0
214 | 1104,0
215 | 1105,0
216 | 1106,0
217 | 1107,0
218 | 1108,0
219 | 1109,0
220 | 1110,0
221 | 1111,0
222 | 1112,0
223 | 1113,0
224 | 1114,0
225 | 1115,0
226 | 1116,0
227 | 1117,0
228 | 1118,0
229 | 1119,0
230 | 1120,0
231 | 1121,0
232 | 1122,0
233 | 1123,0
234 | 1124,0
235 | 1125,0
236 | 1126,0
237 | 1127,0
238 | 1128,0
239 | 1129,0
240 | 1130,0
241 | 1131,0
242 | 1132,1
243 | 1133,0
244 | 1134,0
245 | 1135,0
246 | 1136,0
247 | 1137,0
248 | 1138,0
249 | 1139,0
250 | 1140,0
251 | 1141,0
252 | 1142,0
253 | 1143,0
254 | 1144,0
255 | 1145,0
256 | 1146,0
257 | 1147,0
258 | 1148,0
259 | 1149,0
260 | 1150,0
261 | 1151,0
262 | 1152,0
263 | 1153,0
264 | 1154,0
265 | 1155,0
266 | 1156,0
267 | 1157,0
268 | 1158,0
269 | 1159,0
270 | 1160,0
271 | 1161,0
272 | 1162,0
273 | 1163,0
274 | 1164,0
275 | 1165,0
276 | 1166,0
277 | 1167,0
278 | 1168,0
279 | 1169,0
280 | 1170,0
281 | 1171,0
282 | 1172,0
283 | 1173,0
284 | 1174,0
285 | 1175,0
286 | 1176,0
287 | 1177,0
288 | 1178,0
289 | 1179,0
290 | 1180,0
291 | 1181,0
292 | 1182,0
293 | 1183,0
294 | 1184,0
295 | 1185,0
296 | 1186,0
297 | 1187,0
298 | 1188,0
299 | 1189,0
300 | 1190,0
301 | 1191,0
302 | 1192,0
303 | 1193,0
304 | 1194,0
305 | 1195,0
306 | 1196,0
307 | 1197,1
308 | 1198,0
309 | 1199,0
310 | 1200,0
311 | 1201,0
312 | 1202,0
313 | 1203,0
314 | 1204,0
315 | 1205,0
316 | 1206,0
317 | 1207,0
318 | 1208,0
319 | 1209,0
320 | 1210,0
321 | 1211,0
322 | 1212,0
323 | 1213,0
324 | 1214,0
325 | 1215,0
326 | 1216,0
327 | 1217,0
328 | 1218,0
329 | 1219,0
330 | 1220,0
331 | 1221,0
332 | 1222,0
333 | 1223,0
334 | 1224,0
335 | 1225,0
336 | 1226,0
337 | 1227,0
338 | 1228,0
339 | 1229,0
340 | 1230,0
341 | 1231,0
342 | 1232,0
343 | 1233,0
344 | 1234,0
345 | 1235,0
346 | 1236,0
347 | 1237,0
348 | 1238,0
349 | 1239,0
350 | 1240,0
351 | 1241,0
352 | 1242,0
353 | 1243,0
354 | 1244,0
355 | 1245,0
356 | 1246,0
357 | 1247,0
358 | 1248,0
359 | 1249,0
360 | 1250,0
361 | 1251,0
362 | 1252,0
363 | 1253,0
364 | 1254,0
365 | 1255,0
366 | 1256,0
367 | 1257,0
368 | 1258,0
369 | 1259,0
370 | 1260,1
371 | 1261,0
372 | 1262,0
373 | 1263,1
374 | 1264,0
375 | 1265,0
376 | 1266,1
377 | 1267,0
378 | 1268,0
379 | 1269,0
380 | 1270,0
381 | 1271,0
382 | 1272,0
383 | 1273,0
384 | 1274,0
385 | 1275,0
386 | 1276,0
387 | 1277,0
388 | 1278,0
389 | 1279,0
390 | 1280,0
391 | 1281,0
392 | 1282,0
393 | 1283,0
394 | 1284,0
395 | 1285,0
396 | 1286,0
397 | 1287,0
398 | 1288,0
399 | 1289,0
400 | 1290,0
401 | 1291,0
402 | 1292,0
403 | 1293,0
404 | 1294,0
405 | 1295,0
406 | 1296,0
407 | 1297,0
408 | 1298,0
409 | 1299,0
410 | 1300,0
411 | 1301,0
412 | 1302,0
413 | 1303,0
414 | 1304,0
415 | 1305,0
416 | 1306,0
417 | 1307,0
418 | 1308,0
419 | 1309,0
420 |
--------------------------------------------------------------------------------
/001/solution/prediction_pt.csv:
--------------------------------------------------------------------------------
1 | PassengerId,Survived
2 | 892,0
3 | 893,0
4 | 894,0
5 | 895,0
6 | 896,0
7 | 897,0
8 | 898,1
9 | 899,0
10 | 900,1
11 | 901,0
12 | 902,0
13 | 903,0
14 | 904,1
15 | 905,0
16 | 906,1
17 | 907,1
18 | 908,0
19 | 909,0
20 | 910,1
21 | 911,0
22 | 912,0
23 | 913,0
24 | 914,1
25 | 915,1
26 | 916,1
27 | 917,0
28 | 918,1
29 | 919,0
30 | 920,0
31 | 921,0
32 | 922,0
33 | 923,0
34 | 924,0
35 | 925,0
36 | 926,1
37 | 927,0
38 | 928,1
39 | 929,1
40 | 930,0
41 | 931,0
42 | 932,0
43 | 933,0
44 | 934,0
45 | 935,1
46 | 936,1
47 | 937,0
48 | 938,0
49 | 939,0
50 | 940,1
51 | 941,0
52 | 942,0
53 | 943,0
54 | 944,1
55 | 945,1
56 | 946,0
57 | 947,0
58 | 948,0
59 | 949,0
60 | 950,0
61 | 951,1
62 | 952,0
63 | 953,0
64 | 954,0
65 | 955,1
66 | 956,1
67 | 957,1
68 | 958,1
69 | 959,0
70 | 960,0
71 | 961,1
72 | 962,1
73 | 963,0
74 | 964,1
75 | 965,0
76 | 966,1
77 | 967,0
78 | 968,0
79 | 969,1
80 | 970,0
81 | 971,1
82 | 972,0
83 | 973,0
84 | 974,0
85 | 975,0
86 | 976,0
87 | 977,0
88 | 978,1
89 | 979,1
90 | 980,1
91 | 981,0
92 | 982,1
93 | 983,0
94 | 984,1
95 | 985,0
96 | 986,1
97 | 987,0
98 | 988,1
99 | 989,0
100 | 990,1
101 | 991,0
102 | 992,1
103 | 993,0
104 | 994,0
105 | 995,0
106 | 996,1
107 | 997,0
108 | 998,0
109 | 999,0
110 | 1000,0
111 | 1001,0
112 | 1002,0
113 | 1003,1
114 | 1004,1
115 | 1005,1
116 | 1006,1
117 | 1007,0
118 | 1008,0
119 | 1009,1
120 | 1010,0
121 | 1011,1
122 | 1012,1
123 | 1013,0
124 | 1014,1
125 | 1015,0
126 | 1016,0
127 | 1017,1
128 | 1018,0
129 | 1019,1
130 | 1020,0
131 | 1021,0
132 | 1022,0
133 | 1023,0
134 | 1024,0
135 | 1025,0
136 | 1026,0
137 | 1027,0
138 | 1028,0
139 | 1029,0
140 | 1030,1
141 | 1031,0
142 | 1032,0
143 | 1033,1
144 | 1034,1
145 | 1035,0
146 | 1036,0
147 | 1037,0
148 | 1038,0
149 | 1039,0
150 | 1040,0
151 | 1041,0
152 | 1042,1
153 | 1043,0
154 | 1044,0
155 | 1045,0
156 | 1046,0
157 | 1047,0
158 | 1048,1
159 | 1049,1
160 | 1050,0
161 | 1051,1
162 | 1052,1
163 | 1053,0
164 | 1054,1
165 | 1055,0
166 | 1056,0
167 | 1057,0
168 | 1058,0
169 | 1059,0
170 | 1060,1
171 | 1061,1
172 | 1062,0
173 | 1063,0
174 | 1064,0
175 | 1065,0
176 | 1066,0
177 | 1067,1
178 | 1068,1
179 | 1069,0
180 | 1070,1
181 | 1071,1
182 | 1072,0
183 | 1073,1
184 | 1074,1
185 | 1075,0
186 | 1076,1
187 | 1077,0
188 | 1078,1
189 | 1079,0
190 | 1080,0
191 | 1081,0
192 | 1082,0
193 | 1083,0
194 | 1084,0
195 | 1085,0
196 | 1086,0
197 | 1087,0
198 | 1088,1
199 | 1089,1
200 | 1090,0
201 | 1091,1
202 | 1092,1
203 | 1093,0
204 | 1094,1
205 | 1095,1
206 | 1096,0
207 | 1097,0
208 | 1098,1
209 | 1099,0
210 | 1100,1
211 | 1101,0
212 | 1102,0
213 | 1103,0
214 | 1104,0
215 | 1105,1
216 | 1106,0
217 | 1107,0
218 | 1108,1
219 | 1109,0
220 | 1110,1
221 | 1111,0
222 | 1112,1
223 | 1113,0
224 | 1114,1
225 | 1115,0
226 | 1116,1
227 | 1117,1
228 | 1118,0
229 | 1119,1
230 | 1120,0
231 | 1121,0
232 | 1122,0
233 | 1123,1
234 | 1124,0
235 | 1125,0
236 | 1126,1
237 | 1127,0
238 | 1128,0
239 | 1129,0
240 | 1130,1
241 | 1131,1
242 | 1132,1
243 | 1133,1
244 | 1134,1
245 | 1135,0
246 | 1136,0
247 | 1137,0
248 | 1138,1
249 | 1139,0
250 | 1140,1
251 | 1141,1
252 | 1142,1
253 | 1143,0
254 | 1144,1
255 | 1145,0
256 | 1146,0
257 | 1147,0
258 | 1148,0
259 | 1149,0
260 | 1150,1
261 | 1151,0
262 | 1152,0
263 | 1153,0
264 | 1154,1
265 | 1155,1
266 | 1156,0
267 | 1157,0
268 | 1158,0
269 | 1159,0
270 | 1160,1
271 | 1161,0
272 | 1162,0
273 | 1163,0
274 | 1164,1
275 | 1165,1
276 | 1166,0
277 | 1167,1
278 | 1168,0
279 | 1169,0
280 | 1170,0
281 | 1171,0
282 | 1172,1
283 | 1173,0
284 | 1174,1
285 | 1175,1
286 | 1176,1
287 | 1177,0
288 | 1178,0
289 | 1179,0
290 | 1180,0
291 | 1181,0
292 | 1182,0
293 | 1183,1
294 | 1184,0
295 | 1185,0
296 | 1186,0
297 | 1187,0
298 | 1188,1
299 | 1189,0
300 | 1190,0
301 | 1191,0
302 | 1192,0
303 | 1193,0
304 | 1194,0
305 | 1195,0
306 | 1196,1
307 | 1197,1
308 | 1198,1
309 | 1199,0
310 | 1200,0
311 | 1201,0
312 | 1202,0
313 | 1203,0
314 | 1204,0
315 | 1205,1
316 | 1206,1
317 | 1207,1
318 | 1208,0
319 | 1209,0
320 | 1210,0
321 | 1211,0
322 | 1212,0
323 | 1213,0
324 | 1214,0
325 | 1215,0
326 | 1216,1
327 | 1217,0
328 | 1218,1
329 | 1219,0
330 | 1220,0
331 | 1221,0
332 | 1222,1
333 | 1223,0
334 | 1224,0
335 | 1225,1
336 | 1226,0
337 | 1227,0
338 | 1228,0
339 | 1229,0
340 | 1230,0
341 | 1231,0
342 | 1232,0
343 | 1233,0
344 | 1234,0
345 | 1235,1
346 | 1236,0
347 | 1237,1
348 | 1238,0
349 | 1239,1
350 | 1240,0
351 | 1241,1
352 | 1242,1
353 | 1243,0
354 | 1244,0
355 | 1245,0
356 | 1246,1
357 | 1247,0
358 | 1248,1
359 | 1249,0
360 | 1250,0
361 | 1251,0
362 | 1252,0
363 | 1253,1
364 | 1254,1
365 | 1255,0
366 | 1256,1
367 | 1257,0
368 | 1258,0
369 | 1259,1
370 | 1260,1
371 | 1261,0
372 | 1262,0
373 | 1263,1
374 | 1264,0
375 | 1265,0
376 | 1266,1
377 | 1267,1
378 | 1268,0
379 | 1269,0
380 | 1270,0
381 | 1271,0
382 | 1272,0
383 | 1273,0
384 | 1274,1
385 | 1275,1
386 | 1276,0
387 | 1277,1
388 | 1278,0
389 | 1279,0
390 | 1280,0
391 | 1281,0
392 | 1282,0
393 | 1283,1
394 | 1284,0
395 | 1285,0
396 | 1286,0
397 | 1287,1
398 | 1288,0
399 | 1289,1
400 | 1290,0
401 | 1291,0
402 | 1292,1
403 | 1293,0
404 | 1294,1
405 | 1295,0
406 | 1296,1
407 | 1297,0
408 | 1298,0
409 | 1299,1
410 | 1300,1
411 | 1301,1
412 | 1302,1
413 | 1303,1
414 | 1304,1
415 | 1305,0
416 | 1306,1
417 | 1307,0
418 | 1308,0
419 | 1309,0
420 |
--------------------------------------------------------------------------------
/001/solution/submission.csv:
--------------------------------------------------------------------------------
1 | PassengerId,Survived
2 | 892,0
3 | 893,1
4 | 894,0
5 | 895,0
6 | 896,1
7 | 897,0
8 | 898,1
9 | 899,0
10 | 900,1
11 | 901,0
12 | 902,0
13 | 903,0
14 | 904,1
15 | 905,0
16 | 906,1
17 | 907,1
18 | 908,0
19 | 909,0
20 | 910,1
21 | 911,1
22 | 912,1
23 | 913,0
24 | 914,1
25 | 915,1
26 | 916,1
27 | 917,0
28 | 918,1
29 | 919,0
30 | 920,0
31 | 921,0
32 | 922,0
33 | 923,0
34 | 924,1
35 | 925,1
36 | 926,1
37 | 927,0
38 | 928,1
39 | 929,1
40 | 930,0
41 | 931,0
42 | 932,0
43 | 933,0
44 | 934,0
45 | 935,1
46 | 936,1
47 | 937,0
48 | 938,1
49 | 939,0
50 | 940,1
51 | 941,1
52 | 942,1
53 | 943,0
54 | 944,1
55 | 945,1
56 | 946,0
57 | 947,0
58 | 948,0
59 | 949,0
60 | 950,0
61 | 951,1
62 | 952,0
63 | 953,0
64 | 954,0
65 | 955,1
66 | 956,0
67 | 957,1
68 | 958,1
69 | 959,0
70 | 960,1
71 | 961,1
72 | 962,1
73 | 963,0
74 | 964,1
75 | 965,1
76 | 966,1
77 | 967,1
78 | 968,0
79 | 969,1
80 | 970,0
81 | 971,1
82 | 972,0
83 | 973,0
84 | 974,0
85 | 975,0
86 | 976,0
87 | 977,0
88 | 978,1
89 | 979,1
90 | 980,1
91 | 981,0
92 | 982,1
93 | 983,0
94 | 984,1
95 | 985,0
96 | 986,1
97 | 987,0
98 | 988,1
99 | 989,0
100 | 990,1
101 | 991,0
102 | 992,1
103 | 993,0
104 | 994,0
105 | 995,0
106 | 996,1
107 | 997,0
108 | 998,0
109 | 999,0
110 | 1000,0
111 | 1001,0
112 | 1002,0
113 | 1003,1
114 | 1004,1
115 | 1005,1
116 | 1006,1
117 | 1007,0
118 | 1008,0
119 | 1009,1
120 | 1010,1
121 | 1011,1
122 | 1012,1
123 | 1013,0
124 | 1014,1
125 | 1015,0
126 | 1016,0
127 | 1017,1
128 | 1018,0
129 | 1019,1
130 | 1020,0
131 | 1021,0
132 | 1022,0
133 | 1023,1
134 | 1024,0
135 | 1025,0
136 | 1026,0
137 | 1027,0
138 | 1028,0
139 | 1029,0
140 | 1030,1
141 | 1031,0
142 | 1032,0
143 | 1033,1
144 | 1034,0
145 | 1035,0
146 | 1036,0
147 | 1037,0
148 | 1038,0
149 | 1039,0
150 | 1040,0
151 | 1041,0
152 | 1042,1
153 | 1043,0
154 | 1044,0
155 | 1045,1
156 | 1046,0
157 | 1047,0
158 | 1048,1
159 | 1049,1
160 | 1050,0
161 | 1051,1
162 | 1052,1
163 | 1053,0
164 | 1054,1
165 | 1055,0
166 | 1056,0
167 | 1057,1
168 | 1058,1
169 | 1059,0
170 | 1060,1
171 | 1061,1
172 | 1062,0
173 | 1063,0
174 | 1064,0
175 | 1065,0
176 | 1066,0
177 | 1067,1
178 | 1068,1
179 | 1069,1
180 | 1070,1
181 | 1071,1
182 | 1072,0
183 | 1073,1
184 | 1074,1
185 | 1075,0
186 | 1076,1
187 | 1077,0
188 | 1078,1
189 | 1079,0
190 | 1080,0
191 | 1081,0
192 | 1082,0
193 | 1083,0
194 | 1084,0
195 | 1085,0
196 | 1086,0
197 | 1087,0
198 | 1088,1
199 | 1089,1
200 | 1090,0
201 | 1091,1
202 | 1092,1
203 | 1093,0
204 | 1094,1
205 | 1095,1
206 | 1096,0
207 | 1097,1
208 | 1098,1
209 | 1099,0
210 | 1100,1
211 | 1101,0
212 | 1102,0
213 | 1103,0
214 | 1104,0
215 | 1105,1
216 | 1106,0
217 | 1107,0
218 | 1108,1
219 | 1109,0
220 | 1110,1
221 | 1111,0
222 | 1112,1
223 | 1113,0
224 | 1114,1
225 | 1115,0
226 | 1116,1
227 | 1117,1
228 | 1118,0
229 | 1119,1
230 | 1120,0
231 | 1121,0
232 | 1122,0
233 | 1123,1
234 | 1124,0
235 | 1125,0
236 | 1126,1
237 | 1127,0
238 | 1128,1
239 | 1129,0
240 | 1130,1
241 | 1131,1
242 | 1132,1
243 | 1133,1
244 | 1134,1
245 | 1135,0
246 | 1136,0
247 | 1137,0
248 | 1138,1
249 | 1139,0
250 | 1140,1
251 | 1141,1
252 | 1142,1
253 | 1143,0
254 | 1144,1
255 | 1145,0
256 | 1146,0
257 | 1147,0
258 | 1148,0
259 | 1149,0
260 | 1150,1
261 | 1151,0
262 | 1152,0
263 | 1153,0
264 | 1154,1
265 | 1155,1
266 | 1156,0
267 | 1157,0
268 | 1158,0
269 | 1159,0
270 | 1160,1
271 | 1161,0
272 | 1162,1
273 | 1163,0
274 | 1164,1
275 | 1165,1
276 | 1166,0
277 | 1167,1
278 | 1168,0
279 | 1169,0
280 | 1170,0
281 | 1171,0
282 | 1172,1
283 | 1173,0
284 | 1174,1
285 | 1175,1
286 | 1176,1
287 | 1177,0
288 | 1178,0
289 | 1179,0
290 | 1180,0
291 | 1181,0
292 | 1182,0
293 | 1183,1
294 | 1184,0
295 | 1185,0
296 | 1186,0
297 | 1187,0
298 | 1188,1
299 | 1189,0
300 | 1190,0
301 | 1191,0
302 | 1192,0
303 | 1193,0
304 | 1194,0
305 | 1195,0
306 | 1196,1
307 | 1197,1
308 | 1198,0
309 | 1199,0
310 | 1200,0
311 | 1201,1
312 | 1202,0
313 | 1203,0
314 | 1204,0
315 | 1205,1
316 | 1206,1
317 | 1207,1
318 | 1208,1
319 | 1209,0
320 | 1210,0
321 | 1211,0
322 | 1212,0
323 | 1213,0
324 | 1214,0
325 | 1215,0
326 | 1216,1
327 | 1217,0
328 | 1218,1
329 | 1219,1
330 | 1220,0
331 | 1221,0
332 | 1222,1
333 | 1223,1
334 | 1224,0
335 | 1225,1
336 | 1226,0
337 | 1227,0
338 | 1228,0
339 | 1229,0
340 | 1230,0
341 | 1231,0
342 | 1232,0
343 | 1233,0
344 | 1234,0
345 | 1235,1
346 | 1236,0
347 | 1237,1
348 | 1238,0
349 | 1239,1
350 | 1240,0
351 | 1241,1
352 | 1242,1
353 | 1243,0
354 | 1244,0
355 | 1245,0
356 | 1246,1
357 | 1247,0
358 | 1248,1
359 | 1249,0
360 | 1250,0
361 | 1251,1
362 | 1252,0
363 | 1253,1
364 | 1254,1
365 | 1255,0
366 | 1256,1
367 | 1257,0
368 | 1258,0
369 | 1259,1
370 | 1260,1
371 | 1261,0
372 | 1262,0
373 | 1263,1
374 | 1264,0
375 | 1265,0
376 | 1266,1
377 | 1267,1
378 | 1268,1
379 | 1269,0
380 | 1270,0
381 | 1271,0
382 | 1272,0
383 | 1273,0
384 | 1274,1
385 | 1275,1
386 | 1276,0
387 | 1277,1
388 | 1278,0
389 | 1279,0
390 | 1280,0
391 | 1281,0
392 | 1282,0
393 | 1283,1
394 | 1284,0
395 | 1285,0
396 | 1286,0
397 | 1287,1
398 | 1288,0
399 | 1289,1
400 | 1290,0
401 | 1291,0
402 | 1292,1
403 | 1293,0
404 | 1294,1
405 | 1295,0
406 | 1296,1
407 | 1297,0
408 | 1298,0
409 | 1299,1
410 | 1300,1
411 | 1301,1
412 | 1302,1
413 | 1303,1
414 | 1304,1
415 | 1305,0
416 | 1306,1
417 | 1307,0
418 | 1308,0
419 | 1309,0
420 |
--------------------------------------------------------------------------------
/002/exercise/readme.md:
--------------------------------------------------------------------------------
1 | # Exercise goal
2 | - Learn the basics of Machine Learning
3 | - Learn what an ML model
4 | - Learn your first ML model which is Linear Regression
5 | - Learn how to evaluate models through a 'Loss function'
6 | - Learn how to optimize a model through Gradient Descent
7 |
8 | # Data
9 |
10 | The file `housing_prices.csv` (see [./data/housing_prices.csv](https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/master/002/data/housing_prices.csv)).
11 |
12 | Data source/credit: [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data).
13 |
14 | # Task
15 | - Follow the Jupyter Notebook and complete the required tasks:
16 |
17 |
18 | `linear-regression.ipynb`
19 |
20 | [](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/002/exercise/linear_regression.ipynb)
21 |
22 | [](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/002/exercise/linear_regression.ipynb)
23 |
24 |
25 | # Resources on linear regression
26 |
27 | - [Mandatory for Beginners] Indepth Theoretical Videos for Linear Regression from Andrew NG's Machine Learning Course
28 | - [[Video] Linear Regression in one variable (Refer videos from 2.1 - 2.7)](https://www.youtube.com/playlist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN)
29 | - [[Video] Linear Regression with multiple variables (Refer videos from 4.1 - 4.7)](https://www.youtube.com/playlist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN)
30 | - [Optional] Alternative Explanation for Linear Regression
31 | - [[Video] StatQuest: Linear Models Pt.1 - Linear Regression](https://www.youtube.com/watch?v=nk2CQITm_eo)
32 |
33 | - [Optional] Read this article if you have a basic idea of Linear Regression
34 | - [[Blog] Everything You Need To Know About Linear Regression](https://towardsdatascience.com/everything-you-need-to-know-about-linear-regression-b791e8f4bd7a)
35 | - [Mandatory for Beginners] After getting the theoretical background for Linear Regression learn how to implement it practically using Sklearn
36 | - [[Video] Linear Regression Python Sklearn [FROM SCRATCH]](https://www.youtube.com/watch?v=b0L47BeklTE)
37 |
38 |
--------------------------------------------------------------------------------
/002/solution/readme.md:
--------------------------------------------------------------------------------
1 | # My solution
2 | - Follow the solution notebook:
3 |
4 | `linear_regression.ipynb`
5 |
6 | [](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/002/solution/linear_regression.ipynb)
7 | [](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/002/solution/linear_regression.ipynb)
8 |
9 |
--------------------------------------------------------------------------------
/003/exercise/readme.md:
--------------------------------------------------------------------------------
1 | ## Description
2 | MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.
3 |
4 | In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. We’ve curated a set of tutorial-style kernels which cover everything from regression to neural networks. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare.
5 |
6 | ## Practice Skills
7 | - Computer vision fundamentals including simple neural networks
8 |
9 | - Classification methods such as SVM and K-nearest neighbors
10 |
11 | ## Acknowledgements
12 | More details about the dataset, including algorithms that have been tried on it and their levels of success, can be found at http://yann.lecun.com/exdb/mnist/index.html. The dataset is made available under a Creative Commons Attribution-Share Alike 3.0 license.
13 |
14 | At the practical level, if you are familiar with `keras`, you can access the data using:
15 |
16 | ```python
17 | from keras.datasets import mnist
18 |
19 | (x_train, y_train), (x_test, y_test) = mnist.load_data()
20 | ```
21 |
--------------------------------------------------------------------------------
/003/solution/README.md:
--------------------------------------------------------------------------------
1 | # My Solution
2 |
3 | In this folder, you will find a jupyter notebook having the solution to the MNIST Handwritten Digit Recognition problen done via Simple Neural Networks and CNN:
4 |
5 |
6 |
7 | `digit_recog_nn.ipynb`
8 |
9 | [](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/003/solution/digit_recog_nn.ipynb)
10 | [](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/003/solution/digit_recog_nn.ipynb)
11 |
12 | ## Simple neural network:
13 |
14 | In this file, you will find a simple 2-layer neural network approach performing the digit recognition task with an accuracy of around 98%. This is done using Tensorflow using Keras as frontend/API.
15 |
16 |
17 | ## Convolutional Neural Network (CNN):
18 |
19 | In this file, you will find a convolutional neural network approach having multiple techniques like Batch Normalization and Dropout included to perform the digit recognition task. This is done using Tensorflow using Keras as frontend/API.
20 | The accuracy for the same is more than 99% on test set.
21 |
--------------------------------------------------------------------------------
/003/solution/images/ANN.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/e18c0a34e6a22b05659d31a7155384257d3b02ae/003/solution/images/ANN.jpg
--------------------------------------------------------------------------------
/003/solution/images/cnn-procedure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/e18c0a34e6a22b05659d31a7155384257d3b02ae/003/solution/images/cnn-procedure.png
--------------------------------------------------------------------------------
/003/solution/test_cnn.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/e18c0a34e6a22b05659d31a7155384257d3b02ae/003/solution/test_cnn.h5
--------------------------------------------------------------------------------
/003/solution/test_nn.h5:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/e18c0a34e6a22b05659d31a7155384257d3b02ae/003/solution/test_nn.h5
--------------------------------------------------------------------------------
/004/exercise/readme.md:
--------------------------------------------------------------------------------
1 | # Exercise goal
2 | - Implement a simple code to train a Text Generation model using Keras and TensorFlow to produce a brand new Arctic Monkey song!
3 |
4 | # Data
5 | - The file `AM.txt` (see [./data/AM.txt](https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/master/004/data/AM.txt)).
6 | - The data has been collected by combining the lyrics of multiple [Arctic Monkey songs](https://www.arcticmonkeys.com/). You can also form your own data set by combining the lyrics of songs from your favorite artists. The dataset should have a decent number of lines, you can follow the data set provided with this exercise to understand the format.
7 |
8 | # Task
9 | - Tokenize the text.
10 | - Create a simple neural network with LSTM layer to train on text.
11 | - Use some seed text as input to the trained network to generate new lyrics.
12 |
--------------------------------------------------------------------------------
/004/solution/readme.md:
--------------------------------------------------------------------------------
1 | # My Solution
2 |
3 | In this folder, you will find the following jupyter notebook using a simple LSTM neural network for text generation:
4 |
5 | `text_generation_model.ipynb`
6 |
7 | [](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/004/solution/text_generation_model.ipynb)
8 | [](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/004/solution/text_generation_model.ipynb)
9 |
10 | The first part of the notebook consists of tokenization of the text.
11 |
12 | This is followed by constructing a six-layer simple neural network using `TensorFlow`, consisting of an embedding layer, a bidirectional LSTM layer, a dropout layer, a LSTM layer and finally two standard neural network layer.
13 |
14 | After training, you will find a seed text `I really like the Arctic Monkeys and ` as input to the trained network to produce new lyrics.
15 |
--------------------------------------------------------------------------------
/005/exercise/readme.md:
--------------------------------------------------------------------------------
1 | # Exercise Statement
2 |
3 | In this exercise, one would be able to know how to analyse sentiments of text data using various conventional and advanced algorithms along with textual data processing techniques, such as:
4 |
5 | * Linear SVM
6 | * Decision tree
7 | * Naive bayes
8 | * Logistic regression
9 |
10 | # Prerequisites
11 |
12 | This exercise goes from basic methods to advanced ones, so there are no hard requisites for this exercise. But it is recommended that one should know basic ML workflow to grasp things conveniently.
13 |
14 | # Data source/summary:
15 | The two files in `./data` concern product reviews and metadata from Amazon online sale platform.
16 | They have been taken from [jmcauley.ucsd.edu/data/amazon](http://jmcauley.ucsd.edu/data/amazon/).
17 |
--------------------------------------------------------------------------------
/005/solution/readme.md:
--------------------------------------------------------------------------------
1 | # My Solution
2 |
3 | In this folder, you will find the following jupyter notebook using
4 | a variety of classification methods, including linear SVM, decision tree, naive Bayes and logistic regression:
5 |
6 | `sentiment_analysis.ipynb`
7 |
8 | [](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/005/solution/sentiment_analysis.ipynb)
9 | [](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/005/solution/sentiment_analysis.ipynb)
10 |
11 | The first part of the notebook consists of loading and exploring the data
12 |
13 | This is followed by a simple bag-of-words vectorization.
14 |
15 | Finally, we will perform training and evaluation of the classification task using the aforementioned models.
16 |
--------------------------------------------------------------------------------
/005/solution/sentiment_analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "
"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 2,
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "import numpy as np\n",
21 | "import random\n",
22 | "from sklearn.model_selection import train_test_split\n",
23 | "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
24 | "from sklearn.metrics import f1_score\n",
25 | "\n",
26 | "import json, urllib.request"
27 | ]
28 | },
29 | {
30 | "cell_type": "markdown",
31 | "metadata": {},
32 | "source": [
33 | "### Load Data"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {},
39 | "source": [
40 | "##### Data Class"
41 | ]
42 | },
43 | {
44 | "cell_type": "markdown",
45 | "metadata": {},
46 | "source": [
47 | "Our first model will be automatically classifying positive and negative comments"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "Creating data to train the models is not an good approach, getting data by some other sources or by web crawling is one the best techniques, for negative and positive sentence data you can craw the amazon's review column of any product."
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 3,
60 | "metadata": {
61 | "tags": []
62 | },
63 | "outputs": [
64 | {
65 | "output_type": "stream",
66 | "name": "stdout",
67 | "text": "I bought both boxed sets, books 1-5. Really a great series! Start book 1 three weeks ago and just finished book 5. Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved! Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page! These are books you won't be disappointed with.\n5.0\n"
68 | }
69 | ],
70 | "source": [
71 | "# Storing the Path of file in a variable\n",
72 | "project_url = 'https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/master/005/'\n",
73 | "file_name = project_url+'data/books_small_10000.json'\n",
74 | "\n",
75 | "# Opening JSON file and reading it line by line.\n",
76 | "with urllib.request.urlopen(file_name) as f:\n",
77 | " for line in f:\n",
78 | " review = json.loads(line)\n",
79 | " # Getting review text\n",
80 | " print(review['reviewText'])\n",
81 | " # Getting the Overall rating\n",
82 | " print(review['overall'])\n",
83 | " break"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": 4,
89 | "metadata": {},
90 | "outputs": [],
91 | "source": [
92 | "# Storing the Path of file in a variable\n",
93 | "file_name = project_url+'data/books_small_10000.json'\n",
94 | "\n",
95 | "# Create empty list to store tuple objects of every data\n",
96 | "reviews = []\n",
97 | "\n",
98 | "# Opening JSON file and reading it line by line.\n",
99 | "with urllib.request.urlopen(file_name) as f:\n",
100 | " for line in f:\n",
101 | " review = json.loads(line)\n",
102 | " reviews.append((review['reviewText'], review['overall']))"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": 4,
108 | "metadata": {},
109 | "outputs": [
110 | {
111 | "data": {
112 | "text/plain": [
113 | "5.0"
114 | ]
115 | },
116 | "execution_count": 4,
117 | "metadata": {},
118 | "output_type": "execute_result"
119 | }
120 | ],
121 | "source": [
122 | "# Printing Random object from the reviews\n",
123 | "reviews[5]\n",
124 | "reviews[5][1]"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 5,
130 | "metadata": {},
131 | "outputs": [],
132 | "source": [
133 | "# Now creating a Data class for our reviews and we are gonn initialise it with text and score\n",
134 | "# Now insted of appending this tuple we will create a Review object and pass in text and score.\n",
135 | "\n",
136 | "class Review:\n",
137 | " def __init__(self, text, score):\n",
138 | " self.text = text\n",
139 | " self.score = score\n",
140 | " \n",
141 | "\n",
142 | " # Storing the Path of file in a variable\n",
143 | "file_name = project_url+'data/books_small_10000.json'\n",
144 | "\n",
145 | "# Create empty list to store tuple objects of every data\n",
146 | "reviews = []\n",
147 | "\n",
148 | "# Opening JSON file and reading it line by line.\n",
149 | "with urllib.request.urlopen(file_name) as f:\n",
150 | " for line in f:\n",
151 | " review = json.loads(line)\n",
152 | " reviews.append(Review(review['reviewText'], review['overall']))\n"
153 | ]
154 | },
155 | {
156 | "cell_type": "code",
157 | "execution_count": 6,
158 | "metadata": {},
159 | "outputs": [
160 | {
161 | "data": {
162 | "text/plain": [
163 | "5.0"
164 | ]
165 | },
166 | "execution_count": 6,
167 | "metadata": {},
168 | "output_type": "execute_result"
169 | }
170 | ],
171 | "source": [
172 | "# Getting score \n",
173 | "reviews[5].score"
174 | ]
175 | },
176 | {
177 | "cell_type": "code",
178 | "execution_count": 6,
179 | "metadata": {},
180 | "outputs": [
181 | {
182 | "output_type": "execute_result",
183 | "data": {
184 | "text/plain": "'I hoped for Mia to have some peace in this book, but her story is so real and raw. Broken World was so touching and emotional because you go from Mia\\'s trauma to her trying to cope. I love the way the story displays how there is no \"just bouncing back\" from being sexually assaulted. Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings. I found myself wishing I could give her some of my courage and strength or even just to be there for her. Thank you Lizzy for putting a great character\\'s voice on a strong subject and making it so that other peoples story may be heard through Mia\\'s.'"
185 | },
186 | "metadata": {},
187 | "execution_count": 6
188 | }
189 | ],
190 | "source": [
191 | "reviews[5].text"
192 | ]
193 | },
194 | {
195 | "cell_type": "code",
196 | "execution_count": 8,
197 | "metadata": {},
198 | "outputs": [],
199 | "source": [
200 | "import random\n",
201 | "\n",
202 | "class Sentiment:\n",
203 | " NEGATIVE = 'NEGATIVE'\n",
204 | " NEUTRAL = 'NEUTRAL'\n",
205 | " POSITIVE = 'POSITIVE'\n",
206 | "\n",
207 | "class Review:\n",
208 | " def __init__(self, text, score):\n",
209 | " self.text = text\n",
210 | " self.score = score\n",
211 | " # initialising sentiments 4/5 stars means +ve and 1/2 stars means -ve\n",
212 | " self.sentiments = self.get_sentiments()\n",
213 | " \n",
214 | " def get_sentiments(self):\n",
215 | " if self.score <= 2:\n",
216 | " return Sentiment.NEGATIVE\n",
217 | " elif self.score >= 4:\n",
218 | " return Sentiment.POSITIVE\n",
219 | " elif self.score == 3:\n",
220 | " return Sentiment.NEUTRAL\n",
221 | " \n",
222 | "\n",
223 | "class ReviewContainer:\n",
224 | " def __init__(self, reviews):\n",
225 | " self.reviews = reviews\n",
226 | " \n",
227 | " def get_text(self):\n",
228 | " [x.text for x in training]\n",
229 | " \n",
230 | " def get_sentiment(self):\n",
231 | " [x.sentiment for x in self.sentiment]\n",
232 | " \n",
233 | " def evenly_distribute(self):\n",
234 | " negative = list(filter(lambda x: x.sentiments == Sentiment.NEGATIVE, self.reviews))\n",
235 | "# Its looking all the reviews mapping every sentiment, its basically filtering based upon negative sentiments, keeping \n",
236 | "# track of that in the negative list\n",
237 | " positive = list(filter(lambda x: x.sentiments == Sentiment.POSITIVE, self.reviews))\n",
238 | "# Distribute evenly in the prep Data cell 10.\n",
239 | " positive_shrunk = positive[:len(negative)]\n",
240 | " self.reviews = negative + positive_shrunk\n",
241 | " # Shuffle so u wont know what comes when\n",
242 | " random.shuffle(self.reviews)"
243 | ]
244 | },
245 | {
246 | "cell_type": "markdown",
247 | "metadata": {},
248 | "source": [
249 | "### Load Data"
250 | ]
251 | },
252 | {
253 | "cell_type": "code",
254 | "execution_count": 7,
255 | "metadata": {},
256 | "outputs": [],
257 | "source": [
258 | "# Storing the Path of file in a variable\n",
259 | "file_name =project_url+ 'data/books_small_10000.json'\n",
260 | "\n",
261 | "# Create empty list to store tuple objects of every data\n",
262 | "reviews = []\n",
263 | "\n",
264 | "# Opening JSON file and reading it line by line.\n",
265 | "with urllib.request.urlopen(file_name) as f:\n",
266 | " for line in f:\n",
267 | " review = json.loads(line)\n",
268 | " reviews.append(Review(review['reviewText'], review['overall']))"
269 | ]
270 | },
271 | {
272 | "cell_type": "code",
273 | "execution_count": 10,
274 | "metadata": {},
275 | "outputs": [
276 | {
277 | "data": {
278 | "text/plain": [
279 | "'POSITIVE'"
280 | ]
281 | },
282 | "execution_count": 10,
283 | "metadata": {},
284 | "output_type": "execute_result"
285 | }
286 | ],
287 | "source": [
288 | "reviews[5].sentiments"
289 | ]
290 | },
291 | {
292 | "cell_type": "markdown",
293 | "metadata": {},
294 | "source": [
295 | "Now the Machine learning algorithms loves the numerical data and its kinda hard to work with the strings,So what we are gonna doing here, we will be using count vectoriser here and break down the sentence in a dictionary\n",
296 | "\n",
297 | "like we have two sentences,\n",
298 | "1. This book is great !\n",
299 | "2. This book was so bad.\n",
300 | "\n",
301 | "So the dictionary of the words will include This, book, is, great, was, so, bad.\n",
302 | "so we will map these dict with the sentences itself to see what words does a sentence have so\n",
303 | "\n",
304 | " This book is great was so bad\n",
305 | " 1. This book is great ! 1 | 1 | 1 | 1 | 0 |0 | 0|\n",
306 | " 2. This book was so bad 1 | 1 | 0 | 0 | 1 |1 | 1|\n",
307 | " 3. Was a great book 0 | 1 | 0 | 1 | 1 |0 | 0|\n",
308 | "\n",
309 | "\n",
310 | "so 1 means sentence have that word and 0 means sentence doesnt have that word, 3 rd sentence is that sentence we have never seen before but we can also map that using the knowlegede of previous words in the dictionary but we cant handle 'a' here because that is not included in the dictionary during the training time."
311 | ]
312 | },
313 | {
314 | "cell_type": "code",
315 | "execution_count": 11,
316 | "metadata": {},
317 | "outputs": [],
318 | "source": [
319 | "# Using train test split to split data into test and training\n",
320 | "# What list you are passing here you will get 2 times of that\n",
321 | "\n",
322 | "training, test = train_test_split(reviews, test_size = 0.33, random_state = 42)\n",
323 | "\n",
324 | "cont = ReviewContainer(training)\n",
325 | "# We will use evenly distribute method.\n",
326 | "\n",
327 | "cont.evenly_distribute()\n"
328 | ]
329 | },
330 | {
331 | "cell_type": "code",
332 | "execution_count": 12,
333 | "metadata": {},
334 | "outputs": [
335 | {
336 | "data": {
337 | "text/plain": [
338 | "6700"
339 | ]
340 | },
341 | "execution_count": 12,
342 | "metadata": {},
343 | "output_type": "execute_result"
344 | }
345 | ],
346 | "source": [
347 | "len(training)"
348 | ]
349 | },
350 | {
351 | "cell_type": "code",
352 | "execution_count": 13,
353 | "metadata": {},
354 | "outputs": [],
355 | "source": [
356 | "# Now we build a classifier on that training set and after that we will test everything on our test data\n",
357 | "# We will have to pass our tarining data into the vectorizer, as we have to take text and predict if it is +ve or -ve \n",
358 | "# So what we are gonna pass into vectorizer is X which is our sentence and y = sentiments corresponding to that."
359 | ]
360 | },
361 | {
362 | "cell_type": "code",
363 | "execution_count": 14,
364 | "metadata": {},
365 | "outputs": [],
366 | "source": [
367 | "train_x = [x.text for x in training]\n",
368 | "train_y = [x.sentiments for x in training]\n",
369 | "\n",
370 | "\n",
371 | "test_x = [x.text for x in test]\n",
372 | "test_y = [x.sentiments for x in test]\n",
373 | "\n",
374 | "# We are getting the same text again\n",
375 | "# train_x[0]\n",
376 | "# train_y[0]\n"
377 | ]
378 | },
379 | {
380 | "cell_type": "markdown",
381 | "metadata": {},
382 | "source": [
383 | "### Bag of Words Vectorization"
384 | ]
385 | },
386 | {
387 | "cell_type": "markdown",
388 | "metadata": {},
389 | "source": [
390 | "In order to perform machine learning on text documents, we first need to turn the content into numerical feature vectors "
391 | ]
392 | },
393 | {
394 | "cell_type": "code",
395 | "execution_count": 15,
396 | "metadata": {},
397 | "outputs": [],
398 | "source": [
399 | "vectorizer = CountVectorizer()\n",
400 | "\n",
401 | "# Transforming String Data to numerical data.\n",
402 | "# Now this is the main data we want to use while training.\n",
403 | "train_x_vectors = vectorizer.fit_transform(train_x) # These are 2 steps fit and transform, we can also do them individually.\n",
404 | "test_x_vectors = vectorizer.transform(test_x) # We just wanna transform the test data not to fit that.\n",
405 | "\n",
406 | "\n",
407 | "# So now our main data will be train_x_vectors and train_y and we wanna fit our data around these."
408 | ]
409 | },
410 | {
411 | "cell_type": "markdown",
412 | "metadata": {},
413 | "source": [
414 | "### Classification"
415 | ]
416 | },
417 | {
418 | "cell_type": "markdown",
419 | "metadata": {},
420 | "source": [
421 | "#### Linear SVM"
422 | ]
423 | },
424 | {
425 | "cell_type": "code",
426 | "execution_count": 16,
427 | "metadata": {},
428 | "outputs": [],
429 | "source": [
430 | "from sklearn import svm\n",
431 | "\n",
432 | "clf_svm = svm.SVC(kernel = 'linear')"
433 | ]
434 | },
435 | {
436 | "cell_type": "code",
437 | "execution_count": 17,
438 | "metadata": {},
439 | "outputs": [
440 | {
441 | "data": {
442 | "text/plain": [
443 | "array(['POSITIVE'], dtype='\n",
35 | "\n",
36 | "As already mentioned, naive bayes is a probabilistic model and depends on bayes theorem for prediction. Hence, understanding of conditional probability and bayes theorem is the key to understanding this algorithm.\n",
37 | "\n",
38 | "Conditional probability is defined as the probability of an event occuring given that another event has already occured. For example, suppose we rolled a dice and we know that the number that came out is an even number. Now if we want to find the probability of getting a 2 on the dice, it is expressed using conditional probability. Mathematically, conditional probability is defined as follows:-"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "$$ \\large \\large P(A|B) = \\frac{P(A \\bigcap B)}{P(B)} $$"
46 | ]
47 | },
48 | {
49 | "cell_type": "markdown",
50 | "metadata": {},
51 | "source": [
52 | "Bayes theorem is a very elegant theroem based on conditional probability that describes the probability of an event based on prior knowledge of conditions that might be related to the event. It is named after Thomas Bayes. It is mathematically defined as follows:-\n",
53 | "\n",
54 | "$$ \\large P(A|B) = \\frac{P(B|A)P(A)}{P(B)} $$\n",
55 | "\n",
56 | "where A and B are two events.\n",
57 | "\n",
58 | "Each term in the above equation have been given a special name:-\n",
59 | "\n",
60 | "P(A|B) is known as Posterior Probability \n",
61 | "P(B|A) is known as Likelihood Probability \n",
62 | "P(A) is known as Prior Probability, and \n",
63 | "P(B) is known as Evidence Probability "
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "## The Mathematical model of Naive Bayes Algorithm \n",
71 | "\n",
72 | "Suppose we have a point $x$ as follows:-\n",
73 | "\n",
74 | "$$ \\large x=[x_1, x_2, x_3,....,x_n]$$\n",
75 | "\n",
76 | "Our task is to assign a class or a label to this point 'k'. If we have 'k' classes, then we have to find the probability of the point $x$ belonging to class $C_k$. The class with highest probability will be assigned as the label of $x$. The probablity of a class $C_k$ given $x$ can be calculated using Bayes Theorem as follows:-\n",
77 | "\n",
78 | "$$\\large P(C_k|x) = \\frac{P(x|C_k)P(C_k)}{P(x)} \\;\\;\\;\\;\\;\\;\\; - \\;\\;(i)$$\n",
79 | "\n",
80 | "So, to summarize, if our dataset has 3 classes (setosa, virginica and versicolor for example, then we have to calculate P(setosa|x), P(virginica|x) and P(versicolor|x) and the highest probability will be assigned as the label x.\n",
81 | "\n",
82 | "Now, in our algorithm, we can omit the Evidence term, because is will remain constant for all the probabilities. This is done just to simplify the computations.\n",
83 | "\n",
84 | "Now, $P(C_k|x)$ can also be written as $P(C_k, x)$, and if we replace $x$ with its value, we get $P(C_k, x_1, x_2, x_3, ...., x_n)$. So, till now, we have basically transformed $P(C_k, x)$ into $$P(C_k, x_1, x_2, x_3, ...., x_n)\\;\\;\\;\\;\\;\\;\\; -\\;\\;\\;-(ii) $$. Things will start to get interesting now. \n",
85 | "\n",
86 | "In eq (ii), we can interchanging the terms inside the parenthesis won't change it's meaning. So, I am shifting the $C_k$ to the end. So, our equation will look like this - $P(x_1, x_2, x_3, ...., x_n, C_k)$. Now, if consider $x_1$ as event A and remaining terms as event B and apply bayes theorem, we will get:- \n",
87 | "\n",
88 | "$$\\large P(x_1, x_2, x_3, ...., x_n, C_k) = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2, x_3, ...., x_n, C_k)\\;\\;\\;\\;\\;\\;-(iii)\\;\\;\\;$$\n",
89 | "\n",
90 | "(Omitting the deniminator term as discussed [here](#omitting))\n",
91 | "\n",
92 | "If we keep applying bayes theorem in equation (iii), we will get:-\n",
93 | "\n",
94 | "$ \\quad \\quad P(C_k, x_1, x_2, x_3, ...., x_n) = P(C_k, x_1, x_2, x_3, ...., x_n)$ \n",
95 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2, x_3, ...., x_n, C_k)$ \n",
96 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k)$ \n",
97 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = \\;\\;...$ \n",
98 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k).....P(x_{n-1}|x_n, C_k)P(x_n|C_k)P(C_k) \\;\\;-(iv)$ \n",
99 | "\n",
100 | "Now, in naive bayes, we assume that the features are conditionally independent of each other. If features are independent then:-\n",
101 | "\n",
102 | "$$ \\large P(x_i|x_{i+1}, ..., x_n, C_k) = P(x_i|C_k)$$\n",
103 | "\n",
104 | "If we apply this rule in equation (iv) we get:-\n",
105 | "\n",
106 | "$\\quad \\quad P(C_k, x_1, x_2, x_3, ...., x_n) = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k).....P(x_{n-1}|x_n, C_k)P(x_n|C_k)P(C_k)$\n",
107 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad= P(C_k)P(x_1|C_k)P(x_2|C_k)P(x_3|C_k).....$ \n",
108 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad= P(C_k)\\pi_{i=0}^{n}P(x_i|C_k)$\n",
109 | "\n",
110 | "Hence,\n",
111 | "\n",
112 | "$$ \\large P(C_k|x_n) = P(C_k)\\prod_{i=0}^{n}P(x_i|C_k) $$\n",
113 | "\n",
114 | "This is how we predict in Naive Bayes Algorithm"
115 | ]
116 | },
117 | {
118 | "cell_type": "markdown",
119 | "metadata": {},
120 | "source": [
121 | "## Naive Bayes with a simple example\n",
122 | "\n",
123 | "To understand the naive bayes with ta simple example, check out [this](http://shatterline.com/blog/2013/09/12/not-so-naive-classification-with-the-naive-bayes-classifier/) blog by [shatterline](http://shatterline.com/blog)."
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {},
129 | "source": [
130 | "## Exercise\n",
131 | "\n",
132 | "The objective of this excercise is to solidify your understanding of Naive Bayes algorithm by implementing it from scratch. You will be creating a class **NaiveBayes** and defining the methods in it that learn from data and predict. After implementing this class, you will run it on the same dataset shown in [this](#ex) example.\n",
133 | "\n",
134 | "You can refer the solution if you feel stuck somewhere. Also, one thing you need to be make sure is to use log of probabilities that you calculate. This is important because the probabilities you calculate will have very large decimal places but python will store only the 1st 16 places. This could lead to some discrepancies in the result so make sure to use log probabilities. If would also encourage you to comment your code as it is a good practice.\n",
135 | "\n",
136 | "Good luck"
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "metadata": {},
142 | "source": [
143 | "## Importing the libraries and loding the data"
144 | ]
145 | },
146 | {
147 | "cell_type": "code",
148 | "execution_count": null,
149 | "metadata": {},
150 | "outputs": [],
151 | "source": [
152 | "import numpy as np\n",
153 | "import pandas as pd"
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": null,
159 | "metadata": {},
160 | "outputs": [],
161 | "source": [
162 | "df = pd.DataFrame(\n",
163 | " { \n",
164 | " 'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'], \n",
165 | " 'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', \"Hot\", 'Mild'],\n",
166 | " 'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],\n",
167 | " 'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],\n",
168 | " 'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']\n",
169 | " }\n",
170 | ")"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": null,
176 | "metadata": {},
177 | "outputs": [],
178 | "source": [
179 | "df"
180 | ]
181 | },
182 | {
183 | "cell_type": "code",
184 | "execution_count": null,
185 | "metadata": {},
186 | "outputs": [],
187 | "source": [
188 | "X = df.iloc[:, :-1]\n",
189 | "y = df.iloc[:, -1:]"
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "metadata": {},
195 | "source": [
196 | "# Start coding here..."
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": null,
202 | "metadata": {},
203 | "outputs": [],
204 | "source": [
205 | "class NaiveBayes:\n",
206 | " def __init__(self, X, y):\n",
207 | " '''\n",
208 | " This method initializes all the required data fields of the NaiveBayes class\n",
209 | " \n",
210 | " Input -\n",
211 | " X: A pandas dataframe consisting of all the dependent variables\n",
212 | " y: A pandas dataframe consisting of labels\n",
213 | " '''\n",
214 | " # Initializing the Dependent and Independent Variables\n",
215 | " self.X = X\n",
216 | " self.y = y\n",
217 | " \n",
218 | " # Initializing the column name y. (came in handy for me. If you do not require it, then you can delete it)\n",
219 | " self.y_label = y.columns[0]\n",
220 | " \n",
221 | " # Initializing the variables to store class priors. Initiallt set to None. The will the assigned the correct values by\n",
222 | " # executing the calculate_prior method\n",
223 | " # p_pos is probability of positive class\n",
224 | " # p_neg is probability of negative class\n",
225 | " self.p_pos = None\n",
226 | " self.p_neg = None\n",
227 | " \n",
228 | " # A dictionary to store all likelihood probabilities\n",
229 | " self.likelihoods = {}\n",
230 | " \n",
231 | " # Executing calculate_prior and calculate_likelihood to calculate prior and likelihood probabilities\n",
232 | " self.calculate_prior()\n",
233 | " self.calculate_likelihood()\n",
234 | " \n",
235 | " \n",
236 | " def calculate_prior(self):\n",
237 | " '''\n",
238 | " Method for calculating the prior probabilities\n",
239 | " \n",
240 | " Input - None\n",
241 | " \n",
242 | " Expected output: Expected to assign p_pos and p_neg their correct log probability values. No need to return anything\n",
243 | " '''\n",
244 | " # write your code here ...\n",
245 | " \n",
246 | " def calculate_likelihood(self):\n",
247 | " '''\n",
248 | " Method for calculating the all the likelihood probabilities\n",
249 | " \n",
250 | " Input - None\n",
251 | " \n",
252 | " Expected output: Expected to create a dictionary of likelihood probabilities and assign it to likelihoods.\n",
253 | " '''\n",
254 | " # write your code here ...\n",
255 | " \n",
256 | " def predict(self, test_data):\n",
257 | " '''\n",
258 | " A method to predict the label for the input\n",
259 | " \n",
260 | " Input -\n",
261 | " test_data: A dataframe of dependent variables\n",
262 | " \n",
263 | " Expected output: Expected to return a dataframe of predictions. The column name of dataframe should match column name of y\n",
264 | " '''\n",
265 | " # write your code here ...\n"
266 | ]
267 | },
268 | {
269 | "cell_type": "markdown",
270 | "metadata": {},
271 | "source": [
272 | "# Test if your Code is working as expected..."
273 | ]
274 | },
275 | {
276 | "cell_type": "code",
277 | "execution_count": null,
278 | "metadata": {},
279 | "outputs": [],
280 | "source": [
281 | "# Create the object\n",
282 | "nb = NaiveBayes(X, y)"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": null,
288 | "metadata": {},
289 | "outputs": [],
290 | "source": [
291 | "# Check if your code is predicting correctly\n",
292 | "assert nb.predict(X).equals(pd.DataFrame({'Play': ['No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No']})) == True, 'The prediction received is wrong. Kindly recheck your code. Refer the solution if you find yourself stuck somewhere'"
293 | ]
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": []
299 | }
300 | ],
301 | "metadata": {
302 | "kernelspec": {
303 | "display_name": "Python 3",
304 | "language": "python",
305 | "name": "python3"
306 | },
307 | "language_info": {
308 | "codemirror_mode": {
309 | "name": "ipython",
310 | "version": 3
311 | },
312 | "file_extension": ".py",
313 | "mimetype": "text/x-python",
314 | "name": "python",
315 | "nbconvert_exporter": "python",
316 | "pygments_lexer": "ipython3",
317 | "version": "3.8.3"
318 | }
319 | },
320 | "nbformat": 4,
321 | "nbformat_minor": 4
322 | }
323 |
--------------------------------------------------------------------------------
/008/exercise/README.md:
--------------------------------------------------------------------------------
1 | # Naive Bayes Classifier from scratch
2 |
3 | ## Exercise
4 |
5 | The objective of this excercise is to solidify your understanding of Naive Bayes algorithm by implementing it from scratch. You will be creating a class **NaiveBayes** and defining the methods in it that learn from data and predict. The exercise notebook consists of detailed notes on Conditional Probability, Bayes Theorem and Naive Bayes Algorithm. It also contains some starter code and instructions where you have to complete the code.
6 |
7 | You can refer the solution if you feel stuck somewhere. Also, one thing you need to be make sure is to use log of probabilities that you calculate. This is important because the probabilities you calculate will have very large decimal places but python will store only the 1st 16 places. This could lead to some discrepancies in the result so make sure to use log probabilities. I would also encourage you to comment your code as it is a good practice.
8 |
9 | Good luck
--------------------------------------------------------------------------------
/008/solution/NaiveBayes Solution.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Naive Bayes Classifier from scratch"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## Table of Contents\n",
15 | "\n",
16 | "- [Introduction](#Introduction)\n",
17 | "- [Conditional Probability and Bayes Theorem refresher](#cp-and-bt)\n",
18 | "- [The Mathematical model of Naive Bayes Algorithm](#mm)\n",
19 | "- [Naive Bayes with a simple example](#ex)\n",
20 | "- [Exercise](#exer)\n",
21 | "\n",
22 | "\n",
23 | "## Introduction\n",
24 | "\n",
25 | "Naive Bayes Algorithm is a probability based classification technique for classifying labelled data. It makes use of Bayes Theorem in order to predict the class of the given data. It is called Naive because it makes an assumption that the features of the dependent variable are mutually independent of each other. Although it is named naive, it is a very efficient model, often used as a baseline for text classification and recommender systems.\n",
26 | "\n",
27 | "## Conditional Probability and Bayes Theorem Refresher \n",
28 | "\n",
29 | "As already mentioned, naive bayes is a probabilistic model and depends on bayes theorem for prediction. Hence, understanding of conditional probability and bayes theorem is the key to understanding this algorithm.\n",
30 | "\n",
31 | "Conditional probability is defined as the probability of an event occuring given that another event has already occured. For example, suppose we rolled a dice and we know that the number that came out is an even number. Now if we want to find the probability of getting a 2 on the dice, it is expressed using conditional probability. Mathematically, conditional probability is defined as follows:-"
32 | ]
33 | },
34 | {
35 | "cell_type": "markdown",
36 | "metadata": {},
37 | "source": [
38 | "$$ \\large \\large P(A|B) = \\frac{P(A \\bigcap B)}{P(B)} $$"
39 | ]
40 | },
41 | {
42 | "cell_type": "markdown",
43 | "metadata": {},
44 | "source": [
45 | "Bayes theorem is a very elegant theroem based on conditional probability that describes the probability of an event based on prior knowledge of conditions that might be related to the event. It is named after Thomas Bayes. It is mathematically defined as follows:-\n",
46 | "\n",
47 | "$$ \\large P(A|B) = \\frac{P(B|A)P(A)}{P(B)} $$\n",
48 | "\n",
49 | "where A and B are two events.\n",
50 | "\n",
51 | "Each term in the above equation have been given a special name:-\n",
52 | "\n",
53 | "P(A|B) is known as Posterior Probability \n",
54 | "P(B|A) is known as Likelihood Probability \n",
55 | "P(A) is known as Prior Probability, and \n",
56 | "P(B) is known as Evidence Probability "
57 | ]
58 | },
59 | {
60 | "cell_type": "markdown",
61 | "metadata": {},
62 | "source": [
63 | "## The Mathematical model of Naive Bayes Algorithm \n",
64 | "\n",
65 | "Suppose we have a point $x$ as follows:-\n",
66 | "\n",
67 | "$$ \\large x=[x_1, x_2, x_3,....,x_n]$$\n",
68 | "\n",
69 | "Our task is to assign a class or a label to this point 'k'. If we have 'k' classes, then we have to find the probability of the point $x$ belonging to class $C_k$. The class with highest probability will be assigned as the label of $x$. The probablity of a class $C_k$ given $x$ can be calculated using Bayes Theorem as follows:-\n",
70 | "\n",
71 | "$$\\large P(C_k|x) = \\frac{P(x|C_k)P(C_k)}{P(x)} \\;\\;\\;\\;\\;\\;\\; - \\;\\;(i)$$\n",
72 | "\n",
73 | "So, to summarize, if our dataset has 3 classes (setosa, virginica and versicolor for example, then we have to calculate P(setosa|x), P(virginica|x) and P(versicolor|x) and the highest probability will be assigned as the label x.\n",
74 | "\n",
75 | "Now, in our algorithm, we can omit the Evidence term, because is will remain constant for all the probabilities. This is done just to simplify the computations.\n",
76 | "\n",
77 | "Now, $P(C_k|x)$ can also be written as $P(C_k, x)$, and if we replace $x$ with its value, we get $P(C_k, x_1, x_2, x_3, ...., x_n)$. So, till now, we have basically transformed $P(C_k, x)$ into $$P(C_k, x_1, x_2, x_3, ...., x_n)\\;\\;\\;\\;\\;\\;\\; -\\;\\;\\;-(ii) $$. Things will start to get interesting now. \n",
78 | "\n",
79 | "In eq (ii), we can interchanging the terms inside the parenthesis won't change it's meaning. So, I am shifting the $C_k$ to the end. So, our equation will look like this - $P(x_1, x_2, x_3, ...., x_n, C_k)$. Now, if consider $x_1$ as event A and remaining terms as event B and apply bayes theorem, we will get:- \n",
80 | "\n",
81 | "$$\\large P(x_1, x_2, x_3, ...., x_n, C_k) = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2, x_3, ...., x_n, C_k)\\;\\;\\;\\;\\;\\;-(iii)\\;\\;\\;$$\n",
82 | "\n",
83 | "(Omitting the deniminator term as discussed [here](#omitting))\n",
84 | "\n",
85 | "If we keep applying bayes theorem in equation (iii), we will get:-\n",
86 | "\n",
87 | "$ \\quad \\quad P(C_k, x_1, x_2, x_3, ...., x_n) = P(C_k, x_1, x_2, x_3, ...., x_n)$ \n",
88 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2, x_3, ...., x_n, C_k)$ \n",
89 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k)$ \n",
90 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = \\;\\;...$ \n",
91 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k).....P(x_{n-1}|x_n, C_k)P(x_n|C_k)P(C_k) \\;\\;-(iv)$ \n",
92 | "\n",
93 | "Now, in naive bayes, we assume that the features are conditionally independent of each other. If features are independent then:-\n",
94 | "\n",
95 | "$$ \\large P(x_i|x_{i+1}, ..., x_n, C_k) = P(x_i|C_k)$$\n",
96 | "\n",
97 | "If we apply this rule in equation (iv) we get:-\n",
98 | "\n",
99 | "$\\quad \\quad P(C_k, x_1, x_2, x_3, ...., x_n) = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k).....P(x_{n-1}|x_n, C_k)P(x_n|C_k)P(C_k)$\n",
100 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad= P(C_k)P(x_1|C_k)P(x_2|C_k)P(x_3|C_k).....$ \n",
101 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad= P(C_k)\\pi_{i=0}^{n}P(x_i|C_k)$\n",
102 | "\n",
103 | "Hence,\n",
104 | "\n",
105 | "$$ \\large P(C_k|x_n) = P(C_k)\\prod_{i=0}^{n}P(x_i|C_k) $$\n",
106 | "\n",
107 | "This is how we predict in Naive Bayes Algorithm"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "## Naive Bayes with a simple example\n",
115 | "\n",
116 | "To understand the naive bayes with ta simple example, check out [this](http://shatterline.com/blog/2013/09/12/not-so-naive-classification-with-the-naive-bayes-classifier/) blog by [shatterline](http://shatterline.com/blog)."
117 | ]
118 | },
119 | {
120 | "cell_type": "markdown",
121 | "metadata": {},
122 | "source": [
123 | "## Exercise\n",
124 | "\n",
125 | "The objective of this excercise is to solidify your understanding of Naive Bayes algorithm by implementing it from scratch. You will be creating a class **NaiveBayes** and defining the methods in it that learn from data and predict. After implementing this class, you will run it on the same dataset shown in [this](#ex) example.\n",
126 | "\n",
127 | "You can refer the solution if you feel stuck somewhere. Also, one thing you need to be make sure is to use log of probabilities that you calculate. This is important because the probabilities you calculate will have very large decimal places but python will store only the 1st 16 places. This could lead to some discrepancies in the result so make sure to use log probabilities. If would also encourage you to comment your code as it is a good practice.\n",
128 | "\n",
129 | "Good luck"
130 | ]
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "metadata": {},
135 | "source": [
136 | "## Importing the libraries and loding the data"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 1,
142 | "metadata": {},
143 | "outputs": [],
144 | "source": [
145 | "import numpy as np\n",
146 | "import pandas as pd"
147 | ]
148 | },
149 | {
150 | "cell_type": "code",
151 | "execution_count": 2,
152 | "metadata": {},
153 | "outputs": [],
154 | "source": [
155 | "df = pd.DataFrame(\n",
156 | " { \n",
157 | " 'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'], \n",
158 | " 'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', \"Hot\", 'Mild'],\n",
159 | " 'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],\n",
160 | " 'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],\n",
161 | " 'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']\n",
162 | " }\n",
163 | ")"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": 3,
169 | "metadata": {},
170 | "outputs": [
171 | {
172 | "data": {
173 | "text/html": [
174 | "
\n",
175 | "\n",
188 | "
\n",
189 | " \n",
190 | "
\n",
191 | "
\n",
192 | "
Outlook
\n",
193 | "
Temperature
\n",
194 | "
Humidity
\n",
195 | "
Wind
\n",
196 | "
Play
\n",
197 | "
\n",
198 | " \n",
199 | " \n",
200 | "
\n",
201 | "
0
\n",
202 | "
Sunny
\n",
203 | "
Hot
\n",
204 | "
High
\n",
205 | "
Weak
\n",
206 | "
No
\n",
207 | "
\n",
208 | "
\n",
209 | "
1
\n",
210 | "
Sunny
\n",
211 | "
Hot
\n",
212 | "
High
\n",
213 | "
Strong
\n",
214 | "
No
\n",
215 | "
\n",
216 | "
\n",
217 | "
2
\n",
218 | "
Overcast
\n",
219 | "
Hot
\n",
220 | "
High
\n",
221 | "
Weak
\n",
222 | "
Yes
\n",
223 | "
\n",
224 | "
\n",
225 | "
3
\n",
226 | "
Rain
\n",
227 | "
Mild
\n",
228 | "
High
\n",
229 | "
Weak
\n",
230 | "
Yes
\n",
231 | "
\n",
232 | "
\n",
233 | "
4
\n",
234 | "
Rain
\n",
235 | "
Cool
\n",
236 | "
Normal
\n",
237 | "
Weak
\n",
238 | "
Yes
\n",
239 | "
\n",
240 | "
\n",
241 | "
5
\n",
242 | "
Rain
\n",
243 | "
Cool
\n",
244 | "
Normal
\n",
245 | "
Strong
\n",
246 | "
No
\n",
247 | "
\n",
248 | "
\n",
249 | "
6
\n",
250 | "
Overcast
\n",
251 | "
Cool
\n",
252 | "
Normal
\n",
253 | "
Strong
\n",
254 | "
Yes
\n",
255 | "
\n",
256 | "
\n",
257 | "
7
\n",
258 | "
Sunny
\n",
259 | "
Mild
\n",
260 | "
High
\n",
261 | "
Weak
\n",
262 | "
No
\n",
263 | "
\n",
264 | "
\n",
265 | "
8
\n",
266 | "
Sunny
\n",
267 | "
Cool
\n",
268 | "
Normal
\n",
269 | "
Weak
\n",
270 | "
Yes
\n",
271 | "
\n",
272 | "
\n",
273 | "
9
\n",
274 | "
Rain
\n",
275 | "
Mild
\n",
276 | "
Normal
\n",
277 | "
Weak
\n",
278 | "
Yes
\n",
279 | "
\n",
280 | "
\n",
281 | "
10
\n",
282 | "
Sunny
\n",
283 | "
Mild
\n",
284 | "
Normal
\n",
285 | "
Strong
\n",
286 | "
Yes
\n",
287 | "
\n",
288 | "
\n",
289 | "
11
\n",
290 | "
Overcast
\n",
291 | "
Mild
\n",
292 | "
High
\n",
293 | "
Strong
\n",
294 | "
Yes
\n",
295 | "
\n",
296 | "
\n",
297 | "
12
\n",
298 | "
Overcast
\n",
299 | "
Hot
\n",
300 | "
Normal
\n",
301 | "
Weak
\n",
302 | "
Yes
\n",
303 | "
\n",
304 | "
\n",
305 | "
13
\n",
306 | "
Rain
\n",
307 | "
Mild
\n",
308 | "
High
\n",
309 | "
Strong
\n",
310 | "
No
\n",
311 | "
\n",
312 | " \n",
313 | "
\n",
314 | "
"
315 | ],
316 | "text/plain": [
317 | " Outlook Temperature Humidity Wind Play\n",
318 | "0 Sunny Hot High Weak No\n",
319 | "1 Sunny Hot High Strong No\n",
320 | "2 Overcast Hot High Weak Yes\n",
321 | "3 Rain Mild High Weak Yes\n",
322 | "4 Rain Cool Normal Weak Yes\n",
323 | "5 Rain Cool Normal Strong No\n",
324 | "6 Overcast Cool Normal Strong Yes\n",
325 | "7 Sunny Mild High Weak No\n",
326 | "8 Sunny Cool Normal Weak Yes\n",
327 | "9 Rain Mild Normal Weak Yes\n",
328 | "10 Sunny Mild Normal Strong Yes\n",
329 | "11 Overcast Mild High Strong Yes\n",
330 | "12 Overcast Hot Normal Weak Yes\n",
331 | "13 Rain Mild High Strong No"
332 | ]
333 | },
334 | "execution_count": 3,
335 | "metadata": {},
336 | "output_type": "execute_result"
337 | }
338 | ],
339 | "source": [
340 | "df"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": 4,
346 | "metadata": {},
347 | "outputs": [],
348 | "source": [
349 | "X = df.iloc[:, :-1]\n",
350 | "y = df.iloc[:, -1:]"
351 | ]
352 | },
353 | {
354 | "cell_type": "markdown",
355 | "metadata": {},
356 | "source": [
357 | "# Start coding here..."
358 | ]
359 | },
360 | {
361 | "cell_type": "code",
362 | "execution_count": 5,
363 | "metadata": {},
364 | "outputs": [],
365 | "source": [
366 | "class NaiveBayes:\n",
367 | " def __init__(self, X, y):\n",
368 | " '''\n",
369 | " This method initializes all the required data fields of the NaiveBayes class\n",
370 | " \n",
371 | " Input -\n",
372 | " X: A pandas dataframe consisting of all the dependent variables\n",
373 | " y: A pandas dataframe consisting of labels\n",
374 | " '''\n",
375 | " # Initializing the Dependent and Independent Variables\n",
376 | " self.X = X\n",
377 | " self.y = y\n",
378 | " \n",
379 | " # Initializing the column name y. (came in handy for me. If you do not require it, then you can delete it)\n",
380 | " self.y_label = y.columns[0]\n",
381 | " \n",
382 | " # Initializing the variables to store class priors. Initiallt set to None. The will the assigned the correct values by\n",
383 | " # executing the calculate_prior method\n",
384 | " # p_pos is probability of positive class\n",
385 | " # p_neg is probability of negative class\n",
386 | " self.p_pos = None\n",
387 | " self.p_neg = None\n",
388 | " \n",
389 | " # A dictionary to store all likelihood probabilities\n",
390 | " self.likelihoods = {}\n",
391 | " \n",
392 | " # Executing calculate_prior and calculate_likelihood to calculate prior and likelihood probabilities\n",
393 | " self.calculate_prior()\n",
394 | " self.calculate_likelihood()\n",
395 | " \n",
396 | " \n",
397 | " def calculate_prior(self):\n",
398 | " '''\n",
399 | " Method for calculating the prior probabilities\n",
400 | " \n",
401 | " Input - None\n",
402 | " \n",
403 | " Expected output: Expected to assign p_pos and p_neg their correct log probability values. No need to return anything\n",
404 | " '''\n",
405 | " # Get the total number of positive points\n",
406 | " total_positive = df[self.y_label][df[self.y_label] == 'Yes'].count()\n",
407 | " \n",
408 | " # Get the total number of negative points\n",
409 | " total_negative = df[self.y_label][df[self.y_label] == 'No'].count()\n",
410 | " \n",
411 | " # Get the total number of points\n",
412 | " total = df['Play'].count()\n",
413 | " \n",
414 | " # Calculate log probability of positive class\n",
415 | " self.p_pos = np.log(total_positive / total)\n",
416 | " # Calculate log probability of negative class\n",
417 | " self.p_neg = np.log(total_negative / total)\n",
418 | " \n",
419 | " def calculate_likelihood(self):\n",
420 | " '''\n",
421 | " Method for calculating the all the likelihood probabilities\n",
422 | " \n",
423 | " Input - None\n",
424 | " \n",
425 | " Expected output: Expected to create a dictionary of likelihood probabilities and assign it to likelihoods.\n",
426 | " '''\n",
427 | " # Concatenating X and y for easy access to features and labels\n",
428 | " df = pd.concat([self.X, self.y], axis=1)\n",
429 | " \n",
430 | " # Getting all unique class labels (Yes and No)\n",
431 | " labels = df[self.y_label].unique()\n",
432 | " \n",
433 | " # Get the count of all positive and negative points\n",
434 | " total_positive = y[y[self.y_label] == 'Yes'][self.y_label].count()\n",
435 | " total_negative = y[y[self.y_label] == 'No'][self.y_label].count()\n",
436 | " \n",
437 | " # Traversing through each column of the dataframe\n",
438 | " for feature_name in X.columns:\n",
439 | " # Storing likelihood for each value this column\n",
440 | " self.likelihoods[feature_name] = {}\n",
441 | " \n",
442 | " # Traversing through each unique value in the column\n",
443 | " for feature in df.loc[:, feature_name].unique():\n",
444 | " # Calculate P(feature_name|'yes')\n",
445 | " feature_given_yes = df[(df[feature_name] == feature) & (df[self.y_label] == 'Yes')][feature_name].count()\n",
446 | " \n",
447 | " # Get the log probability\n",
448 | " feature_given_yes = 0 if feature_given_yes == 0 else np.log( feature_given_yes / total_positive )\n",
449 | " \n",
450 | " # Calculate P(feature_name|'yes')\n",
451 | " feature_given_no = df[(df[feature_name] == feature) & (df[self.y_label] == 'No')][feature_name].count()\n",
452 | " \n",
453 | " # Get the log probability\n",
454 | " feature_given_no = 0 if feature_given_no == 0 else np.log( feature_given_no / total_negative )\n",
455 | " \n",
456 | " # Store the likelihood the the dict\n",
457 | " self.likelihoods[feature_name][f'{feature}|yes'] = feature_given_yes\n",
458 | " self.likelihoods[feature_name][f'{feature}|no'] = feature_given_no\n",
459 | " \n",
460 | " def predict(self, test_data):\n",
461 | " '''\n",
462 | " A method to predict the label for the input\n",
463 | " \n",
464 | " Input -\n",
465 | " test_data: A dataframe of dependent variables\n",
466 | " \n",
467 | " Expected output: Expected to return a dataframe of predictions. The column name of dataframe should match column name of y\n",
468 | " '''\n",
469 | " feature_names = test_data.columns\n",
470 | " # List to store the predictions\n",
471 | " prediction = []\n",
472 | " \n",
473 | " # Traversing through the dataframe\n",
474 | " for row in test_data.itertuples():\n",
475 | " # A list to store P(y=yes|X) and P(y=no|X)\n",
476 | " p_yes_given_X = []\n",
477 | " p_no_given_X = []\n",
478 | " \n",
479 | " # Traversing through each row of datadrame to get the value of each column\n",
480 | " for i in range(len(row) - 1):\n",
481 | " \n",
482 | " # Slicingthe 1st element as it is not needed (index)\n",
483 | " row = row[1:]\n",
484 | " \n",
485 | " # getting the likelihood probabilities from likelihood dict and storing them in the list\n",
486 | " p_yes_given_X.append(self.likelihoods[feature_names[i]][f'{row[0]}|yes'])\n",
487 | " p_no_given_X.append(self.likelihoods[feature_names[i]][f'{row[0]}|no'])\n",
488 | " \n",
489 | " # Adding probability of positive and negative class to the list\n",
490 | " p_yes_given_X.append(self.p_pos)\n",
491 | " p_no_given_X.append(self.p_neg)\n",
492 | " \n",
493 | " # Since we are using log probabilities, we can ad them instead of multiplying since log(a*b) = log(a) + log(b)\n",
494 | " p_yes_given_X = np.sum(p_yes_given_X)\n",
495 | " p_no_given_X = np.sum(p_no_given_X)\n",
496 | "\n",
497 | " # If p_yes_given_X > p_no_given_X, the we assign positive class i.e F=True else False\n",
498 | " # Add the prediction to the prediction list\n",
499 | " prediction.append(p_yes_given_X > p_no_given_X)\n",
500 | " \n",
501 | " # Creating the prediction dataframe\n",
502 | " prediction = pd.DataFrame({self.y_label: prediction})\n",
503 | " \n",
504 | " # Converting True to yes and False to No\n",
505 | " prediction[self.y_label] = prediction[self.y_label].map({True: 'Yes', False: 'No'})\n",
506 | " \n",
507 | " # return the prediction\n",
508 | " return prediction"
509 | ]
510 | },
511 | {
512 | "cell_type": "markdown",
513 | "metadata": {},
514 | "source": [
515 | "# Test if your Code is working as expected..."
516 | ]
517 | },
518 | {
519 | "cell_type": "code",
520 | "execution_count": 6,
521 | "metadata": {},
522 | "outputs": [],
523 | "source": [
524 | "# Create the object\n",
525 | "nb = NaiveBayes(X, y)"
526 | ]
527 | },
528 | {
529 | "cell_type": "code",
530 | "execution_count": 7,
531 | "metadata": {},
532 | "outputs": [],
533 | "source": [
534 | "# Check if your code is predicting correctly\n",
535 | "assert nb.predict(X).equals(pd.DataFrame({'Play': ['No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No']})) == True, 'The prediction received is wrong. Kindly recheck your code. Refer the solution if you find yourself stuck somewhere'"
536 | ]
537 | },
538 | {
539 | "cell_type": "markdown",
540 | "metadata": {},
541 | "source": []
542 | }
543 | ],
544 | "metadata": {
545 | "kernelspec": {
546 | "display_name": "Python 3",
547 | "language": "python",
548 | "name": "python3"
549 | },
550 | "language_info": {
551 | "codemirror_mode": {
552 | "name": "ipython",
553 | "version": 3
554 | },
555 | "file_extension": ".py",
556 | "mimetype": "text/x-python",
557 | "name": "python",
558 | "nbconvert_exporter": "python",
559 | "pygments_lexer": "ipython3",
560 | "version": "3.8.3"
561 | }
562 | },
563 | "nbformat": 4,
564 | "nbformat_minor": 4
565 | }
566 |
--------------------------------------------------------------------------------
/009/exercise/readme.md:
--------------------------------------------------------------------------------
1 | ### Problem Statement
2 | The main idea is to predict if a current health-insurance client is interested to get a vehicular insurance.
3 |
4 | ### Task Details
5 |
6 | Your client is an Insurance company that has provided health insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.
7 |
8 | Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc.
9 |
10 | ### Evaluation Metric
11 | The evaluation metric for this hackathon is ROC_AUC score.
12 |
13 | ### Suggested Model
14 | KNN classifier
15 |
16 | Source : https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction/tasks?taskId=2055
17 |
--------------------------------------------------------------------------------
/009/solution/readme.md:
--------------------------------------------------------------------------------
1 | # My Solution
2 |
3 | See the following jupyter notebook:
4 |
5 | `insurance_cross_sell.ipynb`
6 |
7 | [](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/009/solution/insurance_cross_sell.ipynb)
8 | [](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/009/solution/insurance_cross_sell.ipynb)
9 |
10 | It contains a kNN solution for the classification problem.
11 |
--------------------------------------------------------------------------------
/010/exercise/knn_starter_exercise.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "