├── .github ├── ISSUE_TEMPLATE │ ├── bug-report.md │ ├── documentation-improvement.md │ ├── feature-request.md │ ├── improvement.md │ ├── new-exercise.md │ └── submit-question.md ├── pull_request_template.md └── workflows │ └── lint.yml ├── .gitignore ├── 000 ├── exercise │ └── readme.md └── solution │ ├── hello_world.py │ └── readme.md ├── 001 ├── data │ ├── test.csv │ └── train.csv ├── exercise │ ├── readme.md │ └── starter-notebook.ipynb └── solution │ ├── README.md │ ├── prediction.csv │ ├── prediction_pt.csv │ ├── submission.csv │ ├── titanic_classical.ipynb │ ├── titanic_pt_nn.ipynb │ └── titanic_tf_nn.ipynb ├── 002 ├── data │ └── housing_prices.csv ├── exercise │ ├── linear_regression.ipynb │ └── readme.md └── solution │ ├── linear_regression.ipynb │ └── readme.md ├── 003 ├── exercise │ └── readme.md └── solution │ ├── README.md │ ├── digit_recog_nn.ipynb │ ├── images │ ├── ANN.jpg │ └── cnn-procedure.png │ ├── test_cnn.h5 │ └── test_nn.h5 ├── 004 ├── data │ └── AM.txt ├── exercise │ └── readme.md └── solution │ ├── readme.md │ └── text_generation_model.ipynb ├── 005 ├── data │ ├── books_small.json │ └── books_small_10000.json ├── exercise │ └── readme.md └── solution │ ├── readme.md │ └── sentiment_analysis.ipynb ├── 006 ├── data │ ├── breast_cancer_diagnosis.csv │ └── readme.md ├── exercise │ └── readme.md └── solution │ ├── ensemble_techniques.ipynb │ └── readme.md ├── 007 ├── data │ ├── employee_attrition.csv │ └── readme.md ├── exercise │ └── readme.md └── solution │ └── readme.md ├── 008 ├── exercise │ ├── NaiveBayes - exercise starter.ipynb │ └── README.md └── solution │ └── NaiveBayes Solution.ipynb ├── 009 ├── data │ ├── test.csv │ └── train.csv ├── exercise │ └── readme.md └── solution │ ├── insurance_cross_sell.ipynb │ └── readme.md ├── 010 ├── exercise │ ├── knn_starter_exercise.ipynb │ └── readme.md └── solution │ ├── knn_from_scratch.ipynb │ ├── knn_using_sklearn.ipynb │ └── readme.md ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── Resources ├── Numpy │ └── NumPy Tutorial.ipynb └── Pandas │ ├── Pandas Tutorial.ipynb │ ├── pokemon_data.csv │ ├── pokemon_data.txt │ ├── pokemon_data.xlsx │ ├── ufo-sightings.csv │ └── user.txt └── __init__.py /.github/ISSUE_TEMPLATE/bug-report.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Bug report 3 | about: Create a report to help us improve 4 | title: "[BUG]" 5 | labels: bug 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Describe the bug** 11 | A clear and concise description of what the bug is. 12 | 13 | **To Reproduce** 14 | Steps to reproduce the behavior: 15 | 1. Go to '...' 16 | 2. Click on '....' 17 | 3. Scroll down to '....' 18 | 4. See error 19 | 20 | **Expected behavior** 21 | A clear and concise description of what you expected to happen. 22 | 23 | **Screenshots** 24 | If applicable, add screenshots to help explain your problem. 25 | 26 | **Desktop (please complete the following information):** 27 | - OS: [e.g. iOS] 28 | - Browser [e.g. chrome, safari] 29 | - Version [e.g. 22] 30 | 31 | **Smartphone (please complete the following information):** 32 | - Device: [e.g. iPhone6] 33 | - OS: [e.g. iOS8.1] 34 | - Browser [e.g. stock browser, safari] 35 | - Version [e.g. 22] 36 | 37 | **Additional context** 38 | Add any other context about the problem here. 39 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/documentation-improvement.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Documentation Improvement 3 | about: Report wrong/missing documentation or improvement of documentation 4 | title: "[DOC]" 5 | labels: documentation 6 | assignees: '' 7 | 8 | --- 9 | 10 | 11 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/feature-request.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Feature request 3 | about: Suggest an idea for this project 4 | title: "[FEA]" 5 | labels: 'feature' 6 | assignees: '' 7 | 8 | --- 9 | 10 | **Is your feature request related to a problem? Please describe.** 11 | A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 12 | 13 | **Describe the solution you'd like** 14 | A clear and concise description of what you want to happen. 15 | 16 | **Describe alternatives you've considered** 17 | A clear and concise description of any alternative solutions or features you've considered. 18 | 19 | **Additional context** 20 | Add any other context or screenshots about the feature request here. 21 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/improvement.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Exercise/solution Improvement 3 | about: Suggest improvement/enhancement of a particular exercise/solution 4 | title: "[IMP]" 5 | labels: improvement 6 | assignees: '' 7 | 8 | --- 9 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/new-exercise.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: New exercise 3 | about: Suggest a new exercise 4 | title: "[EXE]" 5 | labels: idea for exercise 6 | assignees: '' 7 | 8 | --- 9 | 10 | #### Learning Goals 11 | [Learning goals, bulleted/numbered list is preferred] 12 | [e.g. learn the concept and the use of train/validation/test dataset using scikit-learn ] 13 | 14 | ### Exercise Statement 15 | [Explain and describe what the exercise is] 16 | [e.g. apply simple random-forest model to classify titanic survivability from titanic data ] 17 | 18 | 19 | ### Prerequisites 20 | [Prerequisites, in terms of concepts or other exercises in this repo] 21 | [e.g. random-forest model, stochastic gradient descent, exercise #32] 22 | 23 | ### Data source/summary: 24 | [Provide a succinct summary of what the data is and where it is from] 25 | [e.g. This involves covid19 fatality dataset from John Hopkin's website (links..) ] 26 | 27 | ### (Optional) Suggest/Propose Solutions 28 | [e.g. I have the solution using PyTorch, will be happy to create pull request to include the exercise statement/solution] 29 | [e.g. I think chapter 3 of A. Geron's textbook works out the solution for this exercise] 30 | [e.g. fast.ai's chapter 5 has the perfect solution for this] 31 | 32 | 33 | ### (Optional) Further Links/Credits to Relevant Resources: 34 | [e.g. This exercise and solution's proposal came from a lab session from DL2020] 35 | -------------------------------------------------------------------------------- /.github/ISSUE_TEMPLATE/submit-question.md: -------------------------------------------------------------------------------- 1 | --- 2 | name: Submit Question 3 | about: Ask any question on our projects 4 | title: "[QST]" 5 | labels: '' 6 | assignees: '' 7 | 8 | --- 9 | 10 | 11 | -------------------------------------------------------------------------------- /.github/pull_request_template.md: -------------------------------------------------------------------------------- 1 | #### Reference Issues/PRs 2 | 5 | 6 | 7 | #### What does this implement/fix? Explain your changes. 8 | 9 | 10 | 11 | #### Any other comments? 12 | 13 | 14 | 15 | -------------------------------------------------------------------------------- /.github/workflows/lint.yml: -------------------------------------------------------------------------------- 1 | # This workflow will install Python dependencies, run int with a single version of Python (3.6.11) 2 | 3 | name: Flake8 4 | 5 | on: 6 | push: 7 | branches: [ master ] 8 | pull_request: 9 | branches: [ master ] 10 | 11 | jobs: 12 | flake8_py3: 13 | runs-on: ubuntu-latest 14 | steps: 15 | - name: Setup Python 16 | uses: actions/setup-python@v1 17 | with: 18 | python-version: 3.8.6 19 | architecture: x64 20 | 21 | 22 | - name: Install flake8 23 | run: pip install flake8 24 | 25 | - name: Run flake8 26 | uses: suo/flake8-github-action@releases/v1 27 | with: 28 | checkName: 'flake8_py3' # NOTE: this needs to be the same as the job name 29 | env: 30 | GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} 31 | 32 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Created by https://www.toptal.com/developers/gitignore/api/python,jupyternotebooks 2 | # Edit at https://www.toptal.com/developers/gitignore?templates=python,jupyternotebooks 3 | 4 | ### JupyterNotebooks ### 5 | # gitignore template for Jupyter Notebooks 6 | # website: http://jupyter.org/ 7 | 8 | .ipynb_checkpoints 9 | */.ipynb_checkpoints/* 10 | 11 | # IPython 12 | profile_default/ 13 | ipython_config.py 14 | 15 | # Remove previous ipynb_checkpoints 16 | # git rm -r .ipynb_checkpoints/ 17 | 18 | ### Python ### 19 | # Byte-compiled / optimized / DLL files 20 | __pycache__/ 21 | *.py[cod] 22 | *$py.class 23 | 24 | # C extensions 25 | *.so 26 | 27 | # Distribution / packaging 28 | .Python 29 | build/ 30 | develop-eggs/ 31 | dist/ 32 | downloads/ 33 | eggs/ 34 | .eggs/ 35 | lib/ 36 | lib64/ 37 | parts/ 38 | sdist/ 39 | var/ 40 | wheels/ 41 | pip-wheel-metadata/ 42 | share/python-wheels/ 43 | *.egg-info/ 44 | .installed.cfg 45 | *.egg 46 | MANIFEST 47 | 48 | # PyInstaller 49 | # Usually these files are written by a python script from a template 50 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 51 | *.manifest 52 | *.spec 53 | 54 | # Installer logs 55 | pip-log.txt 56 | pip-delete-this-directory.txt 57 | 58 | # Unit test / coverage reports 59 | htmlcov/ 60 | .tox/ 61 | .nox/ 62 | .coverage 63 | .coverage.* 64 | .cache 65 | nosetests.xml 66 | coverage.xml 67 | *.cover 68 | *.py,cover 69 | .hypothesis/ 70 | .pytest_cache/ 71 | 72 | # Translations 73 | *.mo 74 | *.pot 75 | 76 | # Django stuff: 77 | *.log 78 | local_settings.py 79 | db.sqlite3 80 | db.sqlite3-journal 81 | 82 | # Flask stuff: 83 | instance/ 84 | .webassets-cache 85 | 86 | # Scrapy stuff: 87 | .scrapy 88 | 89 | # Sphinx documentation 90 | docs/_build/ 91 | 92 | # PyBuilder 93 | target/ 94 | 95 | # Jupyter Notebook 96 | 97 | # IPython 98 | 99 | # pyenv 100 | .python-version 101 | 102 | # pipenv 103 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 104 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 105 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 106 | # install all needed dependencies. 107 | #Pipfile.lock 108 | 109 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 110 | __pypackages__/ 111 | 112 | # Celery stuff 113 | celerybeat-schedule 114 | celerybeat.pid 115 | 116 | # SageMath parsed files 117 | *.sage.py 118 | 119 | # Environments 120 | .env 121 | .venv 122 | env/ 123 | venv/ 124 | ENV/ 125 | env.bak/ 126 | venv.bak/ 127 | 128 | # Spyder project settings 129 | .spyderproject 130 | .spyproject 131 | 132 | # Rope project settings 133 | .ropeproject 134 | 135 | # mkdocs documentation 136 | /site 137 | 138 | # mypy 139 | .mypy_cache/ 140 | .dmypy.json 141 | dmypy.json 142 | 143 | # Pyre type checker 144 | .pyre/ 145 | 146 | # pytype static type analyzer 147 | .pytype/ 148 | 149 | # End of https://www.toptal.com/developers/gitignore/api/python,jupyternotebooks 150 | -------------------------------------------------------------------------------- /000/exercise/readme.md: -------------------------------------------------------------------------------- 1 | # Exercise goal 2 | - Learn the `print` statement in Python 3.xx 3 | 4 | # Task 5 | - Print out `hello world!` 6 | -------------------------------------------------------------------------------- /000/solution/hello_world.py: -------------------------------------------------------------------------------- 1 | def print_hello(): 2 | print("Hello world !") 3 | 4 | 5 | print_hello() 6 | -------------------------------------------------------------------------------- /000/solution/readme.md: -------------------------------------------------------------------------------- 1 | # My solution 2 | - Use the handy `print` function of Python 3.x 3 | - The `hello_world.py` file implements that. 4 | -------------------------------------------------------------------------------- /001/exercise/readme.md: -------------------------------------------------------------------------------- 1 | # 👋🛳️ Ahoy, Welcome to your first ML project. 2 | 3 | This is one of the first newbie challenge/problem that you should be doing to enter into the world of ML. 4 | 5 | We will use machine learning to create a model that predicts which passengers survived the Titanic shipwreck. 6 | 7 | # THE CHALLENGE 8 | 9 | The sinking of the Titanic is one of the most infamous shipwrecks in history. 10 | On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. 11 | 12 | While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. 13 | 14 | In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). 15 | In this competition, you’ll gain access to two similar datasets that include passenger information like name, age, gender, socio-economic class, etc. One dataset is in `./data/train.csv` and the other is in `./data/test.csv`. 16 | 17 | The file `./data/train.csv` will contain the details of a subset of the passengers on board (891 to be exact) and importantly, will reveal whether they survived or not, also known as the “ground truth”. 18 | 19 | The `./data/test.csv` dataset contains similar information but does not disclose the “ground truth” for each passenger. It’s your job to predict these outcomes. 20 | 21 | Using the patterns you find in the `./data/train.csv` data, predict whether the other 418 passengers on board (found in `./data/test.csv`) survived. 22 | 23 | # Tasks 24 | 25 | Most Machine learning projects usually follow a similar outline (a set of phases) which are listed below to help you get started. 26 | 27 | - Inspecting Data (this process is where you familiarize yourself with the data) 28 | - Inspect the data 29 | - Check for null/missing values 30 | - View statistical details using ``describe()`` 31 | 32 | What can you infer from the statistical measures? Like possible outliers? 33 | 34 | - Data Analysis and Visualization (this process is where you explore the data, clean it and infer some insights from it) 35 | 36 | - Delete columns irrelevant or not useful for prediction 37 | - Get the average rate of survival by Gender, Pclass 38 | - Plotting the number of people who survived and who didn't survive 39 | - Plot the precenrage of survival by gender 40 | - Handle null/missing values 41 | - Plot the survival rate by Age 42 | - Handle categorical text values and turn them into numerical 43 | - Plot the correlation between features and label 44 | 45 | What do you infer from the data? What can you conclude from it? Who is most likely to survive, based on the data? 46 | 47 | - Model Building and Evaluation (this process is where you start building models and choose a model based on accuracy metrics) 48 | - Choose a model suitable for classification 49 | - Fit the training data 50 | - Use cross-validation to get the average accuracy for model selection or accuracy benchmark 51 | - Find out how accurate the model performs on test data using some metrics 52 | - **Bonus**: Try other classification algorithms and compare the accuracy metrics (and/or F1 score) by presenting them in a readable, easy to compare format 53 | - Choose the model with the best accuracy metric 54 | 55 | - Hint: To get you started, we have written a minimal template-like notebook: 56 | 57 | `starter-notebook.ipynb` 58 | 59 | [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/001/exercise/starter-notebook.ipynb) 60 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/001/exercise/starter-notebook.ipynb) 61 | 62 | 63 | Project and Data Source: https://www.kaggle.com/c/titanic 64 | -------------------------------------------------------------------------------- /001/exercise/starter-notebook.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "metadata": { 3 | "language_info": { 4 | "codemirror_mode": { 5 | "name": "ipython", 6 | "version": 3 7 | }, 8 | "file_extension": ".py", 9 | "mimetype": "text/x-python", 10 | "name": "python", 11 | "nbconvert_exporter": "python", 12 | "pygments_lexer": "ipython3", 13 | "version": "3.8.2-final" 14 | }, 15 | "orig_nbformat": 2, 16 | "kernelspec": { 17 | "name": "python38264bitpandassconda285c25d0d8784f5bba9542830bc5427b", 18 | "display_name": "Python 3.8.2 64-bit ('pandass': conda)" 19 | } 20 | }, 21 | "nbformat": 4, 22 | "nbformat_minor": 2, 23 | "cells": [ 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "\n", 29 | " \n", 32 | "
\n", 30 | " Run in Google Colab\n", 31 | "
" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "## A starter notebook with tasks listed to help you get started on your first machine learning project" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "### First phase: Inspecting the data\n", 47 | "(this process is where you familiarize yourself with the data)\n", 48 | "\n", 49 | "Tasks:\n", 50 | "- Inspect the data\n", 51 | "- Check for null/missing values\n", 52 | "- View statistical details using ``describe()``\n", 53 | "\n", 54 | "Step one has been done for you \n", 55 | "After finishing the tasks think of the following:\n", 56 | "\n", 57 | "What can you infer from the statistical measures? like possible outliers? " 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "**Important note:**\n", 65 | "\n", 66 | "The data is already split into train.csv and test.csv, you generally would build your model using the train.csv **without** using any test.csv data as that would lead to overfitting your model\n", 67 | "\n", 68 | "#### Reminder: you do not have to stick to the tasks listed word for word, the tasks are listed as a guide to guide you.\n", 69 | "#### Feel free to explore more and do more on your own!" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 2, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "# imports \n", 79 | "import pandas as pd" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 3, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "output_type": "execute_result", 89 | "data": { 90 | "text/plain": " PassengerId Survived Pclass \\\n0 1 0 3 \n1 2 1 1 \n2 3 1 3 \n3 4 1 1 \n4 5 0 3 \n\n Name Sex Age SibSp \\\n0 Braund, Mr. Owen Harris male 22.0 1 \n1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n2 Heikkinen, Miss. Laina female 26.0 0 \n3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n4 Allen, Mr. William Henry male 35.0 0 \n\n Parch Ticket Fare Cabin Embarked \n0 0 A/5 21171 7.2500 NaN S \n1 0 PC 17599 71.2833 C85 C \n2 0 STON/O2. 3101282 7.9250 NaN S \n3 0 113803 53.1000 C123 S \n4 0 373450 8.0500 NaN S ", 91 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n
" 92 | }, 93 | "metadata": {}, 94 | "execution_count": 3 95 | } 96 | ], 97 | "source": [ 98 | 99 | "project_url = 'https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/'\n", 100 | "data_path = 'master/001/data/'\n", 101 | "train = pd.read_csv(project_url+data_path+'train.csv')\n", 102 | "test = pd.read_csv(project_url+data_path+'test.csv')\n", 103 | "\n", 104 | "train.head()" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 4, 110 | "metadata": {}, 111 | "outputs": [ 112 | { 113 | "output_type": "execute_result", 114 | "data": { 115 | "text/plain": " PassengerId Pclass Name Sex \\\n0 892 3 Kelly, Mr. James male \n1 893 3 Wilkes, Mrs. James (Ellen Needs) female \n2 894 2 Myles, Mr. Thomas Francis male \n3 895 3 Wirz, Mr. Albert male \n4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female \n\n Age SibSp Parch Ticket Fare Cabin Embarked \n0 34.5 0 0 330911 7.8292 NaN Q \n1 47.0 1 0 363272 7.0000 NaN S \n2 62.0 0 0 240276 9.6875 NaN Q \n3 27.0 0 0 315154 8.6625 NaN S \n4 22.0 1 1 3101298 12.2875 NaN S ", 116 | "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
\n
" 117 | }, 118 | "metadata": {}, 119 | "execution_count": 4 120 | } 121 | ], 122 | "source": [ 123 | "test.head()" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 5, 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "# Check for null/missing values \n" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "# Inspect the statistical measures\n" 142 | ] 143 | }, 144 | { 145 | "cell_type": "markdown", 146 | "metadata": {}, 147 | "source": [ 148 | "### Second phase: Data Analysis and Visualization (this process is where you explore the data, clean it and infer some insights from it)\n", 149 | "Tasks:\n", 150 | "- Plot the gender distribution of passengers on board\n", 151 | "- Plot the rate of men and women who survived and who didnt survive\n", 152 | "- Plot the survival rate by \"Pclass\" (male and female counted together in each Pclass)\n", 153 | "- Plot the survival rate by males only in each \"Pclass\"\n", 154 | "- Plot the survival rate by females only in each \"Pclass\"\n", 155 | "\n", 156 | "Think about the plots that you just did, what can you infer from them?\n", 157 | "Now lets have a look at the columns, there are always unnecessary columns that do not contribute to the prediction.\n", 158 | "\n", 159 | "- Remove unnecessary columns from both train and test datasets\n", 160 | "- Handle categorical text values and turn them into numerical\n", 161 | "- Plot number of people who survived over age and passenger class\n", 162 | "\n", 163 | "- Deal with null values in the Age column\n", 164 | "- Group the data in the Age column into groups for better prediction.\n", 165 | "- Get survival rates by age groups \n", 166 | "- Plot the correlation between features and label\n", 167 | "\n", 168 | "\n", 169 | "What do you infer from the data? What can you conclude from it?who is most likely to survive, based on the data?\n" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": {}, 176 | "outputs": [], 177 | "source": [] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "metadata": {}, 190 | "outputs": [], 191 | "source": [] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "### Third phase:\n", 198 | "### Building the model and testing the accuracy (this process is where you start building models and choose a model based on accuracy metrics)\n", 199 | "- Choose a model suitable for classification\n", 200 | "- Fit the data\n", 201 | "- Find out how \"well\" the model performs using some metrics\n", 202 | "- Use cross validation to get the avg accuracy on the model you chose\n", 203 | "- **Bonus 1** \n", 204 | " - Try other classification algorithms \n", 205 | " - Compare the accuracy metrics (including cross-validation) for the classification algorithms used by presenting them in a readable, easy to compare format\n", 206 | " - Choose a model with the best cross validation accuracy metric\n", 207 | "- **Bonus 2**: Get the feature importance of your features using random forests\n", 208 | " (if you are not sure what that is or how to do it, google it!)" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": {}, 222 | "outputs": [], 223 | "source": [] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": null, 228 | "metadata": {}, 229 | "outputs": [], 230 | "source": [] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [] 236 | } 237 | ] 238 | } 239 | -------------------------------------------------------------------------------- /001/solution/README.md: -------------------------------------------------------------------------------- 1 | # My Solution 2 | 3 | In this folder, you will find two approaches to modeling the Titanic survival prediciton classfication problem: 4 | 5 | ## Classical approach: 6 | 7 | `titanic_classical.ipynb` 8 | 9 | [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_classical.ipynb) 10 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_classical.ipynb) 11 | 12 | 13 | 14 | In this file, you will find a detailed data wrangling approach to clean and prepare the data. Moreoever, classical machine learning classifiers from the list below are used: 15 | 16 | - Logistic Regression 17 | - Support Vector Machines 18 | - KNN or K-Nearest Neighbors 19 | - Decision Trees 20 | - SGDClassifier 21 | - Random Forest 22 | - Gaussian Naive Bayes 23 | 24 | 25 | ## Simple neural network in Tensorflow and keras: 26 | 27 | `titanic_tf_nn.ipynb` 28 | 29 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_tf_nn.ipynb) 30 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_tf_nn.ipynb) 31 | 32 | 33 | In this file, you will find a simple 5-layer neural network approach performing the survival classfication task. This is done using Tensorflow using Keras as frontend/API. 34 | 35 | ## Simple neural network in Pytorch: 36 | 37 | `titanic_pt_nn.ipynb` 38 | 39 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_pt_nn.ipynb) 40 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/001/solution/titanic_pt_nn.ipynb) 41 | 42 | In this file is a simple 4-layer neural network similar to the solution above however using the Pytorch framework. 43 | -------------------------------------------------------------------------------- /001/solution/prediction.csv: -------------------------------------------------------------------------------- 1 | PassengerId,Survived 2 | 892,0 3 | 893,0 4 | 894,0 5 | 895,0 6 | 896,0 7 | 897,0 8 | 898,0 9 | 899,0 10 | 900,0 11 | 901,0 12 | 902,0 13 | 903,0 14 | 904,1 15 | 905,0 16 | 906,1 17 | 907,0 18 | 908,0 19 | 909,0 20 | 910,0 21 | 911,0 22 | 912,0 23 | 913,0 24 | 914,1 25 | 915,0 26 | 916,0 27 | 917,0 28 | 918,0 29 | 919,0 30 | 920,0 31 | 921,0 32 | 922,0 33 | 923,0 34 | 924,0 35 | 925,0 36 | 926,0 37 | 927,0 38 | 928,0 39 | 929,0 40 | 930,0 41 | 931,0 42 | 932,0 43 | 933,0 44 | 934,0 45 | 935,0 46 | 936,1 47 | 937,0 48 | 938,0 49 | 939,0 50 | 940,1 51 | 941,0 52 | 942,0 53 | 943,0 54 | 944,0 55 | 945,0 56 | 946,0 57 | 947,0 58 | 948,0 59 | 949,0 60 | 950,0 61 | 951,1 62 | 952,0 63 | 953,0 64 | 954,0 65 | 955,0 66 | 956,0 67 | 957,0 68 | 958,0 69 | 959,0 70 | 960,0 71 | 961,0 72 | 962,0 73 | 963,0 74 | 964,0 75 | 965,0 76 | 966,0 77 | 967,0 78 | 968,0 79 | 969,1 80 | 970,0 81 | 971,0 82 | 972,0 83 | 973,0 84 | 974,0 85 | 975,0 86 | 976,0 87 | 977,0 88 | 978,0 89 | 979,0 90 | 980,0 91 | 981,0 92 | 982,0 93 | 983,0 94 | 984,0 95 | 985,0 96 | 986,0 97 | 987,0 98 | 988,0 99 | 989,0 100 | 990,0 101 | 991,0 102 | 992,0 103 | 993,0 104 | 994,0 105 | 995,0 106 | 996,0 107 | 997,0 108 | 998,0 109 | 999,0 110 | 1000,0 111 | 1001,0 112 | 1002,0 113 | 1003,0 114 | 1004,0 115 | 1005,0 116 | 1006,1 117 | 1007,0 118 | 1008,0 119 | 1009,0 120 | 1010,0 121 | 1011,0 122 | 1012,0 123 | 1013,0 124 | 1014,0 125 | 1015,0 126 | 1016,0 127 | 1017,0 128 | 1018,0 129 | 1019,0 130 | 1020,0 131 | 1021,0 132 | 1022,0 133 | 1023,0 134 | 1024,0 135 | 1025,0 136 | 1026,0 137 | 1027,0 138 | 1028,0 139 | 1029,0 140 | 1030,0 141 | 1031,0 142 | 1032,0 143 | 1033,0 144 | 1034,0 145 | 1035,0 146 | 1036,0 147 | 1037,0 148 | 1038,0 149 | 1039,0 150 | 1040,0 151 | 1041,0 152 | 1042,0 153 | 1043,0 154 | 1044,0 155 | 1045,0 156 | 1046,0 157 | 1047,0 158 | 1048,1 159 | 1049,0 160 | 1050,0 161 | 1051,0 162 | 1052,0 163 | 1053,0 164 | 1054,0 165 | 1055,0 166 | 1056,0 167 | 1057,0 168 | 1058,0 169 | 1059,0 170 | 1060,1 171 | 1061,0 172 | 1062,0 173 | 1063,0 174 | 1064,0 175 | 1065,0 176 | 1066,0 177 | 1067,0 178 | 1068,0 179 | 1069,0 180 | 1070,0 181 | 1071,0 182 | 1072,0 183 | 1073,0 184 | 1074,1 185 | 1075,0 186 | 1076,0 187 | 1077,0 188 | 1078,0 189 | 1079,0 190 | 1080,0 191 | 1081,0 192 | 1082,0 193 | 1083,0 194 | 1084,0 195 | 1085,0 196 | 1086,0 197 | 1087,0 198 | 1088,0 199 | 1089,0 200 | 1090,0 201 | 1091,0 202 | 1092,0 203 | 1093,0 204 | 1094,0 205 | 1095,0 206 | 1096,0 207 | 1097,0 208 | 1098,0 209 | 1099,0 210 | 1100,1 211 | 1101,0 212 | 1102,0 213 | 1103,0 214 | 1104,0 215 | 1105,0 216 | 1106,0 217 | 1107,0 218 | 1108,0 219 | 1109,0 220 | 1110,0 221 | 1111,0 222 | 1112,0 223 | 1113,0 224 | 1114,0 225 | 1115,0 226 | 1116,0 227 | 1117,0 228 | 1118,0 229 | 1119,0 230 | 1120,0 231 | 1121,0 232 | 1122,0 233 | 1123,0 234 | 1124,0 235 | 1125,0 236 | 1126,0 237 | 1127,0 238 | 1128,0 239 | 1129,0 240 | 1130,0 241 | 1131,0 242 | 1132,1 243 | 1133,0 244 | 1134,0 245 | 1135,0 246 | 1136,0 247 | 1137,0 248 | 1138,0 249 | 1139,0 250 | 1140,0 251 | 1141,0 252 | 1142,0 253 | 1143,0 254 | 1144,0 255 | 1145,0 256 | 1146,0 257 | 1147,0 258 | 1148,0 259 | 1149,0 260 | 1150,0 261 | 1151,0 262 | 1152,0 263 | 1153,0 264 | 1154,0 265 | 1155,0 266 | 1156,0 267 | 1157,0 268 | 1158,0 269 | 1159,0 270 | 1160,0 271 | 1161,0 272 | 1162,0 273 | 1163,0 274 | 1164,0 275 | 1165,0 276 | 1166,0 277 | 1167,0 278 | 1168,0 279 | 1169,0 280 | 1170,0 281 | 1171,0 282 | 1172,0 283 | 1173,0 284 | 1174,0 285 | 1175,0 286 | 1176,0 287 | 1177,0 288 | 1178,0 289 | 1179,0 290 | 1180,0 291 | 1181,0 292 | 1182,0 293 | 1183,0 294 | 1184,0 295 | 1185,0 296 | 1186,0 297 | 1187,0 298 | 1188,0 299 | 1189,0 300 | 1190,0 301 | 1191,0 302 | 1192,0 303 | 1193,0 304 | 1194,0 305 | 1195,0 306 | 1196,0 307 | 1197,1 308 | 1198,0 309 | 1199,0 310 | 1200,0 311 | 1201,0 312 | 1202,0 313 | 1203,0 314 | 1204,0 315 | 1205,0 316 | 1206,0 317 | 1207,0 318 | 1208,0 319 | 1209,0 320 | 1210,0 321 | 1211,0 322 | 1212,0 323 | 1213,0 324 | 1214,0 325 | 1215,0 326 | 1216,0 327 | 1217,0 328 | 1218,0 329 | 1219,0 330 | 1220,0 331 | 1221,0 332 | 1222,0 333 | 1223,0 334 | 1224,0 335 | 1225,0 336 | 1226,0 337 | 1227,0 338 | 1228,0 339 | 1229,0 340 | 1230,0 341 | 1231,0 342 | 1232,0 343 | 1233,0 344 | 1234,0 345 | 1235,0 346 | 1236,0 347 | 1237,0 348 | 1238,0 349 | 1239,0 350 | 1240,0 351 | 1241,0 352 | 1242,0 353 | 1243,0 354 | 1244,0 355 | 1245,0 356 | 1246,0 357 | 1247,0 358 | 1248,0 359 | 1249,0 360 | 1250,0 361 | 1251,0 362 | 1252,0 363 | 1253,0 364 | 1254,0 365 | 1255,0 366 | 1256,0 367 | 1257,0 368 | 1258,0 369 | 1259,0 370 | 1260,1 371 | 1261,0 372 | 1262,0 373 | 1263,1 374 | 1264,0 375 | 1265,0 376 | 1266,1 377 | 1267,0 378 | 1268,0 379 | 1269,0 380 | 1270,0 381 | 1271,0 382 | 1272,0 383 | 1273,0 384 | 1274,0 385 | 1275,0 386 | 1276,0 387 | 1277,0 388 | 1278,0 389 | 1279,0 390 | 1280,0 391 | 1281,0 392 | 1282,0 393 | 1283,0 394 | 1284,0 395 | 1285,0 396 | 1286,0 397 | 1287,0 398 | 1288,0 399 | 1289,0 400 | 1290,0 401 | 1291,0 402 | 1292,0 403 | 1293,0 404 | 1294,0 405 | 1295,0 406 | 1296,0 407 | 1297,0 408 | 1298,0 409 | 1299,0 410 | 1300,0 411 | 1301,0 412 | 1302,0 413 | 1303,0 414 | 1304,0 415 | 1305,0 416 | 1306,0 417 | 1307,0 418 | 1308,0 419 | 1309,0 420 | -------------------------------------------------------------------------------- /001/solution/prediction_pt.csv: -------------------------------------------------------------------------------- 1 | PassengerId,Survived 2 | 892,0 3 | 893,0 4 | 894,0 5 | 895,0 6 | 896,0 7 | 897,0 8 | 898,1 9 | 899,0 10 | 900,1 11 | 901,0 12 | 902,0 13 | 903,0 14 | 904,1 15 | 905,0 16 | 906,1 17 | 907,1 18 | 908,0 19 | 909,0 20 | 910,1 21 | 911,0 22 | 912,0 23 | 913,0 24 | 914,1 25 | 915,1 26 | 916,1 27 | 917,0 28 | 918,1 29 | 919,0 30 | 920,0 31 | 921,0 32 | 922,0 33 | 923,0 34 | 924,0 35 | 925,0 36 | 926,1 37 | 927,0 38 | 928,1 39 | 929,1 40 | 930,0 41 | 931,0 42 | 932,0 43 | 933,0 44 | 934,0 45 | 935,1 46 | 936,1 47 | 937,0 48 | 938,0 49 | 939,0 50 | 940,1 51 | 941,0 52 | 942,0 53 | 943,0 54 | 944,1 55 | 945,1 56 | 946,0 57 | 947,0 58 | 948,0 59 | 949,0 60 | 950,0 61 | 951,1 62 | 952,0 63 | 953,0 64 | 954,0 65 | 955,1 66 | 956,1 67 | 957,1 68 | 958,1 69 | 959,0 70 | 960,0 71 | 961,1 72 | 962,1 73 | 963,0 74 | 964,1 75 | 965,0 76 | 966,1 77 | 967,0 78 | 968,0 79 | 969,1 80 | 970,0 81 | 971,1 82 | 972,0 83 | 973,0 84 | 974,0 85 | 975,0 86 | 976,0 87 | 977,0 88 | 978,1 89 | 979,1 90 | 980,1 91 | 981,0 92 | 982,1 93 | 983,0 94 | 984,1 95 | 985,0 96 | 986,1 97 | 987,0 98 | 988,1 99 | 989,0 100 | 990,1 101 | 991,0 102 | 992,1 103 | 993,0 104 | 994,0 105 | 995,0 106 | 996,1 107 | 997,0 108 | 998,0 109 | 999,0 110 | 1000,0 111 | 1001,0 112 | 1002,0 113 | 1003,1 114 | 1004,1 115 | 1005,1 116 | 1006,1 117 | 1007,0 118 | 1008,0 119 | 1009,1 120 | 1010,0 121 | 1011,1 122 | 1012,1 123 | 1013,0 124 | 1014,1 125 | 1015,0 126 | 1016,0 127 | 1017,1 128 | 1018,0 129 | 1019,1 130 | 1020,0 131 | 1021,0 132 | 1022,0 133 | 1023,0 134 | 1024,0 135 | 1025,0 136 | 1026,0 137 | 1027,0 138 | 1028,0 139 | 1029,0 140 | 1030,1 141 | 1031,0 142 | 1032,0 143 | 1033,1 144 | 1034,1 145 | 1035,0 146 | 1036,0 147 | 1037,0 148 | 1038,0 149 | 1039,0 150 | 1040,0 151 | 1041,0 152 | 1042,1 153 | 1043,0 154 | 1044,0 155 | 1045,0 156 | 1046,0 157 | 1047,0 158 | 1048,1 159 | 1049,1 160 | 1050,0 161 | 1051,1 162 | 1052,1 163 | 1053,0 164 | 1054,1 165 | 1055,0 166 | 1056,0 167 | 1057,0 168 | 1058,0 169 | 1059,0 170 | 1060,1 171 | 1061,1 172 | 1062,0 173 | 1063,0 174 | 1064,0 175 | 1065,0 176 | 1066,0 177 | 1067,1 178 | 1068,1 179 | 1069,0 180 | 1070,1 181 | 1071,1 182 | 1072,0 183 | 1073,1 184 | 1074,1 185 | 1075,0 186 | 1076,1 187 | 1077,0 188 | 1078,1 189 | 1079,0 190 | 1080,0 191 | 1081,0 192 | 1082,0 193 | 1083,0 194 | 1084,0 195 | 1085,0 196 | 1086,0 197 | 1087,0 198 | 1088,1 199 | 1089,1 200 | 1090,0 201 | 1091,1 202 | 1092,1 203 | 1093,0 204 | 1094,1 205 | 1095,1 206 | 1096,0 207 | 1097,0 208 | 1098,1 209 | 1099,0 210 | 1100,1 211 | 1101,0 212 | 1102,0 213 | 1103,0 214 | 1104,0 215 | 1105,1 216 | 1106,0 217 | 1107,0 218 | 1108,1 219 | 1109,0 220 | 1110,1 221 | 1111,0 222 | 1112,1 223 | 1113,0 224 | 1114,1 225 | 1115,0 226 | 1116,1 227 | 1117,1 228 | 1118,0 229 | 1119,1 230 | 1120,0 231 | 1121,0 232 | 1122,0 233 | 1123,1 234 | 1124,0 235 | 1125,0 236 | 1126,1 237 | 1127,0 238 | 1128,0 239 | 1129,0 240 | 1130,1 241 | 1131,1 242 | 1132,1 243 | 1133,1 244 | 1134,1 245 | 1135,0 246 | 1136,0 247 | 1137,0 248 | 1138,1 249 | 1139,0 250 | 1140,1 251 | 1141,1 252 | 1142,1 253 | 1143,0 254 | 1144,1 255 | 1145,0 256 | 1146,0 257 | 1147,0 258 | 1148,0 259 | 1149,0 260 | 1150,1 261 | 1151,0 262 | 1152,0 263 | 1153,0 264 | 1154,1 265 | 1155,1 266 | 1156,0 267 | 1157,0 268 | 1158,0 269 | 1159,0 270 | 1160,1 271 | 1161,0 272 | 1162,0 273 | 1163,0 274 | 1164,1 275 | 1165,1 276 | 1166,0 277 | 1167,1 278 | 1168,0 279 | 1169,0 280 | 1170,0 281 | 1171,0 282 | 1172,1 283 | 1173,0 284 | 1174,1 285 | 1175,1 286 | 1176,1 287 | 1177,0 288 | 1178,0 289 | 1179,0 290 | 1180,0 291 | 1181,0 292 | 1182,0 293 | 1183,1 294 | 1184,0 295 | 1185,0 296 | 1186,0 297 | 1187,0 298 | 1188,1 299 | 1189,0 300 | 1190,0 301 | 1191,0 302 | 1192,0 303 | 1193,0 304 | 1194,0 305 | 1195,0 306 | 1196,1 307 | 1197,1 308 | 1198,1 309 | 1199,0 310 | 1200,0 311 | 1201,0 312 | 1202,0 313 | 1203,0 314 | 1204,0 315 | 1205,1 316 | 1206,1 317 | 1207,1 318 | 1208,0 319 | 1209,0 320 | 1210,0 321 | 1211,0 322 | 1212,0 323 | 1213,0 324 | 1214,0 325 | 1215,0 326 | 1216,1 327 | 1217,0 328 | 1218,1 329 | 1219,0 330 | 1220,0 331 | 1221,0 332 | 1222,1 333 | 1223,0 334 | 1224,0 335 | 1225,1 336 | 1226,0 337 | 1227,0 338 | 1228,0 339 | 1229,0 340 | 1230,0 341 | 1231,0 342 | 1232,0 343 | 1233,0 344 | 1234,0 345 | 1235,1 346 | 1236,0 347 | 1237,1 348 | 1238,0 349 | 1239,1 350 | 1240,0 351 | 1241,1 352 | 1242,1 353 | 1243,0 354 | 1244,0 355 | 1245,0 356 | 1246,1 357 | 1247,0 358 | 1248,1 359 | 1249,0 360 | 1250,0 361 | 1251,0 362 | 1252,0 363 | 1253,1 364 | 1254,1 365 | 1255,0 366 | 1256,1 367 | 1257,0 368 | 1258,0 369 | 1259,1 370 | 1260,1 371 | 1261,0 372 | 1262,0 373 | 1263,1 374 | 1264,0 375 | 1265,0 376 | 1266,1 377 | 1267,1 378 | 1268,0 379 | 1269,0 380 | 1270,0 381 | 1271,0 382 | 1272,0 383 | 1273,0 384 | 1274,1 385 | 1275,1 386 | 1276,0 387 | 1277,1 388 | 1278,0 389 | 1279,0 390 | 1280,0 391 | 1281,0 392 | 1282,0 393 | 1283,1 394 | 1284,0 395 | 1285,0 396 | 1286,0 397 | 1287,1 398 | 1288,0 399 | 1289,1 400 | 1290,0 401 | 1291,0 402 | 1292,1 403 | 1293,0 404 | 1294,1 405 | 1295,0 406 | 1296,1 407 | 1297,0 408 | 1298,0 409 | 1299,1 410 | 1300,1 411 | 1301,1 412 | 1302,1 413 | 1303,1 414 | 1304,1 415 | 1305,0 416 | 1306,1 417 | 1307,0 418 | 1308,0 419 | 1309,0 420 | -------------------------------------------------------------------------------- /001/solution/submission.csv: -------------------------------------------------------------------------------- 1 | PassengerId,Survived 2 | 892,0 3 | 893,1 4 | 894,0 5 | 895,0 6 | 896,1 7 | 897,0 8 | 898,1 9 | 899,0 10 | 900,1 11 | 901,0 12 | 902,0 13 | 903,0 14 | 904,1 15 | 905,0 16 | 906,1 17 | 907,1 18 | 908,0 19 | 909,0 20 | 910,1 21 | 911,1 22 | 912,1 23 | 913,0 24 | 914,1 25 | 915,1 26 | 916,1 27 | 917,0 28 | 918,1 29 | 919,0 30 | 920,0 31 | 921,0 32 | 922,0 33 | 923,0 34 | 924,1 35 | 925,1 36 | 926,1 37 | 927,0 38 | 928,1 39 | 929,1 40 | 930,0 41 | 931,0 42 | 932,0 43 | 933,0 44 | 934,0 45 | 935,1 46 | 936,1 47 | 937,0 48 | 938,1 49 | 939,0 50 | 940,1 51 | 941,1 52 | 942,1 53 | 943,0 54 | 944,1 55 | 945,1 56 | 946,0 57 | 947,0 58 | 948,0 59 | 949,0 60 | 950,0 61 | 951,1 62 | 952,0 63 | 953,0 64 | 954,0 65 | 955,1 66 | 956,0 67 | 957,1 68 | 958,1 69 | 959,0 70 | 960,1 71 | 961,1 72 | 962,1 73 | 963,0 74 | 964,1 75 | 965,1 76 | 966,1 77 | 967,1 78 | 968,0 79 | 969,1 80 | 970,0 81 | 971,1 82 | 972,0 83 | 973,0 84 | 974,0 85 | 975,0 86 | 976,0 87 | 977,0 88 | 978,1 89 | 979,1 90 | 980,1 91 | 981,0 92 | 982,1 93 | 983,0 94 | 984,1 95 | 985,0 96 | 986,1 97 | 987,0 98 | 988,1 99 | 989,0 100 | 990,1 101 | 991,0 102 | 992,1 103 | 993,0 104 | 994,0 105 | 995,0 106 | 996,1 107 | 997,0 108 | 998,0 109 | 999,0 110 | 1000,0 111 | 1001,0 112 | 1002,0 113 | 1003,1 114 | 1004,1 115 | 1005,1 116 | 1006,1 117 | 1007,0 118 | 1008,0 119 | 1009,1 120 | 1010,1 121 | 1011,1 122 | 1012,1 123 | 1013,0 124 | 1014,1 125 | 1015,0 126 | 1016,0 127 | 1017,1 128 | 1018,0 129 | 1019,1 130 | 1020,0 131 | 1021,0 132 | 1022,0 133 | 1023,1 134 | 1024,0 135 | 1025,0 136 | 1026,0 137 | 1027,0 138 | 1028,0 139 | 1029,0 140 | 1030,1 141 | 1031,0 142 | 1032,0 143 | 1033,1 144 | 1034,0 145 | 1035,0 146 | 1036,0 147 | 1037,0 148 | 1038,0 149 | 1039,0 150 | 1040,0 151 | 1041,0 152 | 1042,1 153 | 1043,0 154 | 1044,0 155 | 1045,1 156 | 1046,0 157 | 1047,0 158 | 1048,1 159 | 1049,1 160 | 1050,0 161 | 1051,1 162 | 1052,1 163 | 1053,0 164 | 1054,1 165 | 1055,0 166 | 1056,0 167 | 1057,1 168 | 1058,1 169 | 1059,0 170 | 1060,1 171 | 1061,1 172 | 1062,0 173 | 1063,0 174 | 1064,0 175 | 1065,0 176 | 1066,0 177 | 1067,1 178 | 1068,1 179 | 1069,1 180 | 1070,1 181 | 1071,1 182 | 1072,0 183 | 1073,1 184 | 1074,1 185 | 1075,0 186 | 1076,1 187 | 1077,0 188 | 1078,1 189 | 1079,0 190 | 1080,0 191 | 1081,0 192 | 1082,0 193 | 1083,0 194 | 1084,0 195 | 1085,0 196 | 1086,0 197 | 1087,0 198 | 1088,1 199 | 1089,1 200 | 1090,0 201 | 1091,1 202 | 1092,1 203 | 1093,0 204 | 1094,1 205 | 1095,1 206 | 1096,0 207 | 1097,1 208 | 1098,1 209 | 1099,0 210 | 1100,1 211 | 1101,0 212 | 1102,0 213 | 1103,0 214 | 1104,0 215 | 1105,1 216 | 1106,0 217 | 1107,0 218 | 1108,1 219 | 1109,0 220 | 1110,1 221 | 1111,0 222 | 1112,1 223 | 1113,0 224 | 1114,1 225 | 1115,0 226 | 1116,1 227 | 1117,1 228 | 1118,0 229 | 1119,1 230 | 1120,0 231 | 1121,0 232 | 1122,0 233 | 1123,1 234 | 1124,0 235 | 1125,0 236 | 1126,1 237 | 1127,0 238 | 1128,1 239 | 1129,0 240 | 1130,1 241 | 1131,1 242 | 1132,1 243 | 1133,1 244 | 1134,1 245 | 1135,0 246 | 1136,0 247 | 1137,0 248 | 1138,1 249 | 1139,0 250 | 1140,1 251 | 1141,1 252 | 1142,1 253 | 1143,0 254 | 1144,1 255 | 1145,0 256 | 1146,0 257 | 1147,0 258 | 1148,0 259 | 1149,0 260 | 1150,1 261 | 1151,0 262 | 1152,0 263 | 1153,0 264 | 1154,1 265 | 1155,1 266 | 1156,0 267 | 1157,0 268 | 1158,0 269 | 1159,0 270 | 1160,1 271 | 1161,0 272 | 1162,1 273 | 1163,0 274 | 1164,1 275 | 1165,1 276 | 1166,0 277 | 1167,1 278 | 1168,0 279 | 1169,0 280 | 1170,0 281 | 1171,0 282 | 1172,1 283 | 1173,0 284 | 1174,1 285 | 1175,1 286 | 1176,1 287 | 1177,0 288 | 1178,0 289 | 1179,0 290 | 1180,0 291 | 1181,0 292 | 1182,0 293 | 1183,1 294 | 1184,0 295 | 1185,0 296 | 1186,0 297 | 1187,0 298 | 1188,1 299 | 1189,0 300 | 1190,0 301 | 1191,0 302 | 1192,0 303 | 1193,0 304 | 1194,0 305 | 1195,0 306 | 1196,1 307 | 1197,1 308 | 1198,0 309 | 1199,0 310 | 1200,0 311 | 1201,1 312 | 1202,0 313 | 1203,0 314 | 1204,0 315 | 1205,1 316 | 1206,1 317 | 1207,1 318 | 1208,1 319 | 1209,0 320 | 1210,0 321 | 1211,0 322 | 1212,0 323 | 1213,0 324 | 1214,0 325 | 1215,0 326 | 1216,1 327 | 1217,0 328 | 1218,1 329 | 1219,1 330 | 1220,0 331 | 1221,0 332 | 1222,1 333 | 1223,1 334 | 1224,0 335 | 1225,1 336 | 1226,0 337 | 1227,0 338 | 1228,0 339 | 1229,0 340 | 1230,0 341 | 1231,0 342 | 1232,0 343 | 1233,0 344 | 1234,0 345 | 1235,1 346 | 1236,0 347 | 1237,1 348 | 1238,0 349 | 1239,1 350 | 1240,0 351 | 1241,1 352 | 1242,1 353 | 1243,0 354 | 1244,0 355 | 1245,0 356 | 1246,1 357 | 1247,0 358 | 1248,1 359 | 1249,0 360 | 1250,0 361 | 1251,1 362 | 1252,0 363 | 1253,1 364 | 1254,1 365 | 1255,0 366 | 1256,1 367 | 1257,0 368 | 1258,0 369 | 1259,1 370 | 1260,1 371 | 1261,0 372 | 1262,0 373 | 1263,1 374 | 1264,0 375 | 1265,0 376 | 1266,1 377 | 1267,1 378 | 1268,1 379 | 1269,0 380 | 1270,0 381 | 1271,0 382 | 1272,0 383 | 1273,0 384 | 1274,1 385 | 1275,1 386 | 1276,0 387 | 1277,1 388 | 1278,0 389 | 1279,0 390 | 1280,0 391 | 1281,0 392 | 1282,0 393 | 1283,1 394 | 1284,0 395 | 1285,0 396 | 1286,0 397 | 1287,1 398 | 1288,0 399 | 1289,1 400 | 1290,0 401 | 1291,0 402 | 1292,1 403 | 1293,0 404 | 1294,1 405 | 1295,0 406 | 1296,1 407 | 1297,0 408 | 1298,0 409 | 1299,1 410 | 1300,1 411 | 1301,1 412 | 1302,1 413 | 1303,1 414 | 1304,1 415 | 1305,0 416 | 1306,1 417 | 1307,0 418 | 1308,0 419 | 1309,0 420 | -------------------------------------------------------------------------------- /002/exercise/readme.md: -------------------------------------------------------------------------------- 1 | # Exercise goal 2 | - Learn the basics of Machine Learning 3 | - Learn what an ML model 4 | - Learn your first ML model which is Linear Regression 5 | - Learn how to evaluate models through a 'Loss function' 6 | - Learn how to optimize a model through Gradient Descent 7 | 8 | # Data 9 | 10 | The file `housing_prices.csv` (see [./data/housing_prices.csv](https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/master/002/data/housing_prices.csv)). 11 | 12 | Data source/credit: [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). 13 | 14 | # Task 15 | - Follow the Jupyter Notebook and complete the required tasks: 16 | 17 | 18 | `linear-regression.ipynb` 19 | 20 | [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/002/exercise/linear_regression.ipynb) 21 | 22 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/002/exercise/linear_regression.ipynb) 23 | 24 | 25 | # Resources on linear regression 26 | 27 | - [Mandatory for Beginners] Indepth Theoretical Videos for Linear Regression from Andrew NG's Machine Learning Course 28 | - [[Video] Linear Regression in one variable (Refer videos from 2.1 - 2.7)](https://www.youtube.com/playlist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN) 29 | - [[Video] Linear Regression with multiple variables (Refer videos from 4.1 - 4.7)](https://www.youtube.com/playlist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN) 30 | - [Optional] Alternative Explanation for Linear Regression 31 | - [[Video] StatQuest: Linear Models Pt.1 - Linear Regression](https://www.youtube.com/watch?v=nk2CQITm_eo) 32 | 33 | - [Optional] Read this article if you have a basic idea of Linear Regression 34 | - [[Blog] Everything You Need To Know About Linear Regression](https://towardsdatascience.com/everything-you-need-to-know-about-linear-regression-b791e8f4bd7a) 35 | - [Mandatory for Beginners] After getting the theoretical background for Linear Regression learn how to implement it practically using Sklearn 36 | - [[Video] Linear Regression Python Sklearn [FROM SCRATCH]](https://www.youtube.com/watch?v=b0L47BeklTE) 37 | 38 | -------------------------------------------------------------------------------- /002/solution/readme.md: -------------------------------------------------------------------------------- 1 | # My solution 2 | - Follow the solution notebook: 3 | 4 | `linear_regression.ipynb` 5 | 6 | [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/002/solution/linear_regression.ipynb) 7 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/002/solution/linear_regression.ipynb) 8 | 9 | -------------------------------------------------------------------------------- /003/exercise/readme.md: -------------------------------------------------------------------------------- 1 | ## Description 2 | MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike. 3 | 4 | In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images. We’ve curated a set of tutorial-style kernels which cover everything from regression to neural networks. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare. 5 | 6 | ## Practice Skills 7 | - Computer vision fundamentals including simple neural networks 8 | 9 | - Classification methods such as SVM and K-nearest neighbors 10 | 11 | ## Acknowledgements 12 | More details about the dataset, including algorithms that have been tried on it and their levels of success, can be found at http://yann.lecun.com/exdb/mnist/index.html. The dataset is made available under a Creative Commons Attribution-Share Alike 3.0 license. 13 | 14 | At the practical level, if you are familiar with `keras`, you can access the data using: 15 | 16 | ```python 17 | from keras.datasets import mnist 18 | 19 | (x_train, y_train), (x_test, y_test) = mnist.load_data() 20 | ``` 21 | -------------------------------------------------------------------------------- /003/solution/README.md: -------------------------------------------------------------------------------- 1 | # My Solution 2 | 3 | In this folder, you will find a jupyter notebook having the solution to the MNIST Handwritten Digit Recognition problen done via Simple Neural Networks and CNN: 4 | 5 | 6 | 7 | `digit_recog_nn.ipynb` 8 | 9 | [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/003/solution/digit_recog_nn.ipynb) 10 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/003/solution/digit_recog_nn.ipynb) 11 | 12 | ## Simple neural network: 13 | 14 | In this file, you will find a simple 2-layer neural network approach performing the digit recognition task with an accuracy of around 98%. This is done using Tensorflow using Keras as frontend/API. 15 | 16 | 17 | ## Convolutional Neural Network (CNN): 18 | 19 | In this file, you will find a convolutional neural network approach having multiple techniques like Batch Normalization and Dropout included to perform the digit recognition task. This is done using Tensorflow using Keras as frontend/API. 20 | The accuracy for the same is more than 99% on test set. 21 | -------------------------------------------------------------------------------- /003/solution/images/ANN.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/e18c0a34e6a22b05659d31a7155384257d3b02ae/003/solution/images/ANN.jpg -------------------------------------------------------------------------------- /003/solution/images/cnn-procedure.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/e18c0a34e6a22b05659d31a7155384257d3b02ae/003/solution/images/cnn-procedure.png -------------------------------------------------------------------------------- /003/solution/test_cnn.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/e18c0a34e6a22b05659d31a7155384257d3b02ae/003/solution/test_cnn.h5 -------------------------------------------------------------------------------- /003/solution/test_nn.h5: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/e18c0a34e6a22b05659d31a7155384257d3b02ae/003/solution/test_nn.h5 -------------------------------------------------------------------------------- /004/exercise/readme.md: -------------------------------------------------------------------------------- 1 | # Exercise goal 2 | - Implement a simple code to train a Text Generation model using Keras and TensorFlow to produce a brand new Arctic Monkey song! 3 | 4 | # Data 5 | - The file `AM.txt` (see [./data/AM.txt](https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/master/004/data/AM.txt)). 6 | - The data has been collected by combining the lyrics of multiple [Arctic Monkey songs](https://www.arcticmonkeys.com/). You can also form your own data set by combining the lyrics of songs from your favorite artists. The dataset should have a decent number of lines, you can follow the data set provided with this exercise to understand the format. 7 | 8 | # Task 9 | - Tokenize the text. 10 | - Create a simple neural network with LSTM layer to train on text. 11 | - Use some seed text as input to the trained network to generate new lyrics. 12 | -------------------------------------------------------------------------------- /004/solution/readme.md: -------------------------------------------------------------------------------- 1 | # My Solution 2 | 3 | In this folder, you will find the following jupyter notebook using a simple LSTM neural network for text generation: 4 | 5 | `text_generation_model.ipynb` 6 | 7 | [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/004/solution/text_generation_model.ipynb) 8 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/004/solution/text_generation_model.ipynb) 9 | 10 | The first part of the notebook consists of tokenization of the text. 11 | 12 | This is followed by constructing a six-layer simple neural network using `TensorFlow`, consisting of an embedding layer, a bidirectional LSTM layer, a dropout layer, a LSTM layer and finally two standard neural network layer. 13 | 14 | After training, you will find a seed text `I really like the Arctic Monkeys and ` as input to the trained network to produce new lyrics. 15 | -------------------------------------------------------------------------------- /005/exercise/readme.md: -------------------------------------------------------------------------------- 1 | # Exercise Statement 2 | 3 | In this exercise, one would be able to know how to analyse sentiments of text data using various conventional and advanced algorithms along with textual data processing techniques, such as: 4 | 5 | * Linear SVM 6 | * Decision tree 7 | * Naive bayes 8 | * Logistic regression 9 | 10 | # Prerequisites 11 | 12 | This exercise goes from basic methods to advanced ones, so there are no hard requisites for this exercise. But it is recommended that one should know basic ML workflow to grasp things conveniently. 13 | 14 | # Data source/summary: 15 | The two files in `./data` concern product reviews and metadata from Amazon online sale platform. 16 | They have been taken from [jmcauley.ucsd.edu/data/amazon](http://jmcauley.ucsd.edu/data/amazon/). 17 | -------------------------------------------------------------------------------- /005/solution/readme.md: -------------------------------------------------------------------------------- 1 | # My Solution 2 | 3 | In this folder, you will find the following jupyter notebook using 4 | a variety of classification methods, including linear SVM, decision tree, naive Bayes and logistic regression: 5 | 6 | `sentiment_analysis.ipynb` 7 | 8 | [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/005/solution/sentiment_analysis.ipynb) 9 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/005/solution/sentiment_analysis.ipynb) 10 | 11 | The first part of the notebook consists of loading and exploring the data 12 | 13 | This is followed by a simple bag-of-words vectorization. 14 | 15 | Finally, we will perform training and evaluation of the classification task using the aforementioned models. 16 | -------------------------------------------------------------------------------- /005/solution/sentiment_analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | " \n", 11 | "
\n", 9 | " Run in Google Colab\n", 10 | "
" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "import numpy as np\n", 21 | "import random\n", 22 | "from sklearn.model_selection import train_test_split\n", 23 | "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n", 24 | "from sklearn.metrics import f1_score\n", 25 | "\n", 26 | "import json, urllib.request" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "### Load Data" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "##### Data Class" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "Our first model will be automatically classifying positive and negative comments" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "Creating data to train the models is not an good approach, getting data by some other sources or by web crawling is one the best techniques, for negative and positive sentence data you can craw the amazon's review column of any product." 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 3, 60 | "metadata": { 61 | "tags": [] 62 | }, 63 | "outputs": [ 64 | { 65 | "output_type": "stream", 66 | "name": "stdout", 67 | "text": "I bought both boxed sets, books 1-5. Really a great series! Start book 1 three weeks ago and just finished book 5. Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved! Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page! These are books you won't be disappointed with.\n5.0\n" 68 | } 69 | ], 70 | "source": [ 71 | "# Storing the Path of file in a variable\n", 72 | "project_url = 'https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/master/005/'\n", 73 | "file_name = project_url+'data/books_small_10000.json'\n", 74 | "\n", 75 | "# Opening JSON file and reading it line by line.\n", 76 | "with urllib.request.urlopen(file_name) as f:\n", 77 | " for line in f:\n", 78 | " review = json.loads(line)\n", 79 | " # Getting review text\n", 80 | " print(review['reviewText'])\n", 81 | " # Getting the Overall rating\n", 82 | " print(review['overall'])\n", 83 | " break" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 4, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "# Storing the Path of file in a variable\n", 93 | "file_name = project_url+'data/books_small_10000.json'\n", 94 | "\n", 95 | "# Create empty list to store tuple objects of every data\n", 96 | "reviews = []\n", 97 | "\n", 98 | "# Opening JSON file and reading it line by line.\n", 99 | "with urllib.request.urlopen(file_name) as f:\n", 100 | " for line in f:\n", 101 | " review = json.loads(line)\n", 102 | " reviews.append((review['reviewText'], review['overall']))" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": 4, 108 | "metadata": {}, 109 | "outputs": [ 110 | { 111 | "data": { 112 | "text/plain": [ 113 | "5.0" 114 | ] 115 | }, 116 | "execution_count": 4, 117 | "metadata": {}, 118 | "output_type": "execute_result" 119 | } 120 | ], 121 | "source": [ 122 | "# Printing Random object from the reviews\n", 123 | "reviews[5]\n", 124 | "reviews[5][1]" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 5, 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "# Now creating a Data class for our reviews and we are gonn initialise it with text and score\n", 134 | "# Now insted of appending this tuple we will create a Review object and pass in text and score.\n", 135 | "\n", 136 | "class Review:\n", 137 | " def __init__(self, text, score):\n", 138 | " self.text = text\n", 139 | " self.score = score\n", 140 | " \n", 141 | "\n", 142 | " # Storing the Path of file in a variable\n", 143 | "file_name = project_url+'data/books_small_10000.json'\n", 144 | "\n", 145 | "# Create empty list to store tuple objects of every data\n", 146 | "reviews = []\n", 147 | "\n", 148 | "# Opening JSON file and reading it line by line.\n", 149 | "with urllib.request.urlopen(file_name) as f:\n", 150 | " for line in f:\n", 151 | " review = json.loads(line)\n", 152 | " reviews.append(Review(review['reviewText'], review['overall']))\n" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 6, 158 | "metadata": {}, 159 | "outputs": [ 160 | { 161 | "data": { 162 | "text/plain": [ 163 | "5.0" 164 | ] 165 | }, 166 | "execution_count": 6, 167 | "metadata": {}, 168 | "output_type": "execute_result" 169 | } 170 | ], 171 | "source": [ 172 | "# Getting score \n", 173 | "reviews[5].score" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 6, 179 | "metadata": {}, 180 | "outputs": [ 181 | { 182 | "output_type": "execute_result", 183 | "data": { 184 | "text/plain": "'I hoped for Mia to have some peace in this book, but her story is so real and raw. Broken World was so touching and emotional because you go from Mia\\'s trauma to her trying to cope. I love the way the story displays how there is no \"just bouncing back\" from being sexually assaulted. Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings. I found myself wishing I could give her some of my courage and strength or even just to be there for her. Thank you Lizzy for putting a great character\\'s voice on a strong subject and making it so that other peoples story may be heard through Mia\\'s.'" 185 | }, 186 | "metadata": {}, 187 | "execution_count": 6 188 | } 189 | ], 190 | "source": [ 191 | "reviews[5].text" 192 | ] 193 | }, 194 | { 195 | "cell_type": "code", 196 | "execution_count": 8, 197 | "metadata": {}, 198 | "outputs": [], 199 | "source": [ 200 | "import random\n", 201 | "\n", 202 | "class Sentiment:\n", 203 | " NEGATIVE = 'NEGATIVE'\n", 204 | " NEUTRAL = 'NEUTRAL'\n", 205 | " POSITIVE = 'POSITIVE'\n", 206 | "\n", 207 | "class Review:\n", 208 | " def __init__(self, text, score):\n", 209 | " self.text = text\n", 210 | " self.score = score\n", 211 | " # initialising sentiments 4/5 stars means +ve and 1/2 stars means -ve\n", 212 | " self.sentiments = self.get_sentiments()\n", 213 | " \n", 214 | " def get_sentiments(self):\n", 215 | " if self.score <= 2:\n", 216 | " return Sentiment.NEGATIVE\n", 217 | " elif self.score >= 4:\n", 218 | " return Sentiment.POSITIVE\n", 219 | " elif self.score == 3:\n", 220 | " return Sentiment.NEUTRAL\n", 221 | " \n", 222 | "\n", 223 | "class ReviewContainer:\n", 224 | " def __init__(self, reviews):\n", 225 | " self.reviews = reviews\n", 226 | " \n", 227 | " def get_text(self):\n", 228 | " [x.text for x in training]\n", 229 | " \n", 230 | " def get_sentiment(self):\n", 231 | " [x.sentiment for x in self.sentiment]\n", 232 | " \n", 233 | " def evenly_distribute(self):\n", 234 | " negative = list(filter(lambda x: x.sentiments == Sentiment.NEGATIVE, self.reviews))\n", 235 | "# Its looking all the reviews mapping every sentiment, its basically filtering based upon negative sentiments, keeping \n", 236 | "# track of that in the negative list\n", 237 | " positive = list(filter(lambda x: x.sentiments == Sentiment.POSITIVE, self.reviews))\n", 238 | "# Distribute evenly in the prep Data cell 10.\n", 239 | " positive_shrunk = positive[:len(negative)]\n", 240 | " self.reviews = negative + positive_shrunk\n", 241 | " # Shuffle so u wont know what comes when\n", 242 | " random.shuffle(self.reviews)" 243 | ] 244 | }, 245 | { 246 | "cell_type": "markdown", 247 | "metadata": {}, 248 | "source": [ 249 | "### Load Data" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 7, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "# Storing the Path of file in a variable\n", 259 | "file_name =project_url+ 'data/books_small_10000.json'\n", 260 | "\n", 261 | "# Create empty list to store tuple objects of every data\n", 262 | "reviews = []\n", 263 | "\n", 264 | "# Opening JSON file and reading it line by line.\n", 265 | "with urllib.request.urlopen(file_name) as f:\n", 266 | " for line in f:\n", 267 | " review = json.loads(line)\n", 268 | " reviews.append(Review(review['reviewText'], review['overall']))" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": 10, 274 | "metadata": {}, 275 | "outputs": [ 276 | { 277 | "data": { 278 | "text/plain": [ 279 | "'POSITIVE'" 280 | ] 281 | }, 282 | "execution_count": 10, 283 | "metadata": {}, 284 | "output_type": "execute_result" 285 | } 286 | ], 287 | "source": [ 288 | "reviews[5].sentiments" 289 | ] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "metadata": {}, 294 | "source": [ 295 | "Now the Machine learning algorithms loves the numerical data and its kinda hard to work with the strings,So what we are gonna doing here, we will be using count vectoriser here and break down the sentence in a dictionary\n", 296 | "\n", 297 | "like we have two sentences,\n", 298 | "1. This book is great !\n", 299 | "2. This book was so bad.\n", 300 | "\n", 301 | "So the dictionary of the words will include This, book, is, great, was, so, bad.\n", 302 | "so we will map these dict with the sentences itself to see what words does a sentence have so\n", 303 | "\n", 304 | " This book is great was so bad\n", 305 | " 1. This book is great ! 1 | 1 | 1 | 1 | 0 |0 | 0|\n", 306 | " 2. This book was so bad 1 | 1 | 0 | 0 | 1 |1 | 1|\n", 307 | " 3. Was a great book 0 | 1 | 0 | 1 | 1 |0 | 0|\n", 308 | "\n", 309 | "\n", 310 | "so 1 means sentence have that word and 0 means sentence doesnt have that word, 3 rd sentence is that sentence we have never seen before but we can also map that using the knowlegede of previous words in the dictionary but we cant handle 'a' here because that is not included in the dictionary during the training time." 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": 11, 316 | "metadata": {}, 317 | "outputs": [], 318 | "source": [ 319 | "# Using train test split to split data into test and training\n", 320 | "# What list you are passing here you will get 2 times of that\n", 321 | "\n", 322 | "training, test = train_test_split(reviews, test_size = 0.33, random_state = 42)\n", 323 | "\n", 324 | "cont = ReviewContainer(training)\n", 325 | "# We will use evenly distribute method.\n", 326 | "\n", 327 | "cont.evenly_distribute()\n" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": 12, 333 | "metadata": {}, 334 | "outputs": [ 335 | { 336 | "data": { 337 | "text/plain": [ 338 | "6700" 339 | ] 340 | }, 341 | "execution_count": 12, 342 | "metadata": {}, 343 | "output_type": "execute_result" 344 | } 345 | ], 346 | "source": [ 347 | "len(training)" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": 13, 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "# Now we build a classifier on that training set and after that we will test everything on our test data\n", 357 | "# We will have to pass our tarining data into the vectorizer, as we have to take text and predict if it is +ve or -ve \n", 358 | "# So what we are gonna pass into vectorizer is X which is our sentence and y = sentiments corresponding to that." 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": 14, 364 | "metadata": {}, 365 | "outputs": [], 366 | "source": [ 367 | "train_x = [x.text for x in training]\n", 368 | "train_y = [x.sentiments for x in training]\n", 369 | "\n", 370 | "\n", 371 | "test_x = [x.text for x in test]\n", 372 | "test_y = [x.sentiments for x in test]\n", 373 | "\n", 374 | "# We are getting the same text again\n", 375 | "# train_x[0]\n", 376 | "# train_y[0]\n" 377 | ] 378 | }, 379 | { 380 | "cell_type": "markdown", 381 | "metadata": {}, 382 | "source": [ 383 | "### Bag of Words Vectorization" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "In order to perform machine learning on text documents, we first need to turn the content into numerical feature vectors " 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": 15, 396 | "metadata": {}, 397 | "outputs": [], 398 | "source": [ 399 | "vectorizer = CountVectorizer()\n", 400 | "\n", 401 | "# Transforming String Data to numerical data.\n", 402 | "# Now this is the main data we want to use while training.\n", 403 | "train_x_vectors = vectorizer.fit_transform(train_x) # These are 2 steps fit and transform, we can also do them individually.\n", 404 | "test_x_vectors = vectorizer.transform(test_x) # We just wanna transform the test data not to fit that.\n", 405 | "\n", 406 | "\n", 407 | "# So now our main data will be train_x_vectors and train_y and we wanna fit our data around these." 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "### Classification" 415 | ] 416 | }, 417 | { 418 | "cell_type": "markdown", 419 | "metadata": {}, 420 | "source": [ 421 | "#### Linear SVM" 422 | ] 423 | }, 424 | { 425 | "cell_type": "code", 426 | "execution_count": 16, 427 | "metadata": {}, 428 | "outputs": [], 429 | "source": [ 430 | "from sklearn import svm\n", 431 | "\n", 432 | "clf_svm = svm.SVC(kernel = 'linear')" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": 17, 438 | "metadata": {}, 439 | "outputs": [ 440 | { 441 | "data": { 442 | "text/plain": [ 443 | "array(['POSITIVE'], dtype='\n", 35 | "\n", 36 | "As already mentioned, naive bayes is a probabilistic model and depends on bayes theorem for prediction. Hence, understanding of conditional probability and bayes theorem is the key to understanding this algorithm.\n", 37 | "\n", 38 | "Conditional probability is defined as the probability of an event occuring given that another event has already occured. For example, suppose we rolled a dice and we know that the number that came out is an even number. Now if we want to find the probability of getting a 2 on the dice, it is expressed using conditional probability. Mathematically, conditional probability is defined as follows:-" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "$$ \\large \\large P(A|B) = \\frac{P(A \\bigcap B)}{P(B)} $$" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "Bayes theorem is a very elegant theroem based on conditional probability that describes the probability of an event based on prior knowledge of conditions that might be related to the event. It is named after Thomas Bayes. It is mathematically defined as follows:-\n", 53 | "\n", 54 | "$$ \\large P(A|B) = \\frac{P(B|A)P(A)}{P(B)} $$\n", 55 | "\n", 56 | "where A and B are two events.\n", 57 | "\n", 58 | "Each term in the above equation have been given a special name:-\n", 59 | "\n", 60 | "P(A|B) is known as Posterior Probability
\n", 61 | "P(B|A) is known as Likelihood Probability
\n", 62 | "P(A) is known as Prior Probability, and
\n", 63 | "P(B) is known as Evidence Probability
" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "## The Mathematical model of Naive Bayes Algorithm \n", 71 | "\n", 72 | "Suppose we have a point $x$ as follows:-\n", 73 | "\n", 74 | "$$ \\large x=[x_1, x_2, x_3,....,x_n]$$\n", 75 | "\n", 76 | "Our task is to assign a class or a label to this point 'k'. If we have 'k' classes, then we have to find the probability of the point $x$ belonging to class $C_k$. The class with highest probability will be assigned as the label of $x$. The probablity of a class $C_k$ given $x$ can be calculated using Bayes Theorem as follows:-\n", 77 | "\n", 78 | "$$\\large P(C_k|x) = \\frac{P(x|C_k)P(C_k)}{P(x)} \\;\\;\\;\\;\\;\\;\\; - \\;\\;(i)$$\n", 79 | "\n", 80 | "So, to summarize, if our dataset has 3 classes (setosa, virginica and versicolor for example, then we have to calculate P(setosa|x), P(virginica|x) and P(versicolor|x) and the highest probability will be assigned as the label x.\n", 81 | "\n", 82 | "Now, in our algorithm, we can omit the Evidence term, because is will remain constant for all the probabilities. This is done just to simplify the computations.\n", 83 | "\n", 84 | "Now, $P(C_k|x)$ can also be written as $P(C_k, x)$, and if we replace $x$ with its value, we get $P(C_k, x_1, x_2, x_3, ...., x_n)$. So, till now, we have basically transformed $P(C_k, x)$ into $$P(C_k, x_1, x_2, x_3, ...., x_n)\\;\\;\\;\\;\\;\\;\\; -\\;\\;\\;-(ii) $$. Things will start to get interesting now. \n", 85 | "\n", 86 | "In eq (ii), we can interchanging the terms inside the parenthesis won't change it's meaning. So, I am shifting the $C_k$ to the end. So, our equation will look like this - $P(x_1, x_2, x_3, ...., x_n, C_k)$. Now, if consider $x_1$ as event A and remaining terms as event B and apply bayes theorem, we will get:- \n", 87 | "\n", 88 | "$$\\large P(x_1, x_2, x_3, ...., x_n, C_k) = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2, x_3, ...., x_n, C_k)\\;\\;\\;\\;\\;\\;-(iii)\\;\\;\\;$$\n", 89 | "\n", 90 | "(Omitting the deniminator term as discussed [here](#omitting))\n", 91 | "\n", 92 | "If we keep applying bayes theorem in equation (iii), we will get:-\n", 93 | "\n", 94 | "$ \\quad \\quad P(C_k, x_1, x_2, x_3, ...., x_n) = P(C_k, x_1, x_2, x_3, ...., x_n)$
\n", 95 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2, x_3, ...., x_n, C_k)$
\n", 96 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k)$
\n", 97 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = \\;\\;...$
\n", 98 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k).....P(x_{n-1}|x_n, C_k)P(x_n|C_k)P(C_k) \\;\\;-(iv)$
\n", 99 | "\n", 100 | "Now, in naive bayes, we assume that the features are conditionally independent of each other. If features are independent then:-\n", 101 | "\n", 102 | "$$ \\large P(x_i|x_{i+1}, ..., x_n, C_k) = P(x_i|C_k)$$\n", 103 | "\n", 104 | "If we apply this rule in equation (iv) we get:-\n", 105 | "\n", 106 | "$\\quad \\quad P(C_k, x_1, x_2, x_3, ...., x_n) = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k).....P(x_{n-1}|x_n, C_k)P(x_n|C_k)P(C_k)$\n", 107 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad= P(C_k)P(x_1|C_k)P(x_2|C_k)P(x_3|C_k).....$
\n", 108 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad= P(C_k)\\pi_{i=0}^{n}P(x_i|C_k)$\n", 109 | "\n", 110 | "Hence,\n", 111 | "\n", 112 | "$$ \\large P(C_k|x_n) = P(C_k)\\prod_{i=0}^{n}P(x_i|C_k) $$\n", 113 | "\n", 114 | "This is how we predict in Naive Bayes Algorithm" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "## Naive Bayes with a simple example\n", 122 | "\n", 123 | "To understand the naive bayes with ta simple example, check out [this](http://shatterline.com/blog/2013/09/12/not-so-naive-classification-with-the-naive-bayes-classifier/) blog by [shatterline](http://shatterline.com/blog)." 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "## Exercise\n", 131 | "\n", 132 | "The objective of this excercise is to solidify your understanding of Naive Bayes algorithm by implementing it from scratch. You will be creating a class **NaiveBayes** and defining the methods in it that learn from data and predict. After implementing this class, you will run it on the same dataset shown in [this](#ex) example.\n", 133 | "\n", 134 | "You can refer the solution if you feel stuck somewhere. Also, one thing you need to be make sure is to use log of probabilities that you calculate. This is important because the probabilities you calculate will have very large decimal places but python will store only the 1st 16 places. This could lead to some discrepancies in the result so make sure to use log probabilities. If would also encourage you to comment your code as it is a good practice.\n", 135 | "\n", 136 | "Good luck" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "metadata": {}, 142 | "source": [ 143 | "## Importing the libraries and loding the data" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": null, 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "import numpy as np\n", 153 | "import pandas as pd" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "df = pd.DataFrame(\n", 163 | " { \n", 164 | " 'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'], \n", 165 | " 'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', \"Hot\", 'Mild'],\n", 166 | " 'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],\n", 167 | " 'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],\n", 168 | " 'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']\n", 169 | " }\n", 170 | ")" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "df" 180 | ] 181 | }, 182 | { 183 | "cell_type": "code", 184 | "execution_count": null, 185 | "metadata": {}, 186 | "outputs": [], 187 | "source": [ 188 | "X = df.iloc[:, :-1]\n", 189 | "y = df.iloc[:, -1:]" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "# Start coding here..." 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "class NaiveBayes:\n", 206 | " def __init__(self, X, y):\n", 207 | " '''\n", 208 | " This method initializes all the required data fields of the NaiveBayes class\n", 209 | " \n", 210 | " Input -\n", 211 | " X: A pandas dataframe consisting of all the dependent variables\n", 212 | " y: A pandas dataframe consisting of labels\n", 213 | " '''\n", 214 | " # Initializing the Dependent and Independent Variables\n", 215 | " self.X = X\n", 216 | " self.y = y\n", 217 | " \n", 218 | " # Initializing the column name y. (came in handy for me. If you do not require it, then you can delete it)\n", 219 | " self.y_label = y.columns[0]\n", 220 | " \n", 221 | " # Initializing the variables to store class priors. Initiallt set to None. The will the assigned the correct values by\n", 222 | " # executing the calculate_prior method\n", 223 | " # p_pos is probability of positive class\n", 224 | " # p_neg is probability of negative class\n", 225 | " self.p_pos = None\n", 226 | " self.p_neg = None\n", 227 | " \n", 228 | " # A dictionary to store all likelihood probabilities\n", 229 | " self.likelihoods = {}\n", 230 | " \n", 231 | " # Executing calculate_prior and calculate_likelihood to calculate prior and likelihood probabilities\n", 232 | " self.calculate_prior()\n", 233 | " self.calculate_likelihood()\n", 234 | " \n", 235 | " \n", 236 | " def calculate_prior(self):\n", 237 | " '''\n", 238 | " Method for calculating the prior probabilities\n", 239 | " \n", 240 | " Input - None\n", 241 | " \n", 242 | " Expected output: Expected to assign p_pos and p_neg their correct log probability values. No need to return anything\n", 243 | " '''\n", 244 | " # write your code here ...\n", 245 | " \n", 246 | " def calculate_likelihood(self):\n", 247 | " '''\n", 248 | " Method for calculating the all the likelihood probabilities\n", 249 | " \n", 250 | " Input - None\n", 251 | " \n", 252 | " Expected output: Expected to create a dictionary of likelihood probabilities and assign it to likelihoods.\n", 253 | " '''\n", 254 | " # write your code here ...\n", 255 | " \n", 256 | " def predict(self, test_data):\n", 257 | " '''\n", 258 | " A method to predict the label for the input\n", 259 | " \n", 260 | " Input -\n", 261 | " test_data: A dataframe of dependent variables\n", 262 | " \n", 263 | " Expected output: Expected to return a dataframe of predictions. The column name of dataframe should match column name of y\n", 264 | " '''\n", 265 | " # write your code here ...\n" 266 | ] 267 | }, 268 | { 269 | "cell_type": "markdown", 270 | "metadata": {}, 271 | "source": [ 272 | "# Test if your Code is working as expected..." 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": null, 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [ 281 | "# Create the object\n", 282 | "nb = NaiveBayes(X, y)" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "# Check if your code is predicting correctly\n", 292 | "assert nb.predict(X).equals(pd.DataFrame({'Play': ['No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No']})) == True, 'The prediction received is wrong. Kindly recheck your code. Refer the solution if you find yourself stuck somewhere'" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [] 299 | } 300 | ], 301 | "metadata": { 302 | "kernelspec": { 303 | "display_name": "Python 3", 304 | "language": "python", 305 | "name": "python3" 306 | }, 307 | "language_info": { 308 | "codemirror_mode": { 309 | "name": "ipython", 310 | "version": 3 311 | }, 312 | "file_extension": ".py", 313 | "mimetype": "text/x-python", 314 | "name": "python", 315 | "nbconvert_exporter": "python", 316 | "pygments_lexer": "ipython3", 317 | "version": "3.8.3" 318 | } 319 | }, 320 | "nbformat": 4, 321 | "nbformat_minor": 4 322 | } 323 | -------------------------------------------------------------------------------- /008/exercise/README.md: -------------------------------------------------------------------------------- 1 | # Naive Bayes Classifier from scratch 2 | 3 | ## Exercise 4 | 5 | The objective of this excercise is to solidify your understanding of Naive Bayes algorithm by implementing it from scratch. You will be creating a class **NaiveBayes** and defining the methods in it that learn from data and predict. The exercise notebook consists of detailed notes on Conditional Probability, Bayes Theorem and Naive Bayes Algorithm. It also contains some starter code and instructions where you have to complete the code. 6 | 7 | You can refer the solution if you feel stuck somewhere. Also, one thing you need to be make sure is to use log of probabilities that you calculate. This is important because the probabilities you calculate will have very large decimal places but python will store only the 1st 16 places. This could lead to some discrepancies in the result so make sure to use log probabilities. I would also encourage you to comment your code as it is a good practice. 8 | 9 | Good luck -------------------------------------------------------------------------------- /008/solution/NaiveBayes Solution.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Naive Bayes Classifier from scratch" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Table of Contents\n", 15 | "\n", 16 | "- [Introduction](#Introduction)\n", 17 | "- [Conditional Probability and Bayes Theorem refresher](#cp-and-bt)\n", 18 | "- [The Mathematical model of Naive Bayes Algorithm](#mm)\n", 19 | "- [Naive Bayes with a simple example](#ex)\n", 20 | "- [Exercise](#exer)\n", 21 | "\n", 22 | "\n", 23 | "## Introduction\n", 24 | "\n", 25 | "Naive Bayes Algorithm is a probability based classification technique for classifying labelled data. It makes use of Bayes Theorem in order to predict the class of the given data. It is called Naive because it makes an assumption that the features of the dependent variable are mutually independent of each other. Although it is named naive, it is a very efficient model, often used as a baseline for text classification and recommender systems.\n", 26 | "\n", 27 | "## Conditional Probability and Bayes Theorem Refresher \n", 28 | "\n", 29 | "As already mentioned, naive bayes is a probabilistic model and depends on bayes theorem for prediction. Hence, understanding of conditional probability and bayes theorem is the key to understanding this algorithm.\n", 30 | "\n", 31 | "Conditional probability is defined as the probability of an event occuring given that another event has already occured. For example, suppose we rolled a dice and we know that the number that came out is an even number. Now if we want to find the probability of getting a 2 on the dice, it is expressed using conditional probability. Mathematically, conditional probability is defined as follows:-" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "$$ \\large \\large P(A|B) = \\frac{P(A \\bigcap B)}{P(B)} $$" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "Bayes theorem is a very elegant theroem based on conditional probability that describes the probability of an event based on prior knowledge of conditions that might be related to the event. It is named after Thomas Bayes. It is mathematically defined as follows:-\n", 46 | "\n", 47 | "$$ \\large P(A|B) = \\frac{P(B|A)P(A)}{P(B)} $$\n", 48 | "\n", 49 | "where A and B are two events.\n", 50 | "\n", 51 | "Each term in the above equation have been given a special name:-\n", 52 | "\n", 53 | "P(A|B) is known as Posterior Probability
\n", 54 | "P(B|A) is known as Likelihood Probability
\n", 55 | "P(A) is known as Prior Probability, and
\n", 56 | "P(B) is known as Evidence Probability
" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "## The Mathematical model of Naive Bayes Algorithm \n", 64 | "\n", 65 | "Suppose we have a point $x$ as follows:-\n", 66 | "\n", 67 | "$$ \\large x=[x_1, x_2, x_3,....,x_n]$$\n", 68 | "\n", 69 | "Our task is to assign a class or a label to this point 'k'. If we have 'k' classes, then we have to find the probability of the point $x$ belonging to class $C_k$. The class with highest probability will be assigned as the label of $x$. The probablity of a class $C_k$ given $x$ can be calculated using Bayes Theorem as follows:-\n", 70 | "\n", 71 | "$$\\large P(C_k|x) = \\frac{P(x|C_k)P(C_k)}{P(x)} \\;\\;\\;\\;\\;\\;\\; - \\;\\;(i)$$\n", 72 | "\n", 73 | "So, to summarize, if our dataset has 3 classes (setosa, virginica and versicolor for example, then we have to calculate P(setosa|x), P(virginica|x) and P(versicolor|x) and the highest probability will be assigned as the label x.\n", 74 | "\n", 75 | "Now, in our algorithm, we can omit the Evidence term, because is will remain constant for all the probabilities. This is done just to simplify the computations.\n", 76 | "\n", 77 | "Now, $P(C_k|x)$ can also be written as $P(C_k, x)$, and if we replace $x$ with its value, we get $P(C_k, x_1, x_2, x_3, ...., x_n)$. So, till now, we have basically transformed $P(C_k, x)$ into $$P(C_k, x_1, x_2, x_3, ...., x_n)\\;\\;\\;\\;\\;\\;\\; -\\;\\;\\;-(ii) $$. Things will start to get interesting now. \n", 78 | "\n", 79 | "In eq (ii), we can interchanging the terms inside the parenthesis won't change it's meaning. So, I am shifting the $C_k$ to the end. So, our equation will look like this - $P(x_1, x_2, x_3, ...., x_n, C_k)$. Now, if consider $x_1$ as event A and remaining terms as event B and apply bayes theorem, we will get:- \n", 80 | "\n", 81 | "$$\\large P(x_1, x_2, x_3, ...., x_n, C_k) = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2, x_3, ...., x_n, C_k)\\;\\;\\;\\;\\;\\;-(iii)\\;\\;\\;$$\n", 82 | "\n", 83 | "(Omitting the deniminator term as discussed [here](#omitting))\n", 84 | "\n", 85 | "If we keep applying bayes theorem in equation (iii), we will get:-\n", 86 | "\n", 87 | "$ \\quad \\quad P(C_k, x_1, x_2, x_3, ...., x_n) = P(C_k, x_1, x_2, x_3, ...., x_n)$
\n", 88 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2, x_3, ...., x_n, C_k)$
\n", 89 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k)$
\n", 90 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = \\;\\;...$
\n", 91 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k).....P(x_{n-1}|x_n, C_k)P(x_n|C_k)P(C_k) \\;\\;-(iv)$
\n", 92 | "\n", 93 | "Now, in naive bayes, we assume that the features are conditionally independent of each other. If features are independent then:-\n", 94 | "\n", 95 | "$$ \\large P(x_i|x_{i+1}, ..., x_n, C_k) = P(x_i|C_k)$$\n", 96 | "\n", 97 | "If we apply this rule in equation (iv) we get:-\n", 98 | "\n", 99 | "$\\quad \\quad P(C_k, x_1, x_2, x_3, ...., x_n) = P(x_1|x_2, x_3, ...., x_n, C_k)P(x_2| x_3, ...., x_n, C_k)P(x_3, ...., x_n, C_k).....P(x_{n-1}|x_n, C_k)P(x_n|C_k)P(C_k)$\n", 100 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad= P(C_k)P(x_1|C_k)P(x_2|C_k)P(x_3|C_k).....$
\n", 101 | "$ \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad \\quad= P(C_k)\\pi_{i=0}^{n}P(x_i|C_k)$\n", 102 | "\n", 103 | "Hence,\n", 104 | "\n", 105 | "$$ \\large P(C_k|x_n) = P(C_k)\\prod_{i=0}^{n}P(x_i|C_k) $$\n", 106 | "\n", 107 | "This is how we predict in Naive Bayes Algorithm" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "## Naive Bayes with a simple example\n", 115 | "\n", 116 | "To understand the naive bayes with ta simple example, check out [this](http://shatterline.com/blog/2013/09/12/not-so-naive-classification-with-the-naive-bayes-classifier/) blog by [shatterline](http://shatterline.com/blog)." 117 | ] 118 | }, 119 | { 120 | "cell_type": "markdown", 121 | "metadata": {}, 122 | "source": [ 123 | "## Exercise\n", 124 | "\n", 125 | "The objective of this excercise is to solidify your understanding of Naive Bayes algorithm by implementing it from scratch. You will be creating a class **NaiveBayes** and defining the methods in it that learn from data and predict. After implementing this class, you will run it on the same dataset shown in [this](#ex) example.\n", 126 | "\n", 127 | "You can refer the solution if you feel stuck somewhere. Also, one thing you need to be make sure is to use log of probabilities that you calculate. This is important because the probabilities you calculate will have very large decimal places but python will store only the 1st 16 places. This could lead to some discrepancies in the result so make sure to use log probabilities. If would also encourage you to comment your code as it is a good practice.\n", 128 | "\n", 129 | "Good luck" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "## Importing the libraries and loding the data" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 1, 142 | "metadata": {}, 143 | "outputs": [], 144 | "source": [ 145 | "import numpy as np\n", 146 | "import pandas as pd" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 2, 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "df = pd.DataFrame(\n", 156 | " { \n", 157 | " 'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'], \n", 158 | " 'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', \"Hot\", 'Mild'],\n", 159 | " 'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],\n", 160 | " 'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],\n", 161 | " 'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']\n", 162 | " }\n", 163 | ")" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 3, 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/html": [ 174 | "
\n", 175 | "\n", 188 | "\n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | "
OutlookTemperatureHumidityWindPlay
0SunnyHotHighWeakNo
1SunnyHotHighStrongNo
2OvercastHotHighWeakYes
3RainMildHighWeakYes
4RainCoolNormalWeakYes
5RainCoolNormalStrongNo
6OvercastCoolNormalStrongYes
7SunnyMildHighWeakNo
8SunnyCoolNormalWeakYes
9RainMildNormalWeakYes
10SunnyMildNormalStrongYes
11OvercastMildHighStrongYes
12OvercastHotNormalWeakYes
13RainMildHighStrongNo
\n", 314 | "
" 315 | ], 316 | "text/plain": [ 317 | " Outlook Temperature Humidity Wind Play\n", 318 | "0 Sunny Hot High Weak No\n", 319 | "1 Sunny Hot High Strong No\n", 320 | "2 Overcast Hot High Weak Yes\n", 321 | "3 Rain Mild High Weak Yes\n", 322 | "4 Rain Cool Normal Weak Yes\n", 323 | "5 Rain Cool Normal Strong No\n", 324 | "6 Overcast Cool Normal Strong Yes\n", 325 | "7 Sunny Mild High Weak No\n", 326 | "8 Sunny Cool Normal Weak Yes\n", 327 | "9 Rain Mild Normal Weak Yes\n", 328 | "10 Sunny Mild Normal Strong Yes\n", 329 | "11 Overcast Mild High Strong Yes\n", 330 | "12 Overcast Hot Normal Weak Yes\n", 331 | "13 Rain Mild High Strong No" 332 | ] 333 | }, 334 | "execution_count": 3, 335 | "metadata": {}, 336 | "output_type": "execute_result" 337 | } 338 | ], 339 | "source": [ 340 | "df" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 4, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [ 349 | "X = df.iloc[:, :-1]\n", 350 | "y = df.iloc[:, -1:]" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "# Start coding here..." 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": 5, 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "class NaiveBayes:\n", 367 | " def __init__(self, X, y):\n", 368 | " '''\n", 369 | " This method initializes all the required data fields of the NaiveBayes class\n", 370 | " \n", 371 | " Input -\n", 372 | " X: A pandas dataframe consisting of all the dependent variables\n", 373 | " y: A pandas dataframe consisting of labels\n", 374 | " '''\n", 375 | " # Initializing the Dependent and Independent Variables\n", 376 | " self.X = X\n", 377 | " self.y = y\n", 378 | " \n", 379 | " # Initializing the column name y. (came in handy for me. If you do not require it, then you can delete it)\n", 380 | " self.y_label = y.columns[0]\n", 381 | " \n", 382 | " # Initializing the variables to store class priors. Initiallt set to None. The will the assigned the correct values by\n", 383 | " # executing the calculate_prior method\n", 384 | " # p_pos is probability of positive class\n", 385 | " # p_neg is probability of negative class\n", 386 | " self.p_pos = None\n", 387 | " self.p_neg = None\n", 388 | " \n", 389 | " # A dictionary to store all likelihood probabilities\n", 390 | " self.likelihoods = {}\n", 391 | " \n", 392 | " # Executing calculate_prior and calculate_likelihood to calculate prior and likelihood probabilities\n", 393 | " self.calculate_prior()\n", 394 | " self.calculate_likelihood()\n", 395 | " \n", 396 | " \n", 397 | " def calculate_prior(self):\n", 398 | " '''\n", 399 | " Method for calculating the prior probabilities\n", 400 | " \n", 401 | " Input - None\n", 402 | " \n", 403 | " Expected output: Expected to assign p_pos and p_neg their correct log probability values. No need to return anything\n", 404 | " '''\n", 405 | " # Get the total number of positive points\n", 406 | " total_positive = df[self.y_label][df[self.y_label] == 'Yes'].count()\n", 407 | " \n", 408 | " # Get the total number of negative points\n", 409 | " total_negative = df[self.y_label][df[self.y_label] == 'No'].count()\n", 410 | " \n", 411 | " # Get the total number of points\n", 412 | " total = df['Play'].count()\n", 413 | " \n", 414 | " # Calculate log probability of positive class\n", 415 | " self.p_pos = np.log(total_positive / total)\n", 416 | " # Calculate log probability of negative class\n", 417 | " self.p_neg = np.log(total_negative / total)\n", 418 | " \n", 419 | " def calculate_likelihood(self):\n", 420 | " '''\n", 421 | " Method for calculating the all the likelihood probabilities\n", 422 | " \n", 423 | " Input - None\n", 424 | " \n", 425 | " Expected output: Expected to create a dictionary of likelihood probabilities and assign it to likelihoods.\n", 426 | " '''\n", 427 | " # Concatenating X and y for easy access to features and labels\n", 428 | " df = pd.concat([self.X, self.y], axis=1)\n", 429 | " \n", 430 | " # Getting all unique class labels (Yes and No)\n", 431 | " labels = df[self.y_label].unique()\n", 432 | " \n", 433 | " # Get the count of all positive and negative points\n", 434 | " total_positive = y[y[self.y_label] == 'Yes'][self.y_label].count()\n", 435 | " total_negative = y[y[self.y_label] == 'No'][self.y_label].count()\n", 436 | " \n", 437 | " # Traversing through each column of the dataframe\n", 438 | " for feature_name in X.columns:\n", 439 | " # Storing likelihood for each value this column\n", 440 | " self.likelihoods[feature_name] = {}\n", 441 | " \n", 442 | " # Traversing through each unique value in the column\n", 443 | " for feature in df.loc[:, feature_name].unique():\n", 444 | " # Calculate P(feature_name|'yes')\n", 445 | " feature_given_yes = df[(df[feature_name] == feature) & (df[self.y_label] == 'Yes')][feature_name].count()\n", 446 | " \n", 447 | " # Get the log probability\n", 448 | " feature_given_yes = 0 if feature_given_yes == 0 else np.log( feature_given_yes / total_positive )\n", 449 | " \n", 450 | " # Calculate P(feature_name|'yes')\n", 451 | " feature_given_no = df[(df[feature_name] == feature) & (df[self.y_label] == 'No')][feature_name].count()\n", 452 | " \n", 453 | " # Get the log probability\n", 454 | " feature_given_no = 0 if feature_given_no == 0 else np.log( feature_given_no / total_negative )\n", 455 | " \n", 456 | " # Store the likelihood the the dict\n", 457 | " self.likelihoods[feature_name][f'{feature}|yes'] = feature_given_yes\n", 458 | " self.likelihoods[feature_name][f'{feature}|no'] = feature_given_no\n", 459 | " \n", 460 | " def predict(self, test_data):\n", 461 | " '''\n", 462 | " A method to predict the label for the input\n", 463 | " \n", 464 | " Input -\n", 465 | " test_data: A dataframe of dependent variables\n", 466 | " \n", 467 | " Expected output: Expected to return a dataframe of predictions. The column name of dataframe should match column name of y\n", 468 | " '''\n", 469 | " feature_names = test_data.columns\n", 470 | " # List to store the predictions\n", 471 | " prediction = []\n", 472 | " \n", 473 | " # Traversing through the dataframe\n", 474 | " for row in test_data.itertuples():\n", 475 | " # A list to store P(y=yes|X) and P(y=no|X)\n", 476 | " p_yes_given_X = []\n", 477 | " p_no_given_X = []\n", 478 | " \n", 479 | " # Traversing through each row of datadrame to get the value of each column\n", 480 | " for i in range(len(row) - 1):\n", 481 | " \n", 482 | " # Slicingthe 1st element as it is not needed (index)\n", 483 | " row = row[1:]\n", 484 | " \n", 485 | " # getting the likelihood probabilities from likelihood dict and storing them in the list\n", 486 | " p_yes_given_X.append(self.likelihoods[feature_names[i]][f'{row[0]}|yes'])\n", 487 | " p_no_given_X.append(self.likelihoods[feature_names[i]][f'{row[0]}|no'])\n", 488 | " \n", 489 | " # Adding probability of positive and negative class to the list\n", 490 | " p_yes_given_X.append(self.p_pos)\n", 491 | " p_no_given_X.append(self.p_neg)\n", 492 | " \n", 493 | " # Since we are using log probabilities, we can ad them instead of multiplying since log(a*b) = log(a) + log(b)\n", 494 | " p_yes_given_X = np.sum(p_yes_given_X)\n", 495 | " p_no_given_X = np.sum(p_no_given_X)\n", 496 | "\n", 497 | " # If p_yes_given_X > p_no_given_X, the we assign positive class i.e F=True else False\n", 498 | " # Add the prediction to the prediction list\n", 499 | " prediction.append(p_yes_given_X > p_no_given_X)\n", 500 | " \n", 501 | " # Creating the prediction dataframe\n", 502 | " prediction = pd.DataFrame({self.y_label: prediction})\n", 503 | " \n", 504 | " # Converting True to yes and False to No\n", 505 | " prediction[self.y_label] = prediction[self.y_label].map({True: 'Yes', False: 'No'})\n", 506 | " \n", 507 | " # return the prediction\n", 508 | " return prediction" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "# Test if your Code is working as expected..." 516 | ] 517 | }, 518 | { 519 | "cell_type": "code", 520 | "execution_count": 6, 521 | "metadata": {}, 522 | "outputs": [], 523 | "source": [ 524 | "# Create the object\n", 525 | "nb = NaiveBayes(X, y)" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": 7, 531 | "metadata": {}, 532 | "outputs": [], 533 | "source": [ 534 | "# Check if your code is predicting correctly\n", 535 | "assert nb.predict(X).equals(pd.DataFrame({'Play': ['No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No']})) == True, 'The prediction received is wrong. Kindly recheck your code. Refer the solution if you find yourself stuck somewhere'" 536 | ] 537 | }, 538 | { 539 | "cell_type": "markdown", 540 | "metadata": {}, 541 | "source": [] 542 | } 543 | ], 544 | "metadata": { 545 | "kernelspec": { 546 | "display_name": "Python 3", 547 | "language": "python", 548 | "name": "python3" 549 | }, 550 | "language_info": { 551 | "codemirror_mode": { 552 | "name": "ipython", 553 | "version": 3 554 | }, 555 | "file_extension": ".py", 556 | "mimetype": "text/x-python", 557 | "name": "python", 558 | "nbconvert_exporter": "python", 559 | "pygments_lexer": "ipython3", 560 | "version": "3.8.3" 561 | } 562 | }, 563 | "nbformat": 4, 564 | "nbformat_minor": 4 565 | } 566 | -------------------------------------------------------------------------------- /009/exercise/readme.md: -------------------------------------------------------------------------------- 1 | ### Problem Statement 2 | The main idea is to predict if a current health-insurance client is interested to get a vehicular insurance. 3 | 4 | ### Task Details 5 | 6 | Your client is an Insurance company that has provided health insurance to its customers now they need your help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company. 7 | 8 | Now, in order to predict, whether the customer would be interested in Vehicle insurance, you have information about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy (Premium, sourcing channel) etc. 9 | 10 | ### Evaluation Metric 11 | The evaluation metric for this hackathon is ROC_AUC score. 12 | 13 | ### Suggested Model 14 | KNN classifier 15 | 16 | Source : https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction/tasks?taskId=2055 17 | -------------------------------------------------------------------------------- /009/solution/readme.md: -------------------------------------------------------------------------------- 1 | # My Solution 2 | 3 | See the following jupyter notebook: 4 | 5 | `insurance_cross_sell.ipynb` 6 | 7 | [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/009/solution/insurance_cross_sell.ipynb) 8 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/009/solution/insurance_cross_sell.ipynb) 9 | 10 | It contains a kNN solution for the classification problem. 11 | -------------------------------------------------------------------------------- /010/exercise/knn_starter_exercise.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | " \n", 11 | "
\n", 9 | " Run in Google Colab\n", 10 | "
" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## Importing libraries" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 2, 24 | "metadata": {}, 25 | "outputs": [], 26 | "source": [ 27 | "# importing necessary libraries\n", 28 | "import numpy as np\n", 29 | "import pandas as pd\n", 30 | "import matplotlib.pyplot as plt\n", 31 | "import math\n", 32 | "from collections import Counter\n", 33 | "from copy import deepcopy" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## Loading sample data" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 3, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "# Sample data with (xp,yp) as positive-class coordinates and (xn,yn) as negative-class coordinates\n", 50 | "#xp = [-1,0,4,6,7]\n", 51 | "#yp = [-2,0,4,2,9]\n", 52 | "#xn = [-3,1,2,4,6,4]\n", 53 | "#yn = [-2,5,4,2,4,6]\n", 54 | "\n", 55 | "xp=[]\n", 56 | "yp=[]\n", 57 | "xn=[]\n", 58 | "yn=[]\n", 59 | "\n", 60 | "data = [\n", 61 | " (-1,-2,'Positive'),\n", 62 | " (7,9,'Positive'),\n", 63 | " (0,0,'Positive'),\n", 64 | " (4,4,'Positive'),\n", 65 | " (6,2,'Positive'),\n", 66 | " (-3,-2,'Negative'),\n", 67 | " (1,5,'Negative'),\n", 68 | " (2,4,'Negative'),\n", 69 | " (4,2,'Negative'),\n", 70 | " (6,4,'Negative'),\n", 71 | " (4,6,'Negative'),\n", 72 | "]" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 4, 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "name": "stdout", 82 | "output_type": "stream", 83 | "text": [ 84 | "[-1, 7, 0, 4, 6] [-2, 9, 0, 4, 2] [-3, 1, 2, 4, 6, 4] [-2, 5, 4, 2, 4, 6]\n" 85 | ] 86 | } 87 | ], 88 | "source": [ 89 | "# Append data points in respective arrays for data plotting\n", 90 | "for i in range(0,len(data)):\n", 91 | " if(data[i][2]=='Positive'):\n", 92 | " xp.append(data[i][0])\n", 93 | " yp.append(data[i][1])\n", 94 | " else: \n", 95 | " xn.append(data[i][0])\n", 96 | " yn.append(data[i][1])\n", 97 | " \n", 98 | "print(xp,yp,xn,yn)" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "## Plotting graph for visualization" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 5, 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "data": { 115 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAEGCAYAAABvtY4XAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAUpklEQVR4nO3df5Dcd33f8eeb8xHOvyQaO8I+qzVtmZsksvGhq1OiqeewSGQmDlEzGQ+uoW0mLf0D/4A6SlHS4jHTNJpqwEmmmU492AmdgD2qkdXAUJSMdRtKWoglS+hshApxoNbJ2HiIZMu5wkl5949dSSdzd9JJ+/nu3Wefj5md2/3eel+fr1d+af35fvf7icxEklSf1/V6AJKkMix4SaqUBS9JlbLgJalSFrwkVeqiXg9gtiuuuCKvvfbaXg9DZ/Hqq69yySWX9FV2v+X2Mtt9Xpw9e/a8lJlXzvnLzFwyt7Vr16aWvomJib7L7rfcXma7z4sD7M55OtUpGkmqlAUvSZWy4CWpUha8JFXKgpekShUt+Ii4JyKejohnIuKDJbMkabnZsXeKdVt2MTl1lHVbdrFj71RXX79YwUfEGuBfAjcCbwVujYi3lMqTpOVkx94pNm+fZOrINABTR6bZvH2yqyVf8hP8jwNfzsy/zszjwJ8C/7hgniQtG1t3HmR65sQZ26ZnTrB158GuZUQWuh58RPw48N+BtwPTwBO0T8i/6zXPez/wfoBVq1atffTRR4uMR91z7NgxLr300r7K7rfcXmb3yz5PTh09dX/VELwwffp31w2vOOfXecc73rEnM8fm+l2xggeIiF8BPgAcA74GTGfmh+Z7/tjYWO7evbvYeNQdrVaL8fHxvsrut9xeZvfLPq/bsuvU9My91x3nY5PtK8cMrxzizz588zm/TkTMW/BFD7Jm5kOZ+bbMvAn4HvCNknmStFxs2jDC0ODAGduGBgfYtGGkaxlFLzYWET+WmS9GxN8GfpH2dI0k9b2No8MAnTn3VxheOcSmDSOntndD6atJfiYifhSYAT6QmX9VOE+Slo2No8NsHB2m1Wpx1x3jXX/9ogWfmf+o5OtLkubnN1klqVIWvCRVyoKXpEpZ8JJUKQtekiplwUtSpSx4SaqUBS9JlbLgJalSpVd0+lBnNaenI+KRiHhDyTxJ0mklV3QaBu4GxjJzDTAAvKdUniTpTKWnaC4ChiLiIuBi4HDhPElSR+kFP+4BfpP2ik5/nJl3zPEcV3RaZvplxZ1+zu1ltvu8OAut6ERmFrkBbwR2AVcCg8AO4L0L/TNr165NLX0TExN9l91vub3Mdp8Xh/ZSqHN2askpmncCf5mZ383MGWA78NMF8yRJs5Qs+P8L/MOIuDgiAlgPHCiYJ0mapVjBZ+ZXgMeAp4DJTtaDpfIkSWcqvaLTfcB9JTMkSXPzm6ySVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklSpkgt+jETEvlm3lyPig6XyJElnKnapgsw8CNwAEBEDwBTweKk8SdKZmpqiWQ/8RWZ+u6E8Sep7RVd0OhUS8TDwVGb+pzl+54pOy4wr7tSf28ts93lxerKi08kb8HrgJWDV2Z7rik7Lgyvu1J/by2z3eXHo0YpOJ72L9qf3FxrIkiR1NFHwtwOPNJAjSZqlaMFHxMXAz9Bej1WS1KDSKzr9NfCjJTMkSXPzm6ySVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklSp0pcqWBkRj0XE1yPiQES8vWSepC7Yvw0eWAPP72v/3L+t1yPSeSp6qQLgd4AvZOYvRcTrgYsL50m6EPu3wWfvhplpeBNw9Ln2Y4Drb+vp0LR4JddkvRy4CXgIIDN/kJlHSuVJ6oInPtou99lmptvbtewUW9EpIm4AHgS+BrwV2APck5mvvuZ5rui0zLjiTsW5z+87nf0jV3Pp9w+f/t1VNzQyBP98Lc5CKzqVLPgx4MvAusz8SkT8DvByZv67+f6ZsbGx3L17d5HxqHtarRbj4+N9ld03uQ+saU/LAK2R+xk/eF97+4rV8KGnGxmCf74WJyLmLfiSB1kPAYcy8yudx48BbyuYJ+lCrf8IDA6duW1wqL1dy06xgs/M7wDPRcRIZ9N62tM1kpaq62+Dn//d9id2aP/8+d/1AOsyVfosmruAT3XOoHkW+OXCeZIu1PW3tW+tFtzezLSMyii9otM+YM65IUlSWX6TVZIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklQpC16SKlV6RadvRcRkROyLCC8TqfPnKkPSopW+Fg3AOzLzpQZyVCtXGZLOi1M0WvpcZUg6L8UW/ACIiL8E/gpI4L9k5oNzPMcVnZYZVxlqjqsb9Ud2qRWdyMxiN+Dqzs8fA74K3LTQ89euXZta+iYmJpoN/PhPZt53eeZ9l+fEpx84dT8//pONDaHxfe5xbi+z3efFAXbnPJ1adIomMw93fr4IPA7cWDJPlXKVIem8FCv4iLgkIi47eR/4WcDVA7R4rjIknZeSZ9GsAh6PiJM5n87MLxTMU81cZUhatGIFn5nPAm8t9fqSpIV5mqQkVcqCl6RKWfCSVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFWqeMFHxEBE7I2Iz5XOkrquD1eS2rF3inVbdjE5dZR1W3axY+9Ur4dUXqXvcxMrOt0DHAAubyBL6p4+XElqx94pNm+fZHrmBKyGqSPTbN4+CcDG0eEej66Qit/n0muyXgP8HPCJkjlSEX24ktTWnQfb5T7L9MwJtu482KMRNaDi97n0ik6PAb8FXAb8ambeOsdzXNFpmembFXf6cCWpyamjp+6vGoIXZvXedcMrGhmDK4YtzkIrOhWboomIW4EXM3NPRIzP97xsL+P3IMDY2FiOj8/7VC0RrVaLXr1PjWY/cGf7f9eB1sj9jB+8r719xerGLlnc9L/r39iyi6kj7Va/97rjfGyyXRHDK4e4645mxtH4n6+K3+eSUzTrgHdHxLeAR4GbI+IPC+ZJ3dWHK0lt2jDC0ODAGduGBgfYtGGkRyNqQMXvc8nrwW8GNgN0PsH/ama+t1Se1HUnD7CdnItdsbr9H/0yP/C2kJMHUttz7q8wvHKITRtG6j3AClW/z02cRSMtX324ktTG0WE2jg7TarUam5bpuUrf50YKPjNbQKuJLElSm99klaRKWfCSVKmzFnxE3BkRb2xiMJKk7jmXT/BvAp6MiG0RcUtEROlBSZIu3FkLPjP/LfAW4CHgnwPfiIj/EBF/r/DYJEkX4Jzm4LN9PYPvdG7HgTcCj0XEfyw4NknSBTjraZIRcTfwz4CXaF80bFNmzkTE64BvAL9WdoiSpPNxLufBXwH8YmZ+e/bGzPybzvVmJElL0FkLPjPnvSBDZh7o7nAkSd1S7Dz4iHhDRPx5RHw1Ip6JiPtLZUmSfljJSxV8H7g5M49FxCDwpYj4H5n55YKZkqSOkleTTOBY5+Fg51ZudRFJ0hlKr+g0AOwB/j7we5n5b+Z4jis6LTN9s6JTH+f2Mtt9XpyFVnQiM4vfgJXABLBmoeetXbs2tfRNTEz0XXa/5fYy231eHGB3ztOpjVxsLDOP0L5c8C1N5EmSyp5Fc2VErOzcHwLeCXy9VJ4k6Uwlz6K5CvhkZx7+dcC2zPxcwTxJ0iwlz6LZD4yWen1J0sJc8EOSKmXBS1KlLHhJqpQFL0mVsuAlqVIWvCRVyoKXpEpZ8JJUKQtekipV8lo0qyNiIiIOdFZ0uqdUlqQu2r8NHlgDz+9r/9y/rdcjKm7H3inWbdnF5NRR1m3ZxY69U70eUleUvBbNceDezHwqIi4D9kTEn2Tm1wpmSroQ+7fBZ++GmWl4E3D0ufZjgOtv6+nQStmxd4rN2yeZnjkBq2HqyDSbt08CsHF0uMejuzDFPsFn5vOZ+VTn/ivAAWB5/9uSavfER9vlPtvMdHt7pbbuPNgu91mmZ06wdefBHo2oe4qu6HQqJOJa4Iu0F/x4+TW/c0WnZcYVdyrOfX7f6ewfuZpLv3/49O+uuqGRITS9z5NTR0/dXzUEL8z6++264RWNjKHUik7FCz4iLgX+FPjNzNy+0HPHxsZy9+7dRcejC9dqtRgfH++r7L7JfWBNe1oGaI3cz/jB+9rbV6yGDz3dyBCa3ud1W3YxdaTd6vded5yPTbZnrodXDvFnH765kTFcyD5HxLwFX/QsmogYBD4DfOps5S5pCVj/ERgcOnPb4FB7e6U2bRhhaHDgjG1DgwNs2jDSoxF1T7GDrBERwEPAgcz8eKkcSV108kDqyTn3Favb5V7pAVY4fSC1Pef+CsMrh9i0YWTZH2CFsmfRrAPeB0xGxMmJvV/PzM8XzJR0oa6/rX1rteD2ZqZlem3j6DAbR4dptVrcdcd4r4fTNSVXdPoSEKVeX5K0ML/JKkmVsuAlqVIWvCRVyoKXpEpZ8JJUKQtekiplwUtSpSx4SaqUBS9JlSq5otPDEfFiRPTHd50laYkp+Qn+D4BbCr6+JGkBJVd0+iLwvVKvL0laWNEFPzorOX0uM9cs8BxXdFpmXNGp/txeZrvPi7PQik5kZrEbcC3w9Lk+f+3atamlb2Jiou+y+y23l9nu8+IAu3OeTvUsGkmqlAUvSZUqeZrkI8D/BkYi4lBE/EqpLEnSDyu5otPtpV5bknR2TtFIUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklQpC16SKmXBS1KlihZ8RNwSEQcj4psR8eGSWSpvx94p1m3ZxeTUUdZt2cWOvVO9HpKkBZS8Fs0A8HvAu4CfAG6PiJ8olaeyduydYvP2SaaOTAMwdWSazdsnLXlpCSv5Cf5G4JuZ+Wxm/gB4FPiFgnkqaOvOg0zPnDhj2/TMCbbuPNijEUk6m2IrOkXELwG3ZOa/6Dx+H/BTmXnna57nik7LwOTU0VP3Vw3BC9Onf3fd8IrGxrEcV9xZjrm9zHafF2ehFZ2KXU0SiDm2/dDfJpn5IPAgwNjYWI6Pjxccks7Xb2zZdWp65t7rjvOxyfYfneGVQ9x1x3hj42i1WvTiz0i/5fYy233unpJTNIeA1bMeXwMcLpingjZtGGFocOCMbUODA2zaMNKjEUk6m5Kf4J8E3hIRbwamgPcA/6RgngraODoM0Jlzf4XhlUNs2jByarukpafkgh/HI+JOYCcwADycmc+UylN5G0eH2Tg6TKvVanRaRtL5KfkJnsz8PPD5khmSpLn5TVZJqpQFL0mVsuAlqVIWvCRVyoKXpEpZ8JJUKQtekiplwUtSpSx4SaqUBS9JlbLgJalSFrwkVcqCl6RKWfCSVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklQpC16SKmXBS1KlLHhJqpQFL0mVsuAlqVIX9XoAF2z/Nnjio3D0EKy4BtZ/BK6/rdejKmrH3im27jzI4SPTXL1yiE0bRtg4OtzrYUlaYpZ3we/fBp+9G2am24+PPtd+DNWW/I69U2zePsn0zAkApo5Ms3n7JIAlL+kMy3uK5omPni73k2am29srtXXnwVPlftL0zAm27jzYoxFJWqqWd8EfPbS47RU4fGR6Udsl9a/lXfArrlnc9gpcvXJoUdsl9a/lXfDrPwKDrym2waH29kpt2jDC0ODAGduGBgfYtGGkRyOStFQt74OsJw+k9tFZNCcPpHoWjaSzWd4FD+0yr7jQ57JxdNhCl3RWy3uKRpI0LwtekiplwUtSpSx4SaqUBS9JlYrM7PUYTomI7wLf7vU4dFZXAC/1WXa/5fYy231enL+TmVfO9YslVfBaHiJid2aO9VN2v+X2Mtt97h6naCSpUha8JFXKgtf5eLAPs/stt5fZ7nOXOAcvSZXyE7wkVcqCl6RKWfBalIi4JSIORsQ3I+LDDWU+HBEvRsTTTeS9Jnt1RExExIGIeCYi7mko9w0R8ecR8dVO7v1N5M7KH4iIvRHxuYZzvxURkxGxLyJ2N5i7MiIei4ivd97rtzeUO9LZ15O3lyPig117fefgda4iYgD4P8DPAIeAJ4HbM/NrhXNvAo4B/zUz15TMmiP7KuCqzHwqIi4D9gAbG9jnAC7JzGMRMQh8CbgnM79cMndW/r8GxoDLM/PWJjI7ud8CxjKz0S8bRcQngf+ZmZ+IiNcDF2fmkYbHMABMAT+VmV35wqef4LUYNwLfzMxnM/MHwKPAL5QOzcwvAt8rnTNP9vOZ+VTn/ivAAaD4xfiz7Vjn4WDn1sinsYi4Bvg54BNN5PVaRFwO3AQ8BJCZP2i63DvWA3/RrXIHC16LMww8N+vxIRoou6UiIq4FRoGvNJQ3EBH7gBeBP8nMRnKB3wZ+DfibhvJmS+CPI2JPRLy/ocy/C3wX+P3OtNQnIuKShrJnew/wSDdf0ILXYsQc2/piji8iLgU+A3wwM19uIjMzT2TmDcA1wI0RUXx6KiJuBV7MzD2ls+axLjPfBrwL+EBneq60i4C3Af85M0eBV4FGji+d1JkWejfw37r5uha8FuMQsHrW42uAwz0aS2M6c+CfAT6Vmdubzu9MF7SAWxqIWwe8uzMX/ihwc0T8YQO5AGTm4c7PF4HHaU8LlnYIODTr/5Aeo134TXoX8FRmvtDNF7XgtRhPAm+JiDd3PnG8B/ijHo+pqM7BzoeAA5n58QZzr4yIlZ37Q8A7ga+Xzs3MzZl5TWZeS/v93ZWZ7y2dCxARl3QOZNOZIvlZoPiZU5n5HeC5iBjpbFoPFD2IPofb6fL0DNSw6LYak5nHI+JOYCcwADycmc+Uzo2IR4Bx4IqIOATcl5kPlc7tWAe8D5jszIcD/Hpmfr5w7lXAJztnVrwO2JaZjZ6y2AOrgMfbf6dyEfDpzPxCQ9l3AZ/qfHB5FvjlhnKJiItpn5n2r7r+2p4mKUl1copGkiplwUtSpSx4SaqUBS9JlbLgJalSFrwkVcqCl6RKWfDSPCLiH0TE/s612S/pXJe90csVSxfCLzpJC4iIfw+8ARiifb2S3+rxkKRzZsFLC+h8df1J4P8BP52ZJ3o8JOmcOUUjLexvAZcCl9H+JC8tG36ClxYQEX9E+7K5b6a9dN+dPR6SdM68mqQ0j4j4p8DxzPx056qO/ysibs7MXb0em3Qu/AQvSZVyDl6SKmXBS1KlLHhJqpQFL0mVsuAlqVIWvCRVyoKXpEr9fwoK1wccKUTVAAAAAElFTkSuQmCC\n", 116 | "text/plain": [ 117 | "
" 118 | ] 119 | }, 120 | "metadata": { 121 | "needs_background": "light" 122 | }, 123 | "output_type": "display_data" 124 | } 125 | ], 126 | "source": [ 127 | "# Plot coordinates\n", 128 | "fig = plt.figure()\n", 129 | "ax = fig.gca()\n", 130 | "ax.set_xticks(np.arange(0, 10, 1))\n", 131 | "ax.set_yticks(np.arange(0, 10, 1))\n", 132 | "plt.xlabel('x')\n", 133 | "plt.ylabel('y')\n", 134 | "\n", 135 | "plt.scatter(xp,yp)\n", 136 | "plt.scatter(xn,yn)\n", 137 | "plt.grid(True)\n", 138 | "plt.show()" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "metadata": {}, 144 | "source": [ 145 | "# Writing helper functions" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 6, 151 | "metadata": {}, 152 | "outputs": [], 153 | "source": [ 154 | "def e_distance(x,y):\n", 155 | " '''\n", 156 | " Method that calculates Euclidean distance from the point to be classified \n", 157 | " \n", 158 | " to every other point\n", 159 | " '''\n", 160 | " # write your logic here" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 7, 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "def max_Count(vector):\n", 170 | " '''\n", 171 | " Method for calculating mode of data\n", 172 | " '''\n", 173 | " # write your logic here" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "metadata": {}, 179 | "source": [ 180 | "## Take test input for classification" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 9, 186 | "metadata": {}, 187 | "outputs": [ 188 | { 189 | "name": "stdout", 190 | "output_type": "stream", 191 | "text": [ 192 | "Enter pts x,y and k: 1 2 3\n" 193 | ] 194 | } 195 | ], 196 | "source": [ 197 | "# Provide test input for classification in the format 1 2 3 by giving space in between\n", 198 | "(x1,y1,k) = map(int,input(\"Enter pts x,y and k: \").split(\" \"))" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "# Code the KNN algorithm here" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": 8, 211 | "metadata": {}, 212 | "outputs": [], 213 | "source": [ 214 | "\n", 215 | "def KNN(x1,y1,k):\n", 216 | " '''\n", 217 | " Method that classifies the test point\n", 218 | " \n", 219 | " sorts the list and finds mode of the first k tuples to classify point x1,y1\n", 220 | " \n", 221 | " Call the necessary helper functions\n", 222 | " ''' \n", 223 | " # write your logic here" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 10, 229 | "metadata": {}, 230 | "outputs": [], 231 | "source": [ 232 | "# call the function with test points as parameters and value k\n", 233 | "KNN(x1,y1,k)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [] 242 | } 243 | ], 244 | "metadata": { 245 | "kernelspec": { 246 | "display_name": "Python 3", 247 | "language": "python", 248 | "name": "python3" 249 | }, 250 | "language_info": { 251 | "codemirror_mode": { 252 | "name": "ipython", 253 | "version": 3 254 | }, 255 | "file_extension": ".py", 256 | "mimetype": "text/x-python", 257 | "name": "python", 258 | "nbconvert_exporter": "python", 259 | "pygments_lexer": "ipython3", 260 | "version": "3.8.3" 261 | } 262 | }, 263 | "nbformat": 4, 264 | "nbformat_minor": 4 265 | } 266 | -------------------------------------------------------------------------------- /010/exercise/readme.md: -------------------------------------------------------------------------------- 1 | # Problem Statement 2 | 3 | The aim of the exercise is to implement k-NN from scratch. 4 | The basic implementation is done for understanding. 5 | 6 | # Objective 7 | 8 | To understand how kNN works internally. 9 | 10 | # Task 11 | 12 | - Extend the algorithm for Distance-weighted kNN classification using appropriate dataset. 13 | - Extend the algorithm for regression using appropriate dataset. 14 | - Extend the algorithm with appropriate dataset. 15 | - Implementing KD trees to understand information retrieval. Visit [this](https://www.analyticsvidhya.com/blog/2017/11/information-retrieval-using-kdtree/) site for dataset and references. 16 | 17 | # k-NN Algorithm 18 | 19 | K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. 20 | The following two properties would define KNN 21 | well − 22 | - Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification. 23 | - Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data. 24 | 25 | K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set. 26 | 27 | 1. Load the data 28 | 2. Initialize K to your chosen number of neighbors 29 | 3. For each example in the data 30 | - Calculate the distance between the query example and the current example from the data. 31 | - Add the distance and the index of the example to an ordered collection 32 | 4. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances 33 | 5. Pick the first K entries from the sorted collection 6. Get the labels of the selected K entries 34 | 6. If regression, return the mean of the K labels 35 | 7. If classification, return the mode of the K labels 36 | 37 | Here is a template notebook to get you started: 38 | 39 | `knn_starter_exercise.ipynb` 40 | 41 | [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/010/exercise/knn_starter_exercise.ipynb) 42 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/010/exercise/knn_starter_exercise.ipynb) 43 | 44 | ### References 45 | - https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_with_python_knn_algorithm_finding_nearest_neighbors.htm 46 | -------------------------------------------------------------------------------- /010/solution/knn_from_scratch.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | " \n", 11 | "
\n", 9 | " Run in Google Colab\n", 10 | "
" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "# importing necessary libraries\n", 21 | "import numpy as np\n", 22 | "import pandas as pd\n", 23 | "import matplotlib.pyplot as plt\n", 24 | "import math\n", 25 | "from collections import Counter\n", 26 | "from copy import deepcopy" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "# Sample data with (xp,yp) as positive-class coordinates and (xn,yn) as negative-class coordinates\n", 36 | "#xp = [-1,0,4,6,7]\n", 37 | "#yp = [-2,0,4,2,9]\n", 38 | "#xn = [-3,1,2,4,6,4]\n", 39 | "#yn = [-2,5,4,2,4,6]\n", 40 | "\n", 41 | "xp=[]\n", 42 | "yp=[]\n", 43 | "xn=[]\n", 44 | "yn=[]\n", 45 | "\n", 46 | "data = [\n", 47 | " (-1,-2,'Positive'),\n", 48 | " (7,9,'Positive'),\n", 49 | " (0,0,'Positive'),\n", 50 | " (4,4,'Positive'),\n", 51 | " (6,2,'Positive'),\n", 52 | " (-3,-2,'Negative'),\n", 53 | " (1,5,'Negative'),\n", 54 | " (2,4,'Negative'),\n", 55 | " (4,2,'Negative'),\n", 56 | " (6,4,'Negative'),\n", 57 | " (4,6,'Negative'),\n", 58 | "]" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": 3, 64 | "metadata": {}, 65 | "outputs": [ 66 | { 67 | "name": "stdout", 68 | "output_type": "stream", 69 | "text": [ 70 | "[-1, 7, 0, 4, 6] [-2, 9, 0, 4, 2] [-3, 1, 2, 4, 6, 4] [-2, 5, 4, 2, 4, 6]\n" 71 | ] 72 | } 73 | ], 74 | "source": [ 75 | "# Append data points in respective arrays for data plotting\n", 76 | "for i in range(0,len(data)):\n", 77 | " if(data[i][2]=='Positive'):\n", 78 | " xp.append(data[i][0])\n", 79 | " yp.append(data[i][1])\n", 80 | " else: \n", 81 | " xn.append(data[i][0])\n", 82 | " yn.append(data[i][1])\n", 83 | " \n", 84 | "print(xp,yp,xn,yn)" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 4, 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "data": { 94 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXgAAAEGCAYAAABvtY4XAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAUpklEQVR4nO3df5Dcd33f8eeb8xHOvyQaO8I+qzVtmZsksvGhq1OiqeewSGQmDlEzGQ+uoW0mLf0D/4A6SlHS4jHTNJpqwEmmmU492AmdgD2qkdXAUJSMdRtKWoglS+hshApxoNbJ2HiIZMu5wkl5949dSSdzd9JJ+/nu3Wefj5md2/3eel+fr1d+af35fvf7icxEklSf1/V6AJKkMix4SaqUBS9JlbLgJalSFrwkVeqiXg9gtiuuuCKvvfbaXg9DZ/Hqq69yySWX9FV2v+X2Mtt9Xpw9e/a8lJlXzvnLzFwyt7Vr16aWvomJib7L7rfcXma7z4sD7M55OtUpGkmqlAUvSZWy4CWpUha8JFXKgpekShUt+Ii4JyKejohnIuKDJbMkabnZsXeKdVt2MTl1lHVbdrFj71RXX79YwUfEGuBfAjcCbwVujYi3lMqTpOVkx94pNm+fZOrINABTR6bZvH2yqyVf8hP8jwNfzsy/zszjwJ8C/7hgniQtG1t3HmR65sQZ26ZnTrB158GuZUQWuh58RPw48N+BtwPTwBO0T8i/6zXPez/wfoBVq1atffTRR4uMR91z7NgxLr300r7K7rfcXmb3yz5PTh09dX/VELwwffp31w2vOOfXecc73rEnM8fm+l2xggeIiF8BPgAcA74GTGfmh+Z7/tjYWO7evbvYeNQdrVaL8fHxvsrut9xeZvfLPq/bsuvU9My91x3nY5PtK8cMrxzizz588zm/TkTMW/BFD7Jm5kOZ+bbMvAn4HvCNknmStFxs2jDC0ODAGduGBgfYtGGkaxlFLzYWET+WmS9GxN8GfpH2dI0k9b2No8MAnTn3VxheOcSmDSOntndD6atJfiYifhSYAT6QmX9VOE+Slo2No8NsHB2m1Wpx1x3jXX/9ogWfmf+o5OtLkubnN1klqVIWvCRVyoKXpEpZ8JJUKQtekiplwUtSpSx4SaqUBS9JlbLgJalSpVd0+lBnNaenI+KRiHhDyTxJ0mklV3QaBu4GxjJzDTAAvKdUniTpTKWnaC4ChiLiIuBi4HDhPElSR+kFP+4BfpP2ik5/nJl3zPEcV3RaZvplxZ1+zu1ltvu8OAut6ERmFrkBbwR2AVcCg8AO4L0L/TNr165NLX0TExN9l91vub3Mdp8Xh/ZSqHN2askpmncCf5mZ383MGWA78NMF8yRJs5Qs+P8L/MOIuDgiAlgPHCiYJ0mapVjBZ+ZXgMeAp4DJTtaDpfIkSWcqvaLTfcB9JTMkSXPzm6ySVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklSpkgt+jETEvlm3lyPig6XyJElnKnapgsw8CNwAEBEDwBTweKk8SdKZmpqiWQ/8RWZ+u6E8Sep7RVd0OhUS8TDwVGb+pzl+54pOy4wr7tSf28ts93lxerKi08kb8HrgJWDV2Z7rik7Lgyvu1J/by2z3eXHo0YpOJ72L9qf3FxrIkiR1NFHwtwOPNJAjSZqlaMFHxMXAz9Bej1WS1KDSKzr9NfCjJTMkSXPzm6ySVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklSp0pcqWBkRj0XE1yPiQES8vWSepC7Yvw0eWAPP72v/3L+t1yPSeSp6qQLgd4AvZOYvRcTrgYsL50m6EPu3wWfvhplpeBNw9Ln2Y4Drb+vp0LR4JddkvRy4CXgIIDN/kJlHSuVJ6oInPtou99lmptvbtewUW9EpIm4AHgS+BrwV2APck5mvvuZ5rui0zLjiTsW5z+87nf0jV3Pp9w+f/t1VNzQyBP98Lc5CKzqVLPgx4MvAusz8SkT8DvByZv67+f6ZsbGx3L17d5HxqHtarRbj4+N9ld03uQ+saU/LAK2R+xk/eF97+4rV8KGnGxmCf74WJyLmLfiSB1kPAYcy8yudx48BbyuYJ+lCrf8IDA6duW1wqL1dy06xgs/M7wDPRcRIZ9N62tM1kpaq62+Dn//d9id2aP/8+d/1AOsyVfosmruAT3XOoHkW+OXCeZIu1PW3tW+tFtzezLSMyii9otM+YM65IUlSWX6TVZIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklQpC16SKlV6RadvRcRkROyLCC8TqfPnKkPSopW+Fg3AOzLzpQZyVCtXGZLOi1M0WvpcZUg6L8UW/ACIiL8E/gpI4L9k5oNzPMcVnZYZVxlqjqsb9Ud2qRWdyMxiN+Dqzs8fA74K3LTQ89euXZta+iYmJpoN/PhPZt53eeZ9l+fEpx84dT8//pONDaHxfe5xbi+z3efFAXbnPJ1adIomMw93fr4IPA7cWDJPlXKVIem8FCv4iLgkIi47eR/4WcDVA7R4rjIknZeSZ9GsAh6PiJM5n87MLxTMU81cZUhatGIFn5nPAm8t9fqSpIV5mqQkVcqCl6RKWfCSVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFWqeMFHxEBE7I2Iz5XOkrquD1eS2rF3inVbdjE5dZR1W3axY+9Ur4dUXqXvcxMrOt0DHAAubyBL6p4+XElqx94pNm+fZHrmBKyGqSPTbN4+CcDG0eEej66Qit/n0muyXgP8HPCJkjlSEX24ktTWnQfb5T7L9MwJtu482KMRNaDi97n0ik6PAb8FXAb8ambeOsdzXNFpmembFXf6cCWpyamjp+6vGoIXZvXedcMrGhmDK4YtzkIrOhWboomIW4EXM3NPRIzP97xsL+P3IMDY2FiOj8/7VC0RrVaLXr1PjWY/cGf7f9eB1sj9jB+8r719xerGLlnc9L/r39iyi6kj7Va/97rjfGyyXRHDK4e4645mxtH4n6+K3+eSUzTrgHdHxLeAR4GbI+IPC+ZJ3dWHK0lt2jDC0ODAGduGBgfYtGGkRyNqQMXvc8nrwW8GNgN0PsH/ama+t1Se1HUnD7CdnItdsbr9H/0yP/C2kJMHUttz7q8wvHKITRtG6j3AClW/z02cRSMtX324ktTG0WE2jg7TarUam5bpuUrf50YKPjNbQKuJLElSm99klaRKWfCSVKmzFnxE3BkRb2xiMJKk7jmXT/BvAp6MiG0RcUtEROlBSZIu3FkLPjP/LfAW4CHgnwPfiIj/EBF/r/DYJEkX4Jzm4LN9PYPvdG7HgTcCj0XEfyw4NknSBTjraZIRcTfwz4CXaF80bFNmzkTE64BvAL9WdoiSpPNxLufBXwH8YmZ+e/bGzPybzvVmJElL0FkLPjPnvSBDZh7o7nAkSd1S7Dz4iHhDRPx5RHw1Ip6JiPtLZUmSfljJSxV8H7g5M49FxCDwpYj4H5n55YKZkqSOkleTTOBY5+Fg51ZudRFJ0hlKr+g0AOwB/j7we5n5b+Z4jis6LTN9s6JTH+f2Mtt9XpyFVnQiM4vfgJXABLBmoeetXbs2tfRNTEz0XXa/5fYy231eHGB3ztOpjVxsLDOP0L5c8C1N5EmSyp5Fc2VErOzcHwLeCXy9VJ4k6Uwlz6K5CvhkZx7+dcC2zPxcwTxJ0iwlz6LZD4yWen1J0sJc8EOSKmXBS1KlLHhJqpQFL0mVsuAlqVIWvCRVyoKXpEpZ8JJUKQtekipV8lo0qyNiIiIOdFZ0uqdUlqQu2r8NHlgDz+9r/9y/rdcjKm7H3inWbdnF5NRR1m3ZxY69U70eUleUvBbNceDezHwqIi4D9kTEn2Tm1wpmSroQ+7fBZ++GmWl4E3D0ufZjgOtv6+nQStmxd4rN2yeZnjkBq2HqyDSbt08CsHF0uMejuzDFPsFn5vOZ+VTn/ivAAWB5/9uSavfER9vlPtvMdHt7pbbuPNgu91mmZ06wdefBHo2oe4qu6HQqJOJa4Iu0F/x4+TW/c0WnZcYVdyrOfX7f6ewfuZpLv3/49O+uuqGRITS9z5NTR0/dXzUEL8z6++264RWNjKHUik7FCz4iLgX+FPjNzNy+0HPHxsZy9+7dRcejC9dqtRgfH++r7L7JfWBNe1oGaI3cz/jB+9rbV6yGDz3dyBCa3ud1W3YxdaTd6vded5yPTbZnrodXDvFnH765kTFcyD5HxLwFX/QsmogYBD4DfOps5S5pCVj/ERgcOnPb4FB7e6U2bRhhaHDgjG1DgwNs2jDSoxF1T7GDrBERwEPAgcz8eKkcSV108kDqyTn3Favb5V7pAVY4fSC1Pef+CsMrh9i0YWTZH2CFsmfRrAPeB0xGxMmJvV/PzM8XzJR0oa6/rX1rteD2ZqZlem3j6DAbR4dptVrcdcd4r4fTNSVXdPoSEKVeX5K0ML/JKkmVsuAlqVIWvCRVyoKXpEpZ8JJUKQtekiplwUtSpSx4SaqUBS9JlSq5otPDEfFiRPTHd50laYkp+Qn+D4BbCr6+JGkBJVd0+iLwvVKvL0laWNEFPzorOX0uM9cs8BxXdFpmXNGp/txeZrvPi7PQik5kZrEbcC3w9Lk+f+3atamlb2Jiou+y+y23l9nu8+IAu3OeTvUsGkmqlAUvSZUqeZrkI8D/BkYi4lBE/EqpLEnSDyu5otPtpV5bknR2TtFIUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklQpC16SKmXBS1KlihZ8RNwSEQcj4psR8eGSWSpvx94p1m3ZxeTUUdZt2cWOvVO9HpKkBZS8Fs0A8HvAu4CfAG6PiJ8olaeyduydYvP2SaaOTAMwdWSazdsnLXlpCSv5Cf5G4JuZ+Wxm/gB4FPiFgnkqaOvOg0zPnDhj2/TMCbbuPNijEUk6m2IrOkXELwG3ZOa/6Dx+H/BTmXnna57nik7LwOTU0VP3Vw3BC9Onf3fd8IrGxrEcV9xZjrm9zHafF2ehFZ2KXU0SiDm2/dDfJpn5IPAgwNjYWI6Pjxccks7Xb2zZdWp65t7rjvOxyfYfneGVQ9x1x3hj42i1WvTiz0i/5fYy233unpJTNIeA1bMeXwMcLpingjZtGGFocOCMbUODA2zaMNKjEUk6m5Kf4J8E3hIRbwamgPcA/6RgngraODoM0Jlzf4XhlUNs2jByarukpafkgh/HI+JOYCcwADycmc+UylN5G0eH2Tg6TKvVanRaRtL5KfkJnsz8PPD5khmSpLn5TVZJqpQFL0mVsuAlqVIWvCRVyoKXpEpZ8JJUKQtekiplwUtSpSx4SaqUBS9JlbLgJalSFrwkVcqCl6RKWfCSVCkLXpIqZcFLUqUseEmqlAUvSZWy4CWpUha8JFXKgpekSlnwklQpC16SKmXBS1KlLHhJqpQFL0mVsuAlqVIX9XoAF2z/Nnjio3D0EKy4BtZ/BK6/rdejKmrH3im27jzI4SPTXL1yiE0bRtg4OtzrYUlaYpZ3we/fBp+9G2am24+PPtd+DNWW/I69U2zePsn0zAkApo5Ms3n7JIAlL+kMy3uK5omPni73k2am29srtXXnwVPlftL0zAm27jzYoxFJWqqWd8EfPbS47RU4fGR6Udsl9a/lXfArrlnc9gpcvXJoUdsl9a/lXfDrPwKDrym2waH29kpt2jDC0ODAGduGBgfYtGGkRyOStFQt74OsJw+k9tFZNCcPpHoWjaSzWd4FD+0yr7jQ57JxdNhCl3RWy3uKRpI0LwtekiplwUtSpSx4SaqUBS9JlYrM7PUYTomI7wLf7vU4dFZXAC/1WXa/5fYy231enL+TmVfO9YslVfBaHiJid2aO9VN2v+X2Mtt97h6naCSpUha8JFXKgtf5eLAPs/stt5fZ7nOXOAcvSZXyE7wkVcqCl6RKWfBalIi4JSIORsQ3I+LDDWU+HBEvRsTTTeS9Jnt1RExExIGIeCYi7mko9w0R8ecR8dVO7v1N5M7KH4iIvRHxuYZzvxURkxGxLyJ2N5i7MiIei4ivd97rtzeUO9LZ15O3lyPig117fefgda4iYgD4P8DPAIeAJ4HbM/NrhXNvAo4B/zUz15TMmiP7KuCqzHwqIi4D9gAbG9jnAC7JzGMRMQh8CbgnM79cMndW/r8GxoDLM/PWJjI7ud8CxjKz0S8bRcQngf+ZmZ+IiNcDF2fmkYbHMABMAT+VmV35wqef4LUYNwLfzMxnM/MHwKPAL5QOzcwvAt8rnTNP9vOZ+VTn/ivAAaD4xfiz7Vjn4WDn1sinsYi4Bvg54BNN5PVaRFwO3AQ8BJCZP2i63DvWA3/RrXIHC16LMww8N+vxIRoou6UiIq4FRoGvNJQ3EBH7gBeBP8nMRnKB3wZ+DfibhvJmS+CPI2JPRLy/ocy/C3wX+P3OtNQnIuKShrJnew/wSDdf0ILXYsQc2/piji8iLgU+A3wwM19uIjMzT2TmDcA1wI0RUXx6KiJuBV7MzD2ls+axLjPfBrwL+EBneq60i4C3Af85M0eBV4FGji+d1JkWejfw37r5uha8FuMQsHrW42uAwz0aS2M6c+CfAT6Vmdubzu9MF7SAWxqIWwe8uzMX/ihwc0T8YQO5AGTm4c7PF4HHaU8LlnYIODTr/5Aeo134TXoX8FRmvtDNF7XgtRhPAm+JiDd3PnG8B/ijHo+pqM7BzoeAA5n58QZzr4yIlZ37Q8A7ga+Xzs3MzZl5TWZeS/v93ZWZ7y2dCxARl3QOZNOZIvlZoPiZU5n5HeC5iBjpbFoPFD2IPofb6fL0DNSw6LYak5nHI+JOYCcwADycmc+Uzo2IR4Bx4IqIOATcl5kPlc7tWAe8D5jszIcD/Hpmfr5w7lXAJztnVrwO2JaZjZ6y2AOrgMfbf6dyEfDpzPxCQ9l3AZ/qfHB5FvjlhnKJiItpn5n2r7r+2p4mKUl1copGkiplwUtSpSx4SaqUBS9JlbLgJalSFrwkVcqCl6RKWfDSPCLiH0TE/s612S/pXJe90csVSxfCLzpJC4iIfw+8ARiifb2S3+rxkKRzZsFLC+h8df1J4P8BP52ZJ3o8JOmcOUUjLexvAZcCl9H+JC8tG36ClxYQEX9E+7K5b6a9dN+dPR6SdM68mqQ0j4j4p8DxzPx056qO/ysibs7MXb0em3Qu/AQvSZVyDl6SKmXBS1KlLHhJqpQFL0mVsuAlqVIWvCRVyoKXpEr9fwoK1wccKUTVAAAAAElFTkSuQmCC\n", 95 | "text/plain": [ 96 | "
" 97 | ] 98 | }, 99 | "metadata": { 100 | "needs_background": "light" 101 | }, 102 | "output_type": "display_data" 103 | } 104 | ], 105 | "source": [ 106 | "# Plot coordinates\n", 107 | "fig = plt.figure()\n", 108 | "ax = fig.gca()\n", 109 | "ax.set_xticks(np.arange(0, 10, 1))\n", 110 | "ax.set_yticks(np.arange(0, 10, 1))\n", 111 | "plt.xlabel('x')\n", 112 | "plt.ylabel('y')\n", 113 | "\n", 114 | "plt.scatter(xp,yp)\n", 115 | "plt.scatter(xn,yn)\n", 116 | "plt.grid(True)\n", 117 | "plt.show()" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 5, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "# calculates Euclidean distance from the point to be classified \n", 127 | "# to every other point and stores in a tuple.\n", 128 | "def e_distance(x,y):\n", 129 | " obj_list=[]\n", 130 | " obj_list=deepcopy(data)\n", 131 | " for i in range(0,len(data)):\n", 132 | " obj_list[i] += (math.sqrt((x-data[i][0])**2 + (y-data[i][1])**2),)\n", 133 | " return obj_list" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 6, 139 | "metadata": {}, 140 | "outputs": [], 141 | "source": [ 142 | "# function for calculating mode of data\n", 143 | "def max_Count(vector):\n", 144 | " counts = Counter(x[2] for x in vector) \n", 145 | " return counts.most_common(1)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 7, 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "name": "stdout", 155 | "output_type": "stream", 156 | "text": [ 157 | "Enter pts x,y and k: 1 1 3\n" 158 | ] 159 | } 160 | ], 161 | "source": [ 162 | "# Provide test input for classification in the format 1 2 3 by giving space in between\n", 163 | "(x1,y1,k) = map(int,input(\"Enter pts x,y and k: \").split(\" \"))" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 8, 169 | "metadata": {}, 170 | "outputs": [], 171 | "source": [ 172 | "#sorts the list and finds mode of the first k tuples to classify point x1,y1\n", 173 | "def KNN(x1,y1,k):\n", 174 | " print(\"\\nPoint to be classified: \",str(x1) +\",\"+ str(y1))\n", 175 | " print(\"\\nvalue of k =\",k)\n", 176 | " if(k>len(data)):\n", 177 | " print(\"\\nNo. of neighbors exceeding no. of samples\")\n", 178 | " vector = e_distance(x1,y1)\n", 179 | " vector = sorted(vector,key = lambda x:x[3])\n", 180 | " topk = vector[:k]\n", 181 | " print(\"\\nTop\",k,\"tuples\")\n", 182 | " print(topk)\n", 183 | " arr =[]\n", 184 | " arr = max_Count(topk)\n", 185 | " print(\"\\nClassification is: \",arr[0][0]) " 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 9, 191 | "metadata": {}, 192 | "outputs": [ 193 | { 194 | "name": "stdout", 195 | "output_type": "stream", 196 | "text": [ 197 | "\n", 198 | "Point to be classified: 1,1\n", 199 | "\n", 200 | "value of k = 3\n", 201 | "\n", 202 | "Top 3 tuples\n", 203 | "[(0, 0, 'Positive', 1.4142135623730951), (2, 4, 'Negative', 3.1622776601683795), (4, 2, 'Negative', 3.1622776601683795)]\n", 204 | "\n", 205 | "Classification is: Negative\n" 206 | ] 207 | } 208 | ], 209 | "source": [ 210 | "KNN(x1,y1,k)" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": null, 216 | "metadata": {}, 217 | "outputs": [], 218 | "source": [] 219 | } 220 | ], 221 | "metadata": { 222 | "kernelspec": { 223 | "display_name": "Python 3", 224 | "language": "python", 225 | "name": "python3" 226 | }, 227 | "language_info": { 228 | "codemirror_mode": { 229 | "name": "ipython", 230 | "version": 3 231 | }, 232 | "file_extension": ".py", 233 | "mimetype": "text/x-python", 234 | "name": "python", 235 | "nbconvert_exporter": "python", 236 | "pygments_lexer": "ipython3", 237 | "version": "3.8.3" 238 | } 239 | }, 240 | "nbformat": 4, 241 | "nbformat_minor": 4 242 | } 243 | -------------------------------------------------------------------------------- /010/solution/readme.md: -------------------------------------------------------------------------------- 1 | Basic implementation of k-NN from scratch is provided for reference in the file: 2 | 3 | `knn_from_scratch.ipynb` 4 | 5 | [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/010/solution/knn_from_scratch.ipynb) 6 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/010/solution/knn_from_scratch.ipynb) 7 | 8 | 9 | Furthermore, an example notebook using scikit-learn's `KNeighborsClassifier` for the Iris dataset is given: 10 | 11 | `knn_using_sklearn.ipynb` 12 | 13 | [![Open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gimseng/99-ML-Learning-Projects/blob/master/010/solution/knn_using_sklearn.ipynb) 14 | [![View in nbviewer](https://github.com/jupyter/design/blob/master/logos/Badges/nbviewer_badge.svg)](https://nbviewer.jupyter.org/github/gimseng/99-ML-Learning-Projects/blob/master/010/solution/knn_using_sklearn.ipynb) 15 | 16 | For a more detailed kNN classification problem using scikit-learn, see the following [project](https://github.com/gimseng/99-ML-Learning-Projects/tree/master/009/exercise). 17 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # Code of Conduct 2 | 3 | We are a community based on openness and friendly, didactic, discussions. 4 | 5 | We aspire to treat everybody equally, and value their contributions. 6 | 7 | Decisions are made based on technical merit and consensus. 8 | 9 | Code is not the only way to help the project. Reviewing pull requests, 10 | answering questions to help others on mailing lists or issues, organizing and 11 | teaching tutorials, working on the website, improving the documentation, are 12 | all priceless contributions. 13 | 14 | We abide by the principles of openness, respect, and consideration of others of 15 | the Python Software Foundation: https://www.python.org/psf/codeofconduct/ 16 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Guidance on how to contribute 2 | 3 | There are two primary ways to help: 4 | 5 | - Propose an exercise using the issue tracker, and 6 | - Propose a solution using the pull request 7 | 8 | ## Propose an exercise using the issue tracker 9 | 10 | One generally uses the issue tracker to suggest feature requests, report bugs, and ask questions. 11 | For our project, you are welcome to propose new exercises to be included in this project by creating an new issue. 12 | Describe what you would like in this exercise and if possible propose what the solution should look like. 13 | 14 | If you can write up the exercise statment or the codes for the solution, then follow the guide below. 15 | 16 | ## Propose a solution using the pull request 17 | 18 | ### General github workflow 19 | 20 | Generally speaking, you should fork this repository, make changes in your own fork, and then submit a pull request (PR). 21 | This is the primary way for you to submit solutions to the exercise in codes. 22 | 23 | If you are not familir with git, we follow the general ["fork-and-pull"](https://github.com/susam/gitpr) git flow: 24 | 25 | 1. Fork the repository to your own Github account. 26 | 2. Clone the project to your machine. Add upstream to the original repo. 27 | 28 | - `git clone https://github.com/your_username/99-ML-Learning-Projects.git` 29 | - `git remote add upstream https://github.com/gimseng/99-ML-Learning-Projects.git` so now `upstream` refers to the original repo and `origin` refers to the remote fork created. 30 | 31 | 3. Create a branch locally with a succinct but descriptive name (something like `dev-fix-something`). It is best practice not to work with the master branch. 32 | - `git branch dev-fix-something` this creates a new branch where your changes will be present in. 33 | - `git checkout dev-fix-something` this lets us switch to the new branch created 34 | 4. Commit changes to the branch. Commit small changes often and commit frequently with succint and clear messages. 35 | - make sure you are in the branch specific to the changes you want to commit 36 | - `git commit change-you-made` 37 | 5. Following any formatting and testing guidelines specific to this repo. 38 | 6. Push changes to your fork. Keep your fork's main development branch updated with upstream's. If there are conflicts, resolve within your own forked version. 39 | - `git push origin dev-fix-something` here we push our branch that has our changes to the remote fork. 40 | - Now we need to make sure the fork's main development branch is updated with upstream's 41 | - `git checkout master` to change back to the master branch 42 | - `git fetch upstream master` to get any changes from upstream to our local fork 43 | - `git merge upstream/master` to merge changes from upstream to our fork 44 | - `git push origin` to push the changes to our remote fork 45 | 7. Open a PR in our repository and follow the PR template so that we can efficiently review the changes. 46 | ![Imgur](https://i.imgur.com/Lrv6oOV.png) 47 | After pushing changes to the fork created, a button will appear which will allow for creating a pull request requesting changes made in the branch to be merged with the original repo 48 | 8. After reviewers approve the PR, your branch (and changes) will be merged to the master branch. 49 | 9. (Optionally) Delete your branch. 50 | 51 | ## Structuring the execise or solution 52 | 53 | First, create the folder using a placeholder name, roughly related to the model/project goal/data. For e.g., if you are working on linear regression, you could create the folder `linear regression`. Within that, the directory layout should be of the following format: 54 | 55 | . 56 | ├── exercise # Exercise folder 57 | │ └── readme.md # A mark-down clear and expansive description of the exercise. Data, goals, methods to use and etc 58 | ├── solution # Solution Folder 59 | │ ├── readme.md # A short description on the solutions and what each file does 60 | │ └── random_forest.ipynb # Jupyter notebook solution 61 | └── data # Data folder 62 | ├── train.csv # Some data 63 | └── ... 64 | 65 | Please provide a description/summary in the `readme.md` in each of the `exercise` and `solution` folders. If its appropriate, reference/credit sources. In the `data` folder, if relevant, provide a `readme.md` to describe the data and its source. 66 | 67 | When we eventually merge, the root folder name of your project will be renamed numerically by chronological order. 68 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Gim Seng 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 99-ML-Learning-Projects 2 | A list of 99 machine learning projects for anyone interested to learn machine learning from coding and building projects. 3 | 4 | Our working philosophy is to provide a curated repo for anyone to contribute a cool/fun exercise and solution that is useful for anyone (including themselves) in their journey of learning machine learning. 5 | 6 | 7 | ## Getting Started 8 | 9 | The format is roughly the following: 10 | 11 | 1. Propose an exercise by creating an issue ticket and write what you think is an useful coding exercise for certain concepts. 12 | 13 | 2. If enough people are interested in that issue ticket, hopefully either you or someone else will write the exercise statement properly similar to the style of a lab exercise/homework question. 14 | 15 | 3. Then someone will fork the repo, write up their solution, with a bit of polish and documentation, submit a pull request. Please see [general contribution guidelines](CONTRIBUTING.md) for more details on how to contribute solutions. 16 | 17 | 4. Some of us will scrutinize the codes, review, make suggestions and eventually include (merge) them into the main project repo. 18 | 19 | 5. At anytime, someone can repeat suggest improvements/changes to 3-4 above for a particular exercise. This is done by creating an issue ticket for improvement/enhancement. One can then repeat 3-4. 20 | 21 | 6. Finally, repeat 1-5 indefinitely till we hit 99/99 projects. 22 | 23 | Please abide by [code of conduct guidelines](CODE_OF_CONDUCT.md) to have an open and friendly open source collaboration. 24 | 25 | ### Goal: 99 Projects 26 | ### Current: 10 Projects 27 | 28 | ## Table of Contents 29 | #### General-Purpose Machine Learning 30 | 31 | - [Linear Regression [Beginner]](https://github.com/gimseng/99-ML-Learning-Projects/tree/master/002/exercise) 32 | 33 | - [Titanic Survival Prediction [Beginner]](https://github.com/gimseng/99-ML-Learning-Projects/tree/master/001/exercise) 34 | 35 | - [kNN from Scratch[Beginner]](https://github.com/gimseng/99-ML-Learning-Projects/tree/master/010/exercise) 36 | 37 | - [kNN from Sklearn [Beginner]](https://github.com/gimseng/99-ML-Learning-Projects/tree/master/009/exercise) 38 | 39 | - [Bagging and boosting ensemble methods [Intermediate]](https://github.com/gimseng/99-ML-Learning-Projects/tree/master/006/exercise) 40 | 41 | 42 | #### Computer Vision 43 | - [MNIST Handwriting Digit Recognition [Intermediate]](https://github.com/gimseng/99-ML-Learning-Projects/tree/master/003/exercise) 44 | 45 | 46 | #### Natural Language Processing 47 | 48 | - [Sentiment analysis [Intermediate]](https://github.com/gimseng/99-ML-Learning-Projects/tree/master/005/exercise) 49 | 50 | - [Text-generation neural network model (with LSTM) [Advanced]](https://github.com/gimseng/99-ML-Learning-Projects/tree/master/004/exercise) 51 | 52 | 56 | 57 | #### Bayesian 58 | 59 | - [Naive Bayes Classification](https://github.com/gimseng/99-ML-Learning-Projects/blob/master/008/exercise) 60 | 61 | #### Misc/Mix Models 62 | 63 | - [Employee Attrition & Performance](https://github.com/gimseng/99-ML-Learning-Projects/tree/master/007) 64 | 65 | ## Refreshers/Cheatsheets 66 | 67 | - [Numpy](https://github.com/gimseng/99-ML-Learning-Projects/blob/master/Resources/Numpy/NumPy%20Tutorial.ipynb) 68 | - [Pandas](https://github.com/gimseng/99-ML-Learning-Projects/blob/master/Resources/Pandas/Pandas%20Tutorial.ipynb) 69 | 70 | 71 | 72 | 73 | ## Dependencies 74 | 75 | Some of the libraries (and their versions) we are using: 76 | - Python (>= 3.6) 77 | - NumPy (>= 1.18.5) 78 | - Pandas (>= 1.0.5) 79 | - Matplotlib (>= 3.2.2) 80 | - Seaborn (>= 0.10.1) 81 | - Scikit-learn (>= 0.22.2) 82 | - Tensorflow (>= 2.2.0) 83 | - PyTorch (>= 1.5.1) 84 | 85 | 86 | ## Help and Support 87 | 88 | If you want to get in touch with us, say hi on our discord/gitter chatroom: 89 | 90 | - Discord: https://discord.gg/VVDg6P4 91 | - Gitter: https://gitter.im/99-ML-Learning-Projects/community 92 | 93 | ## Recent Contributors 94 | [![](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/images/0)](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/links/0)[![](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/images/1)](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/links/1)[![](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/images/2)](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/links/2)[![](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/images/3)](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/links/3)[![](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/images/4)](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/links/4)[![](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/images/5)](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/links/5)[![](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/images/6)](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/links/6)[![](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/images/7)](https://sourcerer.io/fame/gimseng/gimseng/99-ML-Learning-Projects/links/7) 95 | 96 | ## Credit: 97 | 98 | This project is inspired by Unnit Metaliya’s answer on quora: https://qr.ae/pNK0FW 99 | 100 | For credits, these are the two repos (one for C and one for React) where I got the idea from: 101 | - https://github.com/truedl/c-for-beginners 102 | - https://github.com/UnnitMetaliya/99-reactjs-project-ideas 103 | 104 | ## License 105 | 106 | This repo is covered under [The MIT License](LICENSE). 107 | -------------------------------------------------------------------------------- /Resources/Pandas/pokemon_data.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/e18c0a34e6a22b05659d31a7155384257d3b02ae/Resources/Pandas/pokemon_data.xlsx -------------------------------------------------------------------------------- /Resources/Pandas/user.txt: -------------------------------------------------------------------------------- 1 | 1|24|M|technician|85711 2 | 2|53|F|other|94043 3 | 3|23|M|writer|32067 4 | 4|24|M|technician|43537 5 | 5|33|F|other|15213 6 | 6|42|M|executive|98101 7 | 7|57|M|administrator|91344 8 | 8|36|M|administrator|05201 9 | 9|29|M|student|01002 10 | 10|53|M|lawyer|90703 11 | 11|39|F|other|30329 12 | 12|28|F|other|06405 13 | 13|47|M|educator|29206 14 | 14|45|M|scientist|55106 15 | 15|49|F|educator|97301 16 | 16|21|M|entertainment|10309 17 | 17|30|M|programmer|06355 18 | 18|35|F|other|37212 19 | 19|40|M|librarian|02138 20 | 20|42|F|homemaker|95660 21 | 21|26|M|writer|30068 22 | 22|25|M|writer|40206 23 | 23|30|F|artist|48197 24 | 24|21|F|artist|94533 25 | 25|39|M|engineer|55107 26 | 26|49|M|engineer|21044 27 | 27|40|F|librarian|30030 28 | 28|32|M|writer|55369 29 | 29|41|M|programmer|94043 30 | 30|7|M|student|55436 31 | 31|24|M|artist|10003 32 | 32|28|F|student|78741 33 | 33|23|M|student|27510 34 | 34|38|F|administrator|42141 35 | 35|20|F|homemaker|42459 36 | 36|19|F|student|93117 37 | 37|23|M|student|55105 38 | 38|28|F|other|54467 39 | 39|41|M|entertainment|01040 40 | 40|38|M|scientist|27514 41 | 41|33|M|engineer|80525 42 | 42|30|M|administrator|17870 43 | 43|29|F|librarian|20854 44 | 44|26|M|technician|46260 45 | 45|29|M|programmer|50233 46 | 46|27|F|marketing|46538 47 | 47|53|M|marketing|07102 48 | 48|45|M|administrator|12550 49 | 49|23|F|student|76111 50 | 50|21|M|writer|52245 51 | 51|28|M|educator|16509 52 | 52|18|F|student|55105 53 | 53|26|M|programmer|55414 54 | 54|22|M|executive|66315 55 | 55|37|M|programmer|01331 56 | 56|25|M|librarian|46260 57 | 57|16|M|none|84010 58 | 58|27|M|programmer|52246 59 | 59|49|M|educator|08403 60 | 60|50|M|healthcare|06472 61 | 61|36|M|engineer|30040 62 | 62|27|F|administrator|97214 63 | 63|31|M|marketing|75240 64 | 64|32|M|educator|43202 65 | 65|51|F|educator|48118 66 | 66|23|M|student|80521 67 | 67|17|M|student|60402 68 | 68|19|M|student|22904 69 | 69|24|M|engineer|55337 70 | 70|27|M|engineer|60067 71 | 71|39|M|scientist|98034 72 | 72|48|F|administrator|73034 73 | 73|24|M|student|41850 74 | 74|39|M|scientist|T8H1N 75 | 75|24|M|entertainment|08816 76 | 76|20|M|student|02215 77 | 77|30|M|technician|29379 78 | 78|26|M|administrator|61801 79 | 79|39|F|administrator|03755 80 | 80|34|F|administrator|52241 81 | 81|21|M|student|21218 82 | 82|50|M|programmer|22902 83 | 83|40|M|other|44133 84 | 84|32|M|executive|55369 85 | 85|51|M|educator|20003 86 | 86|26|M|administrator|46005 87 | 87|47|M|administrator|89503 88 | 88|49|F|librarian|11701 89 | 89|43|F|administrator|68106 90 | 90|60|M|educator|78155 91 | 91|55|M|marketing|01913 92 | 92|32|M|entertainment|80525 93 | 93|48|M|executive|23112 94 | 94|26|M|student|71457 95 | 95|31|M|administrator|10707 96 | 96|25|F|artist|75206 97 | 97|43|M|artist|98006 98 | 98|49|F|executive|90291 99 | 99|20|M|student|63129 100 | 100|36|M|executive|90254 101 | 101|15|M|student|05146 102 | 102|38|M|programmer|30220 103 | 103|26|M|student|55108 104 | 104|27|M|student|55108 105 | 105|24|M|engineer|94043 106 | 106|61|M|retired|55125 107 | 107|39|M|scientist|60466 108 | 108|44|M|educator|63130 109 | 109|29|M|other|55423 110 | 110|19|M|student|77840 111 | 111|57|M|engineer|90630 112 | 112|30|M|salesman|60613 113 | 113|47|M|executive|95032 114 | 114|27|M|programmer|75013 115 | 115|31|M|engineer|17110 116 | 116|40|M|healthcare|97232 117 | 117|20|M|student|16125 118 | 118|21|M|administrator|90210 119 | 119|32|M|programmer|67401 120 | 120|47|F|other|06260 121 | 121|54|M|librarian|99603 122 | 122|32|F|writer|22206 123 | 123|48|F|artist|20008 124 | 124|34|M|student|60615 125 | 125|30|M|lawyer|22202 126 | 126|28|F|lawyer|20015 127 | 127|33|M|none|73439 128 | 128|24|F|marketing|20009 129 | 129|36|F|marketing|07039 130 | 130|20|M|none|60115 131 | 131|59|F|administrator|15237 132 | 132|24|M|other|94612 133 | 133|53|M|engineer|78602 134 | 134|31|M|programmer|80236 135 | 135|23|M|student|38401 136 | 136|51|M|other|97365 137 | 137|50|M|educator|84408 138 | 138|46|M|doctor|53211 139 | 139|20|M|student|08904 140 | 140|30|F|student|32250 141 | 141|49|M|programmer|36117 142 | 142|13|M|other|48118 143 | 143|42|M|technician|08832 144 | 144|53|M|programmer|20910 145 | 145|31|M|entertainment|V3N4P 146 | 146|45|M|artist|83814 147 | 147|40|F|librarian|02143 148 | 148|33|M|engineer|97006 149 | 149|35|F|marketing|17325 150 | 150|20|F|artist|02139 151 | 151|38|F|administrator|48103 152 | 152|33|F|educator|68767 153 | 153|25|M|student|60641 154 | 154|25|M|student|53703 155 | 155|32|F|other|11217 156 | 156|25|M|educator|08360 157 | 157|57|M|engineer|70808 158 | 158|50|M|educator|27606 159 | 159|23|F|student|55346 160 | 160|27|M|programmer|66215 161 | 161|50|M|lawyer|55104 162 | 162|25|M|artist|15610 163 | 163|49|M|administrator|97212 164 | 164|47|M|healthcare|80123 165 | 165|20|F|other|53715 166 | 166|47|M|educator|55113 167 | 167|37|M|other|L9G2B 168 | 168|48|M|other|80127 169 | 169|52|F|other|53705 170 | 170|53|F|healthcare|30067 171 | 171|48|F|educator|78750 172 | 172|55|M|marketing|22207 173 | 173|56|M|other|22306 174 | 174|30|F|administrator|52302 175 | 175|26|F|scientist|21911 176 | 176|28|M|scientist|07030 177 | 177|20|M|programmer|19104 178 | 178|26|M|other|49512 179 | 179|15|M|entertainment|20755 180 | 180|22|F|administrator|60202 181 | 181|26|M|executive|21218 182 | 182|36|M|programmer|33884 183 | 183|33|M|scientist|27708 184 | 184|37|M|librarian|76013 185 | 185|53|F|librarian|97403 186 | 186|39|F|executive|00000 187 | 187|26|M|educator|16801 188 | 188|42|M|student|29440 189 | 189|32|M|artist|95014 190 | 190|30|M|administrator|95938 191 | 191|33|M|administrator|95161 192 | 192|42|M|educator|90840 193 | 193|29|M|student|49931 194 | 194|38|M|administrator|02154 195 | 195|42|M|scientist|93555 196 | 196|49|M|writer|55105 197 | 197|55|M|technician|75094 198 | 198|21|F|student|55414 199 | 199|30|M|writer|17604 200 | 200|40|M|programmer|93402 201 | 201|27|M|writer|E2A4H 202 | 202|41|F|educator|60201 203 | 203|25|F|student|32301 204 | 204|52|F|librarian|10960 205 | 205|47|M|lawyer|06371 206 | 206|14|F|student|53115 207 | 207|39|M|marketing|92037 208 | 208|43|M|engineer|01720 209 | 209|33|F|educator|85710 210 | 210|39|M|engineer|03060 211 | 211|66|M|salesman|32605 212 | 212|49|F|educator|61401 213 | 213|33|M|executive|55345 214 | 214|26|F|librarian|11231 215 | 215|35|M|programmer|63033 216 | 216|22|M|engineer|02215 217 | 217|22|M|other|11727 218 | 218|37|M|administrator|06513 219 | 219|32|M|programmer|43212 220 | 220|30|M|librarian|78205 221 | 221|19|M|student|20685 222 | 222|29|M|programmer|27502 223 | 223|19|F|student|47906 224 | 224|31|F|educator|43512 225 | 225|51|F|administrator|58202 226 | 226|28|M|student|92103 227 | 227|46|M|executive|60659 228 | 228|21|F|student|22003 229 | 229|29|F|librarian|22903 230 | 230|28|F|student|14476 231 | 231|48|M|librarian|01080 232 | 232|45|M|scientist|99709 233 | 233|38|M|engineer|98682 234 | 234|60|M|retired|94702 235 | 235|37|M|educator|22973 236 | 236|44|F|writer|53214 237 | 237|49|M|administrator|63146 238 | 238|42|F|administrator|44124 239 | 239|39|M|artist|95628 240 | 240|23|F|educator|20784 241 | 241|26|F|student|20001 242 | 242|33|M|educator|31404 243 | 243|33|M|educator|60201 244 | 244|28|M|technician|80525 245 | 245|22|M|student|55109 246 | 246|19|M|student|28734 247 | 247|28|M|engineer|20770 248 | 248|25|M|student|37235 249 | 249|25|M|student|84103 250 | 250|29|M|executive|95110 251 | 251|28|M|doctor|85032 252 | 252|42|M|engineer|07733 253 | 253|26|F|librarian|22903 254 | 254|44|M|educator|42647 255 | 255|23|M|entertainment|07029 256 | 256|35|F|none|39042 257 | 257|17|M|student|77005 258 | 258|19|F|student|77801 259 | 259|21|M|student|48823 260 | 260|40|F|artist|89801 261 | 261|28|M|administrator|85202 262 | 262|19|F|student|78264 263 | 263|41|M|programmer|55346 264 | 264|36|F|writer|90064 265 | 265|26|M|executive|84601 266 | 266|62|F|administrator|78756 267 | 267|23|M|engineer|83716 268 | 268|24|M|engineer|19422 269 | 269|31|F|librarian|43201 270 | 270|18|F|student|63119 271 | 271|51|M|engineer|22932 272 | 272|33|M|scientist|53706 273 | 273|50|F|other|10016 274 | 274|20|F|student|55414 275 | 275|38|M|engineer|92064 276 | 276|21|M|student|95064 277 | 277|35|F|administrator|55406 278 | 278|37|F|librarian|30033 279 | 279|33|M|programmer|85251 280 | 280|30|F|librarian|22903 281 | 281|15|F|student|06059 282 | 282|22|M|administrator|20057 283 | 283|28|M|programmer|55305 284 | 284|40|M|executive|92629 285 | 285|25|M|programmer|53713 286 | 286|27|M|student|15217 287 | 287|21|M|salesman|31211 288 | 288|34|M|marketing|23226 289 | 289|11|M|none|94619 290 | 290|40|M|engineer|93550 291 | 291|19|M|student|44106 292 | 292|35|F|programmer|94703 293 | 293|24|M|writer|60804 294 | 294|34|M|technician|92110 295 | 295|31|M|educator|50325 296 | 296|43|F|administrator|16803 297 | 297|29|F|educator|98103 298 | 298|44|M|executive|01581 299 | 299|29|M|doctor|63108 300 | 300|26|F|programmer|55106 301 | 301|24|M|student|55439 302 | 302|42|M|educator|77904 303 | 303|19|M|student|14853 304 | 304|22|F|student|71701 305 | 305|23|M|programmer|94086 306 | 306|45|M|other|73132 307 | 307|25|M|student|55454 308 | 308|60|M|retired|95076 309 | 309|40|M|scientist|70802 310 | 310|37|M|educator|91711 311 | 311|32|M|technician|73071 312 | 312|48|M|other|02110 313 | 313|41|M|marketing|60035 314 | 314|20|F|student|08043 315 | 315|31|M|educator|18301 316 | 316|43|F|other|77009 317 | 317|22|M|administrator|13210 318 | 318|65|M|retired|06518 319 | 319|38|M|programmer|22030 320 | 320|19|M|student|24060 321 | 321|49|F|educator|55413 322 | 322|20|M|student|50613 323 | 323|21|M|student|19149 324 | 324|21|F|student|02176 325 | 325|48|M|technician|02139 326 | 326|41|M|administrator|15235 327 | 327|22|M|student|11101 328 | 328|51|M|administrator|06779 329 | 329|48|M|educator|01720 330 | 330|35|F|educator|33884 331 | 331|33|M|entertainment|91344 332 | 332|20|M|student|40504 333 | 333|47|M|other|V0R2M 334 | 334|32|M|librarian|30002 335 | 335|45|M|executive|33775 336 | 336|23|M|salesman|42101 337 | 337|37|M|scientist|10522 338 | 338|39|F|librarian|59717 339 | 339|35|M|lawyer|37901 340 | 340|46|M|engineer|80123 341 | 341|17|F|student|44405 342 | 342|25|F|other|98006 343 | 343|43|M|engineer|30093 344 | 344|30|F|librarian|94117 345 | 345|28|F|librarian|94143 346 | 346|34|M|other|76059 347 | 347|18|M|student|90210 348 | 348|24|F|student|45660 349 | 349|68|M|retired|61455 350 | 350|32|M|student|97301 351 | 351|61|M|educator|49938 352 | 352|37|F|programmer|55105 353 | 353|25|M|scientist|28480 354 | 354|29|F|librarian|48197 355 | 355|25|M|student|60135 356 | 356|32|F|homemaker|92688 357 | 357|26|M|executive|98133 358 | 358|40|M|educator|10022 359 | 359|22|M|student|61801 360 | 360|51|M|other|98027 361 | 361|22|M|student|44074 362 | 362|35|F|homemaker|85233 363 | 363|20|M|student|87501 364 | 364|63|M|engineer|01810 365 | 365|29|M|lawyer|20009 366 | 366|20|F|student|50670 367 | 367|17|M|student|37411 368 | 368|18|M|student|92113 369 | 369|24|M|student|91335 370 | 370|52|M|writer|08534 371 | 371|36|M|engineer|99206 372 | 372|25|F|student|66046 373 | 373|24|F|other|55116 374 | 374|36|M|executive|78746 375 | 375|17|M|entertainment|37777 376 | 376|28|F|other|10010 377 | 377|22|M|student|18015 378 | 378|35|M|student|02859 379 | 379|44|M|programmer|98117 380 | 380|32|M|engineer|55117 381 | 381|33|M|artist|94608 382 | 382|45|M|engineer|01824 383 | 383|42|M|administrator|75204 384 | 384|52|M|programmer|45218 385 | 385|36|M|writer|10003 386 | 386|36|M|salesman|43221 387 | 387|33|M|entertainment|37412 388 | 388|31|M|other|36106 389 | 389|44|F|writer|83702 390 | 390|42|F|writer|85016 391 | 391|23|M|student|84604 392 | 392|52|M|writer|59801 393 | 393|19|M|student|83686 394 | 394|25|M|administrator|96819 395 | 395|43|M|other|44092 396 | 396|57|M|engineer|94551 397 | 397|17|M|student|27514 398 | 398|40|M|other|60008 399 | 399|25|M|other|92374 400 | 400|33|F|administrator|78213 401 | 401|46|F|healthcare|84107 402 | 402|30|M|engineer|95129 403 | 403|37|M|other|06811 404 | 404|29|F|programmer|55108 405 | 405|22|F|healthcare|10019 406 | 406|52|M|educator|93109 407 | 407|29|M|engineer|03261 408 | 408|23|M|student|61755 409 | 409|48|M|administrator|98225 410 | 410|30|F|artist|94025 411 | 411|34|M|educator|44691 412 | 412|25|M|educator|15222 413 | 413|55|M|educator|78212 414 | 414|24|M|programmer|38115 415 | 415|39|M|educator|85711 416 | 416|20|F|student|92626 417 | 417|27|F|other|48103 418 | 418|55|F|none|21206 419 | 419|37|M|lawyer|43215 420 | 420|53|M|educator|02140 421 | 421|38|F|programmer|55105 422 | 422|26|M|entertainment|94533 423 | 423|64|M|other|91606 424 | 424|36|F|marketing|55422 425 | 425|19|M|student|58644 426 | 426|55|M|educator|01602 427 | 427|51|M|doctor|85258 428 | 428|28|M|student|55414 429 | 429|27|M|student|29205 430 | 430|38|M|scientist|98199 431 | 431|24|M|marketing|92629 432 | 432|22|M|entertainment|50311 433 | 433|27|M|artist|11211 434 | 434|16|F|student|49705 435 | 435|24|M|engineer|60007 436 | 436|30|F|administrator|17345 437 | 437|27|F|other|20009 438 | 438|51|F|administrator|43204 439 | 439|23|F|administrator|20817 440 | 440|30|M|other|48076 441 | 441|50|M|technician|55013 442 | 442|22|M|student|85282 443 | 443|35|M|salesman|33308 444 | 444|51|F|lawyer|53202 445 | 445|21|M|writer|92653 446 | 446|57|M|educator|60201 447 | 447|30|M|administrator|55113 448 | 448|23|M|entertainment|10021 449 | 449|23|M|librarian|55021 450 | 450|35|F|educator|11758 451 | 451|16|M|student|48446 452 | 452|35|M|administrator|28018 453 | 453|18|M|student|06333 454 | 454|57|M|other|97330 455 | 455|48|M|administrator|83709 456 | 456|24|M|technician|31820 457 | 457|33|F|salesman|30011 458 | 458|47|M|technician|Y1A6B 459 | 459|22|M|student|29201 460 | 460|44|F|other|60630 461 | 461|15|M|student|98102 462 | 462|19|F|student|02918 463 | 463|48|F|healthcare|75218 464 | 464|60|M|writer|94583 465 | 465|32|M|other|05001 466 | 466|22|M|student|90804 467 | 467|29|M|engineer|91201 468 | 468|28|M|engineer|02341 469 | 469|60|M|educator|78628 470 | 470|24|M|programmer|10021 471 | 471|10|M|student|77459 472 | 472|24|M|student|87544 473 | 473|29|M|student|94708 474 | 474|51|M|executive|93711 475 | 475|30|M|programmer|75230 476 | 476|28|M|student|60440 477 | 477|23|F|student|02125 478 | 478|29|M|other|10019 479 | 479|30|M|educator|55409 480 | 480|57|M|retired|98257 481 | 481|73|M|retired|37771 482 | 482|18|F|student|40256 483 | 483|29|M|scientist|43212 484 | 484|27|M|student|21208 485 | 485|44|F|educator|95821 486 | 486|39|M|educator|93101 487 | 487|22|M|engineer|92121 488 | 488|48|M|technician|21012 489 | 489|55|M|other|45218 490 | 490|29|F|artist|V5A2B 491 | 491|43|F|writer|53711 492 | 492|57|M|educator|94618 493 | 493|22|M|engineer|60090 494 | 494|38|F|administrator|49428 495 | 495|29|M|engineer|03052 496 | 496|21|F|student|55414 497 | 497|20|M|student|50112 498 | 498|26|M|writer|55408 499 | 499|42|M|programmer|75006 500 | 500|28|M|administrator|94305 501 | 501|22|M|student|10025 502 | 502|22|M|student|23092 503 | 503|50|F|writer|27514 504 | 504|40|F|writer|92115 505 | 505|27|F|other|20657 506 | 506|46|M|programmer|03869 507 | 507|18|F|writer|28450 508 | 508|27|M|marketing|19382 509 | 509|23|M|administrator|10011 510 | 510|34|M|other|98038 511 | 511|22|M|student|21250 512 | 512|29|M|other|20090 513 | 513|43|M|administrator|26241 514 | 514|27|M|programmer|20707 515 | 515|53|M|marketing|49508 516 | 516|53|F|librarian|10021 517 | 517|24|M|student|55454 518 | 518|49|F|writer|99709 519 | 519|22|M|other|55320 520 | 520|62|M|healthcare|12603 521 | 521|19|M|student|02146 522 | 522|36|M|engineer|55443 523 | 523|50|F|administrator|04102 524 | 524|56|M|educator|02159 525 | 525|27|F|administrator|19711 526 | 526|30|M|marketing|97124 527 | 527|33|M|librarian|12180 528 | 528|18|M|student|55104 529 | 529|47|F|administrator|44224 530 | 530|29|M|engineer|94040 531 | 531|30|F|salesman|97408 532 | 532|20|M|student|92705 533 | 533|43|M|librarian|02324 534 | 534|20|M|student|05464 535 | 535|45|F|educator|80302 536 | 536|38|M|engineer|30078 537 | 537|36|M|engineer|22902 538 | 538|31|M|scientist|21010 539 | 539|53|F|administrator|80303 540 | 540|28|M|engineer|91201 541 | 541|19|F|student|84302 542 | 542|21|M|student|60515 543 | 543|33|M|scientist|95123 544 | 544|44|F|other|29464 545 | 545|27|M|technician|08052 546 | 546|36|M|executive|22911 547 | 547|50|M|educator|14534 548 | 548|51|M|writer|95468 549 | 549|42|M|scientist|45680 550 | 550|16|F|student|95453 551 | 551|25|M|programmer|55414 552 | 552|45|M|other|68147 553 | 553|58|M|educator|62901 554 | 554|32|M|scientist|62901 555 | 555|29|F|educator|23227 556 | 556|35|F|educator|30606 557 | 557|30|F|writer|11217 558 | 558|56|F|writer|63132 559 | 559|69|M|executive|10022 560 | 560|32|M|student|10003 561 | 561|23|M|engineer|60005 562 | 562|54|F|administrator|20879 563 | 563|39|F|librarian|32707 564 | 564|65|M|retired|94591 565 | 565|40|M|student|55422 566 | 566|20|M|student|14627 567 | 567|24|M|entertainment|10003 568 | 568|39|M|educator|01915 569 | 569|34|M|educator|91903 570 | 570|26|M|educator|14627 571 | 571|34|M|artist|01945 572 | 572|51|M|educator|20003 573 | 573|68|M|retired|48911 574 | 574|56|M|educator|53188 575 | 575|33|M|marketing|46032 576 | 576|48|M|executive|98281 577 | 577|36|F|student|77845 578 | 578|31|M|administrator|M7A1A 579 | 579|32|M|educator|48103 580 | 580|16|M|student|17961 581 | 581|37|M|other|94131 582 | 582|17|M|student|93003 583 | 583|44|M|engineer|29631 584 | 584|25|M|student|27511 585 | 585|69|M|librarian|98501 586 | 586|20|M|student|79508 587 | 587|26|M|other|14216 588 | 588|18|F|student|93063 589 | 589|21|M|lawyer|90034 590 | 590|50|M|educator|82435 591 | 591|57|F|librarian|92093 592 | 592|18|M|student|97520 593 | 593|31|F|educator|68767 594 | 594|46|M|educator|M4J2K 595 | 595|25|M|programmer|31909 596 | 596|20|M|artist|77073 597 | 597|23|M|other|84116 598 | 598|40|F|marketing|43085 599 | 599|22|F|student|R3T5K 600 | 600|34|M|programmer|02320 601 | 601|19|F|artist|99687 602 | 602|47|F|other|34656 603 | 603|21|M|programmer|47905 604 | 604|39|M|educator|11787 605 | 605|33|M|engineer|33716 606 | 606|28|M|programmer|63044 607 | 607|49|F|healthcare|02154 608 | 608|22|M|other|10003 609 | 609|13|F|student|55106 610 | 610|22|M|student|21227 611 | 611|46|M|librarian|77008 612 | 612|36|M|educator|79070 613 | 613|37|F|marketing|29678 614 | 614|54|M|educator|80227 615 | 615|38|M|educator|27705 616 | 616|55|M|scientist|50613 617 | 617|27|F|writer|11201 618 | 618|15|F|student|44212 619 | 619|17|M|student|44134 620 | 620|18|F|writer|81648 621 | 621|17|M|student|60402 622 | 622|25|M|programmer|14850 623 | 623|50|F|educator|60187 624 | 624|19|M|student|30067 625 | 625|27|M|programmer|20723 626 | 626|23|M|scientist|19807 627 | 627|24|M|engineer|08034 628 | 628|13|M|none|94306 629 | 629|46|F|other|44224 630 | 630|26|F|healthcare|55408 631 | 631|18|F|student|38866 632 | 632|18|M|student|55454 633 | 633|35|M|programmer|55414 634 | 634|39|M|engineer|T8H1N 635 | 635|22|M|other|23237 636 | 636|47|M|educator|48043 637 | 637|30|M|other|74101 638 | 638|45|M|engineer|01940 639 | 639|42|F|librarian|12065 640 | 640|20|M|student|61801 641 | 641|24|M|student|60626 642 | 642|18|F|student|95521 643 | 643|39|M|scientist|55122 644 | 644|51|M|retired|63645 645 | 645|27|M|programmer|53211 646 | 646|17|F|student|51250 647 | 647|40|M|educator|45810 648 | 648|43|M|engineer|91351 649 | 649|20|M|student|39762 650 | 650|42|M|engineer|83814 651 | 651|65|M|retired|02903 652 | 652|35|M|other|22911 653 | 653|31|M|executive|55105 654 | 654|27|F|student|78739 655 | 655|50|F|healthcare|60657 656 | 656|48|M|educator|10314 657 | 657|26|F|none|78704 658 | 658|33|M|programmer|92626 659 | 659|31|M|educator|54248 660 | 660|26|M|student|77380 661 | 661|28|M|programmer|98121 662 | 662|55|M|librarian|19102 663 | 663|26|M|other|19341 664 | 664|30|M|engineer|94115 665 | 665|25|M|administrator|55412 666 | 666|44|M|administrator|61820 667 | 667|35|M|librarian|01970 668 | 668|29|F|writer|10016 669 | 669|37|M|other|20009 670 | 670|30|M|technician|21114 671 | 671|21|M|programmer|91919 672 | 672|54|F|administrator|90095 673 | 673|51|M|educator|22906 674 | 674|13|F|student|55337 675 | 675|34|M|other|28814 676 | 676|30|M|programmer|32712 677 | 677|20|M|other|99835 678 | 678|50|M|educator|61462 679 | 679|20|F|student|54302 680 | 680|33|M|lawyer|90405 681 | 681|44|F|marketing|97208 682 | 682|23|M|programmer|55128 683 | 683|42|M|librarian|23509 684 | 684|28|M|student|55414 685 | 685|32|F|librarian|55409 686 | 686|32|M|educator|26506 687 | 687|31|F|healthcare|27713 688 | 688|37|F|administrator|60476 689 | 689|25|M|other|45439 690 | 690|35|M|salesman|63304 691 | 691|34|M|educator|60089 692 | 692|34|M|engineer|18053 693 | 693|43|F|healthcare|85210 694 | 694|60|M|programmer|06365 695 | 695|26|M|writer|38115 696 | 696|55|M|other|94920 697 | 697|25|M|other|77042 698 | 698|28|F|programmer|06906 699 | 699|44|M|other|96754 700 | 700|17|M|student|76309 701 | 701|51|F|librarian|56321 702 | 702|37|M|other|89104 703 | 703|26|M|educator|49512 704 | 704|51|F|librarian|91105 705 | 705|21|F|student|54494 706 | 706|23|M|student|55454 707 | 707|56|F|librarian|19146 708 | 708|26|F|homemaker|96349 709 | 709|21|M|other|N4T1A 710 | 710|19|M|student|92020 711 | 711|22|F|student|15203 712 | 712|22|F|student|54901 713 | 713|42|F|other|07204 714 | 714|26|M|engineer|55343 715 | 715|21|M|technician|91206 716 | 716|36|F|administrator|44265 717 | 717|24|M|technician|84105 718 | 718|42|M|technician|64118 719 | 719|37|F|other|V0R2H 720 | 720|49|F|administrator|16506 721 | 721|24|F|entertainment|11238 722 | 722|50|F|homemaker|17331 723 | 723|26|M|executive|94403 724 | 724|31|M|executive|40243 725 | 725|21|M|student|91711 726 | 726|25|F|administrator|80538 727 | 727|25|M|student|78741 728 | 728|58|M|executive|94306 729 | 729|19|M|student|56567 730 | 730|31|F|scientist|32114 731 | 731|41|F|educator|70403 732 | 732|28|F|other|98405 733 | 733|44|F|other|60630 734 | 734|25|F|other|63108 735 | 735|29|F|healthcare|85719 736 | 736|48|F|writer|94618 737 | 737|30|M|programmer|98072 738 | 738|35|M|technician|95403 739 | 739|35|M|technician|73162 740 | 740|25|F|educator|22206 741 | 741|25|M|writer|63108 742 | 742|35|M|student|29210 743 | 743|31|M|programmer|92660 744 | 744|35|M|marketing|47024 745 | 745|42|M|writer|55113 746 | 746|25|M|engineer|19047 747 | 747|19|M|other|93612 748 | 748|28|M|administrator|94720 749 | 749|33|M|other|80919 750 | 750|28|M|administrator|32303 751 | 751|24|F|other|90034 752 | 752|60|M|retired|21201 753 | 753|56|M|salesman|91206 754 | 754|59|F|librarian|62901 755 | 755|44|F|educator|97007 756 | 756|30|F|none|90247 757 | 757|26|M|student|55104 758 | 758|27|M|student|53706 759 | 759|20|F|student|68503 760 | 760|35|F|other|14211 761 | 761|17|M|student|97302 762 | 762|32|M|administrator|95050 763 | 763|27|M|scientist|02113 764 | 764|27|F|educator|62903 765 | 765|31|M|student|33066 766 | 766|42|M|other|10960 767 | 767|70|M|engineer|00000 768 | 768|29|M|administrator|12866 769 | 769|39|M|executive|06927 770 | 770|28|M|student|14216 771 | 771|26|M|student|15232 772 | 772|50|M|writer|27105 773 | 773|20|M|student|55414 774 | 774|30|M|student|80027 775 | 775|46|M|executive|90036 776 | 776|30|M|librarian|51157 777 | 777|63|M|programmer|01810 778 | 778|34|M|student|01960 779 | 779|31|M|student|K7L5J 780 | 780|49|M|programmer|94560 781 | 781|20|M|student|48825 782 | 782|21|F|artist|33205 783 | 783|30|M|marketing|77081 784 | 784|47|M|administrator|91040 785 | 785|32|M|engineer|23322 786 | 786|36|F|engineer|01754 787 | 787|18|F|student|98620 788 | 788|51|M|administrator|05779 789 | 789|29|M|other|55420 790 | 790|27|M|technician|80913 791 | 791|31|M|educator|20064 792 | 792|40|M|programmer|12205 793 | 793|22|M|student|85281 794 | 794|32|M|educator|57197 795 | 795|30|M|programmer|08610 796 | 796|32|F|writer|33755 797 | 797|44|F|other|62522 798 | 798|40|F|writer|64131 799 | 799|49|F|administrator|19716 800 | 800|25|M|programmer|55337 801 | 801|22|M|writer|92154 802 | 802|35|M|administrator|34105 803 | 803|70|M|administrator|78212 804 | 804|39|M|educator|61820 805 | 805|27|F|other|20009 806 | 806|27|M|marketing|11217 807 | 807|41|F|healthcare|93555 808 | 808|45|M|salesman|90016 809 | 809|50|F|marketing|30803 810 | 810|55|F|other|80526 811 | 811|40|F|educator|73013 812 | 812|22|M|technician|76234 813 | 813|14|F|student|02136 814 | 814|30|M|other|12345 815 | 815|32|M|other|28806 816 | 816|34|M|other|20755 817 | 817|19|M|student|60152 818 | 818|28|M|librarian|27514 819 | 819|59|M|administrator|40205 820 | 820|22|M|student|37725 821 | 821|37|M|engineer|77845 822 | 822|29|F|librarian|53144 823 | 823|27|M|artist|50322 824 | 824|31|M|other|15017 825 | 825|44|M|engineer|05452 826 | 826|28|M|artist|77048 827 | 827|23|F|engineer|80228 828 | 828|28|M|librarian|85282 829 | 829|48|M|writer|80209 830 | 830|46|M|programmer|53066 831 | 831|21|M|other|33765 832 | 832|24|M|technician|77042 833 | 833|34|M|writer|90019 834 | 834|26|M|other|64153 835 | 835|44|F|executive|11577 836 | 836|44|M|artist|10018 837 | 837|36|F|artist|55409 838 | 838|23|M|student|01375 839 | 839|38|F|entertainment|90814 840 | 840|39|M|artist|55406 841 | 841|45|M|doctor|47401 842 | 842|40|M|writer|93055 843 | 843|35|M|librarian|44212 844 | 844|22|M|engineer|95662 845 | 845|64|M|doctor|97405 846 | 846|27|M|lawyer|47130 847 | 847|29|M|student|55417 848 | 848|46|M|engineer|02146 849 | 849|15|F|student|25652 850 | 850|34|M|technician|78390 851 | 851|18|M|other|29646 852 | 852|46|M|administrator|94086 853 | 853|49|M|writer|40515 854 | 854|29|F|student|55408 855 | 855|53|M|librarian|04988 856 | 856|43|F|marketing|97215 857 | 857|35|F|administrator|V1G4L 858 | 858|63|M|educator|09645 859 | 859|18|F|other|06492 860 | 860|70|F|retired|48322 861 | 861|38|F|student|14085 862 | 862|25|M|executive|13820 863 | 863|17|M|student|60089 864 | 864|27|M|programmer|63021 865 | 865|25|M|artist|11231 866 | 866|45|M|other|60302 867 | 867|24|M|scientist|92507 868 | 868|21|M|programmer|55303 869 | 869|30|M|student|10025 870 | 870|22|M|student|65203 871 | 871|31|M|executive|44648 872 | 872|19|F|student|74078 873 | 873|48|F|administrator|33763 874 | 874|36|M|scientist|37076 875 | 875|24|F|student|35802 876 | 876|41|M|other|20902 877 | 877|30|M|other|77504 878 | 878|50|F|educator|98027 879 | 879|33|F|administrator|55337 880 | 880|13|M|student|83702 881 | 881|39|M|marketing|43017 882 | 882|35|M|engineer|40503 883 | 883|49|M|librarian|50266 884 | 884|44|M|engineer|55337 885 | 885|30|F|other|95316 886 | 886|20|M|student|61820 887 | 887|14|F|student|27249 888 | 888|41|M|scientist|17036 889 | 889|24|M|technician|78704 890 | 890|32|M|student|97301 891 | 891|51|F|administrator|03062 892 | 892|36|M|other|45243 893 | 893|25|M|student|95823 894 | 894|47|M|educator|74075 895 | 895|31|F|librarian|32301 896 | 896|28|M|writer|91505 897 | 897|30|M|other|33484 898 | 898|23|M|homemaker|61755 899 | 899|32|M|other|55116 900 | 900|60|M|retired|18505 901 | 901|38|M|executive|L1V3W 902 | 902|45|F|artist|97203 903 | 903|28|M|educator|20850 904 | 904|17|F|student|61073 905 | 905|27|M|other|30350 906 | 906|45|M|librarian|70124 907 | 907|25|F|other|80526 908 | 908|44|F|librarian|68504 909 | 909|50|F|educator|53171 910 | 910|28|M|healthcare|29301 911 | 911|37|F|writer|53210 912 | 912|51|M|other|06512 913 | 913|27|M|student|76201 914 | 914|44|F|other|08105 915 | 915|50|M|entertainment|60614 916 | 916|27|M|engineer|N2L5N 917 | 917|22|F|student|20006 918 | 918|40|M|scientist|70116 919 | 919|25|M|other|14216 920 | 920|30|F|artist|90008 921 | 921|20|F|student|98801 922 | 922|29|F|administrator|21114 923 | 923|21|M|student|E2E3R 924 | 924|29|M|other|11753 925 | 925|18|F|salesman|49036 926 | 926|49|M|entertainment|01701 927 | 927|23|M|programmer|55428 928 | 928|21|M|student|55408 929 | 929|44|M|scientist|53711 930 | 930|28|F|scientist|07310 931 | 931|60|M|educator|33556 932 | 932|58|M|educator|06437 933 | 933|28|M|student|48105 934 | 934|61|M|engineer|22902 935 | 935|42|M|doctor|66221 936 | 936|24|M|other|32789 937 | 937|48|M|educator|98072 938 | 938|38|F|technician|55038 939 | 939|26|F|student|33319 940 | 940|32|M|administrator|02215 941 | 941|20|M|student|97229 942 | 942|48|F|librarian|78209 943 | 943|22|M|student|77841 944 | -------------------------------------------------------------------------------- /__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gimseng/99-ML-Learning-Projects/e18c0a34e6a22b05659d31a7155384257d3b02ae/__init__.py --------------------------------------------------------------------------------