├── .gitignore ├── ANALYSIS OF DEEP NEURAL NETWORK MODELS └── Deep NN Models.pdf ├── Exploratory Data Analysis ├── .ipynb_checkpoints │ └── 3. Let's Code-checkpoint.ipynb ├── 1. Credit Risk Modeling (Read First).docx ├── 2. Columns Description.csv ├── 3. Let's Code.ipynb └── 4. The Analysis in Detail.pptx ├── Logistic Regression (Classification) ├── .ipynb_checkpoints │ └── 3. Lead Scoring -Suraaj Hasija-checkpoint.ipynb ├── 1. Problem Statement & Steps.docx ├── 2. Lead Scoring PPT - Suraaj.pdf ├── 3. Lead Scoring -Suraaj Hasija.ipynb └── Datasets │ ├── Leads Data Dictionary.xlsx │ └── Leads.csv ├── Multiple Linear Regression ├── 1. Steps to be performed (Read First).docx ├── 2. Multiple Linear Regression.ipynb ├── 3. Multiple Linear Case Inferences.pdf ├── YuluDulu Columns Description └── YuluDulu Dataset.csv ├── Natural Language Processing ├── Module 1- Regex.ipynb ├── Module 2- Lexical Processing.ipynb ├── Module 3- Model Building.ipynb ├── Module 4- Word Embeddings ├── Module 5- Advanced NLP using LSTM.ipynb ├── Module 6- Comparing Different NLP Models.ipynb └── dataset.csv ├── Pictures ├── 5-Step-Workflow-for-Multiple-Linear-Regression.png ├── Exploratory-Data-Analysis.png └── log_reg.png └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /ANALYSIS OF DEEP NEURAL NETWORK MODELS/Deep NN Models.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/ANALYSIS OF DEEP NEURAL NETWORK MODELS/Deep NN Models.pdf -------------------------------------------------------------------------------- /Exploratory Data Analysis/1. Credit Risk Modeling (Read First).docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Exploratory Data Analysis/1. Credit Risk Modeling (Read First).docx -------------------------------------------------------------------------------- /Exploratory Data Analysis/2. Columns Description.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Exploratory Data Analysis/2. Columns Description.csv -------------------------------------------------------------------------------- /Exploratory Data Analysis/3. Let's Code.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "#importing required libraries for this case study\n", 10 | "import pandas as pd\n", 11 | "import seaborn as sbn\n", 12 | "import numpy as np\n", 13 | "import matplotlib.pyplot as plt\n", 14 | "pd.set_option('display.max_rows',500)\n", 15 | "pd.set_option('display.max_columns',500)\n", 16 | "pd.set_option('display.width',500)\n", 17 | "pd.set_option('max_info_columns',500)" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [ 25 | { 26 | "ename": "FileNotFoundError", 27 | "evalue": "[Errno 2] File b'/Users/suraajhasija/Desktop/Python-Basics/EDA Case Study/application_data.csv' does not exist: b'/Users/suraajhasija/Desktop/Python-Basics/EDA Case Study/application_data.csv'", 28 | "output_type": "error", 29 | "traceback": [ 30 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 31 | "\u001b[0;31mFileNotFoundError\u001b[0m Traceback (most recent call last)", 32 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m#Importing the applications data file\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mapplications\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'/Users/suraajhasija/Desktop/Python-Basics/EDA Case Study/application_data.csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 33 | "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36mparser_f\u001b[0;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)\u001b[0m\n\u001b[1;32m 700\u001b[0m skip_blank_lines=skip_blank_lines)\n\u001b[1;32m 701\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 702\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 703\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 704\u001b[0m \u001b[0mparser_f\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 34 | "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m_read\u001b[0;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[1;32m 427\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 428\u001b[0m \u001b[0;31m# Create the parser.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 429\u001b[0;31m \u001b[0mparser\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mTextFileReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 430\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 431\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mchunksize\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0miterator\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 35 | "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, f, engine, **kwds)\u001b[0m\n\u001b[1;32m 893\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'has_index_names'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'has_index_names'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 894\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 895\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mengine\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 896\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 897\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 36 | "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m_make_engine\u001b[0;34m(self, engine)\u001b[0m\n\u001b[1;32m 1120\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m_make_engine\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mengine\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'c'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1121\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'c'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1122\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mCParserWrapper\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0moptions\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1123\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1124\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mengine\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0;34m'python'\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 37 | "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, src, **kwds)\u001b[0m\n\u001b[1;32m 1851\u001b[0m \u001b[0mkwds\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'usecols'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0musecols\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1852\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1853\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reader\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mparsers\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTextReader\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msrc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1854\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munnamed_cols\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_reader\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munnamed_cols\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1855\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", 38 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader.__cinit__\u001b[0;34m()\u001b[0m\n", 39 | "\u001b[0;32mpandas/_libs/parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._setup_parser_source\u001b[0;34m()\u001b[0m\n", 40 | "\u001b[0;31mFileNotFoundError\u001b[0m: [Errno 2] File b'/Users/suraajhasija/Desktop/Python-Basics/EDA Case Study/application_data.csv' does not exist: b'/Users/suraajhasija/Desktop/Python-Basics/EDA Case Study/application_data.csv'" 41 | ] 42 | } 43 | ], 44 | "source": [ 45 | "#Importing the applications data file \n", 46 | "applications=pd.read_csv('/Users/suraajhasija/Desktop/Python-Basics/EDA Case Study/application_data.csv')" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": {}, 53 | "outputs": [], 54 | "source": [ 55 | "#basic data checks \n", 56 | "applications.head()" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "#checking the shape of the dataframe\n", 66 | "applications.shape" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "#to check null percentage\n", 76 | "xx=applications.isnull().sum()/len(applications.index)*100\n", 77 | "xx" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "#dropping columns with null % greater than 50%\n", 87 | "cols_to_drop=xx[xx>50].index" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "app=applications.drop(columns=cols_to_drop)" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": { 103 | "scrolled": true 104 | }, 105 | "outputs": [], 106 | "source": [ 107 | "#Now again checking the percentage of missing values\n", 108 | "app.isnull().sum()/len(applications.index)*100" 109 | ] 110 | }, 111 | { 112 | "cell_type": "code", 113 | "execution_count": null, 114 | "metadata": { 115 | "scrolled": true 116 | }, 117 | "outputs": [], 118 | "source": [ 119 | "#checking for categorical and continuous data\n", 120 | "app.nunique()" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": { 127 | "scrolled": true 128 | }, 129 | "outputs": [], 130 | "source": [ 131 | "#Checking the categorical colums before imputation of columns with less % of null values (~13%)\n", 132 | "app['AMT_REQ_CREDIT_BUREAU_DAY'].value_counts()" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": null, 138 | "metadata": {}, 139 | "outputs": [], 140 | "source": [ 141 | "#Calculating the mode\n", 142 | "app_AMT_REQ_CREDIT_BUREAU_HOUR=app['AMT_REQ_CREDIT_BUREAU_HOUR'].mode()[0]\n", 143 | "app_AMT_REQ_CREDIT_BUREAU_DAY=app['AMT_REQ_CREDIT_BUREAU_DAY'].mode()[0]\n", 144 | "app_AMT_REQ_CREDIT_BUREAU_WEEK=app['AMT_REQ_CREDIT_BUREAU_WEEK'].mode()[0]\n", 145 | "app_AMT_REQ_CREDIT_BUREAU_MON=app['AMT_REQ_CREDIT_BUREAU_MON'].mode()[0]\n", 146 | "app_AMT_REQ_CREDIT_BUREAU_QRT=app['AMT_REQ_CREDIT_BUREAU_QRT'].mode()[0]\n", 147 | "app_AMT_REQ_CREDIT_BUREAU_YEAR=app['AMT_REQ_CREDIT_BUREAU_YEAR'].mode()[0]\n" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [ 156 | "#Reporting the mode\n", 157 | "print(app_AMT_REQ_CREDIT_BUREAU_HOUR)\n", 158 | "print(app_AMT_REQ_CREDIT_BUREAU_DAY)\n", 159 | "print(app_AMT_REQ_CREDIT_BUREAU_WEEK)\n", 160 | "print(app_AMT_REQ_CREDIT_BUREAU_MON)\n", 161 | "print(app_AMT_REQ_CREDIT_BUREAU_QRT)\n", 162 | "print(app_AMT_REQ_CREDIT_BUREAU_YEAR)\n", 163 | "\n", 164 | "#only reporting and no imputing\n", 165 | "#app.loc[app['AMT_REQ_CREDIT_BUREAU_DAY'].isnull(),'AMT_REQ_CREDIT_BUREAU_DAY']=app_AMT_REQ_CREDIT_BUREAU_DAY" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": {}, 172 | "outputs": [], 173 | "source": [ 174 | "#treating others columns as well (for mode imputation) \n", 175 | "app_name_type_suite=app['NAME_TYPE_SUITE'].mode()\n", 176 | "print(app_name_type_suite)\n", 177 | "\n", 178 | "##### do for remaining as well?" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": { 185 | "scrolled": true 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "# checking the data types of the columns\n", 190 | "app.dtypes \n" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": {}, 197 | "outputs": [], 198 | "source": [ 199 | "#outlier in AMT_INCOME_TOTAL\n", 200 | "app.AMT_INCOME_TOTAL.quantile([0.25,0.5,.75,.95,.99,1])\n", 201 | "# Outlier Value= 117Mn " 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "#Visualization (BoxPlot) to check outliers in AMT_INCOME_TOTAL\n", 211 | "sbn.boxplot(app.AMT_INCOME_TOTAL)" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "# to check outliers in AMT_CREDIT\n", 221 | "app.AMT_CREDIT.quantile([0.25,0.5,.75,.95,.99,1])" 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": null, 227 | "metadata": {}, 228 | "outputs": [], 229 | "source": [ 230 | "# Visualization (BoxPlot) to check outliers in AMT_CREDIT\n", 231 | "plt.figure(figsize=[10,2])\n", 232 | "sbn.boxplot(app.AMT_CREDIT)\n", 233 | "plt.show()" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "plt.figure(figsize=[10,2])\n", 243 | "sbn.boxplot(app['AMT_GOODS_PRICE'])" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": null, 249 | "metadata": {}, 250 | "outputs": [], 251 | "source": [ 252 | "app.AMT_GOODS_PRICE.quantile([0.25,0.5,.75,.95,.99,1])" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "#Treating the outliers\n", 262 | "appl=app[(app.AMT_CREDIT<=1854000) & (app.AMT_GOODS_PRICE<4050000) & (app.AMT_INCOME_TOTAL<=472500)]" 263 | ] 264 | }, 265 | { 266 | "cell_type": "code", 267 | "execution_count": null, 268 | "metadata": {}, 269 | "outputs": [], 270 | "source": [ 271 | "#Analysing AMT_GOODS_PRICE after removing the outliers\n", 272 | "appl.AMT_GOODS_PRICE.quantile([0.25,0.5,.75,.95,.99,1])" 273 | ] 274 | }, 275 | { 276 | "cell_type": "code", 277 | "execution_count": null, 278 | "metadata": {}, 279 | "outputs": [], 280 | "source": [ 281 | "#Analysing AMT_CREDIT after removing the outliers\n", 282 | "appl.AMT_CREDIT.quantile([0.25,0.5,.75,.95,.99,1])" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "#Analysing AMT_INCOME_TOTAL after removing the outliers\n", 292 | "appl.AMT_INCOME_TOTAL.quantile([0.25,0.5,.75,.95,.99,1])" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "# dropping unnecessary columns\n", 302 | "appl.drop(columns=appl.columns[33:39],inplace=True)\n", 303 | "appl.drop(columns=appl.columns[34:69],inplace=True)" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": {}, 310 | "outputs": [], 311 | "source": [ 312 | "appl.head()\n" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [ 321 | "#Binning- Dividing income into groups\n", 322 | "plt.figure(figsize=[10,8])\n", 323 | "sbn.distplot(appl.AMT_INCOME_TOTAL)" 324 | ] 325 | }, 326 | { 327 | "cell_type": "markdown", 328 | "metadata": {}, 329 | "source": [ 330 | "1) if the distribution is approximately symmetrical (mainly if average similar to mode) and \n", 331 | "there are no outliers, the cut-off values are given by mean value - 0.25* standard deviation; mean value ± standard deviation* 0.25;\n", 332 | "\n", 333 | "2) if the distribution is clearly asymmetric or if there are outliers, the cut-off values are given by: median - 0.25* interquartile range; median +0.25 * interquartile range.\n" 334 | ] 335 | }, 336 | { 337 | "cell_type": "code", 338 | "execution_count": null, 339 | "metadata": {}, 340 | "outputs": [], 341 | "source": [ 342 | "income_med=appl.AMT_INCOME_TOTAL.median()\n", 343 | "income_quant=appl.AMT_INCOME_TOTAL.quantile(.25)\n", 344 | "low_income=income_med-income_quant\n", 345 | "high_income=income_med+income_quant\n", 346 | "print(low_income)\n", 347 | "print(high_income)" 348 | ] 349 | }, 350 | { 351 | "cell_type": "code", 352 | "execution_count": null, 353 | "metadata": {}, 354 | "outputs": [], 355 | "source": [ 356 | "#Binning of continuous variables \n", 357 | "\n", 358 | "appl.loc[appl['AMT_INCOME_TOTAL']=low_income) & (appl['AMT_INCOME_TOTAL']=high_income,'INCOME_GROUP']='high income'" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": null, 366 | "metadata": {}, 367 | "outputs": [], 368 | "source": [ 369 | "#Checking for Imbalance percentage\n", 370 | "appl['TARGET'].value_counts(normalize=True).plot.bar()\n" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": null, 376 | "metadata": {}, 377 | "outputs": [], 378 | "source": [ 379 | "#Checking for Imbalance percentage\n", 380 | "appl['TARGET'].value_counts(normalize=True)*100" 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "execution_count": null, 386 | "metadata": {}, 387 | "outputs": [], 388 | "source": [ 389 | "appl['TARGET'].value_counts()" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": null, 395 | "metadata": {}, 396 | "outputs": [], 397 | "source": [ 398 | "#Dividing the data into two subsets based on targets\n", 399 | "applT1=appl[appl['TARGET']==1]\n", 400 | "applT0=appl[appl['TARGET']==0]" 401 | ] 402 | }, 403 | { 404 | "cell_type": "code", 405 | "execution_count": null, 406 | "metadata": { 407 | "scrolled": true 408 | }, 409 | "outputs": [], 410 | "source": [ 411 | "#to check categorical and non categorical data\n", 412 | "appl.nunique()" 413 | ] 414 | }, 415 | { 416 | "cell_type": "code", 417 | "execution_count": null, 418 | "metadata": {}, 419 | "outputs": [], 420 | "source": [ 421 | "#Plot for Occupation Type\n", 422 | "#Univariate analysis of categorical data\n", 423 | "fig = plt.figure(figsize=[15,7])\n", 424 | "plt.subplot(1, 2, 1)\n", 425 | "\n", 426 | "\n", 427 | "applT1['OCCUPATION_TYPE'].value_counts().sort_values(ascending=False).plot.bar(color='r')\n", 428 | "plt.title('Target 1')\n", 429 | "\n", 430 | "plt.subplot(1, 2, 2)\n", 431 | "applT0['OCCUPATION_TYPE'].value_counts().sort_values(ascending=False).plot.bar()\n", 432 | "plt.title('Target 0')\n", 433 | "plt.show()\n", 434 | "plt.tight_layout(fig)" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": null, 440 | "metadata": {}, 441 | "outputs": [], 442 | "source": [ 443 | "#Plot for Income Type\n", 444 | "#univariate - categorical analysis\n", 445 | "fig = plt.figure(figsize=[15,7])\n", 446 | "plt.subplot(1, 2, 1)\n", 447 | "\n", 448 | "\n", 449 | "applT1['NAME_INCOME_TYPE'].value_counts().sort_values(ascending=False).plot.bar(color='r')\n", 450 | "plt.title('Target 1')\n", 451 | "\n", 452 | "plt.subplot(1, 2, 2)\n", 453 | "applT0['NAME_INCOME_TYPE'].value_counts().sort_values(ascending=False).plot.bar()\n", 454 | "plt.title('Target 0')\n", 455 | "plt.show()\n", 456 | "plt.tight_layout(fig)" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": null, 462 | "metadata": {}, 463 | "outputs": [], 464 | "source": [ 465 | "#Plot for Education Type\n", 466 | "#univariate - categorical analysis\n", 467 | "fig = plt.figure(figsize=[15,7])\n", 468 | "plt.subplot(1, 2, 1)\n", 469 | "\n", 470 | "\n", 471 | "applT1['NAME_EDUCATION_TYPE'].value_counts().sort_values(ascending=False).plot.bar(color='r')\n", 472 | "plt.title('Target 1')\n", 473 | "\n", 474 | "plt.subplot(1, 2, 2)\n", 475 | "applT0['NAME_EDUCATION_TYPE'].value_counts().sort_values(ascending=False).plot.bar()\n", 476 | "plt.title('Target 0')\n", 477 | "plt.show()\n", 478 | "plt.tight_layout(fig)" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": null, 484 | "metadata": {}, 485 | "outputs": [], 486 | "source": [ 487 | "#Plot for family status\n", 488 | "#univariate - categorical analysis\n", 489 | "fig = plt.figure(figsize=[15,7])\n", 490 | "plt.subplot(1, 2, 1)\n", 491 | "\n", 492 | "\n", 493 | "applT1['NAME_FAMILY_STATUS'].value_counts().sort_values(ascending=False).plot.bar()\n", 494 | "plt.title('Target 1')\n", 495 | "\n", 496 | "plt.subplot(1, 2, 2)\n", 497 | "applT0['NAME_FAMILY_STATUS'].value_counts().sort_values(ascending=False).plot.bar(color='r')\n", 498 | "plt.title('Target 0')\n", 499 | "plt.show()\n", 500 | "plt.tight_layout(fig)" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": null, 506 | "metadata": {}, 507 | "outputs": [], 508 | "source": [ 509 | "#Population composition (Gendee- univariate analysis)\n", 510 | "fig = plt.figure(figsize=[15,7])\n", 511 | "plt.subplot(1, 2, 1)\n", 512 | "\n", 513 | "\n", 514 | "#applT1['CODE_GENDER'].value_counts().sort_values(ascending=False).plot.bar(color='r')\n", 515 | "#print(applT1['CODE_GENDER'].value_counts(normalize=True))\n", 516 | "plt.pie(applT1['CODE_GENDER'].value_counts(normalize=True), autopct = '%1.1f%%', labels= applT1['CODE_GENDER'].value_counts().index )\n", 517 | "plt.title('Target 1')\n", 518 | "\n", 519 | "plt.subplot(1, 2, 2)\n", 520 | "plt.pie(applT0['CODE_GENDER'].value_counts(normalize=True), autopct = '%1.1f%%', labels= applT0['CODE_GENDER'].value_counts().index )\n", 521 | "\n", 522 | "plt.title('Target 0')\n", 523 | "plt.show()\n", 524 | "plt.tight_layout(fig)" 525 | ] 526 | }, 527 | { 528 | "cell_type": "code", 529 | "execution_count": null, 530 | "metadata": {}, 531 | "outputs": [], 532 | "source": [ 533 | "#univariate analysis of Annuity (distplot or boxplots)\n", 534 | "\n", 535 | "\n", 536 | "fig = plt.figure(figsize=[15,7])\n", 537 | "plt.subplot(1, 2, 1)\n", 538 | "\n", 539 | "\n", 540 | "sbn.distplot(applT1['AMT_ANNUITY'], color='r')\n", 541 | "#print(applT1['CODE_GENDER'].value_counts(normalize=True))\n", 542 | "#plt.pie(applT1['CODE_GENDER'].value_counts(normalize=True) )\n", 543 | "plt.title('Target 1')\n", 544 | "\n", 545 | "plt.subplot(1, 2, 2)\n", 546 | "sbn.distplot(applT1['AMT_ANNUITY'])\n", 547 | "plt.title('Target 0')\n", 548 | "plt.show()\n", 549 | "plt.tight_layout(fig)\n", 550 | "\n", 551 | "\n", 552 | "fig = plt.figure(figsize=[15,7])\n", 553 | "plt.subplot(1, 2, 1)\n", 554 | "\n", 555 | "\n", 556 | "sbn.boxplot(applT1['AMT_ANNUITY'], color='r')\n", 557 | "#print(applT1['CODE_GENDER'].value_counts(normalize=True))\n", 558 | "#plt.pie(applT1['CODE_GENDER'].value_counts(normalize=True) )\n", 559 | "plt.title('Target 1')\n", 560 | "\n", 561 | "plt.subplot(1, 2, 2)\n", 562 | "sbn.boxplot(applT0['AMT_ANNUITY'])\n", 563 | "plt.title('Target 0')\n", 564 | "plt.show()\n", 565 | "plt.tight_layout(fig)\n", 566 | "\n", 567 | "\n" 568 | ] 569 | }, 570 | { 571 | "cell_type": "code", 572 | "execution_count": null, 573 | "metadata": {}, 574 | "outputs": [], 575 | "source": [ 576 | "#density plot for income total by gender in current application\n", 577 | "#univariate - continuous analysis\n", 578 | "\n", 579 | "fig = plt.figure(figsize=[15,7])\n", 580 | "plt.subplot(1, 2, 1)\n", 581 | "\n", 582 | "\n", 583 | "for i in applT1['CODE_GENDER'].unique():\n", 584 | " subset=applT1[applT1['CODE_GENDER']==i]\n", 585 | " sbn.distplot(subset['AMT_INCOME_TOTAL'],hist=False, label=i)\n", 586 | "#print(applT1['CODE_GENDER'].value_counts(normalize=True))\n", 587 | "#plt.pie(applT1['CODE_GENDER'].value_counts(normalize=True) )\n", 588 | "plt.title('Target 1')\n", 589 | "\n", 590 | "\n", 591 | "\n", 592 | "\n", 593 | "plt.subplot(1, 2, 2)\n", 594 | "\n", 595 | "for i in applT0['CODE_GENDER'].unique():\n", 596 | " subset=applT0[applT0['CODE_GENDER']==i]\n", 597 | " sbn.distplot(subset['AMT_INCOME_TOTAL'],hist=False, label=i)\n", 598 | " \n", 599 | "plt.title('Target 1')\n", 600 | "\n", 601 | "plt.title('Target 0')\n", 602 | "plt.show()\n", 603 | "plt.tight_layout(fig)\n", 604 | "\n", 605 | "\n" 606 | ] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "execution_count": null, 611 | "metadata": {}, 612 | "outputs": [], 613 | "source": [ 614 | "#density plot for credit amount in current application\n", 615 | "#univariate - continuous analysis\n", 616 | "\n", 617 | "fig = plt.figure(figsize=[15,7])\n", 618 | "plt.subplot(1, 2, 1)\n", 619 | "\n", 620 | "\n", 621 | "sbn.distplot(applT1['AMT_CREDIT'], color='r')\n", 622 | "#print(applT1['CODE_GENDER'].value_counts(normalize=True))\n", 623 | "#plt.pie(applT1['CODE_GENDER'].value_counts(normalize=True) )\n", 624 | "plt.title('Target 1')\n", 625 | "\n", 626 | "plt.subplot(1, 2, 2)\n", 627 | "sbn.distplot(applT0['AMT_CREDIT'])\n", 628 | "plt.title('Target 0')\n", 629 | "plt.show()\n", 630 | "plt.tight_layout(fig)\n", 631 | "\n" 632 | ] 633 | }, 634 | { 635 | "cell_type": "code", 636 | "execution_count": null, 637 | "metadata": {}, 638 | "outputs": [], 639 | "source": [ 640 | "#density plot for goods price in current application\n", 641 | "#univariate - continuous analysis\n", 642 | "fig = plt.figure(figsize=[15,7])\n", 643 | "plt.subplot(1, 2, 1)\n", 644 | "sbn.distplot(applT1['AMT_GOODS_PRICE'], color='r')\n", 645 | "plt.title('Target 1')\n", 646 | "\n", 647 | "plt.subplot(1, 2, 2)\n", 648 | "sbn.distplot(applT0['AMT_GOODS_PRICE'])\n", 649 | "plt.title('Target 0')\n", 650 | "plt.show()\n", 651 | "plt.tight_layout(fig)\n", 652 | "\n" 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": null, 658 | "metadata": {}, 659 | "outputs": [], 660 | "source": [ 661 | "#Filling NA occupation type with unknown\n", 662 | "applT1['OCCUPATION_TYPE']=applT1['OCCUPATION_TYPE'].fillna('Unknown')\n", 663 | "applT0['OCCUPATION_TYPE']=applT0['OCCUPATION_TYPE'].fillna('Unknown')" 664 | ] 665 | }, 666 | { 667 | "cell_type": "code", 668 | "execution_count": null, 669 | "metadata": {}, 670 | "outputs": [], 671 | "source": [ 672 | "#bivariate analysis of AMT_CREDIT and Occupation Type\n", 673 | "#categorical- continuous\n", 674 | "\n", 675 | "plt.figure(figsize=[15,15])\n", 676 | "sbn.set(style=\"whitegrid\", color_codes=True)\n", 677 | "plt.subplot(2, 2, 1)\n", 678 | "\n", 679 | "\n", 680 | "#chart= sbn.barplot(y=applT1['AMT_ANNUITY'],x=applT1['OCCUPATION_TYPE'],palette=\"Blues\", )\n", 681 | "#chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 682 | "\n", 683 | "m=applT1.groupby('OCCUPATION_TYPE').sum()['AMT_CREDIT'].sort_values(ascending=False).reset_index()\n", 684 | "chart=sbn.barplot(y=m['AMT_CREDIT'],x=m['OCCUPATION_TYPE'], palette='Greens_d')\n", 685 | "chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 686 | "\n", 687 | "plt.yscale('log')\n", 688 | "plt.title('Target 1')\n", 689 | "\n", 690 | "\n", 691 | "plt.subplot(2, 2, 2)\n", 692 | "\n", 693 | "\n", 694 | "\n", 695 | "n=applT0.groupby('OCCUPATION_TYPE').sum()['AMT_CREDIT'].sort_values(ascending=False).reset_index()\n", 696 | "chart=sbn.barplot(y=n['AMT_CREDIT'],x=n['OCCUPATION_TYPE'], palette='GnBu_d')\n", 697 | "chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 698 | "\n", 699 | "plt.yscale('log')\n", 700 | "plt.title('Target 0')\n" 701 | ] 702 | }, 703 | { 704 | "cell_type": "code", 705 | "execution_count": null, 706 | "metadata": {}, 707 | "outputs": [], 708 | "source": [ 709 | "#bivariate analysis of Annuity and Occupation Type\n", 710 | "#categorical- continuous\n", 711 | "\n", 712 | "plt.figure(figsize=[15,15])\n", 713 | "sbn.set(style=\"whitegrid\", color_codes=True)\n", 714 | "\n", 715 | "plt.subplot(2, 2, 1)\n", 716 | "\n", 717 | "\n", 718 | "#chart= sbn.barplot(y=applT1['AMT_ANNUITY'],x=applT1['OCCUPATION_TYPE'],palette=\"Blues\", )\n", 719 | "#chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 720 | "\n", 721 | "m=applT1.groupby('OCCUPATION_TYPE').sum()['AMT_ANNUITY'].sort_values(ascending=False).reset_index()\n", 722 | "chart=sbn.barplot(y=m['AMT_ANNUITY'],x=m['OCCUPATION_TYPE'], palette='Greens_d')\n", 723 | "chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 724 | "plt.yscale('log')\n", 725 | "\n", 726 | "plt.title('Target 1')\n", 727 | "\n", 728 | "\n", 729 | "plt.subplot(2, 2, 2)\n", 730 | "\n", 731 | "\n", 732 | "\n", 733 | "n=applT0.groupby('OCCUPATION_TYPE').sum()['AMT_ANNUITY'].sort_values(ascending=False).reset_index()\n", 734 | "chart=sbn.barplot(y=n['AMT_ANNUITY'],x=n['OCCUPATION_TYPE'], palette='GnBu_d')\n", 735 | "chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 736 | "\n", 737 | "plt.yscale('log')\n", 738 | "plt.title('Target 0')" 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "metadata": {}, 745 | "outputs": [], 746 | "source": [ 747 | "#bivariate analysis of INCOME and GENDER\n", 748 | "#categorical- continuous\n", 749 | "\n", 750 | "plt.figure(figsize=[15,15])\n", 751 | "sbn.set(style=\"whitegrid\", color_codes=True)\n", 752 | "\n", 753 | "plt.subplot(2, 2, 1)\n", 754 | "\n", 755 | "\n", 756 | "#chart= sbn.barplot(y=applT1['AMT_ANNUITY'],x=applT1['OCCUPATION_TYPE'],palette=\"Blues\", )\n", 757 | "#chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 758 | "\n", 759 | "m=applT1.groupby('NAME_FAMILY_STATUS').sum()['AMT_INCOME_TOTAL'].sort_values(ascending=False).reset_index()\n", 760 | "chart=sbn.barplot(y=m['AMT_INCOME_TOTAL'],x=m['NAME_FAMILY_STATUS'], palette='Greens_d')\n", 761 | "chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 762 | "plt.yscale('log')\n", 763 | "\n", 764 | "plt.title('Target 1')\n", 765 | "\n", 766 | "\n", 767 | "plt.subplot(2, 2, 2)\n", 768 | "\n", 769 | "\n", 770 | "\n", 771 | "n=applT0.groupby('NAME_FAMILY_STATUS').sum()['AMT_INCOME_TOTAL'].sort_values(ascending=False).reset_index()\n", 772 | "chart=sbn.barplot(y=n['AMT_INCOME_TOTAL'],x=n['NAME_FAMILY_STATUS'], palette='GnBu_d')\n", 773 | "chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 774 | "\n", 775 | "plt.yscale('log')\n", 776 | "plt.title('Target 0')" 777 | ] 778 | }, 779 | { 780 | "cell_type": "code", 781 | "execution_count": null, 782 | "metadata": {}, 783 | "outputs": [], 784 | "source": [ 785 | "#Bivariate Numerical analysis\n", 786 | "fig = plt.figure(figsize=[15,7])\n", 787 | "plt.subplot(1, 2, 1)\n", 788 | "\n", 789 | "\n", 790 | "sbn.scatterplot(x='AMT_INCOME_TOTAL', y='AMT_ANNUITY', data=applT1, color='r')\n", 791 | "#print(applT1['CODE_GENDER'].value_counts(normalize=True))\n", 792 | "#plt.pie(applT1['CODE_GENDER'].value_counts(normalize=True) )\n", 793 | "plt.title('Target 1')\n", 794 | "\n", 795 | "plt.subplot(1, 2, 2)\n", 796 | "sbn.scatterplot(x='AMT_INCOME_TOTAL', y='AMT_ANNUITY', data=applT0, color='b')\n", 797 | "plt.title('Target 0')\n", 798 | "plt.show()\n", 799 | "plt.tight_layout(fig)\n", 800 | "\n", 801 | "\n" 802 | ] 803 | }, 804 | { 805 | "cell_type": "code", 806 | "execution_count": null, 807 | "metadata": {}, 808 | "outputs": [], 809 | "source": [ 810 | "fig = plt.figure(figsize=[15,7])\n", 811 | "plt.subplot(1, 2, 1)\n", 812 | "\n", 813 | "\n", 814 | "sbn.scatterplot(x='AMT_INCOME_TOTAL', y='AMT_CREDIT', data=applT1, color='r')\n", 815 | "#print(applT1['CODE_GENDER'].value_counts(normalize=True))\n", 816 | "#plt.pie(applT1['CODE_GENDER'].value_counts(normalize=True) )\n", 817 | "plt.title('Target 1')\n", 818 | "\n", 819 | "plt.subplot(1, 2, 2)\n", 820 | "sbn.scatterplot(x='AMT_INCOME_TOTAL', y='AMT_CREDIT', data=applT0, color='b')\n", 821 | "plt.title('Target 0')\n", 822 | "plt.show()\n", 823 | "plt.tight_layout(fig)\n", 824 | "\n" 825 | ] 826 | }, 827 | { 828 | "cell_type": "code", 829 | "execution_count": null, 830 | "metadata": {}, 831 | "outputs": [], 832 | "source": [ 833 | "\n", 834 | "fig = plt.figure(figsize=[15,7])\n", 835 | "plt.subplot(1, 2, 1)\n", 836 | "\n", 837 | "\n", 838 | "sbn.scatterplot(x='AMT_ANNUITY', y='AMT_CREDIT', data=applT1, color='r')\n", 839 | "#print(applT1['CODE_GENDER'].value_counts(normalize=True))\n", 840 | "#plt.pie(applT1['CODE_GENDER'].value_counts(normalize=True) )\n", 841 | "plt.title('Target 1')\n", 842 | "\n", 843 | "plt.subplot(1, 2, 2)\n", 844 | "sbn.scatterplot(x='AMT_ANNUITY', y='AMT_CREDIT', data=applT0, color='b')\n", 845 | "plt.title('Target 0')\n", 846 | "plt.show()\n", 847 | "plt.tight_layout(fig)\n", 848 | "\n" 849 | ] 850 | }, 851 | { 852 | "cell_type": "code", 853 | "execution_count": null, 854 | "metadata": {}, 855 | "outputs": [], 856 | "source": [ 857 | "#correlation for continuos - continuous " 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": null, 863 | "metadata": {}, 864 | "outputs": [], 865 | "source": [ 866 | "#changing the data type of columns\n", 867 | "cols=['SK_ID_CURR','TARGET','FLAG_MOBIL','FLAG_PHONE','FLAG_EMAIL','AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',\n", 868 | " 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR','REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',\n", 869 | " 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START','FLAG_MOBIL', 'FLAG_EMP_PHONE',\n", 870 | " 'FLAG_WORK_PHONE','FLAG_CONT_MOBILE','REGION_POPULATION_RELATIVE']\n", 871 | "\n", 872 | "for i in cols:\n", 873 | " applT1[i]=applT1[i].astype('str')\n", 874 | " applT0[i]=applT0[i].astype('str')" 875 | ] 876 | }, 877 | { 878 | "cell_type": "code", 879 | "execution_count": null, 880 | "metadata": {}, 881 | "outputs": [], 882 | "source": [ 883 | "corr_matrixT1=abs(applT1.corr())\n", 884 | "corr_matrixT0=abs(applT0.corr())" 885 | ] 886 | }, 887 | { 888 | "cell_type": "code", 889 | "execution_count": null, 890 | "metadata": {}, 891 | "outputs": [], 892 | "source": [ 893 | "plt.figure(figsize=(20,20))\n", 894 | "\n", 895 | "plt.subplot(2,1,1)\n", 896 | "chart= sbn.heatmap(corr_matrixT1,annot=True)\n", 897 | "plt.title('Target1')\n", 898 | "chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 899 | "\n", 900 | "plt.subplot(2,1,2)\n", 901 | "chart=sbn.heatmap(corr_matrixT0,annot=True)\n", 902 | "chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 903 | "plt.title ('Target0')" 904 | ] 905 | }, 906 | { 907 | "cell_type": "code", 908 | "execution_count": null, 909 | "metadata": {}, 910 | "outputs": [], 911 | "source": [ 912 | "corr_matrixT1" 913 | ] 914 | }, 915 | { 916 | "cell_type": "code", 917 | "execution_count": null, 918 | "metadata": { 919 | "scrolled": true 920 | }, 921 | "outputs": [], 922 | "source": [ 923 | "corrT1=corr_matrixT1.where(np.triu(np.ones(corr_matrixT1.shape),k=1).astype(bool))\n", 924 | "corrT1=corrT1.unstack().reset_index()\n", 925 | "corrT1.columns= ['VAR1','VAR2','CORRELATION']\n", 926 | "corrT1=corrT1.dropna(subset=['CORRELATION'])\n", 927 | "corrT1.sort_values('CORRELATION', ascending=False,inplace=True)\n", 928 | "corrT1.head(10)" 929 | ] 930 | }, 931 | { 932 | "cell_type": "code", 933 | "execution_count": null, 934 | "metadata": {}, 935 | "outputs": [], 936 | "source": [ 937 | "corrT0=corr_matrixT0.where(np.triu(np.ones(corr_matrixT0.shape),k=1).astype(bool))\n", 938 | "corrT0=corrT0.unstack().reset_index()\n", 939 | "corrT0.columns= ['VAR1','VAR2','CORRELATION']\n", 940 | "corrT0=corrT0.dropna(subset=['CORRELATION'])\n", 941 | "corrT0.sort_values('CORRELATION', ascending=False,inplace=True)\n", 942 | "corrT0.head(10)\n" 943 | ] 944 | }, 945 | { 946 | "cell_type": "markdown", 947 | "metadata": {}, 948 | "source": [ 949 | "`comments on correlation`\n", 950 | "\n", 951 | "After comparing the two data sets for Target = 1 and Target = 0, we find that the sets of correlated columns are same for both the datasets.\n", 952 | "AMT_GOODS_PRICE is highly correlated to AMT_CREDIT in both the datasets. Similarly CNT_FAM_MEMBERS is highly correlated to CNT_CHILDREN and so on.\n" 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": null, 958 | "metadata": {}, 959 | "outputs": [], 960 | "source": [ 961 | "# previous application\n", 962 | "\n", 963 | "previous=pd.read_csv('/Users/suraajhasija/Desktop/Python-Basics/EDA Case Study/previous_application.csv')" 964 | ] 965 | }, 966 | { 967 | "cell_type": "code", 968 | "execution_count": null, 969 | "metadata": {}, 970 | "outputs": [], 971 | "source": [ 972 | "previous.head()" 973 | ] 974 | }, 975 | { 976 | "cell_type": "code", 977 | "execution_count": null, 978 | "metadata": {}, 979 | "outputs": [], 980 | "source": [ 981 | "previous.shape" 982 | ] 983 | }, 984 | { 985 | "cell_type": "code", 986 | "execution_count": null, 987 | "metadata": { 988 | "scrolled": true 989 | }, 990 | "outputs": [], 991 | "source": [ 992 | "previous.dtypes" 993 | ] 994 | }, 995 | { 996 | "cell_type": "code", 997 | "execution_count": null, 998 | "metadata": { 999 | "scrolled": true 1000 | }, 1001 | "outputs": [], 1002 | "source": [ 1003 | "null_check=previous.isnull().sum()/len(previous.index)*100\n", 1004 | "null_check" 1005 | ] 1006 | }, 1007 | { 1008 | "cell_type": "code", 1009 | "execution_count": null, 1010 | "metadata": {}, 1011 | "outputs": [], 1012 | "source": [ 1013 | "#columns to drop here\n", 1014 | "col_to_drop=null_check[null_check >50].index" 1015 | ] 1016 | }, 1017 | { 1018 | "cell_type": "code", 1019 | "execution_count": null, 1020 | "metadata": {}, 1021 | "outputs": [], 1022 | "source": [ 1023 | "previous.drop(columns=col_to_drop,inplace=True)" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "code", 1028 | "execution_count": null, 1029 | "metadata": {}, 1030 | "outputs": [], 1031 | "source": [ 1032 | "#merging of current application with previous application keeping data of current applicants only \n", 1033 | "previous['SK_ID_CURR']=previous['SK_ID_CURR'].astype('str')\n", 1034 | "appT1previous=applT1.merge(previous,on='SK_ID_CURR', how='left' )\n", 1035 | "appT0previous=applT0.merge(previous,on='SK_ID_CURR', how='left' )" 1036 | ] 1037 | }, 1038 | { 1039 | "cell_type": "code", 1040 | "execution_count": null, 1041 | "metadata": {}, 1042 | "outputs": [], 1043 | "source": [ 1044 | "appT1previous.head()" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "code", 1049 | "execution_count": null, 1050 | "metadata": {}, 1051 | "outputs": [], 1052 | "source": [ 1053 | "#since no null in previous application NAME_CONTRACT_STATUS. We can assume that where ever we have na,\n", 1054 | "#means no previous record found\n", 1055 | "appT1previous.NAME_CONTRACT_STATUS=appT1previous.NAME_CONTRACT_STATUS.fillna('No History')\n", 1056 | "appT0previous.NAME_CONTRACT_STATUS=appT0previous.NAME_CONTRACT_STATUS.fillna('No History')" 1057 | ] 1058 | }, 1059 | { 1060 | "cell_type": "code", 1061 | "execution_count": null, 1062 | "metadata": {}, 1063 | "outputs": [], 1064 | "source": [ 1065 | "#Customers in current application with previous application contract/loan status \n", 1066 | "fig=plt.figure(figsize=(12,12))\n", 1067 | "\n", 1068 | "plt.subplot(1,2,1)\n", 1069 | "bbc=sbn.countplot(x='NAME_CONTRACT_STATUS', data=appT1previous)\n", 1070 | "bbc.set_xticklabels(bbc.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 1071 | "plt.title('Target 1')\n", 1072 | "\n", 1073 | "plt.subplot(1,2,2)\n", 1074 | "ddc=sbn.countplot(x='NAME_CONTRACT_STATUS', data=appT0previous)\n", 1075 | "ddc.set_xticklabels(ddc.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 1076 | "\n", 1077 | "plt.title('Target 0')\n", 1078 | "plt.show()\n", 1079 | "\n" 1080 | ] 1081 | }, 1082 | { 1083 | "cell_type": "code", 1084 | "execution_count": null, 1085 | "metadata": {}, 1086 | "outputs": [], 1087 | "source": [ 1088 | "#previous application contract status and median credit amount by gender\n", 1089 | "\n", 1090 | "fig=plt.figure(figsize=(13,13))\n", 1091 | "\n", 1092 | "plt.subplot(1,2,1)\n", 1093 | "bbc=sbn.barplot(x='NAME_CONTRACT_STATUS',y='AMT_CREDIT_y', estimator=np.median,hue='CODE_GENDER', data=appT1previous)\n", 1094 | "bbc.set_xticklabels(bbc.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 1095 | "plt.title('Target 1')\n", 1096 | "\n", 1097 | "plt.subplot(1,2,2)\n", 1098 | "ddc=sbn.barplot(x='NAME_CONTRACT_STATUS',y='AMT_CREDIT_y', estimator=np.median, hue='CODE_GENDER', data=appT0previous)\n", 1099 | "ddc.set_xticklabels(ddc.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 1100 | "\n", 1101 | "plt.title('Target 0')\n", 1102 | "plt.show()\n" 1103 | ] 1104 | }, 1105 | { 1106 | "cell_type": "code", 1107 | "execution_count": null, 1108 | "metadata": {}, 1109 | "outputs": [], 1110 | "source": [ 1111 | "#validating the chart data\n", 1112 | "appT1previous.groupby('NAME_CONTRACT_STATUS').aggregate(np.median)['AMT_CREDIT_y']" 1113 | ] 1114 | }, 1115 | { 1116 | "cell_type": "code", 1117 | "execution_count": null, 1118 | "metadata": {}, 1119 | "outputs": [], 1120 | "source": [ 1121 | "#checking the cause of rejection in previous applications\n", 1122 | "fig=plt.figure(figsize=(6,5))\n", 1123 | "\n", 1124 | "plt.subplot(1,2,1)\n", 1125 | "subset=appT1previous[appT1previous['NAME_CONTRACT_STATUS']=='Refused']\n", 1126 | "appT1previous.CODE_REJECT_REASON.value_counts(normalize=True).plot.bar(color='blue')\n", 1127 | "plt.title('Target 1')\n", 1128 | "\n", 1129 | "plt.subplot(1,2,2)\n", 1130 | "subset=appT0previous[appT0previous['NAME_CONTRACT_STATUS']=='Refused']\n", 1131 | "\n", 1132 | "appT0previous.CODE_REJECT_REASON.value_counts(normalize=True).plot.bar(color='red')\n", 1133 | "plt.title('Target 0')\n" 1134 | ] 1135 | }, 1136 | { 1137 | "cell_type": "code", 1138 | "execution_count": null, 1139 | "metadata": {}, 1140 | "outputs": [], 1141 | "source": [ 1142 | "#bivariate analysis - checking the variation between credit amount and the amount for which the client initially applied in the previous application\n", 1143 | "\n", 1144 | "plt.figure(figsize=(12,10))\n", 1145 | "plt.subplot(1,2,1)\n", 1146 | "bbc=sbn.scatterplot(x='AMT_APPLICATION',y='AMT_CREDIT_y', data=appT1previous)\n", 1147 | "bbc.set_xticklabels(bbc.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 1148 | "plt.title('Target 1')\n", 1149 | "\n", 1150 | "plt.subplot(1,2,2)\n", 1151 | "ddc=sbn.scatterplot(x='AMT_APPLICATION',y='AMT_CREDIT_y', data=appT0previous)\n", 1152 | "ddc.set_xticklabels(ddc.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 1153 | "plt.title('Target 0')\n", 1154 | "plt.show()" 1155 | ] 1156 | }, 1157 | { 1158 | "cell_type": "code", 1159 | "execution_count": null, 1160 | "metadata": {}, 1161 | "outputs": [], 1162 | "source": [ 1163 | "#Shift in contract type (from previous application to current)\n", 1164 | "#categorical-categorical bivariate analysis\n", 1165 | "\n", 1166 | "\n", 1167 | "fig=plt.figure(figsize=(12,12))\n", 1168 | "\n", 1169 | "plt.subplot(1,2,1)\n", 1170 | "bbc=sbn.countplot(x='NAME_CONTRACT_TYPE_x',hue='NAME_CONTRACT_TYPE_y', data=appT1previous)\n", 1171 | "bbc.set_xticklabels(bbc.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 1172 | "plt.title('Target 1')\n", 1173 | "\n", 1174 | "plt.subplot(1,2,2)\n", 1175 | "ddc=sbn.countplot(x='NAME_CONTRACT_TYPE_x',hue='NAME_CONTRACT_TYPE_y', data=appT0previous)\n", 1176 | "ddc.set_xticklabels(ddc.get_xticklabels(), rotation=45, horizontalalignment='right')\n", 1177 | "\n", 1178 | "plt.title('Target 0')\n", 1179 | "plt.show()" 1180 | ] 1181 | }, 1182 | { 1183 | "cell_type": "markdown", 1184 | "metadata": {}, 1185 | "source": [ 1186 | "Remarks and Explanations in PPT attached with this notebook" 1187 | ] 1188 | }, 1189 | { 1190 | "cell_type": "markdown", 1191 | "metadata": {}, 1192 | "source": [ 1193 | "### Author: `Suraaj Hasija`\n", 1194 | "### Please share your feedback on: `mailbox.suraaj@gmail.com `" 1195 | ] 1196 | }, 1197 | { 1198 | "cell_type": "code", 1199 | "execution_count": null, 1200 | "metadata": {}, 1201 | "outputs": [], 1202 | "source": [] 1203 | } 1204 | ], 1205 | "metadata": { 1206 | "kernelspec": { 1207 | "display_name": "Python 3", 1208 | "language": "python", 1209 | "name": "python3" 1210 | }, 1211 | "language_info": { 1212 | "codemirror_mode": { 1213 | "name": "ipython", 1214 | "version": 3 1215 | }, 1216 | "file_extension": ".py", 1217 | "mimetype": "text/x-python", 1218 | "name": "python", 1219 | "nbconvert_exporter": "python", 1220 | "pygments_lexer": "ipython3", 1221 | "version": "3.7.3" 1222 | } 1223 | }, 1224 | "nbformat": 4, 1225 | "nbformat_minor": 4 1226 | } 1227 | -------------------------------------------------------------------------------- /Exploratory Data Analysis/4. The Analysis in Detail.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Exploratory Data Analysis/4. The Analysis in Detail.pptx -------------------------------------------------------------------------------- /Logistic Regression (Classification)/1. Problem Statement & Steps.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Logistic Regression (Classification)/1. Problem Statement & Steps.docx -------------------------------------------------------------------------------- /Logistic Regression (Classification)/2. Lead Scoring PPT - Suraaj.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Logistic Regression (Classification)/2. Lead Scoring PPT - Suraaj.pdf -------------------------------------------------------------------------------- /Logistic Regression (Classification)/Datasets/Leads Data Dictionary.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Logistic Regression (Classification)/Datasets/Leads Data Dictionary.xlsx -------------------------------------------------------------------------------- /Multiple Linear Regression/1. Steps to be performed (Read First).docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Multiple Linear Regression/1. Steps to be performed (Read First).docx -------------------------------------------------------------------------------- /Multiple Linear Regression/3. Multiple Linear Case Inferences.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Multiple Linear Regression/3. Multiple Linear Case Inferences.pdf -------------------------------------------------------------------------------- /Multiple Linear Regression/YuluDulu Columns Description: -------------------------------------------------------------------------------- 1 | ========================================= 2 | Dataset characteristics 3 | ========================================= 4 | YuluDulu Dataset.csv have the following fields: 5 | 6 | - instant: record index 7 | - dteday : date 8 | - season : season (1:spring, 2:summer, 3:fall, 4:winter) 9 | - yr : year (0: 2018, 1:2019) 10 | - mnth : month ( 1 to 12) 11 | - holiday : weather day is a holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule) 12 | - weekday : day of the week 13 | - workingday : if day is neither weekend nor holiday is 1, otherwise is 0. 14 | + weathersit : 15 | - 1: Clear, Few clouds, Partly cloudy, Partly cloudy 16 | - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist 17 | - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds 18 | - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog 19 | - temp : temperature in Celsius 20 | - atemp: feeling temperature in Celsius 21 | - hum: humidity 22 | - windspeed: wind speed 23 | - casual: count of casual users 24 | - registered: count of registered users 25 | - cnt: count of total rental bikes including both casual and registered 26 | 27 | -------------------------------------------------------------------------------- /Multiple Linear Regression/YuluDulu Dataset.csv: -------------------------------------------------------------------------------- 1 | instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt 2 | 1,01-01-2018,1,0,1,0,6,0,2,14.110847,18.18125,80.5833,10.749882,331,654,985 3 | 2,02-01-2018,1,0,1,0,0,0,2,14.902598,17.68695,69.6087,16.652113,131,670,801 4 | 3,03-01-2018,1,0,1,0,1,1,1,8.050924,9.47025,43.7273,16.636703,120,1229,1349 5 | 4,04-01-2018,1,0,1,0,2,1,1,8.2,10.6061,59.0435,10.739832,108,1454,1562 6 | 5,05-01-2018,1,0,1,0,3,1,1,9.305237,11.4635,43.6957,12.5223,82,1518,1600 7 | 6,06-01-2018,1,0,1,0,4,1,1,8.378268,11.66045,51.8261,6.0008684,88,1518,1606 8 | 7,07-01-2018,1,0,1,0,5,1,2,8.057402,10.44195,49.8696,11.304642,148,1362,1510 9 | 8,08-01-2018,1,0,1,0,6,0,2,6.765,8.1127,53.5833,17.875868,68,891,959 10 | 9,09-01-2018,1,0,1,0,0,0,1,5.671653,5.80875,43.4167,24.25065,54,768,822 11 | 10,10-01-2018,1,0,1,0,1,1,1,6.184153,7.5444,48.2917,14.958889,41,1280,1321 12 | 11,11-01-2018,1,0,1,0,2,1,2,6.932731,9.5732,68.6364,8.182844,43,1220,1263 13 | 12,12-01-2018,1,0,1,0,3,1,1,7.081807,8.02365,59.9545,20.410009,25,1137,1162 14 | 13,13-01-2018,1,0,1,0,4,1,1,6.765,7.54415,47.0417,20.167,38,1368,1406 15 | 14,14-01-2018,1,0,1,0,5,1,1,6.59567,9.42065,53.7826,8.478716,54,1367,1421 16 | 15,15-01-2018,1,0,1,0,6,0,2,9.566653,12.4056,49.875,10.583521,222,1026,1248 17 | 16,16-01-2018,1,0,1,0,0,0,1,9.498347,11.71085,48.375,12.625011,251,953,1204 18 | 17,17-01-2018,1,0,1,1,1,0,2,7.209153,8.83855,53.75,12.999139,117,883,1000 19 | 18,18-01-2018,1,0,1,0,2,1,2,8.883347,11.61665,86.1667,9.833925,9,674,683 20 | 19,19-01-2018,1,0,1,0,3,1,2,11.979134,14.9211,74.1739,13.957239,78,1572,1650 21 | 20,20-01-2018,1,0,1,0,4,1,2,10.728347,12.7525,53.8333,13.125568,83,1844,1927 22 | 21,21-01-2018,1,0,1,0,5,1,1,7.2775,7.89165,45.7083,23.667214,75,1468,1543 23 | 22,22-01-2018,1,0,1,0,6,0,1,2.4243464,3.95348,40,11.52199,93,888,981 24 | 23,23-01-2018,1,0,1,0,0,0,1,3.9573897,4.941955,43.6522,16.5222,150,836,986 25 | 24,24-01-2018,1,0,1,0,1,1,1,3.9930433,5.8965,49.1739,10.60811,86,1330,1416 26 | 25,25-01-2018,1,0,1,0,2,1,2,9.162598,11.7263,61.6957,8.696332,186,1799,1985 27 | 26,26-01-2018,1,0,1,0,3,1,3,8.9175,10.18,86.25,19.68795,34,472,506 28 | 27,27-01-2018,1,0,1,0,4,1,1,7.995,10.985,68.75,7.627079,15,416,431 29 | 28,28-01-2018,1,0,1,0,5,1,2,8.342598,11.16585,79.3043,8.2611,38,1129,1167 30 | 29,29-01-2018,1,0,1,0,6,0,1,8.057402,10.6063,65.1739,9.739455,123,975,1098 31 | 30,30-01-2018,1,0,1,0,0,0,1,8.877402,12.5161,72.2174,4.9568342,140,956,1096 32 | 31,31-01-2018,1,0,1,0,1,1,2,7.414153,9.3125,60.375,12.541864,42,1459,1501 33 | 32,01-02-2018,1,0,2,0,2,1,2,7.879134,11.7265,82.9565,3.565271,47,1313,1360 34 | 33,02-02-2018,1,0,2,0,3,1,2,10.66,12.72085,77.5417,17.708636,72,1454,1526 35 | 34,03-02-2018,1,0,2,0,4,1,1,7.665237,8.8939,43.7826,18.609384,61,1489,1550 36 | 35,04-02-2018,1,0,2,0,5,1,2,8.663464,11.42935,58.5217,8.565213,88,1620,1708 37 | 36,05-02-2018,1,0,2,0,6,0,2,9.566653,12.1529,92.9167,10.792293,100,905,1005 38 | 37,06-02-2018,1,0,2,0,0,0,1,11.719153,14.58355,56.8333,9.5006,354,1269,1623 39 | 38,07-02-2018,1,0,2,0,1,1,1,11.138347,15.1829,73.8333,3.0423561,120,1592,1712 40 | 39,08-02-2018,1,0,2,0,2,1,1,9.054153,9.9123,53.7917,24.25065,64,1466,1530 41 | 40,09-02-2018,1,0,2,0,3,1,2,5.526103,7.21415,49.4783,12.652213,53,1552,1605 42 | 41,10-02-2018,1,0,2,0,4,1,1,5.918268,7.4774,43.7391,14.869645,47,1491,1538 43 | 42,11-02-2018,1,0,2,0,5,1,1,7.752731,10.67545,50.6364,7.27285,149,1597,1746 44 | 43,12-02-2018,1,0,2,0,6,0,1,9.1225,11.6477,54.4167,13.625589,288,1184,1472 45 | 44,13-02-2018,1,0,2,0,0,0,1,12.977402,16.20565,45.7391,17.479161,397,1192,1589 46 | 45,14-02-2018,1,0,2,0,1,1,1,17.015,19.9175,37.5833,27.999836,208,1705,1913 47 | 46,15-02-2018,1,0,2,0,2,1,1,10.909567,12.7137,31.4348,19.522058,140,1675,1815 48 | 47,16-02-2018,1,0,2,0,3,1,1,13.048701,15.81,42.3478,16.869997,218,1897,2115 49 | 48,17-02-2018,1,0,2,0,4,1,1,17.869153,21.4329,50.5,15.416968,259,2216,2475 50 | 49,18-02-2018,1,0,2,0,5,1,1,21.388347,25.59915,51.6667,17.749975,579,2348,2927 51 | 50,19-02-2018,1,0,2,0,6,0,1,16.365847,19.5702,18.7917,34.000021,532,1103,1635 52 | 51,20-02-2018,1,0,2,0,0,0,1,11.693897,13.8665,40.7826,14.956745,639,1173,1812 53 | 52,21-02-2018,1,0,2,1,1,0,2,12.436653,14.20375,60.5,20.625682,195,912,1107 54 | 53,22-02-2018,1,0,2,0,2,1,1,7.471102,9.30165,57.7778,13.110761,74,1376,1450 55 | 54,23-02-2018,1,0,2,0,3,1,1,9.091299,12.28585,42.3043,6.305571,139,1778,1917 56 | 55,24-02-2018,1,0,2,0,4,1,2,12.121732,14.45955,69.7391,16.783232,100,1707,1807 57 | 56,25-02-2018,1,0,2,0,5,1,2,14.938268,17.52305,71.2174,23.218113,120,1341,1461 58 | 57,26-02-2018,1,0,2,0,6,0,1,11.5825,14.1096,53.7917,12.500257,424,1545,1969 59 | 58,27-02-2018,1,0,2,0,0,0,1,14.082598,17.55545,68,8.391616,694,1708,2402 60 | 59,28-02-2018,1,0,2,0,1,1,2,16.698193,20.0059,87.6364,19.408962,81,1365,1446 61 | 60,01-03-2018,1,0,3,0,2,1,1,10.933347,13.19395,53.5,14.500475,137,1714,1851 62 | 61,02-03-2018,1,0,3,0,3,1,1,13.735,16.00355,44.9583,20.624811,231,1903,2134 63 | 62,03-03-2018,1,0,3,0,4,1,1,8.131653,10.00665,31.8333,15.125518,123,1562,1685 64 | 63,04-03-2018,1,0,3,0,5,1,2,10.728347,12.78395,61.0417,13.624182,214,1730,1944 65 | 64,05-03-2018,1,0,3,0,6,0,2,15.750847,18.93895,78.9167,16.875357,640,1437,2077 66 | 65,06-03-2018,1,0,3,0,0,0,2,15.437402,18.3126,94.8261,23.000229,114,491,605 67 | 66,07-03-2018,1,0,3,0,1,1,1,10.731299,11.92305,55.1304,22.870584,244,1628,1872 68 | 67,08-03-2018,1,0,3,0,2,1,1,11.9925,15.12,42.0833,8.08355,316,1817,2133 69 | 68,09-03-2018,1,0,3,0,3,1,2,12.129153,14.3304,77.5417,14.75005,191,1700,1891 70 | 69,10-03-2018,1,0,3,0,4,1,3,15.952731,19.2834,0,17.545759,46,577,623 71 | 70,11-03-2018,1,0,3,0,5,1,2,12.977402,15.25,64.9565,15.60899,247,1730,1977 72 | 71,12-03-2018,1,0,3,0,6,0,1,13.495847,16.2875,59.4583,14.791925,724,1408,2132 73 | 72,13-03-2018,1,0,3,0,0,0,1,15.758268,19.00455,52.7391,18.130468,982,1435,2417 74 | 73,14-03-2018,1,0,3,0,1,1,1,13.333897,16.6,49.6957,9.174042,359,1687,2046 75 | 74,15-03-2018,1,0,3,0,2,1,2,13.013031,15.9089,65.5652,12.348703,289,1767,2056 76 | 75,16-03-2018,1,0,3,0,3,1,2,14.973897,18.3465,77.6522,13.608839,321,1871,2192 77 | 76,17-03-2018,1,0,3,0,4,1,1,17.015,20.51665,60.2917,14.041793,424,2320,2744 78 | 77,18-03-2018,1,0,3,0,5,1,1,22.14,26.35045,52.5217,15.478139,884,2355,3239 79 | 78,19-03-2018,1,0,3,0,6,0,1,19.3725,23.32625,37.9167,24.667189,1424,1693,3117 80 | 79,20-03-2018,1,0,3,0,0,0,1,13.6325,16.2875,47.375,13.917307,1047,1424,2471 81 | 80,21-03-2018,2,0,3,0,1,1,2,17.647835,20.48675,73.7391,19.348461,401,1676,2077 82 | 81,22-03-2018,2,0,3,0,2,1,1,18.108347,22.0321,62.4583,15.12525,460,2243,2703 83 | 82,23-03-2018,2,0,3,0,3,1,2,14.225237,16.89695,83.9565,15.695487,203,1918,2121 84 | 83,24-03-2018,2,0,3,0,4,1,2,11.685,13.54165,80.5833,16.333729,166,1699,1865 85 | 84,25-03-2018,2,0,3,0,5,1,1,10.830847,12.8156,49.5,15.458575,300,1910,2210 86 | 85,26-03-2018,2,0,3,0,6,0,1,10.899153,12.87855,39.4167,14.041257,981,1515,2496 87 | 86,27-03-2018,2,0,3,0,0,0,2,10.374763,12.51695,49.3913,12.3481,472,1221,1693 88 | 87,28-03-2018,2,0,3,0,1,1,1,10.838268,12.8787,30.2174,14.217668,222,1806,2028 89 | 88,29-03-2018,2,0,3,0,2,1,1,12.4025,14.6454,31.4167,15.208732,317,2108,2425 90 | 89,30-03-2018,2,0,3,0,3,1,2,12.3,14.8675,64.6667,11.583496,168,1368,1536 91 | 90,31-03-2018,2,0,3,0,4,1,3,11.001653,12.87875,91.8333,14.582282,179,1506,1685 92 | 91,01-04-2018,2,0,4,0,5,1,2,12.3,14.1727,68.625,17.333436,307,1920,2227 93 | 92,02-04-2018,2,0,4,0,6,0,2,12.915,15.78185,65.375,13.208782,898,1354,2252 94 | 93,03-04-2018,2,0,4,0,0,0,1,15.511653,18.93835,48,12.208271,1651,1598,3249 95 | 94,04-04-2018,2,0,4,0,1,1,1,23.506653,27.14645,42.625,25.833257,734,2381,3115 96 | 95,05-04-2018,2,0,4,0,2,1,2,16.980847,19.9175,64.2083,26.000489,167,1628,1795 97 | 96,06-04-2018,2,0,4,0,3,1,1,16.024153,19.3804,47.0833,17.625221,413,2395,2808 98 | 97,07-04-2018,2,0,4,0,4,1,1,17.9375,21.6848,60.2917,10.874904,571,2570,3141 99 | 98,08-04-2018,2,0,4,0,5,1,2,13.769153,16.22395,83.625,15.208464,172,1299,1471 100 | 99,09-04-2018,2,0,4,0,6,0,2,14.0425,17.07645,87.75,8.916561,879,1576,2455 101 | 100,10-04-2018,2,0,4,0,0,0,2,17.493347,21.33685,85.75,9.833389,1188,1707,2895 102 | 101,11-04-2018,2,0,4,0,1,1,2,24.421732,28.26085,71.6956,21.739758,855,2493,3348 103 | 102,12-04-2018,2,0,4,0,2,1,2,20.6025,24.6527,73.9167,18.416893,257,1777,2034 104 | 103,13-04-2018,2,0,4,0,3,1,2,16.9125,20.86415,81.9167,16.791339,209,1953,2162 105 | 104,14-04-2018,2,0,4,0,4,1,1,19.1675,23.1371,54.0417,7.4169,529,2738,3267 106 | 105,15-04-2018,2,0,4,1,5,0,1,18.313347,22.09565,67.125,15.167125,642,2484,3126 107 | 106,16-04-2018,2,0,4,0,6,0,3,17.664153,21.2746,88.8333,22.834136,121,674,795 108 | 107,17-04-2018,2,0,4,0,0,0,1,18.723347,22.2848,47.9583,20.334232,1558,2186,3744 109 | 108,18-04-2018,2,0,4,0,1,1,1,21.0125,25.1573,54.25,10.958989,669,2760,3429 110 | 109,19-04-2018,2,0,4,0,2,1,2,20.739153,24.4629,66.5833,10.584057,409,2795,3204 111 | 110,20-04-2018,2,0,4,0,3,1,1,24.395,28.2196,61.4167,16.208975,613,3331,3944 112 | 111,21-04-2018,2,0,4,0,4,1,1,18.825847,22.6946,40.7083,21.792286,745,3444,4189 113 | 112,22-04-2018,2,0,4,0,5,1,2,13.803347,16.0977,72.9583,14.707907,177,1506,1683 114 | 113,23-04-2018,2,0,4,0,6,0,2,18.86,22.50605,88.7917,15.458575,1462,2574,4036 115 | 114,24-04-2018,2,0,4,0,0,0,2,23.848347,27.58815,81.0833,12.875725,1710,2481,4191 116 | 115,25-04-2018,2,0,4,0,1,1,1,24.873347,28.725,77.6667,12.417311,773,3300,4073 117 | 116,26-04-2018,2,0,4,0,2,1,1,25.898347,29.70415,72.9167,21.8755,678,3722,4400 118 | 117,27-04-2018,2,0,4,0,3,1,2,25.42,28.7571,83.5417,20.9174,547,3325,3872 119 | 118,28-04-2018,2,0,4,0,4,1,2,25.3175,28.94645,70.0833,21.500836,569,3489,4058 120 | 119,29-04-2018,2,0,4,0,5,1,1,20.91,24.87315,45.7083,16.084221,878,3717,4595 121 | 120,30-04-2018,2,0,4,0,6,0,1,19.3725,23.20105,50.3333,15.750025,1965,3347,5312 122 | 121,01-05-2018,2,0,5,0,0,0,2,18.518347,22.4102,76.2083,7.125718,1138,2213,3351 123 | 122,02-05-2018,2,0,5,0,1,1,2,22.515847,26.64165,73,12.291418,847,3554,4401 124 | 123,03-05-2018,2,0,5,0,2,1,2,25.283347,29.10395,69.7083,22.958689,603,3848,4451 125 | 124,04-05-2018,2,0,5,0,3,1,2,16.980847,20.2325,73.7083,22.042732,255,2378,2633 126 | 125,05-05-2018,2,0,5,0,4,1,1,18.825847,22.09585,44.4167,19.791264,614,3819,4433 127 | 126,06-05-2018,2,0,5,0,5,1,1,19.645847,23.70585,59,15.292482,894,3714,4608 128 | 127,07-05-2018,2,0,5,0,6,0,1,21.32,25.63105,54.125,10.75015,1612,3102,4714 129 | 128,08-05-2018,2,0,5,0,0,0,1,21.661653,25.94665,63.1667,5.0007125,1401,2932,4333 130 | 129,09-05-2018,2,0,5,0,1,1,1,21.8325,26.2623,58.875,11.792,664,3698,4362 131 | 130,10-05-2018,2,0,5,0,2,1,1,21.8325,26.13605,48.9167,7.749957,694,4109,4803 132 | 131,11-05-2018,2,0,5,0,3,1,1,22.2425,26.42,63.2917,8.083014,550,3632,4182 133 | 132,12-05-2018,2,0,5,0,4,1,1,21.935,26.16815,74.75,12.707689,695,4169,4864 134 | 133,13-05-2018,2,0,5,0,5,1,2,21.0125,24.715,86.3333,12.041575,692,3413,4105 135 | 134,14-05-2018,2,0,5,0,6,0,2,21.354153,25.03145,92.25,9.04165,902,2507,3409 136 | 135,15-05-2018,2,0,5,0,0,0,2,23.0625,26.8,86.7083,10.249593,1582,2971,4553 137 | 136,16-05-2018,2,0,5,0,1,1,1,23.6775,27.5256,78.7917,8.500357,773,3185,3958 138 | 137,17-05-2018,2,0,5,0,2,1,2,23.028347,26.92645,83.7917,18.582718,678,3445,4123 139 | 138,18-05-2018,2,0,5,0,3,1,2,22.55,26.3579,87,13.499964,536,3319,3855 140 | 139,19-05-2018,2,0,5,0,4,1,2,21.764153,25.5371,82.9583,7.250271,735,3840,4575 141 | 140,20-05-2018,2,0,5,0,5,1,1,22.003347,26.4521,71.9583,8.375871,909,4008,4917 142 | 141,21-05-2018,2,0,5,0,6,0,1,24.7025,28.59875,62.6667,8.08355,2258,3547,5805 143 | 142,22-05-2018,2,0,5,0,0,0,1,24.770847,28.725,74.9583,9.916536,1576,3084,4660 144 | 143,23-05-2018,2,0,5,0,1,1,2,25.898347,29.5148,81,15.667414,836,3438,4274 145 | 144,24-05-2018,2,0,5,0,2,1,2,27.06,30.24065,74.0833,13.875164,659,3833,4492 146 | 145,25-05-2018,2,0,5,0,3,1,1,27.094153,30.7771,69.625,10.333611,740,4238,4978 147 | 146,26-05-2018,2,0,5,0,4,1,1,29.041653,32.7344,67.75,13.376014,758,3919,4677 148 | 147,27-05-2018,2,0,5,0,5,1,1,27.948347,31.8504,65.375,16.125493,871,3808,4679 149 | 148,28-05-2018,2,0,5,0,6,0,1,26.889153,30.61895,72.9583,15.416164,2001,2757,4758 150 | 149,29-05-2018,2,0,5,0,0,0,1,27.3675,30.7775,81.875,14.333846,2355,2433,4788 151 | 150,30-05-2018,2,0,5,1,1,0,1,30.066653,33.5546,68.5,8.792075,1549,2549,4098 152 | 151,31-05-2018,2,0,5,0,2,1,1,31.775,36.26915,63.6667,7.459043,673,3309,3982 153 | 152,01-06-2018,2,0,6,0,3,1,2,31.330847,36.04835,67.7083,13.875164,513,3461,3974 154 | 153,02-06-2018,2,0,6,0,4,1,1,29.315,32.1971,30.5,19.583229,736,4232,4968 155 | 154,03-06-2018,2,0,6,0,5,1,1,25.42,29.35665,35.4167,16.959107,898,4414,5312 156 | 155,04-06-2018,2,0,6,0,6,0,1,26.035,29.7348,45.625,8.250514,1869,3473,5342 157 | 156,05-06-2018,2,0,6,0,0,0,2,26.581653,30.8402,65.25,9.292364,1685,3221,4906 158 | 157,06-06-2018,2,0,6,0,1,1,1,27.811653,31.0929,60,8.167032,673,3875,4548 159 | 158,07-06-2018,2,0,6,0,2,1,1,29.0075,32.7975,59.7917,12.583136,763,4070,4833 160 | 159,08-06-2018,2,0,6,0,3,1,1,31.809153,36.36395,62.2083,9.166739,676,3725,4401 161 | 160,09-06-2018,2,0,6,0,4,1,2,33.141653,37.87895,56.8333,10.042161,563,3352,3915 162 | 161,10-06-2018,2,0,6,0,5,1,1,30.955,35.1646,60.5,9.417118,815,3771,4586 163 | 162,11-06-2018,2,0,6,0,6,0,1,29.725,33.9019,65.4583,10.37495,1729,3237,4966 164 | 163,12-06-2018,2,0,6,0,0,0,1,28.3925,32.16625,74.7917,10.958989,1467,2993,4460 165 | 164,13-06-2018,2,0,6,0,1,1,1,26.035,30.0827,49.4583,20.45845,863,4157,5020 166 | 165,14-06-2018,2,0,6,0,2,1,1,24.770847,29.5773,50.7083,18.041961,727,4164,4891 167 | 166,15-06-2018,2,0,6,0,3,1,1,25.693347,29.3877,47.1667,11.250104,769,4411,5180 168 | 167,16-06-2018,2,0,6,0,4,1,2,25.761653,29.7673,68.8333,13.833557,545,3222,3767 169 | 168,17-06-2018,2,0,6,0,5,1,1,26.615847,30.01915,73.5833,9.582943,863,3981,4844 170 | 169,18-06-2018,2,0,6,0,6,0,1,28.563347,32.1977,67.0417,8.000336,1807,3312,5119 171 | 170,19-06-2018,2,0,6,0,0,0,2,28.665847,32.2923,66.6667,6.834,1639,3105,4744 172 | 171,20-06-2018,2,0,6,0,1,1,2,26.035,29.7673,74.625,10.416825,699,3311,4010 173 | 172,21-06-2018,3,0,6,0,2,1,2,27.914153,31.8823,77.0417,11.458675,774,4061,4835 174 | 173,22-06-2018,3,0,6,0,3,1,1,30.066653,34.69145,70.75,11.541554,661,3846,4507 175 | 174,23-06-2018,3,0,6,0,4,1,2,29.861653,34.69165,70.3333,15.999868,746,4044,4790 176 | 175,24-06-2018,3,0,6,0,5,1,1,29.690847,32.82915,57.3333,14.875675,969,4022,4991 177 | 176,25-06-2018,3,0,6,0,6,0,1,28.495,32.16565,48.3333,14.041257,1782,3420,5202 178 | 177,26-06-2018,3,0,6,0,0,0,1,27.88,31.88145,51.3333,6.3337311,1920,3385,5305 179 | 178,27-06-2018,3,0,6,0,1,1,2,27.9825,31.8502,65.8333,7.208396,854,3854,4708 180 | 179,28-06-2018,3,0,6,0,2,1,1,30.510847,34.6279,63.4167,9.666961,732,3916,4648 181 | 180,29-06-2018,3,0,6,0,3,1,1,29.861653,32.7344,49.7917,17.542007,848,4377,5225 182 | 181,30-06-2018,3,0,6,0,4,1,1,28.563347,31.8504,43.4167,12.415904,1027,4488,5515 183 | 182,01-07-2018,3,0,7,0,5,1,1,29.6225,32.6081,39.625,6.874736,1246,4116,5362 184 | 183,02-07-2018,3,0,7,0,6,0,1,30.271653,33.3654,44.4583,7.709154,2204,2915,5119 185 | 184,03-07-2018,3,0,7,0,0,0,2,29.383347,33.42875,68.25,15.333486,2282,2367,4649 186 | 185,04-07-2018,3,0,7,1,1,0,2,29.793347,33.27085,63.7917,5.4591064,3065,2978,6043 187 | 186,05-07-2018,3,0,7,0,2,1,1,30.613347,34.8169,59.0417,8.459286,1031,3634,4665 188 | 187,06-07-2018,3,0,7,0,3,1,1,29.52,34.28165,74.3333,10.042161,784,3845,4629 189 | 188,07-07-2018,3,0,7,0,4,1,1,30.75,34.34355,65.125,10.6664,754,3838,4592 190 | 189,08-07-2018,3,0,7,0,5,1,2,29.075847,33.52415,75.7917,15.083643,692,3348,4040 191 | 190,09-07-2018,3,0,7,0,6,0,1,30.066653,33.2079,60.9167,11.250104,1988,3348,5336 192 | 191,10-07-2018,3,0,7,0,0,0,1,30.6475,34.50125,57.8333,12.292557,1743,3138,4881 193 | 192,11-07-2018,3,0,7,0,1,1,1,31.2625,36.4902,63.5833,18.916579,723,3363,4086 194 | 193,12-07-2018,3,0,7,0,2,1,1,32.560847,36.96375,55.9167,13.417018,662,3596,4258 195 | 194,13-07-2018,3,0,7,0,3,1,1,30.613347,34.4702,63.1667,9.790911,748,3594,4342 196 | 195,14-07-2018,3,0,7,0,4,1,1,27.914153,31.7552,47.625,16.124689,888,4196,5084 197 | 196,15-07-2018,3,0,7,0,5,1,1,27.196653,31.21855,59.125,12.249811,1318,4220,5538 198 | 197,16-07-2018,3,0,7,0,6,0,1,28.153347,31.91315,58.5,13.958914,2418,3505,5923 199 | 198,17-07-2018,3,0,7,0,0,0,1,29.485847,33.49165,60.4167,16.417211,2006,3296,5302 200 | 199,18-07-2018,3,0,7,0,1,1,1,30.613347,35.19625,65.125,14.458868,841,3617,4458 201 | 200,19-07-2018,3,0,7,0,2,1,1,31.843347,37.37395,65.0417,8.7502,752,3789,4541 202 | 201,20-07-2018,3,0,7,0,3,1,1,31.501653,37.3425,70.7083,7.625739,644,3688,4332 203 | 202,21-07-2018,3,0,7,0,4,1,2,33.415,41.31855,69.125,14.875407,632,3152,3784 204 | 203,22-07-2018,3,0,7,0,5,1,1,34.781653,42.0448,58.0417,8.9177,562,2825,3387 205 | 204,23-07-2018,3,0,7,0,6,0,1,34.815847,40.21435,50,8.791807,987,2298,3285 206 | 205,24-07-2018,3,0,7,0,0,0,1,34.03,39.74145,55.0833,11.334457,1050,2556,3606 207 | 206,25-07-2018,3,0,7,0,1,1,1,30.476653,36.0479,75.7083,6.0841561,568,3272,3840 208 | 207,26-07-2018,3,0,7,0,2,1,1,31.638347,34.84895,54.0833,13.417286,750,3840,4590 209 | 208,27-07-2018,3,0,7,0,3,1,1,31.775,34.53335,40.2917,12.292021,755,3901,4656 210 | 209,28-07-2018,3,0,7,0,4,1,1,31.945847,36.995,58.3333,11.958093,606,3784,4390 211 | 210,29-07-2018,3,0,7,0,5,1,1,34.371653,39.29835,54.25,11.667246,670,3176,3846 212 | 211,30-07-2018,3,0,7,0,6,0,1,32.970847,36.42685,46.5833,11.291979,1559,2916,4475 213 | 212,31-07-2018,3,0,7,0,0,0,1,33.039153,36.4898,48.0833,11.042471,1524,2778,4302 214 | 213,01-08-2018,3,0,8,0,1,1,1,31.638347,35.1646,55.0833,10.500039,729,3537,4266 215 | 214,02-08-2018,3,0,8,0,2,1,1,32.116653,35.35355,49.125,13.79195,801,4044,4845 216 | 215,03-08-2018,3,0,8,0,3,1,2,29.998347,33.99685,65.75,9.084061,467,3107,3574 217 | 216,04-08-2018,3,0,8,0,4,1,2,29.11,33.2394,75.75,13.20905,799,3777,4576 218 | 217,05-08-2018,3,0,8,0,5,1,1,29.144153,32.82835,63.0833,12.374632,1023,3843,4866 219 | 218,06-08-2018,3,0,8,0,6,0,2,29.383347,33.8077,75.5,15.29275,1521,2773,4294 220 | 219,07-08-2018,3,0,8,0,0,0,1,30.4425,35.7646,75.2917,13.499629,1298,2487,3785 221 | 220,08-08-2018,3,0,8,0,1,1,1,31.365,35.16415,59.2083,12.875725,846,3480,4326 222 | 221,09-08-2018,3,0,8,0,2,1,1,31.775,36.20605,57.0417,10.125107,907,3695,4602 223 | 222,10-08-2018,3,0,8,0,3,1,1,31.433347,34.24915,42.4167,13.417286,884,3896,4780 224 | 223,11-08-2018,3,0,8,0,4,1,1,29.4175,32.57605,42.375,11.041332,812,3980,4792 225 | 224,12-08-2018,3,0,8,0,5,1,1,29.041653,32.7021,41.5,8.416607,1051,3854,4905 226 | 225,13-08-2018,3,0,8,0,6,0,2,28.119153,32.2929,72.9583,14.167418,1504,2646,4150 227 | 226,14-08-2018,3,0,8,0,0,0,2,27.743347,31.2194,81.75,14.916411,1338,2482,3820 228 | 227,15-08-2018,3,0,8,0,1,1,1,27.299153,30.80835,71.2083,13.999918,775,3563,4338 229 | 228,16-08-2018,3,0,8,0,2,1,1,28.734153,32.29185,57.8333,15.834043,721,4004,4725 230 | 229,17-08-2018,3,0,8,0,3,1,1,29.656653,33.33355,57.5417,9.625689,668,4026,4694 231 | 230,18-08-2018,3,0,8,0,4,1,1,29.178347,33.1129,65.4583,15.624936,639,3166,3805 232 | 231,19-08-2018,3,0,8,0,5,1,2,28.085,31.66105,72.2917,9.333636,797,3356,4153 233 | 232,20-08-2018,3,0,8,0,6,0,1,28.5975,32.4498,67.4167,6.999289,1914,3277,5191 234 | 233,21-08-2018,3,0,8,0,0,0,1,29.144153,33.77625,77,16.666518,1249,2624,3873 235 | 234,22-08-2018,3,0,8,0,1,1,1,28.358347,31.9127,47,18.54225,833,3925,4758 236 | 235,23-08-2018,3,0,8,0,2,1,1,26.274153,30.30335,45.5417,9.833121,1281,4614,5895 237 | 236,24-08-2018,3,0,8,0,3,1,1,27.606653,31.5346,60.5,16.958236,949,4181,5130 238 | 237,25-08-2018,3,0,8,0,4,1,2,28.050847,32.2927,77.1667,14.125811,435,3107,3542 239 | 238,26-08-2018,3,0,8,0,5,1,1,28.7,32.98665,76.125,5.6254875,768,3893,4661 240 | 239,27-08-2018,3,0,8,0,6,0,2,27.88,31.7778,85,25.166339,226,889,1115 241 | 240,28-08-2018,3,0,8,0,0,0,1,28.989419,32.39795,56.1765,20.412153,1415,2919,4334 242 | 241,29-08-2018,3,0,8,0,1,1,1,26.103347,30.3979,55.4583,10.708275,729,3905,4634 243 | 242,30-08-2018,3,0,8,0,2,1,1,26.205847,29.7352,54.8333,8.375536,775,4429,5204 244 | 243,31-08-2018,3,0,8,0,3,1,1,26.923347,30.55605,59.7917,5.5833311,688,4370,5058 245 | 244,01-09-2018,3,0,9,0,4,1,1,26.855,30.74605,63.9167,9.500332,783,4332,5115 246 | 245,02-09-2018,3,0,9,0,5,1,2,26.376653,30.2404,72.7083,9.375243,875,3852,4727 247 | 246,03-09-2018,3,0,9,0,6,0,1,27.435847,31.66065,71.6667,12.416775,1935,2549,4484 248 | 247,04-09-2018,3,0,9,0,0,0,1,29.075847,33.27145,74.2083,13.833289,2521,2419,4940 249 | 248,05-09-2018,3,0,9,1,1,0,2,27.606653,31.2823,79.0417,14.250632,1236,2115,3351 250 | 249,06-09-2018,3,0,9,0,2,1,3,22.14,25.76,88.6957,23.044181,204,2506,2710 251 | 250,07-09-2018,3,0,9,0,3,1,3,24.565847,27.21145,91.7083,6.5003936,118,1878,1996 252 | 251,08-09-2018,3,0,9,0,4,1,3,25.990433,27.76805,93.9565,12.914116,153,1689,1842 253 | 252,09-09-2018,3,0,9,0,5,1,2,26.65,28.9473,89.7917,8.333393,417,3127,3544 254 | 253,10-09-2018,3,0,9,0,6,0,1,27.06,30.3981,75.375,10.291736,1750,3595,5345 255 | 254,11-09-2018,3,0,9,0,0,0,1,26.786653,30.46145,71.375,7.708618,1633,3413,5046 256 | 255,12-09-2018,3,0,9,0,1,1,1,26.418268,30.1065,69.2174,5.957171,690,4023,4713 257 | 256,13-09-2018,3,0,9,0,2,1,1,26.684153,30.1777,71.25,9.500868,701,4062,4763 258 | 257,14-09-2018,3,0,9,0,3,1,1,27.606653,31.345,69.7083,11.2091,647,4138,4785 259 | 258,15-09-2018,3,0,9,0,4,1,2,23.6775,27.68355,70.9167,18.166782,428,3231,3659 260 | 259,16-09-2018,3,0,9,0,5,1,2,19.235847,23.07375,59.0417,11.000261,742,4018,4760 261 | 260,17-09-2018,3,0,9,0,6,0,2,20.158347,23.9256,71.8333,12.708225,1434,3077,4511 262 | 261,18-09-2018,3,0,9,0,0,0,1,20.8075,24.52685,69.5,11.958361,1353,2921,4274 263 | 262,19-09-2018,3,0,9,0,1,1,2,22.515847,26.48375,69,10.166714,691,3848,4539 264 | 263,20-09-2018,3,0,9,0,2,1,2,23.028347,26.61085,88.125,9.041918,438,3203,3641 265 | 264,21-09-2018,3,0,9,0,3,1,2,24.395,27.52665,90,6.4590814,539,3813,4352 266 | 265,22-09-2018,3,0,9,0,4,1,2,25.761653,27.74815,90.2083,8.584375,555,4240,4795 267 | 266,23-09-2018,4,0,9,0,5,1,2,24.975847,26.10625,97.25,5.2505689,258,2137,2395 268 | 267,24-09-2018,4,0,9,0,6,0,2,24.873347,28.2206,86.25,5.2516811,1776,3647,5423 269 | 268,25-09-2018,4,0,9,0,0,0,2,26.000847,28.63185,84.5,3.3754064,1544,3466,5010 270 | 269,26-09-2018,4,0,9,0,1,1,2,26.615847,29.4521,84.8333,7.4169,684,3946,4630 271 | 270,27-09-2018,4,0,9,0,2,1,2,26.103347,28.72625,88.5417,7.917457,477,3643,4120 272 | 271,28-09-2018,4,0,9,0,3,1,2,26.035,28.7579,84.875,9.958143,480,3427,3907 273 | 272,29-09-2018,4,0,9,0,4,1,1,25.283347,28.7256,69.9167,11.583161,653,4186,4839 274 | 273,30-09-2018,4,0,9,0,5,1,1,23.130847,27.24145,64.75,13.833825,830,4372,5202 275 | 274,01-10-2018,4,0,10,0,6,0,2,16.81,20.64315,75.375,19.583832,480,1949,2429 276 | 275,02-10-2018,4,0,10,0,0,0,2,14.623347,17.26585,79.1667,14.874871,616,2302,2918 277 | 276,03-10-2018,4,0,10,0,1,1,2,15.750847,19.6023,76.0833,5.5841686,330,3240,3570 278 | 277,04-10-2018,4,0,10,0,2,1,1,19.850847,23.6429,71,13.792218,486,3970,4456 279 | 278,05-10-2018,4,0,10,0,3,1,1,22.071653,26.3569,64.7917,11.87575,559,4267,4826 280 | 279,06-10-2018,4,0,10,0,4,1,1,20.260847,24.02125,62.0833,9.041918,639,4126,4765 281 | 280,07-10-2018,4,0,10,0,5,1,1,20.944153,25.2202,68.4167,1.5002439,949,4036,4985 282 | 281,08-10-2018,4,0,10,0,6,0,1,21.388347,25.6621,70.125,3.0420814,2235,3174,5409 283 | 282,09-10-2018,4,0,10,0,0,0,1,22.174153,26.19915,72.75,4.25115,2397,3114,5511 284 | 283,10-10-2018,4,0,10,1,1,0,1,23.404153,27.14625,73.375,2.8343814,1514,3603,5117 285 | 284,11-10-2018,4,0,10,0,2,1,2,23.233347,27.3048,80.875,9.583814,667,3896,4563 286 | 285,12-10-2018,4,0,10,0,3,1,3,22.276653,25.88585,90.625,16.62605,217,2199,2416 287 | 286,13-10-2018,4,0,10,0,4,1,2,24.155847,27.5902,89.6667,9.499729,290,2623,2913 288 | 287,14-10-2018,4,0,10,0,5,1,2,22.584153,26.48375,71.625,15.000161,529,3115,3644 289 | 288,15-10-2018,4,0,10,0,6,0,1,20.773347,24.93625,48.3333,17.291561,1899,3318,5217 290 | 289,16-10-2018,4,0,10,0,0,0,1,20.978347,25.1577,48.6667,18.875039,1748,3293,5041 291 | 290,17-10-2018,4,0,10,0,1,1,1,21.900847,25.53625,57.9583,11.750393,713,3857,4570 292 | 291,18-10-2018,4,0,10,0,2,1,2,21.8325,26.13605,70.1667,7.375829,637,4111,4748 293 | 292,19-10-2018,4,0,10,0,3,1,3,22.211299,25.6924,89.5217,16.303713,254,2170,2424 294 | 293,20-10-2018,4,0,10,0,4,1,1,19.509153,23.32625,63.625,28.292425,471,3724,4195 295 | 294,21-10-2018,4,0,10,0,5,1,1,17.5275,21.1798,57.4167,14.833532,676,3628,4304 296 | 295,22-10-2018,4,0,10,0,6,0,1,17.3225,21.2746,62.9167,6.2086689,1499,2809,4308 297 | 296,23-10-2018,4,0,10,0,0,0,1,17.288347,21.11665,74.125,6.6673375,1619,2762,4381 298 | 297,24-10-2018,4,0,10,0,1,1,1,18.996653,22.85335,77.2083,7.959064,699,3488,4187 299 | 298,25-10-2018,4,0,10,0,2,1,1,19.338347,23.16875,62.2917,11.166086,695,3992,4687 300 | 299,26-10-2018,4,0,10,0,3,1,2,19.850847,23.6423,72.0417,9.959014,404,3490,3894 301 | 300,27-10-2018,4,0,10,0,4,1,2,19.27,22.8523,81.2917,13.250121,240,2419,2659 302 | 301,28-10-2018,4,0,10,0,5,1,2,13.564153,15.9406,58.5833,15.375093,456,3291,3747 303 | 302,29-10-2018,4,0,10,0,6,0,3,10.420847,11.39565,88.25,23.541857,57,570,627 304 | 303,30-10-2018,4,0,10,0,0,0,1,13.085847,16.06645,62.375,11.833339,885,2446,3331 305 | 304,31-10-2018,4,0,10,0,1,1,1,13.94,17.80315,70.3333,7.12545,362,3307,3669 306 | 305,01-11-2018,4,0,11,0,2,1,1,16.434153,19.8544,68.375,9.083257,410,3658,4068 307 | 306,02-11-2018,4,0,11,0,3,1,1,15.4775,19.50665,71.875,5.5001439,370,3816,4186 308 | 307,03-11-2018,4,0,11,0,4,1,1,16.741653,20.29605,70.2083,9.166739,318,3656,3974 309 | 308,04-11-2018,4,0,11,0,5,1,2,16.536653,20.1696,62.25,18.209193,470,3576,4046 310 | 309,05-11-2018,4,0,11,0,6,0,1,13.393347,16.1927,51.9167,12.667154,1156,2770,3926 311 | 310,06-11-2018,4,0,11,0,0,0,1,14.281653,18.1179,73.4583,6.1676314,952,2697,3649 312 | 311,07-11-2018,4,0,11,0,1,1,1,16.195,20.04355,75.875,3.834075,373,3662,4035 313 | 312,08-11-2018,4,0,11,0,2,1,1,16.741653,20.6123,72.1667,4.6255125,376,3829,4205 314 | 313,09-11-2018,4,0,11,0,3,1,1,16.4,20.45395,75.8333,4.1671186,305,3804,4109 315 | 314,10-11-2018,4,0,11,0,4,1,2,15.58,18.68605,81.3333,12.667489,190,2743,2933 316 | 315,11-11-2018,4,0,11,1,5,0,1,13.290847,15.34085,44.625,21.083225,440,2928,3368 317 | 316,12-11-2018,4,0,11,0,6,0,1,14.623347,17.8971,55.2917,14.208154,1275,2792,4067 318 | 317,13-11-2018,4,0,11,0,0,0,1,18.074153,21.5275,45.8333,18.875307,1004,2713,3717 319 | 318,14-11-2018,4,0,11,0,1,1,1,21.73,26.2306,58.7083,20.541932,595,3891,4486 320 | 319,15-11-2018,4,0,11,0,2,1,2,21.73,25.37895,68.875,13.375411,449,3746,4195 321 | 320,16-11-2018,4,0,11,0,3,1,3,18.723347,22.5994,93,9.167543,145,1672,1817 322 | 321,17-11-2018,4,0,11,0,4,1,2,14.008347,16.16105,57.5833,20.459254,139,2914,3053 323 | 322,18-11-2018,4,0,11,0,5,1,1,11.240847,13.63605,41,11.291711,245,3147,3392 324 | 323,19-11-2018,4,0,11,0,6,0,1,13.495847,16.22415,50.2083,15.041232,943,2720,3663 325 | 324,20-11-2018,4,0,11,0,0,0,2,18.996653,22.8529,68.4583,12.45865,787,2733,3520 326 | 325,21-11-2018,4,0,11,0,1,1,3,18.3475,22.2531,91,9.249618,220,2545,2765 327 | 326,22-11-2018,4,0,11,0,2,1,3,17.083347,21.0848,96.25,7.959064,69,1538,1607 328 | 327,23-11-2018,4,0,11,0,3,1,2,18.074153,21.52685,75.7917,22.500275,112,2454,2566 329 | 328,24-11-2018,4,0,11,1,4,0,1,15.306653,18.62355,54.9167,11.209368,560,935,1495 330 | 329,25-11-2018,4,0,11,0,5,1,1,15.375,19.03355,64.375,6.6260186,1095,1697,2792 331 | 330,26-11-2018,4,0,11,0,6,0,1,15.409153,19.25435,68.1667,4.5841936,1249,1819,3068 332 | 331,27-11-2018,4,0,11,0,0,0,1,18.825847,22.79,69.8333,13.999918,810,2261,3071 333 | 332,28-11-2018,4,0,11,0,1,1,1,20.642598,24.5061,74.3043,9.522174,253,3614,3867 334 | 333,29-11-2018,4,0,11,0,2,1,2,18.791653,22.56875,83.0833,17.292164,96,2818,2914 335 | 334,30-11-2018,4,0,11,0,3,1,1,13.325,15.56105,61.3333,18.167586,188,3425,3613 336 | 335,01-12-2018,4,0,12,0,4,1,1,12.8125,15.2777,52.4583,14.750586,182,3545,3727 337 | 336,02-12-2018,4,0,12,0,5,1,1,12.880847,16.57165,62.5833,6.750518,268,3672,3940 338 | 337,03-12-2018,4,0,12,0,6,0,1,12.265847,15.5302,61.2917,6.4174811,706,2908,3614 339 | 338,04-12-2018,4,0,12,0,0,0,1,13.564153,17.455,77.5833,5.6252061,634,2851,3485 340 | 339,05-12-2018,4,0,12,0,1,1,2,15.819153,19.69625,82.7083,4.1679561,233,3578,3811 341 | 340,06-12-2018,4,0,12,0,2,1,3,18.9625,22.82,94.9583,15.583061,126,2468,2594 342 | 341,07-12-2018,4,0,12,0,3,1,3,16.81,20.0123,97.0417,17.833725,50,655,705 343 | 342,08-12-2018,4,0,12,0,4,1,1,10.899153,12.8469,58,16.083886,150,3172,3322 344 | 343,09-12-2018,4,0,12,0,5,1,1,11.924153,15.8771,69.5833,5.5420189,261,3359,3620 345 | 344,10-12-2018,4,0,12,0,6,0,1,11.275,13.3206,50.75,15.625807,502,2688,3190 346 | 345,11-12-2018,4,0,12,0,0,0,1,9.054153,12.6577,49,4.4582939,377,2366,2743 347 | 346,12-12-2018,4,0,12,0,1,1,1,9.771653,13.5098,67.0833,4.25115,143,3167,3310 348 | 347,13-12-2018,4,0,12,0,2,1,1,11.5825,15.0569,59,9.41685,155,3368,3523 349 | 348,14-12-2018,4,0,12,0,3,1,2,13.0175,16.9181,66.375,4.0842061,178,3562,3740 350 | 349,15-12-2018,4,0,12,0,4,1,2,17.3225,20.61185,63.4167,17.958814,181,3528,3709 351 | 350,16-12-2018,4,0,12,0,5,1,2,15.375,17.99125,50.0417,17.458525,178,3399,3577 352 | 351,17-12-2018,4,0,12,0,6,0,2,10.591653,12.46855,56.0833,16.292189,275,2464,2739 353 | 352,18-12-2018,4,0,12,0,0,0,1,9.771653,12.27895,58.625,11.375193,220,2211,2431 354 | 353,19-12-2018,4,0,12,0,1,1,1,11.343347,14.04665,63.75,11.584032,260,3143,3403 355 | 354,20-12-2018,4,0,12,0,2,1,2,15.819153,19.8227,59.5417,4.1252436,216,3534,3750 356 | 355,21-12-2018,1,0,12,0,3,1,2,17.561653,21.40085,85.8333,14.8338,107,2553,2660 357 | 356,22-12-2018,1,0,12,0,4,1,2,17.356653,21.30605,75.75,3.167425,227,2841,3068 358 | 357,23-12-2018,1,0,12,0,5,1,1,15.306653,18.87565,68.625,18.374482,163,2046,2209 359 | 358,24-12-2018,1,0,12,0,6,0,1,12.4025,14.9621,54.25,12.750368,155,856,1011 360 | 359,25-12-2018,1,0,12,0,0,0,1,11.266103,13.99805,68.1304,10.391097,303,451,754 361 | 360,26-12-2018,1,0,12,1,1,0,1,13.191299,15.77675,50.6957,16.044155,430,887,1317 362 | 361,27-12-2018,1,0,12,0,2,1,2,13.325,16.38165,76.25,12.62615,103,1059,1162 363 | 362,28-12-2018,1,0,12,0,3,1,1,12.26433,13.9987,50.3913,19.695387,255,2047,2302 364 | 363,29-12-2018,1,0,12,0,4,1,1,10.181653,13.1946,57.4167,8.000604,254,2169,2423 365 | 364,30-12-2018,1,0,12,0,5,1,1,12.778347,15.9406,63.6667,9.000579,491,2508,2999 366 | 365,31-12-2018,1,0,12,0,6,0,1,16.81,20.70605,61.5833,14.750318,665,1820,2485 367 | 366,01-01-2019,1,1,1,0,0,0,1,15.17,18.78105,69.25,12.875189,686,1608,2294 368 | 367,02-01-2019,1,1,1,1,1,0,1,11.194763,12.6152,38.1304,22.087555,244,1707,1951 369 | 368,03-01-2019,1,1,1,0,2,1,1,6.15,6.31375,44.125,24.499957,89,2147,2236 370 | 369,04-01-2019,1,1,1,0,3,1,2,4.4075,5.96685,41.4583,12.3749,95,2273,2368 371 | 370,05-01-2019,1,1,1,0,4,1,1,10.899153,13.9206,52.4167,8.709129,140,3132,3272 372 | 371,06-01-2019,1,1,1,0,5,1,1,13.700847,17.01335,54.2083,11.249836,307,3791,4098 373 | 372,07-01-2019,1,1,1,0,6,0,1,16.126653,19.53895,53.1667,11.708786,1070,3451,4521 374 | 373,08-01-2019,1,1,1,0,0,0,1,13.8375,17.0129,46.5,12.833314,599,2826,3425 375 | 374,09-01-2019,1,1,1,0,1,1,2,9.190847,12.37395,70.1667,6.6263,106,2270,2376 376 | 375,10-01-2019,1,1,1,0,2,1,1,12.656536,15.9413,64.6522,12.565984,173,3425,3598 377 | 376,11-01-2019,1,1,1,0,3,1,2,11.240847,14.14105,84.75,8.791807,92,2085,2177 378 | 377,12-01-2019,1,1,1,0,4,1,2,15.6825,19.0969,80.2917,12.124789,269,3828,4097 379 | 378,13-01-2019,1,1,1,0,5,1,1,11.240847,12.4681,50.75,25.333236,174,3040,3214 380 | 379,14-01-2019,1,1,1,0,6,0,1,7.38,9.15435,45.75,12.541261,333,2160,2493 381 | 380,15-01-2019,1,1,1,0,0,0,1,6.833347,8.08125,41.9167,16.834286,284,2027,2311 382 | 381,16-01-2019,1,1,1,1,1,0,1,7.79,9.53315,52.25,15.500986,217,2081,2298 383 | 382,17-01-2019,1,1,1,0,2,1,2,15.294763,18.2139,71.6087,23.39171,127,2808,2935 384 | 383,18-01-2019,1,1,1,0,3,1,1,12.436653,13.7627,44.3333,27.833743,109,3267,3376 385 | 384,19-01-2019,1,1,1,0,4,1,1,7.79,9.5019,49.75,14.750586,130,3162,3292 386 | 385,20-01-2019,1,1,1,0,5,1,2,8.9175,11.0479,45,13.58425,115,3048,3163 387 | 386,21-01-2019,1,1,1,0,6,0,2,7.106653,8.74375,83.125,14.917014,67,1234,1301 388 | 387,22-01-2019,1,1,1,0,0,0,2,6.6625,8.1125,79.625,13.375746,196,1781,1977 389 | 388,23-01-2019,1,1,1,0,1,1,2,8.951653,12.1529,91.125,7.417436,145,2287,2432 390 | 389,24-01-2019,1,1,1,0,2,1,1,14.0425,17.4554,83.5833,8.292389,439,3900,4339 391 | 390,25-01-2019,1,1,1,0,3,1,1,12.060847,14.74105,64.375,10.791757,467,3803,4270 392 | 391,26-01-2019,1,1,1,0,4,1,2,14.008347,17.8025,76.9583,4.9175186,244,3831,4075 393 | 392,27-01-2019,1,1,1,0,5,1,2,17.425,20.76915,74.125,22.958689,269,3187,3456 394 | 393,28-01-2019,1,1,1,0,6,0,1,12.949153,16.31895,54.3333,14.125543,775,3248,4023 395 | 394,29-01-2019,1,1,1,0,0,0,1,11.5825,13.63605,31.125,16.08335,558,2685,3243 396 | 395,30-01-2019,1,1,1,0,1,1,1,11.035847,13.13125,40.0833,14.458064,126,3498,3624 397 | 396,31-01-2019,1,1,1,0,2,1,1,15.99,19.06585,41.6667,17.541739,324,4185,4509 398 | 397,01-02-2019,1,1,2,0,3,1,1,19.235847,23.3269,50.7917,12.667489,304,4275,4579 399 | 398,02-02-2019,1,1,2,0,4,1,2,16.365847,19.94855,67.2917,12.541529,190,3571,3761 400 | 399,03-02-2019,1,1,2,0,5,1,1,12.846653,15.4673,52.6667,11.959232,310,3841,4151 401 | 400,04-02-2019,1,1,2,0,6,0,2,10.830847,13.63625,77.9583,8.167032,384,2448,2832 402 | 401,05-02-2019,1,1,2,0,0,0,2,10.899153,13.22605,68.7917,11.791732,318,2629,2947 403 | 402,06-02-2019,1,1,2,0,1,1,1,11.586969,14.8213,62.2174,10.3046,206,3578,3784 404 | 403,07-02-2019,1,1,2,0,2,1,1,14.520847,18.0552,49.625,9.874393,199,4176,4375 405 | 404,08-02-2019,1,1,2,0,3,1,2,10.523347,13.32105,72.2917,8.959307,109,2693,2802 406 | 405,09-02-2019,1,1,2,0,4,1,1,10.865,13.0994,56.2083,13.000479,163,3667,3830 407 | 406,10-02-2019,1,1,2,0,5,1,2,11.514153,14.6779,54,7.834243,227,3604,3831 408 | 407,11-02-2019,1,1,2,0,6,0,3,9.190847,10.54335,73.125,19.416332,192,1977,2169 409 | 408,12-02-2019,1,1,2,0,0,0,1,5.2275,5.0829,46.4583,27.417204,73,1456,1529 410 | 409,13-02-2019,1,1,2,0,1,1,1,9.1225,11.39565,41.125,11.207961,94,3328,3422 411 | 410,14-02-2019,1,1,2,0,2,1,2,13.085847,16.6973,50.875,9.458993,135,3787,3922 412 | 411,15-02-2019,1,1,2,0,3,1,1,14.281653,17.58145,53.125,12.1672,141,4028,4169 413 | 412,16-02-2019,1,1,2,0,4,1,2,12.983347,16.5081,75.2917,6.125475,74,2931,3005 414 | 413,17-02-2019,1,1,2,0,5,1,1,14.076653,17.58145,63.4583,13.791682,349,3805,4154 415 | 414,18-02-2019,1,1,2,0,6,0,1,14.213347,17.77125,53.4583,12.792243,1435,2883,4318 416 | 415,19-02-2019,1,1,2,0,0,0,2,11.48,13.2894,51.5833,16.958504,618,2071,2689 417 | 416,20-02-2019,1,1,2,1,1,0,1,11.48,13.66955,50.7826,15.348561,502,2627,3129 418 | 417,21-02-2019,1,1,2,0,2,1,1,11.800866,14.75565,59.4348,13.783039,163,3614,3777 419 | 418,22-02-2019,1,1,2,0,3,1,1,16.229153,19.63335,56.7917,15.709557,394,4379,4773 420 | 419,23-02-2019,1,1,2,0,4,1,1,18.620847,22.2223,55.4583,12.791171,516,4546,5062 421 | 420,24-02-2019,1,1,2,0,5,1,2,16.7075,20.54855,73.75,15.916989,246,3241,3487 422 | 421,25-02-2019,1,1,2,0,6,0,1,11.924153,12.78375,39.5833,28.250014,317,2415,2732 423 | 422,26-02-2019,1,1,2,0,0,0,1,11.445847,13.4154,41,13.750343,515,2874,3389 424 | 423,27-02-2019,1,1,2,0,1,1,1,15.033347,17.8977,49.0833,17.958211,253,4069,4322 425 | 424,28-02-2019,1,1,2,0,2,1,1,14.725847,17.67625,39.5833,12.958939,229,4134,4363 426 | 425,01-03-2019,1,1,3,0,4,1,1,19.919153,23.76855,61.5417,15.208129,325,4665,4990 427 | 426,02-03-2019,1,1,3,0,5,1,2,14.486653,17.9921,65.7083,9.708568,246,2948,3194 428 | 427,03-03-2019,1,1,3,0,6,0,2,16.980847,20.6746,62.125,10.792293,956,3110,4066 429 | 428,04-03-2019,1,1,3,0,0,0,1,13.359153,15.15105,40.3333,22.416257,710,2713,3423 430 | 429,05-03-2019,1,1,3,0,1,1,1,9.976653,12.05855,50.625,15.333486,203,3130,3333 431 | 430,06-03-2019,1,1,3,0,2,1,1,10.591653,12.7521,45.6667,13.458625,221,3735,3956 432 | 431,07-03-2019,1,1,3,0,3,1,1,16.570847,19.255,51.3333,23.167193,432,4484,4916 433 | 432,08-03-2019,1,1,3,0,4,1,1,21.6275,26.2302,56.75,29.584721,486,4896,5382 434 | 433,09-03-2019,1,1,3,0,5,1,2,16.844153,19.85415,40.7083,27.7916,447,4122,4569 435 | 434,10-03-2019,1,1,3,0,6,0,1,11.7875,13.88835,35.0417,15.12525,968,3150,4118 436 | 435,11-03-2019,1,1,3,0,0,0,1,14.831299,17.9835,47.6957,14.913329,1658,3253,4911 437 | 436,12-03-2019,1,1,3,0,1,1,1,19.133347,22.9796,48.9167,13.916771,838,4460,5298 438 | 437,13-03-2019,1,1,3,0,2,1,1,23.165,27.14645,61.75,15.87565,762,5085,5847 439 | 438,14-03-2019,1,1,3,0,3,1,1,23.4725,27.43085,50.7083,7.709154,997,5315,6312 440 | 439,15-03-2019,1,1,3,0,4,1,1,22.8575,26.64125,57.9583,10.042161,1005,5187,6192 441 | 440,16-03-2019,1,1,3,0,5,1,2,17.869153,21.81145,84.2083,7.583864,548,3830,4378 442 | 441,17-03-2019,1,1,3,0,6,0,2,21.080847,25.2523,75.5833,7.417168,3155,4681,7836 443 | 442,18-03-2019,1,1,3,0,0,0,2,19.3725,23.2,81,8.501161,2207,3685,5892 444 | 443,19-03-2019,1,1,3,0,1,1,1,22.345,26.64105,72.875,10.875239,982,5171,6153 445 | 444,20-03-2019,1,1,3,0,2,1,1,22.994153,26.92665,80.7917,8.125157,1051,5042,6093 446 | 445,21-03-2019,2,1,3,0,3,1,2,21.798347,25.6629,82.125,6.0004061,1122,5108,6230 447 | 446,22-03-2019,2,1,3,0,4,1,1,22.720847,26.57835,83.125,7.876654,1334,5537,6871 448 | 447,23-03-2019,2,1,3,0,5,1,2,24.668347,28.50335,69.4167,7.7921,2469,5893,8362 449 | 448,24-03-2019,2,1,3,0,6,0,2,20.6025,24.33665,88.5417,12.916461,1033,2339,3372 450 | 449,25-03-2019,2,1,3,0,0,0,2,17.9375,21.8744,88.0833,14.791925,1532,3464,4996 451 | 450,26-03-2019,2,1,3,0,1,1,1,18.279153,21.9375,47.7917,25.917007,795,4763,5558 452 | 451,27-03-2019,2,1,3,0,2,1,1,13.256653,15.7827,29,12.541864,531,4571,5102 453 | 452,28-03-2019,2,1,3,0,3,1,1,19.850847,23.5475,48.125,19.541957,674,5024,5698 454 | 453,29-03-2019,2,1,3,0,4,1,1,20.260847,24.1152,43.9167,21.41655,834,5299,6133 455 | 454,30-03-2019,2,1,3,0,5,1,2,15.17,18.78105,58.0833,9.250489,796,4663,5459 456 | 455,31-03-2019,2,1,3,0,6,0,2,17.390847,21.0854,73.8333,16.791339,2301,3934,6235 457 | 456,01-04-2019,2,1,4,0,0,0,2,17.459153,20.86435,67.625,11.541889,2347,3694,6041 458 | 457,02-04-2019,2,1,4,0,1,1,1,17.790433,21.37565,50.4348,20.913313,1208,4728,5936 459 | 458,03-04-2019,2,1,4,0,2,1,1,19.133347,23.07415,39.6667,6.708911,1348,5424,6772 460 | 459,04-04-2019,2,1,4,0,3,1,1,22.208347,26.6725,46.9583,12.125325,1058,5378,6436 461 | 460,05-04-2019,2,1,4,0,4,1,1,17.835,21.55815,37.4167,14.708443,1192,5265,6457 462 | 461,06-04-2019,2,1,4,0,5,1,1,16.536653,19.53835,37.7083,20.125996,1807,4653,6460 463 | 462,07-04-2019,2,1,4,0,6,0,1,17.9375,21.30645,25.4167,18.416357,3252,3605,6857 464 | 463,08-04-2019,2,1,4,0,0,0,1,20.5,24.62125,27.5833,15.583932,2230,2939,5169 465 | 464,09-04-2019,2,1,4,0,1,1,1,20.055847,23.8319,31.75,23.999132,905,4680,5585 466 | 465,10-04-2019,2,1,4,0,2,1,1,18.313347,21.81165,43.5,16.708125,819,5099,5918 467 | 466,11-04-2019,2,1,4,0,3,1,1,14.296536,16.8637,46.9565,19.783358,482,4380,4862 468 | 467,12-04-2019,2,1,4,0,4,1,1,16.2975,19.3802,46.625,19.458743,663,4746,5409 469 | 468,13-04-2019,2,1,4,0,5,1,1,18.1425,21.5904,40.8333,10.416557,1252,5146,6398 470 | 469,14-04-2019,2,1,4,0,6,0,1,20.295,24.3998,50.2917,12.791439,2795,4665,7460 471 | 470,15-04-2019,2,1,4,0,0,0,1,24.873347,28.69375,50.7917,15.083643,2846,4286,7132 472 | 471,16-04-2019,2,1,4,1,1,0,1,27.230847,30.74625,56.1667,19.083543,1198,5172,6370 473 | 472,17-04-2019,2,1,4,0,2,1,1,24.941653,29.92435,39.0417,18.333143,989,5702,6691 474 | 473,18-04-2019,2,1,4,0,3,1,2,18.996653,22.8519,56.9167,11.250104,347,4020,4367 475 | 474,19-04-2019,2,1,4,0,4,1,1,20.431653,24.6523,61.25,4.4172564,846,5719,6565 476 | 475,20-04-2019,2,1,4,0,5,1,1,21.593347,25.78875,69.4583,10.041357,1340,5950,7290 477 | 476,21-04-2019,2,1,4,0,6,0,1,23.37,27.14605,68.2917,19.000329,2541,4083,6624 478 | 477,22-04-2019,2,1,4,0,0,0,3,16.263347,19.4752,83.5417,23.084582,120,907,1027 479 | 478,23-04-2019,2,1,4,0,1,1,2,13.188347,15.05625,76.6667,20.334232,195,3019,3214 480 | 479,24-04-2019,2,1,4,0,2,1,1,16.946653,20.26415,45.4167,16.708661,518,5115,5633 481 | 480,25-04-2019,2,1,4,0,3,1,1,19.543347,23.51585,42.7917,7.959064,655,5541,6196 482 | 481,26-04-2019,2,1,4,0,4,1,2,20.431653,24.17915,75.6667,11.833875,475,4551,5026 483 | 482,27-04-2019,2,1,4,0,5,1,1,18.7575,22.63185,40.0833,23.291411,1014,5219,6233 484 | 483,28-04-2019,2,1,4,0,6,0,2,15.443347,18.8752,48.9583,8.708325,1120,3100,4220 485 | 484,29-04-2019,2,1,4,0,0,0,1,18.791653,22.50605,58.7083,7.832836,2229,4075,6304 486 | 485,30-04-2019,2,1,4,0,1,1,2,19.030847,22.8848,57,11.499746,665,4907,5572 487 | 486,01-05-2019,2,1,5,0,2,1,2,25.146653,28.85105,65.9583,10.458432,653,5087,5740 488 | 487,02-05-2019,2,1,5,0,3,1,1,23.130847,26.8948,79.7083,9.249886,667,5502,6169 489 | 488,03-05-2019,2,1,5,0,4,1,2,22.96,26.8621,76.8333,8.957632,764,5657,6421 490 | 489,04-05-2019,2,1,5,0,5,1,1,25.7275,29.54585,73.5417,10.916846,1069,5227,6296 491 | 490,05-05-2019,2,1,5,0,6,0,2,25.488347,29.2304,75.6667,10.250464,2496,4387,6883 492 | 491,06-05-2019,2,1,5,0,0,0,2,23.0625,27.33685,74,10.041893,2135,4224,6359 493 | 492,07-05-2019,2,1,5,0,1,1,2,22.0375,26.3571,66.4167,15.458307,1008,5265,6273 494 | 493,08-05-2019,2,1,5,0,2,1,2,23.848347,27.87355,68.5833,19.833943,738,4990,5728 495 | 494,09-05-2019,2,1,5,0,3,1,2,23.575,27.65125,74.4167,14.499604,620,4097,4717 496 | 495,10-05-2019,2,1,5,0,4,1,1,20.739153,24.58915,55.2083,21.042221,1026,5546,6572 497 | 496,11-05-2019,2,1,5,0,5,1,1,21.866653,26.04165,36.0417,15.874779,1319,5711,7030 498 | 497,12-05-2019,2,1,5,0,6,0,1,23.130847,27.24085,48.0417,8.249911,2622,4807,7429 499 | 498,13-05-2019,2,1,5,0,0,0,1,25.1125,29.2619,57.625,15.082839,2172,3946,6118 500 | 499,14-05-2019,2,1,5,0,1,1,2,23.506653,27.495,78.9583,14.250364,342,2501,2843 501 | 500,15-05-2019,2,1,5,0,2,1,2,25.078347,28.8202,79.4583,9.875264,625,4490,5115 502 | 501,16-05-2019,2,1,5,0,3,1,1,26.103347,29.79875,69.7917,8.208304,991,6433,7424 503 | 502,17-05-2019,2,1,5,0,4,1,1,24.326653,28.63065,52,15.374825,1242,6142,7384 504 | 503,18-05-2019,2,1,5,0,5,1,1,23.130847,27.55605,52.3333,9.166739,1521,6118,7639 505 | 504,19-05-2019,2,1,5,0,6,0,1,24.6,28.3454,45.625,5.626325,3410,4884,8294 506 | 505,20-05-2019,2,1,5,0,0,0,1,25.454153,29.19835,53.0417,17.042589,2704,4425,7129 507 | 506,21-05-2019,2,1,5,0,1,1,2,24.531653,28.28335,81.125,15.624668,630,3729,4359 508 | 507,22-05-2019,2,1,5,0,2,1,2,25.215,29.04125,76.5833,7.917189,819,5254,6073 509 | 508,23-05-2019,2,1,5,0,3,1,2,25.488347,29.2306,77.4583,6.834,766,4494,5260 510 | 509,24-05-2019,2,1,5,0,4,1,1,26.855,30.335,71.6667,11.584032,1059,5711,6770 511 | 510,25-05-2019,2,1,5,0,5,1,1,27.88,31.37645,74.7083,9.41685,1417,5317,6734 512 | 511,26-05-2019,2,1,5,0,6,0,1,28.3925,32.1348,73.25,13.332464,2855,3681,6536 513 | 512,27-05-2019,2,1,5,0,0,0,1,28.29,32.07125,69.7083,14.416457,3283,3308,6591 514 | 513,28-05-2019,2,1,5,1,1,0,1,29.2125,33.965,67.625,13.166907,2557,3486,6043 515 | 514,29-05-2019,2,1,5,0,2,1,1,29.6225,33.6496,68.4583,19.7918,880,4863,5743 516 | 515,30-05-2019,2,1,5,0,3,1,2,26.923347,30.55645,67,9.000043,745,6110,6855 517 | 516,31-05-2019,2,1,5,0,4,1,1,27.88,31.56645,49.2917,13.083693,1100,6238,7338 518 | 517,01-06-2019,2,1,6,0,5,1,2,26.820847,30.3981,75.5417,15.916721,533,3594,4127 519 | 518,02-06-2019,2,1,6,0,6,0,1,23.916653,28.3144,54.9167,12.499654,2795,5325,8120 520 | 519,03-06-2019,2,1,6,0,0,0,1,24.7025,28.75665,49.3333,12.333829,2494,5147,7641 521 | 520,04-06-2019,2,1,6,0,1,1,1,24.4975,28.91415,48.7083,19.083811,1071,5927,6998 522 | 521,05-06-2019,2,1,6,0,2,1,2,22.174153,26.2946,61.3333,14.041525,968,6033,7001 523 | 522,06-06-2019,2,1,6,0,3,1,1,22.720847,27.1146,61.125,5.167375,1027,6028,7055 524 | 523,07-06-2019,2,1,6,0,4,1,1,24.7025,28.4721,56.7083,10.54245,1038,6456,7494 525 | 524,08-06-2019,2,1,6,0,5,1,1,26.615847,29.8931,46.7917,11.750661,1488,6248,7736 526 | 525,09-06-2019,2,1,6,0,6,0,1,29.144153,32.41835,43.7083,9.667229,2708,4790,7498 527 | 526,10-06-2019,2,1,6,0,0,0,1,29.793347,33.17585,53.8333,8.959307,2224,4374,6598 528 | 527,11-06-2019,2,1,6,0,1,1,2,29.554153,32.98605,58.7917,13.916771,1017,5647,6664 529 | 528,12-06-2019,2,1,6,0,2,1,2,26.786653,29.89375,83.3333,14.374582,477,4495,4972 530 | 529,13-06-2019,2,1,6,0,3,1,1,26.889153,30.55585,58.2083,22.999693,1173,6248,7421 531 | 530,14-06-2019,2,1,6,0,4,1,1,26.581653,31.21915,56.9583,17.000111,1180,6183,7363 532 | 531,15-06-2019,2,1,6,0,5,1,1,26.205847,29.9877,58.9583,11.833339,1563,6102,7665 533 | 532,16-06-2019,2,1,6,0,6,0,1,25.898347,29.7354,50.4167,11.166689,2963,4739,7702 534 | 533,17-06-2019,2,1,6,0,0,0,1,24.2925,28.59875,59.875,9.708568,2634,4344,6978 535 | 534,18-06-2019,2,1,6,0,1,1,2,23.301653,27.2421,77.7917,11.707982,653,4446,5099 536 | 535,19-06-2019,2,1,6,0,2,1,1,28.221653,32.7346,69,9.917139,968,5857,6825 537 | 536,20-06-2019,2,1,6,0,3,1,1,32.0825,36.04875,59.2083,7.625404,872,5339,6211 538 | 537,21-06-2019,3,1,6,0,4,1,1,33.039153,37.6271,56.7917,7.958729,778,5127,5905 539 | 538,22-06-2019,3,1,6,0,5,1,1,31.8775,36.20605,57.375,12.250414,964,4859,5823 540 | 539,23-06-2019,3,1,6,0,6,0,1,29.998347,32.6396,53.4583,12.041307,2657,4801,7458 541 | 540,24-06-2019,3,1,6,0,0,0,1,30.476653,33.7127,47.9167,9.750175,2551,4340,6891 542 | 541,25-06-2019,3,1,6,0,1,1,1,29.349153,32.7021,50.4167,20.125661,1139,5640,6779 543 | 542,26-06-2019,3,1,6,0,2,1,1,25.864153,29.7352,37.3333,23.292014,1077,6365,7442 544 | 543,27-06-2019,3,1,6,0,3,1,1,28.5975,32.0396,36,18.208925,1077,6258,7335 545 | 544,28-06-2019,3,1,6,0,4,1,1,30.715847,33.7756,42.25,11.50055,921,5958,6879 546 | 545,29-06-2019,3,1,6,0,5,1,1,34.200847,39.33065,48.875,11.082939,829,4634,5463 547 | 546,30-06-2019,3,1,6,0,6,0,1,31.365,34.3754,60.125,10.791757,1455,4232,5687 548 | 547,01-07-2019,3,1,7,0,0,0,1,33.449153,37.53145,51.875,11.291443,1421,4110,5531 549 | 548,02-07-2019,3,1,7,0,1,1,1,32.048347,35.1019,44.7083,13.082889,904,5323,6227 550 | 549,03-07-2019,3,1,7,0,2,1,1,32.014153,35.1325,49.2083,8.457879,1052,5608,6660 551 | 550,04-07-2019,3,1,7,1,3,0,1,32.355847,36.61685,53.875,9.04165,2562,4841,7403 552 | 551,05-07-2019,3,1,7,0,4,1,1,33.9275,38.06835,45.7917,12.999943,1405,4836,6241 553 | 552,06-07-2019,3,1,7,0,5,1,1,33.961653,37.62665,45.0833,9.791514,1366,4841,6207 554 | 553,07-07-2019,3,1,7,0,6,0,1,35.328347,40.24565,49.2083,10.958118,1448,3392,4840 555 | 554,08-07-2019,3,1,7,0,0,0,1,33.7225,39.5198,57.375,8.417143,1203,3469,4672 556 | 555,09-07-2019,3,1,7,0,1,1,2,29.144153,32.7027,68.3333,12.125325,998,5571,6569 557 | 556,10-07-2019,3,1,7,0,2,1,2,29.554153,33.2398,66.75,10.166379,954,5336,6290 558 | 557,11-07-2019,3,1,7,0,3,1,1,29.383347,32.51355,63.3333,10.166111,975,6289,7264 559 | 558,12-07-2019,3,1,7,0,4,1,1,29.349153,32.73415,52.9583,9.833925,1032,6414,7446 560 | 559,13-07-2019,3,1,7,0,5,1,2,29.998347,33.39665,48.5833,5.41695,1511,5988,7499 561 | 560,14-07-2019,3,1,7,0,6,0,2,28.836653,33.3021,69.9167,9.626493,2355,4614,6969 562 | 561,15-07-2019,3,1,7,0,0,0,1,30.579153,35.2598,71.7917,11.166689,1920,4111,6031 563 | 562,16-07-2019,3,1,7,0,1,1,1,31.296653,36.20625,64.5,11.000529,1088,5742,6830 564 | 563,17-07-2019,3,1,7,0,2,1,1,33.551653,37.78415,50.5833,7.666743,921,5865,6786 565 | 564,18-07-2019,3,1,7,0,3,1,1,32.526653,37.27915,57.7083,9.208614,799,4914,5713 566 | 565,19-07-2019,3,1,7,0,4,1,1,31.57,35.7321,60.0417,11.083743,888,5703,6591 567 | 566,20-07-2019,3,1,7,0,5,1,2,27.299153,30.65125,84.4167,14.000789,747,5123,5870 568 | 567,21-07-2019,3,1,7,0,6,0,3,24.429153,27.4956,86.5417,14.2911,1264,3195,4459 569 | 568,22-07-2019,3,1,7,0,0,0,2,27.3675,31.15625,76.25,6.2926936,2544,4866,7410 570 | 569,23-07-2019,3,1,7,0,1,1,1,30.408347,34.50085,69.4167,9.291761,1135,5831,6966 571 | 570,24-07-2019,3,1,7,0,2,1,1,30.784153,35.3225,65.5,14.167418,1140,6452,7592 572 | 571,25-07-2019,3,1,7,0,3,1,1,29.690847,32.7027,45,11.0416,1383,6790,8173 573 | 572,26-07-2019,3,1,7,0,4,1,1,31.843347,36.96315,59.6667,19.082471,1036,5825,6861 574 | 573,27-07-2019,3,1,7,0,5,1,1,32.048347,36.71085,59.4583,10.250464,1259,5645,6904 575 | 574,28-07-2019,3,1,7,0,6,0,1,30.989153,34.8802,61.3333,10.54245,2234,4451,6685 576 | 575,29-07-2019,3,1,7,0,0,0,1,29.588347,33.39665,62.375,11.416532,2153,4444,6597 577 | 576,30-07-2019,3,1,7,0,1,1,1,29.964153,34.24935,66.875,10.292339,1040,6065,7105 578 | 577,31-07-2019,3,1,7,0,2,1,1,29.246653,33.1448,70.4167,11.083475,968,6248,7216 579 | 578,01-08-2019,3,1,8,0,3,1,1,29.4175,33.3654,67.75,9.458993,1074,6506,7580 580 | 579,02-08-2019,3,1,8,0,4,1,1,30.8525,35.3544,65.9583,8.666718,983,6278,7261 581 | 580,03-08-2019,3,1,8,0,5,1,2,31.399153,36.14335,64.25,14.458064,1328,5847,7175 582 | 581,04-08-2019,3,1,8,0,6,0,1,32.526653,37.56335,61.3333,17.249686,2345,4479,6824 583 | 582,05-08-2019,3,1,8,0,0,0,1,31.535847,36.55395,65.25,19.458207,1707,3757,5464 584 | 583,06-08-2019,3,1,8,0,1,1,2,30.8525,35.5123,65.4167,8.666718,1233,5780,7013 585 | 584,07-08-2019,3,1,8,0,2,1,2,30.169153,34.88105,70.375,7.832836,1278,5995,7273 586 | 585,08-08-2019,3,1,8,0,3,1,2,30.75,35.38585,67.2917,7.4169,1263,6271,7534 587 | 586,09-08-2019,3,1,8,0,4,1,1,30.989153,34.9754,62.0417,10.4587,1196,6090,7286 588 | 587,10-08-2019,3,1,8,0,5,1,2,29.349153,33.3971,71.5833,16.000471,1065,4721,5786 589 | 588,11-08-2019,3,1,8,0,6,0,2,28.3925,31.91335,73.2917,13.834093,2247,4052,6299 590 | 589,12-08-2019,3,1,8,0,0,0,1,28.734153,32.22895,53.0417,8.208304,2182,4362,6544 591 | 590,13-08-2019,3,1,8,0,1,1,1,29.554153,33.1127,54.5417,9.126204,1207,5676,6883 592 | 591,14-08-2019,3,1,8,0,2,1,1,29.793347,33.83895,68.6667,11.333586,1128,5656,6784 593 | 592,15-08-2019,3,1,8,0,3,1,1,28.973347,32.70185,61.9583,11.374657,1198,6149,7347 594 | 593,16-08-2019,3,1,8,0,4,1,1,29.485847,32.7344,51.9167,9.500332,1338,6267,7605 595 | 594,17-08-2019,3,1,8,0,5,1,1,29.656653,12.12,57.0833,15.500718,1483,5665,7148 596 | 595,18-08-2019,3,1,8,0,6,0,1,27.811653,30.90355,60.3333,11.917089,2827,5038,7865 597 | 596,19-08-2019,3,1,8,0,0,0,2,26.069153,30.1777,71.1667,5.79215,1208,3341,4549 598 | 597,20-08-2019,3,1,8,0,1,1,2,26.069153,29.79835,73.4167,8.708593,1026,5504,6530 599 | 598,21-08-2019,3,1,8,0,2,1,1,26.615847,30.05125,67.375,4.8756436,1081,5925,7006 600 | 599,22-08-2019,3,1,8,0,3,1,1,27.3675,31.0927,67.7083,4.7089811,1094,6281,7375 601 | 600,23-08-2019,3,1,8,0,4,1,1,28.529153,31.8504,63.5833,5.6679186,1363,6402,7765 602 | 601,24-08-2019,3,1,8,0,5,1,2,28.8025,32.355,61.5,4.8337686,1325,6257,7582 603 | 602,25-08-2019,3,1,8,0,6,0,2,27.128347,30.9348,71.2917,16.375336,1829,4224,6053 604 | 603,26-08-2019,3,1,8,0,0,0,2,26.786653,29.7998,84.5833,15.333486,1483,3772,5255 605 | 604,27-08-2019,3,1,8,0,1,1,1,28.836653,32.7344,73.0417,8.625111,989,5928,6917 606 | 605,28-08-2019,3,1,8,0,2,1,1,29.861653,33.3025,62,12.791975,935,6105,7040 607 | 606,29-08-2019,3,1,8,0,3,1,1,28.085,31.78665,55.2083,7.541654,1177,6520,7697 608 | 607,30-08-2019,3,1,8,0,4,1,1,28.973347,32.63895,59.0417,5.1668189,1172,6541,7713 609 | 608,31-08-2019,3,1,8,0,5,1,1,31.330847,34.47,58.75,11.291711,1433,5917,7350 610 | 609,01-09-2019,3,1,9,0,6,0,2,30.886653,35.1327,63.8333,7.583529,2352,3788,6140 611 | 610,02-09-2019,3,1,9,0,0,0,2,28.563347,32.45,81.5,4.2927436,2613,3197,5810 612 | 611,03-09-2019,3,1,9,1,1,0,1,29.0075,33.08145,79.0833,10.125107,1965,4069,6034 613 | 612,04-09-2019,3,1,9,0,2,1,1,29.759153,34.3444,75.5,15.833507,867,5997,6864 614 | 613,05-09-2019,3,1,9,0,3,1,1,30.203347,35.44915,74.125,12.583136,832,6280,7112 615 | 614,06-09-2019,3,1,9,0,4,1,2,28.563347,32.76645,81.0417,9.542207,611,5592,6203 616 | 615,07-09-2019,3,1,9,0,5,1,1,28.836653,32.8602,73.625,11.500282,1045,6459,7504 617 | 616,08-09-2019,3,1,9,0,6,0,2,27.025847,30.55605,79.9167,18.833968,1557,4419,5976 618 | 617,09-09-2019,3,1,9,0,0,0,1,25.01,28.94625,54.75,15.041232,2570,5657,8227 619 | 618,10-09-2019,3,1,9,0,1,1,1,23.916653,28.2827,50.375,17.333771,1118,6407,7525 620 | 619,11-09-2019,3,1,9,0,2,1,1,23.6775,27.7146,52,6.1676314,1070,6697,7767 621 | 620,12-09-2019,3,1,9,0,3,1,1,24.565847,28.50375,57.7083,8.833682,1050,6820,7870 622 | 621,13-09-2019,3,1,9,0,4,1,1,25.1125,28.9779,63.7083,5.5422936,1054,6750,7804 623 | 622,14-09-2019,3,1,9,0,5,1,1,25.966653,29.70415,67.25,6.958821,1379,6630,8009 624 | 623,15-09-2019,3,1,9,0,6,0,1,24.941653,29.29335,50.1667,16.583907,3160,5554,8714 625 | 624,16-09-2019,3,1,9,0,0,0,1,23.78,28.15625,57,6.0422811,2166,5167,7333 626 | 625,17-09-2019,3,1,9,0,1,1,2,23.814153,27.6525,73.4583,10.166714,1022,5847,6869 627 | 626,18-09-2019,3,1,9,0,2,1,2,25.556653,28.25335,87.25,23.958329,371,3702,4073 628 | 627,19-09-2019,3,1,9,0,3,1,1,22.6525,27.0202,53.6667,14.416725,788,6803,7591 629 | 628,20-09-2019,3,1,9,0,4,1,1,22.413347,26.6096,61.8333,7.917189,939,6781,7720 630 | 629,21-09-2019,3,1,9,0,5,1,1,24.565847,28.59855,66.875,10.333343,1250,6917,8167 631 | 630,22-09-2019,3,1,9,0,6,0,1,26.65,30.5244,64.6667,19.000061,2512,5883,8395 632 | 631,23-09-2019,4,1,9,0,0,0,1,21.695847,25.94665,46.7083,14.958286,2454,5453,7907 633 | 632,24-09-2019,4,1,9,0,1,1,1,21.080847,25.12565,49.2917,9.541068,1001,6435,7436 634 | 633,25-09-2019,4,1,9,0,2,1,1,22.55,27.20895,57,15.833507,845,6693,7538 635 | 634,26-09-2019,4,1,9,0,3,1,1,26.035,29.83065,63.0833,16.3748,787,6946,7733 636 | 635,27-09-2019,4,1,9,0,4,1,2,26.65,30.39875,69.0833,9.000914,751,6642,7393 637 | 636,28-09-2019,4,1,9,0,5,1,2,25.385847,29.29315,69,10.999993,1045,6370,7415 638 | 637,29-09-2019,4,1,9,0,6,0,1,22.2425,26.5148,54.2917,15.249468,2589,5966,8555 639 | 638,30-09-2019,4,1,9,0,0,0,1,21.593347,25.88315,58.3333,9.042186,2015,4874,6889 640 | 639,01-10-2019,4,1,10,0,1,1,2,21.354153,25.6,64.9167,6.0838814,763,6015,6778 641 | 640,02-10-2019,4,1,10,0,2,1,3,24.224153,27.11665,87.1667,6.999825,315,4324,4639 642 | 641,03-10-2019,4,1,10,0,3,1,2,26.9575,29.95665,79.375,4.4585686,728,6844,7572 643 | 642,04-10-2019,4,1,10,0,4,1,2,26.9575,30.39875,72.2917,7.875582,891,6437,7328 644 | 643,05-10-2019,4,1,10,0,5,1,1,25.215,29.00935,62.75,7.12545,1516,6640,8156 645 | 644,06-10-2019,4,1,10,0,6,0,1,22.720847,26.92605,66.4167,17.957675,3031,4934,7965 646 | 645,07-10-2019,4,1,10,0,0,0,2,17.049153,20.99065,70.8333,9.457854,781,2729,3510 647 | 646,08-10-2019,4,1,10,1,1,0,2,15.716653,19.3804,70.9583,12.708493,874,4604,5478 648 | 647,09-10-2019,4,1,10,0,2,1,2,18.313347,21.9056,76.1667,12.7501,601,5791,6392 649 | 648,10-10-2019,4,1,10,0,3,1,1,21.080847,25.1571,63.0833,12.584007,780,6911,7691 650 | 649,11-10-2019,4,1,10,0,4,1,1,17.835,21.55835,46.3333,12.166932,834,6736,7570 651 | 650,12-10-2019,4,1,10,0,5,1,1,17.9375,21.65355,53.9167,15.751164,1060,6222,7282 652 | 651,13-10-2019,4,1,10,0,6,0,1,16.126653,19.5698,49.4583,9.791514,2252,4857,7109 653 | 652,14-10-2019,4,1,10,0,0,0,1,21.388347,25.4102,64.0417,18.667004,2080,4559,6639 654 | 653,15-10-2019,4,1,10,0,1,1,2,23.028347,26.9575,70.75,19.834479,760,5115,5875 655 | 654,16-10-2019,4,1,10,0,2,1,1,19.201653,23.0423,55.8333,12.208807,922,6612,7534 656 | 655,17-10-2019,4,1,10,0,3,1,1,18.689153,22.5054,69.2917,6.791857,979,6482,7461 657 | 656,18-10-2019,4,1,10,0,4,1,2,21.4225,25.63125,72.8333,15.874779,1008,6501,7509 658 | 657,19-10-2019,4,1,10,0,5,1,2,23.096653,26.8948,81.5,9.041918,753,4671,5424 659 | 658,20-10-2019,4,1,10,0,6,0,1,19.850847,23.6421,57.2917,7.874979,2806,5284,8090 660 | 659,21-10-2019,4,1,10,0,0,0,1,19.030847,22.82145,51,11.125618,2132,4692,6824 661 | 660,22-10-2019,4,1,10,0,1,1,1,19.9875,24.1471,56.8333,5.4593811,830,6228,7058 662 | 661,23-10-2019,4,1,10,0,2,1,1,22.310847,26.5152,64.1667,6.3345686,841,6625,7466 663 | 662,24-10-2019,4,1,10,0,3,1,1,24.0875,27.93605,63.625,4.8762064,795,6898,7693 664 | 663,25-10-2019,4,1,10,0,4,1,2,22.55,26.4844,80.0417,8.333125,875,6484,7359 665 | 664,26-10-2019,4,1,10,0,5,1,2,22.379153,26.1375,80.7083,8.875289,1182,6262,7444 666 | 665,27-10-2019,4,1,10,0,6,0,2,21.73,25.75665,72,15.791364,2643,5209,7852 667 | 666,28-10-2019,4,1,10,0,0,0,2,19.5775,23.38855,69.4583,26.666536,998,3461,4459 668 | 667,29-10-2019,4,1,10,0,1,1,3,18.04,21.97,88,23.9994,2,20,22 669 | 668,30-10-2019,4,1,10,0,2,1,2,13.045462,15.49545,82.5455,14.271603,87,1009,1096 670 | 669,31-10-2019,4,1,10,0,3,1,2,14.6575,18.055,66.6667,11.166689,419,5147,5566 671 | 670,01-11-2019,4,1,11,0,4,1,2,14.999153,18.4971,58.1667,10.542182,466,5520,5986 672 | 671,02-11-2019,4,1,11,0,5,1,1,14.555,17.8021,52.2083,17.833725,618,5229,5847 673 | 672,03-11-2019,4,1,11,0,6,0,2,14.076653,16.1923,49.125,18.125443,1029,4109,5138 674 | 673,04-11-2019,4,1,11,0,0,0,1,13.359153,16.4769,53.2917,12.000236,1201,3906,5107 675 | 674,05-11-2019,4,1,11,0,1,1,1,13.085847,15.40375,49.4167,15.833775,378,4881,5259 676 | 675,06-11-2019,4,1,11,0,2,1,1,11.514153,14.07835,56.7083,11.625371,466,5220,5686 677 | 676,07-11-2019,4,1,11,0,3,1,2,12.129153,13.73105,54.75,20.375236,326,4709,5035 678 | 677,08-11-2019,4,1,11,0,4,1,1,14.439134,17.09455,33.3478,23.304945,340,4975,5315 679 | 678,09-11-2019,4,1,11,0,5,1,1,14.828347,17.77065,54.0833,14.375386,709,5283,5992 680 | 679,10-11-2019,4,1,11,0,6,0,1,15.955847,19.69685,64.5417,3.8756686,2090,4446,6536 681 | 680,11-11-2019,4,1,11,0,0,0,1,17.254153,21.08565,65.9167,8.5425,2290,4562,6852 682 | 681,12-11-2019,4,1,11,1,1,0,1,19.885,23.76915,74.1667,11.625639,1097,5172,6269 683 | 682,13-11-2019,4,1,11,0,2,1,2,14.076653,16.16125,66.2917,22.917082,327,3767,4094 684 | 683,14-11-2019,4,1,11,0,3,1,1,11.855847,14.07815,55.2083,13.374875,373,5122,5495 685 | 684,15-11-2019,4,1,11,0,4,1,2,13.188347,16.2246,62.0417,10.250129,320,5125,5445 686 | 685,16-11-2019,4,1,11,0,5,1,1,14.145,17.3602,52.4583,11.458675,484,5214,5698 687 | 686,17-11-2019,4,1,11,0,6,0,1,13.325,16.31915,54.5417,12.041843,1313,4316,5629 688 | 687,18-11-2019,4,1,11,0,0,0,1,14.0425,16.8873,69.2917,15.250004,922,3747,4669 689 | 688,19-11-2019,4,1,11,0,1,1,2,15.614153,18.78105,62.3333,15.749489,449,5050,5499 690 | 689,20-11-2019,4,1,11,0,2,1,2,15.340847,19.03335,68.5,5.542575,534,5100,5634 691 | 690,21-11-2019,4,1,11,0,3,1,1,14.486653,18.2446,61.375,6.917482,615,4531,5146 692 | 691,22-11-2019,4,1,11,1,4,0,1,13.94,17.51855,58.0417,3.5423436,955,1470,2425 693 | 692,23-11-2019,4,1,11,0,5,1,1,15.101653,18.93895,56.875,9.917407,1603,2307,3910 694 | 693,24-11-2019,4,1,11,0,6,0,1,11.411653,12.4371,40.4583,25.250357,532,1745,2277 695 | 694,25-11-2019,4,1,11,0,0,0,1,10.079153,12.87915,46.8333,10.0835,309,2115,2424 696 | 695,26-11-2019,4,1,11,0,1,1,1,12.846653,16.9502,53.5417,3.12555,337,4750,5087 697 | 696,27-11-2019,4,1,11,0,2,1,2,11.958347,14.0779,78.6667,15.916654,123,3836,3959 698 | 697,28-11-2019,4,1,11,0,3,1,1,12.163347,14.4881,50.625,14.125007,198,5062,5260 699 | 698,29-11-2019,4,1,11,0,4,1,1,11.51567,14.9211,55.5652,7.739974,243,5080,5323 700 | 699,30-11-2019,4,1,11,0,5,1,1,12.231653,16.19335,64.9583,3.9175436,362,5306,5668 701 | 700,01-12-2019,4,1,12,0,6,0,2,12.231653,15.8452,80.6667,4.0001814,951,4240,5191 702 | 701,02-12-2019,4,1,12,0,0,0,2,14.2475,17.9604,82.3333,8.333393,892,3757,4649 703 | 702,03-12-2019,4,1,12,0,1,1,1,18.5525,22.7898,76.75,5.5422936,555,5679,6234 704 | 703,04-12-2019,4,1,12,0,2,1,1,19.509153,23.4527,73.375,11.666643,551,6055,6606 705 | 704,05-12-2019,4,1,12,0,3,1,1,17.971653,21.4006,48.5,21.709407,331,5398,5729 706 | 705,06-12-2019,4,1,12,0,4,1,1,10.489153,12.9102,50.875,11.708518,340,5035,5375 707 | 706,07-12-2019,4,1,12,0,5,1,2,13.154153,16.0979,76.4167,8.7502,349,4659,5008 708 | 707,08-12-2019,4,1,12,0,6,0,2,15.648347,19.4754,91.125,6.792393,1153,4429,5582 709 | 708,09-12-2019,4,1,12,0,0,0,2,15.750847,19.5073,90.5417,10.584325,441,2787,3228 710 | 709,10-12-2019,4,1,12,0,1,1,2,17.869153,21.77875,92.5,12.750636,329,4841,5170 711 | 710,11-12-2019,4,1,12,0,2,1,2,14.486653,16.91815,59.6667,19.834479,282,5219,5501 712 | 711,12-12-2019,4,1,12,0,3,1,2,12.1975,14.8669,53.8333,10.916779,310,5009,5319 713 | 712,13-12-2019,4,1,12,0,4,1,1,12.129153,14.7094,48.5833,11.666643,425,5107,5532 714 | 713,14-12-2019,4,1,12,0,5,1,1,11.548347,14.7096,64.2917,8.792343,429,5182,5611 715 | 714,15-12-2019,4,1,12,0,6,0,1,13.290847,16.91915,65.0417,7.12545,767,4280,5047 716 | 715,16-12-2019,4,1,12,0,0,0,2,14.8625,18.4969,83.875,6.749714,538,3248,3786 717 | 716,17-12-2019,4,1,12,0,1,1,2,16.126653,20.075,90.7083,6.5833061,212,4373,4585 718 | 717,18-12-2019,4,1,12,0,2,1,1,16.844153,20.4854,66.625,14.834068,433,5124,5557 719 | 718,19-12-2019,4,1,12,0,3,1,1,13.6325,17.1081,62.5417,12.334164,333,4934,5267 720 | 719,20-12-2019,4,1,12,0,4,1,2,13.53,16.76085,66.7917,8.875021,314,3814,4128 721 | 720,21-12-2019,1,1,12,0,5,1,2,13.393347,15.08835,55.6667,25.083661,221,3402,3623 722 | 721,22-12-2019,1,1,12,0,6,0,1,10.899153,11.80565,44.125,27.292182,205,1544,1749 723 | 722,23-12-2019,1,1,12,0,0,0,1,10.079153,12.97355,51.5417,8.916561,408,1379,1787 724 | 723,24-12-2019,1,1,12,0,1,1,2,9.483464,12.945,79.1304,5.1744368,174,746,920 725 | 724,25-12-2019,1,1,12,1,2,0,2,11.943464,14.72325,73.4783,11.304642,440,573,1013 726 | 725,26-12-2019,1,1,12,0,3,1,3,9.976653,11.01665,82.3333,21.208582,9,432,441 727 | 726,27-12-2019,1,1,12,0,4,1,2,10.420847,11.3321,65.2917,23.458911,247,1867,2114 728 | 727,28-12-2019,1,1,12,0,5,1,2,10.386653,12.7523,59,10.416557,644,2451,3095 729 | 728,29-12-2019,1,1,12,0,6,0,2,10.386653,12.12,75.2917,8.333661,159,1182,1341 730 | 729,30-12-2019,1,1,12,0,0,0,1,10.489153,11.585,48.3333,23.500518,364,1432,1796 731 | 730,31-12-2019,1,1,12,0,1,1,2,8.849153,11.17435,57.75,10.374682,439,2290,2729 732 | -------------------------------------------------------------------------------- /Natural Language Processing/Module 1- Regex.ipynb: -------------------------------------------------------------------------------- 1 | {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"name":"python","version":"3.7.10","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"## Regular Expressions\nRegular expression is a set of characters, called as the pattern, which helps in finding substrings in a given string. The pattern is used to detect the substrings\n\nFor example, suppose you have a dataset of customer reviews about your restaurant. Say, you want to extract the emojis from the reviews because they are a good predictor os the sentiment of the review.\n\nTake another example, the artificial assistants such as Siri, Google Now use information retrieval to give you better results. When you ask them for any query or ask them to search for something interesting on the screen, they look for common patterns such as emails, phone numbers, place names, date and time and so on. This is because then the assitant can automatically make a booking or ask you to call the resturant to make a booking.\n\nRegular expressions are very powerful tool in text processing. It will help you to clean and handle your text in a much better way.","metadata":{}},{"cell_type":"markdown","source":"### Let's import the regular expression library in python.","metadata":{}},{"cell_type":"code","source":"import re","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.546550Z","iopub.execute_input":"2021-05-30T01:14:58.547080Z","iopub.status.idle":"2021-05-30T01:14:58.550261Z","shell.execute_reply.started":"2021-05-30T01:14:58.547047Z","shell.execute_reply":"2021-05-30T01:14:58.549574Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":"Let's do a quick search using a pattern.","metadata":{}},{"cell_type":"code","source":"re.search('Suraaj', 'Suraaj is an exceptional student!')","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.558399Z","iopub.execute_input":"2021-05-30T01:14:58.558823Z","iopub.status.idle":"2021-05-30T01:14:58.567655Z","shell.execute_reply.started":"2021-05-30T01:14:58.558794Z","shell.execute_reply":"2021-05-30T01:14:58.566814Z"},"trusted":true},"execution_count":4,"outputs":[{"execution_count":4,"output_type":"execute_result","data":{"text/plain":""},"metadata":{}}]},{"cell_type":"code","source":"# print output of re.search()\nmatch = re.search('Suraaj', 'Suraaj is an exceptional student!')\nprint(match.group())","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.572011Z","iopub.execute_input":"2021-05-30T01:14:58.572264Z","iopub.status.idle":"2021-05-30T01:14:58.579384Z","shell.execute_reply.started":"2021-05-30T01:14:58.572240Z","shell.execute_reply":"2021-05-30T01:14:58.578583Z"},"trusted":true},"execution_count":5,"outputs":[{"name":"stdout","text":"Suraaj\n","output_type":"stream"}]},{"cell_type":"markdown","source":"Let's define a function to match regular expression patterns","metadata":{}},{"cell_type":"code","source":"def find_pattern(text, patterns):\n if re.search(patterns, text, flags=re.I | re.M): #Used Regex flags to ignore cases and search in multiple lines if any\n print(\"starting point: \",re.search(patterns, text, flags=re.I | re.M).start())\n print(\"ending point: \",re.search(patterns, text, flags=re.I | re.M).end())\n return re.search(patterns, text)\n \n else:\n return 'Not Found!'","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.582721Z","iopub.execute_input":"2021-05-30T01:14:58.583044Z","iopub.status.idle":"2021-05-30T01:14:58.591034Z","shell.execute_reply.started":"2021-05-30T01:14:58.583016Z","shell.execute_reply":"2021-05-30T01:14:58.589991Z"},"trusted":true},"execution_count":6,"outputs":[]},{"cell_type":"code","source":"find_pattern('My name is Suraaj hasija','suraaj')","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.594607Z","iopub.execute_input":"2021-05-30T01:14:58.595032Z","iopub.status.idle":"2021-05-30T01:14:58.605421Z","shell.execute_reply.started":"2021-05-30T01:14:58.595002Z","shell.execute_reply":"2021-05-30T01:14:58.604547Z"},"trusted":true},"execution_count":7,"outputs":[{"name":"stdout","text":"starting point: 11\nending point: 17\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Quantifiers","metadata":{}},{"cell_type":"code","source":"# '*': Zero or more \nprint(find_pattern(\"ac\", \"ab*\"))\nprint(find_pattern(\"abc\", \"ab*\"))\nprint(find_pattern(\"abbc\", \"ab*\"))","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.613424Z","iopub.execute_input":"2021-05-30T01:14:58.613709Z","iopub.status.idle":"2021-05-30T01:14:58.620370Z","shell.execute_reply.started":"2021-05-30T01:14:58.613683Z","shell.execute_reply":"2021-05-30T01:14:58.619642Z"},"trusted":true},"execution_count":8,"outputs":[{"name":"stdout","text":"starting point: 0\nending point: 1\n\nstarting point: 0\nending point: 2\n\nstarting point: 0\nending point: 3\n\n","output_type":"stream"}]},{"cell_type":"code","source":"# '?': Zero or one (tells whether a pattern is absent or present)\nprint(find_pattern(\"ac\", \"ab?\"))\nprint(find_pattern(\"abc\", \"ab?\"))\nprint(find_pattern(\"abbc\", \"ab?\"))","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.633394Z","iopub.execute_input":"2021-05-30T01:14:58.633836Z","iopub.status.idle":"2021-05-30T01:14:58.641241Z","shell.execute_reply.started":"2021-05-30T01:14:58.633802Z","shell.execute_reply":"2021-05-30T01:14:58.640317Z"},"trusted":true},"execution_count":9,"outputs":[{"name":"stdout","text":"starting point: 0\nending point: 1\n\nstarting point: 0\nending point: 2\n\nstarting point: 0\nending point: 2\n\n","output_type":"stream"}]},{"cell_type":"code","source":"# '+': One or more\nprint(find_pattern(\"ac\", \"ab+\"))\nprint(find_pattern(\"abc\", \"ab+\"))\nprint(find_pattern(\"abbc\", \"ab+\"))","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.653532Z","iopub.execute_input":"2021-05-30T01:14:58.654030Z","iopub.status.idle":"2021-05-30T01:14:58.661760Z","shell.execute_reply.started":"2021-05-30T01:14:58.653988Z","shell.execute_reply":"2021-05-30T01:14:58.660770Z"},"trusted":true},"execution_count":10,"outputs":[{"name":"stdout","text":"Not Found!\nstarting point: 0\nending point: 2\n\nstarting point: 0\nending point: 3\n\n","output_type":"stream"}]},{"cell_type":"code","source":"# {n}: Matches if a character is present exactly n number of times\nprint(find_pattern(\"abbc\", \"ab{2}\"))\n","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.670426Z","iopub.execute_input":"2021-05-30T01:14:58.670759Z","iopub.status.idle":"2021-05-30T01:14:58.676897Z","shell.execute_reply.started":"2021-05-30T01:14:58.670726Z","shell.execute_reply":"2021-05-30T01:14:58.675676Z"},"trusted":true},"execution_count":11,"outputs":[{"name":"stdout","text":"starting point: 0\nending point: 3\n\n","output_type":"stream"}]},{"cell_type":"code","source":"# {m,n}: Matches if a character is present from m to n number of times\nprint(find_pattern(\"aabbbbbbc\", \"ab{3,5}\")) # return true if 'b' is present 3-5 times\nprint(find_pattern(\"aabbbbbbc\", \"ab{7,10}\")) # return true if 'b' is present 7-10 times\nprint(find_pattern(\"aabbbbbbc\", \"ab{,10}\")) # return true if 'b' is present atmost 10 times\nprint(find_pattern(\"aabbbbbbc\", \"ab{10,}\")) # return true if 'b' is present from at least 10 times","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.729239Z","iopub.execute_input":"2021-05-30T01:14:58.729576Z","iopub.status.idle":"2021-05-30T01:14:58.737252Z","shell.execute_reply.started":"2021-05-30T01:14:58.729548Z","shell.execute_reply":"2021-05-30T01:14:58.736300Z"},"trusted":true},"execution_count":12,"outputs":[{"name":"stdout","text":"starting point: 1\nending point: 7\n\nNot Found!\nstarting point: 0\nending point: 1\n\nNot Found!\n","output_type":"stream"}]},{"cell_type":"code","source":"#grouping, OR concepts","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.748492Z","iopub.execute_input":"2021-05-30T01:14:58.748834Z","iopub.status.idle":"2021-05-30T01:14:58.754123Z","shell.execute_reply.started":"2021-05-30T01:14:58.748766Z","shell.execute_reply":"2021-05-30T01:14:58.753230Z"},"trusted":true},"execution_count":13,"outputs":[]},{"cell_type":"code","source":"# without re.compile() function\nresult = re.search(\"a+\", \"abc\")\n\nresult","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.768436Z","iopub.execute_input":"2021-05-30T01:14:58.769025Z","iopub.status.idle":"2021-05-30T01:14:58.773998Z","shell.execute_reply.started":"2021-05-30T01:14:58.768992Z","shell.execute_reply":"2021-05-30T01:14:58.773057Z"},"trusted":true},"execution_count":14,"outputs":[{"execution_count":14,"output_type":"execute_result","data":{"text/plain":""},"metadata":{}}]},{"cell_type":"code","source":"# using the re.compile() function\npattern = re.compile(\"a+\")\nresult = pattern.search(\"abc\")\nresult","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.775091Z","iopub.execute_input":"2021-05-30T01:14:58.775537Z","iopub.status.idle":"2021-05-30T01:14:58.790238Z","shell.execute_reply.started":"2021-05-30T01:14:58.775431Z","shell.execute_reply":"2021-05-30T01:14:58.789271Z"},"trusted":true},"execution_count":15,"outputs":[{"execution_count":15,"output_type":"execute_result","data":{"text/plain":""},"metadata":{}}]},{"cell_type":"markdown","source":"#### Q: Write a regular expression that matches any string that starts with one or more ‘1’s, followed by three or more ‘0’s, followed by any number of ones (zero or more), followed by ‘0’s (from one to seven), and then ends with either two or three ‘1’s.","metadata":{}},{"cell_type":"code","source":"text='11000011100111'\npattern = '^1+0{3,}1*0{1,7}1{2,3}' #write your regex here\n\n# check whether pattern is present in string or not\nresult = re.search(pattern, text)\nresult","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.791844Z","iopub.execute_input":"2021-05-30T01:14:58.792365Z","iopub.status.idle":"2021-05-30T01:14:58.802385Z","shell.execute_reply.started":"2021-05-30T01:14:58.792332Z","shell.execute_reply":"2021-05-30T01:14:58.801630Z"},"trusted":true},"execution_count":16,"outputs":[{"execution_count":16,"output_type":"execute_result","data":{"text/plain":""},"metadata":{}}]},{"cell_type":"markdown","source":"### Anchors","metadata":{}},{"cell_type":"code","source":"# '^': Indicates start of a string\n# '$': Indicates end of string\n\nprint(find_pattern(\"James\", \"^J\")) # return true if string starts with 'J' \nprint(find_pattern(\"Pramod\", \"^J\")) # return true if string starts with 'J' \nprint(find_pattern(\"India\", \"a$\")) # return true if string ends with 'c'\nprint(find_pattern(\"Japan\", \"a$\")) # return true if string ends with 'c'\n","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.843299Z","iopub.execute_input":"2021-05-30T01:14:58.843818Z","iopub.status.idle":"2021-05-30T01:14:58.851585Z","shell.execute_reply.started":"2021-05-30T01:14:58.843770Z","shell.execute_reply":"2021-05-30T01:14:58.850664Z"},"trusted":true},"execution_count":17,"outputs":[{"name":"stdout","text":"starting point: 0\nending point: 1\n\nNot Found!\nstarting point: 4\nending point: 5\n\nNot Found!\n","output_type":"stream"}]},{"cell_type":"markdown","source":"##### Note: if you’re asked to write a regex pattern that should match a string that starts with four characters, followed by three 0s and two 1s, followed by any two characters. The valid strings can be abcd00011ft, jkds00011hf, etc. The pattern that satisfies this kind of condition would be \n**‘.{4}0{3}1{2}.{2}’**. \n\nYou can also use ‘....00011..’ where the dot acts as a placeholder which means anything can sit on the place of the dot. Both are correct regex patterns.\n\n\n","metadata":{}},{"cell_type":"markdown","source":"### Wildcard","metadata":{}},{"cell_type":"code","source":"# '.': Matches any character\nprint(find_pattern(\"a\", \".\"))\nprint(find_pattern(\"#\", \".\"))\n","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.853218Z","iopub.execute_input":"2021-05-30T01:14:58.853492Z","iopub.status.idle":"2021-05-30T01:14:58.862714Z","shell.execute_reply.started":"2021-05-30T01:14:58.853464Z","shell.execute_reply":"2021-05-30T01:14:58.861933Z"},"trusted":true},"execution_count":18,"outputs":[{"name":"stdout","text":"starting point: 0\nending point: 1\n\nstarting point: 0\nending point: 1\n\n","output_type":"stream"}]},{"cell_type":"markdown","source":"**Q. Write a regular expression to match first names (consider only first names, i.e. there are no spaces in a name) that have length between three and fifteen characters**","metadata":{}},{"cell_type":"code","source":"text='Balasubrahmanyam'\npattern = '^.{3,15}$'# write your regex here\n\n# check whether pattern is present in string or not\nresult = re.search(pattern, text)\nif result != None:\n print(True)\nelse:\n print(False)","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.864579Z","iopub.execute_input":"2021-05-30T01:14:58.865090Z","iopub.status.idle":"2021-05-30T01:14:58.875495Z","shell.execute_reply.started":"2021-05-30T01:14:58.865059Z","shell.execute_reply":"2021-05-30T01:14:58.874538Z"},"trusted":true},"execution_count":19,"outputs":[{"name":"stdout","text":"False\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Character sets","metadata":{}},{"cell_type":"code","source":"# Now we will look at '[' and ']'.\n# They're used for specifying a character class, which is a set of characters that you wish to match.\n# Characters can be listed individually as follows\nprint(find_pattern(\"a\", \"[abc]\"))\n\n# Or a range of characters can be indicated by giving two characters and separating them by a '-'.\nprint(find_pattern(\"c\", \"[a-c]\")) # same as above","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.883068Z","iopub.execute_input":"2021-05-30T01:14:58.883523Z","iopub.status.idle":"2021-05-30T01:14:58.891391Z","shell.execute_reply.started":"2021-05-30T01:14:58.883480Z","shell.execute_reply":"2021-05-30T01:14:58.890405Z"},"trusted":true},"execution_count":20,"outputs":[{"name":"stdout","text":"starting point: 0\nending point: 1\n\nstarting point: 0\nending point: 1\n\n","output_type":"stream"}]},{"cell_type":"code","source":"# '^' is used inside character set to indicate complementary set\nprint(find_pattern(\"jjj\", \"[^abc]\")) # return true if neither of these is present - a,b or c","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.900298Z","iopub.execute_input":"2021-05-30T01:14:58.900989Z","iopub.status.idle":"2021-05-30T01:14:58.906383Z","shell.execute_reply.started":"2021-05-30T01:14:58.900921Z","shell.execute_reply":"2021-05-30T01:14:58.905296Z"},"trusted":true},"execution_count":21,"outputs":[{"name":"stdout","text":"starting point: 0\nending point: 1\n\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Character sets\n| Pattern | Matches |\n|----------|--------------------------------------------------------------------------------------------|\n| [abc] | Matches either an a, b or c character |\n| [abcABC] | Matches either an a, A, b, B, c or C character |\n| [a-z] | Matches any characters between a and z, including a and z |\n| [A-Z] | Matches any characters between A and Z, including A and Z |\n| [a-zA-Z] | Matches any characters between a and z, including a and z ignoring cases of the characters |\n| [0-9] | Matches any character which is a number between 0 and 9 |","metadata":{}},{"cell_type":"markdown","source":"### Meta sequences\n\n| Pattern | Equivalent to |\n|----------|------------------|\n| \\s | [ \\t\\n\\r\\f\\v] |\n| \\S | [^ \\t\\n\\r\\f\\v] |\n| \\d | [0-9] |\n| \\D | [^0-9] |\n| \\w | [a-zA-Z0-9_] |\n| \\W | [^a-zA-Z0-9_] |","metadata":{}},{"cell_type":"markdown","source":"**Write a regular expression with the help of meta-sequences that matches usernames of the users of a database. The username starts with alphabets of length one to ten characters long and then followed by a number of length 4**","metadata":{}},{"cell_type":"code","source":"string='suraaj199'\npattern = '^[a-z]{1,10}[0-9]{4}'# write your regex here\n\n# check whether pattern is present in string or not\nresult = re.search(pattern, string, re.I)\nresult\n\nif result!=None:\n print('Not Found')\nelse:\n print(\"Found\")","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.907922Z","iopub.execute_input":"2021-05-30T01:14:58.908253Z","iopub.status.idle":"2021-05-30T01:14:58.920823Z","shell.execute_reply.started":"2021-05-30T01:14:58.908204Z","shell.execute_reply":"2021-05-30T01:14:58.919842Z"},"trusted":true},"execution_count":22,"outputs":[{"name":"stdout","text":"Found\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Greedy vs non-greedy regex","metadata":{}},{"cell_type":"code","source":"print(find_pattern(\"aabbbbbb\", \"ab{3,5}\")) # return if a is followed by b 3-5 times GREEDY","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.958112Z","iopub.execute_input":"2021-05-30T01:14:58.958438Z","iopub.status.idle":"2021-05-30T01:14:58.963675Z","shell.execute_reply.started":"2021-05-30T01:14:58.958406Z","shell.execute_reply":"2021-05-30T01:14:58.962803Z"},"trusted":true},"execution_count":23,"outputs":[{"name":"stdout","text":"starting point: 1\nending point: 7\n\n","output_type":"stream"}]},{"cell_type":"code","source":"print(find_pattern(\"aabbbbbb\", \"ab{3,5}?\")) # return if a is followed by b 3-5 times GREEDY","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.965401Z","iopub.execute_input":"2021-05-30T01:14:58.965757Z","iopub.status.idle":"2021-05-30T01:14:58.974902Z","shell.execute_reply.started":"2021-05-30T01:14:58.965721Z","shell.execute_reply":"2021-05-30T01:14:58.974224Z"},"trusted":true},"execution_count":24,"outputs":[{"name":"stdout","text":"starting point: 1\nending point: 5\n\n","output_type":"stream"}]},{"cell_type":"code","source":"# Example of HTML code\nprint(re.search(\"<.*>\",\"My Page\"))","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.976380Z","iopub.execute_input":"2021-05-30T01:14:58.976654Z","iopub.status.idle":"2021-05-30T01:14:58.987673Z","shell.execute_reply.started":"2021-05-30T01:14:58.976628Z","shell.execute_reply":"2021-05-30T01:14:58.986922Z"},"trusted":true},"execution_count":25,"outputs":[{"name":"stdout","text":"\n","output_type":"stream"}]},{"cell_type":"code","source":"# Example of HTML code\nprint(re.search(\"<.*?>\",\"My Page\"))","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.989058Z","iopub.execute_input":"2021-05-30T01:14:58.989304Z","iopub.status.idle":"2021-05-30T01:14:58.998529Z","shell.execute_reply.started":"2021-05-30T01:14:58.989280Z","shell.execute_reply":"2021-05-30T01:14:58.997559Z"},"trusted":true},"execution_count":26,"outputs":[{"name":"stdout","text":"\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### The five most important re functions that you would be required to use most of the times are\n\nmatch() Determine if the RE matches at the beginning of the string\n\nsearch() Scan through a string, looking for any location where this RE matches\n\nfinall() Find all the substrings where the RE matches, and return them as a list\n\nfinditer() Find all substrings where RE matches and return them as asn iterator\n\nsub() Find all substrings where the RE matches and substitute them with the given string","metadata":{}},{"cell_type":"code","source":"# - this function uses the re.match() and let's see how it differs from re.search()\ndef match_pattern(text, patterns):\n if re.match(patterns, text):\n return re.match(patterns, text)\n else:\n return ('Not found!')","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:58.999928Z","iopub.execute_input":"2021-05-30T01:14:59.000317Z","iopub.status.idle":"2021-05-30T01:14:59.010477Z","shell.execute_reply.started":"2021-05-30T01:14:59.000275Z","shell.execute_reply":"2021-05-30T01:14:59.009552Z"},"trusted":true},"execution_count":27,"outputs":[]},{"cell_type":"code","source":"print(find_pattern(\"abbc\", \"b+\"))","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.012156Z","iopub.execute_input":"2021-05-30T01:14:59.012611Z","iopub.status.idle":"2021-05-30T01:14:59.021623Z","shell.execute_reply.started":"2021-05-30T01:14:59.012566Z","shell.execute_reply":"2021-05-30T01:14:59.020738Z"},"trusted":true},"execution_count":28,"outputs":[{"name":"stdout","text":"starting point: 1\nending point: 3\n\n","output_type":"stream"}]},{"cell_type":"code","source":"print(match_pattern(\"abbc\", \"b+\")) #beacuse the string doesn't start with b","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.022979Z","iopub.execute_input":"2021-05-30T01:14:59.023350Z","iopub.status.idle":"2021-05-30T01:14:59.031095Z","shell.execute_reply.started":"2021-05-30T01:14:59.023309Z","shell.execute_reply":"2021-05-30T01:14:59.030150Z"},"trusted":true},"execution_count":29,"outputs":[{"name":"stdout","text":"Not found!\n","output_type":"stream"}]},{"cell_type":"code","source":"## Example usage of the sub() function. Replace Road with rd.\n\nstreet = '21 Ramakrishna Road'\nprint(re.sub('Road', 'Rd', street))","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.032292Z","iopub.execute_input":"2021-05-30T01:14:59.032657Z","iopub.status.idle":"2021-05-30T01:14:59.045048Z","shell.execute_reply.started":"2021-05-30T01:14:59.032620Z","shell.execute_reply":"2021-05-30T01:14:59.044281Z"},"trusted":true},"execution_count":30,"outputs":[{"name":"stdout","text":"21 Ramakrishna Rd\n","output_type":"stream"}]},{"cell_type":"code","source":"print(re.sub('R\\w+', 'Rd', street))","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.047073Z","iopub.execute_input":"2021-05-30T01:14:59.047392Z","iopub.status.idle":"2021-05-30T01:14:59.056798Z","shell.execute_reply.started":"2021-05-30T01:14:59.047363Z","shell.execute_reply":"2021-05-30T01:14:59.056048Z"},"trusted":true},"execution_count":31,"outputs":[{"name":"stdout","text":"21 Rd Rd\n","output_type":"stream"}]},{"cell_type":"code","source":"pattern = \"\\d\"\nreplacement = \"X\"\nstring = \"My address is 13B, Baker Street\"\n\nre.sub(pattern, replacement, string)","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.058102Z","iopub.execute_input":"2021-05-30T01:14:59.058347Z","iopub.status.idle":"2021-05-30T01:14:59.069326Z","shell.execute_reply.started":"2021-05-30T01:14:59.058323Z","shell.execute_reply":"2021-05-30T01:14:59.068274Z"},"trusted":true},"execution_count":32,"outputs":[{"execution_count":32,"output_type":"execute_result","data":{"text/plain":"'My address is XXB, Baker Street'"},"metadata":{}}]},{"cell_type":"code","source":"## Example usage of finditer(). Find all occurrences of word Festival in given sentence\n\ntext = 'Diwali is a festival of lights, Holi is a festival of colors!'\npattern = 'festival'\nfor match in re.finditer(pattern, text):\n print('START -', match.start(), end=\"\")\n print('END -', match.end())","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.070508Z","iopub.execute_input":"2021-05-30T01:14:59.070796Z","iopub.status.idle":"2021-05-30T01:14:59.082221Z","shell.execute_reply.started":"2021-05-30T01:14:59.070768Z","shell.execute_reply":"2021-05-30T01:14:59.081544Z"},"trusted":true},"execution_count":33,"outputs":[{"name":"stdout","text":"START - 12END - 20\nSTART - 42END - 50\n","output_type":"stream"}]},{"cell_type":"code","source":"# Example usage of findall(). In the given URL find all dates\nurl = \"http://www.telegraph.co.uk/formula-1/2017/10/28/mexican-grand-prix-2017-time-does-start-tv-channel-odds-lewisl/2017/05/12\"\ndate_regex = '/(\\d{4})/(\\d{1,2})/(\\d{1,2})'\nfinal=re.findall(date_regex, url)\nprint(final)","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:17:45.394492Z","iopub.execute_input":"2021-05-30T01:17:45.394825Z","iopub.status.idle":"2021-05-30T01:17:45.400163Z","shell.execute_reply.started":"2021-05-30T01:17:45.394796Z","shell.execute_reply":"2021-05-30T01:17:45.399248Z"},"trusted":true},"execution_count":46,"outputs":[{"name":"stdout","text":"[('2017', '10', '28'), ('2017', '05', '12')]\n","output_type":"stream"}]},{"cell_type":"markdown","source":"#### Q. Write a regular expression to extract all the words from a given sentence. Then use the re.finditer() function and store all the matched words that are of length more than or equal to 5 letters in a separate list called result.\n\n","metadata":{}},{"cell_type":"code","source":"string=\"Do not compare apples with oranges. Compare apples with apples\"\n# regex pattern\npattern = '\\w+' # write regex to extract all the words from a given piece of text\n\n# store results in the list 'result'\nresult = []\n\n# iterate over the matches\nfor match in re.finditer(pattern,string): # replace the ___ with the 'finditer' function to extract 'pattern' from the 'string'\n if len(match.group()) >= 5:\n result.append(match)\n else:\n continue\n","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.094601Z","iopub.execute_input":"2021-05-30T01:14:59.094875Z","iopub.status.idle":"2021-05-30T01:14:59.104595Z","shell.execute_reply.started":"2021-05-30T01:14:59.094851Z","shell.execute_reply":"2021-05-30T01:14:59.103569Z"},"trusted":true},"execution_count":35,"outputs":[]},{"cell_type":"code","source":"result","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.106180Z","iopub.execute_input":"2021-05-30T01:14:59.106939Z","iopub.status.idle":"2021-05-30T01:14:59.121109Z","shell.execute_reply.started":"2021-05-30T01:14:59.106895Z","shell.execute_reply":"2021-05-30T01:14:59.120321Z"},"trusted":true},"execution_count":36,"outputs":[{"execution_count":36,"output_type":"execute_result","data":{"text/plain":"[,\n ,\n ,\n ,\n ,\n ]"},"metadata":{}}]},{"cell_type":"code","source":"## Exploring Groups\nm1 = re.search(date_regex, url)\nprint(m1.group()) ## print the matched group","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.122239Z","iopub.execute_input":"2021-05-30T01:14:59.122499Z","iopub.status.idle":"2021-05-30T01:14:59.131738Z","shell.execute_reply.started":"2021-05-30T01:14:59.122475Z","shell.execute_reply":"2021-05-30T01:14:59.130735Z"},"trusted":true},"execution_count":37,"outputs":[{"name":"stdout","text":"/2017/10/28\n","output_type":"stream"}]},{"cell_type":"code","source":"print(m1.group(1)) # - Print first group","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.132906Z","iopub.execute_input":"2021-05-30T01:14:59.133378Z","iopub.status.idle":"2021-05-30T01:14:59.141096Z","shell.execute_reply.started":"2021-05-30T01:14:59.133343Z","shell.execute_reply":"2021-05-30T01:14:59.140239Z"},"trusted":true},"execution_count":38,"outputs":[{"name":"stdout","text":"2017\n","output_type":"stream"}]},{"cell_type":"code","source":"print(m1.group(2)) # - Print second group","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.142541Z","iopub.execute_input":"2021-05-30T01:14:59.142834Z","iopub.status.idle":"2021-05-30T01:14:59.152405Z","shell.execute_reply.started":"2021-05-30T01:14:59.142810Z","shell.execute_reply":"2021-05-30T01:14:59.151318Z"},"trusted":true},"execution_count":39,"outputs":[{"name":"stdout","text":"10\n","output_type":"stream"}]},{"cell_type":"code","source":"print(m1.group(3)) # - Print third group","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.153716Z","iopub.execute_input":"2021-05-30T01:14:59.154331Z","iopub.status.idle":"2021-05-30T01:14:59.163939Z","shell.execute_reply.started":"2021-05-30T01:14:59.154298Z","shell.execute_reply":"2021-05-30T01:14:59.162869Z"},"trusted":true},"execution_count":40,"outputs":[{"name":"stdout","text":"28\n","output_type":"stream"}]},{"cell_type":"code","source":"print(m1.group(0)) # - Print zero or the default group","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.165608Z","iopub.execute_input":"2021-05-30T01:14:59.166117Z","iopub.status.idle":"2021-05-30T01:14:59.176700Z","shell.execute_reply.started":"2021-05-30T01:14:59.166086Z","shell.execute_reply":"2021-05-30T01:14:59.175748Z"},"trusted":true},"execution_count":41,"outputs":[{"name":"stdout","text":"/2017/10/28\n","output_type":"stream"}]},{"cell_type":"markdown","source":"#### Q: Write a regular expression to extract the domain name from an email address. The format of the email is simple - the part before the ‘@’ symbol contains alphabets, numbers and underscores. The part after the ‘@’ symbol contains only alphabets followed by a dot followed by ‘com’ \n\n\nSample input: \nuser_name_123@gmail.com \n \nExpected output: \ngmail.com","metadata":{}},{"cell_type":"code","source":"string='suraaj.hasija@mastercard.com'\n# regex pattern\npattern = \"\\w+@([A-z]+\\.com)\"\n\n# store result\nresult = re.search(pattern, string)\n\n# extract domain using group command\nif result != None:\n domain = result.group(1)\nelse:\n domain = \"NA\"\n\n# evaluate result - don't change the following piece of code, it is used to evaluate your regex\nprint(domain)","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.178137Z","iopub.execute_input":"2021-05-30T01:14:59.178603Z","iopub.status.idle":"2021-05-30T01:14:59.191053Z","shell.execute_reply.started":"2021-05-30T01:14:59.178572Z","shell.execute_reply":"2021-05-30T01:14:59.190076Z"},"trusted":true},"execution_count":42,"outputs":[{"name":"stdout","text":"mastercard.com\n","output_type":"stream"}]},{"cell_type":"markdown","source":"#### Ques: Find all file names","metadata":{}},{"cell_type":"code","source":"# items contains all the files and folders of current directory\nitems = ['photos', 'documents', 'videos', 'image001.jpg','image002.jpg','image005.jpg', 'wallpaper.jpg',\n 'flower.jpg', 'earth.jpg', 'monkey.jpg', 'image002.png']\n\n# create an empty list to store resultant files\nimages = []\n\n# regex pattern to extract files that end with '.jpg'\npattern = \".*\\.jpg$\"\n\nfor item in items:\n if re.search(pattern, item):\n images.append(item)\n\n# print result\nprint(images)","metadata":{"execution":{"iopub.status.busy":"2021-05-30T01:14:59.192361Z","iopub.execute_input":"2021-05-30T01:14:59.193038Z","iopub.status.idle":"2021-05-30T01:14:59.202320Z","shell.execute_reply.started":"2021-05-30T01:14:59.192928Z","shell.execute_reply":"2021-05-30T01:14:59.201586Z"},"trusted":true},"execution_count":43,"outputs":[{"name":"stdout","text":"['image001.jpg', 'image002.jpg', 'image005.jpg', 'wallpaper.jpg', 'flower.jpg', 'earth.jpg', 'monkey.jpg']\n","output_type":"stream"}]},{"cell_type":"code","source":"","metadata":{},"execution_count":null,"outputs":[]}]} -------------------------------------------------------------------------------- /Natural Language Processing/Module 2- Lexical Processing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "execution": { 7 | "iopub.execute_input": "2021-07-05T20:38:54.194331Z", 8 | "iopub.status.busy": "2021-07-05T20:38:54.193698Z", 9 | "iopub.status.idle": "2021-07-05T20:38:56.031747Z", 10 | "shell.execute_reply": "2021-07-05T20:38:56.03012Z", 11 | "shell.execute_reply.started": "2021-07-05T20:38:54.194202Z" 12 | } 13 | }, 14 | "source": [ 15 | "## Step 1: Reading data from a URL and Processing it using NLTK Library and Regex" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "metadata": {}, 22 | "outputs": [], 23 | "source": [ 24 | "import requests\n", 25 | "from nltk import FreqDist\n", 26 | "import seaborn as sbn\n", 27 | "from nltk.corpus import stopwords\n", 28 | "from matplotlib import pyplot" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": null, 34 | "metadata": { 35 | "execution": { 36 | "iopub.execute_input": "2021-07-05T20:38:56.034104Z", 37 | "iopub.status.busy": "2021-07-05T20:38:56.033739Z", 38 | "iopub.status.idle": "2021-07-05T20:39:00.860009Z", 39 | "shell.execute_reply": "2021-07-05T20:39:00.858719Z", 40 | "shell.execute_reply.started": "2021-07-05T20:38:56.03407Z" 41 | } 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "url='https://www.gutenberg.org/files/11/11-0.txt'\n", 46 | "alice=requests.get(url)" 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": null, 52 | "metadata": { 53 | "execution": { 54 | "iopub.execute_input": "2021-07-05T20:39:00.863736Z", 55 | "iopub.status.busy": "2021-07-05T20:39:00.863248Z", 56 | "iopub.status.idle": "2021-07-05T20:39:00.882281Z", 57 | "shell.execute_reply": "2021-07-05T20:39:00.88111Z", 58 | "shell.execute_reply.started": "2021-07-05T20:39:00.863683Z" 59 | }, 60 | "scrolled": true 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "print(alice.content.decode(\"utf8\"))" 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "metadata": { 71 | "execution": { 72 | "iopub.execute_input": "2021-07-05T20:39:00.8849Z", 73 | "iopub.status.busy": "2021-07-05T20:39:00.8844Z", 74 | "iopub.status.idle": "2021-07-05T20:39:00.897859Z", 75 | "shell.execute_reply": "2021-07-05T20:39:00.896772Z", 76 | "shell.execute_reply.started": "2021-07-05T20:39:00.884843Z" 77 | } 78 | }, 79 | "outputs": [], 80 | "source": [ 81 | "#words=[i for i in alice.content.decode(\"utf8\").split()]\n", 82 | "#or \n", 83 | "words=alice.content.decode(\"utf8\").split()\n" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": { 90 | "execution": { 91 | "iopub.execute_input": "2021-07-05T20:39:00.899663Z", 92 | "iopub.status.busy": "2021-07-05T20:39:00.899357Z", 93 | "iopub.status.idle": "2021-07-05T20:39:00.938379Z", 94 | "shell.execute_reply": "2021-07-05T20:39:00.936561Z", 95 | "shell.execute_reply.started": "2021-07-05T20:39:00.899634Z" 96 | }, 97 | "scrolled": true 98 | }, 99 | "outputs": [], 100 | "source": [ 101 | "words" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": null, 107 | "metadata": { 108 | "execution": { 109 | "iopub.execute_input": "2021-07-05T20:39:00.940332Z", 110 | "iopub.status.busy": "2021-07-05T20:39:00.939926Z", 111 | "iopub.status.idle": "2021-07-05T20:39:01.027598Z", 112 | "shell.execute_reply": "2021-07-05T20:39:01.026764Z", 113 | "shell.execute_reply.started": "2021-07-05T20:39:00.940296Z" 114 | } 115 | }, 116 | "outputs": [], 117 | "source": [ 118 | "import re \n", 119 | "new_words=[]\n", 120 | "for i in words:\n", 121 | " re.sub('\\â\\x80\\x99','\\'', i)\n", 122 | " new_words.append(re.sub('\\â\\x80\\x99','\\'', i))\n", 123 | " " 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": null, 129 | "metadata": { 130 | "execution": { 131 | "iopub.execute_input": "2021-07-05T20:39:01.029376Z", 132 | "iopub.status.busy": "2021-07-05T20:39:01.02876Z", 133 | "iopub.status.idle": "2021-07-05T20:39:01.097021Z", 134 | "shell.execute_reply": "2021-07-05T20:39:01.095808Z", 135 | "shell.execute_reply.started": "2021-07-05T20:39:01.02933Z" 136 | }, 137 | "scrolled": true 138 | }, 139 | "outputs": [], 140 | "source": [ 141 | "import re \n", 142 | "final_words=[]\n", 143 | "for j,i in enumerate(new_words):\n", 144 | " result=re.search('[a-zA-Z0-9].*[^, .]',i)\n", 145 | " try:\n", 146 | " match=result.group(0)\n", 147 | " except AttributeError:\n", 148 | " match=result\n", 149 | " final_words.append(match)\n", 150 | " \n", 151 | " " 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "metadata": { 158 | "execution": { 159 | "iopub.execute_input": "2021-07-05T20:39:01.100328Z", 160 | "iopub.status.busy": "2021-07-05T20:39:01.099964Z", 161 | "iopub.status.idle": "2021-07-05T20:39:01.125478Z", 162 | "shell.execute_reply": "2021-07-05T20:39:01.124277Z", 163 | "shell.execute_reply.started": "2021-07-05T20:39:01.100294Z" 164 | }, 165 | "scrolled": true 166 | }, 167 | "outputs": [], 168 | "source": [ 169 | "final_words" 170 | ] 171 | }, 172 | { 173 | "cell_type": "code", 174 | "execution_count": null, 175 | "metadata": { 176 | "execution": { 177 | "iopub.execute_input": "2021-07-05T20:39:01.127596Z", 178 | "iopub.status.busy": "2021-07-05T20:39:01.127286Z", 179 | "iopub.status.idle": "2021-07-05T20:39:01.142627Z", 180 | "shell.execute_reply": "2021-07-05T20:39:01.141708Z", 181 | "shell.execute_reply.started": "2021-07-05T20:39:01.127568Z" 182 | } 183 | }, 184 | "outputs": [], 185 | "source": [ 186 | "final_words=list(filter(None, (final_words)))\n", 187 | "final_words=list(map(lambda x: x.lower(),final_words))" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "metadata": { 194 | "execution": { 195 | "iopub.execute_input": "2021-07-05T20:39:01.144194Z", 196 | "iopub.status.busy": "2021-07-05T20:39:01.143758Z", 197 | "iopub.status.idle": "2021-07-05T20:39:01.1552Z", 198 | "shell.execute_reply": "2021-07-05T20:39:01.154242Z", 199 | "shell.execute_reply.started": "2021-07-05T20:39:01.144157Z" 200 | } 201 | }, 202 | "outputs": [], 203 | "source": [ 204 | "#final_text=' '.join(final_words)\n", 205 | "final_text=final_words" 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "metadata": { 212 | "execution": { 213 | "iopub.execute_input": "2021-07-05T20:39:01.156859Z", 214 | "iopub.status.busy": "2021-07-05T20:39:01.156334Z", 215 | "iopub.status.idle": "2021-07-05T20:39:01.188353Z", 216 | "shell.execute_reply": "2021-07-05T20:39:01.187578Z", 217 | "shell.execute_reply.started": "2021-07-05T20:39:01.156762Z" 218 | }, 219 | "scrolled": true 220 | }, 221 | "outputs": [], 222 | "source": [ 223 | "final_text" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": null, 229 | "metadata": { 230 | "execution": { 231 | "iopub.execute_input": "2021-07-05T20:39:01.189981Z", 232 | "iopub.status.busy": "2021-07-05T20:39:01.189535Z", 233 | "iopub.status.idle": "2021-07-05T20:39:01.200424Z", 234 | "shell.execute_reply": "2021-07-05T20:39:01.199453Z", 235 | "shell.execute_reply.started": "2021-07-05T20:39:01.189948Z" 236 | }, 237 | "scrolled": true 238 | }, 239 | "outputs": [], 240 | "source": [ 241 | "#function for getting the most common words \n", 242 | "def barplot_most_common(text, top_n):\n", 243 | " word_freq=FreqDist(text)\n", 244 | " pyplot.figure(figsize=(10,8))\n", 245 | " \n", 246 | " labels=[i[0] for i in word_freq.most_common(top_n)]\n", 247 | " counts=[i[1] for i in word_freq.most_common(top_n)]\n", 248 | " plot=sbn.barplot(labels,counts)\n", 249 | " \n", 250 | " return plot\n", 251 | " " 252 | ] 253 | }, 254 | { 255 | "cell_type": "code", 256 | "execution_count": null, 257 | "metadata": { 258 | "execution": { 259 | "iopub.execute_input": "2021-07-05T20:39:01.202128Z", 260 | "iopub.status.busy": "2021-07-05T20:39:01.201658Z", 261 | "iopub.status.idle": "2021-07-05T20:39:01.660235Z", 262 | "shell.execute_reply": "2021-07-05T20:39:01.659042Z", 263 | "shell.execute_reply.started": "2021-07-05T20:39:01.202083Z" 264 | } 265 | }, 266 | "outputs": [], 267 | "source": [ 268 | "barplot_most_common(final_text,15)" 269 | ] 270 | }, 271 | { 272 | "cell_type": "code", 273 | "execution_count": null, 274 | "metadata": { 275 | "execution": { 276 | "iopub.execute_input": "2021-07-05T20:39:01.661898Z", 277 | "iopub.status.busy": "2021-07-05T20:39:01.661588Z", 278 | "iopub.status.idle": "2021-07-05T20:39:01.683303Z", 279 | "shell.execute_reply": "2021-07-05T20:39:01.682279Z", 280 | "shell.execute_reply.started": "2021-07-05T20:39:01.661868Z" 281 | } 282 | }, 283 | "outputs": [], 284 | "source": [ 285 | "print(stopwords.words('english'))\n", 286 | "stop_words=stopwords.words('english')" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": null, 292 | "metadata": { 293 | "execution": { 294 | "iopub.execute_input": "2021-07-05T20:39:01.684855Z", 295 | "iopub.status.busy": "2021-07-05T20:39:01.684565Z", 296 | "iopub.status.idle": "2021-07-05T20:39:01.735036Z", 297 | "shell.execute_reply": "2021-07-05T20:39:01.733749Z", 298 | "shell.execute_reply.started": "2021-07-05T20:39:01.684826Z" 299 | } 300 | }, 301 | "outputs": [], 302 | "source": [ 303 | "words_filtered=[i for i in final_text if i not in stop_words]" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": null, 309 | "metadata": { 310 | "execution": { 311 | "iopub.execute_input": "2021-07-05T20:39:01.736605Z", 312 | "iopub.status.busy": "2021-07-05T20:39:01.736298Z", 313 | "iopub.status.idle": "2021-07-05T20:39:02.022405Z", 314 | "shell.execute_reply": "2021-07-05T20:39:02.021189Z", 315 | "shell.execute_reply.started": "2021-07-05T20:39:01.736577Z" 316 | } 317 | }, 318 | "outputs": [], 319 | "source": [ 320 | "barplot_most_common(words_filtered,15)" 321 | ] 322 | }, 323 | { 324 | "cell_type": "markdown", 325 | "metadata": {}, 326 | "source": [ 327 | "## Step 2: Tokenization" 328 | ] 329 | }, 330 | { 331 | "cell_type": "code", 332 | "execution_count": null, 333 | "metadata": { 334 | "execution": { 335 | "iopub.execute_input": "2021-07-05T20:39:02.024748Z", 336 | "iopub.status.busy": "2021-07-05T20:39:02.024178Z", 337 | "iopub.status.idle": "2021-07-05T20:39:02.030661Z", 338 | "shell.execute_reply": "2021-07-05T20:39:02.029198Z", 339 | "shell.execute_reply.started": "2021-07-05T20:39:02.024698Z" 340 | }, 341 | "scrolled": true 342 | }, 343 | "outputs": [], 344 | "source": [ 345 | "document = \"At nine o'clock I visited him myself. It looks like religious mania, and he'll soon think that he himself is God.\"" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": null, 351 | "metadata": { 352 | "execution": { 353 | "iopub.execute_input": "2021-07-05T20:39:02.033314Z", 354 | "iopub.status.busy": "2021-07-05T20:39:02.032875Z", 355 | "iopub.status.idle": "2021-07-05T20:39:02.047186Z", 356 | "shell.execute_reply": "2021-07-05T20:39:02.046042Z", 357 | "shell.execute_reply.started": "2021-07-05T20:39:02.033278Z" 358 | } 359 | }, 360 | "outputs": [], 361 | "source": [ 362 | "document.split()" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": { 369 | "execution": { 370 | "iopub.execute_input": "2021-07-05T20:39:02.049249Z", 371 | "iopub.status.busy": "2021-07-05T20:39:02.048758Z", 372 | "iopub.status.idle": "2021-07-05T20:39:02.075877Z", 373 | "shell.execute_reply": "2021-07-05T20:39:02.074782Z", 374 | "shell.execute_reply.started": "2021-07-05T20:39:02.049213Z" 375 | } 376 | }, 377 | "outputs": [], 378 | "source": [ 379 | "from nltk.tokenize import word_tokenize\n", 380 | "print(word_tokenize(document))" 381 | ] 382 | }, 383 | { 384 | "cell_type": "code", 385 | "execution_count": null, 386 | "metadata": { 387 | "execution": { 388 | "iopub.execute_input": "2021-07-05T20:39:02.07749Z", 389 | "iopub.status.busy": "2021-07-05T20:39:02.07713Z", 390 | "iopub.status.idle": "2021-07-05T20:39:02.083724Z", 391 | "shell.execute_reply": "2021-07-05T20:39:02.082646Z", 392 | "shell.execute_reply.started": "2021-07-05T20:39:02.077457Z" 393 | } 394 | }, 395 | "outputs": [], 396 | "source": [ 397 | "from nltk.tokenize import sent_tokenize\n", 398 | "print(sent_tokenize(document))" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": { 405 | "execution": { 406 | "iopub.execute_input": "2021-07-05T20:39:02.085667Z", 407 | "iopub.status.busy": "2021-07-05T20:39:02.085293Z", 408 | "iopub.status.idle": "2021-07-05T20:39:02.099046Z", 409 | "shell.execute_reply": "2021-07-05T20:39:02.097918Z", 410 | "shell.execute_reply.started": "2021-07-05T20:39:02.085629Z" 411 | } 412 | }, 413 | "outputs": [], 414 | "source": [ 415 | "message = \"i recently watched this show called mindhunters:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎\"" 416 | ] 417 | }, 418 | { 419 | "cell_type": "code", 420 | "execution_count": null, 421 | "metadata": { 422 | "execution": { 423 | "iopub.execute_input": "2021-07-05T20:39:02.100911Z", 424 | "iopub.status.busy": "2021-07-05T20:39:02.100411Z", 425 | "iopub.status.idle": "2021-07-05T20:39:02.118251Z", 426 | "shell.execute_reply": "2021-07-05T20:39:02.117182Z", 427 | "shell.execute_reply.started": "2021-07-05T20:39:02.100877Z" 428 | } 429 | }, 430 | "outputs": [], 431 | "source": [ 432 | "word_tokenize(message)" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": null, 438 | "metadata": { 439 | "execution": { 440 | "iopub.execute_input": "2021-07-05T20:39:02.123974Z", 441 | "iopub.status.busy": "2021-07-05T20:39:02.123575Z", 442 | "iopub.status.idle": "2021-07-05T20:39:02.132103Z", 443 | "shell.execute_reply": "2021-07-05T20:39:02.130628Z", 444 | "shell.execute_reply.started": "2021-07-05T20:39:02.123939Z" 445 | } 446 | }, 447 | "outputs": [], 448 | "source": [ 449 | "from nltk.tokenize import TweetTokenizer\n", 450 | "tweet_tokenizer=TweetTokenizer()" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": null, 456 | "metadata": { 457 | "execution": { 458 | "iopub.execute_input": "2021-07-05T20:39:02.134825Z", 459 | "iopub.status.busy": "2021-07-05T20:39:02.134399Z", 460 | "iopub.status.idle": "2021-07-05T20:39:02.148273Z", 461 | "shell.execute_reply": "2021-07-05T20:39:02.147243Z", 462 | "shell.execute_reply.started": "2021-07-05T20:39:02.134787Z" 463 | } 464 | }, 465 | "outputs": [], 466 | "source": [ 467 | "tweet_tokenizer.tokenize(message)\n", 468 | "#hashtags and smiles are managed well using Tweet Tokenizer()" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": null, 474 | "metadata": { 475 | "execution": { 476 | "iopub.execute_input": "2021-07-05T20:39:02.149715Z", 477 | "iopub.status.busy": "2021-07-05T20:39:02.149424Z", 478 | "iopub.status.idle": "2021-07-05T20:39:02.164554Z", 479 | "shell.execute_reply": "2021-07-05T20:39:02.163507Z", 480 | "shell.execute_reply.started": "2021-07-05T20:39:02.149687Z" 481 | } 482 | }, 483 | "outputs": [], 484 | "source": [ 485 | "from nltk.tokenize import regexp_tokenize\n", 486 | "message = \"i recently watched this show called mindhunters:). i totally loved it 😍. it was gr8 <3. #bingewatching #nothingtodo 😎\"\n", 487 | "pattern = \"#[\\w]+\"" 488 | ] 489 | }, 490 | { 491 | "cell_type": "code", 492 | "execution_count": null, 493 | "metadata": { 494 | "execution": { 495 | "iopub.execute_input": "2021-07-05T20:39:02.166163Z", 496 | "iopub.status.busy": "2021-07-05T20:39:02.165827Z", 497 | "iopub.status.idle": "2021-07-05T20:39:02.182151Z", 498 | "shell.execute_reply": "2021-07-05T20:39:02.180894Z", 499 | "shell.execute_reply.started": "2021-07-05T20:39:02.166114Z" 500 | } 501 | }, 502 | "outputs": [], 503 | "source": [ 504 | "#re.findall(pattern,message)\n", 505 | "regexp_tokenize(message,pattern)" 506 | ] 507 | }, 508 | { 509 | "cell_type": "markdown", 510 | "metadata": {}, 511 | "source": [ 512 | "## Step 3: Consolidating Pre-processing Steps & using it on a Spam vs Ham usecase" 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": null, 518 | "metadata": { 519 | "execution": { 520 | "iopub.execute_input": "2021-07-05T20:39:02.184273Z", 521 | "iopub.status.busy": "2021-07-05T20:39:02.183773Z", 522 | "iopub.status.idle": "2021-07-05T20:39:02.213438Z", 523 | "shell.execute_reply": "2021-07-05T20:39:02.212356Z", 524 | "shell.execute_reply.started": "2021-07-05T20:39:02.184222Z" 525 | } 526 | }, 527 | "outputs": [], 528 | "source": [ 529 | "import pandas as pd\n", 530 | "import os\n", 531 | "for dirname, _, filenames in os.walk('/kaggle/input'):\n", 532 | " for filename in filenames:\n", 533 | " path=os.path.join(dirname, filename)\n", 534 | " print(os.path.join(dirname, filename))\n" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": null, 540 | "metadata": { 541 | "execution": { 542 | "iopub.execute_input": "2021-07-05T20:39:02.215051Z", 543 | "iopub.status.busy": "2021-07-05T20:39:02.21473Z", 544 | "iopub.status.idle": "2021-07-05T20:39:02.250752Z", 545 | "shell.execute_reply": "2021-07-05T20:39:02.249513Z", 546 | "shell.execute_reply.started": "2021-07-05T20:39:02.215018Z" 547 | } 548 | }, 549 | "outputs": [], 550 | "source": [ 551 | "df=pd.read_csv(path, sep='\\t', names=['label','message'])" 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": null, 557 | "metadata": { 558 | "execution": { 559 | "iopub.execute_input": "2021-07-05T20:39:02.252831Z", 560 | "iopub.status.busy": "2021-07-05T20:39:02.252376Z", 561 | "iopub.status.idle": "2021-07-05T20:39:02.27439Z", 562 | "shell.execute_reply": "2021-07-05T20:39:02.272624Z", 563 | "shell.execute_reply.started": "2021-07-05T20:39:02.252784Z" 564 | } 565 | }, 566 | "outputs": [], 567 | "source": [ 568 | "df.head()" 569 | ] 570 | }, 571 | { 572 | "cell_type": "code", 573 | "execution_count": null, 574 | "metadata": { 575 | "execution": { 576 | "iopub.execute_input": "2021-07-05T20:39:02.27663Z", 577 | "iopub.status.busy": "2021-07-05T20:39:02.276067Z", 578 | "iopub.status.idle": "2021-07-05T20:39:02.287802Z", 579 | "shell.execute_reply": "2021-07-05T20:39:02.286716Z", 580 | "shell.execute_reply.started": "2021-07-05T20:39:02.276588Z" 581 | } 582 | }, 583 | "outputs": [], 584 | "source": [ 585 | "#Now new preprocessing\n", 586 | "from nltk.stem.snowball import SnowballStemmer\n", 587 | "stemmer=SnowballStemmer('english')\n", 588 | " \n", 589 | "from nltk.stem import WordNetLemmatizer\n", 590 | "lemmatizer = WordNetLemmatizer() \n", 591 | "\n", 592 | "def preprocess(document, stem=True):\n", 593 | " 'changes document to lower case and removes stopwords'\n", 594 | "\n", 595 | " # change sentence to lower case\n", 596 | " document = document.lower()\n", 597 | "\n", 598 | " # tokenize into words\n", 599 | " words = word_tokenize(document)\n", 600 | " \n", 601 | " final_words=[]\n", 602 | " for i in words:\n", 603 | " result=re.search('([\\w]+)',i)\n", 604 | " try:\n", 605 | " match=result.group(1)\n", 606 | " except AttributeError:\n", 607 | " match=result\n", 608 | " final_words.append(match)\n", 609 | " final_words=list(filter(None, (final_words)))\n", 610 | " final_words=list(map(lambda x: x.lower(),final_words))\n", 611 | " words=final_words \n", 612 | "\n", 613 | " # remove stop words\n", 614 | " words = [word for word in words if word not in stopwords.words(\"english\")]\n", 615 | " \n", 616 | " #stemming and lematization:\n", 617 | " \n", 618 | "# Stemming is a process that stems or removes last few characters from a word, \n", 619 | "#often leading to incorrect meanings and spelling. Lemmatization considers the context \n", 620 | "#and converts the word to its meaningful base form, which is called Lemma. \n", 621 | "#For instance, stemming the word 'Caring' would return 'Car'\n", 622 | " if stem:\n", 623 | " words=[stemmer.stem(i) for i in words]\n", 624 | " else:\n", 625 | " words=[lemmatizer.lemmatize(i,pos='v') for i in words]\n", 626 | " # join words to make sentence\n", 627 | " document = \" \".join(map( str, words))\n", 628 | " \n", 629 | " return document\n" 630 | ] 631 | }, 632 | { 633 | "cell_type": "code", 634 | "execution_count": null, 635 | "metadata": { 636 | "execution": { 637 | "iopub.execute_input": "2021-07-05T20:39:02.289442Z", 638 | "iopub.status.busy": "2021-07-05T20:39:02.289101Z", 639 | "iopub.status.idle": "2021-07-05T20:39:02.464844Z", 640 | "shell.execute_reply": "2021-07-05T20:39:02.463886Z", 641 | "shell.execute_reply.started": "2021-07-05T20:39:02.289411Z" 642 | } 643 | }, 644 | "outputs": [], 645 | "source": [ 646 | "#stemmed message\n", 647 | "messages=[preprocess(i,stem=True) for i in df.iloc[0:50].message]\n", 648 | "#bag of words\n", 649 | "from sklearn.feature_extraction.text import CountVectorizer\n", 650 | "vectorizer=CountVectorizer()\n", 651 | "model_stem=vectorizer.fit_transform(messages)" 652 | ] 653 | }, 654 | { 655 | "cell_type": "code", 656 | "execution_count": null, 657 | "metadata": { 658 | "execution": { 659 | "iopub.execute_input": "2021-07-05T20:39:02.466812Z", 660 | "iopub.status.busy": "2021-07-05T20:39:02.466381Z", 661 | "iopub.status.idle": "2021-07-05T20:39:02.475761Z", 662 | "shell.execute_reply": "2021-07-05T20:39:02.474548Z", 663 | "shell.execute_reply.started": "2021-07-05T20:39:02.466768Z" 664 | } 665 | }, 666 | "outputs": [], 667 | "source": [ 668 | "messages" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": null, 674 | "metadata": { 675 | "execution": { 676 | "iopub.execute_input": "2021-07-05T20:39:02.477874Z", 677 | "iopub.status.busy": "2021-07-05T20:39:02.477514Z", 678 | "iopub.status.idle": "2021-07-05T20:39:02.489108Z", 679 | "shell.execute_reply": "2021-07-05T20:39:02.487906Z", 680 | "shell.execute_reply.started": "2021-07-05T20:39:02.47784Z" 681 | } 682 | }, 683 | "outputs": [], 684 | "source": [ 685 | "model_stem.shape" 686 | ] 687 | }, 688 | { 689 | "cell_type": "code", 690 | "execution_count": null, 691 | "metadata": { 692 | "execution": { 693 | "iopub.execute_input": "2021-07-05T20:39:02.49077Z", 694 | "iopub.status.busy": "2021-07-05T20:39:02.490358Z", 695 | "iopub.status.idle": "2021-07-05T20:39:02.503389Z", 696 | "shell.execute_reply": "2021-07-05T20:39:02.502179Z", 697 | "shell.execute_reply.started": "2021-07-05T20:39:02.490726Z" 698 | }, 699 | "scrolled": true 700 | }, 701 | "outputs": [], 702 | "source": [ 703 | "len(vectorizer.get_feature_names())" 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": {}, 709 | "source": [ 710 | "347 features with stemmer" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": null, 716 | "metadata": { 717 | "execution": { 718 | "iopub.execute_input": "2021-07-05T20:39:02.505508Z", 719 | "iopub.status.busy": "2021-07-05T20:39:02.50519Z", 720 | "iopub.status.idle": "2021-07-05T20:39:04.812005Z", 721 | "shell.execute_reply": "2021-07-05T20:39:04.811Z", 722 | "shell.execute_reply.started": "2021-07-05T20:39:02.505478Z" 723 | } 724 | }, 725 | "outputs": [], 726 | "source": [ 727 | "messages=[preprocess(i,stem=False) for i in df.iloc[0:50].message]\n", 728 | "#bag of words\n", 729 | "from sklearn.feature_extraction.text import CountVectorizer\n", 730 | "vectorizer=CountVectorizer()\n", 731 | "model_lem=vectorizer.fit_transform(messages)" 732 | ] 733 | }, 734 | { 735 | "cell_type": "code", 736 | "execution_count": null, 737 | "metadata": { 738 | "execution": { 739 | "iopub.execute_input": "2021-07-05T20:39:04.813922Z", 740 | "iopub.status.busy": "2021-07-05T20:39:04.813426Z", 741 | "iopub.status.idle": "2021-07-05T20:39:04.820197Z", 742 | "shell.execute_reply": "2021-07-05T20:39:04.81905Z", 743 | "shell.execute_reply.started": "2021-07-05T20:39:04.813878Z" 744 | } 745 | }, 746 | "outputs": [], 747 | "source": [ 748 | "model_lem.shape" 749 | ] 750 | }, 751 | { 752 | "cell_type": "markdown", 753 | "metadata": {}, 754 | "source": [ 755 | "349 features with lemmatization" 756 | ] 757 | }, 758 | { 759 | "cell_type": "code", 760 | "execution_count": null, 761 | "metadata": { 762 | "execution": { 763 | "iopub.execute_input": "2021-07-05T20:39:04.821869Z", 764 | "iopub.status.busy": "2021-07-05T20:39:04.821572Z", 765 | "iopub.status.idle": "2021-07-05T20:39:04.871538Z", 766 | "shell.execute_reply": "2021-07-05T20:39:04.870509Z", 767 | "shell.execute_reply.started": "2021-07-05T20:39:04.821841Z" 768 | } 769 | }, 770 | "outputs": [], 771 | "source": [ 772 | "pd.DataFrame(model_lem.toarray(), columns=vectorizer.get_feature_names())" 773 | ] 774 | }, 775 | { 776 | "cell_type": "code", 777 | "execution_count": null, 778 | "metadata": { 779 | "execution": { 780 | "iopub.execute_input": "2021-07-05T20:39:04.87356Z", 781 | "iopub.status.busy": "2021-07-05T20:39:04.872977Z", 782 | "iopub.status.idle": "2021-07-05T20:39:04.877904Z", 783 | "shell.execute_reply": "2021-07-05T20:39:04.876849Z", 784 | "shell.execute_reply.started": "2021-07-05T20:39:04.873515Z" 785 | } 786 | }, 787 | "outputs": [], 788 | "source": [ 789 | "#Stemming doesn't apply any background knowledge whereas Lemmatization is a different process where understanding is built." 790 | ] 791 | }, 792 | { 793 | "cell_type": "code", 794 | "execution_count": null, 795 | "metadata": { 796 | "execution": { 797 | "iopub.execute_input": "2021-07-05T20:39:04.879329Z", 798 | "iopub.status.busy": "2021-07-05T20:39:04.879026Z", 799 | "iopub.status.idle": "2021-07-05T20:39:04.894216Z", 800 | "shell.execute_reply": "2021-07-05T20:39:04.89301Z", 801 | "shell.execute_reply.started": "2021-07-05T20:39:04.8793Z" 802 | } 803 | }, 804 | "outputs": [], 805 | "source": [ 806 | "messages" 807 | ] 808 | }, 809 | { 810 | "cell_type": "markdown", 811 | "metadata": {}, 812 | "source": [ 813 | "## Step 4: Using TFIDF (short for term frequency–inverse document frequency)" 814 | ] 815 | }, 816 | { 817 | "cell_type": "markdown", 818 | "metadata": {}, 819 | "source": [ 820 | "### TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus." 821 | ] 822 | }, 823 | { 824 | "cell_type": "code", 825 | "execution_count": null, 826 | "metadata": { 827 | "execution": { 828 | "iopub.execute_input": "2021-07-05T20:39:04.895869Z", 829 | "iopub.status.busy": "2021-07-05T20:39:04.895573Z", 830 | "iopub.status.idle": "2021-07-05T20:39:04.917456Z", 831 | "shell.execute_reply": "2021-07-05T20:39:04.916233Z", 832 | "shell.execute_reply.started": "2021-07-05T20:39:04.89584Z" 833 | } 834 | }, 835 | "outputs": [], 836 | "source": [ 837 | "#TF-IDF\n", 838 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 839 | "vectorize= TfidfVectorizer()\n", 840 | "model_tf_idf= vectorize.fit_transform(messages)" 841 | ] 842 | }, 843 | { 844 | "cell_type": "code", 845 | "execution_count": null, 846 | "metadata": { 847 | "execution": { 848 | "iopub.execute_input": "2021-07-05T20:39:04.919454Z", 849 | "iopub.status.busy": "2021-07-05T20:39:04.919104Z", 850 | "iopub.status.idle": "2021-07-05T20:39:04.927267Z", 851 | "shell.execute_reply": "2021-07-05T20:39:04.926252Z", 852 | "shell.execute_reply.started": "2021-07-05T20:39:04.919421Z" 853 | } 854 | }, 855 | "outputs": [], 856 | "source": [ 857 | "model_tf_idf.shape" 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": null, 863 | "metadata": { 864 | "execution": { 865 | "iopub.execute_input": "2021-07-05T20:39:04.928798Z", 866 | "iopub.status.busy": "2021-07-05T20:39:04.928493Z", 867 | "iopub.status.idle": "2021-07-05T20:39:05.013661Z", 868 | "shell.execute_reply": "2021-07-05T20:39:05.012362Z", 869 | "shell.execute_reply.started": "2021-07-05T20:39:04.928768Z" 870 | } 871 | }, 872 | "outputs": [], 873 | "source": [ 874 | "pd.DataFrame(model_tf_idf.toarray(),columns=vectorizer.get_feature_names())" 875 | ] 876 | } 877 | ], 878 | "metadata": { 879 | "kernelspec": { 880 | "display_name": "Python 3", 881 | "language": "python", 882 | "name": "python3" 883 | }, 884 | "language_info": { 885 | "codemirror_mode": { 886 | "name": "ipython", 887 | "version": 3 888 | }, 889 | "file_extension": ".py", 890 | "mimetype": "text/x-python", 891 | "name": "python", 892 | "nbconvert_exporter": "python", 893 | "pygments_lexer": "ipython3", 894 | "version": "3.9.12" 895 | } 896 | }, 897 | "nbformat": 4, 898 | "nbformat_minor": 4 899 | } 900 | -------------------------------------------------------------------------------- /Natural Language Processing/Module 3- Model Building.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19", 8 | "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5", 9 | "execution": { 10 | "iopub.execute_input": "2021-07-05T21:13:27.815233Z", 11 | "iopub.status.busy": "2021-07-05T21:13:27.814644Z", 12 | "iopub.status.idle": "2021-07-05T21:13:27.831406Z", 13 | "shell.execute_reply": "2021-07-05T21:13:27.830205Z", 14 | "shell.execute_reply.started": "2021-07-05T21:13:27.815179Z" 15 | } 16 | }, 17 | "outputs": [], 18 | "source": [ 19 | "# This Python 3 environment comes with many helpful analytics libraries installed\n", 20 | "# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python\n", 21 | "# For example, here's several helpful packages to load\n", 22 | "\n", 23 | "import numpy as np # linear algebra\n", 24 | "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n", 25 | "\n", 26 | "# Input data files are available in the read-only \"../input/\" directory\n", 27 | "# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n", 28 | "\n", 29 | "import os\n", 30 | "for dirname, _, filenames in os.walk('/kaggle/input'):\n", 31 | " for filename in filenames:\n", 32 | " file1=os.path.join(dirname, filename)\n", 33 | " print(os.path.join(dirname, filename))\n", 34 | "\n", 35 | "# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using \"Save & Run All\" \n", 36 | "# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": { 43 | "execution": { 44 | "iopub.execute_input": "2021-07-05T21:13:27.984361Z", 45 | "iopub.status.busy": "2021-07-05T21:13:27.984008Z", 46 | "iopub.status.idle": "2021-07-05T21:13:28.030431Z", 47 | "shell.execute_reply": "2021-07-05T21:13:28.029675Z", 48 | "shell.execute_reply.started": "2021-07-05T21:13:27.984331Z" 49 | } 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "email=pd.read_csv(file1,encoding = \"latin-1\")\n", 54 | "email = email[['v1', 'v2']]\n", 55 | "email = email.rename(columns = {'v1': 'label', 'v2': 'message'})" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": null, 61 | "metadata": { 62 | "execution": { 63 | "iopub.execute_input": "2021-07-05T21:13:28.127205Z", 64 | "iopub.status.busy": "2021-07-05T21:13:28.126811Z", 65 | "iopub.status.idle": "2021-07-05T21:13:28.145341Z", 66 | "shell.execute_reply": "2021-07-05T21:13:28.144171Z", 67 | "shell.execute_reply.started": "2021-07-05T21:13:28.12717Z" 68 | } 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "email.head()" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": { 79 | "execution": { 80 | "iopub.execute_input": "2021-07-05T21:13:28.258649Z", 81 | "iopub.status.busy": "2021-07-05T21:13:28.258154Z", 82 | "iopub.status.idle": "2021-07-05T21:13:28.285508Z", 83 | "shell.execute_reply": "2021-07-05T21:13:28.284766Z", 84 | "shell.execute_reply.started": "2021-07-05T21:13:28.258617Z" 85 | } 86 | }, 87 | "outputs": [], 88 | "source": [ 89 | "email.label.value_counts(normalize=True)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": null, 95 | "metadata": { 96 | "execution": { 97 | "iopub.execute_input": "2021-07-05T21:13:28.34511Z", 98 | "iopub.status.busy": "2021-07-05T21:13:28.343662Z", 99 | "iopub.status.idle": "2021-07-05T21:13:28.353876Z", 100 | "shell.execute_reply": "2021-07-05T21:13:28.353071Z", 101 | "shell.execute_reply.started": "2021-07-05T21:13:28.345062Z" 102 | } 103 | }, 104 | "outputs": [], 105 | "source": [ 106 | "email['label']=email['label'].apply(lambda x: 0 if x=='ham' else 1 )" 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": { 113 | "execution": { 114 | "iopub.execute_input": "2021-07-05T21:13:28.469091Z", 115 | "iopub.status.busy": "2021-07-05T21:13:28.468549Z", 116 | "iopub.status.idle": "2021-07-05T21:13:28.478945Z", 117 | "shell.execute_reply": "2021-07-05T21:13:28.477902Z", 118 | "shell.execute_reply.started": "2021-07-05T21:13:28.469055Z" 119 | } 120 | }, 121 | "outputs": [], 122 | "source": [ 123 | "email.label.value_counts(normalize=True)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": {}, 129 | "source": [ 130 | "## Data Preprocessing" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "#### Saving as Message & Label as a tuple" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": { 144 | "execution": { 145 | "iopub.execute_input": "2021-07-05T21:13:29.148261Z", 146 | "iopub.status.busy": "2021-07-05T21:13:29.147805Z", 147 | "iopub.status.idle": "2021-07-05T21:13:29.656339Z", 148 | "shell.execute_reply": "2021-07-05T21:13:29.65531Z", 149 | "shell.execute_reply.started": "2021-07-05T21:13:29.148225Z" 150 | } 151 | }, 152 | "outputs": [], 153 | "source": [ 154 | "data=[]\n", 155 | "for i,j in email.iterrows():\n", 156 | " data.append((j['message'],j['label']))" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": null, 162 | "metadata": { 163 | "execution": { 164 | "iopub.execute_input": "2021-07-05T21:13:29.657907Z", 165 | "iopub.status.busy": "2021-07-05T21:13:29.657616Z", 166 | "iopub.status.idle": "2021-07-05T21:13:29.663779Z", 167 | "shell.execute_reply": "2021-07-05T21:13:29.662661Z", 168 | "shell.execute_reply.started": "2021-07-05T21:13:29.657876Z" 169 | } 170 | }, 171 | "outputs": [], 172 | "source": [ 173 | "data[0:5]" 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": null, 179 | "metadata": { 180 | "execution": { 181 | "iopub.execute_input": "2021-07-05T21:13:29.665479Z", 182 | "iopub.status.busy": "2021-07-05T21:13:29.665178Z", 183 | "iopub.status.idle": "2021-07-05T21:13:29.679852Z", 184 | "shell.execute_reply": "2021-07-05T21:13:29.67869Z", 185 | "shell.execute_reply.started": "2021-07-05T21:13:29.66545Z" 186 | } 187 | }, 188 | "outputs": [], 189 | "source": [ 190 | "len(data)" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "metadata": {}, 196 | "source": [ 197 | "#### Creating a preprocessing function" 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": null, 203 | "metadata": { 204 | "execution": { 205 | "iopub.execute_input": "2021-07-05T21:13:29.698896Z", 206 | "iopub.status.busy": "2021-07-05T21:13:29.698376Z", 207 | "iopub.status.idle": "2021-07-05T21:13:30.50071Z", 208 | "shell.execute_reply": "2021-07-05T21:13:30.499684Z", 209 | "shell.execute_reply.started": "2021-07-05T21:13:29.698863Z" 210 | } 211 | }, 212 | "outputs": [], 213 | "source": [ 214 | "#Preprocessing Libraries \n", 215 | "from nltk.tokenize import word_tokenize\n", 216 | "from nltk.corpus import stopwords\n", 217 | "from nltk.stem.snowball import SnowballStemmer\n", 218 | "from nltk.stem import WordNetLemmatizer\n", 219 | "from nltk import FreqDist" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": { 226 | "execution": { 227 | "iopub.execute_input": "2021-07-05T21:13:30.502402Z", 228 | "iopub.status.busy": "2021-07-05T21:13:30.502124Z", 229 | "iopub.status.idle": "2021-07-05T21:13:30.507122Z", 230 | "shell.execute_reply": "2021-07-05T21:13:30.50596Z", 231 | "shell.execute_reply.started": "2021-07-05T21:13:30.502374Z" 232 | } 233 | }, 234 | "outputs": [], 235 | "source": [ 236 | "#initialize\n", 237 | "stemmer=SnowballStemmer('english')\n", 238 | "lemmatizer=WordNetLemmatizer()" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": null, 244 | "metadata": { 245 | "execution": { 246 | "iopub.execute_input": "2021-07-05T21:13:30.50986Z", 247 | "iopub.status.busy": "2021-07-05T21:13:30.509381Z", 248 | "iopub.status.idle": "2021-07-05T21:13:30.569321Z", 249 | "shell.execute_reply": "2021-07-05T21:13:30.568112Z", 250 | "shell.execute_reply.started": "2021-07-05T21:13:30.509789Z" 251 | } 252 | }, 253 | "outputs": [], 254 | "source": [ 255 | "import re\n", 256 | "#cleaning data\n", 257 | "cleaned_data=[]\n", 258 | "\n", 259 | "for (i,j) in data:\n", 260 | " result=re.findall('[\\w]+',i)\n", 261 | " message=' '.join(result)\n", 262 | " cleaned_data.append((message,j))\n", 263 | " " 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": null, 269 | "metadata": { 270 | "execution": { 271 | "iopub.execute_input": "2021-07-05T21:13:30.571462Z", 272 | "iopub.status.busy": "2021-07-05T21:13:30.571077Z", 273 | "iopub.status.idle": "2021-07-05T21:13:30.578485Z", 274 | "shell.execute_reply": "2021-07-05T21:13:30.577351Z", 275 | "shell.execute_reply.started": "2021-07-05T21:13:30.571425Z" 276 | } 277 | }, 278 | "outputs": [], 279 | "source": [ 280 | "cleaned_data[0:5]" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": { 287 | "execution": { 288 | "iopub.execute_input": "2021-07-05T21:13:31.064592Z", 289 | "iopub.status.busy": "2021-07-05T21:13:31.064226Z", 290 | "iopub.status.idle": "2021-07-05T21:13:31.07112Z", 291 | "shell.execute_reply": "2021-07-05T21:13:31.069963Z", 292 | "shell.execute_reply.started": "2021-07-05T21:13:31.06456Z" 293 | } 294 | }, 295 | "outputs": [], 296 | "source": [ 297 | "def preprocessing(document,stem=True):\n", 298 | " \n", 299 | " words=document.lower() \n", 300 | " \n", 301 | " words=word_tokenize(words)\n", 302 | " \n", 303 | " words=[i for i in words if i not in stopwords.words('english')]\n", 304 | " \n", 305 | " if stem:\n", 306 | " words=[stemmer.stem(i) for i in words]\n", 307 | " else:\n", 308 | " words=[lemmatizer.lemmatize(i) for i in words]\n", 309 | " \n", 310 | " new_document=' '.join(words)\n", 311 | " \n", 312 | " return new_document\n", 313 | " \n", 314 | " " 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": { 321 | "execution": { 322 | "iopub.execute_input": "2021-07-05T21:13:31.858163Z", 323 | "iopub.status.busy": "2021-07-05T21:13:31.857797Z", 324 | "iopub.status.idle": "2021-07-05T21:13:47.254414Z", 325 | "shell.execute_reply": "2021-07-05T21:13:47.253265Z", 326 | "shell.execute_reply.started": "2021-07-05T21:13:31.858131Z" 327 | } 328 | }, 329 | "outputs": [], 330 | "source": [ 331 | "dataset=[]\n", 332 | "for (i,j) in cleaned_data:\n", 333 | " x=preprocessing(i,stem=False)\n", 334 | " dataset.append((x,j))" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": null, 340 | "metadata": { 341 | "execution": { 342 | "iopub.execute_input": "2021-07-05T21:13:47.256219Z", 343 | "iopub.status.busy": "2021-07-05T21:13:47.255936Z", 344 | "iopub.status.idle": "2021-07-05T21:13:47.264202Z", 345 | "shell.execute_reply": "2021-07-05T21:13:47.263377Z", 346 | "shell.execute_reply.started": "2021-07-05T21:13:47.256191Z" 347 | } 348 | }, 349 | "outputs": [], 350 | "source": [ 351 | "df = pd.DataFrame(dataset, columns =['message','label'])" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": { 358 | "execution": { 359 | "iopub.execute_input": "2021-07-05T21:13:47.265979Z", 360 | "iopub.status.busy": "2021-07-05T21:13:47.265556Z", 361 | "iopub.status.idle": "2021-07-05T21:13:47.284449Z", 362 | "shell.execute_reply": "2021-07-05T21:13:47.283405Z", 363 | "shell.execute_reply.started": "2021-07-05T21:13:47.265947Z" 364 | } 365 | }, 366 | "outputs": [], 367 | "source": [ 368 | "df.head()" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "### Creating Train and Test Data" 376 | ] 377 | }, 378 | { 379 | "cell_type": "code", 380 | "execution_count": null, 381 | "metadata": { 382 | "execution": { 383 | "iopub.execute_input": "2021-07-05T21:13:47.286439Z", 384 | "iopub.status.busy": "2021-07-05T21:13:47.286005Z", 385 | "iopub.status.idle": "2021-07-05T21:13:47.299587Z", 386 | "shell.execute_reply": "2021-07-05T21:13:47.298563Z", 387 | "shell.execute_reply.started": "2021-07-05T21:13:47.286395Z" 388 | } 389 | }, 390 | "outputs": [], 391 | "source": [ 392 | "from sklearn.model_selection import train_test_split\n", 393 | "X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size = 0.2, random_state = 1)" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": { 400 | "execution": { 401 | "iopub.execute_input": "2021-07-05T21:13:47.301406Z", 402 | "iopub.status.busy": "2021-07-05T21:13:47.300983Z", 403 | "iopub.status.idle": "2021-07-05T21:13:47.315593Z", 404 | "shell.execute_reply": "2021-07-05T21:13:47.314328Z", 405 | "shell.execute_reply.started": "2021-07-05T21:13:47.301362Z" 406 | } 407 | }, 408 | "outputs": [], 409 | "source": [ 410 | "print(len(X_train)), print(len(X_test)), print(len(y_train)), print(len(y_test))" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": null, 416 | "metadata": { 417 | "execution": { 418 | "iopub.execute_input": "2021-07-05T21:13:47.319078Z", 419 | "iopub.status.busy": "2021-07-05T21:13:47.318736Z", 420 | "iopub.status.idle": "2021-07-05T21:13:47.329832Z", 421 | "shell.execute_reply": "2021-07-05T21:13:47.328859Z", 422 | "shell.execute_reply.started": "2021-07-05T21:13:47.319045Z" 423 | } 424 | }, 425 | "outputs": [], 426 | "source": [ 427 | "X_train" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": null, 433 | "metadata": { 434 | "execution": { 435 | "iopub.execute_input": "2021-07-05T21:14:15.213978Z", 436 | "iopub.status.busy": "2021-07-05T21:14:15.213584Z", 437 | "iopub.status.idle": "2021-07-05T21:14:15.316889Z", 438 | "shell.execute_reply": "2021-07-05T21:14:15.31608Z", 439 | "shell.execute_reply.started": "2021-07-05T21:14:15.21394Z" 440 | } 441 | }, 442 | "outputs": [], 443 | "source": [ 444 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 445 | "vectorize= TfidfVectorizer()\n", 446 | "X_train_trans=vectorize.fit_transform(X_train)" 447 | ] 448 | }, 449 | { 450 | "cell_type": "markdown", 451 | "metadata": {}, 452 | "source": [ 453 | "### Buiding the model" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": { 460 | "execution": { 461 | "iopub.execute_input": "2021-07-05T21:14:23.705724Z", 462 | "iopub.status.busy": "2021-07-05T21:14:23.705171Z", 463 | "iopub.status.idle": "2021-07-05T21:14:25.676392Z", 464 | "shell.execute_reply": "2021-07-05T21:14:25.675628Z", 465 | "shell.execute_reply.started": "2021-07-05T21:14:23.705676Z" 466 | } 467 | }, 468 | "outputs": [], 469 | "source": [ 470 | "#svm\n", 471 | "from sklearn.svm import SVC\n", 472 | "svm = SVC(C=1000)\n", 473 | "svm=svm.fit(X_train_trans, y_train)" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": null, 479 | "metadata": { 480 | "execution": { 481 | "iopub.execute_input": "2021-07-05T21:14:26.113156Z", 482 | "iopub.status.busy": "2021-07-05T21:14:26.112589Z", 483 | "iopub.status.idle": "2021-07-05T21:14:27.273511Z", 484 | "shell.execute_reply": "2021-07-05T21:14:27.272437Z", 485 | "shell.execute_reply.started": "2021-07-05T21:14:26.113101Z" 486 | } 487 | }, 488 | "outputs": [], 489 | "source": [ 490 | "#logistic regression model \n", 491 | "from sklearn.ensemble import RandomForestClassifier\n", 492 | "ran_tree = RandomForestClassifier().fit(X_train_trans, y_train)\n" 493 | ] 494 | }, 495 | { 496 | "cell_type": "code", 497 | "execution_count": null, 498 | "metadata": { 499 | "execution": { 500 | "iopub.execute_input": "2021-07-05T21:14:28.050042Z", 501 | "iopub.status.busy": "2021-07-05T21:14:28.049312Z", 502 | "iopub.status.idle": "2021-07-05T21:14:28.099184Z", 503 | "shell.execute_reply": "2021-07-05T21:14:28.097881Z", 504 | "shell.execute_reply.started": "2021-07-05T21:14:28.049977Z" 505 | } 506 | }, 507 | "outputs": [], 508 | "source": [ 509 | "pd.DataFrame(zip(vectorize.get_feature_names(),ran_tree.feature_importances_,)).sort_values(by=1, ascending=False).head(10)" 510 | ] 511 | }, 512 | { 513 | "cell_type": "code", 514 | "execution_count": null, 515 | "metadata": { 516 | "execution": { 517 | "iopub.execute_input": "2021-07-05T21:14:28.27444Z", 518 | "iopub.status.busy": "2021-07-05T21:14:28.273868Z", 519 | "iopub.status.idle": "2021-07-05T21:14:28.35688Z", 520 | "shell.execute_reply": "2021-07-05T21:14:28.355783Z", 521 | "shell.execute_reply.started": "2021-07-05T21:14:28.274389Z" 522 | } 523 | }, 524 | "outputs": [], 525 | "source": [ 526 | "#logistic regression model \n", 527 | "from sklearn.linear_model import LogisticRegression\n", 528 | "log = LogisticRegression().fit(X_train_trans, y_train)" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "### Evaluation" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "metadata": { 542 | "execution": { 543 | "iopub.execute_input": "2021-07-05T21:19:47.503525Z", 544 | "iopub.status.busy": "2021-07-05T21:19:47.503138Z", 545 | "iopub.status.idle": "2021-07-05T21:19:47.528965Z", 546 | "shell.execute_reply": "2021-07-05T21:19:47.528018Z", 547 | "shell.execute_reply.started": "2021-07-05T21:19:47.503489Z" 548 | } 549 | }, 550 | "outputs": [], 551 | "source": [ 552 | "X_test_trans=vectorize.transform(X_test)" 553 | ] 554 | }, 555 | { 556 | "cell_type": "code", 557 | "execution_count": null, 558 | "metadata": { 559 | "execution": { 560 | "iopub.execute_input": "2021-07-05T21:19:47.692601Z", 561 | "iopub.status.busy": "2021-07-05T21:19:47.6922Z", 562 | "iopub.status.idle": "2021-07-05T21:19:47.765624Z", 563 | "shell.execute_reply": "2021-07-05T21:19:47.764614Z", 564 | "shell.execute_reply.started": "2021-07-05T21:19:47.692566Z" 565 | } 566 | }, 567 | "outputs": [], 568 | "source": [ 569 | "from sklearn.metrics import classification_report\n", 570 | "print(classification_report(y_test,ran_tree.predict(X_test_trans)))" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": null, 576 | "metadata": { 577 | "execution": { 578 | "iopub.execute_input": "2021-07-05T21:19:48.453493Z", 579 | "iopub.status.busy": "2021-07-05T21:19:48.453139Z", 580 | "iopub.status.idle": "2021-07-05T21:19:48.465723Z", 581 | "shell.execute_reply": "2021-07-05T21:19:48.464645Z", 582 | "shell.execute_reply.started": "2021-07-05T21:19:48.453464Z" 583 | } 584 | }, 585 | "outputs": [], 586 | "source": [ 587 | "from sklearn.metrics import classification_report\n", 588 | "print(classification_report(y_test,log.predict(X_test_trans)))" 589 | ] 590 | } 591 | ], 592 | "metadata": { 593 | "kernelspec": { 594 | "display_name": "Python 3", 595 | "language": "python", 596 | "name": "python3" 597 | }, 598 | "language_info": { 599 | "codemirror_mode": { 600 | "name": "ipython", 601 | "version": 3 602 | }, 603 | "file_extension": ".py", 604 | "mimetype": "text/x-python", 605 | "name": "python", 606 | "nbconvert_exporter": "python", 607 | "pygments_lexer": "ipython3", 608 | "version": "3.9.12" 609 | } 610 | }, 611 | "nbformat": 4, 612 | "nbformat_minor": 4 613 | } 614 | -------------------------------------------------------------------------------- /Natural Language Processing/Module 5- Advanced NLP using LSTM.ipynb: -------------------------------------------------------------------------------- 1 | {"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"code","source":"#importing important libraries fixing dataset\nimport gensim\nimport pandas as pd\nimport numpy as np\nfrom sklearn.model_selection import train_test_split\n\nmessages=pd.read_csv('../input/spamcsv/spam.csv',encoding='latin1' )\nmessages=messages[['v1','v2']]\nmessages.columns=['label', 'text']\n\n","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:31:22.883921Z","iopub.execute_input":"2022-10-04T23:31:22.884416Z","iopub.status.idle":"2022-10-04T23:31:24.435166Z","shell.execute_reply.started":"2022-10-04T23:31:22.884324Z","shell.execute_reply":"2022-10-04T23:31:24.434248Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"#target variable encoding\nlabels=np.where(messages['label']=='spam',1,0)","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:32:06.041462Z","iopub.execute_input":"2022-10-04T23:32:06.041878Z","iopub.status.idle":"2022-10-04T23:32:06.050677Z","shell.execute_reply.started":"2022-10-04T23:32:06.041847Z","shell.execute_reply":"2022-10-04T23:32:06.049720Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"#train and test split\nX_train, X_test, y_train, y_test= train_test_split(messages['text'], labels, test_size=0.2)\n","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:32:36.431675Z","iopub.execute_input":"2022-10-04T23:32:36.432075Z","iopub.status.idle":"2022-10-04T23:32:36.439372Z","shell.execute_reply.started":"2022-10-04T23:32:36.432039Z","shell.execute_reply":"2022-10-04T23:32:36.438302Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"#importing tensorflow and model building libraries\nimport tensorflow as tf\nfrom tensorflow import keras\nfrom tensorflow.keras import layers","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:33:17.571462Z","iopub.execute_input":"2022-10-04T23:33:17.571853Z","iopub.status.idle":"2022-10-04T23:33:23.788551Z","shell.execute_reply.started":"2022-10-04T23:33:17.571822Z","shell.execute_reply":"2022-10-04T23:33:23.787422Z"}},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Using Tokenizer\n\nTokenizer: Allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...\n\nBy default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character). These sequences are then split into lists of tokens. They will then be indexed or vectorized.m\n\n## Using Pad Sequences\nThis function transforms a list (of length num_samples) of sequences (lists of integers) into a 2D Numpy array of shape (num_samples, num_timesteps). num_timesteps is either the maxlen argument if provided, or the length of the longest sequence in the list.\n\n","metadata":{"execution":{"iopub.status.busy":"2022-10-05T18:21:42.781134Z","iopub.execute_input":"2022-10-05T18:21:42.782029Z","iopub.status.idle":"2022-10-05T18:21:42.817555Z","shell.execute_reply.started":"2022-10-05T18:21:42.781895Z","shell.execute_reply":"2022-10-05T18:21:42.815961Z"}}},{"cell_type":"code","source":"\nfrom tensorflow.keras.preprocessing.text import Tokenizer\nfrom tensorflow.keras.preprocessing.sequence import pad_sequences","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:34:50.361302Z","iopub.execute_input":"2022-10-04T23:34:50.362513Z","iopub.status.idle":"2022-10-04T23:34:50.368283Z","shell.execute_reply.started":"2022-10-04T23:34:50.362471Z","shell.execute_reply":"2022-10-04T23:34:50.367204Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"tokenizer=Tokenizer(oov_token=\"\")\ntokenizer.fit_on_texts(X_train)","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:36:05.740275Z","iopub.execute_input":"2022-10-04T23:36:05.740728Z","iopub.status.idle":"2022-10-04T23:36:05.857285Z","shell.execute_reply.started":"2022-10-04T23:36:05.740691Z","shell.execute_reply":"2022-10-04T23:36:05.856163Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"print(tokenizer.index_word[1])\nprint(tokenizer.index_word[2])","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:45:55.276218Z","iopub.execute_input":"2022-10-04T23:45:55.276640Z","iopub.status.idle":"2022-10-04T23:45:55.282584Z","shell.execute_reply.started":"2022-10-04T23:45:55.276604Z","shell.execute_reply":"2022-10-04T23:45:55.281240Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"X_train_seq=tokenizer.texts_to_sequences(X_train)\nX_test_seq=tokenizer.texts_to_sequences(X_test)","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:37:24.261086Z","iopub.execute_input":"2022-10-04T23:37:24.261426Z","iopub.status.idle":"2022-10-04T23:37:24.370750Z","shell.execute_reply.started":"2022-10-04T23:37:24.261400Z","shell.execute_reply":"2022-10-04T23:37:24.369894Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"X_train_seq[0]","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:37:37.725114Z","iopub.execute_input":"2022-10-04T23:37:37.726253Z","iopub.status.idle":"2022-10-04T23:37:37.735607Z","shell.execute_reply.started":"2022-10-04T23:37:37.726202Z","shell.execute_reply":"2022-10-04T23:37:37.734792Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"X_train_seq_padded = pad_sequences(X_train_seq,maxlen=50, padding='post')\nX_test_seq_padded = pad_sequences(X_test_seq,maxlen=50, padding='post')","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:39:51.200914Z","iopub.execute_input":"2022-10-04T23:39:51.201338Z","iopub.status.idle":"2022-10-04T23:39:51.224739Z","shell.execute_reply.started":"2022-10-04T23:39:51.201298Z","shell.execute_reply":"2022-10-04T23:39:51.223943Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"X_test_seq_padded[0]","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:40:03.502776Z","iopub.execute_input":"2022-10-04T23:40:03.503182Z","iopub.status.idle":"2022-10-04T23:40:03.512577Z","shell.execute_reply.started":"2022-10-04T23:40:03.503148Z","shell.execute_reply":"2022-10-04T23:40:03.511314Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"#Building the model\nmodel = keras.Sequential()\n# Add an Embedding layer expecting input vocab of size 1000, and\n# output embedding dimension of size 32.\nmodel.add(layers.Embedding(input_dim=len(tokenizer.index_word)+1, output_dim=32)) #ypu can test output_dim\n\n# Add a LSTM layer with 128 internal units.\nmodel.add(layers.LSTM(32, dropout=0, recurrent_dropout=0)) #output of previous layer i.e 32\n\n# Add a Dense layer with 10 units.\nmodel.add(layers.Dense(32, activation='relu'))\n#final layer which will tell whether it's a spam or ham\nmodel.add(layers.Dense(1, activation='sigmoid'))\n\nmodel.summary()","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:49:37.271874Z","iopub.execute_input":"2022-10-04T23:49:37.272307Z","iopub.status.idle":"2022-10-04T23:49:37.644026Z","shell.execute_reply.started":"2022-10-04T23:49:37.272275Z","shell.execute_reply":"2022-10-04T23:49:37.642813Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"import tensorflow.keras.backend as K\ndef recall_m(y_true, y_pred):\n true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))\n possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))\n recall = true_positives / (possible_positives + K.epsilon())\n return recall\n\ndef precision_m(y_true, y_pred):\n true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))\n predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))\n precision = true_positives / (predicted_positives + K.epsilon())\n return precision\n\n#compile the model\nmodel.compile(\n loss='binary_crossentropy',\n optimizer=\"adam\",\n metrics=['accuracy',recall_m,precision_m]\n)\n","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:59:02.195076Z","iopub.execute_input":"2022-10-04T23:59:02.195473Z","iopub.status.idle":"2022-10-04T23:59:02.209633Z","shell.execute_reply.started":"2022-10-04T23:59:02.195443Z","shell.execute_reply":"2022-10-04T23:59:02.208295Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"history=model.fit(\n X_train_seq_padded, y_train, validation_data=(X_test_seq_padded, y_test), batch_size=32, epochs=10\n)","metadata":{"execution":{"iopub.status.busy":"2022-10-04T23:59:53.456960Z","iopub.execute_input":"2022-10-04T23:59:53.457359Z","iopub.status.idle":"2022-10-05T00:00:57.837047Z","shell.execute_reply.started":"2022-10-04T23:59:53.457329Z","shell.execute_reply":"2022-10-05T00:00:57.835538Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Plot the evaluation metrics by each epoch for the model to see if we are over or underfitting\nimport matplotlib.pyplot as plt\n\nfor i in ['accuracy', 'precision_m', 'recall_m']:\n acc = history.history[i]\n val_acc = history.history['val_{}'.format(i)]\n epochs = range(1, len(acc) + 1)\n\n plt.figure()\n plt.plot(epochs, acc, label='Training Accuracy')\n plt.plot(epochs, val_acc, label='Validation Accuracy')\n plt.title('Results for {}'.format(i))\n plt.legend()\n plt.show()","metadata":{"execution":{"iopub.status.busy":"2022-10-05T00:00:57.839250Z","iopub.execute_input":"2022-10-05T00:00:57.839689Z","iopub.status.idle":"2022-10-05T00:00:58.377962Z","shell.execute_reply.started":"2022-10-05T00:00:57.839656Z","shell.execute_reply":"2022-10-05T00:00:58.376779Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"","metadata":{},"execution_count":null,"outputs":[]}]} -------------------------------------------------------------------------------- /Natural Language Processing/dataset.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Natural Language Processing/dataset.csv -------------------------------------------------------------------------------- /Pictures/5-Step-Workflow-for-Multiple-Linear-Regression.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Pictures/5-Step-Workflow-for-Multiple-Linear-Regression.png -------------------------------------------------------------------------------- /Pictures/Exploratory-Data-Analysis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Pictures/Exploratory-Data-Analysis.png -------------------------------------------------------------------------------- /Pictures/log_reg.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/gitsuraaj/Data-Science-Series/90a7e493c20f70e720c813bb60d3901facaacfe8/Pictures/log_reg.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data-Science-Series 2 | For all those who're struggling to find a good hands-on resource (with case studies) to master their Data Science skills, Here's all what you need! 3 | 4 | 5 | ## CONCEPT 1: Exploratory Data Analysis 6 | ![EDA](https://github.com/gitsuraaj/Data-Science-Series/blob/master/Pictures/Exploratory-Data-Analysis.png) 7 | 8 | ### Credit Risk Modeling Case Study 9 | 10 | #### Problem Statement: 11 | ABC Finance Company is finding it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for ABC finance company which specializes in lending various types of loans to urban customers. You have to use EDA to analyze the patterns present in the data. This will ensure that the applicants are capable of repaying the loan are not rejected. 12 | When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision: 13 | • If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company. 14 | • If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company. 15 | The bank wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default 16 | 17 | #### Now, what should you do? 18 | 19 | Identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study. 20 | In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilize this knowledge for its portfolio and risk assessment. 21 | To develop your understanding of the domain, you are advised to independently research a little about risk analytics - understanding the types of variables and their significance should be enough). 22 | 23 | #### Datasets: 24 | 25 | 1. 'application_data.csv' contains all the information of the client at the time of application. The data is about whether a client has payment difficulties. 26 | 2. 'previous_application.csv' contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer. 27 | 28 | Datasets uploaded here: https://drive.google.com/drive/folders/199BenSvrJUhf6N5iU7nSNewduW61AwYy 29 | 30 | Note: You can refer to Columns description file to learn about the columns available in the datasets. 31 | 32 | 33 | #### How to begin? 34 | 35 | 1. Start by importing the 'application_train.csv'. 36 | 37 | 2. Check the structure of the data (shape, info, describe). 38 | 39 | 3. Data Quality Check and Missing values 40 | 41 | a. Find the percentage of missing values for all the columns. 42 | 43 | b. Remove columns with high missing percentage. 44 | 45 | c. For columns which has less percentage (around 13% or so), you need to check what will be the best metric to impute the missing values? Like if the column you are checking is a categorica column check, which category you can use to fill the nulls. For others check does mean or median can be imputed or not. Other cases may be imputing with 0. You need to do this task for some variables and not for all, say 5. 46 | 47 | 4. Check the datatypes of all the columns and change the datatype if required. 48 | 49 | 5. For numerical columns check for outliers and report them for at-least 3 variables. Treat them and analyze it. 50 | 51 | 6. Binning of continuous variables. Check if you need to bin any variable in different 52 | categories. Do this for one or two columns. 53 | 54 | 55 | #### Analysis Steps 56 | 57 | 1. Check the Imbalance percentage. No balancing technique required. 58 | 59 | 2. Divide the data into two sets, i.e. Target=1 and Target=0. 60 | 61 | 3. Perform univariate analysis for categorical variables for both 0 and 1. Compare the target variable across categories of categorical variables. 62 | 63 | 4. Find correlation for numerical columns for both the cases, i.e. 0 and 1. 64 | 65 | 5. Check the variables with highest correlation are the same in both the files or not? 66 | 67 | 6. Perform univariate for numerical variables for both 0 and 1. Compared the target variable across categories of continuous variables. 68 | 69 | 7. Perform bivariate analysis for numerical variables for both 0 and 1. 70 | 71 | 8. Read “Previous Application” data. 72 | 73 | 9. Merge the files 74 | 75 | 10. Perform univariate and bivariate analysis to find some pattern. 76 | 77 | 78 | ## CONCEPT 2: Multiple Regression 79 | 80 | ![LinearRegression](https://github.com/gitsuraaj/Data-Science-Series/blob/master/Pictures/5-Step-Workflow-for-Multiple-Linear-Regression.png) 81 | 82 | ### YuluDulu Bicycle Subscription Case Study 83 | 84 | #### Problem Statement 85 | A bike-sharing provider YuluDulu has recently suffered considerable dips in their revenues. They have contracted a consulting company to understand the factors on which the demand for these shared bikes depends. Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know: 86 | • Which variables are significant in predicting the demand for shared bikes? 87 | • How well those variables describe the bike demands 88 | 89 | #### What you need to do? 90 | • Create a linear model that describe the effect of various features on price. 91 | • The model should be interpretable so that the management can understand it. 92 | 93 | 94 | #### Case Study Steps to be performed 95 | 96 | #### Data Preparation 97 | 98 | 1. Identify the categorical and continuous features. 99 | 2. Drop the unnecessary variables: ‘instant’, ‘dteday’, ‘casual’ and ‘registered’. 100 | 3. Check the data-type of all the columns and make necessary changes if required. 101 | 102 | #### Data Visualization 103 | 104 | 4. Perform EDA to understand various variables. 105 | 5. Check the correlation between the variables. 106 | 107 | #### Data Preparation 108 | 109 | 6. Create dummy variables for all the categorical features. 110 | 7. Divide the data to train and test. 111 | 8. Perform scaling. 112 | 9. Divide the data into X and y. 113 | 114 | #### *Data Modelling and Evaluation* 115 | 116 | 10. Create Linear Regression model using mixed approach. 117 | 11. Check the various assumptions. 118 | 12. Check the Adjusted R-Square for both test and train data. 119 | 13. Report the final model. 120 | 121 | #### Data set “YuluDulu Dataset.csv” and its description “YuluDulu Columns Description” is uploaded in the same folder. 122 | 123 | 124 | ## CONCEPT 3: Supervised Learning through Classification 125 | 126 | ![Logistic Regression](https://github.com/gitsuraaj/Data-Science-Series/blob/master/Pictures/log_reg.png) 127 | 128 | ### Potential Customers for X Education 129 | An education company named X Education sells online courses to industry professionals. The company markets its courses on several websites and search engines like Google. Once these people land on the website, they might browse the courses or fill up a form for the course or watch some videos. When these people fill up a form providing their email address or phone number, they are classified to be a lead. Now, although X Education gets a lot of leads, its lead conversion rate is very poor. For example, if, say, they acquire 100 leads in a day, only about 30 of them are converted. 130 | 131 | #### What you need to do? 132 | • X Education has appointed you to help them select the most promising leads, i.e. the leads that are most likely to convert into paying customers. 133 | 134 | • The company requires you to build a model wherein you need to assign a lead score to each of the leads such that the customers with higher lead score have a higher conversion chance and the customers with lower lead score have a lower conversion chance. 135 | Steps to follow: 136 | 137 | #### Data Cleaning 138 | • Handle the “Select” level that is present in many of the categorical variables. 139 | 140 | • Drop columns that are having high percentage of missing values. Check all the columns before dropping them. 141 | 142 | • Check the number of unique categories in each categorical column. Here you may need to do something. 143 | 144 | • For the columns with less percentage of missing, use some imputation technique. 145 | 146 | • Finally check the percentage of rows retained in data cleaning process. 147 | 148 | #### Data Preparation 149 | • Create dummies for all categorical columns. 150 | 151 | • Perform train-test split. 152 | 153 | • Perform scaling. 154 | 155 | #### Modelling 156 | • Use techniques like RFE to perform variable selection. 157 | 158 | • Build a Logistic Regression model with good sensitivity. 159 | 160 | • Check p-value and VIF. 161 | 162 | • Find the optimal probability cutoff. 163 | 164 | • Check the model performance over the test data. 165 | 166 | • Generate the score variable. 167 | 168 | 169 | 170 | --------------------------------------------------------------------------------