├── README.md
└── NLP text classification model Github.ipynb


/README.md:
--------------------------------------------------------------------------------
 1 | # NLP-text-classification-model
 2 | 
 3 | 
 4 | Unstructured data in the form of text: chats, emails, social media, survey responses is present everywhere today. Text can be a rich source of information, but due to its unstructured nature it can be hard to extract insights from it.
 5 | Text classification is one of the important task in supervised machine learning (ML). It is a process of assigning tags/categories to documents helping us to automatically & quickly structure and analyze text in a cost-effective manner. It is one of the fundamental tasks in Natural Language Processing with broad applications such as sentiment-analysis, spam-detection, topic-labeling, intent-detection etc.
 6 | 
 7 | **Let’s divide the classification problem into the below steps:**
 8 | 
 9 | 1. Setup: Importing Libraries
10 | 2. Loading the data set & Exploratory Data Analysis
11 | 3. Text pre-processing
12 | 4. Extracting vectors from text (Vectorization)
13 | 5. Running ML algorithms
14 | 6. Conclusion
15 | 
16 | 
17 | 
18 | **Step 2: Loading the data set & EDA**
19 | 
20 | The data set that we will be using for this article is the famous “Natural Language Processing with Disaster Tweets” data set where we’ll be predicting whether a given tweet is about a real disaster (target=1) or not (target=0)
21 | In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which ones aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified.
22 | We have 7,613 tweets in training (labelled) dataset and 3,263 in the test(unlabelled) dataset
23 | 
24 | Exploratory Data Analysis (EDA)
25 | 1. Class distribution: There are more tweets with class 0 ( no disaster) than class 1 ( disaster tweets). We can say that the dataset is relatively balanced with 4342 non-disaster tweets (57%) and 3271 disaster tweets (43%). Since the data is balanced, we won’t be applying data-balancing techniques like SMOTE while building the model
26 | 
27 | 2. Missing values: We have ~2.5k missing values in location field and 61 missing values in keyword column
28 | 
29 | 3. Number of words in a tweet: Disaster tweets are more wordy than the non-disaster tweets
30 | The average number of words in a disaster tweet is 15.17 as compared to an average of 14.7 words in a non-disaster tweet
31 | 
32 | 4. Number of characters in a tweet: Disaster tweets are longer than the non-disaster tweets
33 | The average characters in a disaster tweet is 108.1 as compared to an average of 95.7 characters in a non-disaster tweet
34 | 
35 | **Step 3: Text Pre-Processing**
36 | 
37 | Before we move to model building, we need to preprocess our dataset by removing punctuations & special characters, cleaning texts, removing stop words, and applying lemmatization
38 | Simple text cleaning processes: Some of the common text cleaning process involves:
39 | - Removing punctuations, special characters, URLs & hashtags
40 | - Removing leading, trailing & extra white spaces/tabs
41 | - Typos, slangs are corrected, abbreviations are written in their long forms
42 | 
43 | 1. Stop-word removal: We can remove a list of generic stop words from the English vocabulary using nltk. A few such words are ‘i’,’you’,’a’,’the’,’he’,’which’ etc.
44 | 2. Stemming: Refers to the process of slicing the end or the beginning of words with the intention of removing affixes(prefix/suffix)
45 | 3. Lemmatization: It is the process of reducing the word to its base form
46 | 
47 | 
48 | **Step 4: Extracting vectors from text (Vectorization)**
49 | 
50 | It’s difficult to work with text data while building Machine learning models since these models need well-defined numerical data. The process to convert text data into numerical data/vector, is called vectorization or in the NLP world, word embedding. Bag-of-Words(BoW) and Word Embedding (with Word2Vec) are two well-known methods for converting text data to numerical data.
51 | There are a few versions of Bag of Words, corresponding to different words scoring methods. We use the Sklearn library to calculate the BoW numerical values using these approaches:
52 | 1. Count vectors: It builds a vocabulary from a corpus of documents and counts how many times the words appear in each document
53 | 2. Term Frequency-Inverse Document Frequencies (tf-Idf): Count vectors might not be the best representation for converting text data to numerical data. So, instead of simple counting, we can also use an advanced variant of the Bag-of-Words that uses the term frequency–inverse document frequency (or Tf-Idf). Basically, the value of a word increases proportionally to count in the document, but it is inversely proportional to the frequency of the word in the corpus
54 | 
55 | Word2Vec: One of the major drawbacks of using Bag-of-words techniques is that it can’t capture the meaning or relation of the words from vectors. Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network which is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.
56 | 
57 | We can use any of these approaches to convert our text data to numerical form which will be used to build the classification model. With this in mind, I am going to first partition the dataset into training set (80%) and test set (20%)
58 | 
59 | **Step 5. Running ML algorithms**
60 | 
61 | It’s time to train a machine learning model on the vectorized dataset and test it. Now that we have converted the text data to numerical data, we can run ML models on X_train_vector_tfidf & y_train. We’ll test this model on X_test_vectors_tfidf to get y_predict and further evaluate the performance of the model
62 | 1. Logistic Regression
63 | 2. Naive Bayes: It’s a probabilistic classifier that makes use of Bayes’ Theorem, a rule that uses probability to make predictions based on prior knowledge of conditions that might be related
64 | 
65 | In this article, I demonstrated the basics of building a text classification model comparing Bag-of-Words (with Tf-Idf) and Word Embedding with Word2Vec. You can further enhance the performance of your model using this code by
66 | - using other classification algorithms like Support Vector Machines (SVM), XgBoost, Ensemble models, Neural networks etc.
67 | - using Gridsearch to tune the hyperparameters of your model
68 | - using advanced word-embedding methods like GloVe and BERT
69 | 


--------------------------------------------------------------------------------
/NLP text classification model Github.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### IMPORTING PACKAGES"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 1,
 13 |    "metadata": {},
 14 |    "outputs": [
 15 |     {
 16 |      "name": "stderr",
 17 |      "output_type": "stream",
 18 |      "text": [
 19 |       "[nltk_data] Downloading package punkt to /Users/ranivija/nltk_data...\n",
 20 |       "[nltk_data]   Package punkt is already up-to-date!\n",
 21 |       "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
 22 |       "[nltk_data]     /Users/ranivija/nltk_data...\n",
 23 |       "[nltk_data]   Package averaged_perceptron_tagger is already up-to-\n",
 24 |       "[nltk_data]       date!\n",
 25 |       "[nltk_data] Downloading package wordnet to\n",
 26 |       "[nltk_data]     /Users/ranivija/nltk_data...\n",
 27 |       "[nltk_data]   Package wordnet is already up-to-date!\n"
 28 |      ]
 29 |     }
 30 |    ],
 31 |    "source": [
 32 |     "import pandas as pd\n",
 33 |     "import numpy as np\n",
 34 |     "\n",
 35 |     "import seaborn as sns\n",
 36 |     "import matplotlib.pyplot as plt\n",
 37 |     "\n",
 38 |     "#for text pre-processing\n",
 39 |     "import re, string\n",
 40 |     "import nltk\n",
 41 |     "from nltk.tokenize import word_tokenize\n",
 42 |     "from nltk.corpus import stopwords\n",
 43 |     "from nltk.tokenize import word_tokenize\n",
 44 |     "from nltk.stem import SnowballStemmer\n",
 45 |     "from nltk.corpus import wordnet\n",
 46 |     "from nltk.stem import WordNetLemmatizer\n",
 47 |     "\n",
 48 |     "nltk.download('punkt')\n",
 49 |     "nltk.download('averaged_perceptron_tagger')\n",
 50 |     "nltk.download('wordnet')\n",
 51 |     "\n",
 52 |     "#for model-building\n",
 53 |     "from sklearn.model_selection import train_test_split\n",
 54 |     "from sklearn.linear_model import LogisticRegression\n",
 55 |     "from sklearn.linear_model import SGDClassifier\n",
 56 |     "from sklearn.naive_bayes import MultinomialNB\n",
 57 |     "from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix\n",
 58 |     "from sklearn.metrics import roc_curve, auc, roc_auc_score\n",
 59 |     "\n",
 60 |     "# bag of words\n",
 61 |     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
 62 |     "from sklearn.feature_extraction.text import CountVectorizer\n",
 63 |     "\n",
 64 |     "#for word embedding\n",
 65 |     "import gensim\n",
 66 |     "from gensim.models import Word2Vec #Word2Vec is mostly used for huge datasets"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "metadata": {},
 72 |    "source": [
 73 |     "## Loading Data"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 2,
 79 |    "metadata": {},
 80 |    "outputs": [
 81 |     {
 82 |      "name": "stdout",
 83 |      "output_type": "stream",
 84 |      "text": [
 85 |       "(7613, 5)\n"
 86 |      ]
 87 |     },
 88 |     {
 89 |      "data": {
 90 |       "text/html": [
 91 |        "<div>\n",
 92 |        "<style scoped>\n",
 93 |        "    .dataframe tbody tr th:only-of-type {\n",
 94 |        "        vertical-align: middle;\n",
 95 |        "    }\n",
 96 |        "\n",
 97 |        "    .dataframe tbody tr th {\n",
 98 |        "        vertical-align: top;\n",
 99 |        "    }\n",
100 |        "\n",
101 |        "    .dataframe thead th {\n",
102 |        "        text-align: right;\n",
103 |        "    }\n",
104 |        "</style>\n",
105 |        "<table border=\"1\" class=\"dataframe\">\n",
106 |        "  <thead>\n",
107 |        "    <tr style=\"text-align: right;\">\n",
108 |        "      <th></th>\n",
109 |        "      <th>id</th>\n",
110 |        "      <th>keyword</th>\n",
111 |        "      <th>location</th>\n",
112 |        "      <th>text</th>\n",
113 |        "      <th>target</th>\n",
114 |        "    </tr>\n",
115 |        "  </thead>\n",
116 |        "  <tbody>\n",
117 |        "    <tr>\n",
118 |        "      <th>0</th>\n",
119 |        "      <td>1</td>\n",
120 |        "      <td>NaN</td>\n",
121 |        "      <td>NaN</td>\n",
122 |        "      <td>Our Deeds are the Reason of this #earthquake M...</td>\n",
123 |        "      <td>1</td>\n",
124 |        "    </tr>\n",
125 |        "    <tr>\n",
126 |        "      <th>1</th>\n",
127 |        "      <td>4</td>\n",
128 |        "      <td>NaN</td>\n",
129 |        "      <td>NaN</td>\n",
130 |        "      <td>Forest fire near La Ronge Sask. Canada</td>\n",
131 |        "      <td>1</td>\n",
132 |        "    </tr>\n",
133 |        "    <tr>\n",
134 |        "      <th>2</th>\n",
135 |        "      <td>5</td>\n",
136 |        "      <td>NaN</td>\n",
137 |        "      <td>NaN</td>\n",
138 |        "      <td>All residents asked to 'shelter in place' are ...</td>\n",
139 |        "      <td>1</td>\n",
140 |        "    </tr>\n",
141 |        "    <tr>\n",
142 |        "      <th>3</th>\n",
143 |        "      <td>6</td>\n",
144 |        "      <td>NaN</td>\n",
145 |        "      <td>NaN</td>\n",
146 |        "      <td>13,000 people receive #wildfires evacuation or...</td>\n",
147 |        "      <td>1</td>\n",
148 |        "    </tr>\n",
149 |        "    <tr>\n",
150 |        "      <th>4</th>\n",
151 |        "      <td>7</td>\n",
152 |        "      <td>NaN</td>\n",
153 |        "      <td>NaN</td>\n",
154 |        "      <td>Just got sent this photo from Ruby #Alaska as ...</td>\n",
155 |        "      <td>1</td>\n",
156 |        "    </tr>\n",
157 |        "  </tbody>\n",
158 |        "</table>\n",
159 |        "</div>"
160 |       ],
161 |       "text/plain": [
162 |        "   id keyword location                                               text  \\\n",
163 |        "0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   \n",
164 |        "1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   \n",
165 |        "2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   \n",
166 |        "3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   \n",
167 |        "4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   \n",
168 |        "\n",
169 |        "   target  \n",
170 |        "0       1  \n",
171 |        "1       1  \n",
172 |        "2       1  \n",
173 |        "3       1  \n",
174 |        "4       1  "
175 |       ]
176 |      },
177 |      "execution_count": 2,
178 |      "metadata": {},
179 |      "output_type": "execute_result"
180 |     }
181 |    ],
182 |    "source": [
183 |     "#you can download the data from https://www.kaggle.com/c/nlp-getting-started/data\n",
184 |     "import os\n",
185 |     "os.chdir('/Users/ranivija/Desktop/')\n",
186 |     "df_train=pd.read_csv('train.csv')\n",
187 |     "print(df_train.shape)\n",
188 |     "df_train.head()"
189 |    ]
190 |   },
191 |   {
192 |    "cell_type": "markdown",
193 |    "metadata": {},
194 |    "source": [
195 |     "## EDA"
196 |    ]
197 |   },
198 |   {
199 |    "cell_type": "code",
200 |    "execution_count": 3,
201 |    "metadata": {},
202 |    "outputs": [
203 |     {
204 |      "name": "stdout",
205 |      "output_type": "stream",
206 |      "text": [
207 |       "0    4342\n",
208 |       "1    3271\n",
209 |       "Name: target, dtype: int64\n"
210 |      ]
211 |     },
212 |     {
213 |      "data": {
214 |       "text/plain": [
215 |        "<AxesSubplot:ylabel='target'>"
216 |       ]
217 |      },
218 |      "execution_count": 3,
219 |      "metadata": {},
220 |      "output_type": "execute_result"
221 |     },
222 |     {
223 |      "data": {
224 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAD4CAYAAAAdIcpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAANyElEQVR4nO3df6hf9X3H8edLrXVbqdqaiiS6yBoqSrvWZtYijE03ja5tZNhicWsoYVmHg+4H63SMybRCu425FtZCOoOx2Fqxg0jnKMHalUGrJq31J847nTPB1mjU6bo6o+/98f3EfY333s9Vcu73m9znAy73nM853/P9BAJPzvece76pKiRJms8hk56AJGn6GQtJUpexkCR1GQtJUpexkCR1HTbpCQzhmGOOqZUrV056GpJ0QNm+ffsTVbVstm0HZSxWrlzJtm3bJj0NSTqgJHlkrm1+DCVJ6jIWkqQuYyFJ6jIWkqQuYyFJ6jIWkqQuYyFJ6jIWkqQuYyFJ6joo/4J7f3jvn1w76SloCm3/649NegrSRHhmIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpK7BY5Hk0CQ/SPKNtn5iktuSzCT5WpLD2/gb2/pM275y7BiXtvEHkpwz9JwlSa+0GGcWnwTuH1v/LHBVVb0deApY38bXA0+18avafiQ5GbgQOAVYA3whyaGLMG9JUjNoLJKsAH4D+Ie2HuBM4Ma2y2bg/La8tq3Ttp/V9l8LXF9Vz1fVw8AMcNqQ85YkvdLQZxZ/B3wKeKmtvxV4uqr2tPUdwPK2vBx4FKBtf6bt//L4LK95WZINSbYl2bZr1679/M+QpKVtsFgk+QDweFVtH+o9xlXVxqpaXVWrly1bthhvKUlLxpDflHcG8KEk5wFHAG8GPgccleSwdvawAtjZ9t8JHA/sSHIYcCTw5Nj4XuOvkSQtgsHOLKrq0qpaUVUrGV2g/lZVXQTcClzQdlsHbGnLN7V12vZvVVW18Qvb3VInAquA24eatyTp1SbxHdx/Clyf5NPAD4Cr2/jVwJeTzAC7GQWGqro3yQ3AfcAe4OKqenHxpy1JS9eixKKqvg18uy0/xCx3M1XVT4EPz/H6K4Erh5uhJGk+/gW3JKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnrsElPQNJr85+Xv3PSU9AUOuEv7h70+J5ZSJK6jIUkqctYSJK6jIUkqctYSJK6jIUkqctYSJK6jIUkqctYSJK6BotFkiOS3J7kh0nuTfKXbfzEJLclmUnytSSHt/E3tvWZtn3l2LEubeMPJDlnqDlLkmY35JnF88CZVfWLwLuBNUlOBz4LXFVVbweeAta3/dcDT7Xxq9p+JDkZuBA4BVgDfCHJoQPOW5K0j8FiUSPPtdU3tJ8CzgRubOObgfPb8tq2Ttt+VpK08eur6vmqehiYAU4bat6SpFcb9JpFkkOT3Ak8DmwF/h14uqr2tF12AMvb8nLgUYC2/RngrePjs7xm/L02JNmWZNuuXbsG+NdI0tI1aCyq6sWqejewgtHZwEkDvtfGqlpdVauXLVs21NtI0pK0KHdDVdXTwK3A+4Gjkux9NPoKYGdb3gkcD9C2Hwk8OT4+y2skSYtgyLuhliU5qi3/DPDrwP2MonFB220dsKUt39TWadu/VVXVxi9sd0udCKwCbh9q3pKkVxvyy4+OAza3O5cOAW6oqm8kuQ+4PsmngR8AV7f9rwa+nGQG2M3oDiiq6t4kNwD3AXuAi6vqxQHnLUnax2CxqKq7gPfMMv4Qs9zNVFU/BT48x7GuBK7c33OUJC2Mf8EtSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSerqxqJ9O113TJJ08FrImcXXZxm7cX9PRJI0veb8prwkJwGnAEcm+c2xTW8Gjhh6YpKk6THf16q+A/gAcBTwwbHxZ4HfGXBOkqQpM2csqmoLsCXJ+6vqu4s4J0nSlFnINYsnk9yS5B6AJO9K8ucDz0uSNEUWEosvAZcCLwBU1V3AhUNOSpI0XRYSi5+tqtv3GdszxGQkSdNpIbF4IskvAAWQ5ALgsUFnJUmaKvPdDbXXxcBG4KQkO4GHgd8adFaSpKnSjUVVPQT8WpKfAw6pqmeHn5YkaZp0Y5Hkj/ZZB3gG2F5Vdw4zLUnSNFnINYvVwCeA5e3nd4E1wJeSfGrAuUmSpsRCrlmsAE6tqucAklwG/BPwy8B24K+Gm54kaRos5MzibcDzY+svAMdW1f/sMy5JOkgt5MziOuC2JFva+geBr7QL3vcNNjNJ0tSYNxYZXc2+Bvhn4Iw2/Imq2taWLxpuapKkaTFvLKqqktxcVe8Ets23ryTp4LWQaxbfT/JLg89EkjS1FnLN4n3ARUkeAf4bCKOTjncNOjNJ0tRYSCzOGXwWkqSptpDHfTwCkORt+HWqkrQkda9ZJPlQkgcZPUDwX4D/YHR3lCRpiVjIBe4rgNOBf6uqE4GzgO/1XpTk+CS3Jrkvyb1JPtnG35Jka5IH2++j23iSfD7JTJK7kpw6dqx1bf8Hk6x7Xf9SSdLrtpBYvFBVTwKHJDmkqm5l9Lyonj3AH1fVyYxic3GSk4FLgFuqahVwS1sHOBdY1X42AF+EUVyAyxhdaD8NuGxvYCRJi2MhsXg6yZuA7wDXJfkc8FzvRVX1WFV9vy0/C9zP6EGEa4HNbbfNwPlteS1wbY18DzgqyXGMLrBvrardVfUUsJXRgwwlSYtkIXdD/RD4CfCHjP5i+0jgTa/lTZKsBN4D3MbouVJ7v2nvR8CxbXk58OjYy3bw/0+6nW183/fYwOiMhBNOOOG1TE+S1LGQWPxqVb0EvEQ7I0hy10LfoJ2VfB34g6r6r/Z9GMDLfyFer23Ks6uqjYy+0Y/Vq1fvl2NKkkbm/Bgqye8luZvR16neNfbzMLCgWCR5A6NQXFdV/9iGf9w+XqL9fryN7wSOH3v5ijY217gkaZHMd83iK4yeMLul/d77896q6n4Hd3sI4dXA/VX1t2ObbgL23tG0rh1/7/jH2l1RpwPPtI+rvgmcneTodmH77DYmSVokc34MVVXPMPr61I++zmOfAfw2cHeSO9vYnwGfAW5Ish54BPhI23YzcB4ww+gaycfbPHYnuQK4o+13eVXtfp1zkiS9Dgu5ZvG6VNW/MnqO1GzOmmX/Ai6e41ibgE37b3aSpNdiIbfOSpKWOGMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkrsFikWRTkseT3DM29pYkW5M82H4f3caT5PNJZpLcleTUsdesa/s/mGTdUPOVJM1tyDOLa4A1+4xdAtxSVauAW9o6wLnAqvazAfgijOICXAa8DzgNuGxvYCRJi2ewWFTVd4Dd+wyvBTa35c3A+WPj19bI94CjkhwHnANsrardVfUUsJVXB0iSNLDFvmZxbFU91pZ/BBzblpcDj47tt6ONzTX+Kkk2JNmWZNuuXbv276wlaYmb2AXuqiqg9uPxNlbV6qpavWzZsv11WEkSix+LH7ePl2i/H2/jO4Hjx/Zb0cbmGpckLaLFjsVNwN47mtYBW8bGP9buijodeKZ9XPVN4OwkR7cL22e3MUnSIjpsqAMn+SrwK8AxSXYwuqvpM8ANSdYDjwAfabvfDJwHzAA/AT4OUFW7k1wB3NH2u7yq9r1oLkka2GCxqKqPzrHprFn2LeDiOY6zCdi0H6cmSXqN/AtuSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVLXAROLJGuSPJBkJsklk56PJC0lB0QskhwK/D1wLnAy8NEkJ092VpK0dBwQsQBOA2aq6qGq+l/gemDthOckSUvGYZOewAItBx4dW98BvG98hyQbgA1t9bkkDyzS3JaCY4AnJj2JaZC/WTfpKeiV/L+512XZH0f5+bk2HCix6KqqjcDGSc/jYJRkW1WtnvQ8pH35f3PxHCgfQ+0Ejh9bX9HGJEmL4ECJxR3AqiQnJjkcuBC4acJzkqQl44D4GKqq9iT5feCbwKHApqq6d8LTWkr8eE/Tyv+biyRVNek5SJKm3IHyMZQkaYKMhSSpy1hoXj5mRdMoyaYkjye5Z9JzWSqMhebkY1Y0xa4B1kx6EkuJsdB8fMyKplJVfQfYPel5LCXGQvOZ7TEryyc0F0kTZCwkSV3GQvPxMSuSAGOh+fmYFUmAsdA8qmoPsPcxK/cDN/iYFU2DJF8Fvgu8I8mOJOsnPaeDnY/7kCR1eWYhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSer6P5LNNm9bNfkXAAAAAElFTkSuQmCC\n",
225 |       "text/plain": [
226 |        "<Figure size 432x288 with 1 Axes>"
227 |       ]
228 |      },
229 |      "metadata": {
230 |       "needs_background": "light"
231 |      },
232 |      "output_type": "display_data"
233 |     }
234 |    ],
235 |    "source": [
236 |     "# CLASS DISTRIBUTION\n",
237 |     "#if dataset is balanced or not\n",
238 |     "x=df_train['target'].value_counts()\n",
239 |     "print(x)\n",
240 |     "sns.barplot(x.index,x)"
241 |    ]
242 |   },
243 |   {
244 |    "cell_type": "code",
245 |    "execution_count": 4,
246 |    "metadata": {},
247 |    "outputs": [
248 |     {
249 |      "data": {
250 |       "text/plain": [
251 |        "id             0\n",
252 |        "keyword       61\n",
253 |        "location    2533\n",
254 |        "text           0\n",
255 |        "target         0\n",
256 |        "dtype: int64"
257 |       ]
258 |      },
259 |      "execution_count": 4,
260 |      "metadata": {},
261 |      "output_type": "execute_result"
262 |     }
263 |    ],
264 |    "source": [
265 |     "#Missing values\n",
266 |     "df_train.isna().sum()"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "code",
271 |    "execution_count": 5,
272 |    "metadata": {},
273 |    "outputs": [
274 |     {
275 |      "name": "stdout",
276 |      "output_type": "stream",
277 |      "text": [
278 |       "15.167532864567411\n",
279 |       "14.704744357438969\n",
280 |       "108.11342097217977\n",
281 |       "95.70681713496084\n",
282 |       "14.664934270865178\n",
283 |       "14.09649930907416\n"
284 |      ]
285 |     }
286 |    ],
287 |    "source": [
288 |     "#1. WORD-COUNT\n",
289 |     "df_train['word_count'] = df_train['text'].apply(lambda x: len(str(x).split()))\n",
290 |     "print(df_train[df_train['target']==1]['word_count'].mean()) #Disaster tweets\n",
291 |     "print(df_train[df_train['target']==0]['word_count'].mean()) #Non-Disaster tweets\n",
292 |     "#Disaster tweets are more wordy than the non-disaster tweets\n",
293 |     "\n",
294 |     "#2. CHARACTER-COUNT\n",
295 |     "df_train['char_count'] = df_train['text'].apply(lambda x: len(str(x)))\n",
296 |     "print(df_train[df_train['target']==1]['char_count'].mean()) #Disaster tweets\n",
297 |     "print(df_train[df_train['target']==0]['char_count'].mean()) #Non-Disaster tweets\n",
298 |     "#Disaster tweets are longer than the non-disaster tweets\n",
299 |     "\n",
300 |     "#3. UNIQUE WORD-COUNT\n",
301 |     "df_train['unique_word_count'] = df_train['text'].apply(lambda x: len(set(str(x).split())))\n",
302 |     "print(df_train[df_train['target']==1]['unique_word_count'].mean()) #Disaster tweets\n",
303 |     "print(df_train[df_train['target']==0]['unique_word_count'].mean()) #Non-Disaster tweets"
304 |    ]
305 |   },
306 |   {
307 |    "cell_type": "code",
308 |    "execution_count": 6,
309 |    "metadata": {},
310 |    "outputs": [
311 |     {
312 |      "data": {
313 |       "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlYAAAEVCAYAAAAigatAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAibklEQVR4nO3de5glVXnv8e9PQOUgcUAmE27DqBANMRHIqBiJGomJIMmQcxQlJoyEZDQHDT56TkRzQYzX5CRekhMURR28AUEJxHCiBAXxAjoookIMI0KYYWBG7ki8oO/5o1bDpunp3j1T3b275/t5nv3sqlW1a7+7hlq8tdbqWqkqJEmStPUeMtcBSJIkLRQmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSTMuyeuSfGiu45CkmWZiJW2Dkrwmyf8bV3bNZspeOLvRzZ4klWTfWf7ODyR5w2x+p6TZY2IlbZs+C/xyku0AkuwO7AAcOK5s37bv0JJs33OsW20UY5K0MJlYSdumL9MlUge09V8BPgN8a1zZt6vqxiR7JDkvya1J1ib5w7EDtW6+s5N8KMmdwIuTPDrJxUnuSnIBsNvA/g9v+96S5PYkX06yZKIgk1zXWteuSnJbkvcnefjA9iOSXNGO84Ukvzjus69OciXwvfHJVZKxhPFrSe5O8oIW8/9o25/WWrSe29YPTXLFwOd/P8nVLa5PJtlnYNvjk1zQzte3khzVylcBLwL+pH3nP0/x7yRpnjGxkrZBVfVD4DLg6a3o6cAlwOfGlY0lH2cA64A9gOcBb0ryrIFDrgDOBhYBHwY+AlxOl1D9JbByYN+VwCOBvYFHAS8F/muScF8E/AbwWOBngT8DSHIg8D7gJe047wbOS/Kwgc8eDTwXWFRV9447B2O/84lV9YiqOhO4GHhmK38GcO3A+XhG206SFcBrgf8OLKY7dx9t23YCLmjn4KeBFwL/kGT/qjq1nZ+/at/5m5P8bknzkImVtO26mPuThl+hSw4uGVd2cZK9gacBr66q71fVFcB7gWMGjvXFqvqnqvoJXaLxJODPq+oHVfVZYLBl5kd0idC+VfXjqrq8qu6cJM6/r6obqupW4I10yRLAKuDdVXVZO85q4AfAwQOffWf77GSJ2/hz8oy2/HTgzQPr9yVWdMngm6vq6pawvQk4oLVaHQFcV1Xvr6p7q+qrwMeA5w8Zg6R5zMRK2nZ9Fjgkya7A4qq6BvgC3dirXYEntH32AG6tqrsGPns9sOfA+g0Dy3sAt1XV98btP+aDwCeBM5LcmOSvkuwwSZyDx76+HR9gH+BVrRvw9iS307WC7bGZzw7ji8DPtq7JA4DTgb2T7AY8mftb8PYB3jHwvbcCoTsn+wBPGRfXi4CfmWYskuYhB3RK264v0nXJ/SHweYCqujPJja3sxqr6TpJ7gV2T7DyQXC0F1g8cqwaWNwC7JNlpILlaOrZPVf0IOBk4Ocky4Hy6sV2nbSbOvQeWlwI3tuUbgDdW1Rsn+Y01ybYH71x1T5LLgROAb1TVD5N8AXgl3Xiz74777g+PP0Zrtbq4qp7dR0yS5hdbrKRtVOseW0OXNFwysOlzreyzbb8b6Fqy3twGnv8icBww4XOpqur6dtyTkzw0ySHAfWOJkvxqkl9of314J13X4E8mCfX4JHu1VrQ/Bc5s5e8BXprkKenslOS5SXaexmm4GXjMuLKLgZdxf7ffRePWAd4FvCbJz7ff9MgkY119n6Br9fq9JDu015OS/Nwk3ylpgTCxkrZtF9MNsP7cQNklrWzwMQtHA8voWovOAU6qqn+b5Li/AzyFrovsJLoutTE/QzfQ/U7g6hbDByc51keAT9ENJP828AaAqlpD17L298BtwFrgxZMcZyKvA1a3LrujWtnFwM7c//vHr1NV5wBvpevOvBP4BnBY23YX8Ot0g9ZvBG5q+44Nqj8N2L995z9NM15JIy5VtkpLGk1JrgP+YIokTpJGhi1WkiRJPTGxkiRJ6omJ1TYkybuS/PlcxyENq6qW2Q2o6cjAhN9JlrYn3G8313Fp22FitUC06Tv+q00hMja9x0uT3PdvXFUvraq/nMEYep/QdiaOOcR3OkmutIVaXbSxPYF+rOwPklw027FU1X+2J9z/eCaOP5jEjfIxh/jOZa2u9RFMPTCxWlh+s6p2pntA4VuAV7P5ZwONFC9oaUHZju5ZYJqE9d7CZGK1AFXVHVV1HvACYGWSJ8ADW2KS7JbkE61169Ykl4y1biU5Mcm3W+vXVUl+e+zYSfZNN1HtHUm+m+TMVv6gCW1b+YKYJLfFub6dk28lOXTr/pWkBe2vgf+VZNFEG5P8crrJt+9o7788sO2iJH+Z5PPtevtUuiffTyiTT/j9gJaYJC9Ocm3b9ztJXtTKH5vk0+kmBv9ukg8Pxj7R9Z/kOXTzRb6g1RVfa/s+MslpSTa0z7whrSuyff/nk7wtyS10j/sY/C0POma65759fWCfC5J8eWD9kiRHtuU9knwsyab2+/54YL+HDNTttyQ5K92z4eD+R4nc3r73qZur6zWEqvK1AF7AdcCvTVD+n8AfteUPAG9oy2+me8jhDu31K9z/+I3n000L8hC65Ox7wO5t20fpHtL4EODhwCED31V087+NrR8IbKR7ntF2dJPvXgc8bCDmK+ierL3jZn7X+GO+Hvi7tvxauucavXVg2zva8gq65xr9HN0MA38GfKFt24nuydnHtm0HAt8F9h9/ntr649r+e7T1ZcBj5/rf3JevUXyN1UXAxwfqmz8ALmrLu9I9d+z32vV3dFt/VNt+UbuufxbYsa2/ZZLv+yLwt3TPCXs6cBfwobZtWatDtm/X/Z3A49q23YGfb8v7As9ux1hMl2i8vW3b7PVPlxh9aFw859BNCL4T3fPgvgS8pG17MXAv8PIW04PqvfHHbOfg+3QJ4w50D5hdT/dstR3pJjB/FF2dfDnwF8BD6R5Cey3wG+04JwCXAnu13/lu4KPjz9PA9262rvc1+csWq4XvRrqKbLwf0VUs+1TVj6rqkmpXU1X9Y1XdWFU/qaozgWvo5kkb+9w+dJXM96vqcxMce8xCmST3x3QV0f5Jdqiq66rq20PGK22r/gJ4eZLF48qfC1xTVR9s199HgX9n4On8wPur6j9avXAW3byND5JkKZNP+D3eT4AnJNmxqjZU1TcBqmptVV3QjrGJLlEbq1eGvv7TzTF5OPCKqvpeVW0E3kb3sNgxN1bV37XfPmW91/b5Ml1990vA1+imoHoaXV16TVXd0s7D4qp6fVX9sKqupZudYOy7Xwr8aVWtq6of0CVwzxvfUzBgOnW9BphYLXx70j39ery/pmvR+VRrGj9xbEOSY3J/993tdJPxjjWv/wndZLNfSvLNJL8/yXcviElyq2ot8Aq6imhjkjOS7DHRvpI6VfUNuul9Thy3aQ8eOCk3PHhS75sGlu8BHgH3/WXz3e31Wqae8Hswnu/RtcC/FNiQ5F+SPL4dd0m7rtene5L+h2h13jSv/33oWpU2DNQt76ZruRoz3ToPuhvGZ9IlVxfTteI9gwfeTO4D7DGuXnstsGRg+zkD266mSxrHto83nbpeA0ysFrAkT6KrrB50p1FVd1XVq6rqMcBvAa9s4wb2obvLeRld0/wiuuk60j53U1X9YVXtAbwE+Ids/q/2xiaqXTTw+m/tDvW+UKbzm6rqHrrm7vsmyaWbx26iSXJfMu67d6yqL7RtF4/b9oiq+qPNxVRVH6mqQ+gqp6KbokTS5E6im3ZoMGm6ke46GjR+Uu8JVfeXzY9orzcxMOH3uGNt7vOfrG5y7N3pWsne0za9ie66/oWq+ingd2l1Xvvc5q7/8XXFDXSt8rsN1C0/VVU/PxjGVD9zgrLxidVYy/1gYnUD8J1x9drOVXX4wPbDxm1/eFWtn+g7p1nXa4CJ1QKU5KeSHAGcQddX//UJ9jmiDU4McAfdnctP6MYFFLCp7XcsXYvV2Oeen2Svtnpb23dsAt3xk8suiElykzwuybOSPIxurMN/MfmkwZK4r7XnTOCPB4rPp7v+fifJ9un+0GV/uutyusefdMLvQa1VakVLwn4A3M391/HObf2OJHsC/3vgc5Nd/zcDy9L+8KeqNtDNa/k3rR5+SLqB8WPdisN4wDGbL9CN9Xoy8KXWhbkP3fjVsVb6LwF3pRtov2OS7ZI8od1gQ1cnvrHdPJNkcZIVbdum9psG673J6npNwsRqYfnnJHfR3Zn8Kd04gWM3s+9+wL/RVSZfBP6hqj5TVVcBf9PKbgZ+ga4/f8yTgMuS3A2cB5zQ+vJh3IS2tXAmyX0Y3eMrvtv2/WngNdP8HdK26vV0N2wAtPFARwCvAm6h63I6YqC1ebomm/B70EPoWrZvbPs+AxhrpT4ZOIjuJvNf6Abej5ns+v/H9n5Lkq+05WPoBo9fRVfvnU3XQjasBx2zdWN+Bfhma6WHro6+vo3jorpndR1BN0TiOy3e9wKPbPu/g67O/lT7/8SldOdtrCfgjcDnW713MJPX9ZqEkzBLkiT1xBYrSZKknphYSZIk9cTESpIkqScmVpIkST0ZiQkgd9ttt1q2bNlchyFpFl1++eXfrarxT+Wed6y/pG3PZPXXSCRWy5YtY82aNXMdhqRZlGTCJ2TPN9Zf0rZnsvrLrkBJkqSemFhJkiT1xMRKkiSpJyZWkiRJPTGxkiRJ6omJlSRJUk9MrCRJknpiYiVJktQTEytJkqSeTPnk9SSPA84cKHoM8BfA6a18GXAdcFRV3ZYkwDuAw4F7gBdX1Vf6DVsLRjJzx66auWNL2ubl5Jmrv+ok66/5asoWq6r6VlUdUFUHAL9ElyydA5wIXFhV+wEXtnWAw4D92msVcMoMxC1JkjRyptsVeCjw7aq6HlgBrG7lq4Ej2/IK4PTqXAosSrJ7H8FKkiSNsukmVi8EPtqWl1TVhrZ8E7CkLe8J3DDwmXWt7AGSrEqyJsmaTZs2TTMMSZKk0TN0YpXkocBvAf84fltVFTCtDuGqOrWqllfV8sWLF0/no5I0lCSPS3LFwOvOJK9IsmuSC5Jc0953afsnyTuTrE1yZZKD5vo3SJpfptNidRjwlaq6ua3fPNbF1943tvL1wN4Dn9urlUnSrHKMqKTZNp3E6mju7wYEOA9Y2ZZXAucOlB/T7vwOBu4Y6DKUpLniGFFJM27Kxy0AJNkJeDbwkoHitwBnJTkOuB44qpWfT/eohbV0d4fH9hatJG25rRkj6s2hpKEMlVhV1feAR40ru4XuDnD8vgUc30t0ktSDgTGirxm/raoqybTGiCZZRddVyNKlS3uJUdLC4JPXJW0Leh0j6h/fSNocEytJ2wLHiEqaFUN1BUrSfOUY0dHn1DBaSEysJC1ojhHdts1k0iZNxK5ASZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST1xEuaFIjM40Wg5O7wkScOwxUqSJKknJlaSJEk9MbGSJEnqyVCJVZJFSc5O8u9Jrk7y1CS7JrkgyTXtfZe2b5K8M8naJFcmOWhmf4IkSdJoGLbF6h3Av1bV44EnAlcDJwIXVtV+wIVtHeAwYL/2WgWc0mvEkiRJI2rKxCrJI4GnA6cBVNUPq+p2YAWwuu22GjiyLa8ATq/OpcCiJLv3HLckSdLIGabF6tHAJuD9Sb6a5L1JdgKWVNWGts9NwJK2vCdww8Dn17WyB0iyKsmaJGs2bdq05b9AkiRpRAyTWG0PHAScUlUHAt/j/m4/AKqqgGk97KiqTq2q5VW1fPHixdP5qCQNzTGikmbTMInVOmBdVV3W1s+mS7RuHuvia+8b2/b1wN4Dn9+rlUnSXHCMqKRZM2ViVVU3ATckeVwrOhS4CjgPWNnKVgLntuXzgGPand/BwB0DXYaSNGscIypptg07pc3LgQ8neShwLXAsXVJ2VpLjgOuBo9q+5wOHA2uBe9q+kjQXBseIPhG4HDiB6Y8RfcDNYZJVdC1aLF26dMaClzT/DJVYVdUVwPIJNh06wb4FHL91YUlSL8bGiL68qi5L8g4mGCOaZNpjRIFTAZYvX+5kmpLu45PXJS1kjhGVNKtMrCQtWI4RlTTbhh1jJUnzlWNENe/k5MzYseske69nkomVpAXNMaKSZpNdgZIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPnCtQkjSlmZwUWFpIbLGSJEnqiYmVJElST0ysJEmSemJiJUmS1JOhEqsk1yX5epIrkqxpZbsmuSDJNe19l1aeJO9MsjbJlUkOmskfIG1WMnMvSZImMJ0Wq1+tqgOqanlbPxG4sKr2Ay5s6wCHAfu11yrglL6ClSRJGmVb0xW4AljdllcDRw6Un16dS4FFSXbfiu+RJEmaF4ZNrAr4VJLLk6xqZUuqakNbvglY0pb3BG4Y+Oy6VvYASVYlWZNkzaZNm7YgdEmamkMZJM2mYROrQ6rqILpuvuOTPH1wY1UVXfI1tKo6taqWV9XyxYsXT+ejkjRdDmWQNCuGSqyqan173wicAzwZuHmsi6+9b2y7rwf2Hvj4Xq1MkkaFQxkkzYgpE6skOyXZeWwZ+HXgG8B5wMq220rg3LZ8HnBMa1I/GLhjoMtQkmZb70MZJGlzhpkrcAlwTro/Md8e+EhV/WuSLwNnJTkOuB44qu1/PnA4sBa4Bzi296glaXiHVNX6JD8NXJDk3wc3VlUlmdZQhpagrQJYunRpf5FKmvemTKyq6lrgiROU3wIcOkF5Acf3Ep0kbaXBoQxJHjCUoao2bMlQhqo6FTgVYPny5dNKyiQtbD55XdKC5VAGSbNtmK5ASZqvHMogaVaZWElasBzKIGm22RUoSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST3Zfq4D2KYkcx2BJEmaQbZYSZIk9cQWK03NljZJkoZii5UkSVJPhk6skmyX5KtJPtHWH53ksiRrk5yZ5KGt/GFtfW3bvmyGYpckSRop0+kKPAG4Gviptv5W4G1VdUaSdwHHAae099uqat8kL2z7vaDHmCVJ0hbKyTM3vKNOqhk79nwxVItVkr2A5wLvbesBngWc3XZZDRzZlle0ddr2Q9v+kiRJC9qwXYFvB/4E+ElbfxRwe1Xd29bXAXu25T2BGwDa9jva/g+QZFWSNUnWbNq0acuil6QhOJRB0myZMrFKcgSwsaou7/OLq+rUqlpeVcsXL17c56ElabyxoQxjxoYy7AvcRjeEAQaGMgBva/tJ0tCGabF6GvBbSa4DzqDrAnwHsCjJ2BitvYD1bXk9sDdA2/5I4JYeY5akoTmUQdJsmjKxqqrXVNVeVbUMeCHw6ap6EfAZ4Hltt5XAuW35vLZO2/7pqnI0m6S58nYcyiBplmzNc6xeDbwyyVq6iue0Vn4a8KhW/krgxK0LUZK2jEMZJM22aT15vaouAi5qy9cCT55gn+8Dz+8hNknaWmNDGQ4HHk73uJj7hjK0VqmJhjKscyiDpC3hk9clLVgOZZA020ysJG2LHMogaUY4CbOkbYJDGSTNBlusJEmSemJiJUmS1BMTK0mSpJ6YWEmSJPXExEqSJKknJlaSJEk9MbGSJEnqiYmVJElST0ysJEmSemJiJUmS1BOntJGkBSInZ65DkLZ5tlhJkiT1xMRKkiSpJyZWkiRJPTGxkiRJ6omJlSRJUk9MrCRJknoyZWKV5OFJvpTka0m+meTkVv7oJJclWZvkzCQPbeUPa+tr2/ZlM/wbJEmSRsIwLVY/AJ5VVU8EDgCek+Rg4K3A26pqX+A24Li2/3HAba38bW0/SZKkBW/KxKo6d7fVHdqrgGcBZ7fy1cCRbXlFW6dtPzSJT62TNOtscZc024YaY5VkuyRXABuBC4BvA7dX1b1tl3XAnm15T+AGgLb9DuBRExxzVZI1SdZs2rRpq36EJG2GLe6SZtVQiVVV/biqDgD2Ap4MPH5rv7iqTq2q5VW1fPHixVt7OEl6EFvcJc22af1VYFXdDnwGeCqwKMnYXIN7Aevb8npgb4C2/ZHALX0EK0nTZYu7pNk0zF8FLk6yqC3vCDwbuJouwXpe220lcG5bPq+t07Z/uqqqx5glaWi2uEuaTdtPvQu7A6uTbEeXiJ1VVZ9IchVwRpI3AF8FTmv7nwZ8MMla4FbghTMQtyRNS1XdnuQBLe6tVWqiFvd1trhL2hJTJlZVdSVw4ATl19Ld/Y0v/z7w/F6ik6StkGQx8KOWVI21uL+V+1vcz2DiFvcvYou7pC0wTIuVJM1XtrhLmlUmVpIWLFvcJc025wqUJEnqiS1W0paYyUcbOaRHkuYtW6wkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPpkyskuyd5DNJrkryzSQntPJdk1yQ5Jr2vksrT5J3Jlmb5MokB830j5AkSRoFw7RY3Qu8qqr2Bw4Gjk+yP3AicGFV7Qdc2NYBDgP2a69VwCm9Ry1JkjSCpkysqmpDVX2lLd8FXA3sCawAVrfdVgNHtuUVwOnVuRRYlGT3vgOXpKnY4i5ptk1rjFWSZcCBwGXAkqra0DbdBCxpy3sCNwx8bF0rk6TZZou7pFm1/bA7JnkE8DHgFVV1Z5L7tlVVJanpfHGSVXQVF0uXLp3ORyVpKO3mb0NbvivJYIv7M9tuq4GLgFcz0OIOXJpkUZLdB24iJU0iJ2fqnbZQnTStNGPODNVilWQHuqTqw1X18VZ881gXX3vf2MrXA3sPfHyvVvYAVXVqVS2vquWLFy/e0vglaSh9trgnWZVkTZI1mzZtmrmgJc07w/xVYIDTgKur6m8HNp0HrGzLK4FzB8qPaWMVDgbu8G5P0lwa3+I+uK21Tk3rVtgbQ0mbM0xX4NOA3wO+nuSKVvZa4C3AWUmOA64HjmrbzgcOB9YC9wDH9hmwJE3HZC3uVbVhS1rcJWlzpkysqupzwOY6TQ+dYP8Cjt/KuCRpqw3R4v4WHtzi/rIkZwBPwRZ3SdM09OB1SZqHbHGXNKtMrCQtWLa4S5ptJlbjZeb+VFSSJC1sTsIsSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUkykTqyTvS7IxyTcGynZNckGSa9r7Lq08Sd6ZZG2SK5McNJPBS5IkjZJhWqw+ADxnXNmJwIVVtR9wYVsHOAzYr71WAaf0E6YkbRlvDiXNpikTq6r6LHDruOIVwOq2vBo4cqD89OpcCixKsntPsUrSlvgA3hxKmiVbOsZqSVVtaMs3AUva8p7ADQP7rWtlkjQnvDmUNJu239oDVFUlqel+LskqujtCli5durVhSNJ0TPfmcMNA2VbVXzk5WxCupPliS1usbh67i2vvG1v5emDvgf32amUPUlWnVtXyqlq+ePHiLQxDkrZOVRUwrZtD6y9Jm7OlidV5wMq2vBI4d6D8mDYA9GDgjoG7QkkaFVt9cyhJExnmcQsfBb4IPC7JuiTHAW8Bnp3kGuDX2jrA+cC1wFrgPcD/nJGoJWnreHMoaUZMOcaqqo7ezKZDJ9i3gOO3NihJ6ku7OXwmsFuSdcBJdDeDZ7UbxeuBo9ru5wOH090c3gMcO+sBS5rXtnrwuqSeZQYHN9e0/85k3vPmUNJsckobSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUEx8QKkmSRl5OnsGHJwN1Uj8PULbFSpIkqScmVpIkST2Zn12BMzmXmiRJ0hayxUqSJKknJlaSJEk9MbGSJEnqiYmVJElST0ysJEmSemJiJUmS1BMTK0mSpJ7MSGKV5DlJvpVkbZITZ+I7JG2BZOZeC4h1mKQt1XtilWQ74P8ChwH7A0cn2b/v75GkmWAdJmlrzESL1ZOBtVV1bVX9EDgDWDED3yNJM8E6TNIWm4nEak/ghoH1da1MkuYD6zBJW2zO5gpMsgpY1VbvTvKtuYplArsB353rIKYwH2KE+RGnMfYhmW6M+8xUKDNtivprlP+tRjW2UY0LRjc245q+SWPL66Y1VnSz9ddMJFbrgb0H1vdqZQ9QVacCp87A92+1JGuqavlcxzGZ+RAjzI84jbEf8yHGIU1Zh01Wf43yeRjV2EY1Lhjd2Ixr+mYrtpnoCvwysF+SRyd5KPBC4LwZ+B5JmgnWYZK2WO8tVlV1b5KXAZ8EtgPeV1Xf7Pt7JGkmWIdJ2hozMsaqqs4Hzp+JY8+SkeyiHGc+xAjzI05j7Md8iHEoW1mHjfJ5GNXYRjUuGN3YjGv6ZiW2VNVsfI8kSdKC55Q2kiRJPTGxGifJdUm+nuSKJGvmOh6AJO9LsjHJNwbKdk1yQZJr2vsuIxjj65Ksb+fyiiSHz3GMeyf5TJKrknwzyQmtfGTO5SQxjtq5fHiSLyX5Wovz5Fb+6CSXtalgzmyDv7cZozoVzijVa6Nan41qHTbK9dao1ldzXT/ZFThOkuuA5VU1Ms/hSPJ04G7g9Kp6Qiv7K+DWqnpLq8B3qapXj1iMrwPurqr/M1dxDUqyO7B7VX0lyc7A5cCRwIsZkXM5SYxHMVrnMsBOVXV3kh2AzwEnAK8EPl5VZyR5F/C1qjplLmOdLemmwvkP4Nl0DxX9MnB0VV01p4ExWvXaqNZno1qHjXK9Nar11VzXT7ZYzQNV9Vng1nHFK4DVbXk13X/Mc2YzMY6UqtpQVV9py3cBV9M9UXtkzuUkMY6U6tzdVndorwKeBZzdyuf8v8tZ5lQ4QxjV+mxU67BRrrdGtb6a6/rJxOrBCvhUksvTPV15VC2pqg1t+SZgyVwGM4mXJbmyNbPPaXfloCTLgAOByxjRczkuRhixc5lkuyRXABuBC4BvA7dX1b1tl21tKphRngpn1Ou1kbwGm5G57ka53hq1+mou6ycTqwc7pKoOopvZ/vjWPDzSquvPHcU+3VOAxwIHABuAv5nTaJokjwA+Bryiqu4c3DYq53KCGEfuXFbVj6vqALonkz8ZePzcRqRJzJt6bVSuwWZkrrtRrrdGsb6ay/rJxGqcqlrf3jcC59D9g4yim1v/9lg/98Y5judBqurm9h/3T4D3MALnsvW3fwz4cFV9vBWP1LmcKMZRPJdjqup24DPAU4FFScaejzfhdFYL2FDTec2FeVCvjdQ1OGZUrrtRrrdGvb6ai/rJxGpAkp3aADyS7AT8OvCNyT81Z84DVrbllcC5cxjLhMYu+ua3meNz2QY0ngZcXVV/O7BpZM7l5mIcwXO5OMmitrwj3YDtq+kqsOe13Ubyv8sZNJJT4cyTem1krsFBo3DdjXK9Nar11VzXT/5V4IAkj6G7m4PuqfQfqao3zmFIACT5KPBMupm5bwZOAv4JOAtYClwPHFVVczbwcjMxPpOuKbiA64CXDIwJmHVJDgEuAb4O/KQVv5ZuTMBInMtJYjya0TqXv0g3+HM7uhu0s6rq9e0aOgPYFfgq8LtV9YO5inO2tT8rfzv3T4UzCvXHSNVro1qfjWodNsr11qjWV3NdP5lYSZIk9cSuQEmSpJ6YWEmSJPXExEqSJKknJlaSJEk9MbGSJEnqiYmVJElST0ysJEmSemJiJUmS1JP/D2neI6WqpBOzAAAAAElFTkSuQmCC\n",
314 |       "text/plain": [
315 |        "<Figure size 720x288 with 2 Axes>"
316 |       ]
317 |      },
318 |      "metadata": {
319 |       "needs_background": "light"
320 |      },
321 |      "output_type": "display_data"
322 |     }
323 |    ],
324 |    "source": [
325 |     "#Plotting word-count per tweet\n",
326 |     "fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,4))\n",
327 |     "train_words=df_train[df_train['target']==1]['word_count']\n",
328 |     "ax1.hist(train_words,color='red')\n",
329 |     "ax1.set_title('Disaster tweets')\n",
330 |     "train_words=df_train[df_train['target']==0]['word_count']\n",
331 |     "ax2.hist(train_words,color='green')\n",
332 |     "ax2.set_title('Non-disaster tweets')\n",
333 |     "fig.suptitle('Words per tweet')\n",
334 |     "plt.show()"
335 |    ]
336 |   },
337 |   {
338 |    "cell_type": "markdown",
339 |    "metadata": {},
340 |    "source": [
341 |     "## PRE-PROCESSING"
342 |    ]
343 |   },
344 |   {
345 |    "cell_type": "code",
346 |    "execution_count": 7,
347 |    "metadata": {},
348 |    "outputs": [
349 |     {
350 |      "name": "stdout",
351 |      "output_type": "stream",
352 |      "text": [
353 |       "this is a message to be cleaned it may involve some things like adjacent spaces and tabs\n"
354 |      ]
355 |     }
356 |    ],
357 |    "source": [
358 |     "#1. Common text preprocessing\n",
359 |     "text = \"   This is a message to be cleaned. It may involve some things like: <br>, ?, :, ''  adjacent spaces and tabs     .  \"\n",
360 |     "\n",
361 |     "#convert to lowercase and remove punctuations and characters and then strip\n",
362 |     "def preprocess(text):\n",
363 |     "    text = text.lower() #lowercase text\n",
364 |     "    text=text.strip()  #get rid of leading/trailing whitespace \n",
365 |     "    text=re.compile('<.*?>').sub('', text) #Remove HTML tags/markups\n",
366 |     "    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)  #Replace punctuation with space. Careful since punctuation can sometime be useful\n",
367 |     "    text = re.sub('\\s+', ' ', text)  #Remove extra space and tabs\n",
368 |     "    text = re.sub(r'\\[[0-9]*\\]',' ',text) #[0-9] matches any digit (0 to 10000...)\n",
369 |     "    text=re.sub(r'[^\\w\\s]', '', str(text).lower().strip())\n",
370 |     "    text = re.sub(r'\\d',' ',text) #matches any digit from 0 to 100000..., \\D matches non-digits\n",
371 |     "    text = re.sub(r'\\s+',' ',text) #\\s matches any whitespace, \\s+ matches multiple whitespace, \\S matches non-whitespace \n",
372 |     "    \n",
373 |     "    return text\n",
374 |     "\n",
375 |     "text=preprocess(text)\n",
376 |     "print(text)  #text is a string"
377 |    ]
378 |   },
379 |   {
380 |    "cell_type": "code",
381 |    "execution_count": 8,
382 |    "metadata": {},
383 |    "outputs": [
384 |     {
385 |      "name": "stdout",
386 |      "output_type": "stream",
387 |      "text": [
388 |       "message cleaned may involve things like adjacent spaces tabs\n",
389 |       "messag clean may involv thing like adjac space tab\n",
390 |       "messag clean may involv thing like adjac space tab\n"
391 |      ]
392 |     }
393 |    ],
394 |    "source": [
395 |     "#3. LEXICON-BASED TEXT PROCESSING EXAMPLES\n",
396 |     " \n",
397 |     "#1. STOPWORD REMOVAL\n",
398 |     "def stopword(string):\n",
399 |     "    a= [i for i in string.split() if i not in stopwords.words('english')]\n",
400 |     "    return ' '.join(a)\n",
401 |     "\n",
402 |     "text=stopword(text)\n",
403 |     "print(text)\n",
404 |     "\n",
405 |     "#2. STEMMING\n",
406 |     " \n",
407 |     "# Initialize the stemmer\n",
408 |     "snow = SnowballStemmer('english')\n",
409 |     "def stemming(string):\n",
410 |     "    a=[snow.stem(i) for i in word_tokenize(string) ]\n",
411 |     "    return \" \".join(a)\n",
412 |     "text=stemming(text)\n",
413 |     "print(text)\n",
414 |     "\n",
415 |     "#3. LEMMATIZATION\n",
416 |     "# Initialize the lemmatizer\n",
417 |     "wl = WordNetLemmatizer()\n",
418 |     " \n",
419 |     "# This is a helper function to map NTLK position tags\n",
420 |     "# Full list is available here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html\n",
421 |     "def get_wordnet_pos(tag):\n",
422 |     "    if tag.startswith('J'):\n",
423 |     "        return wordnet.ADJ\n",
424 |     "    elif tag.startswith('V'):\n",
425 |     "        return wordnet.VERB\n",
426 |     "    elif tag.startswith('N'):\n",
427 |     "        return wordnet.NOUN\n",
428 |     "    elif tag.startswith('R'):\n",
429 |     "        return wordnet.ADV\n",
430 |     "    else:\n",
431 |     "        return wordnet.NOUN\n",
432 |     "\n",
433 |     "# Tokenize the sentence\n",
434 |     "def lemmatizer(string):\n",
435 |     "    word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags\n",
436 |     "    a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token\n",
437 |     "    return \" \".join(a)\n",
438 |     "\n",
439 |     "text = lemmatizer(text)\n",
440 |     "print(text)"
441 |    ]
442 |   },
443 |   {
444 |    "cell_type": "code",
445 |    "execution_count": 9,
446 |    "metadata": {},
447 |    "outputs": [
448 |     {
449 |      "data": {
450 |       "text/html": [
451 |        "<div>\n",
452 |        "<style scoped>\n",
453 |        "    .dataframe tbody tr th:only-of-type {\n",
454 |        "        vertical-align: middle;\n",
455 |        "    }\n",
456 |        "\n",
457 |        "    .dataframe tbody tr th {\n",
458 |        "        vertical-align: top;\n",
459 |        "    }\n",
460 |        "\n",
461 |        "    .dataframe thead th {\n",
462 |        "        text-align: right;\n",
463 |        "    }\n",
464 |        "</style>\n",
465 |        "<table border=\"1\" class=\"dataframe\">\n",
466 |        "  <thead>\n",
467 |        "    <tr style=\"text-align: right;\">\n",
468 |        "      <th></th>\n",
469 |        "      <th>id</th>\n",
470 |        "      <th>keyword</th>\n",
471 |        "      <th>location</th>\n",
472 |        "      <th>text</th>\n",
473 |        "      <th>target</th>\n",
474 |        "      <th>clean_text</th>\n",
475 |        "    </tr>\n",
476 |        "  </thead>\n",
477 |        "  <tbody>\n",
478 |        "    <tr>\n",
479 |        "      <th>0</th>\n",
480 |        "      <td>1</td>\n",
481 |        "      <td>NaN</td>\n",
482 |        "      <td>NaN</td>\n",
483 |        "      <td>Our Deeds are the Reason of this #earthquake M...</td>\n",
484 |        "      <td>1</td>\n",
485 |        "      <td>deed reason earthquake may allah forgive u</td>\n",
486 |        "    </tr>\n",
487 |        "    <tr>\n",
488 |        "      <th>1</th>\n",
489 |        "      <td>4</td>\n",
490 |        "      <td>NaN</td>\n",
491 |        "      <td>NaN</td>\n",
492 |        "      <td>Forest fire near La Ronge Sask. Canada</td>\n",
493 |        "      <td>1</td>\n",
494 |        "      <td>forest fire near la ronge sask canada</td>\n",
495 |        "    </tr>\n",
496 |        "    <tr>\n",
497 |        "      <th>2</th>\n",
498 |        "      <td>5</td>\n",
499 |        "      <td>NaN</td>\n",
500 |        "      <td>NaN</td>\n",
501 |        "      <td>All residents asked to 'shelter in place' are ...</td>\n",
502 |        "      <td>1</td>\n",
503 |        "      <td>resident ask shelter place notify officer evac...</td>\n",
504 |        "    </tr>\n",
505 |        "    <tr>\n",
506 |        "      <th>3</th>\n",
507 |        "      <td>6</td>\n",
508 |        "      <td>NaN</td>\n",
509 |        "      <td>NaN</td>\n",
510 |        "      <td>13,000 people receive #wildfires evacuation or...</td>\n",
511 |        "      <td>1</td>\n",
512 |        "      <td>people receive wildfire evacuation order calif...</td>\n",
513 |        "    </tr>\n",
514 |        "    <tr>\n",
515 |        "      <th>4</th>\n",
516 |        "      <td>7</td>\n",
517 |        "      <td>NaN</td>\n",
518 |        "      <td>NaN</td>\n",
519 |        "      <td>Just got sent this photo from Ruby #Alaska as ...</td>\n",
520 |        "      <td>1</td>\n",
521 |        "      <td>get sent photo ruby alaska smoke wildfires pou...</td>\n",
522 |        "    </tr>\n",
523 |        "  </tbody>\n",
524 |        "</table>\n",
525 |        "</div>"
526 |       ],
527 |       "text/plain": [
528 |        "   id keyword location                                               text  \\\n",
529 |        "0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   \n",
530 |        "1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   \n",
531 |        "2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   \n",
532 |        "3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   \n",
533 |        "4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   \n",
534 |        "\n",
535 |        "   target                                         clean_text  \n",
536 |        "0       1         deed reason earthquake may allah forgive u  \n",
537 |        "1       1              forest fire near la ronge sask canada  \n",
538 |        "2       1  resident ask shelter place notify officer evac...  \n",
539 |        "3       1  people receive wildfire evacuation order calif...  \n",
540 |        "4       1  get sent photo ruby alaska smoke wildfires pou...  "
541 |       ]
542 |      },
543 |      "execution_count": 9,
544 |      "metadata": {},
545 |      "output_type": "execute_result"
546 |     }
547 |    ],
548 |    "source": [
549 |     "#FINAL PREPROCESSING\n",
550 |     "def finalpreprocess(string):\n",
551 |     "    return lemmatizer(stopword(preprocess(string)))\n",
552 |     "\n",
553 |     "df_train['clean_text'] = df_train['text'].apply(lambda x: finalpreprocess(x))\n",
554 |     "df_train=df_train.drop(columns=['word_count','char_count','unique_word_count'])\n",
555 |     "df_train.head()"
556 |    ]
557 |   },
558 |   {
559 |    "cell_type": "markdown",
560 |    "metadata": {},
561 |    "source": [
562 |     "### Word2Vec model"
563 |    ]
564 |   },
565 |   {
566 |    "cell_type": "code",
567 |    "execution_count": 10,
568 |    "metadata": {},
569 |    "outputs": [
570 |     {
571 |      "name": "stderr",
572 |      "output_type": "stream",
573 |      "text": [
574 |       "/Users/ranivija/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: DeprecationWarning: Call to deprecated `syn0` (Attribute will be removed in 4.0.0, use self.vectors instead).\n",
575 |       "  if __name__ == '__main__':\n"
576 |      ]
577 |     }
578 |    ],
579 |    "source": [
580 |     "# create Word2vec model\n",
581 |     "#here words_f should be a list containing words from each document. say 1st row of the list is words from the 1st document/sentence\n",
582 |     "#length of words_f is number of documents/sentences in your dataset\n",
583 |     "df_train['clean_text_tok']=[nltk.word_tokenize(i) for i in df_train['clean_text']] #convert preprocessed sentence to tokenized sentence\n",
584 |     "model = Word2Vec(df_train['clean_text_tok'],min_count=1)  #min_count=1 means word should be present at least across all documents,\n",
585 |     "#if min_count=2 means if the word is present less than 2 times across all the documents then we shouldn't consider it\n",
586 |     "\n",
587 |     "\n",
588 |     "w2v = dict(zip(model.wv.index2word, model.wv.syn0))  #combination of word and its vector\n",
589 |     "\n",
590 |     "#for converting sentence to vectors/numbers from word vectors result by Word2Vec\n",
591 |     "class MeanEmbeddingVectorizer(object):\n",
592 |     "    def __init__(self, word2vec):\n",
593 |     "        self.word2vec = word2vec\n",
594 |     "        # if a text is empty we should return a vector of zeros\n",
595 |     "        # with the same dimensionality as all the other vectors\n",
596 |     "        self.dim = len(next(iter(word2vec.values())))\n",
597 |     "\n",
598 |     "    def fit(self, X, y):\n",
599 |     "        return self\n",
600 |     "\n",
601 |     "    def transform(self, X):\n",
602 |     "        return np.array([\n",
603 |     "            np.mean([self.word2vec[w] for w in words if w in self.word2vec]\n",
604 |     "                    or [np.zeros(self.dim)], axis=0)\n",
605 |     "            for words in X\n",
606 |     "        ])\n"
607 |    ]
608 |   },
609 |   {
610 |    "cell_type": "markdown",
611 |    "metadata": {},
612 |    "source": [
613 |     "### TRAIN TEST SPLITTING OF LABELLED DATASET"
614 |    ]
615 |   },
616 |   {
617 |    "cell_type": "code",
618 |    "execution_count": 11,
619 |    "metadata": {},
620 |    "outputs": [],
621 |    "source": [
622 |     "#SPLITTING THE TRAINING DATASET INTO TRAINING AND VALIDATION\n",
623 |     " \n",
624 |     "# Input: \"reviewText\", \"rating\" and \"time\"\n",
625 |     "# Target: \"log_votes\"\n",
626 |     "X_train, X_val, y_train, y_val = train_test_split(df_train[\"clean_text\"],\n",
627 |     "                                                  df_train[\"target\"],\n",
628 |     "                                                  test_size=0.2,\n",
629 |     "                                                  shuffle=True)\n",
630 |     "X_train_tok= [nltk.word_tokenize(i) for i in X_train]  #for word2vec\n",
631 |     "X_val_tok= [nltk.word_tokenize(i) for i in X_val]      #for word2vec\n",
632 |     "\n",
633 |     "#TF-IDF\n",
634 |     "# Convert x_train to vector since model can only run on numbers and not words- Fit and transform\n",
635 |     "tfidf_vectorizer = TfidfVectorizer(use_idf=True)\n",
636 |     "X_train_vectors_tfidf = tfidf_vectorizer.fit_transform(X_train) #tfidf runs on non-tokenized sentences unlike word2vec\n",
637 |     "# Only transform x_test (not fit and transform)\n",
638 |     "X_val_vectors_tfidf = tfidf_vectorizer.transform(X_val) #Don't fit() your TfidfVectorizer to your test data: it will \n",
639 |     "#change the word-indexes & weights to match test data. Rather, fit on the training data, then use the same train-data-\n",
640 |     "#fit model on the test data, to reflect the fact you're analyzing the test data only based on what was learned without \n",
641 |     "#it, and the have compatible\n",
642 |     "\n",
643 |     "\n",
644 |     "#Word2vec\n",
645 |     "# Fit and transform\n",
646 |     "modelw = MeanEmbeddingVectorizer(w2v)\n",
647 |     "X_train_vectors_w2v = modelw.transform(X_train_tok)\n",
648 |     "X_val_vectors_w2v = modelw.transform(X_val_tok)"
649 |    ]
650 |   },
651 |   {
652 |    "cell_type": "markdown",
653 |    "metadata": {},
654 |    "source": [
655 |     "### Building ML models (Text-classification)"
656 |    ]
657 |   },
658 |   {
659 |    "cell_type": "markdown",
660 |    "metadata": {},
661 |    "source": [
662 |     "#### LR (tf-idf)"
663 |    ]
664 |   },
665 |   {
666 |    "cell_type": "code",
667 |    "execution_count": 12,
668 |    "metadata": {},
669 |    "outputs": [
670 |     {
671 |      "name": "stdout",
672 |      "output_type": "stream",
673 |      "text": [
674 |       "              precision    recall  f1-score   support\n",
675 |       "\n",
676 |       "           0       0.82      0.84      0.83       857\n",
677 |       "           1       0.79      0.76      0.77       666\n",
678 |       "\n",
679 |       "    accuracy                           0.81      1523\n",
680 |       "   macro avg       0.80      0.80      0.80      1523\n",
681 |       "weighted avg       0.81      0.81      0.81      1523\n",
682 |       "\n",
683 |       "Confusion Matrix: [[723 134]\n",
684 |       " [160 506]]\n",
685 |       "AUC: 0.8646537435919\n"
686 |      ]
687 |     }
688 |    ],
689 |    "source": [
690 |     "#FITTING THE CLASSIFICATION MODEL using Logistic Regression(tf-idf)\n",
691 |     "\n",
692 |     "lr_tfidf=LogisticRegression(solver = 'liblinear', C=10, penalty = 'l2')\n",
693 |     "lr_tfidf.fit(X_train_vectors_tfidf, y_train)  #model\n",
694 |     "\n",
695 |     "#Predict y value for test dataset\n",
696 |     "y_predict = lr_tfidf.predict(X_val_vectors_tfidf)\n",
697 |     "y_prob = lr_tfidf.predict_proba(X_val_vectors_tfidf)[:,1]\n",
698 |     " \n",
699 |     "\n",
700 |     "print(classification_report(y_val,y_predict))\n",
701 |     "print('Confusion Matrix:',confusion_matrix(y_val, y_predict))\n",
702 |     " \n",
703 |     "fpr, tpr, thresholds = roc_curve(y_val, y_prob)\n",
704 |     "roc_auc = auc(fpr, tpr)\n",
705 |     "print('AUC:', roc_auc)  "
706 |    ]
707 |   },
708 |   {
709 |    "cell_type": "markdown",
710 |    "metadata": {},
711 |    "source": [
712 |     "#### NB (tf-idf)"
713 |    ]
714 |   },
715 |   {
716 |    "cell_type": "code",
717 |    "execution_count": 14,
718 |    "metadata": {},
719 |    "outputs": [
720 |     {
721 |      "name": "stdout",
722 |      "output_type": "stream",
723 |      "text": [
724 |       "              precision    recall  f1-score   support\n",
725 |       "\n",
726 |       "           0       0.78      0.90      0.83       871\n",
727 |       "           1       0.82      0.66      0.73       652\n",
728 |       "\n",
729 |       "    accuracy                           0.79      1523\n",
730 |       "   macro avg       0.80      0.78      0.78      1523\n",
731 |       "weighted avg       0.80      0.79      0.79      1523\n",
732 |       "\n",
733 |       "Confusion Matrix: [[780  91]\n",
734 |       " [224 428]]\n",
735 |       "AUC: 0.8445135694815211\n"
736 |      ]
737 |     }
738 |    ],
739 |    "source": [
740 |     "#FITTING THE CLASSIFICATION MODEL using Naive Bayes(tf-idf)\n",
741 |     "#It's a probabilistic classifier that makes use of Bayes' Theorem, a rule that uses probability to make predictions based on prior knowledge of conditions that might be related. This algorithm is the most suitable for such large dataset as it considers each feature independently, calculates the probability of each category, and then predicts the category with the highest probability.\n",
742 |     "\n",
743 |     "nb_tfidf = MultinomialNB()\n",
744 |     "nb_tfidf.fit(X_train_vectors_tfidf, y_train)  #model\n",
745 |     "\n",
746 |     "#Predict y value for test dataset\n",
747 |     "y_predict = nb_tfidf.predict(X_val_vectors_tfidf)\n",
748 |     "y_prob = nb_tfidf.predict_proba(X_val_vectors_tfidf)[:,1]\n",
749 |     " \n",
750 |     "\n",
751 |     "print(classification_report(y_val,y_predict))\n",
752 |     "print('Confusion Matrix:',confusion_matrix(y_val, y_predict))\n",
753 |     " \n",
754 |     "fpr, tpr, thresholds = roc_curve(y_val, y_prob)\n",
755 |     "roc_auc = auc(fpr, tpr)\n",
756 |     "print('AUC:', roc_auc)  \n",
757 |     "\n",
758 |     "\n"
759 |    ]
760 |   },
761 |   {
762 |    "cell_type": "markdown",
763 |    "metadata": {},
764 |    "source": [
765 |     "#### LR (w2v)"
766 |    ]
767 |   },
768 |   {
769 |    "cell_type": "code",
770 |    "execution_count": 20,
771 |    "metadata": {},
772 |    "outputs": [
773 |     {
774 |      "name": "stdout",
775 |      "output_type": "stream",
776 |      "text": [
777 |       "              precision    recall  f1-score   support\n",
778 |       "\n",
779 |       "           0       0.63      0.82      0.71       871\n",
780 |       "           1       0.60      0.37      0.46       652\n",
781 |       "\n",
782 |       "    accuracy                           0.63      1523\n",
783 |       "   macro avg       0.62      0.59      0.59      1523\n",
784 |       "weighted avg       0.62      0.63      0.60      1523\n",
785 |       "\n",
786 |       "Confusion Matrix: [[712 159]\n",
787 |       " [412 240]]\n",
788 |       "AUC: 0.6552953730638924\n"
789 |      ]
790 |     }
791 |    ],
792 |    "source": [
793 |     "#FITTING THE CLASSIFICATION MODEL using Logistic Regression (W2v)\n",
794 |     "lr_w2v=LogisticRegression(solver = 'liblinear', C=10, penalty = 'l2')\n",
795 |     "lr_w2v.fit(X_train_vectors_w2v, y_train)  #model\n",
796 |     "\n",
797 |     "#Predict y value for test dataset\n",
798 |     "y_predict = lr_w2v.predict(X_val_vectors_w2v)\n",
799 |     "y_prob = lr_w2v.predict_proba(X_val_vectors_w2v)[:,1]\n",
800 |     " \n",
801 |     "\n",
802 |     "print(classification_report(y_val,y_predict))\n",
803 |     "print('Confusion Matrix:',confusion_matrix(y_val, y_predict))\n",
804 |     " \n",
805 |     "fpr, tpr, thresholds = roc_curve(y_val, y_prob)\n",
806 |     "roc_auc = auc(fpr, tpr)\n",
807 |     "print('AUC:', roc_auc)  "
808 |    ]
809 |   },
810 |   {
811 |    "cell_type": "markdown",
812 |    "metadata": {},
813 |    "source": [
814 |     "### TESTING THE MODEL ON UNLABELLED DATASET"
815 |    ]
816 |   },
817 |   {
818 |    "cell_type": "code",
819 |    "execution_count": 26,
820 |    "metadata": {},
821 |    "outputs": [
822 |     {
823 |      "name": "stdout",
824 |      "output_type": "stream",
825 |      "text": [
826 |       "   id keyword location                                               text  \\\n",
827 |       "0   0     NaN      NaN                 Just happened a terrible car crash   \n",
828 |       "1   2     NaN      NaN  Heard about #earthquake is different cities, s...   \n",
829 |       "2   3     NaN      NaN  there is a forest fire at spot pond, geese are...   \n",
830 |       "3   9     NaN      NaN           Apocalypse lighting. #Spokane #wildfires   \n",
831 |       "4  11     NaN      NaN      Typhoon Soudelor kills 28 in China and Taiwan   \n",
832 |       "\n",
833 |       "                                          clean_text  predict_prob  target  \n",
834 |       "0                          happen terrible car crash      0.703070       1  \n",
835 |       "1  heard earthquake different city stay safe ever...      0.901061       1  \n",
836 |       "2  forest fire spot pond geese flee across street...      0.870295       1  \n",
837 |       "3                  apocalypse light spokane wildfire      0.634877       1  \n",
838 |       "4                 typhoon soudelor kill china taiwan      0.995811       1  \n"
839 |      ]
840 |     }
841 |    ],
842 |    "source": [
843 |     "#Testing it on new dataset with the best model\n",
844 |     "df_test=pd.read_csv('test.csv')  #reading the data\n",
845 |     "df_test['clean_text'] = df_test['text'].apply(lambda x: finalpreprocess(x)) #preprocess the data\n",
846 |     "X_test=df_test['clean_text'] \n",
847 |     "X_vector=tfidf_vectorizer.transform(X_test) #converting X_test to vector\n",
848 |     "y_predict = lr_tfidf.predict(X_vector)      #use the trained model on X_vector\n",
849 |     "y_prob = lr_tfidf.predict_proba(X_vector)[:,1]\n",
850 |     "df_test['predict_prob']= y_prob\n",
851 |     "df_test['target']= y_predict\n",
852 |     "print(df_test.head())\n",
853 |     "final=df_test[['id','target']].reset_index(drop=True)\n",
854 |     "final.to_csv('submission.csv')"
855 |    ]
856 |   }
857 |  ],
858 |  "metadata": {
859 |   "kernelspec": {
860 |    "display_name": "Python 3",
861 |    "language": "python",
862 |    "name": "python3"
863 |   },
864 |   "language_info": {
865 |    "codemirror_mode": {
866 |     "name": "ipython",
867 |     "version": 3
868 |    },
869 |    "file_extension": ".py",
870 |    "mimetype": "text/x-python",
871 |    "name": "python",
872 |    "nbconvert_exporter": "python",
873 |    "pygments_lexer": "ipython3",
874 |    "version": "3.7.3"
875 |   }
876 |  },
877 |  "nbformat": 4,
878 |  "nbformat_minor": 2
879 | }
880 | 


--------------------------------------------------------------------------------