├── README.md └── NLP text classification model Github.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # NLP-text-classification-model 2 | 3 | 4 | Unstructured data in the form of text: chats, emails, social media, survey responses is present everywhere today. Text can be a rich source of information, but due to its unstructured nature it can be hard to extract insights from it. 5 | Text classification is one of the important task in supervised machine learning (ML). It is a process of assigning tags/categories to documents helping us to automatically & quickly structure and analyze text in a cost-effective manner. It is one of the fundamental tasks in Natural Language Processing with broad applications such as sentiment-analysis, spam-detection, topic-labeling, intent-detection etc. 6 | 7 | **Let’s divide the classification problem into the below steps:** 8 | 9 | 1. Setup: Importing Libraries 10 | 2. Loading the data set & Exploratory Data Analysis 11 | 3. Text pre-processing 12 | 4. Extracting vectors from text (Vectorization) 13 | 5. Running ML algorithms 14 | 6. Conclusion 15 | 16 | 17 | 18 | **Step 2: Loading the data set & EDA** 19 | 20 | The data set that we will be using for this article is the famous “Natural Language Processing with Disaster Tweets” data set where we’ll be predicting whether a given tweet is about a real disaster (target=1) or not (target=0) 21 | In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which ones aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. 22 | We have 7,613 tweets in training (labelled) dataset and 3,263 in the test(unlabelled) dataset 23 | 24 | Exploratory Data Analysis (EDA) 25 | 1. Class distribution: There are more tweets with class 0 ( no disaster) than class 1 ( disaster tweets). We can say that the dataset is relatively balanced with 4342 non-disaster tweets (57%) and 3271 disaster tweets (43%). Since the data is balanced, we won’t be applying data-balancing techniques like SMOTE while building the model 26 | 27 | 2. Missing values: We have ~2.5k missing values in location field and 61 missing values in keyword column 28 | 29 | 3. Number of words in a tweet: Disaster tweets are more wordy than the non-disaster tweets 30 | The average number of words in a disaster tweet is 15.17 as compared to an average of 14.7 words in a non-disaster tweet 31 | 32 | 4. Number of characters in a tweet: Disaster tweets are longer than the non-disaster tweets 33 | The average characters in a disaster tweet is 108.1 as compared to an average of 95.7 characters in a non-disaster tweet 34 | 35 | **Step 3: Text Pre-Processing** 36 | 37 | Before we move to model building, we need to preprocess our dataset by removing punctuations & special characters, cleaning texts, removing stop words, and applying lemmatization 38 | Simple text cleaning processes: Some of the common text cleaning process involves: 39 | - Removing punctuations, special characters, URLs & hashtags 40 | - Removing leading, trailing & extra white spaces/tabs 41 | - Typos, slangs are corrected, abbreviations are written in their long forms 42 | 43 | 1. Stop-word removal: We can remove a list of generic stop words from the English vocabulary using nltk. A few such words are ‘i’,’you’,’a’,’the’,’he’,’which’ etc. 44 | 2. Stemming: Refers to the process of slicing the end or the beginning of words with the intention of removing affixes(prefix/suffix) 45 | 3. Lemmatization: It is the process of reducing the word to its base form 46 | 47 | 48 | **Step 4: Extracting vectors from text (Vectorization)** 49 | 50 | It’s difficult to work with text data while building Machine learning models since these models need well-defined numerical data. The process to convert text data into numerical data/vector, is called vectorization or in the NLP world, word embedding. Bag-of-Words(BoW) and Word Embedding (with Word2Vec) are two well-known methods for converting text data to numerical data. 51 | There are a few versions of Bag of Words, corresponding to different words scoring methods. We use the Sklearn library to calculate the BoW numerical values using these approaches: 52 | 1. Count vectors: It builds a vocabulary from a corpus of documents and counts how many times the words appear in each document 53 | 2. Term Frequency-Inverse Document Frequencies (tf-Idf): Count vectors might not be the best representation for converting text data to numerical data. So, instead of simple counting, we can also use an advanced variant of the Bag-of-Words that uses the term frequency–inverse document frequency (or Tf-Idf). Basically, the value of a word increases proportionally to count in the document, but it is inversely proportional to the frequency of the word in the corpus 54 | 55 | Word2Vec: One of the major drawbacks of using Bag-of-words techniques is that it can’t capture the meaning or relation of the words from vectors. Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network which is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc. 56 | 57 | We can use any of these approaches to convert our text data to numerical form which will be used to build the classification model. With this in mind, I am going to first partition the dataset into training set (80%) and test set (20%) 58 | 59 | **Step 5. Running ML algorithms** 60 | 61 | It’s time to train a machine learning model on the vectorized dataset and test it. Now that we have converted the text data to numerical data, we can run ML models on X_train_vector_tfidf & y_train. We’ll test this model on X_test_vectors_tfidf to get y_predict and further evaluate the performance of the model 62 | 1. Logistic Regression 63 | 2. Naive Bayes: It’s a probabilistic classifier that makes use of Bayes’ Theorem, a rule that uses probability to make predictions based on prior knowledge of conditions that might be related 64 | 65 | In this article, I demonstrated the basics of building a text classification model comparing Bag-of-Words (with Tf-Idf) and Word Embedding with Word2Vec. You can further enhance the performance of your model using this code by 66 | - using other classification algorithms like Support Vector Machines (SVM), XgBoost, Ensemble models, Neural networks etc. 67 | - using Gridsearch to tune the hyperparameters of your model 68 | - using advanced word-embedding methods like GloVe and BERT 69 | -------------------------------------------------------------------------------- /NLP text classification model Github.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### IMPORTING PACKAGES" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 1, 13 | "metadata": {}, 14 | "outputs": [ 15 | { 16 | "name": "stderr", 17 | "output_type": "stream", 18 | "text": [ 19 | "[nltk_data] Downloading package punkt to /Users/ranivija/nltk_data...\n", 20 | "[nltk_data] Package punkt is already up-to-date!\n", 21 | "[nltk_data] Downloading package averaged_perceptron_tagger to\n", 22 | "[nltk_data] /Users/ranivija/nltk_data...\n", 23 | "[nltk_data] Package averaged_perceptron_tagger is already up-to-\n", 24 | "[nltk_data] date!\n", 25 | "[nltk_data] Downloading package wordnet to\n", 26 | "[nltk_data] /Users/ranivija/nltk_data...\n", 27 | "[nltk_data] Package wordnet is already up-to-date!\n" 28 | ] 29 | } 30 | ], 31 | "source": [ 32 | "import pandas as pd\n", 33 | "import numpy as np\n", 34 | "\n", 35 | "import seaborn as sns\n", 36 | "import matplotlib.pyplot as plt\n", 37 | "\n", 38 | "#for text pre-processing\n", 39 | "import re, string\n", 40 | "import nltk\n", 41 | "from nltk.tokenize import word_tokenize\n", 42 | "from nltk.corpus import stopwords\n", 43 | "from nltk.tokenize import word_tokenize\n", 44 | "from nltk.stem import SnowballStemmer\n", 45 | "from nltk.corpus import wordnet\n", 46 | "from nltk.stem import WordNetLemmatizer\n", 47 | "\n", 48 | "nltk.download('punkt')\n", 49 | "nltk.download('averaged_perceptron_tagger')\n", 50 | "nltk.download('wordnet')\n", 51 | "\n", 52 | "#for model-building\n", 53 | "from sklearn.model_selection import train_test_split\n", 54 | "from sklearn.linear_model import LogisticRegression\n", 55 | "from sklearn.linear_model import SGDClassifier\n", 56 | "from sklearn.naive_bayes import MultinomialNB\n", 57 | "from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix\n", 58 | "from sklearn.metrics import roc_curve, auc, roc_auc_score\n", 59 | "\n", 60 | "# bag of words\n", 61 | "from sklearn.feature_extraction.text import TfidfVectorizer\n", 62 | "from sklearn.feature_extraction.text import CountVectorizer\n", 63 | "\n", 64 | "#for word embedding\n", 65 | "import gensim\n", 66 | "from gensim.models import Word2Vec #Word2Vec is mostly used for huge datasets" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "metadata": {}, 72 | "source": [ 73 | "## Loading Data" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 2, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "name": "stdout", 83 | "output_type": "stream", 84 | "text": [ 85 | "(7613, 5)\n" 86 | ] 87 | }, 88 | { 89 | "data": { 90 | "text/html": [ 91 | "
\n", 92 | "\n", 105 | "\n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | " \n", 125 | " \n", 126 | " \n", 127 | " \n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | "
idkeywordlocationtexttarget
01NaNNaNOur Deeds are the Reason of this #earthquake M...1
14NaNNaNForest fire near La Ronge Sask. Canada1
25NaNNaNAll residents asked to 'shelter in place' are ...1
36NaNNaN13,000 people receive #wildfires evacuation or...1
47NaNNaNJust got sent this photo from Ruby #Alaska as ...1
\n", 159 | "
" 160 | ], 161 | "text/plain": [ 162 | " id keyword location text \\\n", 163 | "0 1 NaN NaN Our Deeds are the Reason of this #earthquake M... \n", 164 | "1 4 NaN NaN Forest fire near La Ronge Sask. Canada \n", 165 | "2 5 NaN NaN All residents asked to 'shelter in place' are ... \n", 166 | "3 6 NaN NaN 13,000 people receive #wildfires evacuation or... \n", 167 | "4 7 NaN NaN Just got sent this photo from Ruby #Alaska as ... \n", 168 | "\n", 169 | " target \n", 170 | "0 1 \n", 171 | "1 1 \n", 172 | "2 1 \n", 173 | "3 1 \n", 174 | "4 1 " 175 | ] 176 | }, 177 | "execution_count": 2, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "#you can download the data from https://www.kaggle.com/c/nlp-getting-started/data\n", 184 | "import os\n", 185 | "os.chdir('/Users/ranivija/Desktop/')\n", 186 | "df_train=pd.read_csv('train.csv')\n", 187 | "print(df_train.shape)\n", 188 | "df_train.head()" 189 | ] 190 | }, 191 | { 192 | "cell_type": "markdown", 193 | "metadata": {}, 194 | "source": [ 195 | "## EDA" 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": 3, 201 | "metadata": {}, 202 | "outputs": [ 203 | { 204 | "name": "stdout", 205 | "output_type": "stream", 206 | "text": [ 207 | "0 4342\n", 208 | "1 3271\n", 209 | "Name: target, dtype: int64\n" 210 | ] 211 | }, 212 | { 213 | "data": { 214 | "text/plain": [ 215 | "" 216 | ] 217 | }, 218 | "execution_count": 3, 219 | "metadata": {}, 220 | "output_type": "execute_result" 221 | }, 222 | { 223 | "data": { 224 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAD4CAYAAAAdIcpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAANyElEQVR4nO3df6hf9X3H8edLrXVbqdqaiiS6yBoqSrvWZtYijE03ja5tZNhicWsoYVmHg+4H63SMybRCu425FtZCOoOx2Fqxg0jnKMHalUGrJq31J847nTPB1mjU6bo6o+/98f3EfY333s9Vcu73m9znAy73nM853/P9BAJPzvece76pKiRJms8hk56AJGn6GQtJUpexkCR1GQtJUpexkCR1HTbpCQzhmGOOqZUrV056GpJ0QNm+ffsTVbVstm0HZSxWrlzJtm3bJj0NSTqgJHlkrm1+DCVJ6jIWkqQuYyFJ6jIWkqQuYyFJ6jIWkqQuYyFJ6jIWkqQuYyFJ6joo/4J7f3jvn1w76SloCm3/649NegrSRHhmIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpK7BY5Hk0CQ/SPKNtn5iktuSzCT5WpLD2/gb2/pM275y7BiXtvEHkpwz9JwlSa+0GGcWnwTuH1v/LHBVVb0deApY38bXA0+18avafiQ5GbgQOAVYA3whyaGLMG9JUjNoLJKsAH4D+Ie2HuBM4Ma2y2bg/La8tq3Ttp/V9l8LXF9Vz1fVw8AMcNqQ85YkvdLQZxZ/B3wKeKmtvxV4uqr2tPUdwPK2vBx4FKBtf6bt//L4LK95WZINSbYl2bZr1679/M+QpKVtsFgk+QDweFVtH+o9xlXVxqpaXVWrly1bthhvKUlLxpDflHcG8KEk5wFHAG8GPgccleSwdvawAtjZ9t8JHA/sSHIYcCTw5Nj4XuOvkSQtgsHOLKrq0qpaUVUrGV2g/lZVXQTcClzQdlsHbGnLN7V12vZvVVW18Qvb3VInAquA24eatyTp1SbxHdx/Clyf5NPAD4Cr2/jVwJeTzAC7GQWGqro3yQ3AfcAe4OKqenHxpy1JS9eixKKqvg18uy0/xCx3M1XVT4EPz/H6K4Erh5uhJGk+/gW3JKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnrsElPQNJr85+Xv3PSU9AUOuEv7h70+J5ZSJK6jIUkqctYSJK6jIUkqctYSJK6jIUkqctYSJK6jIUkqctYSJK6BotFkiOS3J7kh0nuTfKXbfzEJLclmUnytSSHt/E3tvWZtn3l2LEubeMPJDlnqDlLkmY35JnF88CZVfWLwLuBNUlOBz4LXFVVbweeAta3/dcDT7Xxq9p+JDkZuBA4BVgDfCHJoQPOW5K0j8FiUSPPtdU3tJ8CzgRubOObgfPb8tq2Ttt+VpK08eur6vmqehiYAU4bat6SpFcb9JpFkkOT3Ak8DmwF/h14uqr2tF12AMvb8nLgUYC2/RngrePjs7xm/L02JNmWZNuuXbsG+NdI0tI1aCyq6sWqejewgtHZwEkDvtfGqlpdVauXLVs21NtI0pK0KHdDVdXTwK3A+4Gjkux9NPoKYGdb3gkcD9C2Hwk8OT4+y2skSYtgyLuhliU5qi3/DPDrwP2MonFB220dsKUt39TWadu/VVXVxi9sd0udCKwCbh9q3pKkVxvyy4+OAza3O5cOAW6oqm8kuQ+4PsmngR8AV7f9rwa+nGQG2M3oDiiq6t4kNwD3AXuAi6vqxQHnLUnax2CxqKq7gPfMMv4Qs9zNVFU/BT48x7GuBK7c33OUJC2Mf8EtSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSerqxqJ9O113TJJ08FrImcXXZxm7cX9PRJI0veb8prwkJwGnAEcm+c2xTW8Gjhh6YpKk6THf16q+A/gAcBTwwbHxZ4HfGXBOkqQpM2csqmoLsCXJ+6vqu4s4J0nSlFnINYsnk9yS5B6AJO9K8ucDz0uSNEUWEosvAZcCLwBU1V3AhUNOSpI0XRYSi5+tqtv3GdszxGQkSdNpIbF4IskvAAWQ5ALgsUFnJUmaKvPdDbXXxcBG4KQkO4GHgd8adFaSpKnSjUVVPQT8WpKfAw6pqmeHn5YkaZp0Y5Hkj/ZZB3gG2F5Vdw4zLUnSNFnINYvVwCeA5e3nd4E1wJeSfGrAuUmSpsRCrlmsAE6tqucAklwG/BPwy8B24K+Gm54kaRos5MzibcDzY+svAMdW1f/sMy5JOkgt5MziOuC2JFva+geBr7QL3vcNNjNJ0tSYNxYZXc2+Bvhn4Iw2/Imq2taWLxpuapKkaTFvLKqqktxcVe8Ets23ryTp4LWQaxbfT/JLg89EkjS1FnLN4n3ARUkeAf4bCKOTjncNOjNJ0tRYSCzOGXwWkqSptpDHfTwCkORt+HWqkrQkda9ZJPlQkgcZPUDwX4D/YHR3lCRpiVjIBe4rgNOBf6uqE4GzgO/1XpTk+CS3Jrkvyb1JPtnG35Jka5IH2++j23iSfD7JTJK7kpw6dqx1bf8Hk6x7Xf9SSdLrtpBYvFBVTwKHJDmkqm5l9Lyonj3AH1fVyYxic3GSk4FLgFuqahVwS1sHOBdY1X42AF+EUVyAyxhdaD8NuGxvYCRJi2MhsXg6yZuA7wDXJfkc8FzvRVX1WFV9vy0/C9zP6EGEa4HNbbfNwPlteS1wbY18DzgqyXGMLrBvrardVfUUsJXRgwwlSYtkIXdD/RD4CfCHjP5i+0jgTa/lTZKsBN4D3MbouVJ7v2nvR8CxbXk58OjYy3bw/0+6nW183/fYwOiMhBNOOOG1TE+S1LGQWPxqVb0EvEQ7I0hy10LfoJ2VfB34g6r6r/Z9GMDLfyFer23Ks6uqjYy+0Y/Vq1fvl2NKkkbm/Bgqye8luZvR16neNfbzMLCgWCR5A6NQXFdV/9iGf9w+XqL9fryN7wSOH3v5ijY217gkaZHMd83iK4yeMLul/d77896q6n4Hd3sI4dXA/VX1t2ObbgL23tG0rh1/7/jH2l1RpwPPtI+rvgmcneTodmH77DYmSVokc34MVVXPMPr61I++zmOfAfw2cHeSO9vYnwGfAW5Ish54BPhI23YzcB4ww+gaycfbPHYnuQK4o+13eVXtfp1zkiS9Dgu5ZvG6VNW/MnqO1GzOmmX/Ai6e41ibgE37b3aSpNdiIbfOSpKWOGMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkrsFikWRTkseT3DM29pYkW5M82H4f3caT5PNJZpLcleTUsdesa/s/mGTdUPOVJM1tyDOLa4A1+4xdAtxSVauAW9o6wLnAqvazAfgijOICXAa8DzgNuGxvYCRJi2ewWFTVd4Dd+wyvBTa35c3A+WPj19bI94CjkhwHnANsrardVfUUsJVXB0iSNLDFvmZxbFU91pZ/BBzblpcDj47tt6ONzTX+Kkk2JNmWZNuuXbv276wlaYmb2AXuqiqg9uPxNlbV6qpavWzZsv11WEkSix+LH7ePl2i/H2/jO4Hjx/Zb0cbmGpckLaLFjsVNwN47mtYBW8bGP9buijodeKZ9XPVN4OwkR7cL22e3MUnSIjpsqAMn+SrwK8AxSXYwuqvpM8ANSdYDjwAfabvfDJwHzAA/AT4OUFW7k1wB3NH2u7yq9r1oLkka2GCxqKqPzrHprFn2LeDiOY6zCdi0H6cmSXqN/AtuSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVLXAROLJGuSPJBkJsklk56PJC0lB0QskhwK/D1wLnAy8NEkJ092VpK0dBwQsQBOA2aq6qGq+l/gemDthOckSUvGYZOewAItBx4dW98BvG98hyQbgA1t9bkkDyzS3JaCY4AnJj2JaZC/WTfpKeiV/L+512XZH0f5+bk2HCix6KqqjcDGSc/jYJRkW1WtnvQ8pH35f3PxHCgfQ+0Ejh9bX9HGJEmL4ECJxR3AqiQnJjkcuBC4acJzkqQl44D4GKqq9iT5feCbwKHApqq6d8LTWkr8eE/Tyv+biyRVNek5SJKm3IHyMZQkaYKMhSSpy1hoXj5mRdMoyaYkjye5Z9JzWSqMhebkY1Y0xa4B1kx6EkuJsdB8fMyKplJVfQfYPel5LCXGQvOZ7TEryyc0F0kTZCwkSV3GQvPxMSuSAGOh+fmYFUmAsdA8qmoPsPcxK/cDN/iYFU2DJF8Fvgu8I8mOJOsnPaeDnY/7kCR1eWYhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSer6P5LNNm9bNfkXAAAAAElFTkSuQmCC\n", 225 | "text/plain": [ 226 | "
" 227 | ] 228 | }, 229 | "metadata": { 230 | "needs_background": "light" 231 | }, 232 | "output_type": "display_data" 233 | } 234 | ], 235 | "source": [ 236 | "# CLASS DISTRIBUTION\n", 237 | "#if dataset is balanced or not\n", 238 | "x=df_train['target'].value_counts()\n", 239 | "print(x)\n", 240 | "sns.barplot(x.index,x)" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": 4, 246 | "metadata": {}, 247 | "outputs": [ 248 | { 249 | "data": { 250 | "text/plain": [ 251 | "id 0\n", 252 | "keyword 61\n", 253 | "location 2533\n", 254 | "text 0\n", 255 | "target 0\n", 256 | "dtype: int64" 257 | ] 258 | }, 259 | "execution_count": 4, 260 | "metadata": {}, 261 | "output_type": "execute_result" 262 | } 263 | ], 264 | "source": [ 265 | "#Missing values\n", 266 | "df_train.isna().sum()" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 5, 272 | "metadata": {}, 273 | "outputs": [ 274 | { 275 | "name": "stdout", 276 | "output_type": "stream", 277 | "text": [ 278 | "15.167532864567411\n", 279 | "14.704744357438969\n", 280 | "108.11342097217977\n", 281 | "95.70681713496084\n", 282 | "14.664934270865178\n", 283 | "14.09649930907416\n" 284 | ] 285 | } 286 | ], 287 | "source": [ 288 | "#1. WORD-COUNT\n", 289 | "df_train['word_count'] = df_train['text'].apply(lambda x: len(str(x).split()))\n", 290 | "print(df_train[df_train['target']==1]['word_count'].mean()) #Disaster tweets\n", 291 | "print(df_train[df_train['target']==0]['word_count'].mean()) #Non-Disaster tweets\n", 292 | "#Disaster tweets are more wordy than the non-disaster tweets\n", 293 | "\n", 294 | "#2. CHARACTER-COUNT\n", 295 | "df_train['char_count'] = df_train['text'].apply(lambda x: len(str(x)))\n", 296 | "print(df_train[df_train['target']==1]['char_count'].mean()) #Disaster tweets\n", 297 | "print(df_train[df_train['target']==0]['char_count'].mean()) #Non-Disaster tweets\n", 298 | "#Disaster tweets are longer than the non-disaster tweets\n", 299 | "\n", 300 | "#3. UNIQUE WORD-COUNT\n", 301 | "df_train['unique_word_count'] = df_train['text'].apply(lambda x: len(set(str(x).split())))\n", 302 | "print(df_train[df_train['target']==1]['unique_word_count'].mean()) #Disaster tweets\n", 303 | "print(df_train[df_train['target']==0]['unique_word_count'].mean()) #Non-Disaster tweets" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 6, 309 | "metadata": {}, 310 | "outputs": [ 311 | { 312 | "data": { 313 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAlYAAAEVCAYAAAAigatAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAibklEQVR4nO3de5glVXnv8e9PQOUgcUAmE27DqBANMRHIqBiJGomJIMmQcxQlJoyEZDQHDT56TkRzQYzX5CRekhMURR28AUEJxHCiBAXxAjoookIMI0KYYWBG7ki8oO/5o1bDpunp3j1T3b275/t5nv3sqlW1a7+7hlq8tdbqWqkqJEmStPUeMtcBSJIkLRQmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSTMuyeuSfGiu45CkmWZiJW2Dkrwmyf8bV3bNZspeOLvRzZ4klWTfWf7ODyR5w2x+p6TZY2IlbZs+C/xyku0AkuwO7AAcOK5s37bv0JJs33OsW20UY5K0MJlYSdumL9MlUge09V8BPgN8a1zZt6vqxiR7JDkvya1J1ib5w7EDtW6+s5N8KMmdwIuTPDrJxUnuSnIBsNvA/g9v+96S5PYkX06yZKIgk1zXWteuSnJbkvcnefjA9iOSXNGO84Ukvzjus69OciXwvfHJVZKxhPFrSe5O8oIW8/9o25/WWrSe29YPTXLFwOd/P8nVLa5PJtlnYNvjk1zQzte3khzVylcBLwL+pH3nP0/x7yRpnjGxkrZBVfVD4DLg6a3o6cAlwOfGlY0lH2cA64A9gOcBb0ryrIFDrgDOBhYBHwY+AlxOl1D9JbByYN+VwCOBvYFHAS8F/muScF8E/AbwWOBngT8DSHIg8D7gJe047wbOS/Kwgc8eDTwXWFRV9447B2O/84lV9YiqOhO4GHhmK38GcO3A+XhG206SFcBrgf8OLKY7dx9t23YCLmjn4KeBFwL/kGT/qjq1nZ+/at/5m5P8bknzkImVtO26mPuThl+hSw4uGVd2cZK9gacBr66q71fVFcB7gWMGjvXFqvqnqvoJXaLxJODPq+oHVfVZYLBl5kd0idC+VfXjqrq8qu6cJM6/r6obqupW4I10yRLAKuDdVXVZO85q4AfAwQOffWf77GSJ2/hz8oy2/HTgzQPr9yVWdMngm6vq6pawvQk4oLVaHQFcV1Xvr6p7q+qrwMeA5w8Zg6R5zMRK2nZ9Fjgkya7A4qq6BvgC3dirXYEntH32AG6tqrsGPns9sOfA+g0Dy3sAt1XV98btP+aDwCeBM5LcmOSvkuwwSZyDx76+HR9gH+BVrRvw9iS307WC7bGZzw7ji8DPtq7JA4DTgb2T7AY8mftb8PYB3jHwvbcCoTsn+wBPGRfXi4CfmWYskuYhB3RK264v0nXJ/SHweYCqujPJja3sxqr6TpJ7gV2T7DyQXC0F1g8cqwaWNwC7JNlpILlaOrZPVf0IOBk4Ocky4Hy6sV2nbSbOvQeWlwI3tuUbgDdW1Rsn+Y01ybYH71x1T5LLgROAb1TVD5N8AXgl3Xiz74777g+PP0Zrtbq4qp7dR0yS5hdbrKRtVOseW0OXNFwysOlzreyzbb8b6Fqy3twGnv8icBww4XOpqur6dtyTkzw0ySHAfWOJkvxqkl9of314J13X4E8mCfX4JHu1VrQ/Bc5s5e8BXprkKenslOS5SXaexmm4GXjMuLKLgZdxf7ffRePWAd4FvCbJz7ff9MgkY119n6Br9fq9JDu015OS/Nwk3ylpgTCxkrZtF9MNsP7cQNklrWzwMQtHA8voWovOAU6qqn+b5Li/AzyFrovsJLoutTE/QzfQ/U7g6hbDByc51keAT9ENJP828AaAqlpD17L298BtwFrgxZMcZyKvA1a3LrujWtnFwM7c//vHr1NV5wBvpevOvBP4BnBY23YX8Ot0g9ZvBG5q+44Nqj8N2L995z9NM15JIy5VtkpLGk1JrgP+YIokTpJGhi1WkiRJPTGxkiRJ6omJ1TYkybuS/PlcxyENq6qW2Q2o6cjAhN9JlrYn3G8313Fp22FitUC06Tv+q00hMja9x0uT3PdvXFUvraq/nMEYep/QdiaOOcR3OkmutIVaXbSxPYF+rOwPklw027FU1X+2J9z/eCaOP5jEjfIxh/jOZa2u9RFMPTCxWlh+s6p2pntA4VuAV7P5ZwONFC9oaUHZju5ZYJqE9d7CZGK1AFXVHVV1HvACYGWSJ8ADW2KS7JbkE61169Ykl4y1biU5Mcm3W+vXVUl+e+zYSfZNN1HtHUm+m+TMVv6gCW1b+YKYJLfFub6dk28lOXTr/pWkBe2vgf+VZNFEG5P8crrJt+9o7788sO2iJH+Z5PPtevtUuiffTyiTT/j9gJaYJC9Ocm3b9ztJXtTKH5vk0+kmBv9ukg8Pxj7R9Z/kOXTzRb6g1RVfa/s+MslpSTa0z7whrSuyff/nk7wtyS10j/sY/C0POma65759fWCfC5J8eWD9kiRHtuU9knwsyab2+/54YL+HDNTttyQ5K92z4eD+R4nc3r73qZur6zWEqvK1AF7AdcCvTVD+n8AfteUPAG9oy2+me8jhDu31K9z/+I3n000L8hC65Ox7wO5t20fpHtL4EODhwCED31V087+NrR8IbKR7ntF2dJPvXgc8bCDmK+ierL3jZn7X+GO+Hvi7tvxauucavXVg2zva8gq65xr9HN0MA38GfKFt24nuydnHtm0HAt8F9h9/ntr649r+e7T1ZcBj5/rf3JevUXyN1UXAxwfqmz8ALmrLu9I9d+z32vV3dFt/VNt+UbuufxbYsa2/ZZLv+yLwt3TPCXs6cBfwobZtWatDtm/X/Z3A49q23YGfb8v7As9ux1hMl2i8vW3b7PVPlxh9aFw859BNCL4T3fPgvgS8pG17MXAv8PIW04PqvfHHbOfg+3QJ4w50D5hdT/dstR3pJjB/FF2dfDnwF8BD6R5Cey3wG+04JwCXAnu13/lu4KPjz9PA9262rvc1+csWq4XvRrqKbLwf0VUs+1TVj6rqkmpXU1X9Y1XdWFU/qaozgWvo5kkb+9w+dJXM96vqcxMce8xCmST3x3QV0f5Jdqiq66rq20PGK22r/gJ4eZLF48qfC1xTVR9s199HgX9n4On8wPur6j9avXAW3byND5JkKZNP+D3eT4AnJNmxqjZU1TcBqmptVV3QjrGJLlEbq1eGvv7TzTF5OPCKqvpeVW0E3kb3sNgxN1bV37XfPmW91/b5Ml1990vA1+imoHoaXV16TVXd0s7D4qp6fVX9sKqupZudYOy7Xwr8aVWtq6of0CVwzxvfUzBgOnW9BphYLXx70j39ery/pmvR+VRrGj9xbEOSY3J/993tdJPxjjWv/wndZLNfSvLNJL8/yXcviElyq2ot8Aq6imhjkjOS7DHRvpI6VfUNuul9Thy3aQ8eOCk3PHhS75sGlu8BHgH3/WXz3e31Wqae8Hswnu/RtcC/FNiQ5F+SPL4dd0m7rtene5L+h2h13jSv/33oWpU2DNQt76ZruRoz3ToPuhvGZ9IlVxfTteI9gwfeTO4D7DGuXnstsGRg+zkD266mSxrHto83nbpeA0ysFrAkT6KrrB50p1FVd1XVq6rqMcBvAa9s4wb2obvLeRld0/wiuuk60j53U1X9YVXtAbwE+Ids/q/2xiaqXTTw+m/tDvW+UKbzm6rqHrrm7vsmyaWbx26iSXJfMu67d6yqL7RtF4/b9oiq+qPNxVRVH6mqQ+gqp6KbokTS5E6im3ZoMGm6ke46GjR+Uu8JVfeXzY9orzcxMOH3uGNt7vOfrG5y7N3pWsne0za9ie66/oWq+ingd2l1Xvvc5q7/8XXFDXSt8rsN1C0/VVU/PxjGVD9zgrLxidVYy/1gYnUD8J1x9drOVXX4wPbDxm1/eFWtn+g7p1nXa4CJ1QKU5KeSHAGcQddX//UJ9jmiDU4McAfdnctP6MYFFLCp7XcsXYvV2Oeen2Svtnpb23dsAt3xk8suiElykzwuybOSPIxurMN/MfmkwZK4r7XnTOCPB4rPp7v+fifJ9un+0GV/uutyusefdMLvQa1VakVLwn4A3M391/HObf2OJHsC/3vgc5Nd/zcDy9L+8KeqNtDNa/k3rR5+SLqB8WPdisN4wDGbL9CN9Xoy8KXWhbkP3fjVsVb6LwF3pRtov2OS7ZI8od1gQ1cnvrHdPJNkcZIVbdum9psG673J6npNwsRqYfnnJHfR3Zn8Kd04gWM3s+9+wL/RVSZfBP6hqj5TVVcBf9PKbgZ+ga4/f8yTgMuS3A2cB5zQ+vJh3IS2tXAmyX0Y3eMrvtv2/WngNdP8HdK26vV0N2wAtPFARwCvAm6h63I6YqC1ebomm/B70EPoWrZvbPs+AxhrpT4ZOIjuJvNf6Abej5ns+v/H9n5Lkq+05WPoBo9fRVfvnU3XQjasBx2zdWN+Bfhma6WHro6+vo3jorpndR1BN0TiOy3e9wKPbPu/g67O/lT7/8SldOdtrCfgjcDnW713MJPX9ZqEkzBLkiT1xBYrSZKknphYSZIk9cTESpIkqScmVpIkST0ZiQkgd9ttt1q2bNlchyFpFl1++eXfrarxT+Wed6y/pG3PZPXXSCRWy5YtY82aNXMdhqRZlGTCJ2TPN9Zf0rZnsvrLrkBJkqSemFhJkiT1xMRKkiSpJyZWkiRJPTGxkiRJ6omJlSRJUk9MrCRJknpiYiVJktQTEytJkqSeTPnk9SSPA84cKHoM8BfA6a18GXAdcFRV3ZYkwDuAw4F7gBdX1Vf6DVsLRjJzx66auWNL2ubl5Jmrv+ok66/5asoWq6r6VlUdUFUHAL9ElyydA5wIXFhV+wEXtnWAw4D92msVcMoMxC1JkjRyptsVeCjw7aq6HlgBrG7lq4Ej2/IK4PTqXAosSrJ7H8FKkiSNsukmVi8EPtqWl1TVhrZ8E7CkLe8J3DDwmXWt7AGSrEqyJsmaTZs2TTMMSZKk0TN0YpXkocBvAf84fltVFTCtDuGqOrWqllfV8sWLF0/no5I0lCSPS3LFwOvOJK9IsmuSC5Jc0953afsnyTuTrE1yZZKD5vo3SJpfptNidRjwlaq6ua3fPNbF1943tvL1wN4Dn9urlUnSrHKMqKTZNp3E6mju7wYEOA9Y2ZZXAucOlB/T7vwOBu4Y6DKUpLniGFFJM27Kxy0AJNkJeDbwkoHitwBnJTkOuB44qpWfT/eohbV0d4fH9hatJG25rRkj6s2hpKEMlVhV1feAR40ru4XuDnD8vgUc30t0ktSDgTGirxm/raoqybTGiCZZRddVyNKlS3uJUdLC4JPXJW0Leh0j6h/fSNocEytJ2wLHiEqaFUN1BUrSfOUY0dHn1DBaSEysJC1ojhHdts1k0iZNxK5ASZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST1xEuaFIjM40Wg5O7wkScOwxUqSJKknJlaSJEk9MbGSJEnqyVCJVZJFSc5O8u9Jrk7y1CS7JrkgyTXtfZe2b5K8M8naJFcmOWhmf4IkSdJoGLbF6h3Av1bV44EnAlcDJwIXVtV+wIVtHeAwYL/2WgWc0mvEkiRJI2rKxCrJI4GnA6cBVNUPq+p2YAWwuu22GjiyLa8ATq/OpcCiJLv3HLckSdLIGabF6tHAJuD9Sb6a5L1JdgKWVNWGts9NwJK2vCdww8Dn17WyB0iyKsmaJGs2bdq05b9AkiRpRAyTWG0PHAScUlUHAt/j/m4/AKqqgGk97KiqTq2q5VW1fPHixdP5qCQNzTGikmbTMInVOmBdVV3W1s+mS7RuHuvia+8b2/b1wN4Dn9+rlUnSXHCMqKRZM2ViVVU3ATckeVwrOhS4CjgPWNnKVgLntuXzgGPand/BwB0DXYaSNGscIypptg07pc3LgQ8neShwLXAsXVJ2VpLjgOuBo9q+5wOHA2uBe9q+kjQXBseIPhG4HDiB6Y8RfcDNYZJVdC1aLF26dMaClzT/DJVYVdUVwPIJNh06wb4FHL91YUlSL8bGiL68qi5L8g4mGCOaZNpjRIFTAZYvX+5kmpLu45PXJS1kjhGVNKtMrCQtWI4RlTTbhh1jJUnzlWNENe/k5MzYseske69nkomVpAXNMaKSZpNdgZIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPnCtQkjSlmZwUWFpIbLGSJEnqiYmVJElST0ysJEmSemJiJUmS1JOhEqsk1yX5epIrkqxpZbsmuSDJNe19l1aeJO9MsjbJlUkOmskfIG1WMnMvSZImMJ0Wq1+tqgOqanlbPxG4sKr2Ay5s6wCHAfu11yrglL6ClSRJGmVb0xW4AljdllcDRw6Un16dS4FFSXbfiu+RJEmaF4ZNrAr4VJLLk6xqZUuqakNbvglY0pb3BG4Y+Oy6VvYASVYlWZNkzaZNm7YgdEmamkMZJM2mYROrQ6rqILpuvuOTPH1wY1UVXfI1tKo6taqWV9XyxYsXT+ejkjRdDmWQNCuGSqyqan173wicAzwZuHmsi6+9b2y7rwf2Hvj4Xq1MkkaFQxkkzYgpE6skOyXZeWwZ+HXgG8B5wMq220rg3LZ8HnBMa1I/GLhjoMtQkmZb70MZJGlzhpkrcAlwTro/Md8e+EhV/WuSLwNnJTkOuB44qu1/PnA4sBa4Bzi296glaXiHVNX6JD8NXJDk3wc3VlUlmdZQhpagrQJYunRpf5FKmvemTKyq6lrgiROU3wIcOkF5Acf3Ep0kbaXBoQxJHjCUoao2bMlQhqo6FTgVYPny5dNKyiQtbD55XdKC5VAGSbNtmK5ASZqvHMogaVaZWElasBzKIGm22RUoSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST3Zfq4D2KYkcx2BJEmaQbZYSZIk9cQWK03NljZJkoZii5UkSVJPhk6skmyX5KtJPtHWH53ksiRrk5yZ5KGt/GFtfW3bvmyGYpckSRop0+kKPAG4Gviptv5W4G1VdUaSdwHHAae099uqat8kL2z7vaDHmCVJ0hbKyTM3vKNOqhk79nwxVItVkr2A5wLvbesBngWc3XZZDRzZlle0ddr2Q9v+kiRJC9qwXYFvB/4E+ElbfxRwe1Xd29bXAXu25T2BGwDa9jva/g+QZFWSNUnWbNq0acuil6QhOJRB0myZMrFKcgSwsaou7/OLq+rUqlpeVcsXL17c56ElabyxoQxjxoYy7AvcRjeEAQaGMgBva/tJ0tCGabF6GvBbSa4DzqDrAnwHsCjJ2BitvYD1bXk9sDdA2/5I4JYeY5akoTmUQdJsmjKxqqrXVNVeVbUMeCHw6ap6EfAZ4Hltt5XAuW35vLZO2/7pqnI0m6S58nYcyiBplmzNc6xeDbwyyVq6iue0Vn4a8KhW/krgxK0LUZK2jEMZJM22aT15vaouAi5qy9cCT55gn+8Dz+8hNknaWmNDGQ4HHk73uJj7hjK0VqmJhjKscyiDpC3hk9clLVgOZZA020ysJG2LHMogaUY4CbOkbYJDGSTNBlusJEmSemJiJUmS1BMTK0mSpJ6YWEmSJPXExEqSJKknJlaSJEk9MbGSJEnqiYmVJElST0ysJEmSemJiJUmS1BOntJGkBSInZ65DkLZ5tlhJkiT1xMRKkiSpJyZWkiRJPTGxkiRJ6omJlSRJUk9MrCRJknoyZWKV5OFJvpTka0m+meTkVv7oJJclWZvkzCQPbeUPa+tr2/ZlM/wbJEmSRsIwLVY/AJ5VVU8EDgCek+Rg4K3A26pqX+A24Li2/3HAba38bW0/SZKkBW/KxKo6d7fVHdqrgGcBZ7fy1cCRbXlFW6dtPzSJT62TNOtscZc024YaY5VkuyRXABuBC4BvA7dX1b1tl3XAnm15T+AGgLb9DuBRExxzVZI1SdZs2rRpq36EJG2GLe6SZtVQiVVV/biqDgD2Ap4MPH5rv7iqTq2q5VW1fPHixVt7OEl6EFvcJc22af1VYFXdDnwGeCqwKMnYXIN7Aevb8npgb4C2/ZHALX0EK0nTZYu7pNk0zF8FLk6yqC3vCDwbuJouwXpe220lcG5bPq+t07Z/uqqqx5glaWi2uEuaTdtPvQu7A6uTbEeXiJ1VVZ9IchVwRpI3AF8FTmv7nwZ8MMla4FbghTMQtyRNS1XdnuQBLe6tVWqiFvd1trhL2hJTJlZVdSVw4ATl19Ld/Y0v/z7w/F6ik6StkGQx8KOWVI21uL+V+1vcz2DiFvcvYou7pC0wTIuVJM1XtrhLmlUmVpIWLFvcJc025wqUJEnqiS1W0paYyUcbOaRHkuYtW6wkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPpkyskuyd5DNJrkryzSQntPJdk1yQ5Jr2vksrT5J3Jlmb5MokB830j5AkSRoFw7RY3Qu8qqr2Bw4Gjk+yP3AicGFV7Qdc2NYBDgP2a69VwCm9Ry1JkjSCpkysqmpDVX2lLd8FXA3sCawAVrfdVgNHtuUVwOnVuRRYlGT3vgOXpKnY4i5ptk1rjFWSZcCBwGXAkqra0DbdBCxpy3sCNwx8bF0rk6TZZou7pFm1/bA7JnkE8DHgFVV1Z5L7tlVVJanpfHGSVXQVF0uXLp3ORyVpKO3mb0NbvivJYIv7M9tuq4GLgFcz0OIOXJpkUZLdB24iJU0iJ2fqnbZQnTStNGPODNVilWQHuqTqw1X18VZ881gXX3vf2MrXA3sPfHyvVvYAVXVqVS2vquWLFy/e0vglaSh9trgnWZVkTZI1mzZtmrmgJc07w/xVYIDTgKur6m8HNp0HrGzLK4FzB8qPaWMVDgbu8G5P0lwa3+I+uK21Tk3rVtgbQ0mbM0xX4NOA3wO+nuSKVvZa4C3AWUmOA64HjmrbzgcOB9YC9wDH9hmwJE3HZC3uVbVhS1rcJWlzpkysqupzwOY6TQ+dYP8Cjt/KuCRpqw3R4v4WHtzi/rIkZwBPwRZ3SdM09OB1SZqHbHGXNKtMrCQtWLa4S5ptJlbjZeb+VFSSJC1sTsIsSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUExMrSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUkykTqyTvS7IxyTcGynZNckGSa9r7Lq08Sd6ZZG2SK5McNJPBS5IkjZJhWqw+ADxnXNmJwIVVtR9wYVsHOAzYr71WAaf0E6YkbRlvDiXNpikTq6r6LHDruOIVwOq2vBo4cqD89OpcCixKsntPsUrSlvgA3hxKmiVbOsZqSVVtaMs3AUva8p7ADQP7rWtlkjQnvDmUNJu239oDVFUlqel+LskqujtCli5durVhSNJ0TPfmcMNA2VbVXzk5WxCupPliS1usbh67i2vvG1v5emDvgf32amUPUlWnVtXyqlq+ePHiLQxDkrZOVRUwrZtD6y9Jm7OlidV5wMq2vBI4d6D8mDYA9GDgjoG7QkkaFVt9cyhJExnmcQsfBb4IPC7JuiTHAW8Bnp3kGuDX2jrA+cC1wFrgPcD/nJGoJWnreHMoaUZMOcaqqo7ezKZDJ9i3gOO3NihJ6ku7OXwmsFuSdcBJdDeDZ7UbxeuBo9ru5wOH090c3gMcO+sBS5rXtnrwuqSeZQYHN9e0/85k3vPmUNJsckobSZKknphYSZIk9cTESpIkqScmVpIkST0xsZIkSeqJiZUkSVJPTKwkSZJ6YmIlSZLUEx8QKkmSRl5OnsGHJwN1Uj8PULbFSpIkqScmVpIkST2Zn12BMzmXmiRJ0hayxUqSJKknJlaSJEk9MbGSJEnqiYmVJElST0ysJEmSemJiJUmS1BMTK0mSpJ7MSGKV5DlJvpVkbZITZ+I7JG2BZOZeC4h1mKQt1XtilWQ74P8ChwH7A0cn2b/v75GkmWAdJmlrzESL1ZOBtVV1bVX9EDgDWDED3yNJM8E6TNIWm4nEak/ghoH1da1MkuYD6zBJW2zO5gpMsgpY1VbvTvKtuYplArsB353rIKYwH2KE+RGnMfYhmW6M+8xUKDNtivprlP+tRjW2UY0LRjc245q+SWPL66Y1VnSz9ddMJFbrgb0H1vdqZQ9QVacCp87A92+1JGuqavlcxzGZ+RAjzI84jbEf8yHGIU1Zh01Wf43yeRjV2EY1Lhjd2Ixr+mYrtpnoCvwysF+SRyd5KPBC4LwZ+B5JmgnWYZK2WO8tVlV1b5KXAZ8EtgPeV1Xf7Pt7JGkmWIdJ2hozMsaqqs4Hzp+JY8+SkeyiHGc+xAjzI05j7Md8iHEoW1mHjfJ5GNXYRjUuGN3YjGv6ZiW2VNVsfI8kSdKC55Q2kiRJPTGxGifJdUm+nuSKJGvmOh6AJO9LsjHJNwbKdk1yQZJr2vsuIxjj65Ksb+fyiiSHz3GMeyf5TJKrknwzyQmtfGTO5SQxjtq5fHiSLyX5Wovz5Fb+6CSXtalgzmyDv7cZozoVzijVa6Nan41qHTbK9dao1ldzXT/ZFThOkuuA5VU1Ms/hSPJ04G7g9Kp6Qiv7K+DWqnpLq8B3qapXj1iMrwPurqr/M1dxDUqyO7B7VX0lyc7A5cCRwIsZkXM5SYxHMVrnMsBOVXV3kh2AzwEnAK8EPl5VZyR5F/C1qjplLmOdLemmwvkP4Nl0DxX9MnB0VV01p4ExWvXaqNZno1qHjXK9Nar11VzXT7ZYzQNV9Vng1nHFK4DVbXk13X/Mc2YzMY6UqtpQVV9py3cBV9M9UXtkzuUkMY6U6tzdVndorwKeBZzdyuf8v8tZ5lQ4QxjV+mxU67BRrrdGtb6a6/rJxOrBCvhUksvTPV15VC2pqg1t+SZgyVwGM4mXJbmyNbPPaXfloCTLgAOByxjRczkuRhixc5lkuyRXABuBC4BvA7dX1b1tl21tKphRngpn1Ou1kbwGm5G57ka53hq1+mou6ycTqwc7pKoOopvZ/vjWPDzSquvPHcU+3VOAxwIHABuAv5nTaJokjwA+Bryiqu4c3DYq53KCGEfuXFbVj6vqALonkz8ZePzcRqRJzJt6bVSuwWZkrrtRrrdGsb6ay/rJxGqcqlrf3jcC59D9g4yim1v/9lg/98Y5judBqurm9h/3T4D3MALnsvW3fwz4cFV9vBWP1LmcKMZRPJdjqup24DPAU4FFScaejzfhdFYL2FDTec2FeVCvjdQ1OGZUrrtRrrdGvb6ai/rJxGpAkp3aADyS7AT8OvCNyT81Z84DVrbllcC5cxjLhMYu+ua3meNz2QY0ngZcXVV/O7BpZM7l5mIcwXO5OMmitrwj3YDtq+kqsOe13Ubyv8sZNJJT4cyTem1krsFBo3DdjXK9Nar11VzXT/5V4IAkj6G7m4PuqfQfqao3zmFIACT5KPBMupm5bwZOAv4JOAtYClwPHFVVczbwcjMxPpOuKbiA64CXDIwJmHVJDgEuAb4O/KQVv5ZuTMBInMtJYjya0TqXv0g3+HM7uhu0s6rq9e0aOgPYFfgq8LtV9YO5inO2tT8rfzv3T4UzCvXHSNVro1qfjWodNsr11qjWV3NdP5lYSZIk9cSuQEmSpJ6YWEmSJPXExEqSJKknJlaSJEk9MbGSJEnqiYmVJElST0ysJEmSemJiJUmS1JP/D2neI6WqpBOzAAAAAElFTkSuQmCC\n", 314 | "text/plain": [ 315 | "
" 316 | ] 317 | }, 318 | "metadata": { 319 | "needs_background": "light" 320 | }, 321 | "output_type": "display_data" 322 | } 323 | ], 324 | "source": [ 325 | "#Plotting word-count per tweet\n", 326 | "fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,4))\n", 327 | "train_words=df_train[df_train['target']==1]['word_count']\n", 328 | "ax1.hist(train_words,color='red')\n", 329 | "ax1.set_title('Disaster tweets')\n", 330 | "train_words=df_train[df_train['target']==0]['word_count']\n", 331 | "ax2.hist(train_words,color='green')\n", 332 | "ax2.set_title('Non-disaster tweets')\n", 333 | "fig.suptitle('Words per tweet')\n", 334 | "plt.show()" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "## PRE-PROCESSING" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 7, 347 | "metadata": {}, 348 | "outputs": [ 349 | { 350 | "name": "stdout", 351 | "output_type": "stream", 352 | "text": [ 353 | "this is a message to be cleaned it may involve some things like adjacent spaces and tabs\n" 354 | ] 355 | } 356 | ], 357 | "source": [ 358 | "#1. Common text preprocessing\n", 359 | "text = \" This is a message to be cleaned. It may involve some things like:
, ?, :, '' adjacent spaces and tabs . \"\n", 360 | "\n", 361 | "#convert to lowercase and remove punctuations and characters and then strip\n", 362 | "def preprocess(text):\n", 363 | " text = text.lower() #lowercase text\n", 364 | " text=text.strip() #get rid of leading/trailing whitespace \n", 365 | " text=re.compile('<.*?>').sub('', text) #Remove HTML tags/markups\n", 366 | " text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text) #Replace punctuation with space. Careful since punctuation can sometime be useful\n", 367 | " text = re.sub('\\s+', ' ', text) #Remove extra space and tabs\n", 368 | " text = re.sub(r'\\[[0-9]*\\]',' ',text) #[0-9] matches any digit (0 to 10000...)\n", 369 | " text=re.sub(r'[^\\w\\s]', '', str(text).lower().strip())\n", 370 | " text = re.sub(r'\\d',' ',text) #matches any digit from 0 to 100000..., \\D matches non-digits\n", 371 | " text = re.sub(r'\\s+',' ',text) #\\s matches any whitespace, \\s+ matches multiple whitespace, \\S matches non-whitespace \n", 372 | " \n", 373 | " return text\n", 374 | "\n", 375 | "text=preprocess(text)\n", 376 | "print(text) #text is a string" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 8, 382 | "metadata": {}, 383 | "outputs": [ 384 | { 385 | "name": "stdout", 386 | "output_type": "stream", 387 | "text": [ 388 | "message cleaned may involve things like adjacent spaces tabs\n", 389 | "messag clean may involv thing like adjac space tab\n", 390 | "messag clean may involv thing like adjac space tab\n" 391 | ] 392 | } 393 | ], 394 | "source": [ 395 | "#3. LEXICON-BASED TEXT PROCESSING EXAMPLES\n", 396 | " \n", 397 | "#1. STOPWORD REMOVAL\n", 398 | "def stopword(string):\n", 399 | " a= [i for i in string.split() if i not in stopwords.words('english')]\n", 400 | " return ' '.join(a)\n", 401 | "\n", 402 | "text=stopword(text)\n", 403 | "print(text)\n", 404 | "\n", 405 | "#2. STEMMING\n", 406 | " \n", 407 | "# Initialize the stemmer\n", 408 | "snow = SnowballStemmer('english')\n", 409 | "def stemming(string):\n", 410 | " a=[snow.stem(i) for i in word_tokenize(string) ]\n", 411 | " return \" \".join(a)\n", 412 | "text=stemming(text)\n", 413 | "print(text)\n", 414 | "\n", 415 | "#3. LEMMATIZATION\n", 416 | "# Initialize the lemmatizer\n", 417 | "wl = WordNetLemmatizer()\n", 418 | " \n", 419 | "# This is a helper function to map NTLK position tags\n", 420 | "# Full list is available here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html\n", 421 | "def get_wordnet_pos(tag):\n", 422 | " if tag.startswith('J'):\n", 423 | " return wordnet.ADJ\n", 424 | " elif tag.startswith('V'):\n", 425 | " return wordnet.VERB\n", 426 | " elif tag.startswith('N'):\n", 427 | " return wordnet.NOUN\n", 428 | " elif tag.startswith('R'):\n", 429 | " return wordnet.ADV\n", 430 | " else:\n", 431 | " return wordnet.NOUN\n", 432 | "\n", 433 | "# Tokenize the sentence\n", 434 | "def lemmatizer(string):\n", 435 | " word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags\n", 436 | " a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token\n", 437 | " return \" \".join(a)\n", 438 | "\n", 439 | "text = lemmatizer(text)\n", 440 | "print(text)" 441 | ] 442 | }, 443 | { 444 | "cell_type": "code", 445 | "execution_count": 9, 446 | "metadata": {}, 447 | "outputs": [ 448 | { 449 | "data": { 450 | "text/html": [ 451 | "
\n", 452 | "\n", 465 | "\n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | "
idkeywordlocationtexttargetclean_text
01NaNNaNOur Deeds are the Reason of this #earthquake M...1deed reason earthquake may allah forgive u
14NaNNaNForest fire near La Ronge Sask. Canada1forest fire near la ronge sask canada
25NaNNaNAll residents asked to 'shelter in place' are ...1resident ask shelter place notify officer evac...
36NaNNaN13,000 people receive #wildfires evacuation or...1people receive wildfire evacuation order calif...
47NaNNaNJust got sent this photo from Ruby #Alaska as ...1get sent photo ruby alaska smoke wildfires pou...
\n", 525 | "
" 526 | ], 527 | "text/plain": [ 528 | " id keyword location text \\\n", 529 | "0 1 NaN NaN Our Deeds are the Reason of this #earthquake M... \n", 530 | "1 4 NaN NaN Forest fire near La Ronge Sask. Canada \n", 531 | "2 5 NaN NaN All residents asked to 'shelter in place' are ... \n", 532 | "3 6 NaN NaN 13,000 people receive #wildfires evacuation or... \n", 533 | "4 7 NaN NaN Just got sent this photo from Ruby #Alaska as ... \n", 534 | "\n", 535 | " target clean_text \n", 536 | "0 1 deed reason earthquake may allah forgive u \n", 537 | "1 1 forest fire near la ronge sask canada \n", 538 | "2 1 resident ask shelter place notify officer evac... \n", 539 | "3 1 people receive wildfire evacuation order calif... \n", 540 | "4 1 get sent photo ruby alaska smoke wildfires pou... " 541 | ] 542 | }, 543 | "execution_count": 9, 544 | "metadata": {}, 545 | "output_type": "execute_result" 546 | } 547 | ], 548 | "source": [ 549 | "#FINAL PREPROCESSING\n", 550 | "def finalpreprocess(string):\n", 551 | " return lemmatizer(stopword(preprocess(string)))\n", 552 | "\n", 553 | "df_train['clean_text'] = df_train['text'].apply(lambda x: finalpreprocess(x))\n", 554 | "df_train=df_train.drop(columns=['word_count','char_count','unique_word_count'])\n", 555 | "df_train.head()" 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "### Word2Vec model" 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": 10, 568 | "metadata": {}, 569 | "outputs": [ 570 | { 571 | "name": "stderr", 572 | "output_type": "stream", 573 | "text": [ 574 | "/Users/ranivija/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: DeprecationWarning: Call to deprecated `syn0` (Attribute will be removed in 4.0.0, use self.vectors instead).\n", 575 | " if __name__ == '__main__':\n" 576 | ] 577 | } 578 | ], 579 | "source": [ 580 | "# create Word2vec model\n", 581 | "#here words_f should be a list containing words from each document. say 1st row of the list is words from the 1st document/sentence\n", 582 | "#length of words_f is number of documents/sentences in your dataset\n", 583 | "df_train['clean_text_tok']=[nltk.word_tokenize(i) for i in df_train['clean_text']] #convert preprocessed sentence to tokenized sentence\n", 584 | "model = Word2Vec(df_train['clean_text_tok'],min_count=1) #min_count=1 means word should be present at least across all documents,\n", 585 | "#if min_count=2 means if the word is present less than 2 times across all the documents then we shouldn't consider it\n", 586 | "\n", 587 | "\n", 588 | "w2v = dict(zip(model.wv.index2word, model.wv.syn0)) #combination of word and its vector\n", 589 | "\n", 590 | "#for converting sentence to vectors/numbers from word vectors result by Word2Vec\n", 591 | "class MeanEmbeddingVectorizer(object):\n", 592 | " def __init__(self, word2vec):\n", 593 | " self.word2vec = word2vec\n", 594 | " # if a text is empty we should return a vector of zeros\n", 595 | " # with the same dimensionality as all the other vectors\n", 596 | " self.dim = len(next(iter(word2vec.values())))\n", 597 | "\n", 598 | " def fit(self, X, y):\n", 599 | " return self\n", 600 | "\n", 601 | " def transform(self, X):\n", 602 | " return np.array([\n", 603 | " np.mean([self.word2vec[w] for w in words if w in self.word2vec]\n", 604 | " or [np.zeros(self.dim)], axis=0)\n", 605 | " for words in X\n", 606 | " ])\n" 607 | ] 608 | }, 609 | { 610 | "cell_type": "markdown", 611 | "metadata": {}, 612 | "source": [ 613 | "### TRAIN TEST SPLITTING OF LABELLED DATASET" 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": 11, 619 | "metadata": {}, 620 | "outputs": [], 621 | "source": [ 622 | "#SPLITTING THE TRAINING DATASET INTO TRAINING AND VALIDATION\n", 623 | " \n", 624 | "# Input: \"reviewText\", \"rating\" and \"time\"\n", 625 | "# Target: \"log_votes\"\n", 626 | "X_train, X_val, y_train, y_val = train_test_split(df_train[\"clean_text\"],\n", 627 | " df_train[\"target\"],\n", 628 | " test_size=0.2,\n", 629 | " shuffle=True)\n", 630 | "X_train_tok= [nltk.word_tokenize(i) for i in X_train] #for word2vec\n", 631 | "X_val_tok= [nltk.word_tokenize(i) for i in X_val] #for word2vec\n", 632 | "\n", 633 | "#TF-IDF\n", 634 | "# Convert x_train to vector since model can only run on numbers and not words- Fit and transform\n", 635 | "tfidf_vectorizer = TfidfVectorizer(use_idf=True)\n", 636 | "X_train_vectors_tfidf = tfidf_vectorizer.fit_transform(X_train) #tfidf runs on non-tokenized sentences unlike word2vec\n", 637 | "# Only transform x_test (not fit and transform)\n", 638 | "X_val_vectors_tfidf = tfidf_vectorizer.transform(X_val) #Don't fit() your TfidfVectorizer to your test data: it will \n", 639 | "#change the word-indexes & weights to match test data. Rather, fit on the training data, then use the same train-data-\n", 640 | "#fit model on the test data, to reflect the fact you're analyzing the test data only based on what was learned without \n", 641 | "#it, and the have compatible\n", 642 | "\n", 643 | "\n", 644 | "#Word2vec\n", 645 | "# Fit and transform\n", 646 | "modelw = MeanEmbeddingVectorizer(w2v)\n", 647 | "X_train_vectors_w2v = modelw.transform(X_train_tok)\n", 648 | "X_val_vectors_w2v = modelw.transform(X_val_tok)" 649 | ] 650 | }, 651 | { 652 | "cell_type": "markdown", 653 | "metadata": {}, 654 | "source": [ 655 | "### Building ML models (Text-classification)" 656 | ] 657 | }, 658 | { 659 | "cell_type": "markdown", 660 | "metadata": {}, 661 | "source": [ 662 | "#### LR (tf-idf)" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": 12, 668 | "metadata": {}, 669 | "outputs": [ 670 | { 671 | "name": "stdout", 672 | "output_type": "stream", 673 | "text": [ 674 | " precision recall f1-score support\n", 675 | "\n", 676 | " 0 0.82 0.84 0.83 857\n", 677 | " 1 0.79 0.76 0.77 666\n", 678 | "\n", 679 | " accuracy 0.81 1523\n", 680 | " macro avg 0.80 0.80 0.80 1523\n", 681 | "weighted avg 0.81 0.81 0.81 1523\n", 682 | "\n", 683 | "Confusion Matrix: [[723 134]\n", 684 | " [160 506]]\n", 685 | "AUC: 0.8646537435919\n" 686 | ] 687 | } 688 | ], 689 | "source": [ 690 | "#FITTING THE CLASSIFICATION MODEL using Logistic Regression(tf-idf)\n", 691 | "\n", 692 | "lr_tfidf=LogisticRegression(solver = 'liblinear', C=10, penalty = 'l2')\n", 693 | "lr_tfidf.fit(X_train_vectors_tfidf, y_train) #model\n", 694 | "\n", 695 | "#Predict y value for test dataset\n", 696 | "y_predict = lr_tfidf.predict(X_val_vectors_tfidf)\n", 697 | "y_prob = lr_tfidf.predict_proba(X_val_vectors_tfidf)[:,1]\n", 698 | " \n", 699 | "\n", 700 | "print(classification_report(y_val,y_predict))\n", 701 | "print('Confusion Matrix:',confusion_matrix(y_val, y_predict))\n", 702 | " \n", 703 | "fpr, tpr, thresholds = roc_curve(y_val, y_prob)\n", 704 | "roc_auc = auc(fpr, tpr)\n", 705 | "print('AUC:', roc_auc) " 706 | ] 707 | }, 708 | { 709 | "cell_type": "markdown", 710 | "metadata": {}, 711 | "source": [ 712 | "#### NB (tf-idf)" 713 | ] 714 | }, 715 | { 716 | "cell_type": "code", 717 | "execution_count": 14, 718 | "metadata": {}, 719 | "outputs": [ 720 | { 721 | "name": "stdout", 722 | "output_type": "stream", 723 | "text": [ 724 | " precision recall f1-score support\n", 725 | "\n", 726 | " 0 0.78 0.90 0.83 871\n", 727 | " 1 0.82 0.66 0.73 652\n", 728 | "\n", 729 | " accuracy 0.79 1523\n", 730 | " macro avg 0.80 0.78 0.78 1523\n", 731 | "weighted avg 0.80 0.79 0.79 1523\n", 732 | "\n", 733 | "Confusion Matrix: [[780 91]\n", 734 | " [224 428]]\n", 735 | "AUC: 0.8445135694815211\n" 736 | ] 737 | } 738 | ], 739 | "source": [ 740 | "#FITTING THE CLASSIFICATION MODEL using Naive Bayes(tf-idf)\n", 741 | "#It's a probabilistic classifier that makes use of Bayes' Theorem, a rule that uses probability to make predictions based on prior knowledge of conditions that might be related. This algorithm is the most suitable for such large dataset as it considers each feature independently, calculates the probability of each category, and then predicts the category with the highest probability.\n", 742 | "\n", 743 | "nb_tfidf = MultinomialNB()\n", 744 | "nb_tfidf.fit(X_train_vectors_tfidf, y_train) #model\n", 745 | "\n", 746 | "#Predict y value for test dataset\n", 747 | "y_predict = nb_tfidf.predict(X_val_vectors_tfidf)\n", 748 | "y_prob = nb_tfidf.predict_proba(X_val_vectors_tfidf)[:,1]\n", 749 | " \n", 750 | "\n", 751 | "print(classification_report(y_val,y_predict))\n", 752 | "print('Confusion Matrix:',confusion_matrix(y_val, y_predict))\n", 753 | " \n", 754 | "fpr, tpr, thresholds = roc_curve(y_val, y_prob)\n", 755 | "roc_auc = auc(fpr, tpr)\n", 756 | "print('AUC:', roc_auc) \n", 757 | "\n", 758 | "\n" 759 | ] 760 | }, 761 | { 762 | "cell_type": "markdown", 763 | "metadata": {}, 764 | "source": [ 765 | "#### LR (w2v)" 766 | ] 767 | }, 768 | { 769 | "cell_type": "code", 770 | "execution_count": 20, 771 | "metadata": {}, 772 | "outputs": [ 773 | { 774 | "name": "stdout", 775 | "output_type": "stream", 776 | "text": [ 777 | " precision recall f1-score support\n", 778 | "\n", 779 | " 0 0.63 0.82 0.71 871\n", 780 | " 1 0.60 0.37 0.46 652\n", 781 | "\n", 782 | " accuracy 0.63 1523\n", 783 | " macro avg 0.62 0.59 0.59 1523\n", 784 | "weighted avg 0.62 0.63 0.60 1523\n", 785 | "\n", 786 | "Confusion Matrix: [[712 159]\n", 787 | " [412 240]]\n", 788 | "AUC: 0.6552953730638924\n" 789 | ] 790 | } 791 | ], 792 | "source": [ 793 | "#FITTING THE CLASSIFICATION MODEL using Logistic Regression (W2v)\n", 794 | "lr_w2v=LogisticRegression(solver = 'liblinear', C=10, penalty = 'l2')\n", 795 | "lr_w2v.fit(X_train_vectors_w2v, y_train) #model\n", 796 | "\n", 797 | "#Predict y value for test dataset\n", 798 | "y_predict = lr_w2v.predict(X_val_vectors_w2v)\n", 799 | "y_prob = lr_w2v.predict_proba(X_val_vectors_w2v)[:,1]\n", 800 | " \n", 801 | "\n", 802 | "print(classification_report(y_val,y_predict))\n", 803 | "print('Confusion Matrix:',confusion_matrix(y_val, y_predict))\n", 804 | " \n", 805 | "fpr, tpr, thresholds = roc_curve(y_val, y_prob)\n", 806 | "roc_auc = auc(fpr, tpr)\n", 807 | "print('AUC:', roc_auc) " 808 | ] 809 | }, 810 | { 811 | "cell_type": "markdown", 812 | "metadata": {}, 813 | "source": [ 814 | "### TESTING THE MODEL ON UNLABELLED DATASET" 815 | ] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "execution_count": 26, 820 | "metadata": {}, 821 | "outputs": [ 822 | { 823 | "name": "stdout", 824 | "output_type": "stream", 825 | "text": [ 826 | " id keyword location text \\\n", 827 | "0 0 NaN NaN Just happened a terrible car crash \n", 828 | "1 2 NaN NaN Heard about #earthquake is different cities, s... \n", 829 | "2 3 NaN NaN there is a forest fire at spot pond, geese are... \n", 830 | "3 9 NaN NaN Apocalypse lighting. #Spokane #wildfires \n", 831 | "4 11 NaN NaN Typhoon Soudelor kills 28 in China and Taiwan \n", 832 | "\n", 833 | " clean_text predict_prob target \n", 834 | "0 happen terrible car crash 0.703070 1 \n", 835 | "1 heard earthquake different city stay safe ever... 0.901061 1 \n", 836 | "2 forest fire spot pond geese flee across street... 0.870295 1 \n", 837 | "3 apocalypse light spokane wildfire 0.634877 1 \n", 838 | "4 typhoon soudelor kill china taiwan 0.995811 1 \n" 839 | ] 840 | } 841 | ], 842 | "source": [ 843 | "#Testing it on new dataset with the best model\n", 844 | "df_test=pd.read_csv('test.csv') #reading the data\n", 845 | "df_test['clean_text'] = df_test['text'].apply(lambda x: finalpreprocess(x)) #preprocess the data\n", 846 | "X_test=df_test['clean_text'] \n", 847 | "X_vector=tfidf_vectorizer.transform(X_test) #converting X_test to vector\n", 848 | "y_predict = lr_tfidf.predict(X_vector) #use the trained model on X_vector\n", 849 | "y_prob = lr_tfidf.predict_proba(X_vector)[:,1]\n", 850 | "df_test['predict_prob']= y_prob\n", 851 | "df_test['target']= y_predict\n", 852 | "print(df_test.head())\n", 853 | "final=df_test[['id','target']].reset_index(drop=True)\n", 854 | "final.to_csv('submission.csv')" 855 | ] 856 | } 857 | ], 858 | "metadata": { 859 | "kernelspec": { 860 | "display_name": "Python 3", 861 | "language": "python", 862 | "name": "python3" 863 | }, 864 | "language_info": { 865 | "codemirror_mode": { 866 | "name": "ipython", 867 | "version": 3 868 | }, 869 | "file_extension": ".py", 870 | "mimetype": "text/x-python", 871 | "name": "python", 872 | "nbconvert_exporter": "python", 873 | "pygments_lexer": "ipython3", 874 | "version": "3.7.3" 875 | } 876 | }, 877 | "nbformat": 4, 878 | "nbformat_minor": 2 879 | } 880 | --------------------------------------------------------------------------------