├── README.md
└── NLP text classification model Github.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # NLP-text-classification-model
2 |
3 |
4 | Unstructured data in the form of text: chats, emails, social media, survey responses is present everywhere today. Text can be a rich source of information, but due to its unstructured nature it can be hard to extract insights from it.
5 | Text classification is one of the important task in supervised machine learning (ML). It is a process of assigning tags/categories to documents helping us to automatically & quickly structure and analyze text in a cost-effective manner. It is one of the fundamental tasks in Natural Language Processing with broad applications such as sentiment-analysis, spam-detection, topic-labeling, intent-detection etc.
6 |
7 | **Let’s divide the classification problem into the below steps:**
8 |
9 | 1. Setup: Importing Libraries
10 | 2. Loading the data set & Exploratory Data Analysis
11 | 3. Text pre-processing
12 | 4. Extracting vectors from text (Vectorization)
13 | 5. Running ML algorithms
14 | 6. Conclusion
15 |
16 |
17 |
18 | **Step 2: Loading the data set & EDA**
19 |
20 | The data set that we will be using for this article is the famous “Natural Language Processing with Disaster Tweets” data set where we’ll be predicting whether a given tweet is about a real disaster (target=1) or not (target=0)
21 | In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which ones aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified.
22 | We have 7,613 tweets in training (labelled) dataset and 3,263 in the test(unlabelled) dataset
23 |
24 | Exploratory Data Analysis (EDA)
25 | 1. Class distribution: There are more tweets with class 0 ( no disaster) than class 1 ( disaster tweets). We can say that the dataset is relatively balanced with 4342 non-disaster tweets (57%) and 3271 disaster tweets (43%). Since the data is balanced, we won’t be applying data-balancing techniques like SMOTE while building the model
26 |
27 | 2. Missing values: We have ~2.5k missing values in location field and 61 missing values in keyword column
28 |
29 | 3. Number of words in a tweet: Disaster tweets are more wordy than the non-disaster tweets
30 | The average number of words in a disaster tweet is 15.17 as compared to an average of 14.7 words in a non-disaster tweet
31 |
32 | 4. Number of characters in a tweet: Disaster tweets are longer than the non-disaster tweets
33 | The average characters in a disaster tweet is 108.1 as compared to an average of 95.7 characters in a non-disaster tweet
34 |
35 | **Step 3: Text Pre-Processing**
36 |
37 | Before we move to model building, we need to preprocess our dataset by removing punctuations & special characters, cleaning texts, removing stop words, and applying lemmatization
38 | Simple text cleaning processes: Some of the common text cleaning process involves:
39 | - Removing punctuations, special characters, URLs & hashtags
40 | - Removing leading, trailing & extra white spaces/tabs
41 | - Typos, slangs are corrected, abbreviations are written in their long forms
42 |
43 | 1. Stop-word removal: We can remove a list of generic stop words from the English vocabulary using nltk. A few such words are ‘i’,’you’,’a’,’the’,’he’,’which’ etc.
44 | 2. Stemming: Refers to the process of slicing the end or the beginning of words with the intention of removing affixes(prefix/suffix)
45 | 3. Lemmatization: It is the process of reducing the word to its base form
46 |
47 |
48 | **Step 4: Extracting vectors from text (Vectorization)**
49 |
50 | It’s difficult to work with text data while building Machine learning models since these models need well-defined numerical data. The process to convert text data into numerical data/vector, is called vectorization or in the NLP world, word embedding. Bag-of-Words(BoW) and Word Embedding (with Word2Vec) are two well-known methods for converting text data to numerical data.
51 | There are a few versions of Bag of Words, corresponding to different words scoring methods. We use the Sklearn library to calculate the BoW numerical values using these approaches:
52 | 1. Count vectors: It builds a vocabulary from a corpus of documents and counts how many times the words appear in each document
53 | 2. Term Frequency-Inverse Document Frequencies (tf-Idf): Count vectors might not be the best representation for converting text data to numerical data. So, instead of simple counting, we can also use an advanced variant of the Bag-of-Words that uses the term frequency–inverse document frequency (or Tf-Idf). Basically, the value of a word increases proportionally to count in the document, but it is inversely proportional to the frequency of the word in the corpus
54 |
55 | Word2Vec: One of the major drawbacks of using Bag-of-words techniques is that it can’t capture the meaning or relation of the words from vectors. Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network which is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc.
56 |
57 | We can use any of these approaches to convert our text data to numerical form which will be used to build the classification model. With this in mind, I am going to first partition the dataset into training set (80%) and test set (20%)
58 |
59 | **Step 5. Running ML algorithms**
60 |
61 | It’s time to train a machine learning model on the vectorized dataset and test it. Now that we have converted the text data to numerical data, we can run ML models on X_train_vector_tfidf & y_train. We’ll test this model on X_test_vectors_tfidf to get y_predict and further evaluate the performance of the model
62 | 1. Logistic Regression
63 | 2. Naive Bayes: It’s a probabilistic classifier that makes use of Bayes’ Theorem, a rule that uses probability to make predictions based on prior knowledge of conditions that might be related
64 |
65 | In this article, I demonstrated the basics of building a text classification model comparing Bag-of-Words (with Tf-Idf) and Word Embedding with Word2Vec. You can further enhance the performance of your model using this code by
66 | - using other classification algorithms like Support Vector Machines (SVM), XgBoost, Ensemble models, Neural networks etc.
67 | - using Gridsearch to tune the hyperparameters of your model
68 | - using advanced word-embedding methods like GloVe and BERT
69 |
--------------------------------------------------------------------------------
/NLP text classification model Github.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "### IMPORTING PACKAGES"
8 | ]
9 | },
10 | {
11 | "cell_type": "code",
12 | "execution_count": 1,
13 | "metadata": {},
14 | "outputs": [
15 | {
16 | "name": "stderr",
17 | "output_type": "stream",
18 | "text": [
19 | "[nltk_data] Downloading package punkt to /Users/ranivija/nltk_data...\n",
20 | "[nltk_data] Package punkt is already up-to-date!\n",
21 | "[nltk_data] Downloading package averaged_perceptron_tagger to\n",
22 | "[nltk_data] /Users/ranivija/nltk_data...\n",
23 | "[nltk_data] Package averaged_perceptron_tagger is already up-to-\n",
24 | "[nltk_data] date!\n",
25 | "[nltk_data] Downloading package wordnet to\n",
26 | "[nltk_data] /Users/ranivija/nltk_data...\n",
27 | "[nltk_data] Package wordnet is already up-to-date!\n"
28 | ]
29 | }
30 | ],
31 | "source": [
32 | "import pandas as pd\n",
33 | "import numpy as np\n",
34 | "\n",
35 | "import seaborn as sns\n",
36 | "import matplotlib.pyplot as plt\n",
37 | "\n",
38 | "#for text pre-processing\n",
39 | "import re, string\n",
40 | "import nltk\n",
41 | "from nltk.tokenize import word_tokenize\n",
42 | "from nltk.corpus import stopwords\n",
43 | "from nltk.tokenize import word_tokenize\n",
44 | "from nltk.stem import SnowballStemmer\n",
45 | "from nltk.corpus import wordnet\n",
46 | "from nltk.stem import WordNetLemmatizer\n",
47 | "\n",
48 | "nltk.download('punkt')\n",
49 | "nltk.download('averaged_perceptron_tagger')\n",
50 | "nltk.download('wordnet')\n",
51 | "\n",
52 | "#for model-building\n",
53 | "from sklearn.model_selection import train_test_split\n",
54 | "from sklearn.linear_model import LogisticRegression\n",
55 | "from sklearn.linear_model import SGDClassifier\n",
56 | "from sklearn.naive_bayes import MultinomialNB\n",
57 | "from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix\n",
58 | "from sklearn.metrics import roc_curve, auc, roc_auc_score\n",
59 | "\n",
60 | "# bag of words\n",
61 | "from sklearn.feature_extraction.text import TfidfVectorizer\n",
62 | "from sklearn.feature_extraction.text import CountVectorizer\n",
63 | "\n",
64 | "#for word embedding\n",
65 | "import gensim\n",
66 | "from gensim.models import Word2Vec #Word2Vec is mostly used for huge datasets"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "metadata": {},
72 | "source": [
73 | "## Loading Data"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 2,
79 | "metadata": {},
80 | "outputs": [
81 | {
82 | "name": "stdout",
83 | "output_type": "stream",
84 | "text": [
85 | "(7613, 5)\n"
86 | ]
87 | },
88 | {
89 | "data": {
90 | "text/html": [
91 | "
\n",
92 | "\n",
105 | "
\n",
106 | " \n",
107 | "
\n",
108 | "
\n",
109 | "
id
\n",
110 | "
keyword
\n",
111 | "
location
\n",
112 | "
text
\n",
113 | "
target
\n",
114 | "
\n",
115 | " \n",
116 | " \n",
117 | "
\n",
118 | "
0
\n",
119 | "
1
\n",
120 | "
NaN
\n",
121 | "
NaN
\n",
122 | "
Our Deeds are the Reason of this #earthquake M...
\n",
123 | "
1
\n",
124 | "
\n",
125 | "
\n",
126 | "
1
\n",
127 | "
4
\n",
128 | "
NaN
\n",
129 | "
NaN
\n",
130 | "
Forest fire near La Ronge Sask. Canada
\n",
131 | "
1
\n",
132 | "
\n",
133 | "
\n",
134 | "
2
\n",
135 | "
5
\n",
136 | "
NaN
\n",
137 | "
NaN
\n",
138 | "
All residents asked to 'shelter in place' are ...
\n",
139 | "
1
\n",
140 | "
\n",
141 | "
\n",
142 | "
3
\n",
143 | "
6
\n",
144 | "
NaN
\n",
145 | "
NaN
\n",
146 | "
13,000 people receive #wildfires evacuation or...
\n",
147 | "
1
\n",
148 | "
\n",
149 | "
\n",
150 | "
4
\n",
151 | "
7
\n",
152 | "
NaN
\n",
153 | "
NaN
\n",
154 | "
Just got sent this photo from Ruby #Alaska as ...
\n",
155 | "
1
\n",
156 | "
\n",
157 | " \n",
158 | "
\n",
159 | "
"
160 | ],
161 | "text/plain": [
162 | " id keyword location text \\\n",
163 | "0 1 NaN NaN Our Deeds are the Reason of this #earthquake M... \n",
164 | "1 4 NaN NaN Forest fire near La Ronge Sask. Canada \n",
165 | "2 5 NaN NaN All residents asked to 'shelter in place' are ... \n",
166 | "3 6 NaN NaN 13,000 people receive #wildfires evacuation or... \n",
167 | "4 7 NaN NaN Just got sent this photo from Ruby #Alaska as ... \n",
168 | "\n",
169 | " target \n",
170 | "0 1 \n",
171 | "1 1 \n",
172 | "2 1 \n",
173 | "3 1 \n",
174 | "4 1 "
175 | ]
176 | },
177 | "execution_count": 2,
178 | "metadata": {},
179 | "output_type": "execute_result"
180 | }
181 | ],
182 | "source": [
183 | "#you can download the data from https://www.kaggle.com/c/nlp-getting-started/data\n",
184 | "import os\n",
185 | "os.chdir('/Users/ranivija/Desktop/')\n",
186 | "df_train=pd.read_csv('train.csv')\n",
187 | "print(df_train.shape)\n",
188 | "df_train.head()"
189 | ]
190 | },
191 | {
192 | "cell_type": "markdown",
193 | "metadata": {},
194 | "source": [
195 | "## EDA"
196 | ]
197 | },
198 | {
199 | "cell_type": "code",
200 | "execution_count": 3,
201 | "metadata": {},
202 | "outputs": [
203 | {
204 | "name": "stdout",
205 | "output_type": "stream",
206 | "text": [
207 | "0 4342\n",
208 | "1 3271\n",
209 | "Name: target, dtype: int64\n"
210 | ]
211 | },
212 | {
213 | "data": {
214 | "text/plain": [
215 | ""
216 | ]
217 | },
218 | "execution_count": 3,
219 | "metadata": {},
220 | "output_type": "execute_result"
221 | },
222 | {
223 | "data": {
224 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYsAAAD4CAYAAAAdIcpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAANyElEQVR4nO3df6hf9X3H8edLrXVbqdqaiiS6yBoqSrvWZtYijE03ja5tZNhicWsoYVmHg+4H63SMybRCu425FtZCOoOx2Fqxg0jnKMHalUGrJq31J847nTPB1mjU6bo6o+/98f3EfY333s9Vcu73m9znAy73nM853/P9BAJPzvece76pKiRJms8hk56AJGn6GQtJUpexkCR1GQtJUpexkCR1HTbpCQzhmGOOqZUrV056GpJ0QNm+ffsTVbVstm0HZSxWrlzJtm3bJj0NSTqgJHlkrm1+DCVJ6jIWkqQuYyFJ6jIWkqQuYyFJ6jIWkqQuYyFJ6jIWkqQuYyFJ6joo/4J7f3jvn1w76SloCm3/649NegrSRHhmIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpC5jIUnqMhaSpK7BY5Hk0CQ/SPKNtn5iktuSzCT5WpLD2/gb2/pM275y7BiXtvEHkpwz9JwlSa+0GGcWnwTuH1v/LHBVVb0deApY38bXA0+18avafiQ5GbgQOAVYA3whyaGLMG9JUjNoLJKsAH4D+Ie2HuBM4Ma2y2bg/La8tq3Ttp/V9l8LXF9Vz1fVw8AMcNqQ85YkvdLQZxZ/B3wKeKmtvxV4uqr2tPUdwPK2vBx4FKBtf6bt//L4LK95WZINSbYl2bZr1679/M+QpKVtsFgk+QDweFVtH+o9xlXVxqpaXVWrly1bthhvKUlLxpDflHcG8KEk5wFHAG8GPgccleSwdvawAtjZ9t8JHA/sSHIYcCTw5Nj4XuOvkSQtgsHOLKrq0qpaUVUrGV2g/lZVXQTcClzQdlsHbGnLN7V12vZvVVW18Qvb3VInAquA24eatyTp1SbxHdx/Clyf5NPAD4Cr2/jVwJeTzAC7GQWGqro3yQ3AfcAe4OKqenHxpy1JS9eixKKqvg18uy0/xCx3M1XVT4EPz/H6K4Erh5uhJGk+/gW3JKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnLWEiSuoyFJKnrsElPQNJr85+Xv3PSU9AUOuEv7h70+J5ZSJK6jIUkqctYSJK6jIUkqctYSJK6jIUkqctYSJK6jIUkqctYSJK6BotFkiOS3J7kh0nuTfKXbfzEJLclmUnytSSHt/E3tvWZtn3l2LEubeMPJDlnqDlLkmY35JnF88CZVfWLwLuBNUlOBz4LXFVVbweeAta3/dcDT7Xxq9p+JDkZuBA4BVgDfCHJoQPOW5K0j8FiUSPPtdU3tJ8CzgRubOObgfPb8tq2Ttt+VpK08eur6vmqehiYAU4bat6SpFcb9JpFkkOT3Ak8DmwF/h14uqr2tF12AMvb8nLgUYC2/RngrePjs7xm/L02JNmWZNuuXbsG+NdI0tI1aCyq6sWqejewgtHZwEkDvtfGqlpdVauXLVs21NtI0pK0KHdDVdXTwK3A+4Gjkux9NPoKYGdb3gkcD9C2Hwk8OT4+y2skSYtgyLuhliU5qi3/DPDrwP2MonFB220dsKUt39TWadu/VVXVxi9sd0udCKwCbh9q3pKkVxvyy4+OAza3O5cOAW6oqm8kuQ+4PsmngR8AV7f9rwa+nGQG2M3oDiiq6t4kNwD3AXuAi6vqxQHnLUnax2CxqKq7gPfMMv4Qs9zNVFU/BT48x7GuBK7c33OUJC2Mf8EtSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSerqxqJ9O113TJJ08FrImcXXZxm7cX9PRJI0veb8prwkJwGnAEcm+c2xTW8Gjhh6YpKk6THf16q+A/gAcBTwwbHxZ4HfGXBOkqQpM2csqmoLsCXJ+6vqu4s4J0nSlFnINYsnk9yS5B6AJO9K8ucDz0uSNEUWEosvAZcCLwBU1V3AhUNOSpI0XRYSi5+tqtv3GdszxGQkSdNpIbF4IskvAAWQ5ALgsUFnJUmaKvPdDbXXxcBG4KQkO4GHgd8adFaSpKnSjUVVPQT8WpKfAw6pqmeHn5YkaZp0Y5Hkj/ZZB3gG2F5Vdw4zLUnSNFnINYvVwCeA5e3nd4E1wJeSfGrAuUmSpsRCrlmsAE6tqucAklwG/BPwy8B24K+Gm54kaRos5MzibcDzY+svAMdW1f/sMy5JOkgt5MziOuC2JFva+geBr7QL3vcNNjNJ0tSYNxYZXc2+Bvhn4Iw2/Imq2taWLxpuapKkaTFvLKqqktxcVe8Ets23ryTp4LWQaxbfT/JLg89EkjS1FnLN4n3ARUkeAf4bCKOTjncNOjNJ0tRYSCzOGXwWkqSptpDHfTwCkORt+HWqkrQkda9ZJPlQkgcZPUDwX4D/YHR3lCRpiVjIBe4rgNOBf6uqE4GzgO/1XpTk+CS3Jrkvyb1JPtnG35Jka5IH2++j23iSfD7JTJK7kpw6dqx1bf8Hk6x7Xf9SSdLrtpBYvFBVTwKHJDmkqm5l9Lyonj3AH1fVyYxic3GSk4FLgFuqahVwS1sHOBdY1X42AF+EUVyAyxhdaD8NuGxvYCRJi2MhsXg6yZuA7wDXJfkc8FzvRVX1WFV9vy0/C9zP6EGEa4HNbbfNwPlteS1wbY18DzgqyXGMLrBvrardVfUUsJXRgwwlSYtkIXdD/RD4CfCHjP5i+0jgTa/lTZKsBN4D3MbouVJ7v2nvR8CxbXk58OjYy3bw/0+6nW183/fYwOiMhBNOOOG1TE+S1LGQWPxqVb0EvEQ7I0hy10LfoJ2VfB34g6r6r/Z9GMDLfyFer23Ks6uqjYy+0Y/Vq1fvl2NKkkbm/Bgqye8luZvR16neNfbzMLCgWCR5A6NQXFdV/9iGf9w+XqL9fryN7wSOH3v5ijY217gkaZHMd83iK4yeMLul/d77896q6n4Hd3sI4dXA/VX1t2ObbgL23tG0rh1/7/jH2l1RpwPPtI+rvgmcneTodmH77DYmSVokc34MVVXPMPr61I++zmOfAfw2cHeSO9vYnwGfAW5Ish54BPhI23YzcB4ww+gaycfbPHYnuQK4o+13eVXtfp1zkiS9Dgu5ZvG6VNW/MnqO1GzOmmX/Ai6e41ibgE37b3aSpNdiIbfOSpKWOGMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkrsFikWRTkseT3DM29pYkW5M82H4f3caT5PNJZpLcleTUsdesa/s/mGTdUPOVJM1tyDOLa4A1+4xdAtxSVauAW9o6wLnAqvazAfgijOICXAa8DzgNuGxvYCRJi2ewWFTVd4Dd+wyvBTa35c3A+WPj19bI94CjkhwHnANsrardVfUUsJVXB0iSNLDFvmZxbFU91pZ/BBzblpcDj47tt6ONzTX+Kkk2JNmWZNuuXbv276wlaYmb2AXuqiqg9uPxNlbV6qpavWzZsv11WEkSix+LH7ePl2i/H2/jO4Hjx/Zb0cbmGpckLaLFjsVNwN47mtYBW8bGP9buijodeKZ9XPVN4OwkR7cL22e3MUnSIjpsqAMn+SrwK8AxSXYwuqvpM8ANSdYDjwAfabvfDJwHzAA/AT4OUFW7k1wB3NH2u7yq9r1oLkka2GCxqKqPzrHprFn2LeDiOY6zCdi0H6cmSXqN/AtuSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVKXsZAkdRkLSVLXAROLJGuSPJBkJsklk56PJC0lB0QskhwK/D1wLnAy8NEkJ092VpK0dBwQsQBOA2aq6qGq+l/gemDthOckSUvGYZOewAItBx4dW98BvG98hyQbgA1t9bkkDyzS3JaCY4AnJj2JaZC/WTfpKeiV/L+512XZH0f5+bk2HCix6KqqjcDGSc/jYJRkW1WtnvQ8pH35f3PxHCgfQ+0Ejh9bX9HGJEmL4ECJxR3AqiQnJjkcuBC4acJzkqQl44D4GKqq9iT5feCbwKHApqq6d8LTWkr8eE/Tyv+biyRVNek5SJKm3IHyMZQkaYKMhSSpy1hoXj5mRdMoyaYkjye5Z9JzWSqMhebkY1Y0xa4B1kx6EkuJsdB8fMyKplJVfQfYPel5LCXGQvOZ7TEryyc0F0kTZCwkSV3GQvPxMSuSAGOh+fmYFUmAsdA8qmoPsPcxK/cDN/iYFU2DJF8Fvgu8I8mOJOsnPaeDnY/7kCR1eWYhSeoyFpKkLmMhSeoyFpKkLmMhSeoyFpKkLmMhSer6P5LNNm9bNfkXAAAAAElFTkSuQmCC\n",
225 | "text/plain": [
226 | "