├── README.md
├── SubmissionAV.py
└── code.py


/README.md:
--------------------------------------------------------------------------------
1 | # Twitter-Sentiment-Analysis---Analytics-Vidhya
2 | Problem Statement 
3 | The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.  Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.  
4 | 
5 | Motivation 
6 | Hate  speech  is  an  unfortunately  common  occurrence  on  the  Internet.  Often social media sites like Facebook and Twitter face the problem of identifying and censoring  problematic  posts  while weighing the right to freedom of speech. The  importance  of  detecting  and  moderating hate  speech  is  evident  from  the  strong  connection between hate speech and actual hate crimes. Early identification of users promoting  hate  speech  could  enable  outreach  programs that attempt to prevent an escalation from speech to action. Sites such as Twitter and Facebook have been seeking  to  actively  combat  hate  speech. In spite of these reasons, NLP research on hate speech has been very limited, primarily due to the lack of a general definition of hate speech, an analysis of its demographic influences, and an investigation of the most effective features.  
7 | 
8 | Data Our overall collection of tweets was split in the ratio of 65:35 into training and testing data. Out of the testing data, 30% is public and the rest is private.     Data Files    train.csv - For training the models, we provide a labelled dataset of 31,962 tweets. The dataset is provided in the form of a csv file with each line storing a tweet id, its label and the tweet. There is 1 test file (public)  test_tweets.csv - The test data file contains only tweet ids and the tweet text with each tweet in a new line.  
9 | 


--------------------------------------------------------------------------------
/SubmissionAV.py:
--------------------------------------------------------------------------------
 1 | # Practice Problem : Twitter Sentiment Analysis by Analytics Vidhya 
 2 | #Author - Sachin Kumar
 3 | 
 4 | #Importing Libraries 
 5 | import re #cleaning the text 
 6 | import pandas as pd 
 7 | import numpy as np 
 8 | import string
 9 | import nltk
10 | import warnings 
11 | 
12 | #Importing dataset
13 | dataset = pd.read_csv('train.csv')
14 | testdata = pd.read_csv('test.csv')
15 | 
16 | #combine train and test set
17 | combi = dataset.append(testdata, ignore_index=True)
18 | 
19 | ## importing regular expression library ## clean tweet text by removing links, special characters etc
20 | def remove_pattern(input_txt, pattern):
21 |     r = re.findall(pattern, input_txt)
22 |     for i in r:
23 |         input_txt = re.sub(i, '', input_txt)
24 |         
25 |     return input_txt
26 | 
27 | # remove twitter handles (@user)
28 | combi['tweet'] = np.vectorize(remove_pattern)(combi['tweet'], "@[\w]*")
29 | 
30 | # remove special characters, numbers, punctuations
31 | combi['tweet'] = combi['tweet'].str.replace("[^a-zA-Z#]", " ")
32 |        
33 | #Removing Short Words       
34 | combi['tweet'] = combi['tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))       
35 | 
36 | tokenized_tweet = combi['tweet'].apply(lambda x: x.split())
37 | 
38 | 
39 | from nltk.stem.porter import *
40 | stemmer = PorterStemmer()
41 | 
42 | tokenized_tweet = tokenized_tweet.apply(lambda x: [stemmer.stem(i) for i in x]) # stemming
43 | 
44 | #Now let’s stitch these tokens
45 | for i in range(len(tokenized_tweet)):
46 |     tokenized_tweet[i] = ' '.join(tokenized_tweet[i])
47 | 
48 | combi['tweet'] = tokenized_tweet
49 | 
50 | #Bag-of-Words Features
51 | 
52 | from sklearn.feature_extraction.text import CountVectorizer
53 | bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
54 | # bag-of-words feature matrix
55 | bow = bow_vectorizer.fit_transform(combi['tweet'])
56 | 
57 | 
58 | from sklearn.model_selection import train_test_split
59 | from sklearn.metrics import f1_score
60 | 
61 | train_bow = bow[:31962,:]
62 | test_bow = bow[31962:,:]
63 | 
64 | # splitting data into training and validation set
65 | xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, dataset['label'], random_state=42, test_size=0.3)
66 | 
67 | from sklearn.svm import SVC
68 | svm = SVC()
69 | svm = SVC(kernel = 'rbf', random_state = 0, gamma = 0.14, C =11)
70 | svm.fit(xtrain_bow, ytrain)
71 | y_pred5 = svm.predict(xvalid_bow)
72 | prediction_int7 = y_pred5.astype(np.int)
73 | f1_score(yvalid, prediction_int7)
74 | 
75 | #prediction on test set
76 | test_pred = svm.predict(test_bow)
77 | test_pred_int = test_pred.astype(np.int)
78 | testdata['label'] = test_pred_int
79 | submission = testdata[['id','label']]
80 | submission.to_csv('svmrbfbow.csv', index=False) # writing data to a CSV file


--------------------------------------------------------------------------------
/code.py:
--------------------------------------------------------------------------------
  1 | # Practice Problem : Twitter Sentiment Analysis by Analytics Vidhya 
  2 | 
  3 | #The objective of this task is to detect hate speech in tweets. Formally, given a training sample of tweets and labels, 
  4 | #where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist, your objective
  5 | # is to predict the labels on the given test dataset.
  6 | #Note: The evaluation metric from this practice problem is F1-Score.
  7 | 
  8 | 
  9 | #Importing Libraries 
 10 | import re #cleaning the text 
 11 | import pandas as pd 
 12 | import numpy as np 
 13 | import matplotlib.pyplot as plt 
 14 | import seaborn as sns
 15 | import string
 16 | import nltk
 17 | import warnings 
 18 | 
 19 | #NLTK is a leading platform for building Python programs to work with human language data 
 20 | 
 21 | #Importing dataset
 22 | dataset = pd.read_csv('train.csv')
 23 | testdata = pd.read_csv('test.csv')
 24 | 
 25 | #To see the first few rows of the train dataset
 26 | dataset.head()
 27 | 
 28 | dataset.info()
 29 | 
 30 | #breakdown of how many tweets are ‘0’s and how many tweets are ‘1’s.
 31 | dataset['label'].value_counts()
 32 | 
 33 | #Initial data cleaning requirements that we can think of after looking at the top 5 records:
 34 | #The Twitter handles are already masked as @user due to privacy concerns. So, these Twitter handles are hardly giving any information about the nature of the tweet.
 35 | #We can also think of getting rid of the punctuations, numbers and even special characters since they wouldn’t help in differentiating different kinds of tweets.
 36 | #Most of the smaller words do not add much value. For example, ‘pdx’, ‘his’, ‘all’. So, we will try to remove them as well from our data.
 37 | #Once we have executed the above three steps, we can split every tweet into individual words or tokens which is an essential step in any NLP task.
 38 | #In the 4th tweet, there is a word ‘love’. We might also have terms like loves, loving, lovable, etc. in the rest of the data. These terms are often used in the same
 39 | #context. If we can reduce them to their root word, which is ‘love’, then we can reduce the total number of unique words in our data without losing a significant amount of information.
 40 | 
 41 | 
 42 | #Tweets Preprocessing and Cleaning
 43 | #The objective of this step is to clean noise those are less relevant to find the sentiment of tweets such as punctuation, 
 44 | #special characters, numbers, and terms which don’t carry much weightage in context to the text.
 45 | 
 46 | #a user-defined function to remove unwanted text patterns from the tweets. 
 47 | # The function returns the same input string but without the given pattern
 48 | 
 49 | #combine train and test set
 50 | combi = dataset.append(testdata, ignore_index=True)
 51 | 
 52 | 
 53 | ## importing regular expression library ## clean tweet text by removing links, special characters etc
 54 | def remove_pattern(input_txt, pattern):
 55 |     r = re.findall(pattern, input_txt)
 56 |     for i in r:
 57 |         input_txt = re.sub(i, '', input_txt)
 58 |         
 59 |     return input_txt
 60 | 
 61 | # remove twitter handles (@user)
 62 | combi['tweet'] = np.vectorize(remove_pattern)(combi['tweet'], "@[\w]*")
 63 | 
 64 | # remove special characters, numbers, punctuations
 65 | combi['tweet'] = combi['tweet'].str.replace("[^a-zA-Z#]", " ")
 66 | 
 67 |        
 68 | #Removing Short Words       
 69 | combi['tweet'] = combi['tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))       
 70 | 
 71 | #Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. 
 72 | #Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded
 73 | tokenized_tweet = combi['tweet'].apply(lambda x: x.split())
 74 | 
 75 | #Stemming is a rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word. 
 76 | #For example, For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”.
 77 | 
 78 | from nltk.stem.porter import *
 79 | stemmer = PorterStemmer()
 80 | 
 81 | tokenized_tweet = tokenized_tweet.apply(lambda x: [stemmer.stem(i) for i in x]) # stemming
 82 | 
 83 | #Now let’s stitch these tokens
 84 | for i in range(len(tokenized_tweet)):
 85 |     tokenized_tweet[i] = ' '.join(tokenized_tweet[i])
 86 | 
 87 | combi['tweet'] = tokenized_tweet
 88 | 
 89 | 
 90 | #---------------------------------------------------------------------------------------------------------------
 91 | 
 92 | ##Story Generation and Visualization from Tweets
 93 | #A wordcloud is a visualization wherein the most frequent words appear in 
 94 | #large size and the less frequent words appear in smaller sizes.
 95 | 
 96 | all_words = ' '.join([text for text in combi['tweet']])
 97 | from wordcloud import WordCloud
 98 | wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)
 99 | 
100 | plt.figure(figsize=(10, 7))
101 | plt.imshow(wordcloud, interpolation="bilinear")
102 | 
103 | #Words in non racist/sexist tweets
104 | normal_words =' '.join([text for text in combi['tweet'][combi['label'] == 0]])
105 | 
106 | wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(normal_words)
107 | plt.figure(figsize=(10, 7))
108 | plt.imshow(wordcloud, interpolation="bilinear")
109 | 
110 | #Racist/Sexist Tweets
111 | negative_words = ' '.join([text for text in combi['tweet'][combi['label'] == 1]])
112 | wordcloud = WordCloud(width=800, height=500,random_state=21, max_font_size=110).generate(negative_words)
113 | plt.figure(figsize=(10, 7))
114 | plt.imshow(wordcloud, interpolation="bilinear")
115 | 
116 | #Understanding the impact of Hashtags on tweets sentiment
117 | #Hashtags in twitter are synonymous with the ongoing trends on twitter at any particular point in time.
118 | # function to collect hashtags
119 | def hashtag_extract(x):
120 |     hashtags = []
121 |     # Loop over the words in the tweet
122 |     for i in x:
123 |         ht = re.findall(r"#(\w+)", i)
124 |         hashtags.append(ht)
125 | 
126 |     return hashtags
127 | 
128 | # extracting hashtags from non racist/sexist tweets
129 | HT_regular = hashtag_extract(combi['tweet'][combi['label'] == 0])
130 | 
131 | # extracting hashtags from racist/sexist tweets
132 | HT_negative = hashtag_extract(combi['tweet'][combi['label'] == 1])
133 | 
134 | # unnesting list
135 | HT_regular = sum(HT_regular,[])
136 | HT_negative = sum(HT_negative,[])
137 | 
138 | #Non-Racist/Sexist Tweets
139 | 
140 | a = nltk.FreqDist(HT_regular)
141 | d = pd.DataFrame({'Hashtag': list(a.keys()),
142 |                   'Count': list(a.values())})
143 | # selecting top 10 most frequent hashtags     
144 | d = d.nlargest(columns="Count", n = 10) 
145 | plt.figure(figsize=(13,7))
146 | ax = sns.barplot(data=d, x= "Hashtag", y = "Count")
147 | 
148 | #Racist/Sexist Tweets
149 | 
150 | b = nltk.FreqDist(HT_negative)
151 | e = pd.DataFrame({'Hashtag': list(b.keys()), 'Count': list(b.values())})
152 | # selecting top 10 most frequent hashtags
153 | e = e.nlargest(columns="Count", n = 10)   
154 | plt.figure(figsize=(13,7))
155 | ax = sns.barplot(data=e, x= "Hashtag", y = "Count")
156 | 
157 | #As expected, most of the terms are negative with a few neutral terms as well.
158 | #So,it’s not a bad idea to keep these hashtags in our data as they contain useful information.
159 | 
160 | #-------------------------------------------------------------------------------------------------
161 | 
162 | 
163 | #Extracting Features from Cleaned Tweets
164 | 
165 | #Bag-of-Words Features
166 | #Bag-of-Words features can be easily created using sklearn’s CountVectorizer function. 
167 | #We will set the parameter max_features = 1000 to select only top 1000 terms ordered by term frequency across the corpus.
168 | 
169 | from sklearn.feature_extraction.text import CountVectorizer
170 | bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
171 | # bag-of-words feature matrix
172 | bow = bow_vectorizer.fit_transform(combi['tweet'])
173 | 
174 | #TF-IDF Features
175 | from sklearn.feature_extraction.text import TfidfVectorizer
176 | tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
177 | # TF-IDF feature matrix
178 | tfidf = tfidf_vectorizer.fit_transform(combi['tweet'])
179 | 
180 | 
181 | 
182 | 
183 | #Building log model using Bag-of-Words features
184 | from sklearn.linear_model import LogisticRegression
185 | from sklearn.model_selection import train_test_split
186 | from sklearn.metrics import f1_score
187 | 
188 | train_bow = bow[:31962,:]
189 | test_bow = bow[31962:,:]
190 | 
191 | # splitting data into training and validation set
192 | xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, dataset['label'], random_state=42, test_size=0.3)
193 | 
194 | lreg = LogisticRegression()
195 | lreg.fit(xtrain_bow, ytrain) # training the model
196 | 
197 | prediction1 = lreg.predict_proba(xvalid_bow) # predicting on the validation set
198 | prediction_int1 = prediction1[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
199 | prediction_int1 = prediction_int1.astype(np.int)
200 | 
201 | f1_score(yvalid, prediction_int1) # calculating f1 score
202 | #f1 = 0.53078
203 | 
204 | # Making the Confusion Matrix
205 | from sklearn.metrics import confusion_matrix
206 | cm = confusion_matrix(yvalid, prediction_int1)
207 | cm
208 | prediction_test = lreg.predict_proba(test_bow) # predicting on the testset
209 | prediction_test = prediction_test[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
210 | prediction_test = prediction_test.astype(np.int)
211 | 
212 | #Export submission file
213 | 
214 | testdata['label'] = prediction_test
215 | submission = testdata[['id','label']]
216 | submission.to_csv('sub_log_bow.csv', index=False) # writing data to a CSV file
217 | 
218 | 
219 | #Building log model using TF-IDF features
220 | train_tfidf = tfidf[:31962,:]
221 | test_tfidf = tfidf[31962:,:]
222 | 
223 | xtrain_tfidf = train_tfidf[ytrain.index]
224 | xvalid_tfidf = train_tfidf[yvalid.index]
225 | 
226 | lreg.fit(xtrain_tfidf, ytrain)
227 | 
228 | prediction2 = lreg.predict_proba(xvalid_tfidf)
229 | prediction_int2 = prediction2[:,1] >= 0.3
230 | prediction_int2 = prediction_int2.astype(np.int)
231 | 
232 | f1_score(yvalid, prediction_int2)
233 | #f1 = 0.54465
234 | prediction_test = lreg.predict_proba(test_bow) # predicting on the testset
235 | prediction_test = prediction_test[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
236 | prediction_test = prediction_test.astype(np.int)
237 | 
238 | #Export submission file
239 | 
240 | testdata['label'] = prediction_test
241 | submission = testdata[['id','label']]
242 | submission.to_csv('sub_log_tfidf.csv', index=False) # writing data to a CSV file
243 | 
244 | 
245 | # K-Nearest Neighbors (K-NN)
246 | #Building log model using Bag-of-Words features
247 | # Fitting K-NN to the Training set
248 | from sklearn.neighbors import KNeighborsClassifier
249 | knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
250 | knn.fit(xtrain_bow, ytrain)
251 | # Predicting the Test set results
252 | y_pred = knn.predict_proba(xvalid_bow)
253 | prediction_int3 = y_pred[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
254 | prediction_int3 = prediction_int3.astype(np.int)
255 | f1_score(yvalid, prediction_int3)
256 | #f1 = 0.43578
257 | #Building model using TF-IDF features
258 | 
259 | knn.fit(xtrain_tfidf, ytrain)
260 | y_pred2 = knn.predict_proba(xvalid_tfidf)
261 | prediction_int4 = y_pred2[:,1] >= 0.3
262 | prediction_int4 = prediction_int4.astype(np.int)
263 | 
264 | f1_score(yvalid, prediction_int4)
265 | #f1 = 0.45124
266 | 
267 | 
268 | 
269 | # Naive Bayes
270 | ## Error - A sparse matrix was passed, but dense data is required. 
271 | ##Use X.toarray() to convert to a dense numpy array.
272 | # Fitting Naive Bayes to the Training set
273 | '''from sklearn.naive_bayes import GaussianNB
274 | nb = GaussianNB()
275 | nb.fit(xtrain_bow, ytrain)
276 | # Predicting the Test set results
277 | y_pred3 = nb.predict_proba(xvalid_bow)
278 | prediction_int5 = y_pred3[:,1] >= 0.3
279 | prediction_int5 = y_pred3.astype(np.int)
280 | f1_score(yvalid, prediction_int5)
281 | 
282 | 
283 | #Building model using TF-IDF features
284 | nb.fit(xtrain_tfidf, ytrain)
285 | y_pred4 = nb.predict_proba(xvalid_tfidf)
286 | prediction_int6 = y_pred4[:,1] >= 0.3
287 | prediction_int6 = prediction_int6.astype(np.int)
288 | 
289 | f1_score(yvalid, prediction_int6)
290 | '''
291 | 
292 | # Fitting SVM to the Training set
293 | from sklearn.svm import SVC
294 | svm = SVC()
295 | svm.fit(xtrain_bow, ytrain)
296 | 
297 | 
298 | #Building model using TF-IDF features
299 | svm.fit(xtrain_tfidf, ytrain)
300 | 
301 | 
302 | # Applying k-Fold Cross Validation
303 | from sklearn.model_selection import cross_val_score
304 | accuracies = cross_val_score(estimator = svm, X = xtrain_bow, y = ytrain, cv = 10)
305 | accuracies.mean()
306 | accuracies.std()
307 | 
308 | # Applying Grid Search to find the best model and the best parameters from bag of words
309 | from sklearn.model_selection import GridSearchCV
310 | parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
311 |               {'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]
312 | grid_search = GridSearchCV(estimator = svm,
313 |                            param_grid = parameters,
314 |                            scoring = 'f1',
315 |                            cv = 10,
316 |                            n_jobs = -1)
317 | grid_search = grid_search.fit(xtrain_bow, ytrain)
318 | best_f1 = grid_search.best_score_
319 | best_parameters = grid_search.best_params_
320 | #best f1 = 0.5953
321 | #best parameters = c = 10, gamma = 0.2, kernel = rbf 
322 | #We will again do Grid Search with parametrs close to the above result 
323 | # Applying Grid Search to find the best model and the best parameters
324 | from sklearn.model_selection import GridSearchCV
325 | parameters = [
326 |               {'C': [5, 10, 15, 20], 'kernel': ['rbf'], 'gamma': [0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25,]}]
327 | grid_search = GridSearchCV(estimator = svm,
328 |                            param_grid = parameters,
329 |                            scoring = 'f1',
330 |                            cv = 10,
331 |                            n_jobs = -1)
332 | grid_search = grid_search.fit(xtrain_bow, ytrain)
333 | best_f1 = grid_search.best_score_
334 | best_parameters = grid_search.best_params_
335 | #best f1 = 0.6013
336 | #best parameters = c = 10, gamma = 0.15, kernel = rbf 
337 | 
338 | from sklearn.model_selection import GridSearchCV
339 | parameters = [
340 |               {'C': [8, 9, 10, 11, 12], 'kernel': ['rbf'], 'gamma': [0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18]}]
341 | grid_search = GridSearchCV(estimator = svm,
342 |                            param_grid = parameters,
343 |                            scoring = 'f1',
344 |                            cv = 10,
345 |                            n_jobs = -1)
346 | grid_search = grid_search.fit(xtrain_bow, ytrain)
347 | best_f1 = grid_search.best_score_
348 | best_parameters = grid_search.best_params_
349 | #best f1 = 0.6213
350 | #best parameters = c = 11, gamma = 0.14, kernel = rbf 
351 | #We will go with this parameters 
352 | 
353 | svm = SVC(kernel = 'rbf', random_state = 0, gamma = 0.14, C =11)
354 | svm.fit(xtrain_bow, ytrain)
355 | y_pred5 = svm.predict(xvalid_bow)
356 | prediction_int7 = y_pred5.astype(np.int)
357 | f1_score(yvalid, prediction_int7)
358 | 
359 | #prediction on test set
360 | test_pred = svm.predict(test_bow)
361 | test_pred_int = test_pred.astype(np.int)
362 | testdata['label'] = test_pred_int
363 | submission = testdata[['id','label']]
364 | submission.to_csv('svmrbfbow.csv', index=False) # writing data to a CSV file
365 | 
366 | 
367 | 
368 | # Applying Grid Search to find the best model and the best parameters from tfidf
369 | from sklearn.model_selection import GridSearchCV
370 | parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},
371 |               {'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]
372 | grid_search = GridSearchCV(estimator = svm,
373 |                            param_grid = parameters,
374 |                            scoring = 'f1',
375 |                            cv = 10,
376 |                            n_jobs = -1)
377 | grid_search = grid_search.fit(xtrain_tfidf, ytrain)
378 | best_f1 = grid_search.best_score_
379 | best_parameters = grid_search.best_params_
380 | #best f1 = 0.6146
381 | #best parameters = c = 10, gamma = 0.5, kernel = rbf 
382 | #We will again do Grid Search with parametrs close to the above result 
383 | # Applying Grid Search to find the best model and the best parameters
384 | from sklearn.model_selection import GridSearchCV
385 | parameters = [
386 |               {'C': [5, 10, 15, 20], 'kernel': ['rbf'], 'gamma': [0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54, 0.55,]}]
387 | grid_search = GridSearchCV(estimator = svm,
388 |                            param_grid = parameters,
389 |                            scoring = 'f1',
390 |                            cv = 10,
391 |                            n_jobs = -1)
392 | grid_search = grid_search.fit(xtrain_tfidf, ytrain)
393 | best_f1 = grid_search.best_score_
394 | best_parameters = grid_search.best_params_
395 | #best f1 = 0.6166
396 | #best parameters = c = 10, gamma = 0.51, kernel = rbf 
397 | 
398 | svm = SVC(kernel = 'rbf', random_state = 0, gamma = 0.51, C =10)
399 | #Building model using TF-IDF features
400 | svm.fit(xtrain_tfidf, ytrain)
401 | y_pred6 = svm.predict(xvalid_tfidf)
402 | prediction_int8 = y_pred6.astype(np.int)
403 | f1_score(yvalid, prediction_int8)
404 | 
405 | #prediction on test set
406 | test_pred = svm.predict(test_tfidf)
407 | test_pred_int = test_pred.astype(np.int)
408 | testdata['label'] = test_pred_int
409 | submission = testdata[['id','label']]
410 | submission.to_csv('svmrbftfidf.csv', index=False) # writing data to a CSV file
411 | 
412 | 
413 | 
414 | 
415 | # Fitting Decision Tree Classification to the Training set
416 | from sklearn.tree import DecisionTreeClassifier
417 | tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
418 | tree.fit(xtrain_bow, ytrain)
419 | y_pred7 = tree.predict(xvalid_bow)
420 | prediction_int9 = y_pred7.astype(np.int)
421 | f1_score(yvalid, prediction_int9)
422 | #f1 linear = 0.50385
423 | 
424 | 
425 | #Building model using TF-IDF features
426 | tree.fit(xtrain_tfidf, ytrain)
427 | y_pred8 = svm.predict(xvalid_tfidf)
428 | prediction_int10 = y_pred8.astype(np.int)
429 | f1_score(yvalid, prediction_int10)
430 | #f1 linear = 0.3919
431 | 
432 | 
433 | 
434 | 
435 | #Building random forest model using Bag-of-Words features
436 | from sklearn.ensemble import RandomForestClassifier
437 | 
438 | rf=RandomForestClassifier(n_estimators=1024,criterion='entropy',random_state=0)
439 | rf.fit(xtrain_bow, ytrain) # training the model
440 | 
441 | predict_valid = rf.predict_proba(xvalid_bow) # predicting on the validation set
442 | valid_predict_int = predict_valid[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
443 | valid_predict_int = valid_predict_int.astype(np.int)
444 | f1_score(yvalid,valid_predict_int) # calculating f1 score
445 | #f1 (n =64) = 0.5427
446 | 
447 | 
448 | #prediction on test set
449 | test_pred = rf.predict_proba(test_bow)
450 | test_pred_int = test_pred[:,1] >= 0.3
451 | test_pred_int = test_pred_int.astype(np.int)
452 | testdata['label'] = test_pred_int
453 | submission = testdata[['id','label']]
454 | submission.to_csv('rfbow64.csv', index=False) # writing data to a CSV file
455 | 
456 | 
457 | #Building model using TF-IDF features
458 | rf=RandomForestClassifier(n_estimators=128,criterion='entropy',random_state=0)
459 | rf.fit(xtrain_tfidf, ytrain) # training the model
460 | 
461 | predict_valid = rf.predict_proba(xvalid_tfidf) # predicting on the validation set
462 | valid_predict_int = predict_valid[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0
463 | valid_predict_int = valid_predict_int.astype(np.int)
464 | f1_score(yvalid,valid_predict_int) # calculating f1 score
465 | #f1 128 = 0.5777
466 | 
467 | #prediction on test set
468 | test_pred = rf.predict_proba(test_tfidf)
469 | test_pred_int = test_pred[:,1] >= 0.3
470 | test_pred_int = test_pred_int.astype(np.int)
471 | testdata['label'] = test_pred_int
472 | submission = testdata[['id','label']]
473 | submission.to_csv('rf1024.csv', index=False) # writing data to a CSV file
474 | 
475 | 
476 | 
477 | 
478 | 
479 | 
480 | 
481 | 
482 | 
483 | 
484 | 
485 | 
486 | #Emoticons ##Use of emoticons is very prevalent throughout the web, more so on micro- blogging sites.
487 | 
488 | # Repeating Characters
489 | #People often use repeating characters while using colloquial language, 
490 | #like "I’m in a hurrryyyyy", "We won, yaaayyyyy!" As our final pre-processing step,
491 | # we replace characters repeating more than twice as two characters.
492 | 
493 | 
494 | 
495 | 
496 | 
497 | 
498 | 
499 | 
500 | 
501 | 
502 | 
503 | 
504 | 
505 | 


--------------------------------------------------------------------------------