├── README.md ├── SubmissionAV.py └── code.py /README.md: -------------------------------------------------------------------------------- 1 | # Twitter-Sentiment-Analysis---Analytics-Vidhya 2 | Problem Statement 3 | The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets. Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset. 4 | 5 | Motivation 6 | Hate speech is an unfortunately common occurrence on the Internet. Often social media sites like Facebook and Twitter face the problem of identifying and censoring problematic posts while weighing the right to freedom of speech. The importance of detecting and moderating hate speech is evident from the strong connection between hate speech and actual hate crimes. Early identification of users promoting hate speech could enable outreach programs that attempt to prevent an escalation from speech to action. Sites such as Twitter and Facebook have been seeking to actively combat hate speech. In spite of these reasons, NLP research on hate speech has been very limited, primarily due to the lack of a general definition of hate speech, an analysis of its demographic influences, and an investigation of the most effective features. 7 | 8 | Data Our overall collection of tweets was split in the ratio of 65:35 into training and testing data. Out of the testing data, 30% is public and the rest is private. Data Files train.csv - For training the models, we provide a labelled dataset of 31,962 tweets. The dataset is provided in the form of a csv file with each line storing a tweet id, its label and the tweet. There is 1 test file (public) test_tweets.csv - The test data file contains only tweet ids and the tweet text with each tweet in a new line. 9 | -------------------------------------------------------------------------------- /SubmissionAV.py: -------------------------------------------------------------------------------- 1 | # Practice Problem : Twitter Sentiment Analysis by Analytics Vidhya 2 | #Author - Sachin Kumar 3 | 4 | #Importing Libraries 5 | import re #cleaning the text 6 | import pandas as pd 7 | import numpy as np 8 | import string 9 | import nltk 10 | import warnings 11 | 12 | #Importing dataset 13 | dataset = pd.read_csv('train.csv') 14 | testdata = pd.read_csv('test.csv') 15 | 16 | #combine train and test set 17 | combi = dataset.append(testdata, ignore_index=True) 18 | 19 | ## importing regular expression library ## clean tweet text by removing links, special characters etc 20 | def remove_pattern(input_txt, pattern): 21 | r = re.findall(pattern, input_txt) 22 | for i in r: 23 | input_txt = re.sub(i, '', input_txt) 24 | 25 | return input_txt 26 | 27 | # remove twitter handles (@user) 28 | combi['tweet'] = np.vectorize(remove_pattern)(combi['tweet'], "@[\w]*") 29 | 30 | # remove special characters, numbers, punctuations 31 | combi['tweet'] = combi['tweet'].str.replace("[^a-zA-Z#]", " ") 32 | 33 | #Removing Short Words 34 | combi['tweet'] = combi['tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3])) 35 | 36 | tokenized_tweet = combi['tweet'].apply(lambda x: x.split()) 37 | 38 | 39 | from nltk.stem.porter import * 40 | stemmer = PorterStemmer() 41 | 42 | tokenized_tweet = tokenized_tweet.apply(lambda x: [stemmer.stem(i) for i in x]) # stemming 43 | 44 | #Now let’s stitch these tokens 45 | for i in range(len(tokenized_tweet)): 46 | tokenized_tweet[i] = ' '.join(tokenized_tweet[i]) 47 | 48 | combi['tweet'] = tokenized_tweet 49 | 50 | #Bag-of-Words Features 51 | 52 | from sklearn.feature_extraction.text import CountVectorizer 53 | bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english') 54 | # bag-of-words feature matrix 55 | bow = bow_vectorizer.fit_transform(combi['tweet']) 56 | 57 | 58 | from sklearn.model_selection import train_test_split 59 | from sklearn.metrics import f1_score 60 | 61 | train_bow = bow[:31962,:] 62 | test_bow = bow[31962:,:] 63 | 64 | # splitting data into training and validation set 65 | xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, dataset['label'], random_state=42, test_size=0.3) 66 | 67 | from sklearn.svm import SVC 68 | svm = SVC() 69 | svm = SVC(kernel = 'rbf', random_state = 0, gamma = 0.14, C =11) 70 | svm.fit(xtrain_bow, ytrain) 71 | y_pred5 = svm.predict(xvalid_bow) 72 | prediction_int7 = y_pred5.astype(np.int) 73 | f1_score(yvalid, prediction_int7) 74 | 75 | #prediction on test set 76 | test_pred = svm.predict(test_bow) 77 | test_pred_int = test_pred.astype(np.int) 78 | testdata['label'] = test_pred_int 79 | submission = testdata[['id','label']] 80 | submission.to_csv('svmrbfbow.csv', index=False) # writing data to a CSV file -------------------------------------------------------------------------------- /code.py: -------------------------------------------------------------------------------- 1 | # Practice Problem : Twitter Sentiment Analysis by Analytics Vidhya 2 | 3 | #The objective of this task is to detect hate speech in tweets. Formally, given a training sample of tweets and labels, 4 | #where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist, your objective 5 | # is to predict the labels on the given test dataset. 6 | #Note: The evaluation metric from this practice problem is F1-Score. 7 | 8 | 9 | #Importing Libraries 10 | import re #cleaning the text 11 | import pandas as pd 12 | import numpy as np 13 | import matplotlib.pyplot as plt 14 | import seaborn as sns 15 | import string 16 | import nltk 17 | import warnings 18 | 19 | #NLTK is a leading platform for building Python programs to work with human language data 20 | 21 | #Importing dataset 22 | dataset = pd.read_csv('train.csv') 23 | testdata = pd.read_csv('test.csv') 24 | 25 | #To see the first few rows of the train dataset 26 | dataset.head() 27 | 28 | dataset.info() 29 | 30 | #breakdown of how many tweets are ‘0’s and how many tweets are ‘1’s. 31 | dataset['label'].value_counts() 32 | 33 | #Initial data cleaning requirements that we can think of after looking at the top 5 records: 34 | #The Twitter handles are already masked as @user due to privacy concerns. So, these Twitter handles are hardly giving any information about the nature of the tweet. 35 | #We can also think of getting rid of the punctuations, numbers and even special characters since they wouldn’t help in differentiating different kinds of tweets. 36 | #Most of the smaller words do not add much value. For example, ‘pdx’, ‘his’, ‘all’. So, we will try to remove them as well from our data. 37 | #Once we have executed the above three steps, we can split every tweet into individual words or tokens which is an essential step in any NLP task. 38 | #In the 4th tweet, there is a word ‘love’. We might also have terms like loves, loving, lovable, etc. in the rest of the data. These terms are often used in the same 39 | #context. If we can reduce them to their root word, which is ‘love’, then we can reduce the total number of unique words in our data without losing a significant amount of information. 40 | 41 | 42 | #Tweets Preprocessing and Cleaning 43 | #The objective of this step is to clean noise those are less relevant to find the sentiment of tweets such as punctuation, 44 | #special characters, numbers, and terms which don’t carry much weightage in context to the text. 45 | 46 | #a user-defined function to remove unwanted text patterns from the tweets. 47 | # The function returns the same input string but without the given pattern 48 | 49 | #combine train and test set 50 | combi = dataset.append(testdata, ignore_index=True) 51 | 52 | 53 | ## importing regular expression library ## clean tweet text by removing links, special characters etc 54 | def remove_pattern(input_txt, pattern): 55 | r = re.findall(pattern, input_txt) 56 | for i in r: 57 | input_txt = re.sub(i, '', input_txt) 58 | 59 | return input_txt 60 | 61 | # remove twitter handles (@user) 62 | combi['tweet'] = np.vectorize(remove_pattern)(combi['tweet'], "@[\w]*") 63 | 64 | # remove special characters, numbers, punctuations 65 | combi['tweet'] = combi['tweet'].str.replace("[^a-zA-Z#]", " ") 66 | 67 | 68 | #Removing Short Words 69 | combi['tweet'] = combi['tweet'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3])) 70 | 71 | #Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. 72 | #Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded 73 | tokenized_tweet = combi['tweet'].apply(lambda x: x.split()) 74 | 75 | #Stemming is a rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word. 76 | #For example, For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”. 77 | 78 | from nltk.stem.porter import * 79 | stemmer = PorterStemmer() 80 | 81 | tokenized_tweet = tokenized_tweet.apply(lambda x: [stemmer.stem(i) for i in x]) # stemming 82 | 83 | #Now let’s stitch these tokens 84 | for i in range(len(tokenized_tweet)): 85 | tokenized_tweet[i] = ' '.join(tokenized_tweet[i]) 86 | 87 | combi['tweet'] = tokenized_tweet 88 | 89 | 90 | #--------------------------------------------------------------------------------------------------------------- 91 | 92 | ##Story Generation and Visualization from Tweets 93 | #A wordcloud is a visualization wherein the most frequent words appear in 94 | #large size and the less frequent words appear in smaller sizes. 95 | 96 | all_words = ' '.join([text for text in combi['tweet']]) 97 | from wordcloud import WordCloud 98 | wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words) 99 | 100 | plt.figure(figsize=(10, 7)) 101 | plt.imshow(wordcloud, interpolation="bilinear") 102 | 103 | #Words in non racist/sexist tweets 104 | normal_words =' '.join([text for text in combi['tweet'][combi['label'] == 0]]) 105 | 106 | wordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(normal_words) 107 | plt.figure(figsize=(10, 7)) 108 | plt.imshow(wordcloud, interpolation="bilinear") 109 | 110 | #Racist/Sexist Tweets 111 | negative_words = ' '.join([text for text in combi['tweet'][combi['label'] == 1]]) 112 | wordcloud = WordCloud(width=800, height=500,random_state=21, max_font_size=110).generate(negative_words) 113 | plt.figure(figsize=(10, 7)) 114 | plt.imshow(wordcloud, interpolation="bilinear") 115 | 116 | #Understanding the impact of Hashtags on tweets sentiment 117 | #Hashtags in twitter are synonymous with the ongoing trends on twitter at any particular point in time. 118 | # function to collect hashtags 119 | def hashtag_extract(x): 120 | hashtags = [] 121 | # Loop over the words in the tweet 122 | for i in x: 123 | ht = re.findall(r"#(\w+)", i) 124 | hashtags.append(ht) 125 | 126 | return hashtags 127 | 128 | # extracting hashtags from non racist/sexist tweets 129 | HT_regular = hashtag_extract(combi['tweet'][combi['label'] == 0]) 130 | 131 | # extracting hashtags from racist/sexist tweets 132 | HT_negative = hashtag_extract(combi['tweet'][combi['label'] == 1]) 133 | 134 | # unnesting list 135 | HT_regular = sum(HT_regular,[]) 136 | HT_negative = sum(HT_negative,[]) 137 | 138 | #Non-Racist/Sexist Tweets 139 | 140 | a = nltk.FreqDist(HT_regular) 141 | d = pd.DataFrame({'Hashtag': list(a.keys()), 142 | 'Count': list(a.values())}) 143 | # selecting top 10 most frequent hashtags 144 | d = d.nlargest(columns="Count", n = 10) 145 | plt.figure(figsize=(13,7)) 146 | ax = sns.barplot(data=d, x= "Hashtag", y = "Count") 147 | 148 | #Racist/Sexist Tweets 149 | 150 | b = nltk.FreqDist(HT_negative) 151 | e = pd.DataFrame({'Hashtag': list(b.keys()), 'Count': list(b.values())}) 152 | # selecting top 10 most frequent hashtags 153 | e = e.nlargest(columns="Count", n = 10) 154 | plt.figure(figsize=(13,7)) 155 | ax = sns.barplot(data=e, x= "Hashtag", y = "Count") 156 | 157 | #As expected, most of the terms are negative with a few neutral terms as well. 158 | #So,it’s not a bad idea to keep these hashtags in our data as they contain useful information. 159 | 160 | #------------------------------------------------------------------------------------------------- 161 | 162 | 163 | #Extracting Features from Cleaned Tweets 164 | 165 | #Bag-of-Words Features 166 | #Bag-of-Words features can be easily created using sklearn’s CountVectorizer function. 167 | #We will set the parameter max_features = 1000 to select only top 1000 terms ordered by term frequency across the corpus. 168 | 169 | from sklearn.feature_extraction.text import CountVectorizer 170 | bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english') 171 | # bag-of-words feature matrix 172 | bow = bow_vectorizer.fit_transform(combi['tweet']) 173 | 174 | #TF-IDF Features 175 | from sklearn.feature_extraction.text import TfidfVectorizer 176 | tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english') 177 | # TF-IDF feature matrix 178 | tfidf = tfidf_vectorizer.fit_transform(combi['tweet']) 179 | 180 | 181 | 182 | 183 | #Building log model using Bag-of-Words features 184 | from sklearn.linear_model import LogisticRegression 185 | from sklearn.model_selection import train_test_split 186 | from sklearn.metrics import f1_score 187 | 188 | train_bow = bow[:31962,:] 189 | test_bow = bow[31962:,:] 190 | 191 | # splitting data into training and validation set 192 | xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, dataset['label'], random_state=42, test_size=0.3) 193 | 194 | lreg = LogisticRegression() 195 | lreg.fit(xtrain_bow, ytrain) # training the model 196 | 197 | prediction1 = lreg.predict_proba(xvalid_bow) # predicting on the validation set 198 | prediction_int1 = prediction1[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0 199 | prediction_int1 = prediction_int1.astype(np.int) 200 | 201 | f1_score(yvalid, prediction_int1) # calculating f1 score 202 | #f1 = 0.53078 203 | 204 | # Making the Confusion Matrix 205 | from sklearn.metrics import confusion_matrix 206 | cm = confusion_matrix(yvalid, prediction_int1) 207 | cm 208 | prediction_test = lreg.predict_proba(test_bow) # predicting on the testset 209 | prediction_test = prediction_test[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0 210 | prediction_test = prediction_test.astype(np.int) 211 | 212 | #Export submission file 213 | 214 | testdata['label'] = prediction_test 215 | submission = testdata[['id','label']] 216 | submission.to_csv('sub_log_bow.csv', index=False) # writing data to a CSV file 217 | 218 | 219 | #Building log model using TF-IDF features 220 | train_tfidf = tfidf[:31962,:] 221 | test_tfidf = tfidf[31962:,:] 222 | 223 | xtrain_tfidf = train_tfidf[ytrain.index] 224 | xvalid_tfidf = train_tfidf[yvalid.index] 225 | 226 | lreg.fit(xtrain_tfidf, ytrain) 227 | 228 | prediction2 = lreg.predict_proba(xvalid_tfidf) 229 | prediction_int2 = prediction2[:,1] >= 0.3 230 | prediction_int2 = prediction_int2.astype(np.int) 231 | 232 | f1_score(yvalid, prediction_int2) 233 | #f1 = 0.54465 234 | prediction_test = lreg.predict_proba(test_bow) # predicting on the testset 235 | prediction_test = prediction_test[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0 236 | prediction_test = prediction_test.astype(np.int) 237 | 238 | #Export submission file 239 | 240 | testdata['label'] = prediction_test 241 | submission = testdata[['id','label']] 242 | submission.to_csv('sub_log_tfidf.csv', index=False) # writing data to a CSV file 243 | 244 | 245 | # K-Nearest Neighbors (K-NN) 246 | #Building log model using Bag-of-Words features 247 | # Fitting K-NN to the Training set 248 | from sklearn.neighbors import KNeighborsClassifier 249 | knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2) 250 | knn.fit(xtrain_bow, ytrain) 251 | # Predicting the Test set results 252 | y_pred = knn.predict_proba(xvalid_bow) 253 | prediction_int3 = y_pred[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0 254 | prediction_int3 = prediction_int3.astype(np.int) 255 | f1_score(yvalid, prediction_int3) 256 | #f1 = 0.43578 257 | #Building model using TF-IDF features 258 | 259 | knn.fit(xtrain_tfidf, ytrain) 260 | y_pred2 = knn.predict_proba(xvalid_tfidf) 261 | prediction_int4 = y_pred2[:,1] >= 0.3 262 | prediction_int4 = prediction_int4.astype(np.int) 263 | 264 | f1_score(yvalid, prediction_int4) 265 | #f1 = 0.45124 266 | 267 | 268 | 269 | # Naive Bayes 270 | ## Error - A sparse matrix was passed, but dense data is required. 271 | ##Use X.toarray() to convert to a dense numpy array. 272 | # Fitting Naive Bayes to the Training set 273 | '''from sklearn.naive_bayes import GaussianNB 274 | nb = GaussianNB() 275 | nb.fit(xtrain_bow, ytrain) 276 | # Predicting the Test set results 277 | y_pred3 = nb.predict_proba(xvalid_bow) 278 | prediction_int5 = y_pred3[:,1] >= 0.3 279 | prediction_int5 = y_pred3.astype(np.int) 280 | f1_score(yvalid, prediction_int5) 281 | 282 | 283 | #Building model using TF-IDF features 284 | nb.fit(xtrain_tfidf, ytrain) 285 | y_pred4 = nb.predict_proba(xvalid_tfidf) 286 | prediction_int6 = y_pred4[:,1] >= 0.3 287 | prediction_int6 = prediction_int6.astype(np.int) 288 | 289 | f1_score(yvalid, prediction_int6) 290 | ''' 291 | 292 | # Fitting SVM to the Training set 293 | from sklearn.svm import SVC 294 | svm = SVC() 295 | svm.fit(xtrain_bow, ytrain) 296 | 297 | 298 | #Building model using TF-IDF features 299 | svm.fit(xtrain_tfidf, ytrain) 300 | 301 | 302 | # Applying k-Fold Cross Validation 303 | from sklearn.model_selection import cross_val_score 304 | accuracies = cross_val_score(estimator = svm, X = xtrain_bow, y = ytrain, cv = 10) 305 | accuracies.mean() 306 | accuracies.std() 307 | 308 | # Applying Grid Search to find the best model and the best parameters from bag of words 309 | from sklearn.model_selection import GridSearchCV 310 | parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']}, 311 | {'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}] 312 | grid_search = GridSearchCV(estimator = svm, 313 | param_grid = parameters, 314 | scoring = 'f1', 315 | cv = 10, 316 | n_jobs = -1) 317 | grid_search = grid_search.fit(xtrain_bow, ytrain) 318 | best_f1 = grid_search.best_score_ 319 | best_parameters = grid_search.best_params_ 320 | #best f1 = 0.5953 321 | #best parameters = c = 10, gamma = 0.2, kernel = rbf 322 | #We will again do Grid Search with parametrs close to the above result 323 | # Applying Grid Search to find the best model and the best parameters 324 | from sklearn.model_selection import GridSearchCV 325 | parameters = [ 326 | {'C': [5, 10, 15, 20], 'kernel': ['rbf'], 'gamma': [0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25,]}] 327 | grid_search = GridSearchCV(estimator = svm, 328 | param_grid = parameters, 329 | scoring = 'f1', 330 | cv = 10, 331 | n_jobs = -1) 332 | grid_search = grid_search.fit(xtrain_bow, ytrain) 333 | best_f1 = grid_search.best_score_ 334 | best_parameters = grid_search.best_params_ 335 | #best f1 = 0.6013 336 | #best parameters = c = 10, gamma = 0.15, kernel = rbf 337 | 338 | from sklearn.model_selection import GridSearchCV 339 | parameters = [ 340 | {'C': [8, 9, 10, 11, 12], 'kernel': ['rbf'], 'gamma': [0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18]}] 341 | grid_search = GridSearchCV(estimator = svm, 342 | param_grid = parameters, 343 | scoring = 'f1', 344 | cv = 10, 345 | n_jobs = -1) 346 | grid_search = grid_search.fit(xtrain_bow, ytrain) 347 | best_f1 = grid_search.best_score_ 348 | best_parameters = grid_search.best_params_ 349 | #best f1 = 0.6213 350 | #best parameters = c = 11, gamma = 0.14, kernel = rbf 351 | #We will go with this parameters 352 | 353 | svm = SVC(kernel = 'rbf', random_state = 0, gamma = 0.14, C =11) 354 | svm.fit(xtrain_bow, ytrain) 355 | y_pred5 = svm.predict(xvalid_bow) 356 | prediction_int7 = y_pred5.astype(np.int) 357 | f1_score(yvalid, prediction_int7) 358 | 359 | #prediction on test set 360 | test_pred = svm.predict(test_bow) 361 | test_pred_int = test_pred.astype(np.int) 362 | testdata['label'] = test_pred_int 363 | submission = testdata[['id','label']] 364 | submission.to_csv('svmrbfbow.csv', index=False) # writing data to a CSV file 365 | 366 | 367 | 368 | # Applying Grid Search to find the best model and the best parameters from tfidf 369 | from sklearn.model_selection import GridSearchCV 370 | parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']}, 371 | {'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}] 372 | grid_search = GridSearchCV(estimator = svm, 373 | param_grid = parameters, 374 | scoring = 'f1', 375 | cv = 10, 376 | n_jobs = -1) 377 | grid_search = grid_search.fit(xtrain_tfidf, ytrain) 378 | best_f1 = grid_search.best_score_ 379 | best_parameters = grid_search.best_params_ 380 | #best f1 = 0.6146 381 | #best parameters = c = 10, gamma = 0.5, kernel = rbf 382 | #We will again do Grid Search with parametrs close to the above result 383 | # Applying Grid Search to find the best model and the best parameters 384 | from sklearn.model_selection import GridSearchCV 385 | parameters = [ 386 | {'C': [5, 10, 15, 20], 'kernel': ['rbf'], 'gamma': [0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54, 0.55,]}] 387 | grid_search = GridSearchCV(estimator = svm, 388 | param_grid = parameters, 389 | scoring = 'f1', 390 | cv = 10, 391 | n_jobs = -1) 392 | grid_search = grid_search.fit(xtrain_tfidf, ytrain) 393 | best_f1 = grid_search.best_score_ 394 | best_parameters = grid_search.best_params_ 395 | #best f1 = 0.6166 396 | #best parameters = c = 10, gamma = 0.51, kernel = rbf 397 | 398 | svm = SVC(kernel = 'rbf', random_state = 0, gamma = 0.51, C =10) 399 | #Building model using TF-IDF features 400 | svm.fit(xtrain_tfidf, ytrain) 401 | y_pred6 = svm.predict(xvalid_tfidf) 402 | prediction_int8 = y_pred6.astype(np.int) 403 | f1_score(yvalid, prediction_int8) 404 | 405 | #prediction on test set 406 | test_pred = svm.predict(test_tfidf) 407 | test_pred_int = test_pred.astype(np.int) 408 | testdata['label'] = test_pred_int 409 | submission = testdata[['id','label']] 410 | submission.to_csv('svmrbftfidf.csv', index=False) # writing data to a CSV file 411 | 412 | 413 | 414 | 415 | # Fitting Decision Tree Classification to the Training set 416 | from sklearn.tree import DecisionTreeClassifier 417 | tree = DecisionTreeClassifier(criterion = 'entropy', random_state = 0) 418 | tree.fit(xtrain_bow, ytrain) 419 | y_pred7 = tree.predict(xvalid_bow) 420 | prediction_int9 = y_pred7.astype(np.int) 421 | f1_score(yvalid, prediction_int9) 422 | #f1 linear = 0.50385 423 | 424 | 425 | #Building model using TF-IDF features 426 | tree.fit(xtrain_tfidf, ytrain) 427 | y_pred8 = svm.predict(xvalid_tfidf) 428 | prediction_int10 = y_pred8.astype(np.int) 429 | f1_score(yvalid, prediction_int10) 430 | #f1 linear = 0.3919 431 | 432 | 433 | 434 | 435 | #Building random forest model using Bag-of-Words features 436 | from sklearn.ensemble import RandomForestClassifier 437 | 438 | rf=RandomForestClassifier(n_estimators=1024,criterion='entropy',random_state=0) 439 | rf.fit(xtrain_bow, ytrain) # training the model 440 | 441 | predict_valid = rf.predict_proba(xvalid_bow) # predicting on the validation set 442 | valid_predict_int = predict_valid[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0 443 | valid_predict_int = valid_predict_int.astype(np.int) 444 | f1_score(yvalid,valid_predict_int) # calculating f1 score 445 | #f1 (n =64) = 0.5427 446 | 447 | 448 | #prediction on test set 449 | test_pred = rf.predict_proba(test_bow) 450 | test_pred_int = test_pred[:,1] >= 0.3 451 | test_pred_int = test_pred_int.astype(np.int) 452 | testdata['label'] = test_pred_int 453 | submission = testdata[['id','label']] 454 | submission.to_csv('rfbow64.csv', index=False) # writing data to a CSV file 455 | 456 | 457 | #Building model using TF-IDF features 458 | rf=RandomForestClassifier(n_estimators=128,criterion='entropy',random_state=0) 459 | rf.fit(xtrain_tfidf, ytrain) # training the model 460 | 461 | predict_valid = rf.predict_proba(xvalid_tfidf) # predicting on the validation set 462 | valid_predict_int = predict_valid[:,1] >= 0.3 # if prediction is greater than or equal to 0.3 than 1 else 0 463 | valid_predict_int = valid_predict_int.astype(np.int) 464 | f1_score(yvalid,valid_predict_int) # calculating f1 score 465 | #f1 128 = 0.5777 466 | 467 | #prediction on test set 468 | test_pred = rf.predict_proba(test_tfidf) 469 | test_pred_int = test_pred[:,1] >= 0.3 470 | test_pred_int = test_pred_int.astype(np.int) 471 | testdata['label'] = test_pred_int 472 | submission = testdata[['id','label']] 473 | submission.to_csv('rf1024.csv', index=False) # writing data to a CSV file 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 484 | 485 | 486 | #Emoticons ##Use of emoticons is very prevalent throughout the web, more so on micro- blogging sites. 487 | 488 | # Repeating Characters 489 | #People often use repeating characters while using colloquial language, 490 | #like "I’m in a hurrryyyyy", "We won, yaaayyyyy!" As our final pre-processing step, 491 | # we replace characters repeating more than twice as two characters. 492 | 493 | 494 | 495 | 496 | 497 | 498 | 499 | 500 | 501 | 502 | 503 | 504 | 505 | --------------------------------------------------------------------------------