├── .gitignore ├── LICENSE ├── README.md ├── depression.py ├── depressive_tweets_processed.csv ├── model accuracy.png └── model loss.png /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | MANIFEST 27 | 28 | # PyInstaller 29 | # Usually these files are written by a python script from a template 30 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 31 | *.manifest 32 | *.spec 33 | 34 | # Installer logs 35 | pip-log.txt 36 | pip-delete-this-directory.txt 37 | 38 | # Unit test / coverage reports 39 | htmlcov/ 40 | .tox/ 41 | .coverage 42 | .coverage.* 43 | .cache 44 | nosetests.xml 45 | coverage.xml 46 | *.cover 47 | .hypothesis/ 48 | .pytest_cache/ 49 | 50 | # Translations 51 | *.mo 52 | *.pot 53 | 54 | # Django stuff: 55 | *.log 56 | local_settings.py 57 | db.sqlite3 58 | 59 | # Flask stuff: 60 | instance/ 61 | .webassets-cache 62 | 63 | # Scrapy stuff: 64 | .scrapy 65 | 66 | # Sphinx documentation 67 | docs/_build/ 68 | 69 | # PyBuilder 70 | target/ 71 | 72 | # Jupyter Notebook 73 | .ipynb_checkpoints 74 | 75 | # pyenv 76 | .python-version 77 | 78 | # celery beat schedule file 79 | celerybeat-schedule 80 | 81 | # SageMath parsed files 82 | *.sage.py 83 | 84 | # Environments 85 | .env 86 | .venv 87 | env/ 88 | venv/ 89 | ENV/ 90 | env.bak/ 91 | venv.bak/ 92 | 93 | # Spyder project settings 94 | .spyderproject 95 | .spyproject 96 | 97 | # Rope project settings 98 | .ropeproject 99 | 100 | # mkdocs documentation 101 | /site 102 | 103 | # mypy 104 | .mypy_cache/ 105 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 peyman iravani 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | This repo dedicated to the depression detection by using Tweets of the users: 2 | 3 | 4 | There are two kind of tweets that are required at this project: random tweets that do not indicate depression and tweets that shows the user may have the depression. 5 | 6 | The random tweets dataset could be download from the kaggle website by the following link:https://www.kaggle.com/ywang311/twitter-sentiment/data. 7 | 8 | Since there is no public dataset exists regardingthe depressive tweets, the essential dataset for this project taken by the websraper with the name of TWINT using the keyword depression by scraping all tweets in an one day span. 9 | The tweets which taken as the result of the scrapper may contain tweets that do not shows the user have the depression,such as tweets such as tweets linking to articles about depression. 10 | Hence, the scrapped tweets need to be manually check for the better testing results. THe mentioned result is saved as the csv format with the name of "depressive_tweets_processed.csv" 11 | 12 | We also need the pretrained vectors for word2Vec model which provide by the google by using the following link: https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download 13 | 14 | In this project I used the LSTM+CNN, he model takes in an input and then outputs a single number representing the probability that the tweet indicates depression. The model takes in each input sentence, replace it with it's embeddings, then run the new embedding vector through a convolutional layer. CNNs are excellent at learning spatial structure from data, the convolutional layer takes advantage of that and learn some structure from the sequential data then pass into a standard LSTM layer. Last but not least, the output of the LSTM layer is fed into a standard Dense model for prediction. 15 | 16 | The following figures are showing the accuracy and loss function of the model which I got 99% on zeros which refers to non-depressed people and 97 on ones which shows the depressed people: 17 | 18 | ![screenshot from 2019-02-10 17-41-50](https://user-images.githubusercontent.com/23243761/52536500-ad0b8d00-2d5b-11e9-8740-c607502b759d.png) 19 | ![screenshot from 2019-02-10 17-43-23](https://user-images.githubusercontent.com/23243761/52536501-ada42380-2d5b-11e9-9324-df1425073cfc.png) 20 | ![model accuracy](https://user-images.githubusercontent.com/23243761/52536100-5a7ba200-2d56-11e9-8f4c-d6d14c96b9ac.png) 21 | ![model loss](https://user-images.githubusercontent.com/23243761/52536102-5a7ba200-2d56-11e9-96e9-b00107c1c633.png) 22 | 23 | 24 | 25 | 26 | 27 | 28 | -------------------------------------------------------------------------------- /depression.py: -------------------------------------------------------------------------------- 1 | import warnings 2 | warnings.filterwarnings("ignore") 3 | import ftfy 4 | import matplotlib.pyplot as plt 5 | import nltk 6 | import numpy as np 7 | import pandas as pd 8 | import re 9 | 10 | from math import exp 11 | from numpy import sign 12 | 13 | from sklearn.metrics import classification_report, confusion_matrix, accuracy_score 14 | from gensim.models import KeyedVectors 15 | from nltk.corpus import stopwords 16 | from nltk import PorterStemmer 17 | 18 | from keras.models import Model, Sequential 19 | from keras.callbacks import EarlyStopping, ModelCheckpoint 20 | from keras.layers import Conv1D, Dense, Input, LSTM, Embedding, Dropout, Activation, MaxPooling1D 21 | from keras.preprocessing.text import Tokenizer 22 | from keras.preprocessing.sequence import pad_sequences 23 | from keras.utils import plot_model 24 | 25 | # Reproducibility 26 | np.random.seed(1234) 27 | 28 | DEPRES_NROWS = 3200 # number of rows to read from DEPRESSIVE_TWEETS_CSV 29 | RANDOM_NROWS = 12000 # number of rows to read from RANDOM_TWEETS_CSV 30 | MAX_SEQUENCE_LENGTH = 140 # Max tweet size 31 | MAX_NB_WORDS = 20000 32 | EMBEDDING_DIM = 300 33 | TRAIN_SPLIT = 0.6 34 | TEST_SPLIT = 0.2 35 | LEARNING_RATE = 0.1 36 | EPOCHS= 10 37 | 38 | df = 'depressive_tweets_processed.csv' 39 | RANDOM_TWEETS_CSV = 'Sentiment Analysis Dataset 2.csv' 40 | depressive_tweets_df = pd.read_csv(df, sep = '|', header = None, usecols = range(0,9), nrows = DEPRES_NROWS) 41 | random_tweets_df = pd.read_csv(RANDOM_TWEETS_CSV, encoding = "ISO-8859-1", usecols = range(0,4), nrows = RANDOM_NROWS) 42 | #Embedding_file is the file which taken from this link https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&export=download 43 | EMBEDDING_FILE = 'GoogleNews-vectors-negative300.bin.gz' 44 | 45 | print (depressive_tweets_df.head()) 46 | 47 | word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True) 48 | 49 | # Expand Contraction 50 | cList = { 51 | "ain't": "am not", 52 | "aren't": "are not", 53 | "can't": "cannot", 54 | "can't've": "cannot have", 55 | "'cause": "because", 56 | "could've": "could have", 57 | "couldn't": "could not", 58 | "couldn't've": "could not have", 59 | "didn't": "did not", 60 | "doesn't": "does not", 61 | "don't": "do not", 62 | "hadn't": "had not", 63 | "hadn't've": "had not have", 64 | "hasn't": "has not", 65 | "haven't": "have not", 66 | "he'd": "he would", 67 | "he'd've": "he would have", 68 | "he'll": "he will", 69 | "he'll've": "he will have", 70 | "he's": "he is", 71 | "how'd": "how did", 72 | "how'd'y": "how do you", 73 | "how'll": "how will", 74 | "how's": "how is", 75 | "I'd": "I would", 76 | "I'd've": "I would have", 77 | "I'll": "I will", 78 | "I'll've": "I will have", 79 | "I'm": "I am", 80 | "I've": "I have", 81 | "isn't": "is not", 82 | "it'd": "it had", 83 | "it'd've": "it would have", 84 | "it'll": "it will", 85 | "it'll've": "it will have", 86 | "it's": "it is", 87 | "let's": "let us", 88 | "ma'am": "madam", 89 | "mayn't": "may not", 90 | "might've": "might have", 91 | "mightn't": "might not", 92 | "mightn't've": "might not have", 93 | "must've": "must have", 94 | "mustn't": "must not", 95 | "mustn't've": "must not have", 96 | "needn't": "need not", 97 | "needn't've": "need not have", 98 | "o'clock": "of the clock", 99 | "oughtn't": "ought not", 100 | "oughtn't've": "ought not have", 101 | "shan't": "shall not", 102 | "sha'n't": "shall not", 103 | "shan't've": "shall not have", 104 | "she'd": "she would", 105 | "she'd've": "she would have", 106 | "she'll": "she will", 107 | "she'll've": "she will have", 108 | "she's": "she is", 109 | "should've": "should have", 110 | "shouldn't": "should not", 111 | "shouldn't've": "should not have", 112 | "so've": "so have", 113 | "so's": "so is", 114 | "that'd": "that would", 115 | "that'd've": "that would have", 116 | "that's": "that is", 117 | "there'd": "there had", 118 | "there'd've": "there would have", 119 | "there's": "there is", 120 | "they'd": "they would", 121 | "they'd've": "they would have", 122 | "they'll": "they will", 123 | "they'll've": "they will have", 124 | "they're": "they are", 125 | "they've": "they have", 126 | "to've": "to have", 127 | "wasn't": "was not", 128 | "we'd": "we had", 129 | "we'd've": "we would have", 130 | "we'll": "we will", 131 | "we'll've": "we will have", 132 | "we're": "we are", 133 | "we've": "we have", 134 | "weren't": "were not", 135 | "what'll": "what will", 136 | "what'll've": "what will have", 137 | "what're": "what are", 138 | "what's": "what is", 139 | "what've": "what have", 140 | "when's": "when is", 141 | "when've": "when have", 142 | "where'd": "where did", 143 | "where's": "where is", 144 | "where've": "where have", 145 | "who'll": "who will", 146 | "who'll've": "who will have", 147 | "who's": "who is", 148 | "who've": "who have", 149 | "why's": "why is", 150 | "why've": "why have", 151 | "will've": "will have", 152 | "won't": "will not", 153 | "won't've": "will not have", 154 | "would've": "would have", 155 | "wouldn't": "would not", 156 | "wouldn't've": "would not have", 157 | "y'all": "you all", 158 | "y'alls": "you alls", 159 | "y'all'd": "you all would", 160 | "y'all'd've": "you all would have", 161 | "y'all're": "you all are", 162 | "y'all've": "you all have", 163 | "you'd": "you had", 164 | "you'd've": "you would have", 165 | "you'll": "you you will", 166 | "you'll've": "you you will have", 167 | "you're": "you are", 168 | "you've": "you have" 169 | } 170 | 171 | c_re = re.compile('(%s)' % '|'.join(cList.keys())) 172 | 173 | def expandContractions(text, c_re=c_re): 174 | def replace(match): 175 | return cList[match.group(0)] 176 | return c_re.sub(replace, text) 177 | 178 | 179 | def clean_tweets(tweets): 180 | cleaned_tweets = [] 181 | for tweet in tweets: 182 | tweet = str(tweet) 183 | # if url links then dont append to avoid news articles 184 | # also check tweet length, save those > 10 (length of word "depression") 185 | if re.match("(\w+:\/\/\S+)", tweet) == None and len(tweet) > 10: 186 | # remove hashtag, @mention, emoji and image URLs 187 | tweet = ' '.join( 188 | re.sub("(@[A-Za-z0-9]+)|(\#[A-Za-z0-9]+)|()|(pic\.twitter\.com\/.*)", " ", tweet).split()) 189 | 190 | # fix weirdly encoded texts 191 | tweet = ftfy.fix_text(tweet) 192 | 193 | # expand contraction 194 | tweet = expandContractions(tweet) 195 | 196 | # remove punctuation 197 | tweet = ' '.join(re.sub("([^0-9A-Za-z \t])", " ", tweet).split()) 198 | 199 | # stop words 200 | stop_words = set(stopwords.words('english')) 201 | word_tokens = nltk.word_tokenize(tweet) 202 | filtered_sentence = [w for w in word_tokens if not w in stop_words] 203 | tweet = ' '.join(filtered_sentence) 204 | 205 | # stemming words 206 | tweet = PorterStemmer().stem(tweet) 207 | 208 | cleaned_tweets.append(tweet) 209 | 210 | return cleaned_tweets 211 | depressive_tweets_arr = [x for x in depressive_tweets_df[5]] 212 | random_tweets_arr = [x for x in random_tweets_df['SentimentText']] 213 | X_d = clean_tweets(depressive_tweets_arr) 214 | X_r = clean_tweets(random_tweets_arr) 215 | 216 | tokenizer = Tokenizer(num_words=MAX_NB_WORDS) 217 | tokenizer.fit_on_texts(X_d + X_r) 218 | 219 | sequences_d = tokenizer.texts_to_sequences(X_d) 220 | sequences_r = tokenizer.texts_to_sequences(X_r) 221 | 222 | 223 | word_index = tokenizer.word_index 224 | print('Found %s unique tokens' % len(word_index)) 225 | 226 | 227 | data_d = pad_sequences(sequences_d, maxlen=MAX_SEQUENCE_LENGTH) 228 | data_r = pad_sequences(sequences_r, maxlen=MAX_SEQUENCE_LENGTH) 229 | print('Shape of data_d tensor:', data_d.shape) 230 | print('Shape of data_r tensor:', data_r.shape) 231 | 232 | nb_words = min(MAX_NB_WORDS, len(word_index)) 233 | 234 | embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM)) 235 | 236 | for (word, idx) in word_index.items(): 237 | if word in word2vec.vocab and idx < MAX_NB_WORDS: 238 | embedding_matrix[idx] = word2vec.word_vec(word) 239 | 240 | # Assigning labels to the depressive tweets and random tweets data 241 | labels_d = np.array([1] * DEPRES_NROWS) 242 | labels_r = np.array([0] * RANDOM_NROWS) 243 | 244 | # Splitting the arrays into test (60%), validation (20%), and train data (20%) 245 | perm_d = np.random.permutation(len(data_d)) 246 | idx_train_d = perm_d[:int(len(data_d)*(TRAIN_SPLIT))] 247 | idx_test_d = perm_d[int(len(data_d)*(TRAIN_SPLIT)):int(len(data_d)*(TRAIN_SPLIT+TEST_SPLIT))] 248 | idx_val_d = perm_d[int(len(data_d)*(TRAIN_SPLIT+TEST_SPLIT)):] 249 | 250 | perm_r = np.random.permutation(len(data_r)) 251 | idx_train_r = perm_r[:int(len(data_r)*(TRAIN_SPLIT))] 252 | idx_test_r = perm_r[int(len(data_r)*(TRAIN_SPLIT)):int(len(data_r)*(TRAIN_SPLIT+TEST_SPLIT))] 253 | idx_val_r = perm_r[int(len(data_r)*(TRAIN_SPLIT+TEST_SPLIT)):] 254 | 255 | # Combine depressive tweets and random tweets arrays 256 | data_train = np.concatenate((data_d[idx_train_d], data_r[idx_train_r])) 257 | labels_train = np.concatenate((labels_d[idx_train_d], labels_r[idx_train_r])) 258 | data_test = np.concatenate((data_d[idx_test_d], data_r[idx_test_r])) 259 | labels_test = np.concatenate((labels_d[idx_test_d], labels_r[idx_test_r])) 260 | data_val = np.concatenate((data_d[idx_val_d], data_r[idx_val_r])) 261 | labels_val = np.concatenate((labels_d[idx_val_d], labels_r[idx_val_r])) 262 | 263 | # Shuffling 264 | perm_train = np.random.permutation(len(data_train)) 265 | data_train = data_train[perm_train] 266 | labels_train = labels_train[perm_train] 267 | perm_test = np.random.permutation(len(data_test)) 268 | data_test = data_test[perm_test] 269 | labels_test = labels_test[perm_test] 270 | perm_val = np.random.permutation(len(data_val)) 271 | data_val = data_val[perm_val] 272 | labels_val = labels_val[perm_val] 273 | 274 | model = Sequential() 275 | # Embedded layer 276 | model.add(Embedding(len(embedding_matrix), EMBEDDING_DIM, weights=[embedding_matrix], 277 | input_length=MAX_SEQUENCE_LENGTH, trainable=False)) 278 | # Convolutional Layer 279 | model.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu')) 280 | model.add(MaxPooling1D(pool_size=2)) 281 | model.add(Dropout(0.2)) 282 | # LSTM Layer 283 | model.add(LSTM(300)) 284 | model.add(Dropout(0.2)) 285 | model.add(Dense(1, activation='sigmoid')) 286 | 287 | model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['acc']) 288 | print(model.summary()) 289 | 290 | 291 | early_stop = EarlyStopping(monitor='val_loss', patience=3) 292 | 293 | hist = model.fit(data_train, labels_train, validation_data=(data_val, labels_val),epochs=EPOCHS, batch_size=40, shuffle=True, callbacks=[early_stop]) 294 | #plot_model(model, to_file='model.png') 295 | plt.plot(hist.history['acc']) 296 | plt.plot(hist.history['val_acc']) 297 | plt.title('model accuracy') 298 | plt.ylabel('accuracy') 299 | plt.xlabel('epoch') 300 | plt.legend(['train', 'validation'], loc='upper left') 301 | plt.show() 302 | 303 | plt.plot(hist.history['loss']) 304 | plt.plot(hist.history['val_loss']) 305 | plt.title('model loss') 306 | plt.ylabel('loss') 307 | plt.xlabel('epoch') 308 | plt.legend(['train', 'test'], loc='upper left') 309 | plt.show() 310 | 311 | labels_pred = model.predict(data_test) 312 | labels_pred = np.round(labels_pred.flatten()) 313 | accuracy = accuracy_score(labels_test, labels_pred) 314 | print("Accuracy: %.2f%%" % (accuracy*100)) 315 | 316 | print(classification_report(labels_test, labels_pred)) 317 | 318 | 319 | class LogReg: 320 | """ 321 | Class to represent a logistic regression model. 322 | """ 323 | 324 | def __init__(self, l_rate, epochs, n_features): 325 | """ 326 | Create a new model with certain parameters. 327 | 328 | :param l_rate: Initial learning rate for model. 329 | :param epoch: Number of epochs to train for. 330 | :param n_features: Number of features. 331 | """ 332 | self.l_rate = l_rate 333 | self.epochs = epochs 334 | self.coef = [0.0] * n_features 335 | self.bias = 0.0 336 | 337 | def sigmoid(self, score, threshold=20.0): 338 | """ 339 | Prevent overflow of exp by capping activation at 20. 340 | 341 | :param score: A real valued number to convert into a number between 0 and 1 342 | """ 343 | if abs(score) > threshold: 344 | score = threshold * sign(score) 345 | activation = exp(score) 346 | return activation / (1.0 + activation) 347 | 348 | def predict(self, features): 349 | """ 350 | Given an example's features and the coefficients, predicts the class. 351 | 352 | :param features: List of real valued features for a single training example. 353 | 354 | :return: Returns the predicted class (either 0 or 1). 355 | """ 356 | value = sum([features[i] * self.coef[i] for i in range(len(features))]) + self.bias 357 | return self.sigmoid(value) 358 | 359 | def sg_update(self, features, label): 360 | """ 361 | Computes the update to the weights based on a predicted example. 362 | 363 | :param features: Features to train on. 364 | :param label: Corresponding label for features. 365 | """ 366 | yhat = self.predict(features) 367 | e = label - yhat 368 | self.bias = self.bias + self.l_rate * e * yhat * (1 - yhat) 369 | for i in range(len(features)): 370 | self.coef[i] = self.coef[i] + self.l_rate * e * yhat * (1 - yhat) * features[i] 371 | return 372 | 373 | def train(self, X, y): 374 | """ 375 | Computes logistic regression coefficients using stochastic gradient descent. 376 | 377 | :param X: Features to train on. 378 | :param y: Corresponding label for each set of features. 379 | 380 | :return: Returns a list of model weight coefficients where coef[0] is the bias. 381 | """ 382 | for epoch in range(self.epochs): 383 | for features, label in zip(X, y): 384 | self.sg_update(features, label) 385 | return self.bias, self.coef 386 | 387 | def get_accuracy(y_bar, y_pred): 388 | """ 389 | Computes what percent of the total testing data the model classified correctly. 390 | 391 | :param y_bar: List of ground truth classes for each example. 392 | :param y_pred: List of model predicted class for each example. 393 | 394 | :return: Returns a real number between 0 and 1 for the model accuracy. 395 | """ 396 | correct = 0 397 | for i in range(len(y_bar)): 398 | if y_bar[i] == y_pred[i]: 399 | 400 | correct += 1 401 | accuracy = (correct / len(y_bar)) * 100.0 402 | return accuracy 403 | 404 | 405 | # Logistic Model 406 | logreg = LogReg(LEARNING_RATE, EPOCHS, len(data_train[0])) 407 | bias_logreg, weights_logreg = logreg.train(data_train, labels_train) 408 | y_logistic = [round(logreg.predict(example)) for example in data_test] 409 | 410 | 411 | # Compare accuracies 412 | accuracy_logistic = get_accuracy(y_logistic, labels_test) 413 | print('Logistic Regression Accuracy: {:0.3f}'.format(accuracy_logistic)) 414 | 415 | 416 | 417 | -------------------------------------------------------------------------------- /model accuracy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eddieir/Depression_detection_using_Twitter_post/1779047407b01bb66d1918827544495b3b7098d2/model accuracy.png -------------------------------------------------------------------------------- /model loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/eddieir/Depression_detection_using_Twitter_post/1779047407b01bb66d1918827544495b3b7098d2/model loss.png --------------------------------------------------------------------------------