├── .gitignore ├── history_main_model_attr_0_w2v.txt ├── perf_output_main_model_0_w2v.txt ├── essays.csv ├── run.sh ├── LICENSE ├── README.md ├── process_data.py ├── process_data_python3.py ├── conv_net_train_keras.py ├── conv_net_classes.py ├── conv_net_classes_gpu.py ├── conv_net_train.py └── conv_net_train_gpu.py /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | *.org 3 | -------------------------------------------------------------------------------- /history_main_model_attr_0_w2v.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /perf_output_main_model_0_w2v.txt: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /essays.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/amirmohammadkz/personality-detection/HEAD/essays.csv -------------------------------------------------------------------------------- /run.sh: -------------------------------------------------------------------------------- 1 | cd /home/ubuntu/personality-detection/ 2 | 3 | export CUDA_HOME=/usr/local/cuda 4 | export LD_LIBRARY_PATH=${CUDA_HOME}/lib64 5 | PATH=${CUDA_HOME}/bin:${PATH} 6 | 7 | export PATH 8 | 9 | mkdir t0w2v 10 | python conv_net_train_gpu.py -static -word2vec 0 11 | mv cvt* t0w2v 12 | mv perf_output_* t0w2v 13 | 14 | mkdir t1w2v 15 | python conv_net_train_gpu.py -static -word2vec 1 16 | mv cvt* t1w2v 17 | mv perf_output_* t1w2v 18 | 19 | mkdir t2w2v 20 | python conv_net_train_gpu.py -static -word2vec 2 21 | mv cvt* t2w2v 22 | mv perf_output_* t2w2v 23 | 24 | mkdir t3w2v 25 | python conv_net_train_gpu.py -static -word2vec 3 26 | mv cvt* t3w2v 27 | mv perf_output_* t3w2v 28 | 29 | 30 | mkdir t4w2v 31 | python conv_net_train_gpu.py -static -word2vec 4 32 | mv cvt* t4w2v 33 | mv perf_output_* t4w2v 34 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 MLURG 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Deep Learning-Based Document Modeling for Personality Detection from Text 2 | 3 | This code implements the model discussed in [Deep Learning-Based Document Modeling for Personality Detection from Text](http://sentic.net/deep-learning-based-personality-detection.pdf) for detection of Big-Five personality traits, namely: 4 | 5 | - Extroversion 6 | - Neuroticism 7 | - Agreeableness 8 | - Conscientiousness 9 | - Openness 10 | 11 | 12 | ## Requirements 13 | 14 | - Ubuntu 16.0.4 64bit (Tested) 15 | - Python 2.7 16 | - Theano 1.0.4 (Tested) 17 | - Pandas 0.24.2 (Tested) 18 | - Pre-trained [GoogleNews word2vec](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit) vector (If you are using ssh try [this](https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz)) 19 | 20 | 21 | ## Preprocessing 22 | 23 | `process_data.py` prepares the data for training. It requires three command-line arguments: 24 | 25 | 1. Path to [google word2vec](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit) file (`GoogleNews-vectors-negative300.bin`) 26 | 2. Path to `essays.csv` file containing the annotated dataset 27 | 3. Path to `mairesse.csv` containing [Mairesse features](http://farm2.user.srcf.net/research/personality/recognizer.html) for each sample/essay 28 | 29 | This code generates a pickle file `essays_mairesse.p`. 30 | 31 | Example: 32 | 33 | ```sh 34 | python process_data.py ./GoogleNews-vectors-negative300.bin ./essays.csv ./mairesse.csv 35 | ``` 36 | 37 | ## Configuration for training the model 38 | 39 | A. **Running using CPU** 40 | 1. Configure ~./theanorc: 41 | ```sh 42 | [global] 43 | floatX=float64 44 | OMP_NUM_THREADS=20 45 | openmp=True 46 | ``` 47 | 48 | B. **Running using GPU** 49 | 1. Install [libgpuarray](http://deeplearning.net/software/libgpuarray/installation.html) 50 | 2. Install [cuDNN](http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html) for faster training 51 | 3. Add CUDA path to .bashrc: 52 | ```sh 53 | export CUDA_HOME=/usr/local/cuda 54 | export LD_LIBRARY_PATH=${CUDA_HOME}/lib64 55 | PATH=${CUDA_HOME}/bin:${PATH} 56 | 57 | export PATH 58 | ``` 59 | 4. Configure ~/.theanorc: 60 | ```sh 61 | [cuda] 62 | root=/usr/local/cuda 63 | [global] 64 | device=cuda 65 | floatX = float32 66 | OMP_NUM_THREADS=20 67 | openmp=True 68 | 69 | [nvcc] 70 | fastmath=True 71 | ``` 72 | 73 | ## Training 74 | 75 | Note: Before these changes, every epoch took about 5 hours to complete. After them, it took less than an hour on CPU and about 45s on GPU (Improvements depend on your system spec) 76 | 77 | A. **Running on GPU** 78 | 79 | `conv_net_train_gpu.py` trains and tests the model using GPU.(Alternatively, you can run "run.sh" and train all traits using word2vec at once) 80 | 81 | B. **Running on CPU** 82 | 83 | `conv_net_train.py` trains and tests the model using CPU. 84 | 85 | Both scripts require three command-line arguments: 86 | 87 | 1. **Mode:** 88 | - `-static`: word embeddings will remain fixed 89 | - `-nonstatic`: word embeddings will be trained 90 | 2. **Word Embedding Type:** 91 | - `-rand`: randomized word embedding (dimension is 300 by default; is hardcoded; can be changed by modifying default value of `k` in line 111 of `process_data.py`) 92 | - `-word2vec`: 300 dimensional google pre-trained word embeddings 93 | 3. **Personality Trait:** 94 | - `0`: Extroversion 95 | - `1`: Neuroticism 96 | - `2`: Agreeableness 97 | - `3`: Conscientiousness 98 | - `4`: Openness 99 | 100 | Example: 101 | 102 | ```sh 103 | python conv_net_train.py -static -word2vec 2 104 | ``` 105 | 106 | 107 | ## Citation 108 | 109 | If you use this code in your work then please cite the paper - [Deep Learning-Based Document Modeling for Personality Detection from Text](http://sentic.net/deep-learning-based-personality-detection.pdf) with the following: 110 | 111 | ``` 112 | @ARTICLE{7887639, 113 | author={N. Majumder and S. Poria and A. Gelbukh and E. Cambria}, 114 | journal={IEEE Intelligent Systems}, 115 | title={{Deep} Learning-Based Document Modeling for Personality Detection from Text}, 116 | year={2017}, 117 | volume={32}, 118 | number={2}, 119 | pages={74-79}, 120 | keywords={feedforward neural nets;information filtering;learning (artificial intelligence);pattern classification;text analysis;Big Five traits;author personality type;author psychological profile;binary classifier training;deep convolutional neural network;deep learning based method;deep learning-based document modeling;document vector;document-level Mairesse features;emotionally neutral input sentence filtering;identical architecture;personality detection;text;Artificial intelligence;Computational modeling;Emotion recognition;Feature extraction;Neural networks;Pragmatics;Semantics;artificial intelligence;convolutional neural network;distributional semantics;intelligent systems;natural language processing;neural-based document modeling;personality}, 121 | doi={10.1109/MIS.2017.23}, 122 | ISSN={1541-1672}, 123 | month={Mar},} 124 | ``` 125 | -------------------------------------------------------------------------------- /process_data.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import theano 3 | import cPickle 4 | from collections import defaultdict 5 | import sys, re 6 | import pandas as pd 7 | import csv 8 | from gensim.models import KeyedVectors 9 | 10 | 11 | def build_data_cv(datafile, cv=10, clean_string=True): 12 | """ 13 | Loads data and split into 10 folds. 14 | """ 15 | revs = [] 16 | vocab = defaultdict(float) 17 | 18 | with open(datafile, "rb") as csvf: 19 | csvreader=csv.reader(csvf,delimiter=',',quotechar='"') 20 | first_line=True 21 | for line in csvreader: 22 | if first_line: 23 | first_line=False 24 | continue 25 | status=[] 26 | sentences=re.split(r'[.?]', line[1].strip()) 27 | try: 28 | sentences.remove('') 29 | except ValueError: 30 | None 31 | 32 | for sent in sentences: 33 | if clean_string: 34 | orig_rev = clean_str(sent.strip()) 35 | if orig_rev=='': 36 | continue 37 | words = set(orig_rev.split()) 38 | splitted = orig_rev.split() 39 | if len(splitted)>150: 40 | orig_rev=[] 41 | splits=int(np.floor(len(splitted)/20)) 42 | for index in range(splits): 43 | orig_rev.append(' '.join(splitted[index*20:(index+1)*20])) 44 | if len(splitted)>splits*20: 45 | orig_rev.append(' '.join(splitted[splits*20:])) 46 | status.extend(orig_rev) 47 | else: 48 | status.append(orig_rev) 49 | else: 50 | orig_rev = sent.strip().lower() 51 | words = set(orig_rev.split()) 52 | status.append(orig_rev) 53 | 54 | for word in words: 55 | vocab[word] += 1 56 | 57 | 58 | datum = {"y0":1 if line[2].lower()=='y' else 0, 59 | "y1":1 if line[3].lower()=='y' else 0, 60 | "y2":1 if line[4].lower()=='y' else 0, 61 | "y3":1 if line[5].lower()=='y' else 0, 62 | "y4":1 if line[6].lower()=='y' else 0, 63 | "text": status, 64 | "user": line[0], 65 | "num_words": np.max([len(sent.split()) for sent in status]), 66 | "split": np.random.randint(0,cv)} 67 | revs.append(datum) 68 | 69 | 70 | return revs, vocab 71 | 72 | def get_W(word_vecs, k=300): 73 | """ 74 | Get word matrix. W[i] is the vector for word indexed by i 75 | """ 76 | vocab_size = len(word_vecs) 77 | word_idx_map = dict() 78 | W = np.zeros(shape=(vocab_size+1, k), dtype=theano.config.floatX) 79 | W[0] = np.zeros(k, dtype=theano.config.floatX) 80 | i = 1 81 | for word in word_vecs: 82 | W[i] = word_vecs[word] 83 | word_idx_map[word] = i 84 | i += 1 85 | return W, word_idx_map 86 | 87 | def load_bin_vec(fname, vocab): 88 | """ 89 | Loads 300x1 word vecs from Google (Mikolov) word2vec 90 | """ 91 | word_vecs = {} 92 | model = KeyedVectors.load_word2vec_format(fname, binary=True) 93 | for word in vocab: 94 | try: 95 | word_vecs[word] = model.get_vector(word) 96 | except KeyError: 97 | # Word not in the vocabulary 98 | pass 99 | return word_vecs 100 | def add_unknown_words(word_vecs, vocab, min_df=1, k=300): 101 | """ 102 | For words that occur in at least min_df documents, create a separate word vector. 103 | 0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones 104 | """ 105 | for word in vocab: 106 | if word not in word_vecs and vocab[word] >= min_df: 107 | word_vecs[word] = np.random.uniform(-0.25,0.25,k) 108 | print word 109 | 110 | def clean_str(string, TREC=False): 111 | """ 112 | Tokenization/string cleaning for all datasets except for SST. 113 | Every dataset is lower cased except for TREC 114 | """ 115 | string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) 116 | string = re.sub(r"\'s", " \'s ", string) 117 | string = re.sub(r"\'ve", " have ", string) 118 | string = re.sub(r"n\'t", " not ", string) 119 | string = re.sub(r"\'re", " are ", string) 120 | string = re.sub(r"\'d" , " would ", string) 121 | string = re.sub(r"\'ll", " will ", string) 122 | string = re.sub(r",", " , ", string) 123 | string = re.sub(r"!", " ! ", string) 124 | string = re.sub(r"\(", " ( ", string) 125 | string = re.sub(r"\)", " ) ", string) 126 | string = re.sub(r"\?", " \? ", string) 127 | # string = re.sub(r"[a-zA-Z]{4,}", "", string) 128 | string = re.sub(r"\s{2,}", " ", string) 129 | return string.strip() if TREC else string.strip().lower() 130 | 131 | def clean_str_sst(string): 132 | """ 133 | Tokenization/string cleaning for the SST dataset 134 | """ 135 | string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) 136 | string = re.sub(r"\s{2,}", " ", string) 137 | return string.strip().lower() 138 | 139 | def get_mairesse_features(file_name): 140 | feats={} 141 | with open(file_name, "rb") as csvf: 142 | csvreader=csv.reader(csvf,delimiter=',',quotechar='"') 143 | for line in csvreader: 144 | feats[line[0]]=[float(f) for f in line[1:]] 145 | return feats 146 | 147 | if __name__=="__main__": 148 | w2v_file = sys.argv[1] 149 | data_folder = sys.argv[2] 150 | mairesse_file = sys.argv[3] 151 | print "loading data...", 152 | revs, vocab = build_data_cv(data_folder, cv=10, clean_string=True) 153 | num_words=pd.DataFrame(revs)["num_words"] 154 | max_l = np.max(num_words) 155 | print "data loaded!" 156 | print "number of status: " + str(len(revs)) 157 | print "vocab size: " + str(len(vocab)) 158 | print "max sentence length: " + str(max_l) 159 | print "loading word2vec vectors...", 160 | w2v = load_bin_vec(w2v_file, vocab) 161 | print "word2vec loaded!" 162 | print "num words already in word2vec: " + str(len(w2v)) 163 | add_unknown_words(w2v, vocab) 164 | W, word_idx_map = get_W(w2v) 165 | rand_vecs = {} 166 | add_unknown_words(rand_vecs, vocab) 167 | W2, _ = get_W(rand_vecs) 168 | mairesse = get_mairesse_features(mairesse_file) 169 | cPickle.dump([revs, W, W2, word_idx_map, vocab, mairesse], open("essays_mairesse.p", "wb")) 170 | print "dataset created!" 171 | 172 | -------------------------------------------------------------------------------- /process_data_python3.py: -------------------------------------------------------------------------------- 1 | import joblib 2 | import numpy as np 3 | # import cPickle 4 | from collections import defaultdict 5 | import sys, re 6 | import pandas as pd 7 | import csv 8 | import getpass 9 | from gensim.models import KeyedVectors 10 | 11 | 12 | def build_data_cv(datafile, cv=10, clean_string=True): 13 | """ 14 | Loads data and split into 10 folds. 15 | """ 16 | revs = [] 17 | vocab = defaultdict(float) 18 | with open(datafile, "r", errors='ignore') as csvf: 19 | csvreader = csv.reader(csvf, delimiter=',', quotechar='"') 20 | first_line = True 21 | for line in csvreader: 22 | if first_line: 23 | first_line = False 24 | continue 25 | status = [] 26 | sentences = re.split(r'[.?]', line[1].strip()) 27 | try: 28 | sentences.remove('') 29 | except ValueError: 30 | None 31 | 32 | for sent in sentences: 33 | if clean_string: 34 | orig_rev = clean_str(sent.strip()) 35 | if orig_rev == '': 36 | continue 37 | words = set(orig_rev.split()) 38 | splitted = orig_rev.split() 39 | if len(splitted) > 150: 40 | orig_rev = [] 41 | splits = int(np.floor(len(splitted) / 20)) 42 | for index in range(splits): 43 | orig_rev.append(' '.join(splitted[index * 20:(index + 1) * 20])) 44 | if len(splitted) > splits * 20: 45 | orig_rev.append(' '.join(splitted[splits * 20:])) 46 | status.extend(orig_rev) 47 | else: 48 | status.append(orig_rev) 49 | else: 50 | orig_rev = sent.strip().lower() 51 | words = set(orig_rev.split()) 52 | status.append(orig_rev) 53 | 54 | for word in words: 55 | vocab[word] += 1 56 | 57 | datum = {"y0": 1 if line[2].lower() == 'y' else 0, 58 | "y1": 1 if line[3].lower() == 'y' else 0, 59 | "y2": 1 if line[4].lower() == 'y' else 0, 60 | "y3": 1 if line[5].lower() == 'y' else 0, 61 | "y4": 1 if line[6].lower() == 'y' else 0, 62 | "text": status, 63 | "user": line[0], 64 | "num_words": np.max([len(sent.split()) for sent in status]), 65 | "split": np.random.randint(0, cv)} 66 | revs.append(datum) 67 | 68 | return revs, vocab 69 | 70 | 71 | def get_W(word_vecs, k=300): 72 | """ 73 | Get word matrix. W[i] is the vector for word indexed by i 74 | """ 75 | vocab_size = len(word_vecs) 76 | word_idx_map = dict() 77 | W = np.zeros(shape=(vocab_size + 1, k), dtype="float64") 78 | W[0] = np.zeros(k, dtype="float64") 79 | i = 1 80 | for word in word_vecs: 81 | W[i] = word_vecs[word] 82 | word_idx_map[word] = i 83 | i += 1 84 | return W, word_idx_map 85 | 86 | 87 | def load_bin_vec(fname, vocab): 88 | """ 89 | Loads 300x1 word vecs from Google (Mikolov) word2vec 90 | """ 91 | word_vecs = {} 92 | model = KeyedVectors.load_word2vec_format(fname, binary=True) 93 | for word in vocab: 94 | try: 95 | word_vecs[word] = model.get_vector(word) 96 | except KeyError: 97 | # Word not in the vocabulary 98 | pass 99 | return word_vecs 100 | 101 | 102 | def add_unknown_words(word_vecs, vocab, min_df=1, k=300): 103 | """ 104 | For words that occur in at least min_df documents, create a separate word vector. 105 | 0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones 106 | """ 107 | i = 0.0 108 | for word in vocab: 109 | if word not in word_vecs and vocab[word] >= min_df: 110 | i += 1 111 | word_vecs[word] = np.random.uniform(-0.25, 0.25, k) 112 | print(word) 113 | print("##########################") 114 | print(i * 100 / len(vocab)) 115 | print("##########################") 116 | 117 | 118 | def clean_str(string, TREC=False): 119 | """ 120 | Tokenization/string cleaning for all datasets except for SST. 121 | Every dataset is lower cased except for TREC 122 | """ 123 | string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) 124 | string = re.sub(r"\'s", " \'s ", string) 125 | string = re.sub(r"\'ve", " have ", string) 126 | string = re.sub(r"n\'t", " not ", string) 127 | string = re.sub(r"\'re", " are ", string) 128 | string = re.sub(r"\'d", " would ", string) 129 | string = re.sub(r"\'ll", " will ", string) 130 | string = re.sub(r",", " , ", string) 131 | string = re.sub(r"!", " ! ", string) 132 | string = re.sub(r"\(", " ( ", string) 133 | string = re.sub(r"\)", " ) ", string) 134 | string = re.sub(r"\?", " \? ", string) 135 | # string = re.sub(r"[a-zA-Z]{4,}", "", string) 136 | string = re.sub(r"\s{2,}", " ", string) 137 | return string.strip() if TREC else string.strip().lower() 138 | 139 | 140 | def clean_str_sst(string): 141 | """ 142 | Tokenization/string cleaning for the SST dataset 143 | """ 144 | string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string) 145 | string = re.sub(r"\s{2,}", " ", string) 146 | return string.strip().lower() 147 | 148 | 149 | def get_mairesse_features(file_name): 150 | feats = {} 151 | with open(file_name, "r") as csvf: 152 | csvreader = csv.reader(csvf, delimiter=',', quotechar='"') 153 | for line in csvreader: 154 | feats[line[0]] = [float(f) for f in line[1:]] 155 | return feats 156 | 157 | 158 | if __name__ == "__main__": 159 | w2v_file = sys.argv[1] 160 | data_folder = sys.argv[2] 161 | mairesse_file = sys.argv[3] 162 | print("loading data...") 163 | revs, vocab = build_data_cv(data_folder, cv=10, clean_string=True) 164 | num_words = pd.DataFrame(revs)["num_words"] 165 | max_l = np.max(num_words) 166 | print("data loaded!") 167 | print("number of status: " + str(len(revs))) 168 | print("vocab size: " + str(len(vocab))) 169 | print("max sentence length: " + str(max_l)) 170 | print("loading word2vec vectors...", ) 171 | w2v = load_bin_vec(w2v_file, vocab) 172 | print("word2vec loaded!") 173 | print("num words already in word2vec: " + str(len(w2v))) 174 | add_unknown_words(w2v, vocab) 175 | W, word_idx_map = get_W(w2v) 176 | rand_vecs = {} 177 | add_unknown_words(rand_vecs, vocab) 178 | W2, _ = get_W(rand_vecs) 179 | mairesse = get_mairesse_features(mairesse_file) 180 | filename = 'essays_mairesse.p' 181 | joblib.dump([revs, W, W2, word_idx_map, vocab, mairesse], filename, protocol=2) 182 | print("dataset created!") 183 | 184 | -------------------------------------------------------------------------------- /conv_net_train_keras.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import csv 3 | import joblib 4 | import pickle 5 | import sys 6 | import keras 7 | from keras.layers import Input, concatenate, Dropout, Masking, Bidirectional, TimeDistributed 8 | from keras.layers import Conv3D, MaxPooling3D, Dense, Activation, Reshape, GRU, SimpleRNN, LSTM 9 | from keras.models import Model, Sequential 10 | from keras.activations import softmax 11 | from keras.utils import to_categorical, Sequence 12 | from keras.callbacks import CSVLogger 13 | from keras.callbacks import History, BaseLogger, ModelCheckpoint 14 | import pickle 15 | from pathlib import Path 16 | from keras.callbacks import ModelCheckpoint 17 | import os 18 | 19 | import logging 20 | 21 | model_name = "main_model" 22 | logging.basicConfig(filename='logger_' + model_name + '.log', level=logging.DEBUG, format='%(asctime)s %(message)s') 23 | 24 | 25 | class TestCallback(keras.callbacks.Callback): 26 | def __init__(self, test_data): 27 | self.test_data = test_data 28 | 29 | def on_epoch_end(self, epoch, logs={}): 30 | if epoch % 5 == 0: 31 | test_data, step_size = self.test_data 32 | loss, acc = self.model.evaluate_generator(test_data, steps=step_size) 33 | logging.info('\nTesting loss: {}, acc: {}\n'.format(loss, acc)) 34 | 35 | 36 | class MyLogger(keras.callbacks.Callback): 37 | def __init__(self, n): 38 | self.n = n # logging.info loss & acc every n epochs 39 | 40 | def on_epoch_end(self, epoch, logs={}): 41 | if epoch % self.n == 0: 42 | curr_loss = logs.get('loss') 43 | curr_acc = logs.get('acc') * 100 44 | val_loss = logs.get('val_loss') 45 | val_acc = logs.get('val_acc') 46 | logging.info("epoch = %4d loss = %0.6f acc = %0.2f%%" % (epoch, curr_loss, curr_acc)) 47 | logging.info("epoch = %4d val_loss = %0.6f val_acc = %0.2f%%" % (epoch, val_loss, val_acc)) 48 | 49 | 50 | class Generator(Sequence): 51 | 52 | def __init__(self, x_set, x_set_mairesse, y_set, batch_size, W, sent_max_count, word_max_count, embbeding_size): 53 | self.x, self.mairesse, self.y = x_set, x_set_mairesse, y_set 54 | self.batch_size = batch_size 55 | self.W = W 56 | self.sent_max_count = sent_max_count 57 | self.word_max_count = word_max_count 58 | self.embbeding_size = embbeding_size 59 | 60 | def __len__(self): 61 | return int(np.ceil(len(self.x) / float(self.batch_size))) 62 | 63 | def __getitem__(self, idx): 64 | batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size] 65 | batch_m = self.mairesse[idx * self.batch_size:(idx + 1) * self.batch_size] 66 | batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size] 67 | 68 | return [make_input_batch(batch_x, W, self.sent_max_count, self.word_max_count, self.embbeding_size), batch_m], \ 69 | to_categorical(batch_y, num_classes=2) 70 | 71 | 72 | # def get_checkpoints(model_dir): 73 | # saved_checkpoints = [f for f in os.listdir(model_dir) if f.startswith('model-' + model_name)] 74 | # saved_checkpoints.sort(reverse=True) 75 | # return saved_checkpoints 76 | 77 | 78 | def train_conv_net(datasets, W, historyfile, iteration, 79 | embbeding_size = 300, 80 | n_epochs = 50, 81 | batch_size = 50): 82 | word_max_count = len(datasets[0][0][0]) 83 | sent_max_count = len(datasets[0][0]) 84 | 85 | 86 | # define model architecture 87 | 88 | model_input = Input(shape=(sent_max_count, word_max_count, embbeding_size, 1), name='main_input') 89 | 90 | # unigrams 91 | model_1 = Sequential() 92 | model_1.add(Conv3D(200, (1, 1, embbeding_size), activation='relu', 93 | input_shape=(sent_max_count, word_max_count, embbeding_size, 1))) 94 | model_1.add(MaxPooling3D((1, word_max_count, 1))) 95 | 96 | model_output_1 = model_1(model_input) 97 | 98 | # bigrams 99 | model_2 = Sequential() 100 | model_2.add(Conv3D(200, (1, 2, embbeding_size), activation='relu', 101 | input_shape=(sent_max_count, word_max_count, embbeding_size, 1))) 102 | model_2.add(MaxPooling3D((1, word_max_count - 1, 1))) 103 | 104 | model_output_2 = model_2(model_input) 105 | 106 | # trigrams 107 | model_3 = Sequential() 108 | model_3.add(Conv3D(200, (1, 3, embbeding_size), activation='relu', 109 | input_shape=(sent_max_count, word_max_count, embbeding_size, 1))) 110 | model_3.add(MaxPooling3D((1, word_max_count - 2, 1))) 111 | 112 | model_output_3 = model_3(model_input) 113 | 114 | 115 | model = concatenate([model_output_1, model_output_2, model_output_3], axis=-1) 116 | 117 | after_MaxPooling = MaxPooling3D((sent_max_count, 1, 1))(model) 118 | 119 | mairesse_input = Input(shape=(84,), name='mairesse') 120 | model = Reshape((600,))(after_MaxPooling) 121 | concatenated_with_mairsse = concatenate([model, mairesse_input], axis=-1) 122 | 123 | model = Dense(200, activation='sigmoid')(concatenated_with_mairsse) 124 | model = Dropout(0.5)(model) 125 | output = Dense(2, activation='softmax')(model) 126 | 127 | final_model = Model(inputs=[model_input, mairesse_input], outputs=output) 128 | final_model.compile(optimizer='adadelta', loss='categorical_crossentropy', metrics=['accuracy']) 129 | 130 | validation_size = int(np.round(0.1 * len(datasets[0]))) 131 | 132 | X_train = datasets[0][validation_size:] 133 | y_train = datasets[1][validation_size:] 134 | X_validation = datasets[0][:validation_size] 135 | y_validation = datasets[1][:validation_size] 136 | X_test = datasets[2] 137 | y_test = datasets[3] 138 | 139 | mairesse_train = datasets[4][validation_size:] 140 | mairesse_test = datasets[5] 141 | mairesse_validation = datasets[4][:validation_size] 142 | 143 | train_data_G = Generator(X_train, mairesse_train, y_train, batch_size, W, sent_max_count, word_max_count, 144 | embbeding_size) 145 | val_data_G = Generator(X_validation, mairesse_validation, y_validation, batch_size, W, sent_max_count, 146 | word_max_count, 147 | embbeding_size) 148 | test_data_G = Generator(X_test, mairesse_test, y_test, batch_size, W, sent_max_count, word_max_count, 149 | embbeding_size) 150 | 151 | # model_dir = 'models/results/' + model_name + '/' + str(iteration) 152 | # checkpoint_path = model_dir + "/model-" + model_name + '-' + str(iteration) + "-{acc:02f}.hdf5" 153 | # # Keep only a single checkpoint, the best over test accuracy. 154 | # checkpoint = ModelCheckpoint(str(checkpoint_path), 155 | # monitor='acc', 156 | # verbose=1) 157 | # saved_checkpoints = get_checkpoints(model_dir) 158 | # if len(saved_checkpoints) > 0: 159 | # last_checkpoint = saved_checkpoints[0] 160 | # logging.info("Resume training from " + last_checkpoint) 161 | # final_model.load_weights(model_dir + '/' + last_checkpoint) 162 | # else: 163 | # logging.info("Traning from scratch!") 164 | # logging.info(len(X_train) / batch_size) 165 | 166 | history = History() 167 | 168 | final_model.fit_generator(train_data_G, validation_data=val_data_G, steps_per_epoch=len(X_train) / batch_size, 169 | validation_steps=len(X_validation) / batch_size, epochs=n_epochs, 170 | callbacks=[my_logger, history]) 171 | 172 | # final_model.fit_generator(train_data_G, validation_data=val_data_G, steps_per_epoch=len(X_train) / batch_size, 173 | # validation_steps=len(X_validation) / batch_size, epochs=n_epochs, 174 | # callbacks=[my_logger, history, checkpoint]) 175 | # logging.info("loading best model weights") 176 | # saved_checkpoints = get_checkpoints(model_dir) 177 | # last_checkpoint = saved_checkpoints[0] 178 | # logging.info("Resume weights from " + last_checkpoint) 179 | # final_model.load_weights(model_dir + '/' + last_checkpoint) 180 | logging.info("evaluating model...") 181 | loss, acc = final_model.evaluate_generator(test_data_G, steps=len(datasets[0]) / batch_size) 182 | hist = str(history.history) 183 | pickle.dump(hist, historyfile) 184 | 185 | logging.info('score = ' + str(loss) + "," + str(acc)) 186 | return loss, acc 187 | 188 | def make_input_batch(X_train, W, sent_max_count, word_max_count, embbeding_size): 189 | size = (len(X_train), sent_max_count, word_max_count, embbeding_size) 190 | input_train = np.zeros(size) 191 | for rev_dx, review in enumerate(X_train): 192 | for sent_idx, sentence in enumerate(review): 193 | sentence = np.array(sentence) 194 | indexes = np.where(sentence != 0)[0] 195 | for idx in indexes: 196 | input_train[rev_dx][sent_idx][idx] = W[sentence[idx]] 197 | input_train = input_train.reshape([len(X_train), sent_max_count, word_max_count, embbeding_size, 1]) 198 | return input_train 199 | 200 | 201 | def make_idx_data_cv(revs, word_idx_map, mairesse, charged_words, cv, per_attr=0, max_l=51, max_s=200, k=300, 202 | filter_h=5): 203 | """ 204 | Transforms sentences into a 2-d matrix. 205 | """ 206 | trainX, testX, trainY, testY, mTrain, mTest = [], [], [], [], [], [] 207 | for idx, rev in enumerate(revs): 208 | sent = get_idx_from_sent(rev["text"], word_idx_map, 209 | charged_words, 210 | max_l, max_s, k, filter_h) 211 | 212 | if rev["split"] == cv: 213 | testX.append(sent) 214 | testY.append(rev['y' + str(per_attr)]) 215 | mTest.append(mairesse[rev["user"]]) 216 | else: 217 | trainX.append(sent) 218 | trainY.append(rev['y' + str(per_attr)]) 219 | mTrain.append(mairesse[rev["user"]]) 220 | trainX = np.array(trainX) 221 | testX = np.array(testX) 222 | trainY = np.array(trainY) 223 | testY = np.array(testY) 224 | mTrain = np.array(mTrain) 225 | mTest = np.array(mTest) 226 | return [trainX, trainY, testX, testY, mTrain, mTest] 227 | 228 | def get_idx_from_sent(status, word_idx_map, charged_words, max_l=51, max_s=200, k=300, filter_h=5): 229 | """ 230 | Transforms sentence into a list of indices. Pad with zeroes. 231 | """ 232 | x = [] 233 | pad = filter_h - 1 234 | length = len(status) 235 | 236 | pass_one = True 237 | while len(x) == 0: 238 | charged_counter = 0 239 | not_charged_counter = 0 240 | for i in range(length): 241 | words = status[i].split() 242 | if pass_one: 243 | words_set = set(words) 244 | if len(charged_words.intersection(words_set)) == 0: 245 | not_charged_counter += 1 246 | continue 247 | else: 248 | if np.random.randint(0, 2) == 0: 249 | continue 250 | charged_counter += 1 251 | y = [] 252 | for i in range(pad): 253 | y.append(0) 254 | for word in words: 255 | if word in word_idx_map: 256 | y.append(word_idx_map[word]) 257 | 258 | while len(y) < max_l + 2 * pad: 259 | y.append(0) 260 | x.append(y) 261 | pass_one = False 262 | 263 | if len(x) < max_s: 264 | x.extend([[0] * (max_l + 2 * pad)] * (max_s - len(x))) 265 | 266 | return x 267 | 268 | if __name__ == "__main__": 269 | logging.info("loading data...: floatx:") 270 | my_logger = MyLogger(n=1) 271 | x = joblib.load("essays_mairesse.p") 272 | 273 | revs, W, W2, word_idx_map, vocab, mairesse = x[0], x[1], x[2], x[3], x[4], x[5] 274 | logging.info("data loaded!") 275 | try: 276 | attr = int(sys.argv[1]) 277 | except IndexError: 278 | attr = 4 279 | 280 | r = range(0, 10) 281 | 282 | ofile = open('perf_output_' + model_name + "_" + str(attr) + '_w2v.txt', 'w') 283 | 284 | charged_words = [] 285 | 286 | emof = open("Emotion_Lexicon.csv", "rt") 287 | history_file_name = 'history_' + model_name + '_attr_' + str(attr) + '_w2v.txt' 288 | historyfile = open(history_file_name, 'wb') 289 | csvf = csv.reader(emof, delimiter=',', quotechar='"') 290 | first_line = True 291 | 292 | for line in csvf: 293 | if first_line: 294 | first_line = False 295 | continue 296 | if line[11] == "1": 297 | charged_words.append(line[0]) 298 | 299 | emof.close() 300 | 301 | charged_words = set(charged_words) 302 | 303 | results = [] 304 | for i in r: 305 | logging.info("iteration = %4d from %4d " % (i, len(r))) 306 | datasets = make_idx_data_cv(revs, word_idx_map, mairesse, charged_words, i, attr, max_l=149, 307 | max_s=312, k=300, 308 | filter_h=3) 309 | 310 | results = train_conv_net(datasets, W, historyfile, i) 311 | ofile.write(str(results) + "\n") 312 | ofile.flush() 313 | 314 | ofile.write(str(results)) 315 | historyfile.close() 316 | 317 | -------------------------------------------------------------------------------- /conv_net_classes.py: -------------------------------------------------------------------------------- 1 | """ 2 | Sample code for 3 | Convolutional Neural Networks for Sentence Classification 4 | http://arxiv.org/pdf/1408.5882v2.pdf 5 | 6 | Much of the code is modified from 7 | - deeplearning.net (for ConvNet classes) 8 | - https://github.com/mdenil/dropout (for dropout) 9 | - https://groups.google.com/forum/#!topic/pylearn-dev/3QbKtCumAW4 (for Adadelta) 10 | """ 11 | 12 | import numpy 13 | import theano.tensor.shared_randomstreams 14 | import theano 15 | import theano.tensor as T 16 | # from theano.tensor.signal import downsample 17 | from theano.tensor.signal import pool 18 | from theano.tensor.nnet import conv 19 | 20 | def ReLU(x): 21 | y = T.maximum(0.0, x) 22 | return(y) 23 | def Sigmoid(x): 24 | y = T.nnet.sigmoid(x) 25 | return(y) 26 | def Tanh(x): 27 | y = T.tanh(x) 28 | return(y) 29 | def Iden(x): 30 | y = x 31 | return(y) 32 | 33 | class HiddenLayer(object): 34 | """ 35 | Class for HiddenLayer 36 | """ 37 | def __init__(self, rng, input, n_in, n_out, activation, W=None, b=None, 38 | use_bias=False): 39 | 40 | self.input = input 41 | self.activation = activation 42 | 43 | if W is None: 44 | if activation.func_name == "ReLU": 45 | W_values = numpy.asarray(0.01 * rng.standard_normal(size=(n_in, n_out)), dtype=theano.config.floatX) 46 | else: 47 | W_values = numpy.asarray(rng.uniform(low=-numpy.sqrt(6. / (n_in + n_out)), high=numpy.sqrt(6. / (n_in + n_out)), 48 | size=(n_in, n_out)), dtype=theano.config.floatX) 49 | W = theano.shared(value=W_values, name='W') 50 | if b is None: 51 | b_values = numpy.zeros((n_out,), dtype=theano.config.floatX) 52 | b = theano.shared(value=b_values, name='b') 53 | 54 | self.W = W 55 | self.b = b 56 | 57 | if use_bias: 58 | lin_output = T.dot(input, self.W) + self.b 59 | else: 60 | lin_output = T.dot(input, self.W) 61 | 62 | self.output = (lin_output if activation is None else activation(lin_output)) 63 | 64 | # parameters of the model 65 | if use_bias: 66 | self.params = [self.W, self.b] 67 | else: 68 | self.params = [self.W] 69 | 70 | def _dropout_from_layer(rng, layer, p): 71 | """p is the probablity of dropping a unit 72 | """ 73 | srng = theano.tensor.shared_randomstreams.RandomStreams(rng.randint(999999)) 74 | # p=1-p because 1's indicate keep and p is prob of dropping 75 | mask = srng.binomial(n=1, p=1-p, size=layer.shape) 76 | # The cast is important because 77 | # int * float32 = float64 which pulls things off the gpu 78 | output = layer * T.cast(mask, theano.config.floatX) 79 | return output 80 | 81 | class DropoutHiddenLayer(HiddenLayer): 82 | def __init__(self, rng, input, n_in, n_out, 83 | activation, dropout_rate, use_bias, W=None, b=None): 84 | super(DropoutHiddenLayer, self).__init__( 85 | rng=rng, input=input, n_in=n_in, n_out=n_out, W=W, b=b, 86 | activation=activation, use_bias=use_bias) 87 | 88 | self.output = _dropout_from_layer(rng, self.output, p=dropout_rate) 89 | 90 | class MLPDropout(object): 91 | """A multilayer perceptron with dropout""" 92 | def __init__(self,rng,input,layer_sizes,dropout_rates,activations,use_bias=True): 93 | 94 | #rectified_linear_activation = lambda x: T.maximum(0.0, x) 95 | 96 | # Set up all the hidden layers 97 | self.weight_matrix_sizes = zip(layer_sizes, layer_sizes[1:]) 98 | self.layers = [] 99 | self.dropout_layers = [] 100 | self.activations = activations 101 | next_layer_input = input 102 | #first_layer = True 103 | # dropout the input 104 | next_dropout_layer_input = _dropout_from_layer(rng, input, p=dropout_rates[0]) 105 | layer_counter = 0 106 | for n_in, n_out in self.weight_matrix_sizes[:-1]: 107 | next_dropout_layer = DropoutHiddenLayer(rng=rng, 108 | input=next_dropout_layer_input, 109 | activation=activations[layer_counter], 110 | n_in=n_in, n_out=n_out, use_bias=use_bias, 111 | dropout_rate=dropout_rates[layer_counter]) 112 | self.dropout_layers.append(next_dropout_layer) 113 | next_dropout_layer_input = next_dropout_layer.output 114 | 115 | # Reuse the parameters from the dropout layer here, in a different 116 | # path through the graph. 117 | next_layer = HiddenLayer(rng=rng, 118 | input=next_layer_input, 119 | activation=activations[layer_counter], 120 | # scale the weight matrix W with (1-p) 121 | W=next_dropout_layer.W * (1 - dropout_rates[layer_counter]), 122 | b=next_dropout_layer.b, 123 | n_in=n_in, n_out=n_out, 124 | use_bias=use_bias) 125 | self.layers.append(next_layer) 126 | next_layer_input = next_layer.output 127 | #first_layer = False 128 | layer_counter += 1 129 | 130 | # Set up the output layer 131 | n_in, n_out = self.weight_matrix_sizes[-1] 132 | dropout_output_layer = LogisticRegression( 133 | input=next_dropout_layer_input, 134 | n_in=n_in, n_out=n_out) 135 | self.dropout_layers.append(dropout_output_layer) 136 | 137 | # Again, reuse paramters in the dropout output. 138 | output_layer = LogisticRegression( 139 | input=next_layer_input, 140 | # scale the weight matrix W with (1-p) 141 | W=dropout_output_layer.W * (1 - dropout_rates[-1]), 142 | b=dropout_output_layer.b, 143 | n_in=n_in, n_out=n_out) 144 | self.layers.append(output_layer) 145 | 146 | # Use the negative log likelihood of the logistic regression layer as 147 | # the objective. 148 | self.dropout_negative_log_likelihood = self.dropout_layers[-1].negative_log_likelihood 149 | self.dropout_errors = self.dropout_layers[-1].errors 150 | 151 | self.negative_log_likelihood = self.layers[-1].negative_log_likelihood 152 | self.errors = self.layers[-1].errors 153 | 154 | # Grab all the parameters together. 155 | self.params = [ param for layer in self.dropout_layers for param in layer.params ] 156 | 157 | def predict(self, new_data): 158 | next_layer_input = new_data 159 | for i,layer in enumerate(self.layers): 160 | if i