├── .gitignore
├── history_main_model_attr_0_w2v.txt
├── perf_output_main_model_0_w2v.txt
├── essays.csv
├── run.sh
├── LICENSE
├── README.md
├── process_data.py
├── process_data_python3.py
├── conv_net_train_keras.py
├── conv_net_classes.py
├── conv_net_classes_gpu.py
├── conv_net_train.py
└── conv_net_train_gpu.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *~
2 | *.org
3 | 


--------------------------------------------------------------------------------
/history_main_model_attr_0_w2v.txt:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/perf_output_main_model_0_w2v.txt:
--------------------------------------------------------------------------------
1 | 


--------------------------------------------------------------------------------
/essays.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/amirmohammadkz/personality-detection/HEAD/essays.csv


--------------------------------------------------------------------------------
/run.sh:
--------------------------------------------------------------------------------
 1 | cd /home/ubuntu/personality-detection/
 2 | 
 3 | export CUDA_HOME=/usr/local/cuda
 4 | export LD_LIBRARY_PATH=${CUDA_HOME}/lib64
 5 | PATH=${CUDA_HOME}/bin:${PATH}
 6 | 
 7 | export PATH
 8 | 
 9 | mkdir t0w2v
10 | python conv_net_train_gpu.py -static -word2vec 0
11 | mv cvt* t0w2v
12 | mv perf_output_* t0w2v
13 | 
14 | mkdir t1w2v
15 | python conv_net_train_gpu.py -static -word2vec 1
16 | mv cvt* t1w2v
17 | mv perf_output_* t1w2v
18 | 
19 | mkdir t2w2v
20 | python conv_net_train_gpu.py -static -word2vec 2
21 | mv cvt* t2w2v
22 | mv perf_output_* t2w2v
23 | 
24 | mkdir t3w2v
25 | python conv_net_train_gpu.py -static -word2vec 3
26 | mv cvt* t3w2v
27 | mv perf_output_* t3w2v
28 | 
29 | 
30 | mkdir t4w2v
31 | python conv_net_train_gpu.py -static -word2vec 4
32 | mv cvt* t4w2v
33 | mv perf_output_* t4w2v
34 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 MLURG
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Deep Learning-Based Document Modeling for Personality Detection from Text
  2 | 
  3 | This code implements the model discussed in [Deep Learning-Based Document Modeling for Personality Detection from Text](http://sentic.net/deep-learning-based-personality-detection.pdf) for detection of Big-Five personality traits, namely:
  4 | 
  5 | -   Extroversion
  6 | -   Neuroticism
  7 | -   Agreeableness
  8 | -   Conscientiousness
  9 | -   Openness
 10 | 
 11 | 
 12 | ## Requirements
 13 | 
 14 | -   Ubuntu 16.0.4 64bit (Tested)
 15 | -   Python 2.7
 16 | -   Theano 1.0.4 (Tested)
 17 | -   Pandas 0.24.2 (Tested)
 18 | -   Pre-trained [GoogleNews word2vec](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit) vector (If you are using ssh try [this](https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz))
 19 | 
 20 | 
 21 | ## Preprocessing
 22 | 
 23 | `process_data.py` prepares the data for training. It requires three command-line arguments:
 24 | 
 25 | 1.  Path to [google word2vec](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit) file (`GoogleNews-vectors-negative300.bin`)
 26 | 2.  Path to `essays.csv` file containing the annotated dataset
 27 | 3.  Path to `mairesse.csv` containing [Mairesse features](http://farm2.user.srcf.net/research/personality/recognizer.html) for each sample/essay
 28 | 
 29 | This code generates a pickle file `essays_mairesse.p`.
 30 | 
 31 | Example:
 32 | 
 33 | ```sh
 34 | python process_data.py ./GoogleNews-vectors-negative300.bin ./essays.csv ./mairesse.csv
 35 | ```
 36 | 
 37 | ## Configuration for training the model
 38 | 
 39 | A. **Running using CPU**
 40 | 1. Configure ~./theanorc:
 41 | ```sh
 42 | [global]
 43 | floatX=float64
 44 | OMP_NUM_THREADS=20
 45 | openmp=True
 46 | ```
 47 | 
 48 | B. **Running using GPU**
 49 | 1. Install [libgpuarray](http://deeplearning.net/software/libgpuarray/installation.html)
 50 | 2. Install [cuDNN](http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html) for faster training
 51 | 3. Add CUDA path to .bashrc:
 52 | ```sh
 53 | export CUDA_HOME=/usr/local/cuda
 54 | export LD_LIBRARY_PATH=${CUDA_HOME}/lib64
 55 | PATH=${CUDA_HOME}/bin:${PATH}
 56 | 
 57 | export PATH
 58 | ```
 59 | 4. Configure ~/.theanorc:
 60 | ```sh
 61 | [cuda]
 62 | root=/usr/local/cuda
 63 | [global]
 64 | device=cuda
 65 | floatX = float32
 66 | OMP_NUM_THREADS=20
 67 | openmp=True
 68 | 
 69 | [nvcc]
 70 | fastmath=True
 71 | ```
 72 | 
 73 | ## Training
 74 | 
 75 | Note: Before these changes, every epoch took about 5 hours to complete. After them, it took less than an hour on CPU and about 45s on GPU (Improvements depend on your system spec)
 76 | 
 77 | A. **Running on GPU**
 78 | 
 79 | `conv_net_train_gpu.py` trains and tests the model using GPU.(Alternatively, you can run "run.sh" and train all traits using word2vec at once)
 80 | 
 81 | B. **Running on CPU**
 82 | 
 83 | `conv_net_train.py` trains and tests the model using CPU.
 84 | 
 85 | Both scripts require three command-line arguments:
 86 | 
 87 | 1.  **Mode:**
 88 |     -   `-static`: word embeddings will remain fixed
 89 |     -   `-nonstatic`: word embeddings will be trained
 90 | 2.  **Word Embedding Type:**
 91 |     -   `-rand`: randomized word embedding (dimension is 300 by default; is hardcoded; can be changed by modifying default value of `k` in line 111 of `process_data.py`)
 92 |     -   `-word2vec`: 300 dimensional google pre-trained word embeddings
 93 | 3.  **Personality Trait:**
 94 |     -   `0`: Extroversion
 95 |     -   `1`: Neuroticism
 96 |     -   `2`: Agreeableness
 97 |     -   `3`: Conscientiousness
 98 |     -   `4`: Openness
 99 | 
100 | Example:
101 | 
102 | ```sh
103 | python conv_net_train.py -static -word2vec 2
104 | ```
105 | 
106 | 
107 | ## Citation
108 | 
109 | If you use this code in your work then please cite the paper - [Deep Learning-Based Document Modeling for Personality Detection from Text](http://sentic.net/deep-learning-based-personality-detection.pdf) with the following:
110 | 
111 | ```
112 | @ARTICLE{7887639, 
113 |  author={N. Majumder and S. Poria and A. Gelbukh and E. Cambria}, 
114 |  journal={IEEE Intelligent Systems}, 
115 |  title={{Deep} Learning-Based Document Modeling for Personality Detection from Text}, 
116 |  year={2017}, 
117 |  volume={32}, 
118 |  number={2}, 
119 |  pages={74-79}, 
120 |  keywords={feedforward neural nets;information filtering;learning (artificial intelligence);pattern classification;text analysis;Big Five traits;author personality type;author psychological profile;binary classifier training;deep convolutional neural network;deep learning based method;deep learning-based document modeling;document vector;document-level Mairesse features;emotionally neutral input sentence filtering;identical architecture;personality detection;text;Artificial intelligence;Computational modeling;Emotion recognition;Feature extraction;Neural networks;Pragmatics;Semantics;artificial intelligence;convolutional neural network;distributional semantics;intelligent systems;natural language processing;neural-based document modeling;personality}, 
121 |  doi={10.1109/MIS.2017.23}, 
122 |  ISSN={1541-1672}, 
123 |  month={Mar},}
124 | ```
125 | 


--------------------------------------------------------------------------------
/process_data.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import theano
  3 | import cPickle
  4 | from collections import defaultdict
  5 | import sys, re
  6 | import pandas as pd
  7 | import csv
  8 | from gensim.models import KeyedVectors
  9 | 
 10 | 
 11 | def build_data_cv(datafile, cv=10, clean_string=True):
 12 |     """
 13 |     Loads data and split into 10 folds.
 14 |     """
 15 |     revs = []
 16 |     vocab = defaultdict(float)
 17 | 
 18 |     with open(datafile, "rb") as csvf:
 19 |         csvreader=csv.reader(csvf,delimiter=',',quotechar='"')
 20 |         first_line=True
 21 |         for line in csvreader:
 22 |             if first_line:
 23 |                 first_line=False
 24 |                 continue
 25 |             status=[]
 26 |             sentences=re.split(r'[.?]', line[1].strip())
 27 |             try:
 28 |                 sentences.remove('')
 29 |             except ValueError:
 30 |                 None
 31 | 
 32 |             for sent in sentences:
 33 |                 if clean_string:
 34 |                     orig_rev = clean_str(sent.strip())
 35 |                     if orig_rev=='':
 36 |                             continue
 37 |                     words = set(orig_rev.split())
 38 |                     splitted = orig_rev.split()
 39 |                     if len(splitted)>150:
 40 |                         orig_rev=[]
 41 |                         splits=int(np.floor(len(splitted)/20))
 42 |                         for index in range(splits):
 43 |                             orig_rev.append(' '.join(splitted[index*20:(index+1)*20]))
 44 |                         if len(splitted)>splits*20:
 45 |                             orig_rev.append(' '.join(splitted[splits*20:]))
 46 |                         status.extend(orig_rev)
 47 |                     else:
 48 |                         status.append(orig_rev)
 49 |                 else:
 50 |                     orig_rev = sent.strip().lower()
 51 |                     words = set(orig_rev.split())
 52 |                     status.append(orig_rev)
 53 | 
 54 |                 for word in words:
 55 |                     vocab[word] += 1
 56 | 
 57 | 
 58 |             datum  = {"y0":1 if line[2].lower()=='y' else 0,
 59 |                   "y1":1 if line[3].lower()=='y' else 0,
 60 |                   "y2":1 if line[4].lower()=='y' else 0,
 61 |                   "y3":1 if line[5].lower()=='y' else 0,
 62 |                   "y4":1 if line[6].lower()=='y' else 0,
 63 |                   "text": status,
 64 |                   "user": line[0],
 65 |                   "num_words": np.max([len(sent.split()) for sent in status]),
 66 |                   "split": np.random.randint(0,cv)}
 67 |             revs.append(datum)
 68 | 
 69 | 
 70 |     return revs, vocab
 71 | 
 72 | def get_W(word_vecs, k=300):
 73 |     """
 74 |     Get word matrix. W[i] is the vector for word indexed by i
 75 |     """
 76 |     vocab_size = len(word_vecs)
 77 |     word_idx_map = dict()
 78 |     W = np.zeros(shape=(vocab_size+1, k), dtype=theano.config.floatX)
 79 |     W[0] = np.zeros(k, dtype=theano.config.floatX)
 80 |     i = 1
 81 |     for word in word_vecs:
 82 |         W[i] = word_vecs[word]
 83 |         word_idx_map[word] = i
 84 |         i += 1
 85 |     return W, word_idx_map
 86 | 
 87 | def load_bin_vec(fname, vocab):
 88 |     """
 89 |     Loads 300x1 word vecs from Google (Mikolov) word2vec
 90 |     """
 91 |     word_vecs = {}
 92 |     model = KeyedVectors.load_word2vec_format(fname, binary=True)
 93 |     for word in vocab:
 94 |         try:
 95 |             word_vecs[word] = model.get_vector(word)
 96 |         except KeyError:
 97 |             # Word not in the vocabulary
 98 |             pass
 99 |     return word_vecs
100 | def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
101 |     """
102 |     For words that occur in at least min_df documents, create a separate word vector.
103 |     0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
104 |     """
105 |     for word in vocab:
106 |         if word not in word_vecs and vocab[word] >= min_df:
107 |             word_vecs[word] = np.random.uniform(-0.25,0.25,k)
108 |             print word
109 | 
110 | def clean_str(string, TREC=False):
111 |     """
112 |     Tokenization/string cleaning for all datasets except for SST.
113 |     Every dataset is lower cased except for TREC
114 |     """
115 |     string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
116 |     string = re.sub(r"\'s", " \'s ", string)
117 |     string = re.sub(r"\'ve", " have ", string)
118 |     string = re.sub(r"n\'t", " not ", string)
119 |     string = re.sub(r"\'re", " are ", string)
120 |     string = re.sub(r"\'d" , " would ", string)
121 |     string = re.sub(r"\'ll", " will ", string)
122 |     string = re.sub(r",", " , ", string)
123 |     string = re.sub(r"!", " ! ", string)
124 |     string = re.sub(r"\(", " ( ", string)
125 |     string = re.sub(r"\)", " ) ", string)
126 |     string = re.sub(r"\?", " \? ", string)
127 | #    string = re.sub(r"[a-zA-Z]{4,}", "", string)
128 |     string = re.sub(r"\s{2,}", " ", string)
129 |     return string.strip() if TREC else string.strip().lower()
130 | 
131 | def clean_str_sst(string):
132 |     """
133 |     Tokenization/string cleaning for the SST dataset
134 |     """
135 |     string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
136 |     string = re.sub(r"\s{2,}", " ", string)
137 |     return string.strip().lower()
138 | 
139 | def get_mairesse_features(file_name):
140 |     feats={}
141 |     with open(file_name, "rb") as csvf:
142 |         csvreader=csv.reader(csvf,delimiter=',',quotechar='"')
143 |         for line in csvreader:
144 |             feats[line[0]]=[float(f) for f in line[1:]]
145 |     return feats
146 | 
147 | if __name__=="__main__":
148 |     w2v_file = sys.argv[1]
149 |     data_folder = sys.argv[2]
150 |     mairesse_file = sys.argv[3]
151 |     print "loading data...",
152 |     revs, vocab = build_data_cv(data_folder, cv=10, clean_string=True)
153 |     num_words=pd.DataFrame(revs)["num_words"]
154 |     max_l = np.max(num_words)
155 |     print "data loaded!"
156 |     print "number of status: " + str(len(revs))
157 |     print "vocab size: " + str(len(vocab))
158 |     print "max sentence length: " + str(max_l)
159 |     print "loading word2vec vectors...",
160 |     w2v = load_bin_vec(w2v_file, vocab)
161 |     print "word2vec loaded!"
162 |     print "num words already in word2vec: " + str(len(w2v))
163 |     add_unknown_words(w2v, vocab)
164 |     W, word_idx_map = get_W(w2v)
165 |     rand_vecs = {}
166 |     add_unknown_words(rand_vecs, vocab)
167 |     W2, _ = get_W(rand_vecs)
168 |     mairesse = get_mairesse_features(mairesse_file)
169 |     cPickle.dump([revs, W, W2, word_idx_map, vocab, mairesse], open("essays_mairesse.p", "wb"))
170 |     print "dataset created!"
171 | 
172 | 


--------------------------------------------------------------------------------
/process_data_python3.py:
--------------------------------------------------------------------------------
  1 | import joblib
  2 | import numpy as np
  3 | # import cPickle
  4 | from collections import defaultdict
  5 | import sys, re
  6 | import pandas as pd
  7 | import csv
  8 | import getpass
  9 | from gensim.models import KeyedVectors
 10 | 
 11 | 
 12 | def build_data_cv(datafile, cv=10, clean_string=True):
 13 |     """
 14 |     Loads data and split into 10 folds.
 15 |     """
 16 |     revs = []
 17 |     vocab = defaultdict(float)
 18 |     with open(datafile, "r", errors='ignore') as csvf:
 19 |         csvreader = csv.reader(csvf, delimiter=',', quotechar='"')
 20 |         first_line = True
 21 |         for line in csvreader:
 22 |             if first_line:
 23 |                 first_line = False
 24 |                 continue
 25 |             status = []
 26 |             sentences = re.split(r'[.?]', line[1].strip())
 27 |             try:
 28 |                 sentences.remove('')
 29 |             except ValueError:
 30 |                 None
 31 | 
 32 |             for sent in sentences:
 33 |                 if clean_string:
 34 |                     orig_rev = clean_str(sent.strip())
 35 |                     if orig_rev == '':
 36 |                         continue
 37 |                     words = set(orig_rev.split())
 38 |                     splitted = orig_rev.split()
 39 |                     if len(splitted) > 150:
 40 |                         orig_rev = []
 41 |                         splits = int(np.floor(len(splitted) / 20))
 42 |                         for index in range(splits):
 43 |                             orig_rev.append(' '.join(splitted[index * 20:(index + 1) * 20]))
 44 |                         if len(splitted) > splits * 20:
 45 |                             orig_rev.append(' '.join(splitted[splits * 20:]))
 46 |                         status.extend(orig_rev)
 47 |                     else:
 48 |                         status.append(orig_rev)
 49 |                 else:
 50 |                     orig_rev = sent.strip().lower()
 51 |                     words = set(orig_rev.split())
 52 |                     status.append(orig_rev)
 53 | 
 54 |                 for word in words:
 55 |                     vocab[word] += 1
 56 | 
 57 |             datum = {"y0": 1 if line[2].lower() == 'y' else 0,
 58 |                      "y1": 1 if line[3].lower() == 'y' else 0,
 59 |                      "y2": 1 if line[4].lower() == 'y' else 0,
 60 |                      "y3": 1 if line[5].lower() == 'y' else 0,
 61 |                      "y4": 1 if line[6].lower() == 'y' else 0,
 62 |                      "text": status,
 63 |                      "user": line[0],
 64 |                      "num_words": np.max([len(sent.split()) for sent in status]), 
 65 |                      "split": np.random.randint(0, cv)} 
 66 |             revs.append(datum)
 67 | 
 68 |     return revs, vocab
 69 | 
 70 | 
 71 | def get_W(word_vecs, k=300):
 72 |     """
 73 |     Get word matrix. W[i] is the vector for word indexed by i
 74 |     """
 75 |     vocab_size = len(word_vecs)
 76 |     word_idx_map = dict()
 77 |     W = np.zeros(shape=(vocab_size + 1, k), dtype="float64")
 78 |     W[0] = np.zeros(k, dtype="float64")
 79 |     i = 1
 80 |     for word in word_vecs:
 81 |         W[i] = word_vecs[word]
 82 |         word_idx_map[word] = i
 83 |         i += 1
 84 |     return W, word_idx_map
 85 | 
 86 | 
 87 | def load_bin_vec(fname, vocab):
 88 |     """
 89 |     Loads 300x1 word vecs from Google (Mikolov) word2vec
 90 |     """
 91 |     word_vecs = {}
 92 |     model = KeyedVectors.load_word2vec_format(fname, binary=True)
 93 |     for word in vocab:
 94 |         try:
 95 |             word_vecs[word] = model.get_vector(word)
 96 |         except KeyError:
 97 |             # Word not in the vocabulary
 98 |             pass
 99 |     return word_vecs
100 | 
101 | 
102 | def add_unknown_words(word_vecs, vocab, min_df=1, k=300):
103 |     """
104 |     For words that occur in at least min_df documents, create a separate word vector.
105 |     0.25 is chosen so the unknown vectors have (approximately) same variance as pre-trained ones
106 |     """
107 |     i = 0.0
108 |     for word in vocab:
109 |         if word not in word_vecs and vocab[word] >= min_df:
110 |             i += 1
111 |             word_vecs[word] = np.random.uniform(-0.25, 0.25, k)
112 |             print(word)
113 |     print("##########################")
114 |     print(i * 100 / len(vocab))
115 |     print("##########################")
116 | 
117 | 
118 | def clean_str(string, TREC=False):
119 |     """
120 |     Tokenization/string cleaning for all datasets except for SST.
121 |     Every dataset is lower cased except for TREC
122 |     """
123 |     string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
124 |     string = re.sub(r"\'s", " \'s ", string)
125 |     string = re.sub(r"\'ve", " have ", string)
126 |     string = re.sub(r"n\'t", " not ", string)
127 |     string = re.sub(r"\'re", " are ", string)
128 |     string = re.sub(r"\'d", " would ", string)
129 |     string = re.sub(r"\'ll", " will ", string)
130 |     string = re.sub(r",", " , ", string)
131 |     string = re.sub(r"!", " ! ", string)
132 |     string = re.sub(r"\(", " ( ", string)
133 |     string = re.sub(r"\)", " ) ", string)
134 |     string = re.sub(r"\?", " \? ", string)
135 |     #    string = re.sub(r"[a-zA-Z]{4,}", "", string)
136 |     string = re.sub(r"\s{2,}", " ", string)
137 |     return string.strip() if TREC else string.strip().lower()
138 | 
139 | 
140 | def clean_str_sst(string):
141 |     """
142 |     Tokenization/string cleaning for the SST dataset
143 |     """
144 |     string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
145 |     string = re.sub(r"\s{2,}", " ", string)
146 |     return string.strip().lower()
147 | 
148 | 
149 | def get_mairesse_features(file_name):
150 |     feats = {}
151 |     with open(file_name, "r") as csvf:
152 |         csvreader = csv.reader(csvf, delimiter=',', quotechar='"')
153 |         for line in csvreader:
154 |             feats[line[0]] = [float(f) for f in line[1:]]
155 |     return feats
156 | 
157 | 
158 | if __name__ == "__main__":
159 |     w2v_file = sys.argv[1]
160 |     data_folder = sys.argv[2]
161 |     mairesse_file = sys.argv[3]
162 |     print("loading data...")
163 |     revs, vocab = build_data_cv(data_folder, cv=10, clean_string=True)
164 |     num_words = pd.DataFrame(revs)["num_words"]
165 |     max_l = np.max(num_words)
166 |     print("data loaded!")
167 |     print("number of status: " + str(len(revs)))
168 |     print("vocab size: " + str(len(vocab)))
169 |     print("max sentence length: " + str(max_l))
170 |     print("loading word2vec vectors...", )
171 |     w2v = load_bin_vec(w2v_file, vocab)
172 |     print("word2vec loaded!")
173 |     print("num words already in word2vec: " + str(len(w2v)))
174 |     add_unknown_words(w2v, vocab)
175 |     W, word_idx_map = get_W(w2v)
176 |     rand_vecs = {}
177 |     add_unknown_words(rand_vecs, vocab)
178 |     W2, _ = get_W(rand_vecs)
179 |     mairesse = get_mairesse_features(mairesse_file)
180 |     filename = 'essays_mairesse.p'
181 |     joblib.dump([revs, W, W2, word_idx_map, vocab, mairesse], filename, protocol=2)
182 |     print("dataset created!")
183 | 
184 | 


--------------------------------------------------------------------------------
/conv_net_train_keras.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import csv
  3 | import joblib
  4 | import pickle
  5 | import sys
  6 | import keras
  7 | from keras.layers import Input, concatenate, Dropout, Masking, Bidirectional, TimeDistributed
  8 | from keras.layers import Conv3D, MaxPooling3D, Dense, Activation, Reshape, GRU, SimpleRNN, LSTM
  9 | from keras.models import Model, Sequential
 10 | from keras.activations import softmax
 11 | from keras.utils import to_categorical, Sequence
 12 | from keras.callbacks import CSVLogger
 13 | from keras.callbacks import History, BaseLogger, ModelCheckpoint
 14 | import pickle
 15 | from pathlib import Path
 16 | from keras.callbacks import ModelCheckpoint
 17 | import os
 18 | 
 19 | import logging
 20 | 
 21 | model_name = "main_model"
 22 | logging.basicConfig(filename='logger_' + model_name + '.log', level=logging.DEBUG, format='%(asctime)s %(message)s')
 23 | 
 24 | 
 25 | class TestCallback(keras.callbacks.Callback):
 26 |     def __init__(self, test_data):
 27 |         self.test_data = test_data
 28 | 
 29 |     def on_epoch_end(self, epoch, logs={}):
 30 |         if epoch % 5 == 0:
 31 |             test_data, step_size = self.test_data
 32 |             loss, acc = self.model.evaluate_generator(test_data, steps=step_size)
 33 |             logging.info('\nTesting loss: {}, acc: {}\n'.format(loss, acc))
 34 | 
 35 | 
 36 | class MyLogger(keras.callbacks.Callback):
 37 |     def __init__(self, n):
 38 |         self.n = n  # logging.info loss & acc every n epochs
 39 | 
 40 |     def on_epoch_end(self, epoch, logs={}):
 41 |         if epoch % self.n == 0:
 42 |             curr_loss = logs.get('loss')
 43 |             curr_acc = logs.get('acc') * 100
 44 |             val_loss = logs.get('val_loss')
 45 |             val_acc = logs.get('val_acc')
 46 |             logging.info("epoch = %4d  loss = %0.6f  acc = %0.2f%%" % (epoch, curr_loss, curr_acc))
 47 |             logging.info("epoch = %4d  val_loss = %0.6f  val_acc = %0.2f%%" % (epoch, val_loss, val_acc))
 48 | 
 49 | 
 50 | class Generator(Sequence):
 51 | 
 52 |     def __init__(self, x_set, x_set_mairesse, y_set, batch_size, W, sent_max_count, word_max_count, embbeding_size):
 53 |         self.x, self.mairesse, self.y = x_set, x_set_mairesse, y_set
 54 |         self.batch_size = batch_size
 55 |         self.W = W
 56 |         self.sent_max_count = sent_max_count
 57 |         self.word_max_count = word_max_count
 58 |         self.embbeding_size = embbeding_size
 59 | 
 60 |     def __len__(self):
 61 |         return int(np.ceil(len(self.x) / float(self.batch_size)))
 62 | 
 63 |     def __getitem__(self, idx):
 64 |         batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
 65 |         batch_m = self.mairesse[idx * self.batch_size:(idx + 1) * self.batch_size]
 66 |         batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
 67 | 
 68 |         return [make_input_batch(batch_x, W, self.sent_max_count, self.word_max_count, self.embbeding_size), batch_m], \
 69 |                to_categorical(batch_y, num_classes=2)
 70 | 
 71 | 
 72 | # def get_checkpoints(model_dir):
 73 | #     saved_checkpoints = [f for f in os.listdir(model_dir) if f.startswith('model-' + model_name)]
 74 | #     saved_checkpoints.sort(reverse=True)
 75 | #     return saved_checkpoints
 76 | 
 77 | 
 78 | def train_conv_net(datasets, W, historyfile, iteration,
 79 | 		    embbeding_size = 300,
 80 | 		    n_epochs = 50,
 81 | 		    batch_size = 50):
 82 |     word_max_count = len(datasets[0][0][0])
 83 |     sent_max_count = len(datasets[0][0])
 84 | 
 85 | 
 86 |     # define model architecture
 87 | 
 88 |     model_input = Input(shape=(sent_max_count, word_max_count, embbeding_size, 1), name='main_input')
 89 | 
 90 |     # unigrams
 91 |     model_1 = Sequential()
 92 |     model_1.add(Conv3D(200, (1, 1, embbeding_size), activation='relu',
 93 |                        input_shape=(sent_max_count, word_max_count, embbeding_size, 1)))
 94 |     model_1.add(MaxPooling3D((1, word_max_count, 1)))
 95 | 
 96 |     model_output_1 = model_1(model_input)
 97 | 
 98 |     # bigrams
 99 |     model_2 = Sequential()
100 |     model_2.add(Conv3D(200, (1, 2, embbeding_size), activation='relu',
101 |                        input_shape=(sent_max_count, word_max_count, embbeding_size, 1)))
102 |     model_2.add(MaxPooling3D((1, word_max_count - 1, 1)))
103 | 
104 |     model_output_2 = model_2(model_input)
105 | 
106 |     # trigrams
107 |     model_3 = Sequential()
108 |     model_3.add(Conv3D(200, (1, 3, embbeding_size), activation='relu',
109 |                        input_shape=(sent_max_count, word_max_count, embbeding_size, 1)))
110 |     model_3.add(MaxPooling3D((1, word_max_count - 2, 1)))
111 | 
112 |     model_output_3 = model_3(model_input)
113 | 
114 | 
115 |     model = concatenate([model_output_1, model_output_2, model_output_3], axis=-1)
116 | 
117 |     after_MaxPooling = MaxPooling3D((sent_max_count, 1, 1))(model)
118 | 
119 |     mairesse_input = Input(shape=(84,), name='mairesse')
120 |     model = Reshape((600,))(after_MaxPooling)
121 |     concatenated_with_mairsse = concatenate([model, mairesse_input], axis=-1)
122 | 
123 |     model = Dense(200, activation='sigmoid')(concatenated_with_mairsse)
124 |     model = Dropout(0.5)(model)
125 |     output = Dense(2, activation='softmax')(model)
126 | 
127 |     final_model = Model(inputs=[model_input, mairesse_input], outputs=output)
128 |     final_model.compile(optimizer='adadelta', loss='categorical_crossentropy', metrics=['accuracy'])
129 | 
130 |     validation_size = int(np.round(0.1 * len(datasets[0])))
131 | 
132 |     X_train = datasets[0][validation_size:]
133 |     y_train = datasets[1][validation_size:]
134 |     X_validation = datasets[0][:validation_size]
135 |     y_validation = datasets[1][:validation_size]
136 |     X_test = datasets[2]
137 |     y_test = datasets[3]
138 | 
139 |     mairesse_train = datasets[4][validation_size:]
140 |     mairesse_test = datasets[5]
141 |     mairesse_validation = datasets[4][:validation_size]
142 | 
143 |     train_data_G = Generator(X_train, mairesse_train, y_train, batch_size, W, sent_max_count, word_max_count,
144 |                              embbeding_size)
145 |     val_data_G = Generator(X_validation, mairesse_validation, y_validation, batch_size, W, sent_max_count,
146 |                            word_max_count,
147 |                            embbeding_size)
148 |     test_data_G = Generator(X_test, mairesse_test, y_test, batch_size, W, sent_max_count, word_max_count,
149 |                             embbeding_size)
150 | 
151 |     # model_dir = 'models/results/' + model_name + '/' + str(iteration)
152 |     # checkpoint_path = model_dir + "/model-" + model_name + '-' + str(iteration) + "-{acc:02f}.hdf5"
153 |     # # Keep only a single checkpoint, the best over test accuracy.
154 |     # checkpoint = ModelCheckpoint(str(checkpoint_path),
155 |     #                              monitor='acc',
156 |     #                              verbose=1)
157 |     # saved_checkpoints = get_checkpoints(model_dir)
158 |     # if len(saved_checkpoints) > 0:
159 |     #     last_checkpoint = saved_checkpoints[0]
160 |     #     logging.info("Resume training from " + last_checkpoint)
161 |     #     final_model.load_weights(model_dir + '/' + last_checkpoint)
162 |     # else:
163 |     #     logging.info("Traning from scratch!")
164 |     # logging.info(len(X_train) / batch_size)
165 | 
166 |     history = History()
167 | 
168 |     final_model.fit_generator(train_data_G, validation_data=val_data_G, steps_per_epoch=len(X_train) / batch_size,
169 |                               validation_steps=len(X_validation) / batch_size, epochs=n_epochs,
170 |                               callbacks=[my_logger, history])
171 | 
172 |     # final_model.fit_generator(train_data_G, validation_data=val_data_G, steps_per_epoch=len(X_train) / batch_size,
173 |     #                           validation_steps=len(X_validation) / batch_size, epochs=n_epochs,
174 |     #                           callbacks=[my_logger, history, checkpoint])
175 |     # logging.info("loading best model weights")
176 |     # saved_checkpoints = get_checkpoints(model_dir)
177 |     # last_checkpoint = saved_checkpoints[0]
178 |     # logging.info("Resume weights from " + last_checkpoint)
179 |     # final_model.load_weights(model_dir + '/' + last_checkpoint)
180 |     logging.info("evaluating model...")
181 |     loss, acc = final_model.evaluate_generator(test_data_G, steps=len(datasets[0]) / batch_size)
182 |     hist = str(history.history)
183 |     pickle.dump(hist, historyfile)
184 | 
185 |     logging.info('score = ' + str(loss) + "," + str(acc))
186 |     return loss, acc
187 | 
188 | def make_input_batch(X_train, W, sent_max_count, word_max_count, embbeding_size):
189 |     size = (len(X_train), sent_max_count, word_max_count, embbeding_size)
190 |     input_train = np.zeros(size)
191 |     for rev_dx, review in enumerate(X_train):
192 |         for sent_idx, sentence in enumerate(review):
193 |             sentence = np.array(sentence)
194 |             indexes = np.where(sentence != 0)[0]
195 |             for idx in indexes:
196 |                 input_train[rev_dx][sent_idx][idx] = W[sentence[idx]]
197 |     input_train = input_train.reshape([len(X_train), sent_max_count, word_max_count, embbeding_size, 1])
198 |     return input_train
199 | 
200 | 
201 | def make_idx_data_cv(revs, word_idx_map, mairesse, charged_words, cv, per_attr=0, max_l=51, max_s=200, k=300,
202 |                      filter_h=5):
203 |     """
204 |     Transforms sentences into a 2-d matrix.
205 |     """
206 |     trainX, testX, trainY, testY, mTrain, mTest = [], [], [], [], [], []
207 |     for idx, rev in enumerate(revs):
208 |         sent = get_idx_from_sent(rev["text"], word_idx_map,
209 |                                  charged_words,
210 |                                  max_l, max_s, k, filter_h)
211 | 
212 |         if rev["split"] == cv:
213 |             testX.append(sent)
214 |             testY.append(rev['y' + str(per_attr)])
215 |             mTest.append(mairesse[rev["user"]])
216 |         else:
217 |             trainX.append(sent)
218 |             trainY.append(rev['y' + str(per_attr)])
219 |             mTrain.append(mairesse[rev["user"]])
220 |     trainX = np.array(trainX)
221 |     testX = np.array(testX)
222 |     trainY = np.array(trainY)
223 |     testY = np.array(testY)
224 |     mTrain = np.array(mTrain)
225 |     mTest = np.array(mTest)
226 |     return [trainX, trainY, testX, testY, mTrain, mTest]
227 | 
228 | def get_idx_from_sent(status, word_idx_map, charged_words, max_l=51, max_s=200, k=300, filter_h=5):
229 |     """
230 |     Transforms sentence into a list of indices. Pad with zeroes.
231 |     """
232 |     x = []
233 |     pad = filter_h - 1
234 |     length = len(status)
235 | 
236 |     pass_one = True
237 |     while len(x) == 0:
238 |         charged_counter = 0
239 |         not_charged_counter = 0
240 |         for i in range(length):
241 |             words = status[i].split()
242 |             if pass_one:
243 |                 words_set = set(words)
244 |                 if len(charged_words.intersection(words_set)) == 0:
245 |                     not_charged_counter += 1
246 |                     continue
247 |             else:
248 |                 if np.random.randint(0, 2) == 0:
249 |                     continue
250 |             charged_counter += 1
251 |             y = []
252 |             for i in range(pad):
253 |                 y.append(0)
254 |             for word in words:
255 |                 if word in word_idx_map:
256 |                     y.append(word_idx_map[word])
257 | 
258 |             while len(y) < max_l + 2 * pad:
259 |                 y.append(0)
260 |             x.append(y)
261 |         pass_one = False
262 | 
263 |     if len(x) < max_s:
264 |         x.extend([[0] * (max_l + 2 * pad)] * (max_s - len(x)))
265 | 
266 |     return x
267 | 
268 | if __name__ == "__main__":
269 |     logging.info("loading data...: floatx:")
270 |     my_logger = MyLogger(n=1)
271 |     x = joblib.load("essays_mairesse.p")
272 | 
273 |     revs, W, W2, word_idx_map, vocab, mairesse = x[0], x[1], x[2], x[3], x[4], x[5]
274 |     logging.info("data loaded!")
275 |     try:
276 |         attr = int(sys.argv[1])
277 |     except IndexError:
278 |         attr = 4
279 | 
280 |     r = range(0, 10)
281 | 
282 |     ofile = open('perf_output_' + model_name + "_" + str(attr) + '_w2v.txt', 'w')
283 | 
284 |     charged_words = []
285 | 
286 |     emof = open("Emotion_Lexicon.csv", "rt")
287 |     history_file_name = 'history_' + model_name + '_attr_' + str(attr) + '_w2v.txt'
288 |     historyfile = open(history_file_name, 'wb')
289 |     csvf = csv.reader(emof, delimiter=',', quotechar='"')
290 |     first_line = True
291 | 
292 |     for line in csvf:
293 |         if first_line:
294 |             first_line = False
295 |             continue
296 |         if line[11] == "1":
297 |             charged_words.append(line[0])
298 | 
299 |     emof.close()
300 | 
301 |     charged_words = set(charged_words)
302 | 
303 |     results = []
304 |     for i in r:
305 |         logging.info("iteration = %4d from %4d " % (i, len(r)))
306 |         datasets = make_idx_data_cv(revs, word_idx_map, mairesse, charged_words, i, attr, max_l=149,
307 |                                     max_s=312, k=300,
308 |                                     filter_h=3)
309 | 
310 |         results = train_conv_net(datasets, W, historyfile, i)
311 |         ofile.write(str(results) + "\n")
312 |         ofile.flush()
313 | 
314 |     ofile.write(str(results))
315 |     historyfile.close()
316 | 
317 | 


--------------------------------------------------------------------------------
/conv_net_classes.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Sample code for
  3 | Convolutional Neural Networks for Sentence Classification
  4 | http://arxiv.org/pdf/1408.5882v2.pdf
  5 | 
  6 | Much of the code is modified from
  7 | - deeplearning.net (for ConvNet classes)
  8 | - https://github.com/mdenil/dropout (for dropout)
  9 | - https://groups.google.com/forum/#!topic/pylearn-dev/3QbKtCumAW4 (for Adadelta)
 10 | """
 11 | 
 12 | import numpy
 13 | import theano.tensor.shared_randomstreams
 14 | import theano
 15 | import theano.tensor as T
 16 | # from theano.tensor.signal import downsample
 17 | from theano.tensor.signal import pool
 18 | from theano.tensor.nnet import conv
 19 | 
 20 | def ReLU(x):
 21 |     y = T.maximum(0.0, x)
 22 |     return(y)
 23 | def Sigmoid(x):
 24 |     y = T.nnet.sigmoid(x)
 25 |     return(y)
 26 | def Tanh(x):
 27 |     y = T.tanh(x)
 28 |     return(y)
 29 | def Iden(x):
 30 |     y = x
 31 |     return(y)
 32 | 
 33 | class HiddenLayer(object):
 34 |     """
 35 |     Class for HiddenLayer
 36 |     """
 37 |     def __init__(self, rng, input, n_in, n_out, activation, W=None, b=None,
 38 |                  use_bias=False):
 39 | 
 40 |         self.input = input
 41 |         self.activation = activation
 42 | 
 43 |         if W is None:
 44 |             if activation.func_name == "ReLU":
 45 |                 W_values = numpy.asarray(0.01 * rng.standard_normal(size=(n_in, n_out)), dtype=theano.config.floatX)
 46 |             else:
 47 |                 W_values = numpy.asarray(rng.uniform(low=-numpy.sqrt(6. / (n_in + n_out)), high=numpy.sqrt(6. / (n_in + n_out)),
 48 |                                                      size=(n_in, n_out)), dtype=theano.config.floatX)
 49 |             W = theano.shared(value=W_values, name='W')
 50 |         if b is None:
 51 |             b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
 52 |             b = theano.shared(value=b_values, name='b')
 53 | 
 54 |         self.W = W
 55 |         self.b = b
 56 | 
 57 |         if use_bias:
 58 |             lin_output = T.dot(input, self.W) + self.b
 59 |         else:
 60 |             lin_output = T.dot(input, self.W)
 61 | 
 62 |         self.output = (lin_output if activation is None else activation(lin_output))
 63 | 
 64 |         # parameters of the model
 65 |         if use_bias:
 66 |             self.params = [self.W, self.b]
 67 |         else:
 68 |             self.params = [self.W]
 69 | 
 70 | def _dropout_from_layer(rng, layer, p):
 71 |     """p is the probablity of dropping a unit
 72 | """
 73 |     srng = theano.tensor.shared_randomstreams.RandomStreams(rng.randint(999999))
 74 |     # p=1-p because 1's indicate keep and p is prob of dropping
 75 |     mask = srng.binomial(n=1, p=1-p, size=layer.shape)
 76 |     # The cast is important because
 77 |     # int * float32 = float64 which pulls things off the gpu
 78 |     output = layer * T.cast(mask, theano.config.floatX)
 79 |     return output
 80 | 
 81 | class DropoutHiddenLayer(HiddenLayer):
 82 |     def __init__(self, rng, input, n_in, n_out,
 83 |                  activation, dropout_rate, use_bias, W=None, b=None):
 84 |         super(DropoutHiddenLayer, self).__init__(
 85 |                 rng=rng, input=input, n_in=n_in, n_out=n_out, W=W, b=b,
 86 |                 activation=activation, use_bias=use_bias)
 87 | 
 88 |         self.output = _dropout_from_layer(rng, self.output, p=dropout_rate)
 89 | 
 90 | class MLPDropout(object):
 91 |     """A multilayer perceptron with dropout"""
 92 |     def __init__(self,rng,input,layer_sizes,dropout_rates,activations,use_bias=True):
 93 | 
 94 |         #rectified_linear_activation = lambda x: T.maximum(0.0, x)
 95 | 
 96 |         # Set up all the hidden layers
 97 |         self.weight_matrix_sizes = zip(layer_sizes, layer_sizes[1:])
 98 |         self.layers = []
 99 |         self.dropout_layers = []
100 |         self.activations = activations
101 |         next_layer_input = input
102 |         #first_layer = True
103 |         # dropout the input
104 |         next_dropout_layer_input = _dropout_from_layer(rng, input, p=dropout_rates[0])
105 |         layer_counter = 0
106 |         for n_in, n_out in self.weight_matrix_sizes[:-1]:
107 |             next_dropout_layer = DropoutHiddenLayer(rng=rng,
108 |                     input=next_dropout_layer_input,
109 |                     activation=activations[layer_counter],
110 |                     n_in=n_in, n_out=n_out, use_bias=use_bias,
111 |                     dropout_rate=dropout_rates[layer_counter])
112 |             self.dropout_layers.append(next_dropout_layer)
113 |             next_dropout_layer_input = next_dropout_layer.output
114 | 
115 |             # Reuse the parameters from the dropout layer here, in a different
116 |             # path through the graph.
117 |             next_layer = HiddenLayer(rng=rng,
118 |                     input=next_layer_input,
119 |                     activation=activations[layer_counter],
120 |                     # scale the weight matrix W with (1-p)
121 |                     W=next_dropout_layer.W * (1 - dropout_rates[layer_counter]),
122 |                     b=next_dropout_layer.b,
123 |                     n_in=n_in, n_out=n_out,
124 |                     use_bias=use_bias)
125 |             self.layers.append(next_layer)
126 |             next_layer_input = next_layer.output
127 |             #first_layer = False
128 |             layer_counter += 1
129 | 
130 |         # Set up the output layer
131 |         n_in, n_out = self.weight_matrix_sizes[-1]
132 |         dropout_output_layer = LogisticRegression(
133 |                 input=next_dropout_layer_input,
134 |                 n_in=n_in, n_out=n_out)
135 |         self.dropout_layers.append(dropout_output_layer)
136 | 
137 |         # Again, reuse paramters in the dropout output.
138 |         output_layer = LogisticRegression(
139 |             input=next_layer_input,
140 |             # scale the weight matrix W with (1-p)
141 |             W=dropout_output_layer.W * (1 - dropout_rates[-1]),
142 |             b=dropout_output_layer.b,
143 |             n_in=n_in, n_out=n_out)
144 |         self.layers.append(output_layer)
145 | 
146 |         # Use the negative log likelihood of the logistic regression layer as
147 |         # the objective.
148 |         self.dropout_negative_log_likelihood = self.dropout_layers[-1].negative_log_likelihood
149 |         self.dropout_errors = self.dropout_layers[-1].errors
150 | 
151 |         self.negative_log_likelihood = self.layers[-1].negative_log_likelihood
152 |         self.errors = self.layers[-1].errors
153 | 
154 |         # Grab all the parameters together.
155 |         self.params = [ param for layer in self.dropout_layers for param in layer.params ]
156 | 
157 |     def predict(self, new_data):
158 |         next_layer_input = new_data
159 |         for i,layer in enumerate(self.layers):
160 |             if i<len(self.layers)-1:
161 |                 next_layer_input = self.activations[i](T.dot(next_layer_input,layer.W) + layer.b)
162 |             else:
163 |                 p_y_given_x = T.nnet.softmax(T.dot(next_layer_input, layer.W) + layer.b)
164 |         y_pred = T.argmax(p_y_given_x, axis=1)
165 |         return y_pred
166 | 
167 |     def predict_p(self, new_data):
168 |         next_layer_input = new_data
169 |         for i,layer in enumerate(self.layers):
170 |             if i<len(self.layers)-1:
171 |                 next_layer_input = self.activations[i](T.dot(next_layer_input,layer.W) + layer.b)
172 |             else:
173 |                 p_y_given_x = T.nnet.softmax(T.dot(next_layer_input, layer.W) + layer.b)
174 |         return p_y_given_x
175 | 
176 | class MLP(object):
177 |     """Multi-Layer Perceptron Class
178 | 
179 |     A multilayer perceptron is a feedforward artificial neural network model
180 |     that has one layer or more of hidden units and nonlinear activations.
181 |     Intermediate layers usually have as activation function tanh or the
182 |     sigmoid function (defined here by a ``HiddenLayer`` class)  while the
183 |     top layer is a softamx layer (defined here by a ``LogisticRegression``
184 |     class).
185 |     """
186 | 
187 |     def __init__(self, rng, input, n_in, n_hidden, n_out):
188 |         """Initialize the parameters for the multilayer perceptron
189 | 
190 |         :type rng: numpy.random.RandomState
191 |         :param rng: a random number generator used to initialize weights
192 | 
193 |         :type input: theano.tensor.TensorType
194 |         :param input: symbolic variable that describes the input of the
195 |         architecture (one minibatch)
196 | 
197 |         :type n_in: int
198 |         :param n_in: number of input units, the dimension of the space in
199 |         which the datapoints lie
200 | 
201 |         :type n_hidden: int
202 |         :param n_hidden: number of hidden units
203 | 
204 |         :type n_out: int
205 |         :param n_out: number of output units, the dimension of the space in
206 |         which the labels lie
207 | 
208 |         """
209 | 
210 |         # Since we are dealing with a one hidden layer MLP, this will translate
211 |         # into a HiddenLayer with a tanh activation function connected to the
212 |         # LogisticRegression layer; the activation function can be replaced by
213 |         # sigmoid or any other nonlinear function
214 |         self.hiddenLayer = HiddenLayer(rng=rng, input=input,
215 |                                        n_in=n_in, n_out=n_hidden,
216 |                                        activation=T.tanh)
217 | 
218 |         # The logistic regression layer gets as input the hidden units
219 |         # of the hidden layer
220 |         self.logRegressionLayer = LogisticRegression(
221 |             input=self.hiddenLayer.output,
222 |             n_in=n_hidden,
223 |             n_out=n_out)
224 | 
225 |         # L1 norm ; one regularization option is to enforce L1 norm to
226 |         # be small
227 | 
228 |         # negative log likelihood of the MLP is given by the negative
229 |         # log likelihood of the output of the model, computed in the
230 |         # logistic regression layer
231 |         self.negative_log_likelihood = self.logRegressionLayer.negative_log_likelihood
232 |         # same holds for the function computing the number of errors
233 |         self.errors = self.logRegressionLayer.errors
234 | 
235 |         # the parameters of the model are the parameters of the two layer it is
236 |         # made out of
237 |         self.params = self.hiddenLayer.params + self.logRegressionLayer.params
238 | 
239 | class LogisticRegression(object):
240 |     """Multi-class Logistic Regression Class
241 | 
242 |     The logistic regression is fully described by a weight matrix :math:`W`
243 |     and bias vector :math:`b`. Classification is done by projecting data
244 |     points onto a set of hyperplanes, the distance to which is used to
245 |     determine a class membership probability.
246 |     """
247 | 
248 |     def __init__(self, input, n_in, n_out, W=None, b=None):
249 |         """ Initialize the parameters of the logistic regression
250 | 
251 |     :type input: theano.tensor.TensorType
252 |     :param input: symbolic variable that describes the input of the
253 |     architecture (one minibatch)
254 | 
255 |     :type n_in: int
256 |     :param n_in: number of input units, the dimension of the space in
257 |     which the datapoints lie
258 | 
259 |     :type n_out: int
260 |     :param n_out: number of output units, the dimension of the space in
261 |     which the labels lie
262 | 
263 |     """
264 | 
265 |         # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
266 |         if W is None:
267 |             self.W = theano.shared(
268 |                     value=numpy.zeros((n_in, n_out), dtype=theano.config.floatX),
269 |                     name='W')
270 |         else:
271 |             self.W = W
272 | 
273 |         # initialize the baises b as a vector of n_out 0s
274 |         if b is None:
275 |             self.b = theano.shared(
276 |                     value=numpy.zeros((n_out,), dtype=theano.config.floatX),
277 |                     name='b')
278 |         else:
279 |             self.b = b
280 | 
281 |         # compute vector of class-membership probabilities in symbolic form
282 |         self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
283 | 
284 |         # compute prediction as class whose probability is maximal in
285 |         # symbolic form
286 |         self.y_pred = T.argmax(self.p_y_given_x, axis=1)
287 | 
288 |         # parameters of the model
289 |         self.params = [self.W, self.b]
290 | 
291 |     def negative_log_likelihood(self, y):
292 |         """Return the mean of the negative log-likelihood of the prediction
293 |         of this model under a given target distribution.
294 | 
295 |     .. math::
296 | 
297 |     \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =
298 |     \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
299 |     \ell (\theta=\{W,b\}, \mathcal{D})
300 | 
301 |     :type y: theano.tensor.TensorType
302 |     :param y: corresponds to a vector that gives for each example the
303 |     correct label
304 | 
305 |     Note: we use the mean instead of the sum so that
306 |     the learning rate is less dependent on the batch size
307 |     """
308 |         # y.shape[0] is (symbolically) the number of rows in y, i.e.,
309 |         # number of examples (call it n) in the minibatch
310 |         # T.arange(y.shape[0]) is a symbolic vector which will contain
311 |         # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
312 |         # Log-Probabilities (call it LP) with one row per example and
313 |         # one column per class LP[T.arange(y.shape[0]),y] is a vector
314 |         # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
315 |         # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
316 |         # the mean (across minibatch examples) of the elements in v,
317 |         # i.e., the mean log-likelihood across the minibatch.
318 |         return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
319 | 
320 |     def errors(self, y):
321 |         """Return a float representing the number of errors in the minibatch ;
322 |     zero one loss over the size of the minibatch
323 | 
324 |     :type y: theano.tensor.TensorType
325 |     :param y: corresponds to a vector that gives for each example the
326 |     correct label
327 |     """
328 | 
329 |         # check if y has same dimension of y_pred
330 |         if y.ndim != self.y_pred.ndim:
331 |             raise TypeError('y should have the same shape as self.y_pred',
332 |                 ('y', target.type, 'y_pred', self.y_pred.type))
333 |         # check if y is of the correct datatype
334 |         if y.dtype.startswith('int'):
335 |             # the T.neq operator returns a vector of 0s and 1s, where 1
336 |             # represents a mistake in prediction
337 |             return T.mean(T.neq(self.y_pred, y))
338 |         else:
339 |             raise NotImplementedError()
340 | 
341 | class LeNetConvPoolLayer(object):
342 |     """Pool Layer of a convolutional network """
343 | 
344 |     def __init__(self, rng, filter_shape, image_shape, poolsize=(2, 2), non_linear="tanh"):
345 |         """
346 |         Allocate a LeNetConvPoolLayer with shared variable internal parameters.
347 | 
348 |         :type rng: numpy.random.RandomState
349 |         :param rng: a random number generator used to initialize weights
350 | 
351 |         :type input: theano.tensor.dtensor4
352 |         :param input: symbolic image tensor, of shape image_shape
353 | 
354 |         :type filter_shape: tuple or list of length 4
355 |         :param filter_shape: (number of filters, num input feature maps,
356 |                               filter height,filter width)
357 | 
358 |         :type image_shape: tuple or list of length 4
359 |         :param image_shape: (batch size, num input feature maps,
360 |                              image height, image width)
361 | 
362 |         :type poolsize: tuple or list of length 2
363 |         :param poolsize: the downsampling (pooling) factor (#rows,#cols)
364 |         """
365 | 
366 |         # assert image_shape[1] == filter_shape[1]
367 |         # self.input = input
368 |         self.filter_shape = filter_shape
369 |         self.image_shape = image_shape
370 |         self.poolsize = poolsize
371 |         self.non_linear = non_linear
372 |         # there are "num input feature maps * filter height * filter width"
373 |         # inputs to each hidden unit
374 |         fan_in = numpy.prod(filter_shape[1:])
375 |         # each unit in the lower layer receives a gradient from:
376 |         # "num output feature maps * filter height * filter width" /
377 |         #   pooling size
378 |         fan_out = (filter_shape[0] * numpy.prod(filter_shape[2:]) /numpy.prod(poolsize))
379 |         # initialize weights with random weights
380 |         if self.non_linear=="none" or self.non_linear=="relu":
381 |             self.W = theano.shared(numpy.asarray(rng.uniform(low=-0.01,high=0.01,size=filter_shape),
382 |                                                 dtype=theano.config.floatX),borrow=True,name="W_conv")
383 |         else:
384 |             W_bound = numpy.sqrt(6. / (fan_in + fan_out))
385 |             self.W = theano.shared(numpy.asarray(rng.uniform(low=-W_bound, high=W_bound, size=filter_shape),
386 |                 dtype=theano.config.floatX),borrow=True,name="W_conv")
387 |         b_values = numpy.zeros((filter_shape[0],), dtype=theano.config.floatX)
388 |         self.b = theano.shared(value=b_values, borrow=True, name="b_conv")
389 | 
390 |         self.params = [self.W, self.b]
391 | 
392 |     def set_input(self, input):
393 |         # convolve input feature maps with filters
394 |         conv_out = conv.conv2d(input=input, filters=self.W,filter_shape=self.filter_shape, image_shape=self.image_shape)
395 |         if self.non_linear=="tanh":
396 |             conv_out_tanh = T.tanh(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
397 |             output = pool.pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
398 |         elif self.non_linear=="relu":
399 |             conv_out_tanh = ReLU(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
400 |             output = pool.pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
401 |         else:
402 |             pooled_out = pool.pool_2d(input=conv_out, ds=self.poolsize, ignore_border=True)
403 |             output = pooled_out + self.b.dimshuffle('x', 0, 'x', 'x')
404 |         return output
405 | 
406 |     def predict(self, new_data, batch_size):
407 |         """
408 |         predict for new data
409 |         """
410 |         img_shape = None#(batch_size, 1, self.image_shape[2], self.image_shape[3])
411 |         conv_out = conv.conv2d(input=new_data, filters=self.W, filter_shape=self.filter_shape, image_shape=img_shape)
412 |         if self.non_linear=="tanh":
413 |             conv_out_tanh = T.tanh(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
414 |             output = pool.pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
415 |         if self.non_linear=="relu":
416 |             conv_out_tanh = ReLU(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
417 |             output = pool.pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
418 |         else:
419 |             pooled_out = pool.pool_2d(input=conv_out, ds=self.poolsize, ignore_border=True)
420 |             output = pooled_out + self.b.dimshuffle('x', 0, 'x', 'x')
421 |         return output
422 | 
423 | 


--------------------------------------------------------------------------------
/conv_net_classes_gpu.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Sample code for
  3 | Convolutional Neural Networks for Sentence Classification
  4 | http://arxiv.org/pdf/1408.5882v2.pdf
  5 | 
  6 | Much of the code is modified from
  7 | - deeplearning.net (for ConvNet classes)
  8 | - https://github.com/mdenil/dropout (for dropout)
  9 | - https://groups.google.com/forum/#!topic/pylearn-dev/3QbKtCumAW4 (for Adadelta)
 10 | """
 11 | 
 12 | import numpy
 13 | import theano.tensor.shared_randomstreams
 14 | import theano
 15 | import theano.tensor as T
 16 | # from theano.tensor.signal import downsample
 17 | from theano.tensor.signal import pool
 18 | from theano.tensor.nnet import conv2d as conv
 19 | 
 20 | def ReLU(x):
 21 |     y = T.maximum(0.0, x)
 22 |     return(y)
 23 | def Sigmoid(x):
 24 |     y = T.nnet.sigmoid(x)
 25 |     return(y)
 26 | def Tanh(x):
 27 |     y = T.tanh(x)
 28 |     return(y)
 29 | def Iden(x):
 30 |     y = x
 31 |     return(y)
 32 | 
 33 | class HiddenLayer(object):
 34 |     """
 35 |     Class for HiddenLayer
 36 |     """
 37 |     def __init__(self, rng, input, n_in, n_out, activation, W=None, b=None,
 38 |                  use_bias=False):
 39 | 
 40 |         self.input = input
 41 |         self.activation = activation
 42 | 
 43 |         if W is None:
 44 |             if activation.func_name == "ReLU":
 45 |                 W_values = numpy.asarray(0.01 * rng.standard_normal(size=(n_in, n_out)), dtype=theano.config.floatX)
 46 |             else:
 47 |                 W_values = numpy.asarray(rng.uniform(low=-numpy.sqrt(6. / (n_in + n_out)), high=numpy.sqrt(6. / (n_in + n_out)),
 48 |                                                      size=(n_in, n_out)), dtype=theano.config.floatX)
 49 |             W = theano.shared(value=W_values, name='W')
 50 |         if b is None:
 51 |             b_values = numpy.zeros((n_out,), dtype=theano.config.floatX)
 52 |             b = theano.shared(value=b_values, name='b')
 53 | 
 54 |         self.W = W
 55 |         self.b = b
 56 | 
 57 |         if use_bias:
 58 |             lin_output = T.dot(input, self.W) + self.b
 59 |         else:
 60 |             lin_output = T.dot(input, self.W)
 61 | 
 62 |         self.output = (lin_output if activation is None else activation(lin_output))
 63 | 
 64 |         # parameters of the model
 65 |         if use_bias:
 66 |             self.params = [self.W, self.b]
 67 |         else:
 68 |             self.params = [self.W]
 69 | 
 70 | def _dropout_from_layer(rng, layer, p):
 71 |     """p is the probablity of dropping a unit
 72 | """
 73 |     srng = theano.tensor.shared_randomstreams.RandomStreams(rng.randint(999999))
 74 |     # p=1-p because 1's indicate keep and p is prob of dropping
 75 |     mask = srng.binomial(n=1, p=1-p, size=layer.shape)
 76 |     # The cast is important because
 77 |     # int * float32 = float64 which pulls things off the gpu
 78 |     output = layer * T.cast(mask, theano.config.floatX)
 79 |     return output
 80 | 
 81 | class DropoutHiddenLayer(HiddenLayer):
 82 |     def __init__(self, rng, input, n_in, n_out,
 83 |                  activation, dropout_rate, use_bias, W=None, b=None):
 84 |         super(DropoutHiddenLayer, self).__init__(
 85 |                 rng=rng, input=input, n_in=n_in, n_out=n_out, W=W, b=b,
 86 |                 activation=activation, use_bias=use_bias)
 87 | 
 88 |         self.output = _dropout_from_layer(rng, self.output, p=dropout_rate)
 89 | 
 90 | class MLPDropout(object):
 91 |     """A multilayer perceptron with dropout"""
 92 |     def __init__(self,rng,input,layer_sizes,dropout_rates,activations,use_bias=True):
 93 | 
 94 |         #rectified_linear_activation = lambda x: T.maximum(0.0, x)
 95 | 
 96 |         # Set up all the hidden layers
 97 |         self.weight_matrix_sizes = zip(layer_sizes, layer_sizes[1:])
 98 |         self.layers = []
 99 |         self.dropout_layers = []
100 |         self.activations = activations
101 |         next_layer_input = input
102 |         #first_layer = True
103 |         # dropout the input
104 |         next_dropout_layer_input = _dropout_from_layer(rng, input, p=dropout_rates[0])
105 |         layer_counter = 0
106 |         for n_in, n_out in self.weight_matrix_sizes[:-1]:
107 |             next_dropout_layer = DropoutHiddenLayer(rng=rng,
108 |                     input=next_dropout_layer_input,
109 |                     activation=activations[layer_counter],
110 |                     n_in=n_in, n_out=n_out, use_bias=use_bias,
111 |                     dropout_rate=dropout_rates[layer_counter])
112 |             self.dropout_layers.append(next_dropout_layer)
113 |             next_dropout_layer_input = next_dropout_layer.output
114 | 
115 |             # Reuse the parameters from the dropout layer here, in a different
116 |             # path through the graph.
117 |             next_layer = HiddenLayer(rng=rng,
118 |                     input=next_layer_input,
119 |                     activation=activations[layer_counter],
120 |                     # scale the weight matrix W with (1-p)
121 |                     W=next_dropout_layer.W * (1 - dropout_rates[layer_counter]),
122 |                     b=next_dropout_layer.b,
123 |                     n_in=n_in, n_out=n_out,
124 |                     use_bias=use_bias)
125 |             self.layers.append(next_layer)
126 |             next_layer_input = next_layer.output
127 |             #first_layer = False
128 |             layer_counter += 1
129 | 
130 |         # Set up the output layer
131 |         n_in, n_out = self.weight_matrix_sizes[-1]
132 |         dropout_output_layer = LogisticRegression(
133 |                 input=next_dropout_layer_input,
134 |                 n_in=n_in, n_out=n_out)
135 |         self.dropout_layers.append(dropout_output_layer)
136 | 
137 |         # Again, reuse paramters in the dropout output.
138 |         output_layer = LogisticRegression(
139 |             input=next_layer_input,
140 |             # scale the weight matrix W with (1-p)
141 |             W=dropout_output_layer.W * (1 - dropout_rates[-1]),
142 |             b=dropout_output_layer.b,
143 |             n_in=n_in, n_out=n_out)
144 |         self.layers.append(output_layer)
145 | 
146 |         # Use the negative log likelihood of the logistic regression layer as
147 |         # the objective.
148 |         self.dropout_negative_log_likelihood = self.dropout_layers[-1].negative_log_likelihood
149 |         self.dropout_errors = self.dropout_layers[-1].errors
150 | 
151 |         self.negative_log_likelihood = self.layers[-1].negative_log_likelihood
152 |         self.errors = self.layers[-1].errors
153 | 
154 |         # Grab all the parameters together.
155 |         self.params = [ param for layer in self.dropout_layers for param in layer.params ]
156 | 
157 |     def predict(self, new_data):
158 |         next_layer_input = new_data
159 |         for i,layer in enumerate(self.layers):
160 |             if i<len(self.layers)-1:
161 |                 next_layer_input = self.activations[i](T.dot(next_layer_input,layer.W) + layer.b)
162 |             else:
163 |                 p_y_given_x = T.nnet.softmax(T.dot(next_layer_input, layer.W) + layer.b)
164 |         y_pred = T.argmax(p_y_given_x, axis=1)
165 |         return y_pred
166 | 
167 |     def predict_p(self, new_data):
168 |         next_layer_input = new_data
169 |         for i,layer in enumerate(self.layers):
170 |             if i<len(self.layers)-1:
171 |                 next_layer_input = self.activations[i](T.dot(next_layer_input,layer.W) + layer.b)
172 |             else:
173 |                 p_y_given_x = T.nnet.softmax(T.dot(next_layer_input, layer.W) + layer.b)
174 |         return p_y_given_x
175 | 
176 | class MLP(object):
177 |     """Multi-Layer Perceptron Class
178 | 
179 |     A multilayer perceptron is a feedforward artificial neural network model
180 |     that has one layer or more of hidden units and nonlinear activations.
181 |     Intermediate layers usually have as activation function tanh or the
182 |     sigmoid function (defined here by a ``HiddenLayer`` class)  while the
183 |     top layer is a softamx layer (defined here by a ``LogisticRegression``
184 |     class).
185 |     """
186 | 
187 |     def __init__(self, rng, input, n_in, n_hidden, n_out):
188 |         """Initialize the parameters for the multilayer perceptron
189 | 
190 |         :type rng: numpy.random.RandomState
191 |         :param rng: a random number generator used to initialize weights
192 | 
193 |         :type input: theano.tensor.TensorType
194 |         :param input: symbolic variable that describes the input of the
195 |         architecture (one minibatch)
196 | 
197 |         :type n_in: int
198 |         :param n_in: number of input units, the dimension of the space in
199 |         which the datapoints lie
200 | 
201 |         :type n_hidden: int
202 |         :param n_hidden: number of hidden units
203 | 
204 |         :type n_out: int
205 |         :param n_out: number of output units, the dimension of the space in
206 |         which the labels lie
207 | 
208 |         """
209 | 
210 |         # Since we are dealing with a one hidden layer MLP, this will translate
211 |         # into a HiddenLayer with a tanh activation function connected to the
212 |         # LogisticRegression layer; the activation function can be replaced by
213 |         # sigmoid or any other nonlinear function
214 |         self.hiddenLayer = HiddenLayer(rng=rng, input=input,
215 |                                        n_in=n_in, n_out=n_hidden,
216 |                                        activation=T.tanh)
217 | 
218 |         # The logistic regression layer gets as input the hidden units
219 |         # of the hidden layer
220 |         self.logRegressionLayer = LogisticRegression(
221 |             input=self.hiddenLayer.output,
222 |             n_in=n_hidden,
223 |             n_out=n_out)
224 | 
225 |         # L1 norm ; one regularization option is to enforce L1 norm to
226 |         # be small
227 | 
228 |         # negative log likelihood of the MLP is given by the negative
229 |         # log likelihood of the output of the model, computed in the
230 |         # logistic regression layer
231 |         self.negative_log_likelihood = self.logRegressionLayer.negative_log_likelihood
232 |         # same holds for the function computing the number of errors
233 |         self.errors = self.logRegressionLayer.errors
234 | 
235 |         # the parameters of the model are the parameters of the two layer it is
236 |         # made out of
237 |         self.params = self.hiddenLayer.params + self.logRegressionLayer.params
238 | 
239 | class LogisticRegression(object):
240 |     """Multi-class Logistic Regression Class
241 | 
242 |     The logistic regression is fully described by a weight matrix :math:`W`
243 |     and bias vector :math:`b`. Classification is done by projecting data
244 |     points onto a set of hyperplanes, the distance to which is used to
245 |     determine a class membership probability.
246 |     """
247 | 
248 |     def __init__(self, input, n_in, n_out, W=None, b=None):
249 |         """ Initialize the parameters of the logistic regression
250 | 
251 |     :type input: theano.tensor.TensorType
252 |     :param input: symbolic variable that describes the input of the
253 |     architecture (one minibatch)
254 | 
255 |     :type n_in: int
256 |     :param n_in: number of input units, the dimension of the space in
257 |     which the datapoints lie
258 | 
259 |     :type n_out: int
260 |     :param n_out: number of output units, the dimension of the space in
261 |     which the labels lie
262 | 
263 |     """
264 | 
265 |         # initialize with 0 the weights W as a matrix of shape (n_in, n_out)
266 |         if W is None:
267 |             self.W = theano.shared(
268 |                     value=numpy.zeros((n_in, n_out), dtype=theano.config.floatX),
269 |                     name='W')
270 |         else:
271 |             self.W = W
272 | 
273 |         # initialize the baises b as a vector of n_out 0s
274 |         if b is None:
275 |             self.b = theano.shared(
276 |                     value=numpy.zeros((n_out,), dtype=theano.config.floatX),
277 |                     name='b')
278 |         else:
279 |             self.b = b
280 | 
281 |         # compute vector of class-membership probabilities in symbolic form
282 |         self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
283 | 
284 |         # compute prediction as class whose probability is maximal in
285 |         # symbolic form
286 |         self.y_pred = T.argmax(self.p_y_given_x, axis=1)
287 | 
288 |         # parameters of the model
289 |         self.params = [self.W, self.b]
290 | 
291 |     def negative_log_likelihood(self, y):
292 |         """Return the mean of the negative log-likelihood of the prediction
293 |         of this model under a given target distribution.
294 | 
295 |     .. math::
296 | 
297 |     \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) =
298 |     \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|} \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\
299 |     \ell (\theta=\{W,b\}, \mathcal{D})
300 | 
301 |     :type y: theano.tensor.TensorType
302 |     :param y: corresponds to a vector that gives for each example the
303 |     correct label
304 | 
305 |     Note: we use the mean instead of the sum so that
306 |     the learning rate is less dependent on the batch size
307 |     """
308 |         # y.shape[0] is (symbolically) the number of rows in y, i.e.,
309 |         # number of examples (call it n) in the minibatch
310 |         # T.arange(y.shape[0]) is a symbolic vector which will contain
311 |         # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of
312 |         # Log-Probabilities (call it LP) with one row per example and
313 |         # one column per class LP[T.arange(y.shape[0]),y] is a vector
314 |         # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ...,
315 |         # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is
316 |         # the mean (across minibatch examples) of the elements in v,
317 |         # i.e., the mean log-likelihood across the minibatch.
318 |         return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
319 | 
320 |     def errors(self, y):
321 |         """Return a float representing the number of errors in the minibatch ;
322 |     zero one loss over the size of the minibatch
323 | 
324 |     :type y: theano.tensor.TensorType
325 |     :param y: corresponds to a vector that gives for each example the
326 |     correct label
327 |     """
328 | 
329 |         # check if y has same dimension of y_pred
330 |         if y.ndim != self.y_pred.ndim:
331 |             raise TypeError('y should have the same shape as self.y_pred',
332 |                 ('y', target.type, 'y_pred', self.y_pred.type))
333 |         # check if y is of the correct datatype
334 |         if y.dtype.startswith('int'):
335 |             # the T.neq operator returns a vector of 0s and 1s, where 1
336 |             # represents a mistake in prediction
337 |             return T.mean(T.neq(self.y_pred, y))
338 |         else:
339 |             raise NotImplementedError()
340 | 
341 | class LeNetConvPoolLayer(object):
342 |     """Pool Layer of a convolutional network """
343 | 
344 |     def __init__(self, rng, filter_shape, image_shape, poolsize=(2, 2), non_linear="tanh"):
345 |         """
346 |         Allocate a LeNetConvPoolLayer with shared variable internal parameters.
347 | 
348 |         :type rng: numpy.random.RandomState
349 |         :param rng: a random number generator used to initialize weights
350 | 
351 |         :type input: theano.tensor.dtensor4
352 |         :param input: symbolic image tensor, of shape image_shape
353 | 
354 |         :type filter_shape: tuple or list of length 4
355 |         :param filter_shape: (number of filters, num input feature maps,
356 |                               filter height,filter width)
357 | 
358 |         :type image_shape: tuple or list of length 4
359 |         :param image_shape: (batch size, num input feature maps,
360 |                              image height, image width)
361 | 
362 |         :type poolsize: tuple or list of length 2
363 |         :param poolsize: the downsampling (pooling) factor (#rows,#cols)
364 |         """
365 | 
366 |         # assert image_shape[1] == filter_shape[1]
367 |         # self.input = input
368 |         self.filter_shape = filter_shape
369 |         self.image_shape = image_shape
370 |         self.poolsize = poolsize
371 |         self.non_linear = non_linear
372 |         # there are "num input feature maps * filter height * filter width"
373 |         # inputs to each hidden unit
374 |         fan_in = numpy.prod(filter_shape[1:])
375 |         # each unit in the lower layer receives a gradient from:
376 |         # "num output feature maps * filter height * filter width" /
377 |         #   pooling size
378 |         fan_out = (filter_shape[0] * numpy.prod(filter_shape[2:]) /numpy.prod(poolsize))
379 |         # initialize weights with random weights
380 |         if self.non_linear=="none" or self.non_linear=="relu":
381 |             self.W = theano.shared(numpy.asarray(rng.uniform(low=-0.01,high=0.01,size=filter_shape),
382 |                                                 dtype=theano.config.floatX),borrow=True,name="W_conv")
383 |         else:
384 |             W_bound = numpy.sqrt(6. / (fan_in + fan_out))
385 |             self.W = theano.shared(numpy.asarray(rng.uniform(low=-W_bound, high=W_bound, size=filter_shape),
386 |                 dtype=theano.config.floatX),borrow=True,name="W_conv")
387 |         b_values = numpy.zeros((filter_shape[0],), dtype=theano.config.floatX)
388 |         self.b = theano.shared(value=b_values, borrow=True, name="b_conv")
389 | 
390 |         self.params = [self.W, self.b]
391 | 
392 |     def set_input(self, input):
393 |         # convolve input feature maps with filters
394 |         conv_out = conv(input=input, filters=self.W,filter_shape=self.filter_shape, image_shape=self.image_shape)
395 |         if self.non_linear=="tanh":
396 |             conv_out_tanh = T.tanh(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
397 |             output = pool.pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
398 |         elif self.non_linear=="relu":
399 |             conv_out_tanh = ReLU(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
400 |             output = pool.pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
401 |         else:
402 |             pooled_out = pool.pool_2d(input=conv_out, ds=self.poolsize, ignore_border=True)
403 |             output = pooled_out + self.b.dimshuffle('x', 0, 'x', 'x')
404 |         return output
405 | 
406 |     def predict(self, new_data, batch_size):
407 |         """
408 |         predict for new data
409 |         """
410 |         img_shape = None#(batch_size, 1, self.image_shape[2], self.image_shape[3])
411 |         conv_out = conv.conv2d(input=new_data, filters=self.W, filter_shape=self.filter_shape, image_shape=img_shape)
412 |         if self.non_linear=="tanh":
413 |             conv_out_tanh = T.tanh(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
414 |             output = pool.pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
415 |         if self.non_linear=="relu":
416 |             conv_out_tanh = ReLU(conv_out + self.b.dimshuffle('x', 0, 'x', 'x'))
417 |             output = pool.pool_2d(input=conv_out_tanh, ds=self.poolsize, ignore_border=True)
418 |         else:
419 |             pooled_out = pool.pool_2d(input=conv_out, ds=self.poolsize, ignore_border=True)
420 |             output = pooled_out + self.b.dimshuffle('x', 0, 'x', 'x')
421 |         return output
422 | 
423 | 


--------------------------------------------------------------------------------
/conv_net_train.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Sample code for
  3 | Convolutional Neural Networks for Sentence Classification
  4 | http://arxiv.org/pdf/1408.5882v2.pdf
  5 | 
  6 | Much of the code is modified from
  7 | - deeplearning.net (for ConvNet classes)
  8 | - https://github.com/mdenil/dropout (for dropout)
  9 | - https://groups.google.com/forum/#!topic/pylearn-dev/3QbKtCumAW4 (for Adadelta)
 10 | """
 11 | # from conv_net_classes import LeNetConvPoolLayer, MLPDropout
 12 | # import _pickle as cPickle
 13 | import cPickle
 14 | import numpy as np
 15 | from collections import defaultdict, OrderedDict
 16 | import theano
 17 | import theano.tensor as T
 18 | from theano.ifelse import ifelse
 19 | import os
 20 | import warnings
 21 | import sys
 22 | import time
 23 | import getpass
 24 | import csv
 25 | 
 26 | 
 27 | warnings.filterwarnings("ignore")
 28 | 
 29 | 
 30 | # different non-linearities
 31 | def ReLU(x):
 32 |     y = T.maximum(0.0, x)
 33 |     return (y)
 34 | 
 35 | 
 36 | def Sigmoid(x):
 37 |     y = T.nnet.sigmoid(x)
 38 |     return (y)
 39 | 
 40 | 
 41 | def Tanh(x):
 42 |     y = T.tanh(x)
 43 |     return (y)
 44 | 
 45 | 
 46 | def Iden(x):
 47 |     y = x
 48 |     return (y)
 49 | 
 50 | 
 51 | def train_conv_net(datasets,
 52 |                    U,
 53 |                    ofile,
 54 |                    cv=0,
 55 |                    attr=0,
 56 |                    img_w=300,
 57 |                    filter_hs=[3, 4, 5],
 58 |                    hidden_units=[100, 2],
 59 |                    dropout_rate=[0.5],
 60 |                    shuffle_batch=True,
 61 |                    n_epochs=25,
 62 |                    batch_size=50,
 63 |                    lr_decay=0.95,
 64 |                    conv_non_linear="relu",
 65 |                    activations=[Iden],
 66 |                    sqr_norm_lim=9,
 67 |                    non_static=True):
 68 |     """
 69 |     Train a simple conv net
 70 |     img_h = sentence length (padded where necessary)
 71 |     img_w = word vector length (300 for word2vec)
 72 |     filter_hs = filter window sizes
 73 |     hidden_units = [x,y] x is the number of feature maps (per filter window), and y is the penultimate layer
 74 |     sqr_norm_lim = s^2 in the paper
 75 |     lr_decay = adadelta decay parameter
 76 |     """
 77 |     rng = np.random.RandomState(3435)
 78 |     img_h = len(datasets[0][0][0])
 79 |     filter_w = img_w
 80 |     feature_maps = hidden_units[0]
 81 |     filter_shapes = []
 82 |     pool_sizes = []
 83 |     for filter_h in filter_hs:
 84 |         filter_shapes.append((feature_maps, 1, filter_h, filter_w))
 85 |         pool_sizes.append((img_h - filter_h + 1, img_w - filter_w + 1))
 86 |     parameters = [("image shape", img_h, img_w), ("filter shape", filter_shapes), ("hidden_units", hidden_units),
 87 |                   ("dropout", dropout_rate), ("batch_size", batch_size), ("non_static", non_static),
 88 |                   ("learn_decay", lr_decay), ("conv_non_linear", conv_non_linear), ("non_static", non_static)
 89 |         , ("sqr_norm_lim", sqr_norm_lim), ("shuffle_batch", shuffle_batch)]
 90 |     print(parameters)
 91 | 
 92 |     # define model architecture
 93 |     index = T.lscalar()
 94 |     x = T.tensor3('x', dtype=theano.config.floatX)
 95 |     y = T.ivector('y')
 96 |     mair = T.dmatrix('mair')
 97 |     Words = theano.shared(value=U, name="Words")
 98 |     zero_vec_tensor = T.vector(dtype=theano.config.floatX)
 99 |     zero_vec = np.zeros(img_w, dtype=theano.config.floatX)
100 |     set_zero = theano.function([zero_vec_tensor], updates=[(Words, T.set_subtensor(Words[0, :], zero_vec_tensor))],
101 |                                allow_input_downcast=True)
102 | 
103 |     conv_layers = []
104 | 
105 |     for i in range(len(filter_hs)):
106 |         filter_shape = filter_shapes[i]
107 |         pool_size = pool_sizes[i]
108 |         conv_layer = LeNetConvPoolLayer(rng, image_shape=None,
109 |                                         filter_shape=filter_shape, poolsize=pool_size, non_linear=conv_non_linear)
110 |         conv_layers.append(conv_layer)
111 | 
112 |     layer0_input = Words[T.cast(x.flatten(), dtype="int32")].reshape(
113 |         (x.shape[0], x.shape[1], x.shape[2], Words.shape[1]))
114 | 
115 |     def convolve_user_statuses(statuses):
116 |         layer1_inputs = []
117 | 
118 |         def sum_mat(mat, out):
119 |             z = ifelse(T.neq(T.sum(mat, dtype=theano.config.floatX), T.constant(0, dtype=theano.config.floatX)),
120 |                        T.constant(1, dtype=theano.config.floatX), T.constant(0, dtype=theano.config.floatX))
121 |             return out + z, theano.scan_module.until(T.eq(z, T.constant(0, dtype=theano.config.floatX)))
122 | 
123 |         status_count, _ = theano.scan(fn=sum_mat, sequences=statuses,
124 |                                       outputs_info=T.constant(0, dtype=theano.config.floatX))
125 | 
126 |         # Slice-out dummy (zeroed) sentences
127 |         relv_input = statuses[:T.cast(status_count[-1], dtype='int32')].dimshuffle(0, 'x', 1, 2)
128 | 
129 |         for conv_layer in conv_layers:
130 |             layer1_inputs.append(conv_layer.set_input(input=relv_input).flatten(2))
131 | 
132 |         features = T.concatenate(layer1_inputs, axis=1)
133 | 
134 |         avg_feat = T.max(features, axis=0)
135 | 
136 |         return avg_feat
137 | 
138 |     conv_feats, _ = theano.scan(fn=convolve_user_statuses, sequences=layer0_input)
139 | 
140 |     # Add Mairesse features
141 |     layer1_input = T.concatenate([conv_feats, mair], axis=1)  ##mairesse_change
142 |     hidden_units[0] = feature_maps * len(filter_hs) + datasets[4].shape[1]  ##mairesse_change
143 |     classifier = MLPDropout(rng, input=layer1_input, layer_sizes=hidden_units, activations=activations,
144 |                             dropout_rates=dropout_rate)
145 | 
146 |     svm_data = T.concatenate([classifier.layers[0].output, y.dimshuffle(0, 'x')], axis=1)
147 |     # define parameters of the model and update functions using adadelta
148 |     params = classifier.params
149 |     for conv_layer in conv_layers:
150 |         params += conv_layer.params
151 |     if non_static:
152 |         # if word vectors are allowed to change, add them as model parameters
153 |         params += [Words]
154 |     cost = classifier.negative_log_likelihood(y)
155 |     dropout_cost = classifier.dropout_negative_log_likelihood(y)
156 |     grad_updates = sgd_updates_adadelta(params, dropout_cost, lr_decay, 1e-6, sqr_norm_lim)
157 | 
158 |     # shuffle dataset and assign to mini batches. if dataset size is not a multiple of mini batches, replicate
159 |     # extra data (at random)
160 |     np.random.seed(3435)
161 |     if datasets[0].shape[0] % batch_size > 0:
162 |         extra_data_num = batch_size - datasets[0].shape[0] % batch_size
163 |         rand_perm = np.random.permutation(range(len(datasets[0])))
164 |         train_set_x = datasets[0][rand_perm]
165 |         train_set_y = datasets[1][rand_perm]
166 |         train_set_m = datasets[4][rand_perm]
167 |         extra_data_x = train_set_x[:extra_data_num]
168 |         extra_data_y = train_set_y[:extra_data_num]
169 |         extra_data_m = train_set_m[:extra_data_num]
170 |         new_data_x = np.append(datasets[0], extra_data_x, axis=0)
171 |         new_data_y = np.append(datasets[1], extra_data_y, axis=0)
172 |         new_data_m = np.append(datasets[4], extra_data_m, axis=0)
173 |     else:
174 |         new_data_x = datasets[0]
175 |         new_data_y = datasets[1]
176 |         new_data_m = datasets[4]
177 |     rand_perm = np.random.permutation(range(len(new_data_x)))
178 |     new_data_x = new_data_x[rand_perm]
179 |     new_data_y = new_data_y[rand_perm]
180 |     new_data_m = new_data_m[rand_perm]
181 |     n_batches = new_data_x.shape[0] / batch_size
182 |     n_train_batches = int(np.round(n_batches * 0.9))
183 |     # divide train set into train/val sets
184 |     test_set_x = datasets[2]
185 |     test_set_y = np.asarray(datasets[3], "int32")
186 |     test_set_m = datasets[5]
187 |     train_set_x, train_set_y, train_set_m = shared_dataset((new_data_x[:n_train_batches * batch_size],
188 |                                                             new_data_y[:n_train_batches * batch_size],
189 |                                                             new_data_m[:n_train_batches * batch_size]))
190 |     val_set_x, val_set_y, val_set_m = shared_dataset((new_data_x[n_train_batches * batch_size:],
191 |                                                       new_data_y[n_train_batches * batch_size:],
192 |                                                       new_data_m[n_train_batches * batch_size:]))
193 |     n_val_batches = n_batches - n_train_batches
194 |     val_model = theano.function([index], classifier.errors(y),
195 |                                 givens={
196 |                                     x: val_set_x[index * batch_size: (index + 1) * batch_size],
197 |                                     y: val_set_y[index * batch_size: (index + 1) * batch_size],
198 |                                     mair: val_set_m[index * batch_size: (index + 1) * batch_size]},  ##mairesse_change
199 |                                 allow_input_downcast=False)
200 | 
201 |     # compile theano functions to get train/val/test errors
202 |     test_model = theano.function([index], [classifier.errors(y), svm_data],
203 |                                  givens={
204 |                                      x: train_set_x[index * batch_size: (index + 1) * batch_size],
205 |                                      y: train_set_y[index * batch_size: (index + 1) * batch_size],
206 |                                      mair: train_set_m[index * batch_size: (index + 1) * batch_size]},
207 |                                  ##mairesse_change
208 |                                  allow_input_downcast=True)
209 |     train_model = theano.function([index], cost, updates=grad_updates,
210 |                                   givens={
211 |                                       x: train_set_x[index * batch_size:(index + 1) * batch_size],
212 |                                       y: train_set_y[index * batch_size:(index + 1) * batch_size],
213 |                                       mair: train_set_m[index * batch_size: (index + 1) * batch_size]},
214 |                                   ##mairesse_change
215 |                                   allow_input_downcast=True)
216 | 
217 |     test_y_pred = classifier.predict(layer1_input)
218 |     test_error = T.sum(T.neq(test_y_pred, y), dtype=theano.config.floatX)
219 |     true_p = T.sum(test_y_pred * y, dtype=theano.config.floatX)
220 |     false_p = T.sum(test_y_pred * T.mod(y + T.ones_like(y, dtype=theano.config.floatX), T.constant(2, dtype='int32')))
221 |     false_n = T.sum(y * T.mod(test_y_pred + T.ones_like(y, dtype=theano.config.floatX), T.constant(2, dtype='int32')))
222 |     test_model_all = theano.function([x, y,
223 |                                       mair  ##mairesse_change
224 |                                       ]
225 |                                      , [test_error, true_p, false_p, false_n, svm_data], allow_input_downcast=True)
226 | 
227 |     test_batches = test_set_x.shape[0] / batch_size;
228 | 
229 |     # start training over mini-batches
230 |     print('... training')
231 |     epoch = 0
232 |     best_val_perf = 0
233 |     val_perf = 0
234 |     test_perf = 0
235 |     fscore = 0
236 |     cost_epoch = 0
237 |     while (epoch < n_epochs):
238 |         start_time = time.time()
239 |         epoch = epoch + 1
240 |         if shuffle_batch:
241 |             for minibatch_index in np.random.permutation(range(n_train_batches)):
242 |                 cost_epoch = train_model(minibatch_index)
243 |                 set_zero(zero_vec)
244 |         else:
245 |             for minibatch_index in range(n_train_batches):
246 |                 cost_epoch = train_model(minibatch_index)
247 |                 set_zero(zero_vec)
248 |         train_losses = [test_model(i) for i in range(n_train_batches)]
249 |         train_perf = 1 - np.mean([loss[0] for loss in train_losses])
250 |         val_losses = [val_model(i) for i in range(n_val_batches)]
251 |         val_perf = 1 - np.mean(val_losses)
252 |         epoch_perf = 'epoch: %i, training time: %.2f secs, train perf: %.2f %%, val perf: %.2f %%' % (
253 |             epoch, time.time() - start_time, train_perf * 100., val_perf * 100.)
254 |         print(epoch_perf)
255 |         ofile.write(epoch_perf + "\n")
256 |         ofile.flush()
257 |         if val_perf >= best_val_perf:
258 |             best_val_perf = val_perf
259 |             test_loss_list = [test_model_all(test_set_x[idx * batch_size:(idx + 1) * batch_size],
260 |                                              test_set_y[idx * batch_size:(idx + 1) * batch_size],
261 |                                              test_set_m[idx * batch_size:(idx + 1) * batch_size]  ##mairesse_change
262 |                                              ) for idx in range(test_batches)]
263 |             if test_set_x.shape[0] > test_batches * batch_size:
264 |                 test_loss_list.append(
265 |                     test_model_all(test_set_x[test_batches * batch_size:], test_set_y[test_batches * batch_size:],
266 |                                    test_set_m[test_batches * batch_size:]  ##mairesse_change
267 |                                    ))
268 |             test_loss_list_temp = test_loss_list
269 |             test_loss_list = np.asarray([t[:-1] for t in test_loss_list])
270 |             test_loss = np.sum(test_loss_list[:, 0]) / float(test_set_x.shape[0])
271 |             test_perf = 1 - test_loss
272 |             tp = np.sum(test_loss_list[:, 1])
273 |             fp = np.sum(test_loss_list[:, 2])
274 |             fn = np.sum(test_loss_list[:, 3])
275 |             tn = test_set_x.shape[0] - (tp + fp + fn)
276 |             fscore = np.mean([2 * tp / float(2 * tp + fp + fn), 2 * tn / float(2 * tn + fp + fn)])
277 |             svm_test = np.concatenate([t[-1] for t in test_loss_list_temp], axis=0)
278 |             svm_train = np.concatenate([t[1] for t in train_losses], axis=0)
279 |             output = "Test result: accu: " + str(test_perf) + ", macro_fscore: " + str(fscore) + "\ntp: " + str(
280 |                 tp) + " tn:" + str(tn) + " fp: " + str(fp) + " fn: " + str(fn)
281 |             print(output)
282 |             ofile.write(output + "\n")
283 |             ofile.flush()
284 |             # dump train and test features
285 |             cPickle.dump(svm_test, open("cvte" + str(attr) + str(cv) + ".p", "wb"))
286 |             cPickle.dump(svm_train, open("cvtr" + str(attr) + str(cv) + ".p", "wb"))
287 |         updated_epochs = refresh_epochs()
288 |         if updated_epochs != None and n_epochs != updated_epochs:
289 |             n_epochs = updated_epochs
290 |             print('Epochs updated to ' + str(n_epochs))
291 |     return test_perf, fscore
292 | 
293 | 
294 | def refresh_epochs():
295 |     try:
296 |         f = open('n_epochs', 'r')
297 |     except Exception:
298 |         return None
299 | 
300 |     try:
301 |         n = int(f.readline().strip())
302 |     except Exception:
303 |         f.close()
304 |         return None
305 |     f.close()
306 |     return n
307 | 
308 | 
309 | def shared_dataset(data_xy, borrow=True):
310 |     """ Function that loads the dataset into shared variables
311 | 
312 |     The reason we store our dataset in shared variables is to allow
313 |     Theano to copy it into the GPU memory (when code is run on GPU).
314 |     Since copying data into the GPU is slow, copying a minibatch everytime
315 |     is needed (the default behaviour if the data is not in a shared
316 |     variable) would lead to a large decrease in performance.
317 |     """
318 |     data_x, data_y, data_m = data_xy
319 |     shared_x = theano.shared(np.asarray(data_x,
320 |                                         dtype=theano.config.floatX),
321 |                              borrow=borrow)
322 |     shared_y = theano.shared(np.asarray(data_y,
323 |                                         dtype=theano.config.floatX),
324 |                              borrow=borrow)
325 |     shared_m = theano.shared(np.asarray(data_m,
326 |                                         dtype=theano.config.floatX),
327 |                              borrow=borrow)
328 |     return shared_x, T.cast(shared_y, 'int32'), shared_m
329 | 
330 | 
331 | def sgd_updates_adadelta(params, cost, rho=0.95, epsilon=1e-6, norm_lim=9, word_vec_name='Words'):
332 |     """
333 |     adadelta update rule, mostly from
334 |     https://groups.google.com/forum/#!topic/pylearn-dev/3QbKtCumAW4 (for Adadelta)
335 |     """
336 |     updates = OrderedDict({})
337 |     exp_sqr_grads = OrderedDict({})
338 |     exp_sqr_ups = OrderedDict({})
339 |     gparams = []
340 |     for param in params:
341 |         empty = np.zeros_like(param.get_value(), dtype=theano.config.floatX)
342 |         exp_sqr_grads[param] = theano.shared(value=as_floatX(empty), name="exp_grad_%s" % param.name)
343 |         gp = T.grad(cost, param)
344 |         exp_sqr_ups[param] = theano.shared(value=as_floatX(empty), name="exp_grad_%s" % param.name)
345 |         gparams.append(gp)
346 |     for param, gp in zip(params, gparams):
347 |         exp_sg = exp_sqr_grads[param]
348 |         exp_su = exp_sqr_ups[param]
349 |         up_exp_sg = rho * exp_sg + (1 - rho) * T.sqr(gp)
350 |         updates[exp_sg] = up_exp_sg
351 |         step = -(T.sqrt(exp_su + epsilon) / T.sqrt(up_exp_sg + epsilon)) * gp
352 |         updates[exp_su] = rho * exp_su + (1 - rho) * T.sqr(step)
353 |         stepped_param = param + step
354 |         updates[param] = stepped_param
355 |     return updates
356 | 
357 | 
358 | def as_floatX(variable):
359 |     if isinstance(variable, float):
360 |         return np.cast[theano.config.floatX](variable)
361 | 
362 |     if isinstance(variable, np.ndarray):
363 |         return np.cast[theano.config.floatX](variable)
364 |     return theano.tensor.cast(variable, theano.config.floatX)
365 | 
366 | 
367 | def safe_update(dict_to, dict_from):
368 |     """
369 |     re-make update dictionary for safe updating
370 |     """
371 |     for key, val in dict(dict_from).iteritems():
372 |         if key in dict_to:
373 |             raise KeyError(key)
374 |         dict_to[key] = val
375 |     return dict_to
376 | 
377 | 
378 | def get_idx_from_sent(status, word_idx_map, charged_words, max_l=51, max_s=200, k=300, filter_h=5):
379 |     """
380 |     Transforms sentence into a list of indices. Pad with zeroes.
381 |     """
382 |     x = []
383 |     pad = filter_h - 1
384 |     length = len(status)
385 | 
386 |     pass_one = True
387 |     while len(x) == 0:
388 |         for i in range(length):
389 |             words = status[i].split()
390 |             if pass_one:
391 |                 words_set = set(words)
392 |                 if len(charged_words.intersection(words_set)) == 0:
393 |                     continue
394 |             else:
395 |                 if np.random.randint(0, 2) == 0:
396 |                     continue
397 |             y = []
398 |             for i in range(pad):
399 |                 y.append(0)
400 |             for word in words:
401 |                 if word in word_idx_map:
402 |                     y.append(word_idx_map[word])
403 | 
404 |             while len(y) < max_l + 2 * pad:
405 |                 y.append(0)
406 |             x.append(y)
407 |         pass_one = False
408 | 
409 |     if len(x) < max_s:
410 |         x.extend([[0] * (max_l + 2 * pad)] * (max_s - len(x)))
411 | 
412 |     return x
413 | 
414 | 
415 | def make_idx_data_cv(revs, word_idx_map, mairesse, charged_words, cv, per_attr=0, max_l=51, max_s=200, k=300,
416 |                      filter_h=5):
417 |     """
418 |     Transforms sentences into a 2-d matrix.
419 |     """
420 |     trainX, testX, trainY, testY, mTrain, mTest = [], [], [], [], [], []
421 |     for rev in revs:
422 |         sent = get_idx_from_sent(rev["text"], word_idx_map,
423 |                                  charged_words,
424 |                                  max_l, max_s, k, filter_h)
425 | 
426 |         if rev["split"] == cv:
427 |             testX.append(sent)
428 |             testY.append(rev['y' + str(per_attr)])
429 |             mTest.append(mairesse[rev["user"]])
430 |         else:
431 |             trainX.append(sent)
432 |             trainY.append(rev['y' + str(per_attr)])
433 |             mTrain.append(mairesse[rev["user"]])
434 |     trainX = np.array(trainX, dtype="int32")
435 |     testX = np.array(testX, dtype="int32")
436 |     trainY = np.array(trainY, dtype="int32")
437 |     testY = np.array(testY, dtype="int32")
438 |     mTrain = np.array(mTrain, dtype=theano.config.floatX)
439 |     mTest = np.array(mTest, dtype=theano.config.floatX)
440 |     return [trainX, trainY, testX, testY, mTrain, mTest]
441 | 
442 | 
443 | if __name__ == "__main__":
444 |     print("loading data...: floatx:" + theano.config.floatX),
445 |     x = cPickle.load(open("essays_mairesse.p", "rb"))
446 |     revs, W, W2, word_idx_map, vocab, mairesse = x[0], x[1], x[2], x[3], x[4], x[5]
447 |     print("data loaded!")
448 |     mode = sys.argv[1]
449 |     word_vectors = sys.argv[2]
450 |     attr = int(sys.argv[3])
451 | #     mode = "-static"
452 | #     word_vectors = "-word2vec"
453 | #     attr = int("2")
454 |     if mode == "-nonstatic":
455 |         print("model architecture: CNN-non-static")
456 |         non_static = True
457 |     elif mode == "-static":
458 |         print("model architecture: CNN-static")
459 |         non_static = False
460 |     exec (open("conv_net_classes.py").read())
461 |     if word_vectors == "-rand":
462 |         print("using: random vectors")
463 |         U = W2
464 |     elif word_vectors == "-word2vec":
465 |         print("using: word2vec vectors")
466 |         U = W
467 | 
468 |     r = range(0, 10)
469 | 
470 |     ofile = open('perf_output_' + str(attr) + '.txt', 'w')
471 | 
472 |     charged_words = []
473 | 
474 |     emof = open("Emotion_Lexicon.csv", "rb")
475 |     csvf = csv.reader(emof, delimiter=',', quotechar='"')
476 |     first_line = True
477 | 
478 |     for line in csvf:
479 |         if first_line:
480 |             first_line = False
481 |             continue
482 |         if line[11] == "1":
483 |             charged_words.append(line[0])
484 | 
485 |     emof.close()
486 | 
487 |     charged_words = set(charged_words)
488 | 
489 |     results = []
490 |     for i in r:
491 |         datasets = make_idx_data_cv(revs, word_idx_map, mairesse, charged_words, i, attr, max_l=149, max_s=312, k=300,
492 |                                     filter_h=3)
493 | 
494 |         perf, fscore = train_conv_net(datasets,
495 |                                       U,
496 |                                       ofile,
497 |                                       cv=i,
498 |                                       attr=attr,
499 |                                       lr_decay=0.95,
500 |                                       filter_hs=[1, 2, 3],
501 |                                       conv_non_linear="relu",
502 |                                       hidden_units=[200, 200, 2],
503 |                                       shuffle_batch=True,
504 |                                       n_epochs=50,
505 |                                       sqr_norm_lim=9,
506 |                                       non_static=non_static,
507 |                                       batch_size=50,
508 |                                       dropout_rate=[0.5, 0.5, 0.5],
509 |                                       activations=[Sigmoid])
510 |         output = "cv: " + str(i) + ", perf: " + str(perf) + ", macro_fscore: " + str(fscore)
511 |         print(output)
512 |         ofile.write(output + "\n")
513 |         ofile.flush()
514 |         results.append([perf, fscore])
515 |     results = np.asarray(results)
516 |     perf_out = 'Perf : ' + str(np.mean(results[:, 0]))
517 |     fscore_out = 'Macro_Fscore : ' + str(np.mean(results[:, 1]))
518 |     print(perf_out)
519 |     print(fscore_out)
520 |     ofile.write(perf_out + "\n" + fscore_out)
521 |     ofile.close()
522 | 


--------------------------------------------------------------------------------
/conv_net_train_gpu.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Sample code for
  3 | Convolutional Neural Networks for Sentence Classification
  4 | http://arxiv.org/pdf/1408.5882v2.pdf
  5 | 
  6 | Much of the code is modified from
  7 | - deeplearning.net (for ConvNet classes)
  8 | - https://github.com/mdenil/dropout (for dropout)
  9 | - https://groups.google.com/forum/#!topic/pylearn-dev/3QbKtCumAW4 (for Adadelta)
 10 | """
 11 | # from conv_net_classes import LeNetConvPoolLayer, MLPDropout
 12 | # import _pickle as cPickle
 13 | import cPickle
 14 | import numpy as np
 15 | from collections import defaultdict, OrderedDict
 16 | import theano
 17 | import theano.tensor as T
 18 | from theano.ifelse import ifelse
 19 | import os
 20 | import warnings
 21 | import sys
 22 | import time
 23 | import getpass
 24 | import csv
 25 | from conv_net_classes_gpu import LeNetConvPoolLayer, MLPDropout
 26 | 
 27 | warnings.filterwarnings("ignore")
 28 | 
 29 | 
 30 | # different non-linearities
 31 | def ReLU(x):
 32 |     y = T.maximum(0.0, x)
 33 |     return (y)
 34 | 
 35 | 
 36 | def Sigmoid(x):
 37 |     y = T.nnet.sigmoid(x)
 38 |     return (y)
 39 | 
 40 | 
 41 | def Tanh(x):
 42 |     y = T.tanh(x)
 43 |     return (y)
 44 | 
 45 | 
 46 | def Iden(x):
 47 |     y = x
 48 |     return (y)
 49 | 
 50 | 
 51 | def train_conv_net(datasets,
 52 |                    U,
 53 |                    ofile,
 54 |                    cv=0,
 55 |                    attr=0,
 56 |                    img_w=300,
 57 |                    filter_hs=[3, 4, 5],
 58 |                    hidden_units=[100, 2],
 59 |                    dropout_rate=[0.5],
 60 |                    shuffle_batch=True,
 61 |                    n_epochs=25,
 62 |                    batch_size=50,
 63 |                    lr_decay=0.95,
 64 |                    conv_non_linear="relu",
 65 |                    activations=[Iden],
 66 |                    sqr_norm_lim=9,
 67 |                    non_static=True):
 68 |     """
 69 |     Train a simple conv net
 70 |     img_h = sentence length (padded where necessary)
 71 |     img_w = word vector length (300 for word2vec)
 72 |     filter_hs = filter window sizes
 73 |     hidden_units = [x,y] x is the number of feature maps (per filter window), and y is the penultimate layer
 74 |     sqr_norm_lim = s^2 in the paper
 75 |     lr_decay = adadelta decay parameter
 76 |     """
 77 |     rng = np.random.RandomState(3435)
 78 |     img_h = len(datasets[0][0][0])
 79 |     filter_w = img_w
 80 |     feature_maps = hidden_units[0]
 81 |     filter_shapes = []
 82 |     pool_sizes = []
 83 |     for filter_h in filter_hs:
 84 |         filter_shapes.append((feature_maps, 1, filter_h, filter_w))
 85 |         pool_sizes.append((img_h - filter_h + 1, img_w - filter_w + 1))
 86 |     parameters = [("image shape", img_h, img_w), ("filter shape", filter_shapes), ("hidden_units", hidden_units),
 87 |                   ("dropout", dropout_rate), ("batch_size", batch_size), ("non_static", non_static),
 88 |                   ("learn_decay", lr_decay), ("conv_non_linear", conv_non_linear), ("non_static", non_static)
 89 |         , ("sqr_norm_lim", sqr_norm_lim), ("shuffle_batch", shuffle_batch)]
 90 |     print(parameters)
 91 | 
 92 |     # define model architecture
 93 |     index = T.iscalar()
 94 |     x = T.tensor3('x', dtype=theano.config.floatX)
 95 |     y = T.ivector('y')
 96 |     mair = T.matrix('mair')
 97 |     Words = theano.shared(value=U, name="Words")
 98 |     zero_vec_tensor = T.vector(dtype=theano.config.floatX)
 99 |     zero_vec = np.zeros(img_w, dtype=theano.config.floatX)
100 |     set_zero = theano.function([zero_vec_tensor], updates=[(Words, T.set_subtensor(Words[0, :], zero_vec_tensor))],
101 |                                allow_input_downcast=True)
102 | 
103 |     conv_layers = []
104 | 
105 |     for i in range(len(filter_hs)):
106 |         filter_shape = filter_shapes[i]
107 |         pool_size = pool_sizes[i]
108 |         conv_layer = LeNetConvPoolLayer(rng, image_shape=None,
109 |                                         filter_shape=filter_shape, poolsize=pool_size, non_linear=conv_non_linear)
110 |         conv_layers.append(conv_layer)
111 | 
112 |     layer0_input = Words[T.cast(x.flatten(), dtype="int32")].reshape(
113 |         (x.shape[0], x.shape[1], x.shape[2], Words.shape[1]))
114 | 
115 |     def convolve_user_statuses(statuses):
116 |         layer1_inputs = []
117 | 
118 |         def sum_mat(mat, out):
119 |             z = ifelse(T.neq(T.sum(mat, dtype=theano.config.floatX), T.constant(0, dtype=theano.config.floatX)),
120 |                        T.constant(1, dtype=theano.config.floatX), T.constant(0, dtype=theano.config.floatX))
121 |             return out + z, theano.scan_module.until(T.eq(z, T.constant(0, dtype=theano.config.floatX)))
122 | 
123 |         status_count, _ = theano.scan(fn=sum_mat, sequences=statuses,
124 |                                       outputs_info=T.constant(0, dtype=theano.config.floatX))
125 | 
126 |         # Slice-out dummy (zeroed) sentences
127 |         relv_input = statuses[:T.cast(status_count[-1], dtype='int32')].dimshuffle(0, 'x', 1, 2)
128 | 
129 |         for conv_layer in conv_layers:
130 |             layer1_inputs.append(conv_layer.set_input(input=relv_input).flatten(2))
131 | 
132 |         features = T.concatenate(layer1_inputs, axis=1)
133 | 
134 |         avg_feat = T.max(features, axis=0)
135 | 
136 |         return avg_feat
137 | 
138 |     conv_feats, _ = theano.scan(fn=convolve_user_statuses, sequences=layer0_input)
139 | 
140 |     # Add Mairesse features
141 |     layer1_input = T.concatenate([conv_feats, mair], axis=1)  ##mairesse_change
142 |     hidden_units[0] = feature_maps * len(filter_hs) + datasets[4].shape[1]  ##mairesse_change
143 |     classifier = MLPDropout(rng, input=layer1_input, layer_sizes=hidden_units, activations=activations,
144 |                             dropout_rates=dropout_rate)
145 | 
146 |     svm_data = T.concatenate([classifier.layers[0].output, y.dimshuffle(0, 'x')], axis=1)
147 |     # define parameters of the model and update functions using adadelta
148 |     params = classifier.params
149 |     for conv_layer in conv_layers:
150 |         params += conv_layer.params
151 |     if non_static:
152 |         # if word vectors are allowed to change, add them as model parameters
153 |         params += [Words]
154 |     cost = classifier.negative_log_likelihood(y)
155 |     dropout_cost = classifier.dropout_negative_log_likelihood(y)
156 |     grad_updates = sgd_updates_adadelta(params, dropout_cost, lr_decay, 1e-6, sqr_norm_lim)
157 | 
158 |     # shuffle dataset and assign to mini batches. if dataset size is not a multiple of mini batches, replicate
159 |     # extra data (at random)
160 |     np.random.seed(3435)
161 |     if datasets[0].shape[0] % batch_size > 0:
162 |         extra_data_num = batch_size - datasets[0].shape[0] % batch_size
163 |         rand_perm = np.random.permutation(range(len(datasets[0])))
164 |         train_set_x = datasets[0][rand_perm]
165 |         train_set_y = datasets[1][rand_perm]
166 |         train_set_m = datasets[4][rand_perm]
167 |         extra_data_x = train_set_x[:extra_data_num]
168 |         extra_data_y = train_set_y[:extra_data_num]
169 |         extra_data_m = train_set_m[:extra_data_num]
170 |         new_data_x = np.append(datasets[0], extra_data_x, axis=0)
171 |         new_data_y = np.append(datasets[1], extra_data_y, axis=0)
172 |         new_data_m = np.append(datasets[4], extra_data_m, axis=0)
173 |     else:
174 |         new_data_x = datasets[0]
175 |         new_data_y = datasets[1]
176 |         new_data_m = datasets[4]
177 |     rand_perm = np.random.permutation(range(len(new_data_x)))
178 |     new_data_x = new_data_x[rand_perm]
179 |     new_data_y = new_data_y[rand_perm]
180 |     new_data_m = new_data_m[rand_perm]
181 |     n_batches = new_data_x.shape[0] / batch_size
182 |     n_train_batches = int(np.round(n_batches * 0.9))
183 |     # divide train set into train/val sets
184 |     test_set_x = datasets[2]
185 |     test_set_y = np.asarray(datasets[3], "int32")
186 |     test_set_m = datasets[5]
187 |     train_set_x, train_set_y, train_set_m = shared_dataset((new_data_x[:n_train_batches * batch_size],
188 |                                                             new_data_y[:n_train_batches * batch_size],
189 |                                                             new_data_m[:n_train_batches * batch_size]))
190 |     val_set_x, val_set_y, val_set_m = shared_dataset((new_data_x[n_train_batches * batch_size:],
191 |                                                       new_data_y[n_train_batches * batch_size:],
192 |                                                       new_data_m[n_train_batches * batch_size:]))
193 |     n_val_batches = n_batches - n_train_batches
194 |     val_model = theano.function([index], classifier.errors(y),
195 |                                 givens={
196 |                                     x: val_set_x[index * batch_size: (index + 1) * batch_size],
197 |                                     y: val_set_y[index * batch_size: (index + 1) * batch_size],
198 |                                     mair: val_set_m[index * batch_size: (index + 1) * batch_size]},  ##mairesse_change
199 |                                 allow_input_downcast=False)
200 | 
201 |     # compile theano functions to get train/val/test errors
202 |     test_model = theano.function([index], [classifier.errors(y), svm_data],
203 |                                  givens={
204 |                                      x: train_set_x[index * batch_size: (index + 1) * batch_size],
205 |                                      y: train_set_y[index * batch_size: (index + 1) * batch_size],
206 |                                      mair: train_set_m[index * batch_size: (index + 1) * batch_size]},
207 |                                  ##mairesse_change
208 |                                  allow_input_downcast=True)
209 |     train_model = theano.function([index], cost, updates=grad_updates,
210 |                                   givens={
211 |                                       x: train_set_x[index * batch_size:(index + 1) * batch_size],
212 |                                       y: train_set_y[index * batch_size:(index + 1) * batch_size],
213 |                                       mair: train_set_m[index * batch_size: (index + 1) * batch_size]},
214 |                                   ##mairesse_change
215 |                                   allow_input_downcast=True)
216 | 
217 |     test_y_pred = classifier.predict(layer1_input)
218 |     test_error = T.sum(T.neq(test_y_pred, y), dtype=theano.config.floatX)
219 |     true_p = T.sum(test_y_pred * y, dtype=theano.config.floatX)
220 |     false_p = T.sum(test_y_pred * T.mod(y + T.ones_like(y, dtype=theano.config.floatX), T.constant(2, dtype='int32')))
221 |     false_n = T.sum(y * T.mod(test_y_pred + T.ones_like(y, dtype=theano.config.floatX), T.constant(2, dtype='int32')))
222 |     test_model_all = theano.function([x, y,
223 |                                       mair  ##mairesse_change
224 |                                       ]
225 |                                      , [test_error, true_p, false_p, false_n, svm_data], allow_input_downcast=True)
226 | 
227 |     test_batches = test_set_x.shape[0] / batch_size;
228 | 
229 |     # start training over mini-batches
230 |     print('... training')
231 |     epoch = 0
232 |     best_val_perf = 0
233 |     val_perf = 0
234 |     test_perf = 0
235 |     fscore = 0
236 |     cost_epoch = 0
237 |     while (epoch < n_epochs):
238 |         start_time = time.time()
239 |         epoch = epoch + 1
240 |         if shuffle_batch:
241 |             for minibatch_index in np.random.permutation(range(n_train_batches)):
242 |                 cost_epoch = train_model(minibatch_index)
243 |                 set_zero(zero_vec)
244 |         else:
245 |             for minibatch_index in range(n_train_batches):
246 |                 cost_epoch = train_model(minibatch_index)
247 |                 set_zero(zero_vec)
248 |         train_losses = [test_model(i) for i in range(n_train_batches)]
249 |         train_perf = 1 - np.mean([loss[0] for loss in train_losses])
250 |         val_losses = [val_model(i) for i in range(n_val_batches)]
251 |         val_perf = 1 - np.mean(val_losses)
252 |         epoch_perf = 'epoch: %i, training time: %.2f secs, train perf: %.2f %%, val perf: %.2f %%' % (
253 |             epoch, time.time() - start_time, train_perf * 100., val_perf * 100.)
254 |         print(epoch_perf)
255 |         ofile.write(epoch_perf + "\n")
256 |         ofile.flush()
257 |         if val_perf >= best_val_perf:
258 |             best_val_perf = val_perf
259 |             test_loss_list = [test_model_all(test_set_x[idx * batch_size:(idx + 1) * batch_size],
260 |                                              test_set_y[idx * batch_size:(idx + 1) * batch_size],
261 |                                              test_set_m[idx * batch_size:(idx + 1) * batch_size]  ##mairesse_change
262 |                                              ) for idx in range(test_batches)]
263 |             if test_set_x.shape[0] > test_batches * batch_size:
264 |                 test_loss_list.append(
265 |                     test_model_all(test_set_x[test_batches * batch_size:], test_set_y[test_batches * batch_size:],
266 |                                    test_set_m[test_batches * batch_size:]  ##mairesse_change
267 |                                    ))
268 |             test_loss_list_temp = test_loss_list
269 |             test_loss_list = np.asarray([t[:-1] for t in test_loss_list])
270 |             test_loss = np.sum(test_loss_list[:, 0]) / float(test_set_x.shape[0])
271 |             test_perf = 1 - test_loss
272 |             tp = np.sum(test_loss_list[:, 1])
273 |             fp = np.sum(test_loss_list[:, 2])
274 |             fn = np.sum(test_loss_list[:, 3])
275 |             tn = test_set_x.shape[0] - (tp + fp + fn)
276 |             fscore = np.mean([2 * tp / float(2 * tp + fp + fn), 2 * tn / float(2 * tn + fp + fn)])
277 |             svm_test = np.concatenate([t[-1] for t in test_loss_list_temp], axis=0)
278 |             svm_train = np.concatenate([t[1] for t in train_losses], axis=0)
279 |             output = "Test result: accu: " + str(test_perf) + ", macro_fscore: " + str(fscore) + "\ntp: " + str(
280 |                 tp) + " tn:" + str(tn) + " fp: " + str(fp) + " fn: " + str(fn)
281 |             print(output)
282 |             ofile.write(output + "\n")
283 |             ofile.flush()
284 |             # dump train and test features
285 |             cPickle.dump(svm_test, open("cvte" + str(attr) + str(cv) + ".p", "wb"))
286 |             cPickle.dump(svm_train, open("cvtr" + str(attr) + str(cv) + ".p", "wb"))
287 |         updated_epochs = refresh_epochs()
288 |         if updated_epochs != None and n_epochs != updated_epochs:
289 |             n_epochs = updated_epochs
290 |             print('Epochs updated to ' + str(n_epochs))
291 |     return test_perf, fscore
292 | 
293 | 
294 | def refresh_epochs():
295 |     try:
296 |         f = open('n_epochs', 'r')
297 |     except Exception:
298 |         return None
299 | 
300 |     try:
301 |         n = int(f.readline().strip())
302 |     except Exception:
303 |         f.close()
304 |         return None
305 |     f.close()
306 |     return n
307 | 
308 | 
309 | def shared_dataset(data_xy, borrow=True):
310 |     """ Function that loads the dataset into shared variables
311 | 
312 |     The reason we store our dataset in shared variables is to allow
313 |     Theano to copy it into the GPU memory (when code is run on GPU).
314 |     Since copying data into the GPU is slow, copying a minibatch everytime
315 |     is needed (the default behaviour if the data is not in a shared
316 |     variable) would lead to a large decrease in performance.
317 |     """
318 |     data_x, data_y, data_m = data_xy
319 |     shared_x = theano.shared(np.asarray(data_x,
320 |                                         dtype=theano.config.floatX),
321 |                              borrow=borrow)
322 |     shared_y = theano.shared(np.asarray(data_y,
323 |                                         dtype=theano.config.floatX),
324 |                              borrow=borrow)
325 |     shared_m = theano.shared(np.asarray(data_m,
326 |                                         dtype=theano.config.floatX),
327 |                              borrow=borrow)
328 |     return shared_x, T.cast(shared_y, 'int32'), shared_m
329 | 
330 | 
331 | def sgd_updates_adadelta(params, cost, rho=0.95, epsilon=1e-6, norm_lim=9, word_vec_name='Words'):
332 |     """
333 |     adadelta update rule, mostly from
334 |     https://groups.google.com/forum/#!topic/pylearn-dev/3QbKtCumAW4 (for Adadelta)
335 |     """
336 |     updates = OrderedDict({})
337 |     exp_sqr_grads = OrderedDict({})
338 |     exp_sqr_ups = OrderedDict({})
339 |     gparams = []
340 |     for param in params:
341 |         empty = np.zeros_like(param.get_value(), dtype=theano.config.floatX)
342 |         exp_sqr_grads[param] = theano.shared(value=as_floatX(empty), name="exp_grad_%s" % param.name)
343 |         gp = T.grad(cost, param)
344 |         exp_sqr_ups[param] = theano.shared(value=as_floatX(empty), name="exp_grad_%s" % param.name)
345 |         gparams.append(gp)
346 |     for param, gp in zip(params, gparams):
347 |         exp_sg = exp_sqr_grads[param]
348 |         exp_su = exp_sqr_ups[param]
349 |         up_exp_sg = rho * exp_sg + (1 - rho) * T.sqr(gp)
350 |         updates[exp_sg] = up_exp_sg
351 |         step = -(T.sqrt(exp_su + epsilon) / T.sqrt(up_exp_sg + epsilon)) * gp
352 |         updates[exp_su] = rho * exp_su + (1 - rho) * T.sqr(step)
353 |         stepped_param = param + step
354 |         updates[param] = stepped_param
355 |     return updates
356 | 
357 | 
358 | def as_floatX(variable):
359 |     if isinstance(variable, float):
360 |         return np.cast[theano.config.floatX](variable)
361 | 
362 |     if isinstance(variable, np.ndarray):
363 |         return np.cast[theano.config.floatX](variable)
364 |     return theano.tensor.cast(variable, theano.config.floatX)
365 | 
366 | 
367 | def safe_update(dict_to, dict_from):
368 |     """
369 |     re-make update dictionary for safe updating
370 |     """
371 |     for key, val in dict(dict_from).iteritems():
372 |         if key in dict_to:
373 |             raise KeyError(key)
374 |         dict_to[key] = val
375 |     return dict_to
376 | 
377 | 
378 | def get_idx_from_sent(status, word_idx_map, charged_words, max_l=51, max_s=200, k=300, filter_h=5):
379 |     """
380 |     Transforms sentence into a list of indices. Pad with zeroes.
381 |     """
382 |     x = []
383 |     pad = filter_h - 1
384 |     length = len(status)
385 | 
386 |     pass_one = True
387 |     while len(x) == 0:
388 |         for i in range(length):
389 |             words = status[i].split()
390 |             if pass_one:
391 |                 words_set = set(words)
392 |                 if len(charged_words.intersection(words_set)) == 0:
393 |                     continue
394 |             else:
395 |                 if np.random.randint(0, 2) == 0:
396 |                     continue
397 |             y = []
398 |             for i in range(pad):
399 |                 y.append(0)
400 |             for word in words:
401 |                 if word in word_idx_map:
402 |                     y.append(word_idx_map[word])
403 | 
404 |             while len(y) < max_l + 2 * pad:
405 |                 y.append(0)
406 |             x.append(y)
407 |         pass_one = False
408 | 
409 |     if len(x) < max_s:
410 |         x.extend([[0] * (max_l + 2 * pad)] * (max_s - len(x)))
411 | 
412 |     return x
413 | 
414 | 
415 | def make_idx_data_cv(revs, word_idx_map, mairesse, charged_words, cv, per_attr=0, max_l=51, max_s=200, k=300,
416 |                      filter_h=5):
417 |     """
418 |     Transforms sentences into a 2-d matrix.
419 |     """
420 |     trainX, testX, trainY, testY, mTrain, mTest = [], [], [], [], [], []
421 |     for rev in revs:
422 |         sent = get_idx_from_sent(rev["text"], word_idx_map,
423 |                                  charged_words,
424 |                                  max_l, max_s, k, filter_h)
425 | 
426 |         if rev["split"] == cv:
427 |             testX.append(sent)
428 |             testY.append(rev['y' + str(per_attr)])
429 |             mTest.append(mairesse[rev["user"]])
430 |         else:
431 |             trainX.append(sent)
432 |             trainY.append(rev['y' + str(per_attr)])
433 |             mTrain.append(mairesse[rev["user"]])
434 |     trainX = np.array(trainX, dtype="int32")
435 |     testX = np.array(testX, dtype="int32")
436 |     trainY = np.array(trainY, dtype="int32")
437 |     testY = np.array(testY, dtype="int32")
438 |     mTrain = np.array(mTrain, dtype=theano.config.floatX)
439 |     mTest = np.array(mTest, dtype=theano.config.floatX)
440 |     return [trainX, trainY, testX, testY, mTrain, mTest]
441 | 
442 | 
443 | if __name__ == "__main__":
444 |     print("loading data...: floatx:" + theano.config.floatX),
445 |     x = cPickle.load(open("essays_mairesse.p", "rb"))
446 |     revs, W, W2, word_idx_map, vocab, mairesse = x[0], x[1], x[2], x[3], x[4], x[5]
447 |     print("data loaded!")
448 |     mode = sys.argv[1]
449 |     word_vectors = sys.argv[2]
450 |     attr = int(sys.argv[3])
451 |     # mode = "-static"
452 |     # word_vectors = "-word2vec"
453 |     # attr = int("2")
454 |     print ("attr: " + str(attr))
455 |     if mode == "-nonstatic":
456 |         print("model architecture: CNN-non-static")
457 |         non_static = True
458 |     elif mode == "-static":
459 |         print("model architecture: CNN-static")
460 |         non_static = False
461 |     exec (open("conv_net_classes_gpu.py").read())
462 |     if word_vectors == "-rand":
463 |         print("using: random vectors")
464 |         U = np.float32(W2)
465 |     elif word_vectors == "-word2vec":
466 |         print("using: word2vec vectors")
467 |         U = np.float32(W)
468 | 
469 |     r = range(0, 10)
470 | 
471 |     ofile = open('perf_output_' + str(attr) + '.txt', 'w')
472 | 
473 |     charged_words = []
474 | 
475 |     emof = open("Emotion_Lexicon.csv", "rb")
476 |     csvf = csv.reader(emof, delimiter=',', quotechar='"')
477 |     first_line = True
478 | 
479 |     for line in csvf:
480 |         if first_line:
481 |             first_line = False
482 |             continue
483 |         if line[11] == "1":
484 |             charged_words.append(line[0])
485 | 
486 |     emof.close()
487 | 
488 |     charged_words = set(charged_words)
489 | 
490 |     results = []
491 |     for i in r:
492 |         datasets = make_idx_data_cv(revs, word_idx_map, mairesse, charged_words, i, attr, max_l=149, max_s=312, k=300,
493 |                                     filter_h=3)
494 | 
495 |         perf, fscore = train_conv_net(datasets,
496 |                                       U,
497 |                                       ofile,
498 |                                       cv=i,
499 |                                       attr=attr,
500 |                                       lr_decay=0.95,
501 |                                       filter_hs=[1, 2, 3],
502 |                                       conv_non_linear="relu",
503 |                                       hidden_units=[200, 200, 2],
504 |                                       shuffle_batch=True,
505 |                                       n_epochs=50,
506 |                                       sqr_norm_lim=9,
507 |                                       non_static=non_static,
508 |                                       batch_size=50,
509 |                                       dropout_rate=[0.5, 0.5, 0.5],
510 |                                       activations=[Sigmoid])
511 |         output = "cv: " + str(i) + ", perf: " + str(perf) + ", macro_fscore: " + str(fscore)
512 |         print(output)
513 |         ofile.write(output + "\n")
514 |         ofile.flush()
515 |         results.append([perf, fscore])
516 |     results = np.asarray(results)
517 |     perf_out = 'Perf : ' + str(np.mean(results[:, 0]))
518 |     fscore_out = 'Macro_Fscore : ' + str(np.mean(results[:, 1]))
519 |     print(perf_out)
520 |     print(fscore_out)
521 |     ofile.write(perf_out + "\n" + fscore_out)
522 |     ofile.close()
523 | 


--------------------------------------------------------------------------------