├── .gitignore ├── Data_analysis.ipynb ├── LCA Shortest Path ├── README.md ├── modelv1.ipynb ├── modelv2.ipynb ├── modelv3.ipynb ├── modelv4.ipynb ├── modelv5.ipynb ├── modelv6.ipynb ├── modelv7.ipynb ├── modelv8.ipynb └── path_extractor.ipynb ├── LCA SubTree ├── README.md ├── model2v1.ipynb ├── model2v2.ipynb └── path_extractor.ipynb ├── LICENSE ├── LSTM Seq and Tree ├── README.md ├── model3v1.ipynb ├── model3v2.ipynb └── path_extractor.ipynb ├── README.md ├── data ├── TEST_FILE.txt ├── TEST_FILE_FULL.TXT ├── TRAIN_FILE.TXT ├── dependency_types.txt ├── full_postags_types.txt ├── model3v1.ipynb ├── pos_tags.txt ├── relation_types.txt ├── relation_typesv3.txt ├── test_data ├── test_lca_paths ├── test_pathsv1 ├── test_pathsv3 ├── test_relations.txt ├── test_relationsv3.txt ├── train_data ├── train_lca_paths ├── train_pathsv1 ├── train_pathsv3 ├── train_relations.txt ├── train_relationsv3.txt ├── train_text.txt ├── vocab.pkl ├── vocab_glove └── vocab_wiki ├── img ├── lca.jpg ├── lstm_seq.jpg ├── lstm_tree.jpg └── lstm_tree_eq.jpg └── preprocessing.py.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | checkpoint/ 2 | .ipynb_checkpoints/ 3 | _pycache_/ 4 | word_embedding_glove 5 | glove.6B.100d.txt 6 | glove.6B.200d.txt 7 | train/ 8 | word_embd_wiki 9 | wikipedia200.bin -------------------------------------------------------------------------------- /LCA Shortest Path/README.md: -------------------------------------------------------------------------------- 1 | ## Relation Classification using LSTM Networks along Shortest Dependency Paths 2 | 3 | First we implemented a architecture following a paper [Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Paths](https://pdfs.semanticscholar.org/0f44/366c1e1446cfd51258c68bd1da14fe9c7f10.pdf?_ga=2.136229944.807016038.1498203433-264083776.1497442258) by Yan Xu and others. 4 | This neural architecture utilizes the shortest dependency path between two entities in a sentence. 5 | The shortest dependency paths retain most relevant information (to relation classification), while eliminating irrelevant words in the sentence. 6 | 7 | ## SDP-LSTM Model 8 | 9 | ![LCA Shortest Path](/img/lca.jpg) 10 | 11 | First sentence is parsed to a dependency tree by the [Stanford parser](https://nlp.stanford.edu/software/stanford-dependencies.shtml), the shortest dependency path(SDP) is extracted as the input of our network. 12 | 13 | Dependency trees are a kind of directed graph, so direction of relation matters. Hence we separate SDP into two sub-paths, each from an entity to the common ancestor node. Along the SDP, three different types of information(as channels) are used, including the words, POS tags, dependency types. 14 | In each channel, e.g. words, are mapped to real-valued vectors, called embeddings, which capture the underlying meanings of the inputs. 15 | 16 | ### Channels 17 | 18 | * Each word in a given sentence is mapped to a real-valued vector by looking up in a word embedding table of Glove (pretrained). 19 | * Since word embeddings are obtained on a generic corpus of a large scale, the information they contain may not agree with a specific sentence. We deal with this problem by allying each input word with its POS tag, e.g., noun, verb, etc. 20 | * The dependency types between words provide grammatical relationships in a sentence that can easily be understood and effectively used by people 21 | without linguistic expertise 22 | Two recurrent neural networks pick up information along the left and right sub-paths of the SDP. 23 | 24 | ### Recurrent Neural Networks 25 | 26 | Recurrent Neural Networks have one problem, known as gradient vanishing or exploding problem. Long short term memory(LSTM) overcome this problem by introducing an adaptive gating mechanism, which keep the previous state and memorize the extracted features of the current data input. 27 | LSTM-based recurrent neural network comprises four components: an input gate, a forget gate, an output gate, and a memory cell. 28 | The two SDP-LSTM propagate bottom-up from the entities to their common ancestor. This way, the model is direction-sensitive. 29 | 30 | A max pooling layer packs, for each sub-path, the recurrent network’s states, to a fixed vector by taking the maximum value in each dimension. 31 | The pooling layers from different channels are concatenated, and then connected to a hidden layer. Finally, we have a softmax output layer for 32 | classification. 33 | 34 | ### Training 35 | 36 | We update the model parameters including weights, biases, and embeddings by BPTT and Adam gradient descent with L2-regularization (we regularize weights W and U, not the bias terms b). 37 | 38 | ### Data 39 | 40 | SemEval-2010 Task 8 defines 9 relation types between nominals and a tenth type Other when two nouns have none of these relations. Direction is considered and hence model is trained over 19 relation classes. 41 | ## Experiments 42 | 43 | Model | Train-Accuracy | Test-Accuracy| Epochs 44 | --- | --- | ---| --- 45 | modelv1 | 99.45 | 61.4 | 10 46 | modelv2 | 100 | ? | 10 47 | modelv3 | 84.03 | 60.4 | 20 48 | modelv4 | 96.1 | 63.2 | 60 49 | modelv5 | 92.2 | 62.3 | 60 50 | modelv6 | 97.3 | 61.4 | 34 51 | modelv7 | 94.6 | 60.03 | 20 52 | modelv8 | 98.96 | 62.5 | 60 53 | 54 | ### [modelv1](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv1.ipynb) 55 | * Learning rate = 0.001 56 | * other_state size = 100 57 | * lambda_l2 = 0.0001 58 | 59 | ### [modelv2](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv2.ipynb) 60 | * dropout over hidden layer - 0.3 61 | 62 | ### [modelv3](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv3.ipynb) 63 | * dropout over word_embedding - 0.3 64 | 65 | ### [modelv4](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv4.ipynb) 66 | * dropout over word_embedding - 0.3 67 | * other_state_size = 50 68 | 69 | ### [modelv5](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv5.ipynb) 70 | * dropout over word_embedding and hidden_layer - 0.3 71 | * other_state_size = 50 72 | * lambda = 0.00001 73 | 74 | ### [modelv5](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv5.ipynb) 75 | * dropout over word_embedding, pos_embedding, dep_embedding of 0.5 76 | * dropout on hidden_layer of 0.3 77 | 78 | ``below all models have a learning rate decay at the rate of 0.96 over 2000 steps`` 79 | 80 | ### [modelv6](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv6.ipynb) 81 | * learning rate decays. 82 | 83 | ### [modelv7](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv7.ipynb) 84 | * learnings rate decays. 85 | * dropout on word, pos tags, dep embedding of 0.5 86 | * dropout on hidden layer of 0.3 87 | 88 | ### [modelv8](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv8.ipynb) 89 | * learning rate decay 90 | * [word embedding](http://tti-coin.jp/data/wikipedia200.bin) trained over wikipedia 91 | * dropout over hidden layer of 0.3 -------------------------------------------------------------------------------- /LCA Shortest Path/modelv1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import sys, os, _pickle as pickle\n", 12 | "import tensorflow as tf\n", 13 | "import numpy as np\n", 14 | "import nltk\n", 15 | "from sklearn.metrics import f1_score" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": { 22 | "collapsed": true 23 | }, 24 | "outputs": [], 25 | "source": [ 26 | "data_dir = '../data' # Directory for Data and Other files\n", 27 | "ckpt_dir = '../checkpoint' # Directory for Checkpoints \n", 28 | "word_embd_dir = '../checkpoint/word_embd' # Directory for Checkpoints of Word Embedding Layer\n", 29 | "model_dir = '../checkpoint/modelv1' # Directory for Checkpoints of Model" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 3, 35 | "metadata": { 36 | "collapsed": true 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "word_embd_dim = 100 # Dimension of embedding layer for words\n", 41 | "pos_embd_dim = 25 # Dimension of embedding layer for POS Tags\n", 42 | "dep_embd_dim = 25 # Dimension of embedding layer for Dependency Types\n", 43 | "\n", 44 | "word_vocab_size = 400001 # Vocab size for Words\n", 45 | "pos_vocab_size = 10 # Vocab size for POS Tags\n", 46 | "dep_vocab_size = 21 # Vocab size for Dependency Types\n", 47 | "\n", 48 | "relation_classes = 19 # No. of Relation Classes\n", 49 | "state_size = 100 # Dimension of States of LSTM-RNNs\n", 50 | "batch_size = 10 # Batch Size for training\n", 51 | "\n", 52 | "channels = 3 # No. of types of features to feed in LSTM-RNN\n", 53 | "lambda_l2 = 0.0001\n", 54 | "max_len_path = 10 # Maximum length of sequence" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 4, 60 | "metadata": { 61 | "collapsed": true 62 | }, 63 | "outputs": [], 64 | "source": [ 65 | "with tf.name_scope(\"input\"):\n", 66 | " \n", 67 | " # Length of the sequence\n", 68 | " path_length = tf.placeholder(tf.int32, shape=[2, batch_size], name=\"path1_length\") \n", 69 | " \n", 70 | " # Words in the sequence\n", 71 | " word_ids = tf.placeholder(tf.int32, shape=[2, batch_size, max_len_path], name=\"word_ids\") \n", 72 | " \n", 73 | " # POS Tags in the sequence\n", 74 | " pos_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"pos_ids\") \n", 75 | " \n", 76 | " # Dependency Types in the sequence\n", 77 | " dep_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"dep_ids\") \n", 78 | " \n", 79 | " # True Relation btw the entities\n", 80 | " y = tf.placeholder(tf.int32, [batch_size], name=\"y\") " 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 5, 86 | "metadata": { 87 | "collapsed": true 88 | }, 89 | "outputs": [], 90 | "source": [ 91 | "# Embedding Layer of Words \n", 92 | "with tf.name_scope(\"word_embedding\"):\n", 93 | " W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n", 94 | " embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n", 95 | " embedding_init = W.assign(embedding_placeholder)\n", 96 | " embedded_word = tf.nn.embedding_lookup(W, word_ids)\n", 97 | " word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n", 98 | "\n", 99 | "# Embedding Layer of POS Tags \n", 100 | "with tf.name_scope(\"pos_embedding\"):\n", 101 | " W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n", 102 | " embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n", 103 | " pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n", 104 | "\n", 105 | "# Embedding Layer of Dependency Types \n", 106 | "with tf.name_scope(\"dep_embedding\"):\n", 107 | " W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n", 108 | " embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n", 109 | " dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 6, 115 | "metadata": { 116 | "collapsed": true 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "hidden_states = tf.zeros([channels, batch_size, state_size], name=\"hidden_state\")\n", 121 | "cell_states = tf.zeros([channels, batch_size, state_size], name=\"cell_state\")\n", 122 | "\n", 123 | "init_states = [tf.contrib.rnn.LSTMStateTuple(hidden_states[i], cell_states[i]) for i in range(channels)]\n", 124 | "\n", 125 | "with tf.variable_scope(\"word_lstm1\"):\n", 126 | " cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n", 127 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word[0], sequence_length=path_length[0], initial_state=init_states[0])\n", 128 | " state_series_word1 = tf.reduce_max(state_series, axis=1)\n", 129 | "\n", 130 | "with tf.variable_scope(\"word_lstm2\"):\n", 131 | " cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n", 132 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word[1], sequence_length=path_length[1], initial_state=init_states[0])\n", 133 | " state_series_word2 = tf.reduce_max(state_series, axis=1)\n", 134 | "\n", 135 | "with tf.variable_scope(\"pos_lstm1\"):\n", 136 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 137 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[0], sequence_length=path_length[0],initial_state=init_states[1])\n", 138 | " state_series_pos1 = tf.reduce_max(state_series, axis=1)\n", 139 | "\n", 140 | "with tf.variable_scope(\"pos_lstm2\"):\n", 141 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 142 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[1], sequence_length=path_length[1],initial_state=init_states[1])\n", 143 | " state_series_pos2 = tf.reduce_max(state_series, axis=1)\n", 144 | "\n", 145 | "with tf.variable_scope(\"dep_lstm1\"):\n", 146 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 147 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[0], sequence_length=path_length[0], initial_state=init_states[2])\n", 148 | " state_series_dep1 = tf.reduce_max(state_series, axis=1)\n", 149 | "\n", 150 | "with tf.variable_scope(\"dep_lstm2\"):\n", 151 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 152 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[1], sequence_length=path_length[1], initial_state=init_states[2])\n", 153 | " state_series_dep2 = tf.reduce_max(state_series, axis=1)\n", 154 | "\n", 155 | "state_series1 = tf.concat([state_series_word1, state_series_pos1, state_series_dep1], 1)\n", 156 | "state_series2 = tf.concat([state_series_word2, state_series_pos2, state_series_dep2], 1)\n", 157 | "\n", 158 | "state_series = tf.concat([state_series1, state_series2], 1)" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 7, 164 | "metadata": { 165 | "collapsed": true 166 | }, 167 | "outputs": [], 168 | "source": [ 169 | "with tf.name_scope(\"hidden_layer\"):\n", 170 | " W = tf.Variable(tf.truncated_normal([600, 100], -0.1, 0.1), name=\"W\")\n", 171 | " b = tf.Variable(tf.zeros([100]), name=\"b\")\n", 172 | " y_hidden_layer = tf.matmul(state_series, W) + b\n", 173 | "\n", 174 | "with tf.name_scope(\"softmax_layer\"):\n", 175 | " W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n", 176 | " b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n", 177 | " logits = tf.matmul(y_hidden_layer, W) + b\n", 178 | " predictions = tf.argmax(logits, 1)" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 8, 184 | "metadata": { 185 | "collapsed": true 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "tv_all = tf.trainable_variables()\n", 190 | "tv_regu = []\n", 191 | "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n", 192 | "for t in tv_all:\n", 193 | " if t.name not in non_reg:\n", 194 | " if(t.name.find('biases')==-1):\n", 195 | " tv_regu.append(t)\n", 196 | "\n", 197 | "with tf.name_scope(\"loss\"):\n", 198 | " l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n", 199 | " loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n", 200 | " total_loss = loss + l2_loss\n", 201 | "\n", 202 | "global_step = tf.Variable(0, name=\"global_step\")\n", 203 | "\n", 204 | "optimizer = tf.train.AdamOptimizer(0.001).minimize(total_loss, global_step=global_step)" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": 9, 210 | "metadata": { 211 | "collapsed": true 212 | }, 213 | "outputs": [], 214 | "source": [ 215 | "f = open(data_dir + '/vocab.pkl', 'rb')\n", 216 | "vocab = pickle.load(f)\n", 217 | "f.close()\n", 218 | "\n", 219 | "word2id = dict((w, i) for i,w in enumerate(vocab))\n", 220 | "id2word = dict((i, w) for i,w in enumerate(vocab))\n", 221 | "\n", 222 | "unknown_token = \"UNKNOWN_TOKEN\"\n", 223 | "word2id[unknown_token] = word_vocab_size -1\n", 224 | "id2word[word_vocab_size-1] = unknown_token\n", 225 | "\n", 226 | "pos_tags_vocab = []\n", 227 | "for line in open(data_dir + '/pos_tags.txt'):\n", 228 | " pos_tags_vocab.append(line.strip())\n", 229 | "\n", 230 | "dep_vocab = []\n", 231 | "for line in open(data_dir + '/dependency_types.txt'):\n", 232 | " dep_vocab.append(line.strip())\n", 233 | "\n", 234 | "relation_vocab = []\n", 235 | "for line in open(data_dir + '/relation_types.txt'):\n", 236 | " relation_vocab.append(line.strip())\n", 237 | "\n", 238 | "\n", 239 | "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n", 240 | "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n", 241 | "\n", 242 | "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n", 243 | "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n", 244 | "\n", 245 | "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n", 246 | "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n", 247 | "\n", 248 | "pos_tag2id['OTH'] = 9\n", 249 | "id2pos_tag[9] = 'OTH'\n", 250 | "\n", 251 | "dep2id['OTH'] = 20\n", 252 | "id2dep[20] = 'OTH'\n", 253 | "\n", 254 | "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n", 255 | "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n", 256 | "RB_pos_tags = ['RB', 'RBR', 'RBS']\n", 257 | "PRP_pos_tags = ['PRP', 'PRP$']\n", 258 | "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n", 259 | "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n", 260 | "\n", 261 | "def pos_tag(x):\n", 262 | " if x in JJ_pos_tags:\n", 263 | " return pos_tag2id['JJ']\n", 264 | " if x in NN_pos_tags:\n", 265 | " return pos_tag2id['NN']\n", 266 | " if x in RB_pos_tags:\n", 267 | " return pos_tag2id['RB']\n", 268 | " if x in PRP_pos_tags:\n", 269 | " return pos_tag2id['PRP']\n", 270 | " if x in VB_pos_tags:\n", 271 | " return pos_tag2id['VB']\n", 272 | " if x in _pos_tags:\n", 273 | " return pos_tag2id[x]\n", 274 | " else:\n", 275 | " return 9" 276 | ] 277 | }, 278 | { 279 | "cell_type": "code", 280 | "execution_count": 16, 281 | "metadata": { 282 | "collapsed": true 283 | }, 284 | "outputs": [], 285 | "source": [ 286 | "f = open(data_dir + '/train_paths', 'rb')\n", 287 | "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n", 288 | "f.close()\n", 289 | "\n", 290 | "relations = []\n", 291 | "for line in open(data_dir + '/train_relations.txt'):\n", 292 | " relations.append(line.strip().split()[1])\n", 293 | "\n", 294 | "length = len(word_p1)\n", 295 | "num_batches = int(length/batch_size)\n", 296 | "\n", 297 | "for i in range(length):\n", 298 | " for j, word in enumerate(word_p1[i]):\n", 299 | " word = word.lower()\n", 300 | " word_p1[i][j] = word if word in word2id else unknown_token \n", 301 | " for k, word in enumerate(word_p2[i]):\n", 302 | " word = word.lower()\n", 303 | " word_p2[i][k] = word if word in word2id else unknown_token \n", 304 | " for l, d in enumerate(dep_p1[i]):\n", 305 | " dep_p1[i][l] = d if d in dep2id else 'OTH'\n", 306 | " for m, d in enumerate(dep_p2[i]):\n", 307 | " dep_p2[i][m] = d if d in dep2id else 'OTH'\n", 308 | "\n", 309 | "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 310 | "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 311 | "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 312 | "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 313 | "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 314 | "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 315 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 316 | "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n", 317 | "path2_len = np.array([len(w) for w in word_p2])\n", 318 | "\n", 319 | "for i in range(length):\n", 320 | " for j, w in enumerate(word_p1[i]):\n", 321 | " word_p1_ids[i][j] = word2id[w]\n", 322 | " for j, w in enumerate(word_p2[i]):\n", 323 | " word_p2_ids[i][j] = word2id[w]\n", 324 | " for j, w in enumerate(pos_p1[i]):\n", 325 | " pos_p1_ids[i][j] = pos_tag(w)\n", 326 | " for j, w in enumerate(pos_p2[i]):\n", 327 | " pos_p2_ids[i][j] = pos_tag(w)\n", 328 | " for j, w in enumerate(dep_p1[i]):\n", 329 | " dep_p1_ids[i][j] = dep2id[w]\n", 330 | " for j, w in enumerate(dep_p2[i]):\n", 331 | " dep_p2_ids[i][j] = dep2id[w]" 332 | ] 333 | }, 334 | { 335 | "cell_type": "code", 336 | "execution_count": 17, 337 | "metadata": { 338 | "collapsed": true 339 | }, 340 | "outputs": [], 341 | "source": [ 342 | "sess = tf.Session()\n", 343 | "sess.run(tf.global_variables_initializer())\n", 344 | "saver = tf.train.Saver()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": 10, 350 | "metadata": { 351 | "collapsed": true 352 | }, 353 | "outputs": [], 354 | "source": [ 355 | "# f = open('data/word_embedding', 'rb')\n", 356 | "# word_embedding = pickle.load(f)\n", 357 | "# f.close()\n", 358 | "\n", 359 | "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n", 360 | "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": 19, 366 | "metadata": {}, 367 | "outputs": [ 368 | { 369 | "name": "stdout", 370 | "output_type": "stream", 371 | "text": [ 372 | "INFO:tensorflow:Restoring parameters from checkpoint/modelv1/model\n" 373 | ] 374 | } 375 | ], 376 | "source": [ 377 | "model = tf.train.latest_checkpoint(model_dir)\n", 378 | "saver.restore(sess, model)" 379 | ] 380 | }, 381 | { 382 | "cell_type": "code", 383 | "execution_count": 13, 384 | "metadata": { 385 | "collapsed": true, 386 | "scrolled": true 387 | }, 388 | "outputs": [], 389 | "source": [ 390 | "# latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n", 391 | "# word_embedding_saver.restore(sess, latest_embd)" 392 | ] 393 | }, 394 | { 395 | "cell_type": "code", 396 | "execution_count": 20, 397 | "metadata": { 398 | "collapsed": true, 399 | "scrolled": true 400 | }, 401 | "outputs": [], 402 | "source": [ 403 | "num_epochs = 10\n", 404 | "for i in range(num_epochs):\n", 405 | " for j in range(num_batches):\n", 406 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 407 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 408 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 409 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 410 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 411 | " \n", 412 | " feed_dict = {\n", 413 | " path_length:path_dict,\n", 414 | " word_ids:word_dict,\n", 415 | " pos_ids:pos_dict,\n", 416 | " dep_ids:dep_dict,\n", 417 | " y:y_dict}\n", 418 | " _, loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n", 419 | " if step%10==0:\n", 420 | " print(\"Step:\", step, \"loss:\",loss)\n", 421 | " if step % 1000 == 0:\n", 422 | " saver.save(sess, model_dir + '/model')\n", 423 | " print(\"Saved Model\")" 424 | ] 425 | }, 426 | { 427 | "cell_type": "code", 428 | "execution_count": 21, 429 | "metadata": { 430 | "scrolled": false 431 | }, 432 | "outputs": [ 433 | { 434 | "name": "stdout", 435 | "output_type": "stream", 436 | "text": [ 437 | "training accuracy 99.2625\n" 438 | ] 439 | } 440 | ], 441 | "source": [ 442 | "# training accuracy\n", 443 | "all_predictions = []\n", 444 | "for j in range(num_batches):\n", 445 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 446 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 447 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 448 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 449 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 450 | "\n", 451 | " feed_dict = {\n", 452 | " path_length:path_dict,\n", 453 | " word_ids:word_dict,\n", 454 | " pos_ids:pos_dict,\n", 455 | " dep_ids:dep_dict,\n", 456 | " y:y_dict}\n", 457 | " batch_predictions = sess.run(predictions, feed_dict)\n", 458 | " all_predictions.append(batch_predictions)\n", 459 | "\n", 460 | "y_pred = []\n", 461 | "for i in range(num_batches):\n", 462 | " for pred in all_predictions[i]:\n", 463 | " y_pred.append(pred)\n", 464 | "\n", 465 | "count = 0\n", 466 | "for i in range(batch_size*num_batches):\n", 467 | " count += y_pred[i]==rel_ids[i]\n", 468 | "accuracy = count/(batch_size*num_batches) * 100\n", 469 | "\n", 470 | "print(\"training accuracy\", accuracy)" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": 22, 476 | "metadata": { 477 | "collapsed": true 478 | }, 479 | "outputs": [], 480 | "source": [ 481 | "f = open(data_dir + '/test_paths', 'rb')\n", 482 | "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n", 483 | "f.close()\n", 484 | "\n", 485 | "relations = []\n", 486 | "for line in open(data_dir + '/test_relations.txt'):\n", 487 | " relations.append(line.strip().split()[0])\n", 488 | "\n", 489 | "length = len(word_p1)\n", 490 | "num_batches = int(length/batch_size)\n", 491 | "\n", 492 | "for i in range(length):\n", 493 | " for j, word in enumerate(word_p1[i]):\n", 494 | " word = word.lower()\n", 495 | " word_p1[i][j] = word if word in word2id else unknown_token \n", 496 | " for k, word in enumerate(word_p2[i]):\n", 497 | " word = word.lower()\n", 498 | " word_p2[i][k] = word if word in word2id else unknown_token \n", 499 | " for l, d in enumerate(dep_p1[i]):\n", 500 | " dep_p1[i][l] = d if d in dep2id else 'OTH'\n", 501 | " for m, d in enumerate(dep_p2[i]):\n", 502 | " dep_p2[i][m] = d if d in dep2id else 'OTH'\n", 503 | "\n", 504 | "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 505 | "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 506 | "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 507 | "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 508 | "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 509 | "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 510 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 511 | "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n", 512 | "path2_len = np.array([len(w) for w in word_p2])\n", 513 | "\n", 514 | "for i in range(length):\n", 515 | " for j, w in enumerate(word_p1[i]):\n", 516 | " word_p1_ids[i][j] = word2id[w]\n", 517 | " for j, w in enumerate(word_p2[i]):\n", 518 | " word_p2_ids[i][j] = word2id[w]\n", 519 | " for j, w in enumerate(pos_p1[i]):\n", 520 | " pos_p1_ids[i][j] = pos_tag(w)\n", 521 | " for j, w in enumerate(pos_p2[i]):\n", 522 | " pos_p2_ids[i][j] = pos_tag(w)\n", 523 | " for j, w in enumerate(dep_p1[i]):\n", 524 | " dep_p1_ids[i][j] = dep2id[w]\n", 525 | " for j, w in enumerate(dep_p2[i]):\n", 526 | " dep_p2_ids[i][j] = dep2id[w]" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": 23, 532 | "metadata": {}, 533 | "outputs": [ 534 | { 535 | "name": "stdout", 536 | "output_type": "stream", 537 | "text": [ 538 | "test accuracy 61.4022140221\n" 539 | ] 540 | } 541 | ], 542 | "source": [ 543 | "# test \n", 544 | "all_predictions = []\n", 545 | "for j in range(num_batches):\n", 546 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 547 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 548 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 549 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 550 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 551 | "\n", 552 | " feed_dict = {\n", 553 | " path_length:path_dict,\n", 554 | " word_ids:word_dict,\n", 555 | " pos_ids:pos_dict,\n", 556 | " dep_ids:dep_dict,\n", 557 | " y:y_dict}\n", 558 | " batch_predictions = sess.run(predictions, feed_dict)\n", 559 | " all_predictions.append(batch_predictions)\n", 560 | "\n", 561 | "y_pred = []\n", 562 | "for i in range(num_batches):\n", 563 | " for pred in all_predictions[i]:\n", 564 | " y_pred.append(pred)\n", 565 | "\n", 566 | "count = 0\n", 567 | "for i in range(batch_size*num_batches):\n", 568 | " count += y_pred[i]==rel_ids[i]\n", 569 | "accuracy = count/(batch_size*num_batches) * 100\n", 570 | "\n", 571 | "print(\"test accuracy\", accuracy)" 572 | ] 573 | } 574 | ], 575 | "metadata": { 576 | "kernelspec": { 577 | "display_name": "Python 3", 578 | "language": "python", 579 | "name": "python3" 580 | }, 581 | "language_info": { 582 | "codemirror_mode": { 583 | "name": "ipython", 584 | "version": 3 585 | }, 586 | "file_extension": ".py", 587 | "mimetype": "text/x-python", 588 | "name": "python", 589 | "nbconvert_exporter": "python", 590 | "pygments_lexer": "ipython3", 591 | "version": "3.5.4" 592 | } 593 | }, 594 | "nbformat": 4, 595 | "nbformat_minor": 2 596 | } 597 | -------------------------------------------------------------------------------- /LCA Shortest Path/modelv6.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import sys, os, _pickle as pickle\n", 12 | "import tensorflow as tf\n", 13 | "import numpy as np\n", 14 | "import nltk\n", 15 | "from sklearn.metrics import f1_score\n", 16 | "\n", 17 | "data_dir = '../data'\n", 18 | "ckpt_dir = '../checkpoint'\n", 19 | "word_embd_dir = '../checkpoint/word_embd'\n", 20 | "model_dir = '../checkpoint/modelv6'\n", 21 | "\n", 22 | "word_embd_dim = 100\n", 23 | "pos_embd_dim = 25\n", 24 | "dep_embd_dim = 25\n", 25 | "word_vocab_size = 400001\n", 26 | "pos_vocab_size = 10\n", 27 | "dep_vocab_size = 21\n", 28 | "relation_classes = 19\n", 29 | "word_state_size = 100\n", 30 | "other_state_size = 100\n", 31 | "batch_size = 10\n", 32 | "channels = 3\n", 33 | "lambda_l2 = 0.0001\n", 34 | "max_len_path = 10\n", 35 | "starter_learning_rate = 0.001\n", 36 | "decay_steps = 2000\n", 37 | "decay_rate = 0.96" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": { 44 | "collapsed": true 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "with tf.name_scope(\"input\"):\n", 49 | " path_length = tf.placeholder(tf.int32, shape=[2, batch_size], name=\"path1_length\")\n", 50 | " word_ids = tf.placeholder(tf.int32, shape=[2, batch_size, max_len_path], name=\"word_ids\")\n", 51 | " pos_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"pos_ids\")\n", 52 | " dep_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"dep_ids\")\n", 53 | " y = tf.placeholder(tf.int32, [batch_size], name=\"y\")\n", 54 | "\n", 55 | "with tf.name_scope(\"word_embedding\"):\n", 56 | " W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n", 57 | " embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n", 58 | " embedding_init = W.assign(embedding_placeholder)\n", 59 | " embedded_word = tf.nn.embedding_lookup(W, word_ids)\n", 60 | " word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n", 61 | "\n", 62 | "with tf.name_scope(\"pos_embedding\"):\n", 63 | " W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n", 64 | " embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n", 65 | " pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n", 66 | "\n", 67 | "with tf.name_scope(\"dep_embedding\"):\n", 68 | " W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n", 69 | " embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n", 70 | " dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})\n", 71 | "\n", 72 | "word_hidden_state = tf.zeros([batch_size, word_state_size], name='word_hidden_state')\n", 73 | "word_cell_state = tf.zeros([batch_size, word_state_size], name='word_cell_state')\n", 74 | "word_init_state = tf.contrib.rnn.LSTMStateTuple(word_hidden_state, word_cell_state)\n", 75 | "\n", 76 | "other_hidden_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"hidden_state\")\n", 77 | "other_cell_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"cell_state\")\n", 78 | "\n", 79 | "other_init_states = [tf.contrib.rnn.LSTMStateTuple(other_hidden_states[i], other_cell_states[i]) for i in range(channels-1)]\n", 80 | "\n", 81 | "with tf.variable_scope(\"word_lstm1\"):\n", 82 | " cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n", 83 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word[0], sequence_length=path_length[0], initial_state=word_init_state)\n", 84 | " state_series_word1 = tf.reduce_max(state_series, axis=1)\n", 85 | "\n", 86 | "with tf.variable_scope(\"word_lstm2\"):\n", 87 | " cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n", 88 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word[1], sequence_length=path_length[1], initial_state=word_init_state)\n", 89 | " state_series_word2 = tf.reduce_max(state_series, axis=1)\n", 90 | "\n", 91 | "with tf.variable_scope(\"pos_lstm1\"):\n", 92 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 93 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[0], sequence_length=path_length[0],initial_state=other_init_states[0])\n", 94 | " state_series_pos1 = tf.reduce_max(state_series, axis=1)\n", 95 | "\n", 96 | "with tf.variable_scope(\"pos_lstm2\"):\n", 97 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 98 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[1], sequence_length=path_length[1],initial_state=other_init_states[0])\n", 99 | " state_series_pos2 = tf.reduce_max(state_series, axis=1)\n", 100 | "\n", 101 | "with tf.variable_scope(\"dep_lstm1\"):\n", 102 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 103 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[0], sequence_length=path_length[0], initial_state=other_init_states[1])\n", 104 | " state_series_dep1 = tf.reduce_max(state_series, axis=1)\n", 105 | "\n", 106 | "with tf.variable_scope(\"dep_lstm2\"):\n", 107 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 108 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[1], sequence_length=path_length[1], initial_state=other_init_states[1])\n", 109 | " state_series_dep2 = tf.reduce_max(state_series, axis=1)\n", 110 | "\n", 111 | "state_series1 = tf.concat([state_series_word1, state_series_pos1, state_series_dep1], 1)\n", 112 | "state_series2 = tf.concat([state_series_word2, state_series_pos2, state_series_dep2], 1)\n", 113 | "\n", 114 | "state_series = tf.concat([state_series1, state_series2], 1)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 3, 120 | "metadata": { 121 | "collapsed": true 122 | }, 123 | "outputs": [], 124 | "source": [ 125 | "with tf.name_scope(\"hidden_layer\"):\n", 126 | " W = tf.Variable(tf.truncated_normal([600, 100], -0.1, 0.1), name=\"W\")\n", 127 | " b = tf.Variable(tf.zeros([100]), name=\"b\")\n", 128 | " y_hidden_layer = tf.nn.relu(tf.matmul(state_series, W) + b)\n", 129 | "\n", 130 | "with tf.name_scope(\"softmax_layer\"):\n", 131 | " W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n", 132 | " b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n", 133 | " logits = tf.matmul(y_hidden_layer, W) + b\n", 134 | " predictions = tf.argmax(logits, 1)\n", 135 | "\n", 136 | "tv_all = tf.trainable_variables()\n", 137 | "tv_regu = []\n", 138 | "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n", 139 | "for t in tv_all:\n", 140 | " if t.name not in non_reg:\n", 141 | " if(t.name.find('biases')==-1):\n", 142 | " tv_regu.append(t)\n", 143 | "\n", 144 | "with tf.name_scope(\"loss\"):\n", 145 | " l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n", 146 | " loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n", 147 | " total_loss = loss + l2_loss\n", 148 | "\n", 149 | "global_step = tf.Variable(0, trainable=False, name=\"global_step\")\n", 150 | "\n", 151 | "learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, decay_steps, decay_rate, staircase=True)\n", 152 | "optimizer = tf.train.AdamOptimizer(learning_rate).minimize(total_loss, global_step=global_step)" 153 | ] 154 | }, 155 | { 156 | "cell_type": "code", 157 | "execution_count": 4, 158 | "metadata": { 159 | "collapsed": true 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "f = open(data_dir + '/vocab.pkl', 'rb')\n", 164 | "vocab = pickle.load(f)\n", 165 | "f.close()\n", 166 | "\n", 167 | "word2id = dict((w, i) for i,w in enumerate(vocab))\n", 168 | "id2word = dict((i, w) for i,w in enumerate(vocab))\n", 169 | "\n", 170 | "unknown_token = \"UNKNOWN_TOKEN\"\n", 171 | "word2id[unknown_token] = word_vocab_size -1\n", 172 | "id2word[word_vocab_size-1] = unknown_token\n", 173 | "\n", 174 | "pos_tags_vocab = []\n", 175 | "for line in open(data_dir + '/pos_tags.txt'):\n", 176 | " pos_tags_vocab.append(line.strip())\n", 177 | "\n", 178 | "dep_vocab = []\n", 179 | "for line in open(data_dir + '/dependency_types.txt'):\n", 180 | " dep_vocab.append(line.strip())\n", 181 | "\n", 182 | "relation_vocab = []\n", 183 | "for line in open(data_dir + '/relation_types.txt'):\n", 184 | " relation_vocab.append(line.strip())\n", 185 | "\n", 186 | "\n", 187 | "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n", 188 | "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n", 189 | "\n", 190 | "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n", 191 | "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n", 192 | "\n", 193 | "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n", 194 | "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n", 195 | "\n", 196 | "pos_tag2id['OTH'] = 9\n", 197 | "id2pos_tag[9] = 'OTH'\n", 198 | "\n", 199 | "dep2id['OTH'] = 20\n", 200 | "id2dep[20] = 'OTH'\n", 201 | "\n", 202 | "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n", 203 | "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n", 204 | "RB_pos_tags = ['RB', 'RBR', 'RBS']\n", 205 | "PRP_pos_tags = ['PRP', 'PRP$']\n", 206 | "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n", 207 | "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n", 208 | "\n", 209 | "def pos_tag(x):\n", 210 | " if x in JJ_pos_tags:\n", 211 | " return pos_tag2id['JJ']\n", 212 | " if x in NN_pos_tags:\n", 213 | " return pos_tag2id['NN']\n", 214 | " if x in RB_pos_tags:\n", 215 | " return pos_tag2id['RB']\n", 216 | " if x in PRP_pos_tags:\n", 217 | " return pos_tag2id['PRP']\n", 218 | " if x in VB_pos_tags:\n", 219 | " return pos_tag2id['VB']\n", 220 | " if x in _pos_tags:\n", 221 | " return pos_tag2id[x]\n", 222 | " else:\n", 223 | " return 9" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 5, 229 | "metadata": { 230 | "collapsed": true 231 | }, 232 | "outputs": [], 233 | "source": [ 234 | "sess = tf.Session()\n", 235 | "sess.run(tf.global_variables_initializer())\n", 236 | "saver = tf.train.Saver()" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 6, 242 | "metadata": { 243 | "collapsed": true 244 | }, 245 | "outputs": [], 246 | "source": [ 247 | "# f = open('data/word_embedding', 'rb')\n", 248 | "# word_embedding = pickle.load(f)\n", 249 | "# f.close()\n", 250 | "\n", 251 | "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n", 252 | "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": 7, 258 | "metadata": { 259 | "collapsed": true 260 | }, 261 | "outputs": [], 262 | "source": [ 263 | "# model = tf.train.latest_checkpoint(model_dir)\n", 264 | "# saver.restore(sess, model)" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 8, 270 | "metadata": { 271 | "scrolled": true 272 | }, 273 | "outputs": [ 274 | { 275 | "name": "stdout", 276 | "output_type": "stream", 277 | "text": [ 278 | "INFO:tensorflow:Restoring parameters from checkpoint/word_embd/word_embd\n" 279 | ] 280 | } 281 | ], 282 | "source": [ 283 | "latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n", 284 | "word_embedding_saver.restore(sess, latest_embd)" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": null, 290 | "metadata": { 291 | "collapsed": true 292 | }, 293 | "outputs": [], 294 | "source": [ 295 | "f = open(data_dir + '/train_paths', 'rb')\n", 296 | "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n", 297 | "f.close()\n", 298 | "\n", 299 | "relations = []\n", 300 | "for line in open(data_dir + '/train_relations.txt'):\n", 301 | " relations.append(line.strip().split()[1])\n", 302 | "\n", 303 | "length = len(word_p1)\n", 304 | "num_batches = int(length/batch_size)\n", 305 | "\n", 306 | "for i in range(length):\n", 307 | " for j, word in enumerate(word_p1[i]):\n", 308 | " word = word.lower()\n", 309 | " word_p1[i][j] = word if word in word2id else unknown_token \n", 310 | " for k, word in enumerate(word_p2[i]):\n", 311 | " word = word.lower()\n", 312 | " word_p2[i][k] = word if word in word2id else unknown_token \n", 313 | " for l, d in enumerate(dep_p1[i]):\n", 314 | " dep_p1[i][l] = d if d in dep2id else 'OTH'\n", 315 | " for m, d in enumerate(dep_p2[i]):\n", 316 | " dep_p2[i][m] = d if d in dep2id else 'OTH'\n", 317 | "\n", 318 | "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 319 | "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 320 | "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 321 | "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 322 | "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 323 | "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 324 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 325 | "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n", 326 | "path2_len = np.array([len(w) for w in word_p2])\n", 327 | "\n", 328 | "for i in range(length):\n", 329 | " for j, w in enumerate(word_p1[i]):\n", 330 | " word_p1_ids[i][j] = word2id[w]\n", 331 | " for j, w in enumerate(word_p2[i]):\n", 332 | " word_p2_ids[i][j] = word2id[w]\n", 333 | " for j, w in enumerate(pos_p1[i]):\n", 334 | " pos_p1_ids[i][j] = pos_tag(w)\n", 335 | " for j, w in enumerate(pos_p2[i]):\n", 336 | " pos_p2_ids[i][j] = pos_tag(w)\n", 337 | " for j, w in enumerate(dep_p1[i]):\n", 338 | " dep_p1_ids[i][j] = dep2id[w]\n", 339 | " for j, w in enumerate(dep_p2[i]):\n", 340 | " dep_p2_ids[i][j] = dep2id[w]" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": { 347 | "scrolled": true 348 | }, 349 | "outputs": [ 350 | { 351 | "name": "stdout", 352 | "output_type": "stream", 353 | "text": [ 354 | "Epoch: 1 Step: 800 loss: 2.85308745444\n", 355 | "Saved Model\n", 356 | "Epoch: 2 Step: 1600 loss: 2.73827668965\n", 357 | "Saved Model\n", 358 | "Epoch: 3 Step: 2400 loss: 2.70001435518\n", 359 | "Saved Model\n", 360 | "Epoch: 4 Step: 3200 loss: 2.68624746531\n", 361 | "Saved Model\n", 362 | "Epoch: 5 Step: 4000 loss: 2.68042603165\n", 363 | "Saved Model\n", 364 | "Epoch: 6 Step: 4800 loss: 2.67750604913\n", 365 | "Saved Model\n", 366 | "Epoch: 7 Step: 5600 loss: 2.67583220631\n", 367 | "Saved Model\n", 368 | "Epoch: 8 Step: 6400 loss: 2.67482194766\n", 369 | "Saved Model\n", 370 | "Epoch: 9 Step: 7200 loss: 2.67411908716\n", 371 | "Saved Model\n", 372 | "Epoch: 10 Step: 8000 loss: 2.67369878128\n", 373 | "Saved Model\n", 374 | "Epoch: 11 Step: 8800 loss: 2.67341704309\n", 375 | "Saved Model\n", 376 | "Epoch: 12 Step: 9600 loss: 2.67321884066\n", 377 | "Saved Model\n", 378 | "Epoch: 13 Step: 10400 loss: 2.67310401961\n", 379 | "Saved Model\n", 380 | "Epoch: 14 Step: 11200 loss: 2.67295600712\n", 381 | "Saved Model\n", 382 | "Epoch: 15 Step: 12000 loss: 2.67288722694\n", 383 | "Saved Model\n", 384 | "Epoch: 16 Step: 12800 loss: 2.67282888472\n", 385 | "Saved Model\n", 386 | "Epoch: 17 Step: 13600 loss: 2.67277920395\n", 387 | "Saved Model\n", 388 | "Epoch: 18 Step: 14400 loss: 2.6727619794\n", 389 | "Saved Model\n", 390 | "Epoch: 19 Step: 15200 loss: 2.67268569678\n", 391 | "Saved Model\n", 392 | "Epoch: 20 Step: 16000 loss: 2.67266457796\n", 393 | "Saved Model\n", 394 | "Epoch: 21 Step: 16800 loss: 2.67263956338\n", 395 | "Saved Model\n", 396 | "Epoch: 22 Step: 17600 loss: 2.67261722207\n", 397 | "Saved Model\n", 398 | "Epoch: 23 Step: 18400 loss: 2.67261824235\n", 399 | "Saved Model\n", 400 | "Epoch: 24 Step: 19200 loss: 2.67256126881\n", 401 | "Saved Model\n", 402 | "Epoch: 25 Step: 20000 loss: 2.6725519672\n", 403 | "Saved Model\n", 404 | "Epoch: 26 Step: 20800 loss: 2.67253558069\n", 405 | "Saved Model\n", 406 | "Epoch: 27 Step: 21600 loss: 2.67252239197\n", 407 | "Saved Model\n", 408 | "Epoch: 28 Step: 22400 loss: 2.67252858594\n", 409 | "Saved Model\n", 410 | "Epoch: 29 Step: 23200 loss: 2.67248077154\n", 411 | "Saved Model\n", 412 | "Epoch: 30 Step: 24000 loss: 2.67247578681\n", 413 | "Saved Model\n", 414 | "Epoch: 31 Step: 24800 loss: 2.67246250227\n", 415 | "Saved Model\n", 416 | "Epoch: 32 Step: 25600 loss: 2.67245363146\n", 417 | "Saved Model\n", 418 | "Epoch: 33 Step: 26400 loss: 2.67246143714\n", 419 | "Saved Model\n", 420 | "Epoch: 34 Step: 27200 loss: 2.6724195759\n", 421 | "Saved Model\n", 422 | "Epoch: 35 Step: 28000 loss: 2.67241657913\n", 423 | "Saved Model\n", 424 | "Epoch: 36 Step: 28800 loss: 2.67240460932\n", 425 | "Saved Model\n", 426 | "Epoch: 37 Step: 29600 loss: 2.67239822775\n", 427 | "Saved Model\n", 428 | "Epoch: 38 Step: 30400 loss: 2.6724064\n", 429 | "Saved Model\n" 430 | ] 431 | } 432 | ], 433 | "source": [ 434 | "num_epochs = 60\n", 435 | "for i in range(num_epochs):\n", 436 | " loss_per_epoch = 0\n", 437 | " for j in range(num_batches):\n", 438 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 439 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 440 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 441 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 442 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 443 | " \n", 444 | " feed_dict = {\n", 445 | " path_length:path_dict,\n", 446 | " word_ids:word_dict,\n", 447 | " pos_ids:pos_dict,\n", 448 | " dep_ids:dep_dict,\n", 449 | " y:y_dict}\n", 450 | " _, _loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n", 451 | " loss_per_epoch +=_loss\n", 452 | " if (j+1)%num_batches==0:\n", 453 | " print(\"Epoch:\", i+1,\"Step:\", step, \"loss:\",loss_per_epoch/num_batches)\n", 454 | " saver.save(sess, model_dir + '/model')\n", 455 | " print(\"Saved Model\")" 456 | ] 457 | }, 458 | { 459 | "cell_type": "code", 460 | "execution_count": null, 461 | "metadata": { 462 | "collapsed": true, 463 | "scrolled": false 464 | }, 465 | "outputs": [], 466 | "source": [ 467 | "# training accuracy\n", 468 | "all_predictions = []\n", 469 | "for j in range(num_batches):\n", 470 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 471 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 472 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 473 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 474 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 475 | "\n", 476 | " feed_dict = {\n", 477 | " path_length:path_dict,\n", 478 | " word_ids:word_dict,\n", 479 | " pos_ids:pos_dict,\n", 480 | " dep_ids:dep_dict,\n", 481 | " y:y_dict}\n", 482 | " batch_predictions = sess.run(predictions, feed_dict)\n", 483 | " all_predictions.append(batch_predictions)\n", 484 | "\n", 485 | "y_pred = []\n", 486 | "for i in range(num_batches):\n", 487 | " for pred in all_predictions[i]:\n", 488 | " y_pred.append(pred)\n", 489 | "\n", 490 | "count = 0\n", 491 | "for i in range(batch_size*num_batches):\n", 492 | " count += y_pred[i]==rel_ids[i]\n", 493 | "accuracy = count/(batch_size*num_batches) * 100\n", 494 | "\n", 495 | "print(\"training accuracy\", accuracy)" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 23, 501 | "metadata": {}, 502 | "outputs": [ 503 | { 504 | "name": "stdout", 505 | "output_type": "stream", 506 | "text": [ 507 | "test accuracy 61.4022140221\n" 508 | ] 509 | } 510 | ], 511 | "source": [ 512 | "f = open(data_dir + '/test_paths', 'rb')\n", 513 | "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n", 514 | "f.close()\n", 515 | "\n", 516 | "relations = []\n", 517 | "for line in open(data_dir + '/test_relations.txt'):\n", 518 | " relations.append(line.strip().split()[0])\n", 519 | "\n", 520 | "length = len(word_p1)\n", 521 | "num_batches = int(length/batch_size)\n", 522 | "\n", 523 | "for i in range(length):\n", 524 | " for j, word in enumerate(word_p1[i]):\n", 525 | " word = word.lower()\n", 526 | " word_p1[i][j] = word if word in word2id else unknown_token \n", 527 | " for k, word in enumerate(word_p2[i]):\n", 528 | " word = word.lower()\n", 529 | " word_p2[i][k] = word if word in word2id else unknown_token \n", 530 | " for l, d in enumerate(dep_p1[i]):\n", 531 | " dep_p1[i][l] = d if d in dep2id else 'OTH'\n", 532 | " for m, d in enumerate(dep_p2[i]):\n", 533 | " dep_p2[i][m] = d if d in dep2id else 'OTH'\n", 534 | "\n", 535 | "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 536 | "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 537 | "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 538 | "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 539 | "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 540 | "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 541 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 542 | "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n", 543 | "path2_len = np.array([len(w) for w in word_p2])\n", 544 | "\n", 545 | "for i in range(length):\n", 546 | " for j, w in enumerate(word_p1[i]):\n", 547 | " word_p1_ids[i][j] = word2id[w]\n", 548 | " for j, w in enumerate(word_p2[i]):\n", 549 | " word_p2_ids[i][j] = word2id[w]\n", 550 | " for j, w in enumerate(pos_p1[i]):\n", 551 | " pos_p1_ids[i][j] = pos_tag(w)\n", 552 | " for j, w in enumerate(pos_p2[i]):\n", 553 | " pos_p2_ids[i][j] = pos_tag(w)\n", 554 | " for j, w in enumerate(dep_p1[i]):\n", 555 | " dep_p1_ids[i][j] = dep2id[w]\n", 556 | " for j, w in enumerate(dep_p2[i]):\n", 557 | " dep_p2_ids[i][j] = dep2id[w]\n", 558 | "\n", 559 | "# test \n", 560 | "all_predictions = []\n", 561 | "for j in range(num_batches):\n", 562 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 563 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 564 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 565 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 566 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 567 | "\n", 568 | " feed_dict = {\n", 569 | " path_length:path_dict,\n", 570 | " word_ids:word_dict,\n", 571 | " pos_ids:pos_dict,\n", 572 | " dep_ids:dep_dict,\n", 573 | " y:y_dict}\n", 574 | " batch_predictions = sess.run(predictions, feed_dict)\n", 575 | " all_predictions.append(batch_predictions)\n", 576 | "\n", 577 | "y_pred = []\n", 578 | "for i in range(num_batches):\n", 579 | " for pred in all_predictions[i]:\n", 580 | " y_pred.append(pred)\n", 581 | "\n", 582 | "count = 0\n", 583 | "for i in range(batch_size*num_batches):\n", 584 | " count += y_pred[i]==rel_ids[i]\n", 585 | "accuracy = count/(batch_size*num_batches) * 100\n", 586 | "\n", 587 | "print(\"test accuracy\", accuracy)" 588 | ] 589 | } 590 | ], 591 | "metadata": { 592 | "kernelspec": { 593 | "display_name": "Python 3", 594 | "language": "python", 595 | "name": "python3" 596 | }, 597 | "language_info": { 598 | "codemirror_mode": { 599 | "name": "ipython", 600 | "version": 3 601 | }, 602 | "file_extension": ".py", 603 | "mimetype": "text/x-python", 604 | "name": "python", 605 | "nbconvert_exporter": "python", 606 | "pygments_lexer": "ipython3", 607 | "version": "3.5.2" 608 | } 609 | }, 610 | "nbformat": 4, 611 | "nbformat_minor": 2 612 | } 613 | -------------------------------------------------------------------------------- /LCA Shortest Path/modelv7.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import sys, os, _pickle as pickle\n", 12 | "import tensorflow as tf\n", 13 | "import numpy as np\n", 14 | "import nltk\n", 15 | "from sklearn.metrics import f1_score\n", 16 | "\n", 17 | "data_dir = '../data'\n", 18 | "ckpt_dir = '../checkpoint'\n", 19 | "word_embd_dir = '../checkpoint/word_embd'\n", 20 | "model_dir = '../checkpoint/modelv7'\n", 21 | "\n", 22 | "word_embd_dim = 100\n", 23 | "pos_embd_dim = 25\n", 24 | "dep_embd_dim = 25\n", 25 | "word_vocab_size = 400001\n", 26 | "pos_vocab_size = 10\n", 27 | "dep_vocab_size = 21\n", 28 | "relation_classes = 19\n", 29 | "word_state_size = 100\n", 30 | "other_state_size = 100\n", 31 | "batch_size = 10\n", 32 | "channels = 3\n", 33 | "lambda_l2 = 0.0001\n", 34 | "max_len_path = 10\n", 35 | "starter_learning_rate = 0.001\n", 36 | "decay_steps = 2000\n", 37 | "decay_rate = 0.96" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": { 44 | "collapsed": true 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "with tf.name_scope(\"input\"):\n", 49 | " path_length = tf.placeholder(tf.int32, shape=[2, batch_size], name=\"path1_length\")\n", 50 | " word_ids = tf.placeholder(tf.int32, shape=[2, batch_size, max_len_path], name=\"word_ids\")\n", 51 | " pos_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"pos_ids\")\n", 52 | " dep_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"dep_ids\")\n", 53 | " y = tf.placeholder(tf.int32, [batch_size], name=\"y\")\n", 54 | "\n", 55 | "with tf.name_scope(\"word_embedding\"):\n", 56 | " W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n", 57 | " embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n", 58 | " embedding_init = W.assign(embedding_placeholder)\n", 59 | " embedded_word = tf.nn.embedding_lookup(W, word_ids)\n", 60 | " word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n", 61 | "\n", 62 | "with tf.name_scope(\"pos_embedding\"):\n", 63 | " W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n", 64 | " embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n", 65 | " pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n", 66 | "\n", 67 | "with tf.name_scope(\"dep_embedding\"):\n", 68 | " W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n", 69 | " embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n", 70 | " dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})\n", 71 | "\n", 72 | "with tf.name_scope(\"word_dropout\"):\n", 73 | " embedded_word_drop = tf.nn.dropout(embedded_word, 0.5)\n", 74 | " \n", 75 | "with tf.name_scope(\"pos_dropout\"):\n", 76 | " embedded_pos_drop = tf.nn.dropout(embedded_word, 0.5)\n", 77 | " \n", 78 | "with tf.name_scope(\"dep_dropout\"):\n", 79 | " embedded_dep_drop = tf.nn.dropout(embedded_word, 0.5)\n", 80 | "\n", 81 | "word_hidden_state = tf.zeros([batch_size, word_state_size], name='word_hidden_state')\n", 82 | "word_cell_state = tf.zeros([batch_size, word_state_size], name='word_cell_state')\n", 83 | "word_init_state = tf.contrib.rnn.LSTMStateTuple(word_hidden_state, word_cell_state)\n", 84 | "\n", 85 | "other_hidden_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"hidden_state\")\n", 86 | "other_cell_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"cell_state\")\n", 87 | "\n", 88 | "other_init_states = [tf.contrib.rnn.LSTMStateTuple(other_hidden_states[i], other_cell_states[i]) for i in range(channels-1)]\n", 89 | "\n", 90 | "with tf.variable_scope(\"word_lstm1\"):\n", 91 | " cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n", 92 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word_drop[0], sequence_length=path_length[0], initial_state=word_init_state)\n", 93 | " state_series_word1 = tf.reduce_max(state_series, axis=1)\n", 94 | "\n", 95 | "with tf.variable_scope(\"word_lstm2\"):\n", 96 | " cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n", 97 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word_drop[1], sequence_length=path_length[1], initial_state=word_init_state)\n", 98 | " state_series_word2 = tf.reduce_max(state_series, axis=1)\n", 99 | "\n", 100 | "with tf.variable_scope(\"pos_lstm1\"):\n", 101 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 102 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos_drop[0], sequence_length=path_length[0],initial_state=other_init_states[0])\n", 103 | " state_series_pos1 = tf.reduce_max(state_series, axis=1)\n", 104 | "\n", 105 | "with tf.variable_scope(\"pos_lstm2\"):\n", 106 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 107 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos_drop[1], sequence_length=path_length[1],initial_state=other_init_states[0])\n", 108 | " state_series_pos2 = tf.reduce_max(state_series, axis=1)\n", 109 | "\n", 110 | "with tf.variable_scope(\"dep_lstm1\"):\n", 111 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 112 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep_drop[0], sequence_length=path_length[0], initial_state=other_init_states[1])\n", 113 | " state_series_dep1 = tf.reduce_max(state_series, axis=1)\n", 114 | "\n", 115 | "with tf.variable_scope(\"dep_lstm2\"):\n", 116 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 117 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep_drop[1], sequence_length=path_length[1], initial_state=other_init_states[1])\n", 118 | " state_series_dep2 = tf.reduce_max(state_series, axis=1)\n", 119 | "\n", 120 | "state_series1 = tf.concat([state_series_word1, state_series_pos1, state_series_dep1], 1)\n", 121 | "state_series2 = tf.concat([state_series_word2, state_series_pos2, state_series_dep2], 1)\n", 122 | "\n", 123 | "state_series = tf.concat([state_series1, state_series2], 1)" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 3, 129 | "metadata": {}, 130 | "outputs": [ 131 | { 132 | "data": { 133 | "text/plain": [ 134 | "" 135 | ] 136 | }, 137 | "execution_count": 3, 138 | "metadata": {}, 139 | "output_type": "execute_result" 140 | } 141 | ], 142 | "source": [ 143 | "state_series" 144 | ] 145 | }, 146 | { 147 | "cell_type": "code", 148 | "execution_count": 4, 149 | "metadata": { 150 | "collapsed": true 151 | }, 152 | "outputs": [], 153 | "source": [ 154 | "with tf.name_scope(\"hidden_layer\"):\n", 155 | " W = tf.Variable(tf.truncated_normal([600, 100], -0.1, 0.1), name=\"W\")\n", 156 | " b = tf.Variable(tf.zeros([100]), name=\"b\")\n", 157 | " y_hidden_layer = tf.matmul(state_series, W) + b\n", 158 | "\n", 159 | "with tf.name_scope(\"dropout\"):\n", 160 | " y_hidden_layer_drop = tf.nn.dropout(y_hidden_layer, 0.3)\n", 161 | "\n", 162 | "with tf.name_scope(\"softmax_layer\"):\n", 163 | " W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n", 164 | " b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n", 165 | " logits = tf.matmul(y_hidden_layer_drop, W) + b\n", 166 | " predictions = tf.argmax(logits, 1)\n", 167 | "\n", 168 | "tv_all = tf.trainable_variables()\n", 169 | "tv_regu = []\n", 170 | "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n", 171 | "for t in tv_all:\n", 172 | " if t.name not in non_reg:\n", 173 | " if(t.name.find('biases')==-1):\n", 174 | " tv_regu.append(t)\n", 175 | "\n", 176 | "with tf.name_scope(\"loss\"):\n", 177 | " l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n", 178 | " loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n", 179 | " total_loss = loss + l2_loss\n", 180 | "\n", 181 | "global_step = tf.Variable(0, trainable=False, name=\"global_step\")\n", 182 | "\n", 183 | "learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, decay_steps, decay_rate, staircase=True)\n", 184 | "optimizer = tf.train.AdamOptimizer(learning_rate).minimize(total_loss, global_step=global_step)" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 5, 190 | "metadata": { 191 | "collapsed": true 192 | }, 193 | "outputs": [], 194 | "source": [ 195 | "f = open(data_dir + '/vocab.pkl', 'rb')\n", 196 | "vocab = pickle.load(f)\n", 197 | "f.close()\n", 198 | "\n", 199 | "word2id = dict((w, i) for i,w in enumerate(vocab))\n", 200 | "id2word = dict((i, w) for i,w in enumerate(vocab))\n", 201 | "\n", 202 | "unknown_token = \"UNKNOWN_TOKEN\"\n", 203 | "word2id[unknown_token] = word_vocab_size -1\n", 204 | "id2word[word_vocab_size-1] = unknown_token\n", 205 | "\n", 206 | "pos_tags_vocab = []\n", 207 | "for line in open(data_dir + '/pos_tags.txt'):\n", 208 | " pos_tags_vocab.append(line.strip())\n", 209 | "\n", 210 | "dep_vocab = []\n", 211 | "for line in open(data_dir + '/dependency_types.txt'):\n", 212 | " dep_vocab.append(line.strip())\n", 213 | "\n", 214 | "relation_vocab = []\n", 215 | "for line in open(data_dir + '/relation_types.txt'):\n", 216 | " relation_vocab.append(line.strip())\n", 217 | "\n", 218 | "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n", 219 | "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n", 220 | "\n", 221 | "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n", 222 | "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n", 223 | "\n", 224 | "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n", 225 | "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n", 226 | "\n", 227 | "pos_tag2id['OTH'] = 9\n", 228 | "id2pos_tag[9] = 'OTH'\n", 229 | "\n", 230 | "dep2id['OTH'] = 20\n", 231 | "id2dep[20] = 'OTH'\n", 232 | "\n", 233 | "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n", 234 | "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n", 235 | "RB_pos_tags = ['RB', 'RBR', 'RBS']\n", 236 | "PRP_pos_tags = ['PRP', 'PRP$']\n", 237 | "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n", 238 | "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n", 239 | "\n", 240 | "def pos_tag(x):\n", 241 | " if x in JJ_pos_tags:\n", 242 | " return pos_tag2id['JJ']\n", 243 | " if x in NN_pos_tags:\n", 244 | " return pos_tag2id['NN']\n", 245 | " if x in RB_pos_tags:\n", 246 | " return pos_tag2id['RB']\n", 247 | " if x in PRP_pos_tags:\n", 248 | " return pos_tag2id['PRP']\n", 249 | " if x in VB_pos_tags:\n", 250 | " return pos_tag2id['VB']\n", 251 | " if x in _pos_tags:\n", 252 | " return pos_tag2id[x]\n", 253 | " else:\n", 254 | " return 9" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": 6, 260 | "metadata": { 261 | "collapsed": true 262 | }, 263 | "outputs": [], 264 | "source": [ 265 | "sess = tf.Session()\n", 266 | "sess.run(tf.global_variables_initializer())\n", 267 | "saver = tf.train.Saver()" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 7, 273 | "metadata": { 274 | "collapsed": true 275 | }, 276 | "outputs": [], 277 | "source": [ 278 | "# f = open('data/word_embedding', 'rb')\n", 279 | "# word_embedding = pickle.load(f)\n", 280 | "# f.close()\n", 281 | "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n", 282 | "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 8, 288 | "metadata": { 289 | "collapsed": true 290 | }, 291 | "outputs": [], 292 | "source": [ 293 | "# model = tf.train.latest_checkpoint(model_dir)\n", 294 | "# saver.restore(sess, model)" 295 | ] 296 | }, 297 | { 298 | "cell_type": "code", 299 | "execution_count": 9, 300 | "metadata": { 301 | "scrolled": true 302 | }, 303 | "outputs": [ 304 | { 305 | "name": "stdout", 306 | "output_type": "stream", 307 | "text": [ 308 | "INFO:tensorflow:Restoring parameters from checkpoint/word_embd/word_embd\n" 309 | ] 310 | } 311 | ], 312 | "source": [ 313 | "latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n", 314 | "word_embedding_saver.restore(sess, latest_embd)" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 8, 320 | "metadata": { 321 | "collapsed": true 322 | }, 323 | "outputs": [], 324 | "source": [ 325 | "f = open(data_dir + '/train_paths', 'rb')\n", 326 | "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n", 327 | "f.close()\n", 328 | "\n", 329 | "relations = []\n", 330 | "for line in open(data_dir + '/train_relations.txt'):\n", 331 | " relations.append(line.strip().split()[1])\n", 332 | "\n", 333 | "length = len(word_p1)\n", 334 | "num_batches = int(length/batch_size)\n", 335 | "\n", 336 | "for i in range(length):\n", 337 | " for j, word in enumerate(word_p1[i]):\n", 338 | " word = word.lower()\n", 339 | " word_p1[i][j] = word if word in word2id else unknown_token \n", 340 | " for k, word in enumerate(word_p2[i]):\n", 341 | " word = word.lower()\n", 342 | " word_p2[i][k] = word if word in word2id else unknown_token \n", 343 | " for l, d in enumerate(dep_p1[i]):\n", 344 | " dep_p1[i][l] = d if d in dep2id else 'OTH'\n", 345 | " for m, d in enumerate(dep_p2[i]):\n", 346 | " dep_p2[i][m] = d if d in dep2id else 'OTH'\n", 347 | "\n", 348 | "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 349 | "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 350 | "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 351 | "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 352 | "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 353 | "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 354 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 355 | "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n", 356 | "path2_len = np.array([len(w) for w in word_p2])\n", 357 | "\n", 358 | "for i in range(length):\n", 359 | " for j, w in enumerate(word_p1[i]):\n", 360 | " word_p1_ids[i][j] = word2id[w]\n", 361 | " for j, w in enumerate(word_p2[i]):\n", 362 | " word_p2_ids[i][j] = word2id[w]\n", 363 | " for j, w in enumerate(pos_p1[i]):\n", 364 | " pos_p1_ids[i][j] = pos_tag(w)\n", 365 | " for j, w in enumerate(pos_p2[i]):\n", 366 | " pos_p2_ids[i][j] = pos_tag(w)\n", 367 | " for j, w in enumerate(dep_p1[i]):\n", 368 | " dep_p1_ids[i][j] = dep2id[w]\n", 369 | " for j, w in enumerate(dep_p2[i]):\n", 370 | " dep_p2_ids[i][j] = dep2id[w]" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": 11, 376 | "metadata": { 377 | "scrolled": true 378 | }, 379 | "outputs": [ 380 | { 381 | "name": "stdout", 382 | "output_type": "stream", 383 | "text": [ 384 | "epoch: 0\n", 385 | "Step: 800 loss: 2.91281905636\n", 386 | "epoch: 1\n", 387 | "Saved Model\n", 388 | "Step: 1600 loss: 1.93908668235\n", 389 | "epoch: 2\n", 390 | "Saved Model\n", 391 | "Step: 2400 loss: 1.45170823216\n", 392 | "epoch: 3\n", 393 | "Saved Model\n", 394 | "Step: 3200 loss: 1.18255942896\n", 395 | "epoch: 4\n", 396 | "Step: 4000 loss: 1.00360578123\n", 397 | "Saved Model\n", 398 | "epoch: 5\n", 399 | "Step: 4800 loss: 0.854295852538\n", 400 | "epoch: 6\n", 401 | "Saved Model\n", 402 | "Step: 5600 loss: 0.748602524679\n", 403 | "epoch: 7\n", 404 | "Saved Model\n", 405 | "Step: 6400 loss: 0.661906255111\n", 406 | "epoch: 8\n", 407 | "Saved Model\n", 408 | "Step: 7200 loss: 0.587379012275\n", 409 | "epoch: 9\n", 410 | "Step: 8000 loss: 0.531537927147\n", 411 | "Saved Model\n", 412 | "epoch: 10\n", 413 | "Step: 8800 loss: 0.484521641694\n", 414 | "epoch: 11\n", 415 | "Saved Model\n", 416 | "Step: 9600 loss: 0.444365512617\n", 417 | "epoch: 12\n", 418 | "Saved Model\n", 419 | "Step: 10400 loss: 0.415288321041\n", 420 | "epoch: 13\n", 421 | "Saved Model\n", 422 | "Step: 11200 loss: 0.384827776505\n", 423 | "epoch: 14\n", 424 | "Step: 12000 loss: 0.361082672933\n", 425 | "Saved Model\n", 426 | "epoch: 15\n", 427 | "Step: 12800 loss: 0.338339183582\n", 428 | "epoch: 16\n", 429 | "Saved Model\n", 430 | "Step: 13600 loss: 0.319484538799\n", 431 | "epoch: 17\n", 432 | "Saved Model\n", 433 | "Step: 14400 loss: 0.297788869962\n", 434 | "epoch: 18\n", 435 | "Saved Model\n", 436 | "Step: 15200 loss: 0.28938733831\n", 437 | "epoch: 19\n", 438 | "Step: 16000 loss: 0.27373408448\n", 439 | "Saved Model\n" 440 | ] 441 | } 442 | ], 443 | "source": [ 444 | "num_epochs = 60\n", 445 | "for i in range(num_epochs):\n", 446 | " loss_per_epoch = 0\n", 447 | " for j in range(num_batches):\n", 448 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 449 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 450 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 451 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 452 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 453 | " \n", 454 | " feed_dict = {\n", 455 | " path_length:path_dict,\n", 456 | " word_ids:word_dict,\n", 457 | " pos_ids:pos_dict,\n", 458 | " dep_ids:dep_dict,\n", 459 | " y:y_dict}\n", 460 | " _, _loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n", 461 | " loss_per_epoch +=_loss\n", 462 | " if (j+1)%num_batches==0:\n", 463 | " print(\"Epoch:\", i+1,\"Step:\", step, \"loss:\",loss_per_epoch/num_batches)\n", 464 | " saver.save(sess, model_dir + '/model')\n", 465 | " print(\"Saved Model\")" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": 12, 471 | "metadata": { 472 | "scrolled": false 473 | }, 474 | "outputs": [ 475 | { 476 | "name": "stdout", 477 | "output_type": "stream", 478 | "text": [ 479 | "training accuracy 94.6875\n" 480 | ] 481 | } 482 | ], 483 | "source": [ 484 | "# training accuracy\n", 485 | "all_predictions = []\n", 486 | "for j in range(num_batches):\n", 487 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 488 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 489 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 490 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 491 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 492 | "\n", 493 | " feed_dict = {\n", 494 | " path_length:path_dict,\n", 495 | " word_ids:word_dict,\n", 496 | " pos_ids:pos_dict,\n", 497 | " dep_ids:dep_dict,\n", 498 | " y:y_dict}\n", 499 | " batch_predictions = sess.run(predictions, feed_dict)\n", 500 | " all_predictions.append(batch_predictions)\n", 501 | "\n", 502 | "y_pred = []\n", 503 | "for i in range(num_batches):\n", 504 | " for pred in all_predictions[i]:\n", 505 | " y_pred.append(pred)\n", 506 | "\n", 507 | "count = 0\n", 508 | "for i in range(batch_size*num_batches):\n", 509 | " count += y_pred[i]==rel_ids[i]\n", 510 | "accuracy = count/(batch_size*num_batches) * 100\n", 511 | "\n", 512 | "print(\"training accuracy\", accuracy)" 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": 13, 518 | "metadata": {}, 519 | "outputs": [ 520 | { 521 | "name": "stdout", 522 | "output_type": "stream", 523 | "text": [ 524 | "test accuracy 60.036900369\n" 525 | ] 526 | } 527 | ], 528 | "source": [ 529 | "f = open(data_dir + '/test_paths', 'rb')\n", 530 | "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n", 531 | "f.close()\n", 532 | "\n", 533 | "relations = []\n", 534 | "for line in open(data_dir + '/test_relations.txt'):\n", 535 | " relations.append(line.strip().split()[0])\n", 536 | "\n", 537 | "length = len(word_p1)\n", 538 | "num_batches = int(length/batch_size)\n", 539 | "\n", 540 | "for i in range(length):\n", 541 | " for j, word in enumerate(word_p1[i]):\n", 542 | " word = word.lower()\n", 543 | " word_p1[i][j] = word if word in word2id else unknown_token \n", 544 | " for k, word in enumerate(word_p2[i]):\n", 545 | " word = word.lower()\n", 546 | " word_p2[i][k] = word if word in word2id else unknown_token \n", 547 | " for l, d in enumerate(dep_p1[i]):\n", 548 | " dep_p1[i][l] = d if d in dep2id else 'OTH'\n", 549 | " for m, d in enumerate(dep_p2[i]):\n", 550 | " dep_p2[i][m] = d if d in dep2id else 'OTH'\n", 551 | "\n", 552 | "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 553 | "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 554 | "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 555 | "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 556 | "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 557 | "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 558 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 559 | "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n", 560 | "path2_len = np.array([len(w) for w in word_p2])\n", 561 | "\n", 562 | "for i in range(length):\n", 563 | " for j, w in enumerate(word_p1[i]):\n", 564 | " word_p1_ids[i][j] = word2id[w]\n", 565 | " for j, w in enumerate(word_p2[i]):\n", 566 | " word_p2_ids[i][j] = word2id[w]\n", 567 | " for j, w in enumerate(pos_p1[i]):\n", 568 | " pos_p1_ids[i][j] = pos_tag(w)\n", 569 | " for j, w in enumerate(pos_p2[i]):\n", 570 | " pos_p2_ids[i][j] = pos_tag(w)\n", 571 | " for j, w in enumerate(dep_p1[i]):\n", 572 | " dep_p1_ids[i][j] = dep2id[w]\n", 573 | " for j, w in enumerate(dep_p2[i]):\n", 574 | " dep_p2_ids[i][j] = dep2id[w]\n", 575 | "\n", 576 | "# test predictions\n", 577 | "all_predictions = []\n", 578 | "for j in range(num_batches):\n", 579 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 580 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 581 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 582 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 583 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 584 | "\n", 585 | " feed_dict = {\n", 586 | " path_length:path_dict,\n", 587 | " word_ids:word_dict,\n", 588 | " pos_ids:pos_dict,\n", 589 | " dep_ids:dep_dict,\n", 590 | " y:y_dict}\n", 591 | " batch_predictions = sess.run(predictions, feed_dict)\n", 592 | " all_predictions.append(batch_predictions)\n", 593 | "\n", 594 | "y_pred = []\n", 595 | "for i in range(num_batches):\n", 596 | " for pred in all_predictions[i]:\n", 597 | " y_pred.append(pred)\n", 598 | "\n", 599 | "count = 0\n", 600 | "for i in range(batch_size*num_batches):\n", 601 | " count += y_pred[i]==rel_ids[i]\n", 602 | "accuracy = count/(batch_size*num_batches) * 100\n", 603 | "\n", 604 | "print(\"test accuracy\", accuracy)" 605 | ] 606 | } 607 | ], 608 | "metadata": { 609 | "kernelspec": { 610 | "display_name": "Python 3", 611 | "language": "python", 612 | "name": "python3" 613 | }, 614 | "language_info": { 615 | "codemirror_mode": { 616 | "name": "ipython", 617 | "version": 3 618 | }, 619 | "file_extension": ".py", 620 | "mimetype": "text/x-python", 621 | "name": "python", 622 | "nbconvert_exporter": "python", 623 | "pygments_lexer": "ipython3", 624 | "version": "3.5.2" 625 | } 626 | }, 627 | "nbformat": 4, 628 | "nbformat_minor": 2 629 | } 630 | -------------------------------------------------------------------------------- /LCA Shortest Path/modelv8.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import sys, os, _pickle as pickle\n", 12 | "import tensorflow as tf\n", 13 | "import numpy as np\n", 14 | "import nltk\n", 15 | "from sklearn.metrics import f1_score\n", 16 | "\n", 17 | "data_dir = '../data'\n", 18 | "ckpt_dir = '../checkpoint'\n", 19 | "word_embd_dir = '../checkpoint/word_embd_wiki'\n", 20 | "pos_embd_dir = '../checkpoint/pos_embd'\n", 21 | "dep_embd_dir = '../checkpoint/dep_embd'\n", 22 | "model_dir = '../checkpoint/modelv8'\n", 23 | "\n", 24 | "word_embd_dim = 200\n", 25 | "pos_embd_dim = 25\n", 26 | "dep_embd_dim = 25\n", 27 | "word_vocab_size = 306561\n", 28 | "pos_vocab_size = 10\n", 29 | "dep_vocab_size = 21\n", 30 | "relation_classes = 19\n", 31 | "word_state_size = 100\n", 32 | "other_state_size = 100\n", 33 | "batch_size = 10\n", 34 | "channels = 3\n", 35 | "lambda_l2 = 0.0001\n", 36 | "max_len_path = 10\n", 37 | "starter_learning_rate = 0.001\n", 38 | "decay_steps = 2000\n", 39 | "decay_rate = 0.96" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 7, 45 | "metadata": { 46 | "collapsed": true 47 | }, 48 | "outputs": [], 49 | "source": [ 50 | "with tf.name_scope(\"input\"):\n", 51 | " path_length = tf.placeholder(tf.int32, shape=[2, batch_size], name=\"path1_length\")\n", 52 | " word_ids = tf.placeholder(tf.int32, shape=[2, batch_size, max_len_path], name=\"word_ids\")\n", 53 | " pos_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"pos_ids\")\n", 54 | " dep_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"dep_ids\")\n", 55 | " y = tf.placeholder(tf.int32, [batch_size], name=\"y\")\n", 56 | "\n", 57 | "with tf.name_scope(\"word_embedding\"):\n", 58 | " W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n", 59 | " embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n", 60 | " embedding_init = W.assign(embedding_placeholder)\n", 61 | " embedded_word = tf.nn.embedding_lookup(W, word_ids)\n", 62 | " word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n", 63 | "\n", 64 | "with tf.name_scope(\"pos_embedding\"):\n", 65 | " W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n", 66 | " embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n", 67 | " pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n", 68 | "\n", 69 | "with tf.name_scope(\"dep_embedding\"):\n", 70 | " W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n", 71 | " embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n", 72 | " dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})\n", 73 | "\n", 74 | "with tf.name_scope(\"wordout\"):\n", 75 | " embedded_word_drop = tf.nn.dropout(embedded_word, 0.5)\n", 76 | "\n", 77 | "word_hidden_state = tf.zeros([batch_size, word_state_size], name='word_hidden_state')\n", 78 | "word_cell_state = tf.zeros([batch_size, word_state_size], name='word_cell_state')\n", 79 | "word_init_state = tf.contrib.rnn.LSTMStateTuple(word_hidden_state, word_cell_state)\n", 80 | "\n", 81 | "other_hidden_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"hidden_state\")\n", 82 | "other_cell_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"cell_state\")\n", 83 | "\n", 84 | "other_init_states = [tf.contrib.rnn.LSTMStateTuple(other_hidden_states[i], other_cell_states[i]) for i in range(channels-1)]\n", 85 | "\n", 86 | "with tf.variable_scope(\"word_lstm1\"):\n", 87 | " cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n", 88 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word_drop[0], sequence_length=path_length[0], initial_state=word_init_state)\n", 89 | " state_series_word1 = tf.reduce_max(state_series, axis=1)\n", 90 | "\n", 91 | "with tf.variable_scope(\"word_lstm2\"):\n", 92 | " cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n", 93 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word_drop[1], sequence_length=path_length[1], initial_state=word_init_state)\n", 94 | " state_series_word2 = tf.reduce_max(state_series, axis=1)\n", 95 | "\n", 96 | "with tf.variable_scope(\"pos_lstm1\"):\n", 97 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 98 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[0], sequence_length=path_length[0],initial_state=other_init_states[0])\n", 99 | " state_series_pos1 = tf.reduce_max(state_series, axis=1)\n", 100 | "\n", 101 | "with tf.variable_scope(\"pos_lstm2\"):\n", 102 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 103 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[1], sequence_length=path_length[1],initial_state=other_init_states[0])\n", 104 | " state_series_pos2 = tf.reduce_max(state_series, axis=1)\n", 105 | "\n", 106 | "with tf.variable_scope(\"dep_lstm1\"):\n", 107 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 108 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[0], sequence_length=path_length[0], initial_state=other_init_states[1])\n", 109 | " state_series_dep1 = tf.reduce_max(state_series, axis=1)\n", 110 | "\n", 111 | "with tf.variable_scope(\"dep_lstm2\"):\n", 112 | " cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n", 113 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[1], sequence_length=path_length[1], initial_state=other_init_states[1])\n", 114 | " state_series_dep2 = tf.reduce_max(state_series, axis=1)\n", 115 | "\n", 116 | "state_series1 = tf.concat([state_series_word1, state_series_pos1, state_series_dep1], 1)\n", 117 | "state_series2 = tf.concat([state_series_word2, state_series_pos2, state_series_dep2], 1)\n", 118 | "\n", 119 | "state_series = tf.concat([state_series1, state_series2], 1)" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 8, 125 | "metadata": {}, 126 | "outputs": [ 127 | { 128 | "data": { 129 | "text/plain": [ 130 | "" 131 | ] 132 | }, 133 | "execution_count": 8, 134 | "metadata": {}, 135 | "output_type": "execute_result" 136 | } 137 | ], 138 | "source": [ 139 | "state_series" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": 9, 145 | "metadata": { 146 | "collapsed": true 147 | }, 148 | "outputs": [], 149 | "source": [ 150 | "with tf.name_scope(\"hidden_layer\"):\n", 151 | " W = tf.Variable(tf.truncated_normal([600, 100], -0.1, 0.1), name=\"W\")\n", 152 | " b = tf.Variable(tf.zeros([100]), name=\"b\")\n", 153 | " y_hidden_layer = tf.matmul(state_series, W) + b\n", 154 | "\n", 155 | "with tf.name_scope(\"dropout\"):\n", 156 | " y_hidden_layer_drop = tf.nn.dropout(y_hidden_layer, 0.3)\n", 157 | "\n", 158 | "with tf.name_scope(\"softmax_layer\"):\n", 159 | " W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n", 160 | " b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n", 161 | " logits = tf.matmul(y_hidden_layer_drop, W) + b\n", 162 | " predictions = tf.argmax(logits, 1)\n", 163 | "\n", 164 | "tv_all = tf.trainable_variables()\n", 165 | "tv_regu = []\n", 166 | "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n", 167 | "for t in tv_all:\n", 168 | " if t.name not in non_reg:\n", 169 | " if(t.name.find('biases')==-1):\n", 170 | " tv_regu.append(t)\n", 171 | "\n", 172 | "with tf.name_scope(\"loss\"):\n", 173 | " l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n", 174 | " loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n", 175 | " total_loss = loss + l2_loss\n", 176 | "\n", 177 | "global_step = tf.Variable(0, trainable=False, name=\"global_step\")\n", 178 | "\n", 179 | "learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, decay_steps, decay_rate, staircase=True)\n", 180 | "optimizer = tf.train.AdamOptimizer(learning_rate).minimize(total_loss, global_step=global_step)" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": 12, 186 | "metadata": { 187 | "collapsed": true 188 | }, 189 | "outputs": [], 190 | "source": [ 191 | "f = open(data_dir + '/word_embd_wiki', 'rb')\n", 192 | "vocab, word_embedding = pickle.load(f)\n", 193 | "f.close()\n", 194 | "\n", 195 | "word2id = dict((w, i) for i,w in enumerate(vocab))\n", 196 | "id2word = dict((i, w) for i,w in enumerate(vocab))\n", 197 | "\n", 198 | "unknown_token = \"UNKNOWN_TOKEN\"\n", 199 | "\n", 200 | "pos_tags_vocab = []\n", 201 | "for line in open(data_dir + '/pos_tags.txt'):\n", 202 | " pos_tags_vocab.append(line.strip())\n", 203 | "\n", 204 | "dep_vocab = []\n", 205 | "for line in open(data_dir + '/dependency_types.txt'):\n", 206 | " dep_vocab.append(line.strip())\n", 207 | "\n", 208 | "relation_vocab = []\n", 209 | "for line in open(data_dir + '/relation_types.txt'):\n", 210 | " relation_vocab.append(line.strip())\n", 211 | "\n", 212 | "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n", 213 | "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n", 214 | "\n", 215 | "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n", 216 | "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n", 217 | "\n", 218 | "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n", 219 | "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n", 220 | "\n", 221 | "pos_tag2id['OTH'] = 9\n", 222 | "id2pos_tag[9] = 'OTH'\n", 223 | "\n", 224 | "dep2id['OTH'] = 20\n", 225 | "id2dep[20] = 'OTH'\n", 226 | "\n", 227 | "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n", 228 | "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n", 229 | "RB_pos_tags = ['RB', 'RBR', 'RBS']\n", 230 | "PRP_pos_tags = ['PRP', 'PRP$']\n", 231 | "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n", 232 | "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n", 233 | "\n", 234 | "def pos_tag(x):\n", 235 | " if x in JJ_pos_tags:\n", 236 | " return pos_tag2id['JJ']\n", 237 | " if x in NN_pos_tags:\n", 238 | " return pos_tag2id['NN']\n", 239 | " if x in RB_pos_tags:\n", 240 | " return pos_tag2id['RB']\n", 241 | " if x in PRP_pos_tags:\n", 242 | " return pos_tag2id['PRP']\n", 243 | " if x in VB_pos_tags:\n", 244 | " return pos_tag2id['VB']\n", 245 | " if x in _pos_tags:\n", 246 | " return pos_tag2id[x]\n", 247 | " else:\n", 248 | " return 9" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 13, 254 | "metadata": { 255 | "collapsed": true 256 | }, 257 | "outputs": [], 258 | "source": [ 259 | "sess = tf.Session()\n", 260 | "sess.run(tf.global_variables_initializer())\n", 261 | "saver = tf.train.Saver()" 262 | ] 263 | }, 264 | { 265 | "cell_type": "code", 266 | "execution_count": 14, 267 | "metadata": { 268 | "collapsed": true 269 | }, 270 | "outputs": [], 271 | "source": [ 272 | "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n", 273 | "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')" 274 | ] 275 | }, 276 | { 277 | "cell_type": "code", 278 | "execution_count": 15, 279 | "metadata": { 280 | "collapsed": true 281 | }, 282 | "outputs": [], 283 | "source": [ 284 | "# pos_embedding_saver.save(sess, pos_embd_dir + '/pos_embd')\n", 285 | "# dep_embedding_saver.save(sess, dep_embd_dir + '/dep_embd')" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": 16, 291 | "metadata": { 292 | "collapsed": true 293 | }, 294 | "outputs": [], 295 | "source": [ 296 | "# model = tf.train.latest_checkpoint(model_dir)\n", 297 | "# saver.restore(sess, model)" 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 17, 303 | "metadata": { 304 | "scrolled": true 305 | }, 306 | "outputs": [ 307 | { 308 | "name": "stdout", 309 | "output_type": "stream", 310 | "text": [ 311 | "INFO:tensorflow:Restoring parameters from checkpoint/word_embd_wiki/word_embd\n" 312 | ] 313 | } 314 | ], 315 | "source": [ 316 | "latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n", 317 | "word_embedding_saver.restore(sess, latest_embd)" 318 | ] 319 | }, 320 | { 321 | "cell_type": "code", 322 | "execution_count": 18, 323 | "metadata": { 324 | "collapsed": true 325 | }, 326 | "outputs": [], 327 | "source": [ 328 | "f = open(data_dir + '/train_paths', 'rb')\n", 329 | "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n", 330 | "f.close()\n", 331 | "\n", 332 | "relations = []\n", 333 | "for line in open(data_dir + '/train_relations.txt'):\n", 334 | " relations.append(line.strip().split()[1])\n", 335 | "\n", 336 | "length = len(word_p1)\n", 337 | "num_batches = int(length/batch_size)\n", 338 | "\n", 339 | "for i in range(length):\n", 340 | " for j, word in enumerate(word_p1[i]):\n", 341 | " word = word.lower()\n", 342 | " word_p1[i][j] = word if word in word2id else unknown_token \n", 343 | " for k, word in enumerate(word_p2[i]):\n", 344 | " word = word.lower()\n", 345 | " word_p2[i][k] = word if word in word2id else unknown_token \n", 346 | " for l, d in enumerate(dep_p1[i]):\n", 347 | " dep_p1[i][l] = d if d in dep2id else 'OTH'\n", 348 | " for m, d in enumerate(dep_p2[i]):\n", 349 | " dep_p2[i][m] = d if d in dep2id else 'OTH'\n", 350 | "\n", 351 | "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 352 | "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 353 | "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 354 | "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 355 | "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 356 | "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 357 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 358 | "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n", 359 | "path2_len = np.array([len(w) for w in word_p2])\n", 360 | "\n", 361 | "for i in range(length):\n", 362 | " for j, w in enumerate(word_p1[i]):\n", 363 | " word_p1_ids[i][j] = word2id[w]\n", 364 | " for j, w in enumerate(word_p2[i]):\n", 365 | " word_p2_ids[i][j] = word2id[w]\n", 366 | " for j, w in enumerate(pos_p1[i]):\n", 367 | " pos_p1_ids[i][j] = pos_tag(w)\n", 368 | " for j, w in enumerate(pos_p2[i]):\n", 369 | " pos_p2_ids[i][j] = pos_tag(w)\n", 370 | " for j, w in enumerate(dep_p1[i]):\n", 371 | " dep_p1_ids[i][j] = dep2id[w]\n", 372 | " for j, w in enumerate(dep_p2[i]):\n", 373 | " dep_p2_ids[i][j] = dep2id[w]" 374 | ] 375 | }, 376 | { 377 | "cell_type": "code", 378 | "execution_count": 20, 379 | "metadata": { 380 | "scrolled": true 381 | }, 382 | "outputs": [ 383 | { 384 | "name": "stdout", 385 | "output_type": "stream", 386 | "text": [ 387 | "Epoch: 1 Step: 800 loss: 1834.98579071\n", 388 | "Epoch: 2 Step: 1600 loss: 674.517935333\n", 389 | "Epoch: 3 Step: 2400 loss: 274.690037136\n", 390 | "Epoch: 4 Step: 3200 loss: 121.720370474\n", 391 | "Epoch: 5 Step: 4000 loss: 57.159888463\n", 392 | "Epoch: 6 Step: 4800 loss: 28.9982907772\n", 393 | "Epoch: 7 Step: 5600 loss: 16.1453694081\n", 394 | "Epoch: 8 Step: 6400 loss: 9.82831180751\n", 395 | "Epoch: 9 Step: 7200 loss: 6.59800129354\n", 396 | "Epoch: 10 Step: 8000 loss: 4.74144041985\n", 397 | "Epoch: 11 Step: 8800 loss: 3.62599194676\n", 398 | "Epoch: 12 Step: 9600 loss: 2.89340341777\n", 399 | "Epoch: 13 Step: 10400 loss: 2.38419453934\n", 400 | "Epoch: 14 Step: 11200 loss: 2.02302289039\n", 401 | "Epoch: 15 Step: 12000 loss: 1.72652905107\n", 402 | "Epoch: 16 Step: 12800 loss: 1.52037471995\n", 403 | "Epoch: 17 Step: 13600 loss: 1.32972317606\n", 404 | "Epoch: 18 Step: 14400 loss: 1.20203789197\n", 405 | "Epoch: 19 Step: 15200 loss: 1.08597725138\n", 406 | "Epoch: 20 Step: 16000 loss: 0.986360963807\n", 407 | "Epoch: 21 Step: 16800 loss: 0.898290062994\n", 408 | "Epoch: 22 Step: 17600 loss: 0.830409790054\n", 409 | "Epoch: 23 Step: 18400 loss: 0.782534971312\n", 410 | "Epoch: 24 Step: 19200 loss: 0.718714593202\n", 411 | "Epoch: 25 Step: 20000 loss: 0.668461259529\n", 412 | "Epoch: 26 Step: 20800 loss: 0.633814334124\n", 413 | "Epoch: 27 Step: 21600 loss: 0.594128802419\n", 414 | "Epoch: 28 Step: 22400 loss: 0.56548288703\n", 415 | "Epoch: 29 Step: 23200 loss: 0.526212990992\n", 416 | "Epoch: 30 Step: 24000 loss: 0.520392173678\n", 417 | "Epoch: 31 Step: 24800 loss: 0.487168050297\n", 418 | "Epoch: 32 Step: 25600 loss: 0.464592997283\n", 419 | "Epoch: 33 Step: 26400 loss: 0.445906150565\n", 420 | "Epoch: 34 Step: 27200 loss: 0.430318820551\n", 421 | "Epoch: 35 Step: 28000 loss: 0.415004718341\n", 422 | "Epoch: 36 Step: 28800 loss: 0.39048141662\n", 423 | "Epoch: 37 Step: 29600 loss: 0.378652221076\n", 424 | "Epoch: 38 Step: 30400 loss: 0.376885517202\n", 425 | "Epoch: 39 Step: 31200 loss: 0.361440741643\n", 426 | "Epoch: 40 Step: 32000 loss: 0.345032765269\n", 427 | "Epoch: 41 Step: 32800 loss: 0.331929060183\n", 428 | "Epoch: 42 Step: 33600 loss: 0.322243774533\n", 429 | "Epoch: 43 Step: 34400 loss: 0.316909426395\n", 430 | "Epoch: 44 Step: 35200 loss: 0.307885918804\n", 431 | "Epoch: 45 Step: 36000 loss: 0.303443572205\n", 432 | "Epoch: 46 Step: 36800 loss: 0.284900524076\n", 433 | "Epoch: 47 Step: 37600 loss: 0.281887375377\n", 434 | "Epoch: 48 Step: 38400 loss: 0.279675952736\n", 435 | "Epoch: 49 Step: 39200 loss: 0.272306431141\n", 436 | "Epoch: 50 Step: 40000 loss: 0.267325288765\n", 437 | "Epoch: 51 Step: 40800 loss: 0.252997332923\n", 438 | "Epoch: 52 Step: 41600 loss: 0.257797217574\n", 439 | "Epoch: 53 Step: 42400 loss: 0.248855141364\n", 440 | "Epoch: 54 Step: 43200 loss: 0.241898285728\n", 441 | "Epoch: 55 Step: 44000 loss: 0.237289066594\n", 442 | "Epoch: 56 Step: 44800 loss: 0.241825930495\n", 443 | "Epoch: 57 Step: 45600 loss: 0.228385834955\n", 444 | "Epoch: 58 Step: 46400 loss: 0.225023462269\n", 445 | "Epoch: 59 Step: 47200 loss: 0.223864237741\n", 446 | "Epoch: 60 Step: 48000 loss: 0.216368767507\n", 447 | "Saved Model\n" 448 | ] 449 | } 450 | ], 451 | "source": [ 452 | "num_epochs = 60\n", 453 | "for i in range(num_epochs):\n", 454 | " loss_per_epoch = 0\n", 455 | " for j in range(num_batches):\n", 456 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 457 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 458 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 459 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 460 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 461 | " \n", 462 | " feed_dict = {\n", 463 | " path_length:path_dict,\n", 464 | " word_ids:word_dict,\n", 465 | " pos_ids:pos_dict,\n", 466 | " dep_ids:dep_dict,\n", 467 | " y:y_dict}\n", 468 | " _, _loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n", 469 | " loss_per_epoch +=_loss\n", 470 | " if (j+1)%num_batches==0:\n", 471 | " print(\"Epoch:\", i+1,\"Step:\", step, \"loss:\",loss_per_epoch/num_batches)\n", 472 | " \n", 473 | "saver.save(sess, model_dir + '/model')\n", 474 | "print(\"Saved Model\")" 475 | ] 476 | }, 477 | { 478 | "cell_type": "code", 479 | "execution_count": 21, 480 | "metadata": { 481 | "scrolled": false 482 | }, 483 | "outputs": [ 484 | { 485 | "name": "stdout", 486 | "output_type": "stream", 487 | "text": [ 488 | "training accuracy 98.9625\n" 489 | ] 490 | } 491 | ], 492 | "source": [ 493 | "# training accuracy\n", 494 | "all_predictions = []\n", 495 | "for j in range(num_batches):\n", 496 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 497 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 498 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 499 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 500 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 501 | "\n", 502 | " feed_dict = {\n", 503 | " path_length:path_dict,\n", 504 | " word_ids:word_dict,\n", 505 | " pos_ids:pos_dict,\n", 506 | " dep_ids:dep_dict,\n", 507 | " y:y_dict}\n", 508 | " batch_predictions = sess.run(predictions, feed_dict)\n", 509 | " all_predictions.append(batch_predictions)\n", 510 | "\n", 511 | "y_pred = []\n", 512 | "for i in range(num_batches):\n", 513 | " for pred in all_predictions[i]:\n", 514 | " y_pred.append(pred)\n", 515 | "\n", 516 | "count = 0\n", 517 | "for i in range(batch_size*num_batches):\n", 518 | " count += y_pred[i]==rel_ids[i]\n", 519 | "accuracy = count/(batch_size*num_batches) * 100\n", 520 | "\n", 521 | "print(\"training accuracy\", accuracy)" 522 | ] 523 | }, 524 | { 525 | "cell_type": "code", 526 | "execution_count": 22, 527 | "metadata": {}, 528 | "outputs": [ 529 | { 530 | "name": "stdout", 531 | "output_type": "stream", 532 | "text": [ 533 | "test accuracy 61.8819188192\n" 534 | ] 535 | } 536 | ], 537 | "source": [ 538 | "f = open(data_dir + '/test_paths', 'rb')\n", 539 | "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n", 540 | "f.close()\n", 541 | "\n", 542 | "relations = []\n", 543 | "for line in open(data_dir + '/test_relations.txt'):\n", 544 | " relations.append(line.strip().split()[0])\n", 545 | "\n", 546 | "length = len(word_p1)\n", 547 | "num_batches = int(length/batch_size)\n", 548 | "\n", 549 | "for i in range(length):\n", 550 | " for j, word in enumerate(word_p1[i]):\n", 551 | " word = word.lower()\n", 552 | " word_p1[i][j] = word if word in word2id else unknown_token \n", 553 | " for k, word in enumerate(word_p2[i]):\n", 554 | " word = word.lower()\n", 555 | " word_p2[i][k] = word if word in word2id else unknown_token \n", 556 | " for l, d in enumerate(dep_p1[i]):\n", 557 | " dep_p1[i][l] = d if d in dep2id else 'OTH'\n", 558 | " for m, d in enumerate(dep_p2[i]):\n", 559 | " dep_p2[i][m] = d if d in dep2id else 'OTH'\n", 560 | "\n", 561 | "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 562 | "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 563 | "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 564 | "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 565 | "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n", 566 | "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n", 567 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 568 | "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n", 569 | "path2_len = np.array([len(w) for w in word_p2])\n", 570 | "\n", 571 | "for i in range(length):\n", 572 | " for j, w in enumerate(word_p1[i]):\n", 573 | " word_p1_ids[i][j] = word2id[w]\n", 574 | " for j, w in enumerate(word_p2[i]):\n", 575 | " word_p2_ids[i][j] = word2id[w]\n", 576 | " for j, w in enumerate(pos_p1[i]):\n", 577 | " pos_p1_ids[i][j] = pos_tag(w)\n", 578 | " for j, w in enumerate(pos_p2[i]):\n", 579 | " pos_p2_ids[i][j] = pos_tag(w)\n", 580 | " for j, w in enumerate(dep_p1[i]):\n", 581 | " dep_p1_ids[i][j] = dep2id[w]\n", 582 | " for j, w in enumerate(dep_p2[i]):\n", 583 | " dep_p2_ids[i][j] = dep2id[w]\n", 584 | "\n", 585 | "# test predictions\n", 586 | "all_predictions = []\n", 587 | "for j in range(num_batches):\n", 588 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 589 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 590 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 591 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 592 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 593 | "\n", 594 | " feed_dict = {\n", 595 | " path_length:path_dict,\n", 596 | " word_ids:word_dict,\n", 597 | " pos_ids:pos_dict,\n", 598 | " dep_ids:dep_dict,\n", 599 | " y:y_dict}\n", 600 | " batch_predictions = sess.run(predictions, feed_dict)\n", 601 | " all_predictions.append(batch_predictions)\n", 602 | "\n", 603 | "y_pred = []\n", 604 | "for i in range(num_batches):\n", 605 | " for pred in all_predictions[i]:\n", 606 | " y_pred.append(pred)\n", 607 | "\n", 608 | "count = 0\n", 609 | "for i in range(batch_size*num_batches):\n", 610 | " count += y_pred[i]==rel_ids[i]\n", 611 | "accuracy = count/(batch_size*num_batches) * 100\n", 612 | "\n", 613 | "print(\"test accuracy\", accuracy)" 614 | ] 615 | }, 616 | { 617 | "cell_type": "code", 618 | "execution_count": 15, 619 | "metadata": { 620 | "collapsed": true 621 | }, 622 | "outputs": [], 623 | "source": [ 624 | "f1 = f1_score(rel_ids[:batch_size*num_batches], y_pred, average='macro')" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": 16, 630 | "metadata": {}, 631 | "outputs": [ 632 | { 633 | "data": { 634 | "text/plain": [ 635 | "0.62487150135880543" 636 | ] 637 | }, 638 | "execution_count": 16, 639 | "metadata": {}, 640 | "output_type": "execute_result" 641 | } 642 | ], 643 | "source": [ 644 | "f1" 645 | ] 646 | } 647 | ], 648 | "metadata": { 649 | "kernelspec": { 650 | "display_name": "Python 3", 651 | "language": "python", 652 | "name": "python3" 653 | }, 654 | "language_info": { 655 | "codemirror_mode": { 656 | "name": "ipython", 657 | "version": 3 658 | }, 659 | "file_extension": ".py", 660 | "mimetype": "text/x-python", 661 | "name": "python", 662 | "nbconvert_exporter": "python", 663 | "pygments_lexer": "ipython3", 664 | "version": "3.5.2" 665 | } 666 | }, 667 | "nbformat": 4, 668 | "nbformat_minor": 2 669 | } 670 | -------------------------------------------------------------------------------- /LCA Shortest Path/path_extractor.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 8, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import os\n", 12 | "from nltk.parse import stanford\n", 13 | "import nltk" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 10, 19 | "metadata": { 20 | "collapsed": true 21 | }, 22 | "outputs": [], 23 | "source": [ 24 | "# Dependency Tree\n", 25 | "from nltk.parse.stanford import StanfordDependencyParser\n", 26 | "dep_parser=StanfordDependencyParser(model_path=\"/home/shanu/nltk/jars/englishPCFG.ser.gz\")" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 11, 32 | "metadata": { 33 | "collapsed": true 34 | }, 35 | "outputs": [], 36 | "source": [ 37 | "def lca(tree, index1, index2):\n", 38 | " node = index1\n", 39 | " path1 = []\n", 40 | " path2 = []\n", 41 | " path1.append(index1)\n", 42 | " path2.append(index2)\n", 43 | " while(node != tree.root):\n", 44 | " node = tree.nodes[node['head']]\n", 45 | " path1.append(node)\n", 46 | " node = index2\n", 47 | " while(node != tree.root):\n", 48 | " node = tree.nodes[node['head']]\n", 49 | " path2.append(node)\n", 50 | " for l1, l2 in zip(path1[::-1],path2[::-1]):\n", 51 | " if(l1==l2):\n", 52 | " temp = l1\n", 53 | " return temp" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 12, 59 | "metadata": { 60 | "collapsed": true 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "def path_lca(tree, node, lca_node):\n", 65 | " path = []\n", 66 | " path.append(node)\n", 67 | " while(node != lca_node):\n", 68 | " node = tree.nodes[node['head']]\n", 69 | " path.append(node)\n", 70 | " return path" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 13, 76 | "metadata": { 77 | "collapsed": true 78 | }, 79 | "outputs": [], 80 | "source": [ 81 | "import _pickle \n", 82 | "f = open('../data/training_data', 'rb')\n", 83 | "sentences, e1, e2 = _pickle.load(f)\n", 84 | "f.close()" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": { 91 | "collapsed": true 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "sentences[7588] = 'The reaction mixture is kept in the dark at room temperature for 1.5 hours .'\n", 96 | "sentences[2608] = \"This strawberry sauce has about a million uses , is freezer-friendly , and is so much better than that jar of Smuckers strawberry sauce that you 've had sitting in your fridge since that time you made banana splits 1.5 years ago .\"" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": { 103 | "collapsed": true 104 | }, 105 | "outputs": [], 106 | "source": [ 107 | "## Uncomment this for test set. \n", 108 | "# sentences[2590] = \"The pendant with the bail measure 1.25'' .\"\n", 109 | "# sentences[2664] = \"The cabinet encloses a 6.5 inch cone woofer , 4 inch cone midrange , and a 0.86 inch balanced dome tweeter .\"" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": 20, 115 | "metadata": { 116 | "collapsed": true 117 | }, 118 | "outputs": [], 119 | "source": [ 120 | "word_path1 = []\n", 121 | "word_path2 = []\n", 122 | "rel_path1 = []\n", 123 | "rel_path2 = []\n", 124 | "pos_path1 = []\n", 125 | "pos_path2 = []\n", 126 | "for i in range(8000):\n", 127 | " word_path1.append(0)\n", 128 | " word_path2.append(0)\n", 129 | " rel_path1.append(0)\n", 130 | " rel_path2.append(0)\n", 131 | " pos_path1.append(0)\n", 132 | " pos_path2.append(0)" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 38, 138 | "metadata": { 139 | "scrolled": true 140 | }, 141 | "outputs": [ 142 | { 143 | "name": "stdout", 144 | "output_type": "stream", 145 | "text": [ 146 | "7588 success\n" 147 | ] 148 | } 149 | ], 150 | "source": [ 151 | "for i in range(8000):\n", 152 | " try:\n", 153 | " parse_tree = dep_parser.raw_parse(sentences[i])\n", 154 | " for trees in parse_tree:\n", 155 | " tree = trees\n", 156 | " node1 = tree.nodes[e1[i]+1]\n", 157 | " node2 = tree.nodes[e2[i]+1]\n", 158 | " if node1['address']!=None and node2['address']!=None:\n", 159 | " print(i, \"success\")\n", 160 | " lca_node = lca(tree, node1, node2)\n", 161 | " path1 = path_lca(tree, node1, lca_node)\n", 162 | " path2 = path_lca(tree, node2, lca_node)\n", 163 | "\n", 164 | " word_path1[i] = [p[\"word\"] for p in path1]\n", 165 | " word_path2[i] = [p[\"word\"] for p in path2]\n", 166 | " rel_path1[i] = [p[\"rel\"] for p in path1]\n", 167 | " rel_path2[i] = [p[\"rel\"] for p in path2]\n", 168 | " pos_path1[i] = [p[\"tag\"] for p in path1]\n", 169 | " pos_path2[i] = [p[\"tag\"] for p in path2]\n", 170 | " else:\n", 171 | " print(i, node1[\"address\"], node2[\"address\"])\n", 172 | " except AssertionError:\n", 173 | " print(i, \"error\")\n", 174 | " " 175 | ] 176 | }, 177 | { 178 | "cell_type": "code", 179 | "execution_count": 39, 180 | "metadata": { 181 | "collapsed": true 182 | }, 183 | "outputs": [], 184 | "source": [ 185 | "file = open('../data/train_paths', 'wb')\n", 186 | "_pickle.dump([word_path1, word_path2, rel_path1, rel_path2, pos_path1, pos_path2], file)" 187 | ] 188 | } 189 | ], 190 | "metadata": { 191 | "kernelspec": { 192 | "display_name": "Python 3", 193 | "language": "python", 194 | "name": "python3" 195 | }, 196 | "language_info": { 197 | "codemirror_mode": { 198 | "name": "ipython", 199 | "version": 3 200 | }, 201 | "file_extension": ".py", 202 | "mimetype": "text/x-python", 203 | "name": "python", 204 | "nbconvert_exporter": "python", 205 | "pygments_lexer": "ipython3", 206 | "version": "3.5.2" 207 | } 208 | }, 209 | "nbformat": 4, 210 | "nbformat_minor": 2 211 | } 212 | -------------------------------------------------------------------------------- /LCA SubTree/README.md: -------------------------------------------------------------------------------- 1 | ## Relation Classification using LSTMs on LCA Sub Tree 2 | 3 | LSTMs are applied on the Sub Tree of Lowest Ancestor of two entities as a sequence when traversed. 4 | 5 | Model | Train-Accuracy | Test-Accuracy| Epochs 6 | --- | --- | ---| --- 7 | model2v1 | ? | 54.6 | 11 8 | model2v2 | ? | 55.2 | 10 9 | 10 | 11 | 12 | * dropout on hidden layer of 0.3 13 | * Learning rate = 0.001 14 | * Learning rate decay = 0.96 15 | * state size = 100 16 | * lambda_l2 = 0.0001 17 | 18 | 19 | ### [model2v1](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20SubTree/model2v1.ipynb) 20 | * Foward LSTM on the sequnence traversed on LCA Sub Tree. 21 | 22 | ### [model2v2](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20SubTree/model2v2.ipynb) 23 | * Bidirectional LSTM on the sequnence traversed on LCA Sub Tree. 24 | -------------------------------------------------------------------------------- /LCA SubTree/model2v1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import sys, os, _pickle as pickle\n", 12 | "import tensorflow as tf\n", 13 | "import numpy as np\n", 14 | "import nltk\n", 15 | "from sklearn.metrics import f1_score\n", 16 | "\n", 17 | "data_dir = '../data'\n", 18 | "ckpt_dir = '../checkpoint'\n", 19 | "word_embd_dir = '../checkpoint/word_embd'\n", 20 | "model_dir = '../checkpoint/model2v1'\n", 21 | "\n", 22 | "word_embd_dim = 100\n", 23 | "pos_embd_dim = 25\n", 24 | "dep_embd_dim = 25\n", 25 | "word_vocab_size = 400001\n", 26 | "pos_vocab_size = 10\n", 27 | "dep_vocab_size = 21\n", 28 | "relation_classes = 19\n", 29 | "state_size = 100\n", 30 | "batch_size = 10\n", 31 | "channels = 3\n", 32 | "lambda_l2 = 0.0001\n", 33 | "max_len_path = 70\n", 34 | "starter_learning_rate = 0.001\n", 35 | "decay_steps = 2000\n", 36 | "decay_rate = 0.96" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": 2, 42 | "metadata": { 43 | "collapsed": true 44 | }, 45 | "outputs": [], 46 | "source": [ 47 | "with tf.name_scope(\"input\"):\n", 48 | " path_length = tf.placeholder(tf.int32, shape=[batch_size], name=\"path1_length\")\n", 49 | " word_ids = tf.placeholder(tf.int32, shape=[batch_size, max_len_path], name=\"word_ids\")\n", 50 | " pos_ids = tf.placeholder(tf.int32, [batch_size, max_len_path], name=\"pos_ids\")\n", 51 | " dep_ids = tf.placeholder(tf.int32, [batch_size, max_len_path], name=\"dep_ids\")\n", 52 | " y = tf.placeholder(tf.int32, [batch_size], name=\"y\")\n", 53 | "\n", 54 | "with tf.name_scope(\"word_embedding\"):\n", 55 | " W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n", 56 | " embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n", 57 | " embedding_init = W.assign(embedding_placeholder)\n", 58 | " embedded_word = tf.nn.embedding_lookup(W, word_ids)\n", 59 | " word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n", 60 | "\n", 61 | "with tf.name_scope(\"pos_embedding\"):\n", 62 | " W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n", 63 | " embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n", 64 | " pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n", 65 | "\n", 66 | "with tf.name_scope(\"dep_embedding\"):\n", 67 | " W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n", 68 | " embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n", 69 | " dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 4, 75 | "metadata": { 76 | "collapsed": true 77 | }, 78 | "outputs": [], 79 | "source": [ 80 | "with tf.variable_scope(\"word_lstm\"):\n", 81 | " cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n", 82 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word, sequence_length=path_length, dtype=tf.float32)\n", 83 | " state_series_word = tf.reduce_max(state_series, axis=1)\n", 84 | "\n", 85 | "with tf.variable_scope(\"pos_lstm\"):\n", 86 | " cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n", 87 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos, sequence_length=path_length, dtype=tf.float32)\n", 88 | " state_series_pos = tf.reduce_max(state_series, axis=1)\n", 89 | "\n", 90 | "with tf.variable_scope(\"dep_lstm\"):\n", 91 | " cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n", 92 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep, sequence_length=path_length, dtype=tf.float32)\n", 93 | " state_series_dep = tf.reduce_max(state_series, axis=1)\n", 94 | " \n", 95 | "state_series = tf.concat([state_series_word, state_series_pos, state_series_dep], 1)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 5, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "data": { 105 | "text/plain": [ 106 | "" 107 | ] 108 | }, 109 | "execution_count": 5, 110 | "metadata": {}, 111 | "output_type": "execute_result" 112 | } 113 | ], 114 | "source": [ 115 | "state_series" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 6, 121 | "metadata": { 122 | "collapsed": true 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "with tf.name_scope(\"hidden_layer\"):\n", 127 | " W = tf.Variable(tf.truncated_normal([300, 100], -0.1, 0.1), name=\"W\")\n", 128 | " b = tf.Variable(tf.zeros([100]), name=\"b\")\n", 129 | " y_hidden_layer = tf.nn.relu(tf.matmul(state_series, W) + b)\n", 130 | "\n", 131 | "with tf.name_scope(\"dropout\"):\n", 132 | " y_hidden_layer_drop = tf.nn.dropout(y_hidden_layer, 0.3)\n", 133 | "\n", 134 | "with tf.name_scope(\"softmax_layer\"):\n", 135 | " W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n", 136 | " b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n", 137 | " logits = tf.matmul(y_hidden_layer_drop, W) + b\n", 138 | " predictions = tf.argmax(logits, 1)\n", 139 | "\n", 140 | "tv_all = tf.trainable_variables()\n", 141 | "tv_regu = []\n", 142 | "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n", 143 | "for t in tv_all:\n", 144 | " if t.name not in non_reg:\n", 145 | " if(t.name.find('biases')==-1):\n", 146 | " tv_regu.append(t)\n", 147 | "\n", 148 | "with tf.name_scope(\"loss\"):\n", 149 | " l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n", 150 | " loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n", 151 | " total_loss = loss + l2_loss\n", 152 | "\n", 153 | "global_step = tf.Variable(0, trainable=False, name=\"global_step\")\n", 154 | "\n", 155 | "learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, decay_steps, decay_rate, staircase=True)\n", 156 | "optimizer = tf.train.AdamOptimizer(learning_rate).minimize(total_loss, global_step=global_step)" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 2, 162 | "metadata": { 163 | "collapsed": true 164 | }, 165 | "outputs": [], 166 | "source": [ 167 | "f = open(data_dir + '/vocab.pkl', 'rb')\n", 168 | "vocab = pickle.load(f)\n", 169 | "f.close()\n", 170 | "\n", 171 | "word2id = dict((w, i) for i,w in enumerate(vocab))\n", 172 | "id2word = dict((i, w) for i,w in enumerate(vocab))\n", 173 | "\n", 174 | "unknown_token = \"UNKNOWN_TOKEN\"\n", 175 | "word2id[unknown_token] = word_vocab_size -1\n", 176 | "id2word[word_vocab_size-1] = unknown_token\n", 177 | "\n", 178 | "pos_tags_vocab = []\n", 179 | "for line in open(data_dir + '/pos_tags.txt'):\n", 180 | " pos_tags_vocab.append(line.strip())\n", 181 | "\n", 182 | "dep_vocab = []\n", 183 | "for line in open(data_dir + '/dependency_types.txt'):\n", 184 | " dep_vocab.append(line.strip())\n", 185 | "\n", 186 | "relation_vocab = []\n", 187 | "for line in open(data_dir + '/relation_types.txt'):\n", 188 | " relation_vocab.append(line.strip())\n", 189 | "\n", 190 | "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n", 191 | "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n", 192 | "\n", 193 | "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n", 194 | "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n", 195 | "\n", 196 | "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n", 197 | "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n", 198 | "\n", 199 | "pos_tag2id['OTH'] = 9\n", 200 | "id2pos_tag[9] = 'OTH'\n", 201 | "\n", 202 | "dep2id['OTH'] = 20\n", 203 | "id2dep[20] = 'OTH'\n", 204 | "\n", 205 | "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n", 206 | "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n", 207 | "RB_pos_tags = ['RB', 'RBR', 'RBS']\n", 208 | "PRP_pos_tags = ['PRP', 'PRP$']\n", 209 | "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n", 210 | "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n", 211 | "\n", 212 | "def pos_tag(x):\n", 213 | " if x in JJ_pos_tags:\n", 214 | " return pos_tag2id['JJ']\n", 215 | " if x in NN_pos_tags:\n", 216 | " return pos_tag2id['NN']\n", 217 | " if x in RB_pos_tags:\n", 218 | " return pos_tag2id['RB']\n", 219 | " if x in PRP_pos_tags:\n", 220 | " return pos_tag2id['PRP']\n", 221 | " if x in VB_pos_tags:\n", 222 | " return pos_tag2id['VB']\n", 223 | " if x in _pos_tags:\n", 224 | " return pos_tag2id[x]\n", 225 | " else:\n", 226 | " return 9" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 9, 232 | "metadata": { 233 | "collapsed": true 234 | }, 235 | "outputs": [], 236 | "source": [ 237 | "sess = tf.Session()\n", 238 | "sess.run(tf.global_variables_initializer())\n", 239 | "saver = tf.train.Saver()" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": { 246 | "collapsed": true 247 | }, 248 | "outputs": [], 249 | "source": [ 250 | "# f = open('data/word_embedding', 'rb')\n", 251 | "# word_embedding = pickle.load(f)\n", 252 | "# f.close()\n", 253 | "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n", 254 | "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": 12, 260 | "metadata": { 261 | "collapsed": true 262 | }, 263 | "outputs": [], 264 | "source": [ 265 | "# model = tf.train.latest_checkpoint(model_dir)\n", 266 | "# saver.restore(sess, model)" 267 | ] 268 | }, 269 | { 270 | "cell_type": "code", 271 | "execution_count": 9, 272 | "metadata": { 273 | "scrolled": true 274 | }, 275 | "outputs": [ 276 | { 277 | "name": "stdout", 278 | "output_type": "stream", 279 | "text": [ 280 | "INFO:tensorflow:Restoring parameters from checkpoint/word_embd/word_embd\n" 281 | ] 282 | } 283 | ], 284 | "source": [ 285 | "latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n", 286 | "word_embedding_saver.restore(sess, latest_embd)" 287 | ] 288 | }, 289 | { 290 | "cell_type": "code", 291 | "execution_count": 13, 292 | "metadata": { 293 | "collapsed": true 294 | }, 295 | "outputs": [], 296 | "source": [ 297 | "f = open(data_dir + '/train_lca_paths', 'rb')\n", 298 | "word_p, dep_p, pos_p = pickle.load(f)\n", 299 | "f.close()\n", 300 | "relations = []\n", 301 | "for line in open(data_dir + '/train_relations.txt'):\n", 302 | " relations.append(line.strip().split()[1])\n", 303 | "\n", 304 | "length = len(word_p)\n", 305 | "num_batches = int(length/batch_size)\n", 306 | "\n", 307 | "for i in range(length):\n", 308 | " for j, word in enumerate(word_p[i]):\n", 309 | " word = word.lower()\n", 310 | " word_p[i][j] = word if word in word2id else unknown_token \n", 311 | " for l, d in enumerate(dep_p[i]):\n", 312 | " dep_p[i][l] = d if d in dep2id else 'OTH'\n", 313 | " \n", 314 | "word_p_ids = np.ones([length, max_len_path],dtype=int)\n", 315 | "pos_p_ids = np.ones([length, max_len_path],dtype=int)\n", 316 | "dep_p_ids = np.ones([length, max_len_path],dtype=int)\n", 317 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 318 | "path_len = np.array([len(w) for w in word_p], dtype=int)\n", 319 | "\n", 320 | "for i in range(length):\n", 321 | " for j, w in enumerate(word_p[i]):\n", 322 | " word_p_ids[i][j] = word2id[w]\n", 323 | " \n", 324 | " for j, w in enumerate(pos_p[i]):\n", 325 | " pos_p_ids[i][j] = pos_tag(w)\n", 326 | " \n", 327 | " for j, w in enumerate(dep_p[i]):\n", 328 | " dep_p_ids[i][j] = dep2id[w]" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": { 335 | "scrolled": true 336 | }, 337 | "outputs": [ 338 | { 339 | "name": "stdout", 340 | "output_type": "stream", 341 | "text": [ 342 | "Epoch: 1 Step: 800 loss: 2.85355205297\n", 343 | "Saved Model\n", 344 | "Epoch: 2 Step: 1600 loss: 2.73827668965\n", 345 | "Saved Model\n", 346 | "Epoch: 3 Step: 2400 loss: 2.70001435518\n", 347 | "Saved Model\n", 348 | "Epoch: 4 Step: 3200 loss: 2.68624746531\n", 349 | "Saved Model\n", 350 | "Epoch: 5 Step: 4000 loss: 2.68042603165\n", 351 | "Saved Model\n", 352 | "Epoch: 6 Step: 4800 loss: 2.67750604913\n", 353 | "Saved Model\n", 354 | "Epoch: 7 Step: 5600 loss: 2.67583220631\n", 355 | "Saved Model\n", 356 | "Epoch: 8 Step: 6400 loss: 2.67482194766\n", 357 | "Saved Model\n", 358 | "Epoch: 9 Step: 7200 loss: 2.67411908716\n", 359 | "Saved Model\n", 360 | "Epoch: 10 Step: 8000 loss: 2.67369878128\n", 361 | "Saved Model\n", 362 | "Epoch: 11 Step: 8800 loss: 2.67341704309\n", 363 | "Saved Model\n", 364 | "Epoch: 12 Step: 9600 loss: 2.67321884066\n", 365 | "Saved Model\n", 366 | "Epoch: 13 Step: 10400 loss: 2.67310401961\n", 367 | "Saved Model\n", 368 | "Epoch: 14 Step: 11200 loss: 2.67295600712\n", 369 | "Saved Model\n", 370 | "Epoch: 15 Step: 12000 loss: 2.67288722694\n", 371 | "Saved Model\n", 372 | "Epoch: 16 Step: 12800 loss: 2.67282888472\n", 373 | "Saved Model\n", 374 | "Epoch: 17 Step: 13600 loss: 2.67277920395\n", 375 | "Saved Model\n", 376 | "Epoch: 18 Step: 14400 loss: 2.6727619794\n", 377 | "Saved Model\n", 378 | "Epoch: 19 Step: 15200 loss: 2.67268569678\n", 379 | "Saved Model\n", 380 | "Epoch: 20 Step: 16000 loss: 2.67266457796\n", 381 | "Saved Model\n", 382 | "Epoch: 21 Step: 16800 loss: 2.67263956338\n", 383 | "Saved Model\n", 384 | "Epoch: 22 Step: 17600 loss: 2.67261722207\n", 385 | "Saved Model\n", 386 | "Epoch: 23 Step: 18400 loss: 2.67261824235\n", 387 | "Saved Model\n", 388 | "Epoch: 24 Step: 19200 loss: 2.67256126881\n", 389 | "Saved Model\n", 390 | "Epoch: 25 Step: 20000 loss: 2.6725519672\n", 391 | "Saved Model\n", 392 | "Epoch: 26 Step: 20800 loss: 2.67253558069\n", 393 | "Saved Model\n", 394 | "Epoch: 27 Step: 21600 loss: 2.67252239197\n", 395 | "Saved Model\n", 396 | "Epoch: 28 Step: 22400 loss: 2.67252858594\n", 397 | "Saved Model\n", 398 | "Epoch: 29 Step: 23200 loss: 2.67248077154\n", 399 | "Saved Model\n", 400 | "Epoch: 30 Step: 24000 loss: 2.67247578681\n", 401 | "Saved Model\n", 402 | "Epoch: 31 Step: 24800 loss: 2.67246250227\n", 403 | "Saved Model\n", 404 | "Epoch: 32 Step: 25600 loss: 2.67245363146\n", 405 | "Saved Model\n", 406 | "Epoch: 33 Step: 26400 loss: 2.67246143714\n", 407 | "Saved Model\n", 408 | "Epoch: 34 Step: 27200 loss: 2.6724195759\n", 409 | "Saved Model\n", 410 | "Epoch: 35 Step: 28000 loss: 2.67241657913\n", 411 | "Saved Model\n", 412 | "Epoch: 36 Step: 28800 loss: 2.67240460932\n", 413 | "Saved Model\n", 414 | "Epoch: 37 Step: 29600 loss: 2.67239822775\n", 415 | "Saved Model\n" 416 | ] 417 | } 418 | ], 419 | "source": [ 420 | "num_epochs = 40\n", 421 | "for i in range(num_epochs):\n", 422 | " loss_per_epoch = 0\n", 423 | " for j in range(num_batches):\n", 424 | " feed_dict = {\n", 425 | " path_length:path_len[j*batch_size:(j+1)*batch_size],\n", 426 | " word_ids:word_p_ids[j*batch_size:(j+1)*batch_size],\n", 427 | " pos_ids:pos_p_ids[j*batch_size:(j+1)*batch_size],\n", 428 | " dep_ids:dep_p_ids[j*batch_size:(j+1)*batch_size],\n", 429 | " y:rel_ids[j*batch_size:(j+1)*batch_size]}\n", 430 | " _, _loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n", 431 | " loss_per_epoch +=_loss\n", 432 | " if (j+1)%num_batches==0:\n", 433 | " print(\"Epoch:\", i+1,\"Step:\", step, \"loss:\",loss_per_epoch/num_batches)\n", 434 | " saver.save(sess, model_dir + '/model')\n", 435 | " print(\"Saved Model\")" 436 | ] 437 | }, 438 | { 439 | "cell_type": "code", 440 | "execution_count": null, 441 | "metadata": { 442 | "collapsed": true, 443 | "scrolled": false 444 | }, 445 | "outputs": [], 446 | "source": [ 447 | "# training accuracy\n", 448 | "all_predictions = []\n", 449 | "for j in range(num_batches):\n", 450 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 451 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 452 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 453 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 454 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 455 | "\n", 456 | " feed_dict = {\n", 457 | " path_length:path_dict,\n", 458 | " word_ids:word_dict,\n", 459 | " pos_ids:pos_dict,\n", 460 | " dep_ids:dep_dict,\n", 461 | " y:y_dict}\n", 462 | " batch_predictions = sess.run(predictions, feed_dict)\n", 463 | " all_predictions.append(batch_predictions)\n", 464 | "\n", 465 | "y_pred = []\n", 466 | "for i in range(num_batches):\n", 467 | " for pred in all_predictions[i]:\n", 468 | " y_pred.append(pred)\n", 469 | "\n", 470 | "count = 0\n", 471 | "for i in range(batch_size*num_batches):\n", 472 | " count += y_pred[i]==rel_ids[i]\n", 473 | "accuracy = count/(batch_size*num_batches) * 100\n", 474 | "\n", 475 | "print(\"training accuracy\", accuracy)" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 11, 481 | "metadata": { 482 | "collapsed": true 483 | }, 484 | "outputs": [], 485 | "source": [ 486 | "f = open(data_dir + '/test_lca_paths', 'rb')\n", 487 | "word_p, dep_p, pos_p = pickle.load(f)\n", 488 | "f.close()\n", 489 | "\n", 490 | "relations = []\n", 491 | "for line in open(data_dir + '/test_relations.txt'):\n", 492 | " relations.append(line.strip().split()[0])\n", 493 | "\n", 494 | "length = len(word_p1)\n", 495 | "num_batches = int(length/batch_size)\n", 496 | "\n", 497 | "for i in range(length):\n", 498 | " for j, word in enumerate(word_p[i]):\n", 499 | " word = word.lower()\n", 500 | " word_p[i][j] = word if word in word2id else unknown_token \n", 501 | " for l, d in enumerate(dep_p[i]):\n", 502 | " dep_p[i][l] = d if d in dep2id else 'OTH'\n", 503 | " \n", 504 | "word_p_ids = np.ones([length, max_len_path],dtype=int)\n", 505 | "pos_p_ids = np.ones([length, max_len_path],dtype=int)\n", 506 | "dep_p_ids = np.ones([length, max_len_path],dtype=int)\n", 507 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 508 | "path_len = np.array([len(w) for w in word_p], dtype=int)\n", 509 | "\n", 510 | "for i in range(length):\n", 511 | " for j, w in enumerate(word_p[i]):\n", 512 | " word_p_ids[i][j] = word2id[w]\n", 513 | " \n", 514 | " for j, w in enumerate(pos_p[i]):\n", 515 | " pos_p_ids[i][j] = pos_tag(w)\n", 516 | " \n", 517 | " for j, w in enumerate(dep_p[i]):\n", 518 | " dep_p_ids[i][j] = dep2id[w]\n", 519 | "\n", 520 | "# test predictions\n", 521 | "all_predictions = []\n", 522 | "for j in range(num_batches):\n", 523 | " path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n", 524 | " word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 525 | " pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 526 | " dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n", 527 | " y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n", 528 | "\n", 529 | " feed_dict = {\n", 530 | " path_length:path_dict,\n", 531 | " word_ids:word_dict,\n", 532 | " pos_ids:pos_dict,\n", 533 | " dep_ids:dep_dict,\n", 534 | " y:y_dict}\n", 535 | " batch_predictions = sess.run(predictions, feed_dict)\n", 536 | " all_predictions.append(batch_predictions)\n", 537 | "\n", 538 | "y_pred = []\n", 539 | "for i in range(num_batches):\n", 540 | " for pred in all_predictions[i]:\n", 541 | " y_pred.append(pred)\n", 542 | "\n", 543 | "count = 0\n", 544 | "for i in range(batch_size*num_batches):\n", 545 | " count += y_pred[i]==rel_ids[i]\n", 546 | "accuracy = count/(batch_size*num_batches) * 100\n", 547 | "\n", 548 | "print(\"test accuracy\", accuracy)" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": null, 554 | "metadata": { 555 | "collapsed": true 556 | }, 557 | "outputs": [], 558 | "source": [] 559 | } 560 | ], 561 | "metadata": { 562 | "kernelspec": { 563 | "display_name": "Python 3", 564 | "language": "python", 565 | "name": "python3" 566 | }, 567 | "language_info": { 568 | "codemirror_mode": { 569 | "name": "ipython", 570 | "version": 3 571 | }, 572 | "file_extension": ".py", 573 | "mimetype": "text/x-python", 574 | "name": "python", 575 | "nbconvert_exporter": "python", 576 | "pygments_lexer": "ipython3", 577 | "version": "3.5.2" 578 | } 579 | }, 580 | "nbformat": 4, 581 | "nbformat_minor": 2 582 | } 583 | -------------------------------------------------------------------------------- /LCA SubTree/model2v2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import sys, os, _pickle as pickle\n", 12 | "import tensorflow as tf\n", 13 | "import numpy as np\n", 14 | "import nltk\n", 15 | "from sklearn.metrics import f1_score\n", 16 | "\n", 17 | "\n", 18 | "data_dir = '../data'\n", 19 | "ckpt_dir = '../checkpoint'\n", 20 | "word_embd_dir = '../checkpoint/word_embd'\n", 21 | "model_dir = '../checkpoint/model2v2'\n", 22 | "\n", 23 | "word_embd_dim = 100\n", 24 | "pos_embd_dim = 25\n", 25 | "dep_embd_dim = 25\n", 26 | "word_vocab_size = 400001\n", 27 | "pos_vocab_size = 10\n", 28 | "dep_vocab_size = 21\n", 29 | "relation_classes = 19\n", 30 | "state_size = 100\n", 31 | "batch_size = 10\n", 32 | "channels = 3\n", 33 | "lambda_l2 = 0.0001\n", 34 | "max_len_path = 70\n", 35 | "starter_learning_rate = 0.001\n", 36 | "decay_steps = 2000\n", 37 | "decay_rate = 0.96" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 2, 43 | "metadata": { 44 | "collapsed": true 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "with tf.name_scope(\"input\"):\n", 49 | " path_length = tf.placeholder(tf.int32, shape=[batch_size], name=\"path1_length\")\n", 50 | " word_ids = tf.placeholder(tf.int32, shape=[batch_size, max_len_path], name=\"word_ids\")\n", 51 | " pos_ids = tf.placeholder(tf.int32, [batch_size, max_len_path], name=\"pos_ids\")\n", 52 | " dep_ids = tf.placeholder(tf.int32, [batch_size, max_len_path], name=\"dep_ids\")\n", 53 | " y = tf.placeholder(tf.int32, [batch_size], name=\"y\")\n", 54 | "\n", 55 | "with tf.name_scope(\"word_embedding\"):\n", 56 | " W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n", 57 | " embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n", 58 | " embedding_init = W.assign(embedding_placeholder)\n", 59 | " embedded_word = tf.nn.embedding_lookup(W, word_ids)\n", 60 | " word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n", 61 | "\n", 62 | "with tf.name_scope(\"pos_embedding\"):\n", 63 | " W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n", 64 | " embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n", 65 | " pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n", 66 | "\n", 67 | "with tf.name_scope(\"dep_embedding\"):\n", 68 | " W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n", 69 | " embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n", 70 | " dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 3, 76 | "metadata": { 77 | "collapsed": true 78 | }, 79 | "outputs": [], 80 | "source": [ 81 | "embedded_word_rev = tf.reverse(embedded_word, [1])\n", 82 | "embedded_pos_rev = tf.reverse(embedded_pos, [1])\n", 83 | "embedded_dep_rev = tf.reverse(embedded_dep, [1])\n", 84 | "path_length_rev = tf.reverse(path_length, [0])" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 4, 90 | "metadata": { 91 | "collapsed": true 92 | }, 93 | "outputs": [], 94 | "source": [ 95 | "with tf.variable_scope(\"word_lstm_fw\"):\n", 96 | " cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n", 97 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word, sequence_length=path_length, dtype=tf.float32)\n", 98 | " state_series_word_fw = tf.reduce_max(state_series, axis=1)\n", 99 | "\n", 100 | "with tf.variable_scope(\"pos_lstm_fw\"):\n", 101 | " cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n", 102 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos, sequence_length=path_length, dtype=tf.float32)\n", 103 | " state_series_pos_fw = tf.reduce_max(state_series, axis=1)\n", 104 | "\n", 105 | "with tf.variable_scope(\"dep_lstm_fw\"):\n", 106 | " cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n", 107 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep, sequence_length=path_length, dtype=tf.float32)\n", 108 | " state_series_dep_fw = tf.reduce_max(state_series, axis=1)\n", 109 | " \n", 110 | "with tf.variable_scope(\"word_lstm_bw\"):\n", 111 | " cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n", 112 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word_rev, sequence_length=path_length_rev, dtype=tf.float32)\n", 113 | " state_series_word_bw = tf.reduce_max(state_series, axis=1)\n", 114 | "\n", 115 | "with tf.variable_scope(\"pos_lstm_bw\"):\n", 116 | " cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n", 117 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos_rev, sequence_length=path_length_rev, dtype=tf.float32)\n", 118 | " state_series_pos_bw = tf.reduce_max(state_series, axis=1)\n", 119 | "\n", 120 | "with tf.variable_scope(\"dep_lstm_bw\"):\n", 121 | " cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n", 122 | " state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep_rev, sequence_length=path_length_rev, dtype=tf.float32)\n", 123 | " state_series_dep_bw = tf.reduce_max(state_series, axis=1)\n", 124 | " \n", 125 | "state_series = tf.concat([state_series_word_fw, state_series_pos_fw, state_series_dep_fw, state_series_word_bw, state_series_pos_bw, state_series_dep_bw], 1)\n" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 5, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "data": { 135 | "text/plain": [ 136 | "" 137 | ] 138 | }, 139 | "execution_count": 5, 140 | "metadata": {}, 141 | "output_type": "execute_result" 142 | } 143 | ], 144 | "source": [ 145 | "state_series" 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": 6, 151 | "metadata": { 152 | "collapsed": true 153 | }, 154 | "outputs": [], 155 | "source": [ 156 | "with tf.name_scope(\"hidden_layer\"):\n", 157 | " W = tf.Variable(tf.truncated_normal([600, 100], -0.1, 0.1), name=\"W\")\n", 158 | " b = tf.Variable(tf.zeros([100]), name=\"b\")\n", 159 | " y_hidden_layer = tf.matmul(state_series, W) + b\n", 160 | "\n", 161 | "with tf.name_scope(\"dropout\"):\n", 162 | " y_hidden_layer_drop = tf.nn.dropout(y_hidden_layer, 0.3)\n", 163 | "\n", 164 | "with tf.name_scope(\"softmax_layer\"):\n", 165 | " W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n", 166 | " b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n", 167 | " logits = tf.matmul(y_hidden_layer_drop, W) + b\n", 168 | " predictions = tf.argmax(logits, 1)\n", 169 | "\n", 170 | "tv_all = tf.trainable_variables()\n", 171 | "tv_regu = []\n", 172 | "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n", 173 | "for t in tv_all:\n", 174 | " if t.name not in non_reg:\n", 175 | " if(t.name.find('biases')==-1):\n", 176 | " tv_regu.append(t)\n", 177 | "\n", 178 | "with tf.name_scope(\"loss\"):\n", 179 | " l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n", 180 | " loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n", 181 | " total_loss = loss + l2_loss\n", 182 | "\n", 183 | "global_step = tf.Variable(0, trainable=False, name=\"global_step\")\n", 184 | "\n", 185 | "learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, decay_steps, decay_rate, staircase=True)\n", 186 | "optimizer = tf.train.AdamOptimizer(learning_rate).minimize(total_loss, global_step=global_step)" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 7, 192 | "metadata": { 193 | "collapsed": true 194 | }, 195 | "outputs": [], 196 | "source": [ 197 | "f = open(data_dir + '/vocab.pkl', 'rb')\n", 198 | "vocab = pickle.load(f)\n", 199 | "f.close()\n", 200 | "\n", 201 | "word2id = dict((w, i) for i,w in enumerate(vocab))\n", 202 | "id2word = dict((i, w) for i,w in enumerate(vocab))\n", 203 | "\n", 204 | "unknown_token = \"UNKNOWN_TOKEN\"\n", 205 | "word2id[unknown_token] = word_vocab_size -1\n", 206 | "id2word[word_vocab_size-1] = unknown_token\n", 207 | "\n", 208 | "pos_tags_vocab = []\n", 209 | "for line in open(data_dir + '/pos_tags.txt'):\n", 210 | " pos_tags_vocab.append(line.strip())\n", 211 | "\n", 212 | "dep_vocab = []\n", 213 | "for line in open(data_dir + '/dependency_types.txt'):\n", 214 | " dep_vocab.append(line.strip())\n", 215 | "\n", 216 | "relation_vocab = []\n", 217 | "for line in open(data_dir + '/relation_types.txt'):\n", 218 | " relation_vocab.append(line.strip())\n", 219 | "\n", 220 | "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n", 221 | "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n", 222 | "\n", 223 | "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n", 224 | "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n", 225 | "\n", 226 | "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n", 227 | "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n", 228 | "\n", 229 | "pos_tag2id['OTH'] = 9\n", 230 | "id2pos_tag[9] = 'OTH'\n", 231 | "\n", 232 | "dep2id['OTH'] = 20\n", 233 | "id2dep[20] = 'OTH'\n", 234 | "\n", 235 | "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n", 236 | "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n", 237 | "RB_pos_tags = ['RB', 'RBR', 'RBS']\n", 238 | "PRP_pos_tags = ['PRP', 'PRP$']\n", 239 | "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n", 240 | "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n", 241 | "\n", 242 | "def pos_tag(x):\n", 243 | " if x in JJ_pos_tags:\n", 244 | " return pos_tag2id['JJ']\n", 245 | " if x in NN_pos_tags:\n", 246 | " return pos_tag2id['NN']\n", 247 | " if x in RB_pos_tags:\n", 248 | " return pos_tag2id['RB']\n", 249 | " if x in PRP_pos_tags:\n", 250 | " return pos_tag2id['PRP']\n", 251 | " if x in VB_pos_tags:\n", 252 | " return pos_tag2id['VB']\n", 253 | " if x in _pos_tags:\n", 254 | " return pos_tag2id[x]\n", 255 | " else:\n", 256 | " return 9" 257 | ] 258 | }, 259 | { 260 | "cell_type": "code", 261 | "execution_count": 8, 262 | "metadata": { 263 | "collapsed": true 264 | }, 265 | "outputs": [], 266 | "source": [ 267 | "sess = tf.Session()\n", 268 | "sess.run(tf.global_variables_initializer())\n", 269 | "saver = tf.train.Saver()" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 9, 275 | "metadata": { 276 | "collapsed": true 277 | }, 278 | "outputs": [], 279 | "source": [ 280 | "# f = open('data/word_embedding', 'rb')\n", 281 | "# word_embedding = pickle.load(f)\n", 282 | "# f.close()\n", 283 | "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n", 284 | "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')" 285 | ] 286 | }, 287 | { 288 | "cell_type": "code", 289 | "execution_count": 10, 290 | "metadata": { 291 | "collapsed": true 292 | }, 293 | "outputs": [], 294 | "source": [ 295 | "# model = tf.train.latest_checkpoint(model_dir)\n", 296 | "# saver.restore(sess, model)" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 11, 302 | "metadata": { 303 | "scrolled": true 304 | }, 305 | "outputs": [ 306 | { 307 | "name": "stdout", 308 | "output_type": "stream", 309 | "text": [ 310 | "INFO:tensorflow:Restoring parameters from checkpoint/word_embd/word_embd\n" 311 | ] 312 | } 313 | ], 314 | "source": [ 315 | "latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n", 316 | "word_embedding_saver.restore(sess, latest_embd)" 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": 12, 322 | "metadata": { 323 | "collapsed": true 324 | }, 325 | "outputs": [], 326 | "source": [ 327 | "f = open(data_dir + '/train_lca_paths', 'rb')\n", 328 | "word_p, dep_p, pos_p = pickle.load(f)\n", 329 | "f.close()\n", 330 | "relations = []\n", 331 | "for line in open(data_dir + '/train_relations.txt'):\n", 332 | " relations.append(line.strip().split()[1])\n", 333 | "\n", 334 | "length = len(word_p)\n", 335 | "num_batches = int(length/batch_size)\n", 336 | "\n", 337 | "for i in range(length):\n", 338 | " for j, word in enumerate(word_p[i]):\n", 339 | " word = word.lower()\n", 340 | " word_p[i][j] = word if word in word2id else unknown_token \n", 341 | " for l, d in enumerate(dep_p[i]):\n", 342 | " dep_p[i][l] = d if d in dep2id else 'OTH'\n", 343 | " \n", 344 | "word_p_ids = np.ones([length, max_len_path],dtype=int)\n", 345 | "pos_p_ids = np.ones([length, max_len_path],dtype=int)\n", 346 | "dep_p_ids = np.ones([length, max_len_path],dtype=int)\n", 347 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 348 | "path_len = np.array([len(w) for w in word_p], dtype=int)\n", 349 | "\n", 350 | "for i in range(length):\n", 351 | " for j, w in enumerate(word_p[i]):\n", 352 | " word_p_ids[i][j] = word2id[w]\n", 353 | " \n", 354 | " for j, w in enumerate(pos_p[i]):\n", 355 | " pos_p_ids[i][j] = pos_tag(w)\n", 356 | " \n", 357 | " for j, w in enumerate(dep_p[i]):\n", 358 | " dep_p_ids[i][j] = dep2id[w]" 359 | ] 360 | }, 361 | { 362 | "cell_type": "code", 363 | "execution_count": 13, 364 | "metadata": { 365 | "scrolled": true 366 | }, 367 | "outputs": [ 368 | { 369 | "name": "stdout", 370 | "output_type": "stream", 371 | "text": [ 372 | "Epoch: 1 Step: 800 loss: 2.85300489247\n", 373 | "Saved Model\n", 374 | "Epoch: 2 Step: 1600 loss: 2.73827668965\n", 375 | "Saved Model\n", 376 | "Epoch: 3 Step: 2400 loss: 2.70001435518\n", 377 | "Saved Model\n", 378 | "Epoch: 4 Step: 3200 loss: 2.68624746531\n", 379 | "Saved Model\n", 380 | "Epoch: 5 Step: 4000 loss: 2.68042603165\n", 381 | "Saved Model\n", 382 | "Epoch: 6 Step: 4800 loss: 2.67750604913\n", 383 | "Saved Model\n", 384 | "Epoch: 7 Step: 5600 loss: 2.67583220631\n", 385 | "Saved Model\n", 386 | "Epoch: 8 Step: 6400 loss: 2.67482194766\n", 387 | "Saved Model\n", 388 | "Epoch: 9 Step: 7200 loss: 2.67411908716\n", 389 | "Saved Model\n", 390 | "Epoch: 10 Step: 8000 loss: 2.67369878128\n", 391 | "Saved Model\n" 392 | ] 393 | } 394 | ], 395 | "source": [ 396 | "num_epochs = 10\n", 397 | "for i in range(num_epochs):\n", 398 | " loss_per_epoch = 0\n", 399 | " for j in range(num_batches):\n", 400 | " feed_dict = {\n", 401 | " path_length:path_len[j*batch_size:(j+1)*batch_size],\n", 402 | " word_ids:word_p_ids[j*batch_size:(j+1)*batch_size],\n", 403 | " pos_ids:pos_p_ids[j*batch_size:(j+1)*batch_size],\n", 404 | " dep_ids:dep_p_ids[j*batch_size:(j+1)*batch_size],\n", 405 | " y:rel_ids[j*batch_size:(j+1)*batch_size]}\n", 406 | " _, _loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n", 407 | " loss_per_epoch +=_loss\n", 408 | " if (j+1)%num_batches==0:\n", 409 | " print(\"Epoch:\", i+1,\"Step:\", step, \"loss:\",loss_per_epoch/num_batches)\n", 410 | " saver.save(sess, model_dir + '/model')\n", 411 | " print(\"Saved Model\")" 412 | ] 413 | }, 414 | { 415 | "cell_type": "code", 416 | "execution_count": null, 417 | "metadata": { 418 | "collapsed": true, 419 | "scrolled": false 420 | }, 421 | "outputs": [], 422 | "source": [ 423 | "# training accuracy\n", 424 | "all_predictions = []\n", 425 | "for j in range(num_batches):\n", 426 | " feed_dict = {\n", 427 | " path_length:path_len[j*batch_size:(j+1)*batch_size],\n", 428 | " word_ids:word_p_ids[j*batch_size:(j+1)*batch_size],\n", 429 | " pos_ids:pos_p_ids[j*batch_size:(j+1)*batch_size],\n", 430 | " dep_ids:dep_p_ids[j*batch_size:(j+1)*batch_size],\n", 431 | " y:rel_ids[j*batch_size:(j+1)*batch_size]}\n", 432 | " batch_predictions = sess.run(predictions, feed_dict)\n", 433 | " all_predictions.append(batch_predictions)\n", 434 | "\n", 435 | "y_pred = []\n", 436 | "for i in range(num_batches):\n", 437 | " for pred in all_predictions[i]:\n", 438 | " y_pred.append(pred)\n", 439 | "\n", 440 | "count = 0\n", 441 | "for i in range(batch_size*num_batches):\n", 442 | " count += y_pred[i]==rel_ids[i]\n", 443 | "accuracy = count/(batch_size*num_batches) * 100\n", 444 | "\n", 445 | "print(\"training accuracy\", accuracy)" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": 11, 451 | "metadata": { 452 | "collapsed": true 453 | }, 454 | "outputs": [], 455 | "source": [ 456 | "f = open(data_dir + '/test_lca_paths', 'rb')\n", 457 | "word_p, dep_p, pos_p = pickle.load(f)\n", 458 | "f.close()\n", 459 | "\n", 460 | "relations = []\n", 461 | "for line in open(data_dir + '/test_relations.txt'):\n", 462 | " relations.append(line.strip().split()[0])\n", 463 | "\n", 464 | "length = len(word_p1)\n", 465 | "num_batches = int(length/batch_size)\n", 466 | "\n", 467 | "for i in range(length):\n", 468 | " for j, word in enumerate(word_p[i]):\n", 469 | " word = word.lower()\n", 470 | " word_p[i][j] = word if word in word2id else unknown_token \n", 471 | " for l, d in enumerate(dep_p[i]):\n", 472 | " dep_p[i][l] = d if d in dep2id else 'OTH'\n", 473 | " \n", 474 | "word_p_ids = np.ones([length, max_len_path],dtype=int)\n", 475 | "pos_p_ids = np.ones([length, max_len_path],dtype=int)\n", 476 | "dep_p_ids = np.ones([length, max_len_path],dtype=int)\n", 477 | "rel_ids = np.array([rel2id[rel] for rel in relations])\n", 478 | "path_len = np.array([len(w) for w in word_p], dtype=int)\n", 479 | "\n", 480 | "for i in range(length):\n", 481 | " for j, w in enumerate(word_p[i]):\n", 482 | " word_p_ids[i][j] = word2id[w]\n", 483 | " \n", 484 | " for j, w in enumerate(pos_p[i]):\n", 485 | " pos_p_ids[i][j] = pos_tag(w)\n", 486 | " \n", 487 | " for j, w in enumerate(dep_p[i]):\n", 488 | " dep_p_ids[i][j] = dep2id[w]\n", 489 | "\n", 490 | "# test predictions\n", 491 | "all_predictions = []\n", 492 | "for j in range(num_batches):\n", 493 | " feed_dict = {\n", 494 | " path_length:path_len[j*batch_size:(j+1)*batch_size],\n", 495 | " word_ids:word_p_ids[j*batch_size:(j+1)*batch_size],\n", 496 | " pos_ids:pos_p_ids[j*batch_size:(j+1)*batch_size],\n", 497 | " dep_ids:dep_p_ids[j*batch_size:(j+1)*batch_size],\n", 498 | " y:rel_ids[j*batch_size:(j+1)*batch_size]}\n", 499 | " batch_predictions = sess.run(predictions, feed_dict)\n", 500 | " all_predictions.append(batch_predictions)\n", 501 | "\n", 502 | "y_pred = []\n", 503 | "for i in range(num_batches):\n", 504 | " for pred in all_predictions[i]:\n", 505 | " y_pred.append(pred)\n", 506 | "\n", 507 | "count = 0\n", 508 | "for i in range(batch_size*num_batches):\n", 509 | " count += y_pred[i]==rel_ids[i]\n", 510 | "accuracy = count/(batch_size*num_batches) * 100\n", 511 | "\n", 512 | "print(\"test accuracy\", accuracy)" 513 | ] 514 | }, 515 | { 516 | "cell_type": "code", 517 | "execution_count": null, 518 | "metadata": { 519 | "collapsed": true 520 | }, 521 | "outputs": [], 522 | "source": [] 523 | } 524 | ], 525 | "metadata": { 526 | "kernelspec": { 527 | "display_name": "Python 3", 528 | "language": "python", 529 | "name": "python3" 530 | }, 531 | "language_info": { 532 | "codemirror_mode": { 533 | "name": "ipython", 534 | "version": 3 535 | }, 536 | "file_extension": ".py", 537 | "mimetype": "text/x-python", 538 | "name": "python", 539 | "nbconvert_exporter": "python", 540 | "pygments_lexer": "ipython3", 541 | "version": "3.5.2" 542 | } 543 | }, 544 | "nbformat": 4, 545 | "nbformat_minor": 2 546 | } 547 | -------------------------------------------------------------------------------- /LCA SubTree/path_extractor.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import os\n", 12 | "from nltk.parse import stanford\n", 13 | "import nltk\n", 14 | "os.environ['STANFORD_PARSER'] = '/home/shanu/nltk/jars/stanford-parser.jar'\n", 15 | "os.environ['STANFORD_MODELS'] = '/home/shanu/nltk/jars/stanford-parser-3.7.0-models.jar'" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": { 22 | "collapsed": true 23 | }, 24 | "outputs": [], 25 | "source": [ 26 | "# Dependency Tree\n", 27 | "from nltk.parse.stanford import StanfordDependencyParser\n", 28 | "dep_parser=StanfordDependencyParser(model_path=\"/home/shanu/nltk/jars/englishPCFG.ser.gz\")" 29 | ] 30 | }, 31 | { 32 | "cell_type": "code", 33 | "execution_count": 3, 34 | "metadata": { 35 | "collapsed": true 36 | }, 37 | "outputs": [], 38 | "source": [ 39 | "def lca(tree, index1, index2):\n", 40 | " node = index1\n", 41 | " path1 = []\n", 42 | " path2 = []\n", 43 | " path1.append(index1)\n", 44 | " path2.append(index2)\n", 45 | " while(node != tree.root):\n", 46 | " node = tree.nodes[node['head']]\n", 47 | " path1.append(node)\n", 48 | " node = index2\n", 49 | " while(node != tree.root):\n", 50 | " node = tree.nodes[node['head']]\n", 51 | " path2.append(node)\n", 52 | " for l1, l2 in zip(path1[::-1],path2[::-1]):\n", 53 | " if(l1==l2):\n", 54 | " temp = l1\n", 55 | " return temp" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 4, 61 | "metadata": { 62 | "collapsed": true 63 | }, 64 | "outputs": [], 65 | "source": [ 66 | "def path_lca(tree, node, lca_node):\n", 67 | " path = []\n", 68 | " path.append(node)\n", 69 | " while(node != lca_node):\n", 70 | " node = tree.nodes[node['head']]\n", 71 | " path.append(node)\n", 72 | " return path" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": 5, 78 | "metadata": { 79 | "collapsed": true 80 | }, 81 | "outputs": [], 82 | "source": [ 83 | "def seq(lca):\n", 84 | " l=[lca]\n", 85 | " for key in tree.nodes[lca]['deps']:\n", 86 | " for i in tree.nodes[lca]['deps'][key]:\n", 87 | " l.extend(seq(i))\n", 88 | " return l" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "metadata": { 95 | "collapsed": true 96 | }, 97 | "outputs": [], 98 | "source": [] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 8, 103 | "metadata": { 104 | "collapsed": true 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "import _pickle \n", 109 | "f = open('../data/training_data', 'rb')\n", 110 | "sentences, e1, e2 = _pickle.load(f)\n", 111 | "f.close()" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": 9, 117 | "metadata": { 118 | "collapsed": true 119 | }, 120 | "outputs": [], 121 | "source": [ 122 | "sentences[7588] = 'The reaction mixture is kept in the dark at room temperature for 1.5 hours .'\n", 123 | "sentences[2608] = \"This strawberry sauce has about a million uses , is freezer-friendly , and is so much better than that jar of Smuckers strawberry sauce that you 've had sitting in your fridge since that time you made banana splits 1.5 years ago .\"" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 41, 129 | "metadata": { 130 | "collapsed": true 131 | }, 132 | "outputs": [], 133 | "source": [ 134 | "# sentences[2590] = \"The pendant with the bail measure 1.25'' .\"\n", 135 | "# sentences[2664] = \"The cabinet encloses a 6.5 inch cone woofer , 4 inch cone midrange , and a 0.86 inch balanced dome tweeter .\"" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 12, 141 | "metadata": { 142 | "collapsed": true 143 | }, 144 | "outputs": [], 145 | "source": [ 146 | "length = len(sentences)" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 13, 152 | "metadata": { 153 | "collapsed": true 154 | }, 155 | "outputs": [], 156 | "source": [ 157 | "word_p = []\n", 158 | "rel_p = []\n", 159 | "pos_p = []\n", 160 | "for i in range(length):\n", 161 | " word_p.append(0)\n", 162 | " rel_p.append(0)\n", 163 | " pos_p.append(0)" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 44, 169 | "metadata": { 170 | "scrolled": true 171 | }, 172 | "outputs": [ 173 | { 174 | "name": "stdout", 175 | "output_type": "stream", 176 | "text": [ 177 | "2590 success [2, 1, 6, 4, 5, 3]\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "# for i in range(length):\n", 183 | "i = 2590\n", 184 | "try:\n", 185 | " parse_tree = dep_parser.raw_parse(sentences[i])\n", 186 | " for trees in parse_tree:\n", 187 | " tree = trees\n", 188 | " node1 = tree.nodes[e1[i]+1]\n", 189 | " node2 = tree.nodes[e2[i]+1]\n", 190 | " if node1['address']!=None and node2['address']!=None:\n", 191 | " lca_node = lca(tree, node1, node2)\n", 192 | " path = seq(lca_node['address'])\n", 193 | " print(i, \"success\", path)\n", 194 | "\n", 195 | " word_p[i] = [tree.nodes[p][\"word\"] for p in path]\n", 196 | " rel_p[i] = [tree.nodes[p][\"rel\"] for p in path]\n", 197 | " pos_p[i] = [tree.nodes[p][\"tag\"] for p in path]\n", 198 | " else:\n", 199 | "\n", 200 | " print(i, node1[\"address\"], node2[\"address\"])\n", 201 | "except AssertionError:\n", 202 | " print(i, \"error\")\n", 203 | " " 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": 45, 209 | "metadata": { 210 | "collapsed": true 211 | }, 212 | "outputs": [], 213 | "source": [ 214 | "file = open('../data/train_lca_paths', 'wb')\n", 215 | "_pickle.dump([word_p, rel_p, pos_p], file)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": null, 221 | "metadata": { 222 | "collapsed": true 223 | }, 224 | "outputs": [], 225 | "source": [] 226 | } 227 | ], 228 | "metadata": { 229 | "kernelspec": { 230 | "display_name": "Python 3", 231 | "language": "python", 232 | "name": "python3" 233 | }, 234 | "language_info": { 235 | "codemirror_mode": { 236 | "name": "ipython", 237 | "version": 3 238 | }, 239 | "file_extension": ".py", 240 | "mimetype": "text/x-python", 241 | "name": "python", 242 | "nbconvert_exporter": "python", 243 | "pygments_lexer": "ipython3", 244 | "version": "3.5.2" 245 | } 246 | }, 247 | "nbformat": 4, 248 | "nbformat_minor": 2 249 | } 250 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Shanu Kumar 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /LSTM Seq and Tree/README.md: -------------------------------------------------------------------------------- 1 | ## Relation Classification using LSTMs on Sequences and Tree Structures 2 | 3 | We implemented a architecture based on the paper [End-to-End Relation Extraction using LSTMs 4 | on Sequences and Tree Structures](http://www.aclweb.org/anthology/P/P16/P16-1105.pdf). This recurrent neural network based model captures both word sequence and dependency tree substructure information by stacking bidirectional treestructured LSTM-RNNs on bidirectional sequential LSTM-RNNs. This allows our model to jointly represent both entities and relations with shared parameters in a single model. 5 | 6 | 7 | Our model allows 8 | joint modeling of entities and relations in a single 9 | model by using both bidirectional sequential 10 | (left-to-right and right-to-left) and bidirectional 11 | tree-structured (bottom-up and top-down) LSTMRNNs. 12 | 13 | 14 | ## Model 15 | The model mainly consists of three representation layers: 16 | a embeddings layer, a word sequence based LSTM-RNN layer (sequence layer), and finally a dependency subtree based LSTM-RNN layer (dependency layer). 17 | 18 | ![Relation Classification Network](/img/lstm_tree.jpg) 19 | 20 | ### Embedding Layer 21 | Embedding layer consists of words, part-of-speech (POS) tags, dependency relations. 22 | 23 | ### Sequence Layer 24 | The sequence layer represents words in a linear sequence 25 | using the representations from the embedding layer. We represent the word sequence in a sentence with bidirectional LSTM-RNNs. 26 | The LSTM unit at t-th word receives the concatenation of word and POS embeddings as its input vector. 27 | 28 |

29 | 30 |

31 | 32 | We also concatenate the hidden state vectors of the two directions’ LSTM units corresponding to each word (denoted as ↑ht and ↓ht) as its output vector (st), and pass it to the subsequent layers. 33 | 34 | ### Entity Detection 35 | We perform entity detection on top of the sequence 36 | layer. We employ a two-layered NN with an hidden layer and a softmax output layer for entity detection. 37 | 38 | ### Dependency Layer 39 | The dependency layer represents a relation between a pair of two target words (corresponding to a relation candidate in relation classification) in 40 | the dependency tree. 41 | 42 | This layer mainly focuses on the shortest path between a pair of target words in the dependency tree (i.e., the path between the least common node and the two target words). 43 | 44 | We employ bidirectional tree-structured LSTMRNNs (i.e., bottom-up and top-down) to represent a relation candidate by capturing the dependency 45 | structure around the target word pair. This bidirectional structure propagates to each node not only the information from the leaves but also information from the root. This is especially important for relation classification, which makes use of argument nodes near the bottom of the tree, and our top-down LSTM-RNN sends information from the top of the tree to such near-leaf nodes (unlike in standard bottom-up LSTM-RNNs). 46 | 47 | Tree-structured LSTM-RNN's equations : 48 |

49 | 50 |

51 | 52 | While we use one node from Shortest Dependency path, then the hidden and current states of the children of this node in Dependency Tree are taken as previous state in LSTM. 53 | 54 | We stack the dependency layers (corresponding to relation candidates) on top of the sequence layer to incorporate both word sequence and dependency tree structure information into the output. 55 | The dependency-layer LSMT unit at the t-th word recives as input, the concatenation of its corresponding hidden state vectors st in the sequence layer, dependency type embedding. 56 | 57 | ### Relation Classification 58 | The relation candidate vector is constructed as 59 | the concatenation dp = [↑hpA; ↓hp1; ↓hp2], where ↑hpA is the hidden state vector of the top LSTM unit in the bottom-up LSTM-RNN (representing the lowest common ancestor of the target word pair p), and ↓hp1, ↓hp2 are the hidden state vectors of the two LSTM units representing the first and second target words in the top-down LSTMRNN. 60 | 61 | Similarly to the entity detection, we employ a two-layered NN with an hidden layer and a softmax output layer. 62 | 63 | ### Training 64 | 65 | We update the model parameters including weights, biases, and embeddings by BPTT and Adam gradient descent with gradient clipping, L2-regularization 66 | (we regularize weights W and U, not the bias terms b). We also apply dropout to the embedding layer and to the final hidden layers for entity detection and relation classification. We employ entity pretraining to improve the model. 67 | 68 | ### Data 69 | 70 | SemEval-2010 Task 8 defines 9 relation types between nominals and a tenth type Other when two nouns have none of these relations and no direction is considered. 71 | ## Experiments 72 | 73 | Model | Train-Accuracy | Test-Accuracy| Epochs 74 | --- | --- | ---| --- 75 | model3v1 | 97.54 | 66.5 | 11 76 | model3v2 | 99.9 | 70.69 | 19 77 | 78 | 79 | * Learning rate = 0.001 80 | * Learning rate decay = 0.96 81 | * state size = 100 82 | * lambda_l2 = 0.0001 83 | * Gradient Clipping = 10 84 | * Entity Detection Pretrained 85 | 86 | 87 | ### [model3v1](https://github.com/Sshanu/Relation-Classification/blob/master/LSTM%20Seq%20and%20Tree/model3v1.ipynb) 88 | * Bidirectional LSTM over whole sentence 89 | * Bottom-up and Top-down LSTM along Shortest Dependency Path with childrens from Dependency tree. 90 | 91 | ### [model3v2](https://github.com/Sshanu/Relation-Classification/blob/master/LSTM%20Seq%20and%20Tree/model3v2.ipynb) 92 | * Dropout on hidden layers of both entity detection and relation classifier of 0.3. 93 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Relation Classification 2 | 3 | [![MIT License](https://img.shields.io/badge/license-MIT-green.svg)](https://opensource.org/licenses/MIT) 4 | 5 | Relation classification aims to categorize into predefined classes the relations btw pairs of given entities in texts. There are two ways to represent relations between entities using deep neural networks: recurrent neural networks (RNNs) and convolutional neural networks (CNNs). We have implemented three LSTM-RNN architectures for solving the task of relation classification: 6 | * [Relation classification using LSTM Networks along Shortest Dependency Paths.](https://github.com/Sshanu/Relation-Classification/tree/master/LCA%20Shortest%20Path) 7 | * [Relation classification using bidirectional LSTM Networks on LCA Sub Tree.](https://github.com/Sshanu/Relation-Classification/tree/master/LCA%20SubTree) 8 | * [Relation classification using LSTMS on Sequences and Tree Structures.](https://github.com/Sshanu/Relation-Classification/tree/master/LSTM%20Seq%20and%20Tree) 9 | 10 | We achieve better performance for solving this task using the last approach "[Relation classification using LSTMS on Sequences and Tree Structures.](https://github.com/Sshanu/Relation-Classification/tree/master/LSTM%20Seq%20and%20Tree)". 11 | 12 | 13 | ### References: 14 | 15 | > **End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures**
16 | > Makoto Miwa, Mohit Bansal
17 | > [http://www.aclweb.org/anthology/P/P16/P16-1105.pdf](http://www.aclweb.org/anthology/P/P16/P16-1105.pdf) 18 | > 19 | > **Abstract:** *We present a novel end-to-end neural 20 | model to extract entities and relations between them. Our recurrent neural network based model captures both word sequence and dependency tree substructure 21 | information by stacking bidirectional treestructured LSTM-RNNs on bidirectional 22 | sequential LSTM-RNNs. This allows our 23 | model to jointly represent both entities and 24 | relations with shared parameters in a single model. We further encourage detection of entities during training and use of 25 | entity information in relation extraction 26 | via entity pretraining and scheduled sampling. Our model improves over the stateof-the-art feature-based model on end-toend relation extraction, achieving 12.1% 27 | and 5.7% relative error reductions in F1- 28 | score on ACE2005 and ACE2004, respectively. We also show that our LSTMRNN based model compares favorably to 29 | the state-of-the-art CNN based model (in 30 | F1-score) on nominal relation classification (SemEval-2010 Task 8). Finally, we 31 | present an extensive ablation analysis of 32 | several model components* 33 | 34 | > **Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Paths**
35 | > Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, Zhi Jin
36 | > [http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP206.pdf](http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP206.pdf) 37 | > 38 | > **Abstract:** *Relation classification is an important research arena in the field of natural language processing (NLP). In this paper, we 39 | present SDP-LSTM, a novel neural network to classify the relation of two entities in a sentence. Our neural architecture 40 | leverages the shortest dependency path 41 | (SDP) between two entities; multichannel recurrent neural networks, with long 42 | short term memory (LSTM) units, pick 43 | up heterogeneous information along the 44 | SDP. Our proposed model has several distinct features: (1) The shortest dependency 45 | paths retain most relevant information (to 46 | relation classification), while eliminating 47 | irrelevant words in the sentence. (2) The 48 | multichannel LSTM networks allow effective information integration from heterogeneous sources over the dependency 49 | paths. (3) A customized dropout strategy 50 | regularizes the neural network to alleviate overfitting. We test our model on the 51 | SemEval 2010 relation classification task, 52 | and achieve an F1-score of 83.7%, higher 53 | than competing methods in the literature.* 54 | -------------------------------------------------------------------------------- /data/dependency_types.txt: -------------------------------------------------------------------------------- 1 | root 2 | nmod 3 | nsubj 4 | dobj 5 | nsubjpass 6 | compound 7 | conj 8 | acl 9 | advcl 10 | ccomp 11 | amod 12 | acl:relcl 13 | xcomp 14 | dep 15 | appos 16 | nmod:poss 17 | advmod 18 | parataxis 19 | csubj 20 | iobj 21 | -------------------------------------------------------------------------------- /data/full_postags_types.txt: -------------------------------------------------------------------------------- 1 | CC 2 | CD 3 | DT 4 | EX 5 | FW 6 | IN 7 | JJ 8 | JJR 9 | JJS 10 | LS 11 | MD 12 | NN 13 | NNS 14 | NNP 15 | NNPS 16 | PDT 17 | POS 18 | PRP 19 | PRP$ 20 | RB 21 | RBR 22 | RBS 23 | RP 24 | SYM 25 | TO 26 | UH 27 | VB 28 | VBD 29 | VBG 30 | VBN 31 | VBP 32 | VBZ 33 | WDT 34 | WP 35 | WP$ 36 | WRB 37 | -------------------------------------------------------------------------------- /data/pos_tags.txt: -------------------------------------------------------------------------------- 1 | CC 2 | CD 3 | DT 4 | IN 5 | JJ 6 | NN 7 | PRP 8 | RB 9 | VB 10 | -------------------------------------------------------------------------------- /data/relation_types.txt: -------------------------------------------------------------------------------- 1 | Other 2 | Entity-Destination(e1,e2) 3 | Cause-Effect(e2,e1) 4 | Member-Collection(e2,e1) 5 | Entity-Origin(e1,e2) 6 | Message-Topic(e1,e2) 7 | Component-Whole(e2,e1) 8 | Component-Whole(e1,e2) 9 | Instrument-Agency(e2,e1) 10 | Product-Producer(e2,e1) 11 | Content-Container(e1,e2) 12 | Cause-Effect(e1,e2) 13 | Product-Producer(e1,e2) 14 | Content-Container(e2,e1) 15 | Entity-Origin(e2,e1) 16 | Message-Topic(e2,e1) 17 | Instrument-Agency(e1,e2) 18 | Member-Collection(e1,e2) 19 | Entity-Destination(e2,e1) -------------------------------------------------------------------------------- /data/relation_typesv3.txt: -------------------------------------------------------------------------------- 1 | Other 2 | Entity-Destination 3 | Cause-Effect 4 | Member-Collection 5 | Entity-Origin 6 | Message-Topic 7 | Component-Whole 8 | Instrument-Agency 9 | Product-Producer 10 | Content-Container 11 | -------------------------------------------------------------------------------- /data/test_data: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/test_data -------------------------------------------------------------------------------- /data/test_lca_paths: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/test_lca_paths -------------------------------------------------------------------------------- /data/test_pathsv1: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/test_pathsv1 -------------------------------------------------------------------------------- /data/test_pathsv3: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/test_pathsv3 -------------------------------------------------------------------------------- /data/train_data: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/train_data -------------------------------------------------------------------------------- /data/train_lca_paths: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/train_lca_paths -------------------------------------------------------------------------------- /data/train_pathsv1: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/train_pathsv1 -------------------------------------------------------------------------------- /data/train_pathsv3: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/train_pathsv3 -------------------------------------------------------------------------------- /data/vocab.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/vocab.pkl -------------------------------------------------------------------------------- /data/vocab_glove: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/vocab_glove -------------------------------------------------------------------------------- /data/vocab_wiki: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/vocab_wiki -------------------------------------------------------------------------------- /img/lca.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/img/lca.jpg -------------------------------------------------------------------------------- /img/lstm_seq.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/img/lstm_seq.jpg -------------------------------------------------------------------------------- /img/lstm_tree.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/img/lstm_tree.jpg -------------------------------------------------------------------------------- /img/lstm_tree_eq.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/img/lstm_tree_eq.jpg -------------------------------------------------------------------------------- /preprocessing.py.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": true 8 | }, 9 | "outputs": [], 10 | "source": [ 11 | "import re, sys, nltk\n", 12 | "from nltk.tokenize.stanford import StanfordTokenizer\n", 13 | "path_to_jar = \"/home/shanu/nltk/jars/stanford-postagger.jar\"\n", 14 | "tokenizer = StanfordTokenizer(path_to_jar)\n", 15 | "import _pickle as pickle" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "metadata": { 22 | "collapsed": true 23 | }, 24 | "outputs": [], 25 | "source": [ 26 | "# Extracting the Relations \n", 27 | "# Please comment this when preprocessing the sentences.\n", 28 | "# for training data open \"TRAIN_FILE.TXT\" and for test data open \"TEST_FILE_FULL.TXT\"\n", 29 | "\n", 30 | "lines = []\n", 31 | "for line in open(\"data/TRAIN_FILE.TXT\"):\n", 32 | " lines.append(line.strip())\n", 33 | "\n", 34 | "relations = []\n", 35 | "for i, w in enumerate(lines):\n", 36 | " if((i+3)%4==0):\n", 37 | " relations.append(w)\n", 38 | " \n", 39 | "f = open(\"data/train_relations.txt\", 'w')\n", 40 | "for rel in relations:\n", 41 | " f.write(rel+'\\n')" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 3, 47 | "metadata": { 48 | "collapsed": true 49 | }, 50 | "outputs": [], 51 | "source": [ 52 | "# For preprocessing Training data open \"TRAIN_FILE.TXT and for Test data open \"TEST_FILE.txt\n", 53 | "\n", 54 | "lines = []\n", 55 | "for line in open(\"data/TRAIN_FILE.TXT\"): \n", 56 | " m = re.match(r'^([0-9]+)\\s\"(.+)\"$', line.strip())\n", 57 | " if(m is not None):\n", 58 | " lines.append(m.group(2))" 59 | ] 60 | }, 61 | { 62 | "cell_type": "code", 63 | "execution_count": null, 64 | "metadata": {}, 65 | "outputs": [], 66 | "source": [ 67 | "len(relations)" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": null, 73 | "metadata": { 74 | "scrolled": true 75 | }, 76 | "outputs": [], 77 | "source": [ 78 | "sentences = []\n", 79 | "e1 = []\n", 80 | "e2 = []\n", 81 | "for j,line in enumerate(lines):\n", 82 | " text = []\n", 83 | " temp = []\n", 84 | " t = line.split(\"\")\n", 85 | " text.append(t[0])\n", 86 | " temp.append(t[0])\n", 87 | "\n", 88 | " t = t[1].split(\"\")\n", 89 | " e1_text = text\n", 90 | " e1_text = \" \".join(e1_text)\n", 91 | " e1_text = tokenizer.tokenize(e1_text)\n", 92 | " text.append(t[0])\n", 93 | " e11= t[0]\n", 94 | " y = tokenizer.tokenize(t[0])\n", 95 | " y[0] +=\"E11\"\n", 96 | " temp.append(\" \".join(y))\n", 97 | " t = t[1].split(\"\")\n", 98 | " text.append(t[0])\n", 99 | " temp.append(t[0])\n", 100 | " t = t[1].split(\"\")\n", 101 | " e22 = t[0]\n", 102 | " e2_text = text\n", 103 | " e2_text = \" \".join(e2_text)\n", 104 | " e2_text = tokenizer.tokenize(e2_text)\n", 105 | " text.append(t[0])\n", 106 | " text.append(t[1])\n", 107 | " y = tokenizer.tokenize(t[0])\n", 108 | " y[0] +=\"E22\"\n", 109 | " temp.append(\" \".join(y))\n", 110 | " temp.append(t[1])\n", 111 | "\n", 112 | " text = \" \".join(text)\n", 113 | " text = tokenizer.tokenize(text)\n", 114 | " temp = \" \".join(temp)\n", 115 | " temp = tokenizer.tokenize(temp)\n", 116 | "\n", 117 | " q1 = tokenizer.tokenize(e11)[0]\n", 118 | " q2 = tokenizer.tokenize(e22)[0]\n", 119 | " for i, word in enumerate(text):\n", 120 | " if(word.find(q1)!=-1):\n", 121 | " if(temp[i].find(\"E11\")!=-1):\n", 122 | " e1.append(i) \n", 123 | " break\n", 124 | " for i, word in enumerate(text):\n", 125 | " if(word.find(q2)!=-1):\n", 126 | " if(temp[i].find(\"E22\")!=-1):\n", 127 | " e2.append(i) \n", 128 | " text = \" \".join(text)\n", 129 | " sentences.append(text)\n", 130 | " print(j, text)" 131 | ] 132 | }, 133 | { 134 | "cell_type": "code", 135 | "execution_count": null, 136 | "metadata": {}, 137 | "outputs": [], 138 | "source": [ 139 | "len(sentences), len(e1), len(e2)" 140 | ] 141 | }, 142 | { 143 | "cell_type": "code", 144 | "execution_count": null, 145 | "metadata": { 146 | "collapsed": true 147 | }, 148 | "outputs": [], 149 | "source": [ 150 | "# for saving training data open \"train_data\" and for test data open \"test_data\"\n", 151 | "\n", 152 | "with open('data/train_data', 'wb') as f:\n", 153 | " pickle.dump((sentences, e1, e2), f)\n", 154 | " f.close()" 155 | ] 156 | }, 157 | { 158 | "cell_type": "code", 159 | "execution_count": null, 160 | "metadata": { 161 | "collapsed": true 162 | }, 163 | "outputs": [], 164 | "source": [] 165 | } 166 | ], 167 | "metadata": { 168 | "kernelspec": { 169 | "display_name": "Python 3", 170 | "language": "python", 171 | "name": "python3" 172 | }, 173 | "language_info": { 174 | "codemirror_mode": { 175 | "name": "ipython", 176 | "version": 3 177 | }, 178 | "file_extension": ".py", 179 | "mimetype": "text/x-python", 180 | "name": "python", 181 | "nbconvert_exporter": "python", 182 | "pygments_lexer": "ipython3", 183 | "version": "3.5.4" 184 | } 185 | }, 186 | "nbformat": 4, 187 | "nbformat_minor": 2 188 | } 189 | --------------------------------------------------------------------------------