├── .gitignore
├── Data_analysis.ipynb
├── LCA Shortest Path
    ├── README.md
    ├── modelv1.ipynb
    ├── modelv2.ipynb
    ├── modelv3.ipynb
    ├── modelv4.ipynb
    ├── modelv5.ipynb
    ├── modelv6.ipynb
    ├── modelv7.ipynb
    ├── modelv8.ipynb
    └── path_extractor.ipynb
├── LCA SubTree
    ├── README.md
    ├── model2v1.ipynb
    ├── model2v2.ipynb
    └── path_extractor.ipynb
├── LICENSE
├── LSTM Seq and Tree
    ├── README.md
    ├── model3v1.ipynb
    ├── model3v2.ipynb
    └── path_extractor.ipynb
├── README.md
├── data
    ├── TEST_FILE.txt
    ├── TEST_FILE_FULL.TXT
    ├── TRAIN_FILE.TXT
    ├── dependency_types.txt
    ├── full_postags_types.txt
    ├── model3v1.ipynb
    ├── pos_tags.txt
    ├── relation_types.txt
    ├── relation_typesv3.txt
    ├── test_data
    ├── test_lca_paths
    ├── test_pathsv1
    ├── test_pathsv3
    ├── test_relations.txt
    ├── test_relationsv3.txt
    ├── train_data
    ├── train_lca_paths
    ├── train_pathsv1
    ├── train_pathsv3
    ├── train_relations.txt
    ├── train_relationsv3.txt
    ├── train_text.txt
    ├── vocab.pkl
    ├── vocab_glove
    └── vocab_wiki
├── img
    ├── lca.jpg
    ├── lstm_seq.jpg
    ├── lstm_tree.jpg
    └── lstm_tree_eq.jpg
└── preprocessing.py.ipynb


/.gitignore:
--------------------------------------------------------------------------------
1 | checkpoint/
2 | .ipynb_checkpoints/
3 | _pycache_/
4 | word_embedding_glove
5 | glove.6B.100d.txt
6 | glove.6B.200d.txt
7 | train/
8 | word_embd_wiki
9 | wikipedia200.bin


--------------------------------------------------------------------------------
/LCA Shortest Path/README.md:
--------------------------------------------------------------------------------
 1 | ## Relation Classification using LSTM Networks along Shortest Dependency Paths
 2 | 
 3 | First we implemented a architecture following a paper [Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Paths](https://pdfs.semanticscholar.org/0f44/366c1e1446cfd51258c68bd1da14fe9c7f10.pdf?_ga=2.136229944.807016038.1498203433-264083776.1497442258) by Yan Xu and others. 
 4 | This neural architecture utilizes the shortest dependency path between two entities in a sentence. 
 5 | The shortest dependency paths retain most relevant information (to relation classification), while eliminating irrelevant words in the sentence.
 6 | 
 7 | ## SDP-LSTM Model
 8 | 
 9 | ![LCA Shortest Path](/img/lca.jpg)
10 | 
11 | First sentence is parsed to a dependency tree by the [Stanford parser](https://nlp.stanford.edu/software/stanford-dependencies.shtml), the shortest dependency path(SDP) is extracted as the input of our network.
12 | 
13 | Dependency trees are a kind of directed graph, so direction of relation matters. Hence we separate SDP into two sub-paths, each from an entity to the common ancestor node. Along the SDP, three different types of information(as channels) are used, including the words, POS tags, dependency types.
14 | In each channel, e.g. words, are mapped to real-valued vectors, called embeddings, which capture the underlying meanings of the inputs.
15 | 
16 | ### Channels
17 | 
18 | * Each word in a given sentence is mapped to a real-valued vector by looking up in a word embedding table of Glove (pretrained).
19 | * Since word embeddings are obtained on a generic corpus of a large scale, the information they contain may not agree with a specific sentence. We deal with this problem by allying each input word with its POS tag, e.g., noun, verb, etc.
20 | * The dependency types between words provide grammatical relationships in a sentence that can easily be understood and effectively used by people
21 | without linguistic expertise
22 |  Two recurrent neural networks pick up information along the left and right sub-paths of the SDP. 
23 | 
24 | ### Recurrent Neural Networks
25 | 
26 | Recurrent Neural Networks have one problem, known as gradient vanishing or exploding problem. Long short term memory(LSTM) overcome this problem by introducing an adaptive gating mechanism, which keep the previous state and memorize the extracted features of the current data input.
27 | LSTM-based recurrent neural network comprises four components: an input gate, a forget gate, an output gate, and a memory cell.
28 | The two SDP-LSTM  propagate bottom-up from the entities to their common ancestor. This way, the model is direction-sensitive.
29 | 
30 | A max pooling layer packs, for each sub-path, the recurrent network’s states, to a fixed vector by taking the maximum value in each dimension.
31 | The pooling layers from different channels are concatenated, and then connected to a hidden layer. Finally, we have a softmax output layer for
32 | classification. 
33 | 
34 | ### Training
35 | 
36 | We update the model parameters including weights, biases, and embeddings by BPTT and Adam gradient descent with L2-regularization (we regularize weights W and U, not the bias terms b).
37 | 
38 | ### Data
39 | 
40 | SemEval-2010 Task 8 defines 9 relation types between nominals and a tenth type Other when two nouns have none of these relations. Direction is considered and hence model is trained over 19 relation classes.
41 | ## Experiments
42 | 
43 | Model | Train-Accuracy | Test-Accuracy| Epochs
44 | --- | --- | ---| ---
45 | modelv1 | 99.45 | 61.4 | 10
46 | modelv2 | 100 | ? | 10
47 | modelv3 | 84.03 | 60.4 | 20
48 | modelv4 | 96.1 | 63.2 | 60
49 | modelv5 | 92.2 | 62.3 | 60
50 | modelv6 | 97.3 | 61.4 | 34
51 | modelv7 | 94.6 | 60.03 | 20
52 | modelv8 | 98.96 | 62.5 | 60
53 | 
54 | ### [modelv1](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv1.ipynb)
55 | * Learning rate = 0.001 
56 | * other_state size = 100
57 | * lambda_l2 = 0.0001
58 | 
59 | ### [modelv2](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv2.ipynb)
60 | * dropout over hidden layer - 0.3
61 | 
62 | ### [modelv3](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv3.ipynb)
63 | * dropout over word_embedding - 0.3
64 | 
65 | ### [modelv4](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv4.ipynb)
66 | * dropout over word_embedding - 0.3
67 | * other_state_size = 50
68 | 
69 | ### [modelv5](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv5.ipynb)
70 | * dropout over word_embedding and hidden_layer - 0.3
71 | * other_state_size = 50
72 | * lambda = 0.00001
73 | 
74 | ### [modelv5](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv5.ipynb)
75 | * dropout over word_embedding, pos_embedding, dep_embedding of 0.5  
76 | * dropout on hidden_layer of 0.3
77 | 
78 | ``below all models have a learning rate decay at the rate of 0.96 over 2000 steps``
79 | 
80 | ### [modelv6](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv6.ipynb)
81 | * learning rate decays.
82 | 
83 | ### [modelv7](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv7.ipynb)
84 | * learnings rate decays.
85 | * dropout on word, pos tags, dep embedding of 0.5
86 | * dropout on hidden layer of 0.3
87 | 
88 | ### [modelv8](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20Shortest%20Path/modelv8.ipynb)
89 | * learning rate decay
90 | * [word embedding](http://tti-coin.jp/data/wikipedia200.bin) trained over wikipedia
91 | * dropout over hidden layer of 0.3


--------------------------------------------------------------------------------
/LCA Shortest Path/modelv1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import sys, os, _pickle as pickle\n",
 12 |     "import tensorflow as tf\n",
 13 |     "import numpy as np\n",
 14 |     "import nltk\n",
 15 |     "from sklearn.metrics import f1_score"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 2,
 21 |    "metadata": {
 22 |     "collapsed": true
 23 |    },
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "data_dir = '../data'                        # Directory for Data and Other files\n",
 27 |     "ckpt_dir = '../checkpoint'                  # Directory for Checkpoints \n",
 28 |     "word_embd_dir = '../checkpoint/word_embd'   # Directory for Checkpoints of Word Embedding Layer\n",
 29 |     "model_dir = '../checkpoint/modelv1'         # Directory for Checkpoints of Model"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 3,
 35 |    "metadata": {
 36 |     "collapsed": true
 37 |    },
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "word_embd_dim = 100       # Dimension of embedding layer for words\n",
 41 |     "pos_embd_dim = 25         # Dimension of embedding layer for POS Tags\n",
 42 |     "dep_embd_dim = 25         # Dimension of embedding layer for Dependency Types\n",
 43 |     "\n",
 44 |     "word_vocab_size = 400001  # Vocab size for Words\n",
 45 |     "pos_vocab_size = 10       # Vocab size for POS Tags\n",
 46 |     "dep_vocab_size = 21       # Vocab size for Dependency Types\n",
 47 |     "\n",
 48 |     "relation_classes = 19     # No. of Relation Classes\n",
 49 |     "state_size = 100          # Dimension of States of LSTM-RNNs\n",
 50 |     "batch_size = 10           # Batch Size for training\n",
 51 |     "\n",
 52 |     "channels = 3              # No. of types of features to feed in LSTM-RNN\n",
 53 |     "lambda_l2 = 0.0001\n",
 54 |     "max_len_path = 10         # Maximum length of sequence"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": 4,
 60 |    "metadata": {
 61 |     "collapsed": true
 62 |    },
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "with tf.name_scope(\"input\"):\n",
 66 |     "    \n",
 67 |     "    # Length of the sequence\n",
 68 |     "    path_length = tf.placeholder(tf.int32, shape=[2, batch_size], name=\"path1_length\") \n",
 69 |     "    \n",
 70 |     "    # Words in the sequence\n",
 71 |     "    word_ids = tf.placeholder(tf.int32, shape=[2, batch_size, max_len_path], name=\"word_ids\") \n",
 72 |     "    \n",
 73 |     "     # POS Tags in the sequence\n",
 74 |     "    pos_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"pos_ids\") \n",
 75 |     "    \n",
 76 |     "    # Dependency Types in the sequence\n",
 77 |     "    dep_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"dep_ids\") \n",
 78 |     "    \n",
 79 |     "     # True Relation btw the entities\n",
 80 |     "    y = tf.placeholder(tf.int32, [batch_size], name=\"y\")   "
 81 |    ]
 82 |   },
 83 |   {
 84 |    "cell_type": "code",
 85 |    "execution_count": 5,
 86 |    "metadata": {
 87 |     "collapsed": true
 88 |    },
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "# Embedding Layer of Words \n",
 92 |     "with tf.name_scope(\"word_embedding\"):\n",
 93 |     "    W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n",
 94 |     "    embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n",
 95 |     "    embedding_init = W.assign(embedding_placeholder)\n",
 96 |     "    embedded_word = tf.nn.embedding_lookup(W, word_ids)\n",
 97 |     "    word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n",
 98 |     "\n",
 99 |     "# Embedding Layer of POS Tags \n",
100 |     "with tf.name_scope(\"pos_embedding\"):\n",
101 |     "    W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n",
102 |     "    embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n",
103 |     "    pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n",
104 |     "\n",
105 |     "# Embedding Layer of Dependency Types \n",
106 |     "with tf.name_scope(\"dep_embedding\"):\n",
107 |     "    W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n",
108 |     "    embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n",
109 |     "    dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})"
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "execution_count": 6,
115 |    "metadata": {
116 |     "collapsed": true
117 |    },
118 |    "outputs": [],
119 |    "source": [
120 |     "hidden_states = tf.zeros([channels, batch_size, state_size], name=\"hidden_state\")\n",
121 |     "cell_states = tf.zeros([channels, batch_size, state_size], name=\"cell_state\")\n",
122 |     "\n",
123 |     "init_states = [tf.contrib.rnn.LSTMStateTuple(hidden_states[i], cell_states[i]) for i in range(channels)]\n",
124 |     "\n",
125 |     "with tf.variable_scope(\"word_lstm1\"):\n",
126 |     "    cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n",
127 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word[0], sequence_length=path_length[0], initial_state=init_states[0])\n",
128 |     "    state_series_word1 = tf.reduce_max(state_series, axis=1)\n",
129 |     "\n",
130 |     "with tf.variable_scope(\"word_lstm2\"):\n",
131 |     "    cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n",
132 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word[1], sequence_length=path_length[1], initial_state=init_states[0])\n",
133 |     "    state_series_word2 = tf.reduce_max(state_series, axis=1)\n",
134 |     "\n",
135 |     "with tf.variable_scope(\"pos_lstm1\"):\n",
136 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
137 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[0], sequence_length=path_length[0],initial_state=init_states[1])\n",
138 |     "    state_series_pos1 = tf.reduce_max(state_series, axis=1)\n",
139 |     "\n",
140 |     "with tf.variable_scope(\"pos_lstm2\"):\n",
141 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
142 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[1], sequence_length=path_length[1],initial_state=init_states[1])\n",
143 |     "    state_series_pos2 = tf.reduce_max(state_series, axis=1)\n",
144 |     "\n",
145 |     "with tf.variable_scope(\"dep_lstm1\"):\n",
146 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
147 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[0], sequence_length=path_length[0], initial_state=init_states[2])\n",
148 |     "    state_series_dep1 = tf.reduce_max(state_series, axis=1)\n",
149 |     "\n",
150 |     "with tf.variable_scope(\"dep_lstm2\"):\n",
151 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
152 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[1], sequence_length=path_length[1], initial_state=init_states[2])\n",
153 |     "    state_series_dep2 = tf.reduce_max(state_series, axis=1)\n",
154 |     "\n",
155 |     "state_series1 = tf.concat([state_series_word1, state_series_pos1, state_series_dep1], 1)\n",
156 |     "state_series2 = tf.concat([state_series_word2, state_series_pos2, state_series_dep2], 1)\n",
157 |     "\n",
158 |     "state_series = tf.concat([state_series1, state_series2], 1)"
159 |    ]
160 |   },
161 |   {
162 |    "cell_type": "code",
163 |    "execution_count": 7,
164 |    "metadata": {
165 |     "collapsed": true
166 |    },
167 |    "outputs": [],
168 |    "source": [
169 |     "with tf.name_scope(\"hidden_layer\"):\n",
170 |     "    W = tf.Variable(tf.truncated_normal([600, 100], -0.1, 0.1), name=\"W\")\n",
171 |     "    b = tf.Variable(tf.zeros([100]), name=\"b\")\n",
172 |     "    y_hidden_layer = tf.matmul(state_series, W) + b\n",
173 |     "\n",
174 |     "with tf.name_scope(\"softmax_layer\"):\n",
175 |     "    W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n",
176 |     "    b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n",
177 |     "    logits = tf.matmul(y_hidden_layer, W) + b\n",
178 |     "    predictions = tf.argmax(logits, 1)"
179 |    ]
180 |   },
181 |   {
182 |    "cell_type": "code",
183 |    "execution_count": 8,
184 |    "metadata": {
185 |     "collapsed": true
186 |    },
187 |    "outputs": [],
188 |    "source": [
189 |     "tv_all = tf.trainable_variables()\n",
190 |     "tv_regu = []\n",
191 |     "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n",
192 |     "for t in tv_all:\n",
193 |     "    if t.name not in non_reg:\n",
194 |     "        if(t.name.find('biases')==-1):\n",
195 |     "            tv_regu.append(t)\n",
196 |     "\n",
197 |     "with tf.name_scope(\"loss\"):\n",
198 |     "    l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n",
199 |     "    loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n",
200 |     "    total_loss = loss + l2_loss\n",
201 |     "\n",
202 |     "global_step = tf.Variable(0, name=\"global_step\")\n",
203 |     "\n",
204 |     "optimizer = tf.train.AdamOptimizer(0.001).minimize(total_loss, global_step=global_step)"
205 |    ]
206 |   },
207 |   {
208 |    "cell_type": "code",
209 |    "execution_count": 9,
210 |    "metadata": {
211 |     "collapsed": true
212 |    },
213 |    "outputs": [],
214 |    "source": [
215 |     "f = open(data_dir + '/vocab.pkl', 'rb')\n",
216 |     "vocab = pickle.load(f)\n",
217 |     "f.close()\n",
218 |     "\n",
219 |     "word2id = dict((w, i) for i,w in enumerate(vocab))\n",
220 |     "id2word = dict((i, w) for i,w in enumerate(vocab))\n",
221 |     "\n",
222 |     "unknown_token = \"UNKNOWN_TOKEN\"\n",
223 |     "word2id[unknown_token] = word_vocab_size -1\n",
224 |     "id2word[word_vocab_size-1] = unknown_token\n",
225 |     "\n",
226 |     "pos_tags_vocab = []\n",
227 |     "for line in open(data_dir + '/pos_tags.txt'):\n",
228 |     "        pos_tags_vocab.append(line.strip())\n",
229 |     "\n",
230 |     "dep_vocab = []\n",
231 |     "for line in open(data_dir + '/dependency_types.txt'):\n",
232 |     "    dep_vocab.append(line.strip())\n",
233 |     "\n",
234 |     "relation_vocab = []\n",
235 |     "for line in open(data_dir + '/relation_types.txt'):\n",
236 |     "    relation_vocab.append(line.strip())\n",
237 |     "\n",
238 |     "\n",
239 |     "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n",
240 |     "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n",
241 |     "\n",
242 |     "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n",
243 |     "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n",
244 |     "\n",
245 |     "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n",
246 |     "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n",
247 |     "\n",
248 |     "pos_tag2id['OTH'] = 9\n",
249 |     "id2pos_tag[9] = 'OTH'\n",
250 |     "\n",
251 |     "dep2id['OTH'] = 20\n",
252 |     "id2dep[20] = 'OTH'\n",
253 |     "\n",
254 |     "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n",
255 |     "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n",
256 |     "RB_pos_tags = ['RB', 'RBR', 'RBS']\n",
257 |     "PRP_pos_tags = ['PRP', 'PRP$']\n",
258 |     "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n",
259 |     "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n",
260 |     "\n",
261 |     "def pos_tag(x):\n",
262 |     "    if x in JJ_pos_tags:\n",
263 |     "        return pos_tag2id['JJ']\n",
264 |     "    if x in NN_pos_tags:\n",
265 |     "        return pos_tag2id['NN']\n",
266 |     "    if x in RB_pos_tags:\n",
267 |     "        return pos_tag2id['RB']\n",
268 |     "    if x in PRP_pos_tags:\n",
269 |     "        return pos_tag2id['PRP']\n",
270 |     "    if x in VB_pos_tags:\n",
271 |     "        return pos_tag2id['VB']\n",
272 |     "    if x in _pos_tags:\n",
273 |     "        return pos_tag2id[x]\n",
274 |     "    else:\n",
275 |     "        return 9"
276 |    ]
277 |   },
278 |   {
279 |    "cell_type": "code",
280 |    "execution_count": 16,
281 |    "metadata": {
282 |     "collapsed": true
283 |    },
284 |    "outputs": [],
285 |    "source": [
286 |     "f = open(data_dir + '/train_paths', 'rb')\n",
287 |     "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n",
288 |     "f.close()\n",
289 |     "\n",
290 |     "relations = []\n",
291 |     "for line in open(data_dir + '/train_relations.txt'):\n",
292 |     "    relations.append(line.strip().split()[1])\n",
293 |     "\n",
294 |     "length = len(word_p1)\n",
295 |     "num_batches = int(length/batch_size)\n",
296 |     "\n",
297 |     "for i in range(length):\n",
298 |     "    for j, word in enumerate(word_p1[i]):\n",
299 |     "        word = word.lower()\n",
300 |     "        word_p1[i][j] = word if word in word2id else unknown_token \n",
301 |     "    for k, word in enumerate(word_p2[i]):\n",
302 |     "        word = word.lower()\n",
303 |     "        word_p2[i][k] = word if word in word2id else unknown_token \n",
304 |     "    for l, d in enumerate(dep_p1[i]):\n",
305 |     "        dep_p1[i][l] = d if d in dep2id else 'OTH'\n",
306 |     "    for m, d in enumerate(dep_p2[i]):\n",
307 |     "        dep_p2[i][m] = d if d in dep2id else 'OTH'\n",
308 |     "\n",
309 |     "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
310 |     "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
311 |     "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
312 |     "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
313 |     "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
314 |     "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
315 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
316 |     "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n",
317 |     "path2_len = np.array([len(w) for w in word_p2])\n",
318 |     "\n",
319 |     "for i in range(length):\n",
320 |     "    for j, w in enumerate(word_p1[i]):\n",
321 |     "        word_p1_ids[i][j] = word2id[w]\n",
322 |     "    for j, w in enumerate(word_p2[i]):\n",
323 |     "        word_p2_ids[i][j] = word2id[w]\n",
324 |     "    for j, w in enumerate(pos_p1[i]):\n",
325 |     "        pos_p1_ids[i][j] = pos_tag(w)\n",
326 |     "    for j, w in enumerate(pos_p2[i]):\n",
327 |     "        pos_p2_ids[i][j] = pos_tag(w)\n",
328 |     "    for j, w in enumerate(dep_p1[i]):\n",
329 |     "        dep_p1_ids[i][j] = dep2id[w]\n",
330 |     "    for j, w in enumerate(dep_p2[i]):\n",
331 |     "        dep_p2_ids[i][j] = dep2id[w]"
332 |    ]
333 |   },
334 |   {
335 |    "cell_type": "code",
336 |    "execution_count": 17,
337 |    "metadata": {
338 |     "collapsed": true
339 |    },
340 |    "outputs": [],
341 |    "source": [
342 |     "sess = tf.Session()\n",
343 |     "sess.run(tf.global_variables_initializer())\n",
344 |     "saver = tf.train.Saver()"
345 |    ]
346 |   },
347 |   {
348 |    "cell_type": "code",
349 |    "execution_count": 10,
350 |    "metadata": {
351 |     "collapsed": true
352 |    },
353 |    "outputs": [],
354 |    "source": [
355 |     "# f = open('data/word_embedding', 'rb')\n",
356 |     "# word_embedding = pickle.load(f)\n",
357 |     "# f.close()\n",
358 |     "\n",
359 |     "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n",
360 |     "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')"
361 |    ]
362 |   },
363 |   {
364 |    "cell_type": "code",
365 |    "execution_count": 19,
366 |    "metadata": {},
367 |    "outputs": [
368 |     {
369 |      "name": "stdout",
370 |      "output_type": "stream",
371 |      "text": [
372 |       "INFO:tensorflow:Restoring parameters from checkpoint/modelv1/model\n"
373 |      ]
374 |     }
375 |    ],
376 |    "source": [
377 |     "model = tf.train.latest_checkpoint(model_dir)\n",
378 |     "saver.restore(sess, model)"
379 |    ]
380 |   },
381 |   {
382 |    "cell_type": "code",
383 |    "execution_count": 13,
384 |    "metadata": {
385 |     "collapsed": true,
386 |     "scrolled": true
387 |    },
388 |    "outputs": [],
389 |    "source": [
390 |     "# latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n",
391 |     "# word_embedding_saver.restore(sess, latest_embd)"
392 |    ]
393 |   },
394 |   {
395 |    "cell_type": "code",
396 |    "execution_count": 20,
397 |    "metadata": {
398 |     "collapsed": true,
399 |     "scrolled": true
400 |    },
401 |    "outputs": [],
402 |    "source": [
403 |     "num_epochs = 10\n",
404 |     "for i in range(num_epochs):\n",
405 |     "    for j in range(num_batches):\n",
406 |     "        path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
407 |     "        word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
408 |     "        pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
409 |     "        dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
410 |     "        y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
411 |     "        \n",
412 |     "        feed_dict = {\n",
413 |     "            path_length:path_dict,\n",
414 |     "            word_ids:word_dict,\n",
415 |     "            pos_ids:pos_dict,\n",
416 |     "            dep_ids:dep_dict,\n",
417 |     "            y:y_dict}\n",
418 |     "        _, loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n",
419 |     "        if step%10==0:\n",
420 |     "            print(\"Step:\", step, \"loss:\",loss)\n",
421 |     "        if step % 1000 == 0:\n",
422 |     "            saver.save(sess, model_dir + '/model')\n",
423 |     "            print(\"Saved Model\")"
424 |    ]
425 |   },
426 |   {
427 |    "cell_type": "code",
428 |    "execution_count": 21,
429 |    "metadata": {
430 |     "scrolled": false
431 |    },
432 |    "outputs": [
433 |     {
434 |      "name": "stdout",
435 |      "output_type": "stream",
436 |      "text": [
437 |       "training accuracy 99.2625\n"
438 |      ]
439 |     }
440 |    ],
441 |    "source": [
442 |     "# training accuracy\n",
443 |     "all_predictions = []\n",
444 |     "for j in range(num_batches):\n",
445 |     "    path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
446 |     "    word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
447 |     "    pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
448 |     "    dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
449 |     "    y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
450 |     "\n",
451 |     "    feed_dict = {\n",
452 |     "        path_length:path_dict,\n",
453 |     "        word_ids:word_dict,\n",
454 |     "        pos_ids:pos_dict,\n",
455 |     "        dep_ids:dep_dict,\n",
456 |     "        y:y_dict}\n",
457 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
458 |     "    all_predictions.append(batch_predictions)\n",
459 |     "\n",
460 |     "y_pred = []\n",
461 |     "for i in range(num_batches):\n",
462 |     "    for pred in all_predictions[i]:\n",
463 |     "        y_pred.append(pred)\n",
464 |     "\n",
465 |     "count = 0\n",
466 |     "for i in range(batch_size*num_batches):\n",
467 |     "    count += y_pred[i]==rel_ids[i]\n",
468 |     "accuracy = count/(batch_size*num_batches) * 100\n",
469 |     "\n",
470 |     "print(\"training accuracy\", accuracy)"
471 |    ]
472 |   },
473 |   {
474 |    "cell_type": "code",
475 |    "execution_count": 22,
476 |    "metadata": {
477 |     "collapsed": true
478 |    },
479 |    "outputs": [],
480 |    "source": [
481 |     "f = open(data_dir + '/test_paths', 'rb')\n",
482 |     "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n",
483 |     "f.close()\n",
484 |     "\n",
485 |     "relations = []\n",
486 |     "for line in open(data_dir + '/test_relations.txt'):\n",
487 |     "    relations.append(line.strip().split()[0])\n",
488 |     "\n",
489 |     "length = len(word_p1)\n",
490 |     "num_batches = int(length/batch_size)\n",
491 |     "\n",
492 |     "for i in range(length):\n",
493 |     "    for j, word in enumerate(word_p1[i]):\n",
494 |     "        word = word.lower()\n",
495 |     "        word_p1[i][j] = word if word in word2id else unknown_token \n",
496 |     "    for k, word in enumerate(word_p2[i]):\n",
497 |     "        word = word.lower()\n",
498 |     "        word_p2[i][k] = word if word in word2id else unknown_token \n",
499 |     "    for l, d in enumerate(dep_p1[i]):\n",
500 |     "        dep_p1[i][l] = d if d in dep2id else 'OTH'\n",
501 |     "    for m, d in enumerate(dep_p2[i]):\n",
502 |     "        dep_p2[i][m] = d if d in dep2id else 'OTH'\n",
503 |     "\n",
504 |     "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
505 |     "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
506 |     "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
507 |     "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
508 |     "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
509 |     "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
510 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
511 |     "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n",
512 |     "path2_len = np.array([len(w) for w in word_p2])\n",
513 |     "\n",
514 |     "for i in range(length):\n",
515 |     "    for j, w in enumerate(word_p1[i]):\n",
516 |     "        word_p1_ids[i][j] = word2id[w]\n",
517 |     "    for j, w in enumerate(word_p2[i]):\n",
518 |     "        word_p2_ids[i][j] = word2id[w]\n",
519 |     "    for j, w in enumerate(pos_p1[i]):\n",
520 |     "        pos_p1_ids[i][j] = pos_tag(w)\n",
521 |     "    for j, w in enumerate(pos_p2[i]):\n",
522 |     "        pos_p2_ids[i][j] = pos_tag(w)\n",
523 |     "    for j, w in enumerate(dep_p1[i]):\n",
524 |     "        dep_p1_ids[i][j] = dep2id[w]\n",
525 |     "    for j, w in enumerate(dep_p2[i]):\n",
526 |     "        dep_p2_ids[i][j] = dep2id[w]"
527 |    ]
528 |   },
529 |   {
530 |    "cell_type": "code",
531 |    "execution_count": 23,
532 |    "metadata": {},
533 |    "outputs": [
534 |     {
535 |      "name": "stdout",
536 |      "output_type": "stream",
537 |      "text": [
538 |       "test accuracy 61.4022140221\n"
539 |      ]
540 |     }
541 |    ],
542 |    "source": [
543 |     "# test \n",
544 |     "all_predictions = []\n",
545 |     "for j in range(num_batches):\n",
546 |     "    path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
547 |     "    word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
548 |     "    pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
549 |     "    dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
550 |     "    y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
551 |     "\n",
552 |     "    feed_dict = {\n",
553 |     "        path_length:path_dict,\n",
554 |     "        word_ids:word_dict,\n",
555 |     "        pos_ids:pos_dict,\n",
556 |     "        dep_ids:dep_dict,\n",
557 |     "        y:y_dict}\n",
558 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
559 |     "    all_predictions.append(batch_predictions)\n",
560 |     "\n",
561 |     "y_pred = []\n",
562 |     "for i in range(num_batches):\n",
563 |     "    for pred in all_predictions[i]:\n",
564 |     "        y_pred.append(pred)\n",
565 |     "\n",
566 |     "count = 0\n",
567 |     "for i in range(batch_size*num_batches):\n",
568 |     "    count += y_pred[i]==rel_ids[i]\n",
569 |     "accuracy = count/(batch_size*num_batches) * 100\n",
570 |     "\n",
571 |     "print(\"test accuracy\", accuracy)"
572 |    ]
573 |   }
574 |  ],
575 |  "metadata": {
576 |   "kernelspec": {
577 |    "display_name": "Python 3",
578 |    "language": "python",
579 |    "name": "python3"
580 |   },
581 |   "language_info": {
582 |    "codemirror_mode": {
583 |     "name": "ipython",
584 |     "version": 3
585 |    },
586 |    "file_extension": ".py",
587 |    "mimetype": "text/x-python",
588 |    "name": "python",
589 |    "nbconvert_exporter": "python",
590 |    "pygments_lexer": "ipython3",
591 |    "version": "3.5.4"
592 |   }
593 |  },
594 |  "nbformat": 4,
595 |  "nbformat_minor": 2
596 | }
597 | 


--------------------------------------------------------------------------------
/LCA Shortest Path/modelv6.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import sys, os, _pickle as pickle\n",
 12 |     "import tensorflow as tf\n",
 13 |     "import numpy as np\n",
 14 |     "import nltk\n",
 15 |     "from sklearn.metrics import f1_score\n",
 16 |     "\n",
 17 |     "data_dir = '../data'\n",
 18 |     "ckpt_dir = '../checkpoint'\n",
 19 |     "word_embd_dir = '../checkpoint/word_embd'\n",
 20 |     "model_dir = '../checkpoint/modelv6'\n",
 21 |     "\n",
 22 |     "word_embd_dim = 100\n",
 23 |     "pos_embd_dim = 25\n",
 24 |     "dep_embd_dim = 25\n",
 25 |     "word_vocab_size = 400001\n",
 26 |     "pos_vocab_size = 10\n",
 27 |     "dep_vocab_size = 21\n",
 28 |     "relation_classes = 19\n",
 29 |     "word_state_size = 100\n",
 30 |     "other_state_size = 100\n",
 31 |     "batch_size = 10\n",
 32 |     "channels = 3\n",
 33 |     "lambda_l2 = 0.0001\n",
 34 |     "max_len_path = 10\n",
 35 |     "starter_learning_rate = 0.001\n",
 36 |     "decay_steps = 2000\n",
 37 |     "decay_rate = 0.96"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 2,
 43 |    "metadata": {
 44 |     "collapsed": true
 45 |    },
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "with tf.name_scope(\"input\"):\n",
 49 |     "    path_length = tf.placeholder(tf.int32, shape=[2, batch_size], name=\"path1_length\")\n",
 50 |     "    word_ids = tf.placeholder(tf.int32, shape=[2, batch_size, max_len_path], name=\"word_ids\")\n",
 51 |     "    pos_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"pos_ids\")\n",
 52 |     "    dep_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"dep_ids\")\n",
 53 |     "    y = tf.placeholder(tf.int32, [batch_size], name=\"y\")\n",
 54 |     "\n",
 55 |     "with tf.name_scope(\"word_embedding\"):\n",
 56 |     "    W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n",
 57 |     "    embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n",
 58 |     "    embedding_init = W.assign(embedding_placeholder)\n",
 59 |     "    embedded_word = tf.nn.embedding_lookup(W, word_ids)\n",
 60 |     "    word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n",
 61 |     "\n",
 62 |     "with tf.name_scope(\"pos_embedding\"):\n",
 63 |     "    W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n",
 64 |     "    embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n",
 65 |     "    pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n",
 66 |     "\n",
 67 |     "with tf.name_scope(\"dep_embedding\"):\n",
 68 |     "    W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n",
 69 |     "    embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n",
 70 |     "    dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})\n",
 71 |     "\n",
 72 |     "word_hidden_state = tf.zeros([batch_size, word_state_size], name='word_hidden_state')\n",
 73 |     "word_cell_state = tf.zeros([batch_size, word_state_size], name='word_cell_state')\n",
 74 |     "word_init_state = tf.contrib.rnn.LSTMStateTuple(word_hidden_state, word_cell_state)\n",
 75 |     "\n",
 76 |     "other_hidden_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"hidden_state\")\n",
 77 |     "other_cell_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"cell_state\")\n",
 78 |     "\n",
 79 |     "other_init_states = [tf.contrib.rnn.LSTMStateTuple(other_hidden_states[i], other_cell_states[i]) for i in range(channels-1)]\n",
 80 |     "\n",
 81 |     "with tf.variable_scope(\"word_lstm1\"):\n",
 82 |     "    cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n",
 83 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word[0], sequence_length=path_length[0], initial_state=word_init_state)\n",
 84 |     "    state_series_word1 = tf.reduce_max(state_series, axis=1)\n",
 85 |     "\n",
 86 |     "with tf.variable_scope(\"word_lstm2\"):\n",
 87 |     "    cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n",
 88 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word[1], sequence_length=path_length[1], initial_state=word_init_state)\n",
 89 |     "    state_series_word2 = tf.reduce_max(state_series, axis=1)\n",
 90 |     "\n",
 91 |     "with tf.variable_scope(\"pos_lstm1\"):\n",
 92 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
 93 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[0], sequence_length=path_length[0],initial_state=other_init_states[0])\n",
 94 |     "    state_series_pos1 = tf.reduce_max(state_series, axis=1)\n",
 95 |     "\n",
 96 |     "with tf.variable_scope(\"pos_lstm2\"):\n",
 97 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
 98 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[1], sequence_length=path_length[1],initial_state=other_init_states[0])\n",
 99 |     "    state_series_pos2 = tf.reduce_max(state_series, axis=1)\n",
100 |     "\n",
101 |     "with tf.variable_scope(\"dep_lstm1\"):\n",
102 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
103 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[0], sequence_length=path_length[0], initial_state=other_init_states[1])\n",
104 |     "    state_series_dep1 = tf.reduce_max(state_series, axis=1)\n",
105 |     "\n",
106 |     "with tf.variable_scope(\"dep_lstm2\"):\n",
107 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
108 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[1], sequence_length=path_length[1], initial_state=other_init_states[1])\n",
109 |     "    state_series_dep2 = tf.reduce_max(state_series, axis=1)\n",
110 |     "\n",
111 |     "state_series1 = tf.concat([state_series_word1, state_series_pos1, state_series_dep1], 1)\n",
112 |     "state_series2 = tf.concat([state_series_word2, state_series_pos2, state_series_dep2], 1)\n",
113 |     "\n",
114 |     "state_series = tf.concat([state_series1, state_series2], 1)"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": 3,
120 |    "metadata": {
121 |     "collapsed": true
122 |    },
123 |    "outputs": [],
124 |    "source": [
125 |     "with tf.name_scope(\"hidden_layer\"):\n",
126 |     "    W = tf.Variable(tf.truncated_normal([600, 100], -0.1, 0.1), name=\"W\")\n",
127 |     "    b = tf.Variable(tf.zeros([100]), name=\"b\")\n",
128 |     "    y_hidden_layer = tf.nn.relu(tf.matmul(state_series, W) + b)\n",
129 |     "\n",
130 |     "with tf.name_scope(\"softmax_layer\"):\n",
131 |     "    W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n",
132 |     "    b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n",
133 |     "    logits = tf.matmul(y_hidden_layer, W) + b\n",
134 |     "    predictions = tf.argmax(logits, 1)\n",
135 |     "\n",
136 |     "tv_all = tf.trainable_variables()\n",
137 |     "tv_regu = []\n",
138 |     "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n",
139 |     "for t in tv_all:\n",
140 |     "    if t.name not in non_reg:\n",
141 |     "        if(t.name.find('biases')==-1):\n",
142 |     "            tv_regu.append(t)\n",
143 |     "\n",
144 |     "with tf.name_scope(\"loss\"):\n",
145 |     "    l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n",
146 |     "    loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n",
147 |     "    total_loss = loss + l2_loss\n",
148 |     "\n",
149 |     "global_step = tf.Variable(0, trainable=False, name=\"global_step\")\n",
150 |     "\n",
151 |     "learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, decay_steps, decay_rate, staircase=True)\n",
152 |     "optimizer = tf.train.AdamOptimizer(learning_rate).minimize(total_loss, global_step=global_step)"
153 |    ]
154 |   },
155 |   {
156 |    "cell_type": "code",
157 |    "execution_count": 4,
158 |    "metadata": {
159 |     "collapsed": true
160 |    },
161 |    "outputs": [],
162 |    "source": [
163 |     "f = open(data_dir + '/vocab.pkl', 'rb')\n",
164 |     "vocab = pickle.load(f)\n",
165 |     "f.close()\n",
166 |     "\n",
167 |     "word2id = dict((w, i) for i,w in enumerate(vocab))\n",
168 |     "id2word = dict((i, w) for i,w in enumerate(vocab))\n",
169 |     "\n",
170 |     "unknown_token = \"UNKNOWN_TOKEN\"\n",
171 |     "word2id[unknown_token] = word_vocab_size -1\n",
172 |     "id2word[word_vocab_size-1] = unknown_token\n",
173 |     "\n",
174 |     "pos_tags_vocab = []\n",
175 |     "for line in open(data_dir + '/pos_tags.txt'):\n",
176 |     "        pos_tags_vocab.append(line.strip())\n",
177 |     "\n",
178 |     "dep_vocab = []\n",
179 |     "for line in open(data_dir + '/dependency_types.txt'):\n",
180 |     "    dep_vocab.append(line.strip())\n",
181 |     "\n",
182 |     "relation_vocab = []\n",
183 |     "for line in open(data_dir + '/relation_types.txt'):\n",
184 |     "    relation_vocab.append(line.strip())\n",
185 |     "\n",
186 |     "\n",
187 |     "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n",
188 |     "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n",
189 |     "\n",
190 |     "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n",
191 |     "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n",
192 |     "\n",
193 |     "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n",
194 |     "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n",
195 |     "\n",
196 |     "pos_tag2id['OTH'] = 9\n",
197 |     "id2pos_tag[9] = 'OTH'\n",
198 |     "\n",
199 |     "dep2id['OTH'] = 20\n",
200 |     "id2dep[20] = 'OTH'\n",
201 |     "\n",
202 |     "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n",
203 |     "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n",
204 |     "RB_pos_tags = ['RB', 'RBR', 'RBS']\n",
205 |     "PRP_pos_tags = ['PRP', 'PRP$']\n",
206 |     "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n",
207 |     "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n",
208 |     "\n",
209 |     "def pos_tag(x):\n",
210 |     "    if x in JJ_pos_tags:\n",
211 |     "        return pos_tag2id['JJ']\n",
212 |     "    if x in NN_pos_tags:\n",
213 |     "        return pos_tag2id['NN']\n",
214 |     "    if x in RB_pos_tags:\n",
215 |     "        return pos_tag2id['RB']\n",
216 |     "    if x in PRP_pos_tags:\n",
217 |     "        return pos_tag2id['PRP']\n",
218 |     "    if x in VB_pos_tags:\n",
219 |     "        return pos_tag2id['VB']\n",
220 |     "    if x in _pos_tags:\n",
221 |     "        return pos_tag2id[x]\n",
222 |     "    else:\n",
223 |     "        return 9"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": 5,
229 |    "metadata": {
230 |     "collapsed": true
231 |    },
232 |    "outputs": [],
233 |    "source": [
234 |     "sess = tf.Session()\n",
235 |     "sess.run(tf.global_variables_initializer())\n",
236 |     "saver = tf.train.Saver()"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": 6,
242 |    "metadata": {
243 |     "collapsed": true
244 |    },
245 |    "outputs": [],
246 |    "source": [
247 |     "# f = open('data/word_embedding', 'rb')\n",
248 |     "# word_embedding = pickle.load(f)\n",
249 |     "# f.close()\n",
250 |     "\n",
251 |     "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n",
252 |     "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "code",
257 |    "execution_count": 7,
258 |    "metadata": {
259 |     "collapsed": true
260 |    },
261 |    "outputs": [],
262 |    "source": [
263 |     "# model = tf.train.latest_checkpoint(model_dir)\n",
264 |     "# saver.restore(sess, model)"
265 |    ]
266 |   },
267 |   {
268 |    "cell_type": "code",
269 |    "execution_count": 8,
270 |    "metadata": {
271 |     "scrolled": true
272 |    },
273 |    "outputs": [
274 |     {
275 |      "name": "stdout",
276 |      "output_type": "stream",
277 |      "text": [
278 |       "INFO:tensorflow:Restoring parameters from checkpoint/word_embd/word_embd\n"
279 |      ]
280 |     }
281 |    ],
282 |    "source": [
283 |     "latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n",
284 |     "word_embedding_saver.restore(sess, latest_embd)"
285 |    ]
286 |   },
287 |   {
288 |    "cell_type": "code",
289 |    "execution_count": null,
290 |    "metadata": {
291 |     "collapsed": true
292 |    },
293 |    "outputs": [],
294 |    "source": [
295 |     "f = open(data_dir + '/train_paths', 'rb')\n",
296 |     "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n",
297 |     "f.close()\n",
298 |     "\n",
299 |     "relations = []\n",
300 |     "for line in open(data_dir + '/train_relations.txt'):\n",
301 |     "    relations.append(line.strip().split()[1])\n",
302 |     "\n",
303 |     "length = len(word_p1)\n",
304 |     "num_batches = int(length/batch_size)\n",
305 |     "\n",
306 |     "for i in range(length):\n",
307 |     "    for j, word in enumerate(word_p1[i]):\n",
308 |     "        word = word.lower()\n",
309 |     "        word_p1[i][j] = word if word in word2id else unknown_token \n",
310 |     "    for k, word in enumerate(word_p2[i]):\n",
311 |     "        word = word.lower()\n",
312 |     "        word_p2[i][k] = word if word in word2id else unknown_token \n",
313 |     "    for l, d in enumerate(dep_p1[i]):\n",
314 |     "        dep_p1[i][l] = d if d in dep2id else 'OTH'\n",
315 |     "    for m, d in enumerate(dep_p2[i]):\n",
316 |     "        dep_p2[i][m] = d if d in dep2id else 'OTH'\n",
317 |     "\n",
318 |     "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
319 |     "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
320 |     "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
321 |     "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
322 |     "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
323 |     "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
324 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
325 |     "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n",
326 |     "path2_len = np.array([len(w) for w in word_p2])\n",
327 |     "\n",
328 |     "for i in range(length):\n",
329 |     "    for j, w in enumerate(word_p1[i]):\n",
330 |     "        word_p1_ids[i][j] = word2id[w]\n",
331 |     "    for j, w in enumerate(word_p2[i]):\n",
332 |     "        word_p2_ids[i][j] = word2id[w]\n",
333 |     "    for j, w in enumerate(pos_p1[i]):\n",
334 |     "        pos_p1_ids[i][j] = pos_tag(w)\n",
335 |     "    for j, w in enumerate(pos_p2[i]):\n",
336 |     "        pos_p2_ids[i][j] = pos_tag(w)\n",
337 |     "    for j, w in enumerate(dep_p1[i]):\n",
338 |     "        dep_p1_ids[i][j] = dep2id[w]\n",
339 |     "    for j, w in enumerate(dep_p2[i]):\n",
340 |     "        dep_p2_ids[i][j] = dep2id[w]"
341 |    ]
342 |   },
343 |   {
344 |    "cell_type": "code",
345 |    "execution_count": null,
346 |    "metadata": {
347 |     "scrolled": true
348 |    },
349 |    "outputs": [
350 |     {
351 |      "name": "stdout",
352 |      "output_type": "stream",
353 |      "text": [
354 |       "Epoch: 1 Step: 800 loss: 2.85308745444\n",
355 |       "Saved Model\n",
356 |       "Epoch: 2 Step: 1600 loss: 2.73827668965\n",
357 |       "Saved Model\n",
358 |       "Epoch: 3 Step: 2400 loss: 2.70001435518\n",
359 |       "Saved Model\n",
360 |       "Epoch: 4 Step: 3200 loss: 2.68624746531\n",
361 |       "Saved Model\n",
362 |       "Epoch: 5 Step: 4000 loss: 2.68042603165\n",
363 |       "Saved Model\n",
364 |       "Epoch: 6 Step: 4800 loss: 2.67750604913\n",
365 |       "Saved Model\n",
366 |       "Epoch: 7 Step: 5600 loss: 2.67583220631\n",
367 |       "Saved Model\n",
368 |       "Epoch: 8 Step: 6400 loss: 2.67482194766\n",
369 |       "Saved Model\n",
370 |       "Epoch: 9 Step: 7200 loss: 2.67411908716\n",
371 |       "Saved Model\n",
372 |       "Epoch: 10 Step: 8000 loss: 2.67369878128\n",
373 |       "Saved Model\n",
374 |       "Epoch: 11 Step: 8800 loss: 2.67341704309\n",
375 |       "Saved Model\n",
376 |       "Epoch: 12 Step: 9600 loss: 2.67321884066\n",
377 |       "Saved Model\n",
378 |       "Epoch: 13 Step: 10400 loss: 2.67310401961\n",
379 |       "Saved Model\n",
380 |       "Epoch: 14 Step: 11200 loss: 2.67295600712\n",
381 |       "Saved Model\n",
382 |       "Epoch: 15 Step: 12000 loss: 2.67288722694\n",
383 |       "Saved Model\n",
384 |       "Epoch: 16 Step: 12800 loss: 2.67282888472\n",
385 |       "Saved Model\n",
386 |       "Epoch: 17 Step: 13600 loss: 2.67277920395\n",
387 |       "Saved Model\n",
388 |       "Epoch: 18 Step: 14400 loss: 2.6727619794\n",
389 |       "Saved Model\n",
390 |       "Epoch: 19 Step: 15200 loss: 2.67268569678\n",
391 |       "Saved Model\n",
392 |       "Epoch: 20 Step: 16000 loss: 2.67266457796\n",
393 |       "Saved Model\n",
394 |       "Epoch: 21 Step: 16800 loss: 2.67263956338\n",
395 |       "Saved Model\n",
396 |       "Epoch: 22 Step: 17600 loss: 2.67261722207\n",
397 |       "Saved Model\n",
398 |       "Epoch: 23 Step: 18400 loss: 2.67261824235\n",
399 |       "Saved Model\n",
400 |       "Epoch: 24 Step: 19200 loss: 2.67256126881\n",
401 |       "Saved Model\n",
402 |       "Epoch: 25 Step: 20000 loss: 2.6725519672\n",
403 |       "Saved Model\n",
404 |       "Epoch: 26 Step: 20800 loss: 2.67253558069\n",
405 |       "Saved Model\n",
406 |       "Epoch: 27 Step: 21600 loss: 2.67252239197\n",
407 |       "Saved Model\n",
408 |       "Epoch: 28 Step: 22400 loss: 2.67252858594\n",
409 |       "Saved Model\n",
410 |       "Epoch: 29 Step: 23200 loss: 2.67248077154\n",
411 |       "Saved Model\n",
412 |       "Epoch: 30 Step: 24000 loss: 2.67247578681\n",
413 |       "Saved Model\n",
414 |       "Epoch: 31 Step: 24800 loss: 2.67246250227\n",
415 |       "Saved Model\n",
416 |       "Epoch: 32 Step: 25600 loss: 2.67245363146\n",
417 |       "Saved Model\n",
418 |       "Epoch: 33 Step: 26400 loss: 2.67246143714\n",
419 |       "Saved Model\n",
420 |       "Epoch: 34 Step: 27200 loss: 2.6724195759\n",
421 |       "Saved Model\n",
422 |       "Epoch: 35 Step: 28000 loss: 2.67241657913\n",
423 |       "Saved Model\n",
424 |       "Epoch: 36 Step: 28800 loss: 2.67240460932\n",
425 |       "Saved Model\n",
426 |       "Epoch: 37 Step: 29600 loss: 2.67239822775\n",
427 |       "Saved Model\n",
428 |       "Epoch: 38 Step: 30400 loss: 2.6724064\n",
429 |       "Saved Model\n"
430 |      ]
431 |     }
432 |    ],
433 |    "source": [
434 |     "num_epochs = 60\n",
435 |     "for i in range(num_epochs):\n",
436 |     "    loss_per_epoch = 0\n",
437 |     "    for j in range(num_batches):\n",
438 |     "        path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
439 |     "        word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
440 |     "        pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
441 |     "        dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
442 |     "        y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
443 |     "        \n",
444 |     "        feed_dict = {\n",
445 |     "            path_length:path_dict,\n",
446 |     "            word_ids:word_dict,\n",
447 |     "            pos_ids:pos_dict,\n",
448 |     "            dep_ids:dep_dict,\n",
449 |     "            y:y_dict}\n",
450 |     "        _, _loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n",
451 |     "        loss_per_epoch +=_loss\n",
452 |     "        if (j+1)%num_batches==0:\n",
453 |     "            print(\"Epoch:\", i+1,\"Step:\", step, \"loss:\",loss_per_epoch/num_batches)\n",
454 |     "    saver.save(sess, model_dir + '/model')\n",
455 |     "    print(\"Saved Model\")"
456 |    ]
457 |   },
458 |   {
459 |    "cell_type": "code",
460 |    "execution_count": null,
461 |    "metadata": {
462 |     "collapsed": true,
463 |     "scrolled": false
464 |    },
465 |    "outputs": [],
466 |    "source": [
467 |     "# training accuracy\n",
468 |     "all_predictions = []\n",
469 |     "for j in range(num_batches):\n",
470 |     "    path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
471 |     "    word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
472 |     "    pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
473 |     "    dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
474 |     "    y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
475 |     "\n",
476 |     "    feed_dict = {\n",
477 |     "        path_length:path_dict,\n",
478 |     "        word_ids:word_dict,\n",
479 |     "        pos_ids:pos_dict,\n",
480 |     "        dep_ids:dep_dict,\n",
481 |     "        y:y_dict}\n",
482 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
483 |     "    all_predictions.append(batch_predictions)\n",
484 |     "\n",
485 |     "y_pred = []\n",
486 |     "for i in range(num_batches):\n",
487 |     "    for pred in all_predictions[i]:\n",
488 |     "        y_pred.append(pred)\n",
489 |     "\n",
490 |     "count = 0\n",
491 |     "for i in range(batch_size*num_batches):\n",
492 |     "    count += y_pred[i]==rel_ids[i]\n",
493 |     "accuracy = count/(batch_size*num_batches) * 100\n",
494 |     "\n",
495 |     "print(\"training accuracy\", accuracy)"
496 |    ]
497 |   },
498 |   {
499 |    "cell_type": "code",
500 |    "execution_count": 23,
501 |    "metadata": {},
502 |    "outputs": [
503 |     {
504 |      "name": "stdout",
505 |      "output_type": "stream",
506 |      "text": [
507 |       "test accuracy 61.4022140221\n"
508 |      ]
509 |     }
510 |    ],
511 |    "source": [
512 |     "f = open(data_dir + '/test_paths', 'rb')\n",
513 |     "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n",
514 |     "f.close()\n",
515 |     "\n",
516 |     "relations = []\n",
517 |     "for line in open(data_dir + '/test_relations.txt'):\n",
518 |     "    relations.append(line.strip().split()[0])\n",
519 |     "\n",
520 |     "length = len(word_p1)\n",
521 |     "num_batches = int(length/batch_size)\n",
522 |     "\n",
523 |     "for i in range(length):\n",
524 |     "    for j, word in enumerate(word_p1[i]):\n",
525 |     "        word = word.lower()\n",
526 |     "        word_p1[i][j] = word if word in word2id else unknown_token \n",
527 |     "    for k, word in enumerate(word_p2[i]):\n",
528 |     "        word = word.lower()\n",
529 |     "        word_p2[i][k] = word if word in word2id else unknown_token \n",
530 |     "    for l, d in enumerate(dep_p1[i]):\n",
531 |     "        dep_p1[i][l] = d if d in dep2id else 'OTH'\n",
532 |     "    for m, d in enumerate(dep_p2[i]):\n",
533 |     "        dep_p2[i][m] = d if d in dep2id else 'OTH'\n",
534 |     "\n",
535 |     "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
536 |     "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
537 |     "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
538 |     "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
539 |     "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
540 |     "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
541 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
542 |     "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n",
543 |     "path2_len = np.array([len(w) for w in word_p2])\n",
544 |     "\n",
545 |     "for i in range(length):\n",
546 |     "    for j, w in enumerate(word_p1[i]):\n",
547 |     "        word_p1_ids[i][j] = word2id[w]\n",
548 |     "    for j, w in enumerate(word_p2[i]):\n",
549 |     "        word_p2_ids[i][j] = word2id[w]\n",
550 |     "    for j, w in enumerate(pos_p1[i]):\n",
551 |     "        pos_p1_ids[i][j] = pos_tag(w)\n",
552 |     "    for j, w in enumerate(pos_p2[i]):\n",
553 |     "        pos_p2_ids[i][j] = pos_tag(w)\n",
554 |     "    for j, w in enumerate(dep_p1[i]):\n",
555 |     "        dep_p1_ids[i][j] = dep2id[w]\n",
556 |     "    for j, w in enumerate(dep_p2[i]):\n",
557 |     "        dep_p2_ids[i][j] = dep2id[w]\n",
558 |     "\n",
559 |     "# test \n",
560 |     "all_predictions = []\n",
561 |     "for j in range(num_batches):\n",
562 |     "    path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
563 |     "    word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
564 |     "    pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
565 |     "    dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
566 |     "    y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
567 |     "\n",
568 |     "    feed_dict = {\n",
569 |     "        path_length:path_dict,\n",
570 |     "        word_ids:word_dict,\n",
571 |     "        pos_ids:pos_dict,\n",
572 |     "        dep_ids:dep_dict,\n",
573 |     "        y:y_dict}\n",
574 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
575 |     "    all_predictions.append(batch_predictions)\n",
576 |     "\n",
577 |     "y_pred = []\n",
578 |     "for i in range(num_batches):\n",
579 |     "    for pred in all_predictions[i]:\n",
580 |     "        y_pred.append(pred)\n",
581 |     "\n",
582 |     "count = 0\n",
583 |     "for i in range(batch_size*num_batches):\n",
584 |     "    count += y_pred[i]==rel_ids[i]\n",
585 |     "accuracy = count/(batch_size*num_batches) * 100\n",
586 |     "\n",
587 |     "print(\"test accuracy\", accuracy)"
588 |    ]
589 |   }
590 |  ],
591 |  "metadata": {
592 |   "kernelspec": {
593 |    "display_name": "Python 3",
594 |    "language": "python",
595 |    "name": "python3"
596 |   },
597 |   "language_info": {
598 |    "codemirror_mode": {
599 |     "name": "ipython",
600 |     "version": 3
601 |    },
602 |    "file_extension": ".py",
603 |    "mimetype": "text/x-python",
604 |    "name": "python",
605 |    "nbconvert_exporter": "python",
606 |    "pygments_lexer": "ipython3",
607 |    "version": "3.5.2"
608 |   }
609 |  },
610 |  "nbformat": 4,
611 |  "nbformat_minor": 2
612 | }
613 | 


--------------------------------------------------------------------------------
/LCA Shortest Path/modelv7.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import sys, os, _pickle as pickle\n",
 12 |     "import tensorflow as tf\n",
 13 |     "import numpy as np\n",
 14 |     "import nltk\n",
 15 |     "from sklearn.metrics import f1_score\n",
 16 |     "\n",
 17 |     "data_dir = '../data'\n",
 18 |     "ckpt_dir = '../checkpoint'\n",
 19 |     "word_embd_dir = '../checkpoint/word_embd'\n",
 20 |     "model_dir = '../checkpoint/modelv7'\n",
 21 |     "\n",
 22 |     "word_embd_dim = 100\n",
 23 |     "pos_embd_dim = 25\n",
 24 |     "dep_embd_dim = 25\n",
 25 |     "word_vocab_size = 400001\n",
 26 |     "pos_vocab_size = 10\n",
 27 |     "dep_vocab_size = 21\n",
 28 |     "relation_classes = 19\n",
 29 |     "word_state_size = 100\n",
 30 |     "other_state_size = 100\n",
 31 |     "batch_size = 10\n",
 32 |     "channels = 3\n",
 33 |     "lambda_l2 = 0.0001\n",
 34 |     "max_len_path = 10\n",
 35 |     "starter_learning_rate = 0.001\n",
 36 |     "decay_steps = 2000\n",
 37 |     "decay_rate = 0.96"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 2,
 43 |    "metadata": {
 44 |     "collapsed": true
 45 |    },
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "with tf.name_scope(\"input\"):\n",
 49 |     "    path_length = tf.placeholder(tf.int32, shape=[2, batch_size], name=\"path1_length\")\n",
 50 |     "    word_ids = tf.placeholder(tf.int32, shape=[2, batch_size, max_len_path], name=\"word_ids\")\n",
 51 |     "    pos_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"pos_ids\")\n",
 52 |     "    dep_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"dep_ids\")\n",
 53 |     "    y = tf.placeholder(tf.int32, [batch_size], name=\"y\")\n",
 54 |     "\n",
 55 |     "with tf.name_scope(\"word_embedding\"):\n",
 56 |     "    W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n",
 57 |     "    embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n",
 58 |     "    embedding_init = W.assign(embedding_placeholder)\n",
 59 |     "    embedded_word = tf.nn.embedding_lookup(W, word_ids)\n",
 60 |     "    word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n",
 61 |     "\n",
 62 |     "with tf.name_scope(\"pos_embedding\"):\n",
 63 |     "    W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n",
 64 |     "    embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n",
 65 |     "    pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n",
 66 |     "\n",
 67 |     "with tf.name_scope(\"dep_embedding\"):\n",
 68 |     "    W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n",
 69 |     "    embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n",
 70 |     "    dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})\n",
 71 |     "\n",
 72 |     "with tf.name_scope(\"word_dropout\"):\n",
 73 |     "    embedded_word_drop = tf.nn.dropout(embedded_word, 0.5)\n",
 74 |     "    \n",
 75 |     "with tf.name_scope(\"pos_dropout\"):\n",
 76 |     "    embedded_pos_drop = tf.nn.dropout(embedded_word, 0.5)\n",
 77 |     "    \n",
 78 |     "with tf.name_scope(\"dep_dropout\"):\n",
 79 |     "    embedded_dep_drop = tf.nn.dropout(embedded_word, 0.5)\n",
 80 |     "\n",
 81 |     "word_hidden_state = tf.zeros([batch_size, word_state_size], name='word_hidden_state')\n",
 82 |     "word_cell_state = tf.zeros([batch_size, word_state_size], name='word_cell_state')\n",
 83 |     "word_init_state = tf.contrib.rnn.LSTMStateTuple(word_hidden_state, word_cell_state)\n",
 84 |     "\n",
 85 |     "other_hidden_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"hidden_state\")\n",
 86 |     "other_cell_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"cell_state\")\n",
 87 |     "\n",
 88 |     "other_init_states = [tf.contrib.rnn.LSTMStateTuple(other_hidden_states[i], other_cell_states[i]) for i in range(channels-1)]\n",
 89 |     "\n",
 90 |     "with tf.variable_scope(\"word_lstm1\"):\n",
 91 |     "    cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n",
 92 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word_drop[0], sequence_length=path_length[0], initial_state=word_init_state)\n",
 93 |     "    state_series_word1 = tf.reduce_max(state_series, axis=1)\n",
 94 |     "\n",
 95 |     "with tf.variable_scope(\"word_lstm2\"):\n",
 96 |     "    cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n",
 97 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word_drop[1], sequence_length=path_length[1], initial_state=word_init_state)\n",
 98 |     "    state_series_word2 = tf.reduce_max(state_series, axis=1)\n",
 99 |     "\n",
100 |     "with tf.variable_scope(\"pos_lstm1\"):\n",
101 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
102 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos_drop[0], sequence_length=path_length[0],initial_state=other_init_states[0])\n",
103 |     "    state_series_pos1 = tf.reduce_max(state_series, axis=1)\n",
104 |     "\n",
105 |     "with tf.variable_scope(\"pos_lstm2\"):\n",
106 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
107 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos_drop[1], sequence_length=path_length[1],initial_state=other_init_states[0])\n",
108 |     "    state_series_pos2 = tf.reduce_max(state_series, axis=1)\n",
109 |     "\n",
110 |     "with tf.variable_scope(\"dep_lstm1\"):\n",
111 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
112 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep_drop[0], sequence_length=path_length[0], initial_state=other_init_states[1])\n",
113 |     "    state_series_dep1 = tf.reduce_max(state_series, axis=1)\n",
114 |     "\n",
115 |     "with tf.variable_scope(\"dep_lstm2\"):\n",
116 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
117 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep_drop[1], sequence_length=path_length[1], initial_state=other_init_states[1])\n",
118 |     "    state_series_dep2 = tf.reduce_max(state_series, axis=1)\n",
119 |     "\n",
120 |     "state_series1 = tf.concat([state_series_word1, state_series_pos1, state_series_dep1], 1)\n",
121 |     "state_series2 = tf.concat([state_series_word2, state_series_pos2, state_series_dep2], 1)\n",
122 |     "\n",
123 |     "state_series = tf.concat([state_series1, state_series2], 1)"
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": 3,
129 |    "metadata": {},
130 |    "outputs": [
131 |     {
132 |      "data": {
133 |       "text/plain": [
134 |        "<tf.Tensor 'concat_2:0' shape=(10, 600) dtype=float32>"
135 |       ]
136 |      },
137 |      "execution_count": 3,
138 |      "metadata": {},
139 |      "output_type": "execute_result"
140 |     }
141 |    ],
142 |    "source": [
143 |     "state_series"
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "code",
148 |    "execution_count": 4,
149 |    "metadata": {
150 |     "collapsed": true
151 |    },
152 |    "outputs": [],
153 |    "source": [
154 |     "with tf.name_scope(\"hidden_layer\"):\n",
155 |     "    W = tf.Variable(tf.truncated_normal([600, 100], -0.1, 0.1), name=\"W\")\n",
156 |     "    b = tf.Variable(tf.zeros([100]), name=\"b\")\n",
157 |     "    y_hidden_layer = tf.matmul(state_series, W) + b\n",
158 |     "\n",
159 |     "with tf.name_scope(\"dropout\"):\n",
160 |     "    y_hidden_layer_drop = tf.nn.dropout(y_hidden_layer, 0.3)\n",
161 |     "\n",
162 |     "with tf.name_scope(\"softmax_layer\"):\n",
163 |     "    W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n",
164 |     "    b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n",
165 |     "    logits = tf.matmul(y_hidden_layer_drop, W) + b\n",
166 |     "    predictions = tf.argmax(logits, 1)\n",
167 |     "\n",
168 |     "tv_all = tf.trainable_variables()\n",
169 |     "tv_regu = []\n",
170 |     "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n",
171 |     "for t in tv_all:\n",
172 |     "    if t.name not in non_reg:\n",
173 |     "        if(t.name.find('biases')==-1):\n",
174 |     "            tv_regu.append(t)\n",
175 |     "\n",
176 |     "with tf.name_scope(\"loss\"):\n",
177 |     "    l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n",
178 |     "    loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n",
179 |     "    total_loss = loss + l2_loss\n",
180 |     "\n",
181 |     "global_step = tf.Variable(0, trainable=False, name=\"global_step\")\n",
182 |     "\n",
183 |     "learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, decay_steps, decay_rate, staircase=True)\n",
184 |     "optimizer = tf.train.AdamOptimizer(learning_rate).minimize(total_loss, global_step=global_step)"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "code",
189 |    "execution_count": 5,
190 |    "metadata": {
191 |     "collapsed": true
192 |    },
193 |    "outputs": [],
194 |    "source": [
195 |     "f = open(data_dir + '/vocab.pkl', 'rb')\n",
196 |     "vocab = pickle.load(f)\n",
197 |     "f.close()\n",
198 |     "\n",
199 |     "word2id = dict((w, i) for i,w in enumerate(vocab))\n",
200 |     "id2word = dict((i, w) for i,w in enumerate(vocab))\n",
201 |     "\n",
202 |     "unknown_token = \"UNKNOWN_TOKEN\"\n",
203 |     "word2id[unknown_token] = word_vocab_size -1\n",
204 |     "id2word[word_vocab_size-1] = unknown_token\n",
205 |     "\n",
206 |     "pos_tags_vocab = []\n",
207 |     "for line in open(data_dir + '/pos_tags.txt'):\n",
208 |     "        pos_tags_vocab.append(line.strip())\n",
209 |     "\n",
210 |     "dep_vocab = []\n",
211 |     "for line in open(data_dir + '/dependency_types.txt'):\n",
212 |     "    dep_vocab.append(line.strip())\n",
213 |     "\n",
214 |     "relation_vocab = []\n",
215 |     "for line in open(data_dir + '/relation_types.txt'):\n",
216 |     "    relation_vocab.append(line.strip())\n",
217 |     "\n",
218 |     "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n",
219 |     "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n",
220 |     "\n",
221 |     "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n",
222 |     "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n",
223 |     "\n",
224 |     "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n",
225 |     "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n",
226 |     "\n",
227 |     "pos_tag2id['OTH'] = 9\n",
228 |     "id2pos_tag[9] = 'OTH'\n",
229 |     "\n",
230 |     "dep2id['OTH'] = 20\n",
231 |     "id2dep[20] = 'OTH'\n",
232 |     "\n",
233 |     "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n",
234 |     "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n",
235 |     "RB_pos_tags = ['RB', 'RBR', 'RBS']\n",
236 |     "PRP_pos_tags = ['PRP', 'PRP$']\n",
237 |     "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n",
238 |     "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n",
239 |     "\n",
240 |     "def pos_tag(x):\n",
241 |     "    if x in JJ_pos_tags:\n",
242 |     "        return pos_tag2id['JJ']\n",
243 |     "    if x in NN_pos_tags:\n",
244 |     "        return pos_tag2id['NN']\n",
245 |     "    if x in RB_pos_tags:\n",
246 |     "        return pos_tag2id['RB']\n",
247 |     "    if x in PRP_pos_tags:\n",
248 |     "        return pos_tag2id['PRP']\n",
249 |     "    if x in VB_pos_tags:\n",
250 |     "        return pos_tag2id['VB']\n",
251 |     "    if x in _pos_tags:\n",
252 |     "        return pos_tag2id[x]\n",
253 |     "    else:\n",
254 |     "        return 9"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "code",
259 |    "execution_count": 6,
260 |    "metadata": {
261 |     "collapsed": true
262 |    },
263 |    "outputs": [],
264 |    "source": [
265 |     "sess = tf.Session()\n",
266 |     "sess.run(tf.global_variables_initializer())\n",
267 |     "saver = tf.train.Saver()"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": 7,
273 |    "metadata": {
274 |     "collapsed": true
275 |    },
276 |    "outputs": [],
277 |    "source": [
278 |     "# f = open('data/word_embedding', 'rb')\n",
279 |     "# word_embedding = pickle.load(f)\n",
280 |     "# f.close()\n",
281 |     "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n",
282 |     "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')"
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "code",
287 |    "execution_count": 8,
288 |    "metadata": {
289 |     "collapsed": true
290 |    },
291 |    "outputs": [],
292 |    "source": [
293 |     "# model = tf.train.latest_checkpoint(model_dir)\n",
294 |     "# saver.restore(sess, model)"
295 |    ]
296 |   },
297 |   {
298 |    "cell_type": "code",
299 |    "execution_count": 9,
300 |    "metadata": {
301 |     "scrolled": true
302 |    },
303 |    "outputs": [
304 |     {
305 |      "name": "stdout",
306 |      "output_type": "stream",
307 |      "text": [
308 |       "INFO:tensorflow:Restoring parameters from checkpoint/word_embd/word_embd\n"
309 |      ]
310 |     }
311 |    ],
312 |    "source": [
313 |     "latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n",
314 |     "word_embedding_saver.restore(sess, latest_embd)"
315 |    ]
316 |   },
317 |   {
318 |    "cell_type": "code",
319 |    "execution_count": 8,
320 |    "metadata": {
321 |     "collapsed": true
322 |    },
323 |    "outputs": [],
324 |    "source": [
325 |     "f = open(data_dir + '/train_paths', 'rb')\n",
326 |     "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n",
327 |     "f.close()\n",
328 |     "\n",
329 |     "relations = []\n",
330 |     "for line in open(data_dir + '/train_relations.txt'):\n",
331 |     "    relations.append(line.strip().split()[1])\n",
332 |     "\n",
333 |     "length = len(word_p1)\n",
334 |     "num_batches = int(length/batch_size)\n",
335 |     "\n",
336 |     "for i in range(length):\n",
337 |     "    for j, word in enumerate(word_p1[i]):\n",
338 |     "        word = word.lower()\n",
339 |     "        word_p1[i][j] = word if word in word2id else unknown_token \n",
340 |     "    for k, word in enumerate(word_p2[i]):\n",
341 |     "        word = word.lower()\n",
342 |     "        word_p2[i][k] = word if word in word2id else unknown_token \n",
343 |     "    for l, d in enumerate(dep_p1[i]):\n",
344 |     "        dep_p1[i][l] = d if d in dep2id else 'OTH'\n",
345 |     "    for m, d in enumerate(dep_p2[i]):\n",
346 |     "        dep_p2[i][m] = d if d in dep2id else 'OTH'\n",
347 |     "\n",
348 |     "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
349 |     "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
350 |     "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
351 |     "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
352 |     "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
353 |     "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
354 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
355 |     "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n",
356 |     "path2_len = np.array([len(w) for w in word_p2])\n",
357 |     "\n",
358 |     "for i in range(length):\n",
359 |     "    for j, w in enumerate(word_p1[i]):\n",
360 |     "        word_p1_ids[i][j] = word2id[w]\n",
361 |     "    for j, w in enumerate(word_p2[i]):\n",
362 |     "        word_p2_ids[i][j] = word2id[w]\n",
363 |     "    for j, w in enumerate(pos_p1[i]):\n",
364 |     "        pos_p1_ids[i][j] = pos_tag(w)\n",
365 |     "    for j, w in enumerate(pos_p2[i]):\n",
366 |     "        pos_p2_ids[i][j] = pos_tag(w)\n",
367 |     "    for j, w in enumerate(dep_p1[i]):\n",
368 |     "        dep_p1_ids[i][j] = dep2id[w]\n",
369 |     "    for j, w in enumerate(dep_p2[i]):\n",
370 |     "        dep_p2_ids[i][j] = dep2id[w]"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "code",
375 |    "execution_count": 11,
376 |    "metadata": {
377 |     "scrolled": true
378 |    },
379 |    "outputs": [
380 |     {
381 |      "name": "stdout",
382 |      "output_type": "stream",
383 |      "text": [
384 |       "epoch: 0\n",
385 |       "Step: 800 loss: 2.91281905636\n",
386 |       "epoch: 1\n",
387 |       "Saved Model\n",
388 |       "Step: 1600 loss: 1.93908668235\n",
389 |       "epoch: 2\n",
390 |       "Saved Model\n",
391 |       "Step: 2400 loss: 1.45170823216\n",
392 |       "epoch: 3\n",
393 |       "Saved Model\n",
394 |       "Step: 3200 loss: 1.18255942896\n",
395 |       "epoch: 4\n",
396 |       "Step: 4000 loss: 1.00360578123\n",
397 |       "Saved Model\n",
398 |       "epoch: 5\n",
399 |       "Step: 4800 loss: 0.854295852538\n",
400 |       "epoch: 6\n",
401 |       "Saved Model\n",
402 |       "Step: 5600 loss: 0.748602524679\n",
403 |       "epoch: 7\n",
404 |       "Saved Model\n",
405 |       "Step: 6400 loss: 0.661906255111\n",
406 |       "epoch: 8\n",
407 |       "Saved Model\n",
408 |       "Step: 7200 loss: 0.587379012275\n",
409 |       "epoch: 9\n",
410 |       "Step: 8000 loss: 0.531537927147\n",
411 |       "Saved Model\n",
412 |       "epoch: 10\n",
413 |       "Step: 8800 loss: 0.484521641694\n",
414 |       "epoch: 11\n",
415 |       "Saved Model\n",
416 |       "Step: 9600 loss: 0.444365512617\n",
417 |       "epoch: 12\n",
418 |       "Saved Model\n",
419 |       "Step: 10400 loss: 0.415288321041\n",
420 |       "epoch: 13\n",
421 |       "Saved Model\n",
422 |       "Step: 11200 loss: 0.384827776505\n",
423 |       "epoch: 14\n",
424 |       "Step: 12000 loss: 0.361082672933\n",
425 |       "Saved Model\n",
426 |       "epoch: 15\n",
427 |       "Step: 12800 loss: 0.338339183582\n",
428 |       "epoch: 16\n",
429 |       "Saved Model\n",
430 |       "Step: 13600 loss: 0.319484538799\n",
431 |       "epoch: 17\n",
432 |       "Saved Model\n",
433 |       "Step: 14400 loss: 0.297788869962\n",
434 |       "epoch: 18\n",
435 |       "Saved Model\n",
436 |       "Step: 15200 loss: 0.28938733831\n",
437 |       "epoch: 19\n",
438 |       "Step: 16000 loss: 0.27373408448\n",
439 |       "Saved Model\n"
440 |      ]
441 |     }
442 |    ],
443 |    "source": [
444 |     "num_epochs = 60\n",
445 |     "for i in range(num_epochs):\n",
446 |     "    loss_per_epoch = 0\n",
447 |     "    for j in range(num_batches):\n",
448 |     "        path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
449 |     "        word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
450 |     "        pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
451 |     "        dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
452 |     "        y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
453 |     "        \n",
454 |     "        feed_dict = {\n",
455 |     "            path_length:path_dict,\n",
456 |     "            word_ids:word_dict,\n",
457 |     "            pos_ids:pos_dict,\n",
458 |     "            dep_ids:dep_dict,\n",
459 |     "            y:y_dict}\n",
460 |     "        _, _loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n",
461 |     "        loss_per_epoch +=_loss\n",
462 |     "        if (j+1)%num_batches==0:\n",
463 |     "            print(\"Epoch:\", i+1,\"Step:\", step, \"loss:\",loss_per_epoch/num_batches)\n",
464 |     "    saver.save(sess, model_dir + '/model')\n",
465 |     "    print(\"Saved Model\")"
466 |    ]
467 |   },
468 |   {
469 |    "cell_type": "code",
470 |    "execution_count": 12,
471 |    "metadata": {
472 |     "scrolled": false
473 |    },
474 |    "outputs": [
475 |     {
476 |      "name": "stdout",
477 |      "output_type": "stream",
478 |      "text": [
479 |       "training accuracy 94.6875\n"
480 |      ]
481 |     }
482 |    ],
483 |    "source": [
484 |     "# training accuracy\n",
485 |     "all_predictions = []\n",
486 |     "for j in range(num_batches):\n",
487 |     "    path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
488 |     "    word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
489 |     "    pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
490 |     "    dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
491 |     "    y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
492 |     "\n",
493 |     "    feed_dict = {\n",
494 |     "        path_length:path_dict,\n",
495 |     "        word_ids:word_dict,\n",
496 |     "        pos_ids:pos_dict,\n",
497 |     "        dep_ids:dep_dict,\n",
498 |     "        y:y_dict}\n",
499 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
500 |     "    all_predictions.append(batch_predictions)\n",
501 |     "\n",
502 |     "y_pred = []\n",
503 |     "for i in range(num_batches):\n",
504 |     "    for pred in all_predictions[i]:\n",
505 |     "        y_pred.append(pred)\n",
506 |     "\n",
507 |     "count = 0\n",
508 |     "for i in range(batch_size*num_batches):\n",
509 |     "    count += y_pred[i]==rel_ids[i]\n",
510 |     "accuracy = count/(batch_size*num_batches) * 100\n",
511 |     "\n",
512 |     "print(\"training accuracy\", accuracy)"
513 |    ]
514 |   },
515 |   {
516 |    "cell_type": "code",
517 |    "execution_count": 13,
518 |    "metadata": {},
519 |    "outputs": [
520 |     {
521 |      "name": "stdout",
522 |      "output_type": "stream",
523 |      "text": [
524 |       "test accuracy 60.036900369\n"
525 |      ]
526 |     }
527 |    ],
528 |    "source": [
529 |     "f = open(data_dir + '/test_paths', 'rb')\n",
530 |     "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n",
531 |     "f.close()\n",
532 |     "\n",
533 |     "relations = []\n",
534 |     "for line in open(data_dir + '/test_relations.txt'):\n",
535 |     "    relations.append(line.strip().split()[0])\n",
536 |     "\n",
537 |     "length = len(word_p1)\n",
538 |     "num_batches = int(length/batch_size)\n",
539 |     "\n",
540 |     "for i in range(length):\n",
541 |     "    for j, word in enumerate(word_p1[i]):\n",
542 |     "        word = word.lower()\n",
543 |     "        word_p1[i][j] = word if word in word2id else unknown_token \n",
544 |     "    for k, word in enumerate(word_p2[i]):\n",
545 |     "        word = word.lower()\n",
546 |     "        word_p2[i][k] = word if word in word2id else unknown_token \n",
547 |     "    for l, d in enumerate(dep_p1[i]):\n",
548 |     "        dep_p1[i][l] = d if d in dep2id else 'OTH'\n",
549 |     "    for m, d in enumerate(dep_p2[i]):\n",
550 |     "        dep_p2[i][m] = d if d in dep2id else 'OTH'\n",
551 |     "\n",
552 |     "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
553 |     "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
554 |     "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
555 |     "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
556 |     "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
557 |     "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
558 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
559 |     "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n",
560 |     "path2_len = np.array([len(w) for w in word_p2])\n",
561 |     "\n",
562 |     "for i in range(length):\n",
563 |     "    for j, w in enumerate(word_p1[i]):\n",
564 |     "        word_p1_ids[i][j] = word2id[w]\n",
565 |     "    for j, w in enumerate(word_p2[i]):\n",
566 |     "        word_p2_ids[i][j] = word2id[w]\n",
567 |     "    for j, w in enumerate(pos_p1[i]):\n",
568 |     "        pos_p1_ids[i][j] = pos_tag(w)\n",
569 |     "    for j, w in enumerate(pos_p2[i]):\n",
570 |     "        pos_p2_ids[i][j] = pos_tag(w)\n",
571 |     "    for j, w in enumerate(dep_p1[i]):\n",
572 |     "        dep_p1_ids[i][j] = dep2id[w]\n",
573 |     "    for j, w in enumerate(dep_p2[i]):\n",
574 |     "        dep_p2_ids[i][j] = dep2id[w]\n",
575 |     "\n",
576 |     "# test predictions\n",
577 |     "all_predictions = []\n",
578 |     "for j in range(num_batches):\n",
579 |     "    path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
580 |     "    word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
581 |     "    pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
582 |     "    dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
583 |     "    y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
584 |     "\n",
585 |     "    feed_dict = {\n",
586 |     "        path_length:path_dict,\n",
587 |     "        word_ids:word_dict,\n",
588 |     "        pos_ids:pos_dict,\n",
589 |     "        dep_ids:dep_dict,\n",
590 |     "        y:y_dict}\n",
591 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
592 |     "    all_predictions.append(batch_predictions)\n",
593 |     "\n",
594 |     "y_pred = []\n",
595 |     "for i in range(num_batches):\n",
596 |     "    for pred in all_predictions[i]:\n",
597 |     "        y_pred.append(pred)\n",
598 |     "\n",
599 |     "count = 0\n",
600 |     "for i in range(batch_size*num_batches):\n",
601 |     "    count += y_pred[i]==rel_ids[i]\n",
602 |     "accuracy = count/(batch_size*num_batches) * 100\n",
603 |     "\n",
604 |     "print(\"test accuracy\", accuracy)"
605 |    ]
606 |   }
607 |  ],
608 |  "metadata": {
609 |   "kernelspec": {
610 |    "display_name": "Python 3",
611 |    "language": "python",
612 |    "name": "python3"
613 |   },
614 |   "language_info": {
615 |    "codemirror_mode": {
616 |     "name": "ipython",
617 |     "version": 3
618 |    },
619 |    "file_extension": ".py",
620 |    "mimetype": "text/x-python",
621 |    "name": "python",
622 |    "nbconvert_exporter": "python",
623 |    "pygments_lexer": "ipython3",
624 |    "version": "3.5.2"
625 |   }
626 |  },
627 |  "nbformat": 4,
628 |  "nbformat_minor": 2
629 | }
630 | 


--------------------------------------------------------------------------------
/LCA Shortest Path/modelv8.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import sys, os, _pickle as pickle\n",
 12 |     "import tensorflow as tf\n",
 13 |     "import numpy as np\n",
 14 |     "import nltk\n",
 15 |     "from sklearn.metrics import f1_score\n",
 16 |     "\n",
 17 |     "data_dir = '../data'\n",
 18 |     "ckpt_dir = '../checkpoint'\n",
 19 |     "word_embd_dir = '../checkpoint/word_embd_wiki'\n",
 20 |     "pos_embd_dir = '../checkpoint/pos_embd'\n",
 21 |     "dep_embd_dir = '../checkpoint/dep_embd'\n",
 22 |     "model_dir = '../checkpoint/modelv8'\n",
 23 |     "\n",
 24 |     "word_embd_dim = 200\n",
 25 |     "pos_embd_dim = 25\n",
 26 |     "dep_embd_dim = 25\n",
 27 |     "word_vocab_size = 306561\n",
 28 |     "pos_vocab_size = 10\n",
 29 |     "dep_vocab_size = 21\n",
 30 |     "relation_classes = 19\n",
 31 |     "word_state_size = 100\n",
 32 |     "other_state_size = 100\n",
 33 |     "batch_size = 10\n",
 34 |     "channels = 3\n",
 35 |     "lambda_l2 = 0.0001\n",
 36 |     "max_len_path = 10\n",
 37 |     "starter_learning_rate = 0.001\n",
 38 |     "decay_steps = 2000\n",
 39 |     "decay_rate = 0.96"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "code",
 44 |    "execution_count": 7,
 45 |    "metadata": {
 46 |     "collapsed": true
 47 |    },
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "with tf.name_scope(\"input\"):\n",
 51 |     "    path_length = tf.placeholder(tf.int32, shape=[2, batch_size], name=\"path1_length\")\n",
 52 |     "    word_ids = tf.placeholder(tf.int32, shape=[2, batch_size, max_len_path], name=\"word_ids\")\n",
 53 |     "    pos_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"pos_ids\")\n",
 54 |     "    dep_ids = tf.placeholder(tf.int32, [2, batch_size, max_len_path], name=\"dep_ids\")\n",
 55 |     "    y = tf.placeholder(tf.int32, [batch_size], name=\"y\")\n",
 56 |     "\n",
 57 |     "with tf.name_scope(\"word_embedding\"):\n",
 58 |     "    W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n",
 59 |     "    embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n",
 60 |     "    embedding_init = W.assign(embedding_placeholder)\n",
 61 |     "    embedded_word = tf.nn.embedding_lookup(W, word_ids)\n",
 62 |     "    word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n",
 63 |     "\n",
 64 |     "with tf.name_scope(\"pos_embedding\"):\n",
 65 |     "    W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n",
 66 |     "    embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n",
 67 |     "    pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n",
 68 |     "\n",
 69 |     "with tf.name_scope(\"dep_embedding\"):\n",
 70 |     "    W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n",
 71 |     "    embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n",
 72 |     "    dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})\n",
 73 |     "\n",
 74 |     "with tf.name_scope(\"wordout\"):\n",
 75 |     "    embedded_word_drop = tf.nn.dropout(embedded_word, 0.5)\n",
 76 |     "\n",
 77 |     "word_hidden_state = tf.zeros([batch_size, word_state_size], name='word_hidden_state')\n",
 78 |     "word_cell_state = tf.zeros([batch_size, word_state_size], name='word_cell_state')\n",
 79 |     "word_init_state = tf.contrib.rnn.LSTMStateTuple(word_hidden_state, word_cell_state)\n",
 80 |     "\n",
 81 |     "other_hidden_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"hidden_state\")\n",
 82 |     "other_cell_states = tf.zeros([channels-1, batch_size, other_state_size], name=\"cell_state\")\n",
 83 |     "\n",
 84 |     "other_init_states = [tf.contrib.rnn.LSTMStateTuple(other_hidden_states[i], other_cell_states[i]) for i in range(channels-1)]\n",
 85 |     "\n",
 86 |     "with tf.variable_scope(\"word_lstm1\"):\n",
 87 |     "    cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n",
 88 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word_drop[0], sequence_length=path_length[0], initial_state=word_init_state)\n",
 89 |     "    state_series_word1 = tf.reduce_max(state_series, axis=1)\n",
 90 |     "\n",
 91 |     "with tf.variable_scope(\"word_lstm2\"):\n",
 92 |     "    cell = tf.contrib.rnn.BasicLSTMCell(word_state_size)\n",
 93 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word_drop[1], sequence_length=path_length[1], initial_state=word_init_state)\n",
 94 |     "    state_series_word2 = tf.reduce_max(state_series, axis=1)\n",
 95 |     "\n",
 96 |     "with tf.variable_scope(\"pos_lstm1\"):\n",
 97 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
 98 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[0], sequence_length=path_length[0],initial_state=other_init_states[0])\n",
 99 |     "    state_series_pos1 = tf.reduce_max(state_series, axis=1)\n",
100 |     "\n",
101 |     "with tf.variable_scope(\"pos_lstm2\"):\n",
102 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
103 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos[1], sequence_length=path_length[1],initial_state=other_init_states[0])\n",
104 |     "    state_series_pos2 = tf.reduce_max(state_series, axis=1)\n",
105 |     "\n",
106 |     "with tf.variable_scope(\"dep_lstm1\"):\n",
107 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
108 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[0], sequence_length=path_length[0], initial_state=other_init_states[1])\n",
109 |     "    state_series_dep1 = tf.reduce_max(state_series, axis=1)\n",
110 |     "\n",
111 |     "with tf.variable_scope(\"dep_lstm2\"):\n",
112 |     "    cell = tf.contrib.rnn.BasicLSTMCell(other_state_size)\n",
113 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep[1], sequence_length=path_length[1], initial_state=other_init_states[1])\n",
114 |     "    state_series_dep2 = tf.reduce_max(state_series, axis=1)\n",
115 |     "\n",
116 |     "state_series1 = tf.concat([state_series_word1, state_series_pos1, state_series_dep1], 1)\n",
117 |     "state_series2 = tf.concat([state_series_word2, state_series_pos2, state_series_dep2], 1)\n",
118 |     "\n",
119 |     "state_series = tf.concat([state_series1, state_series2], 1)"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "code",
124 |    "execution_count": 8,
125 |    "metadata": {},
126 |    "outputs": [
127 |     {
128 |      "data": {
129 |       "text/plain": [
130 |        "<tf.Tensor 'concat_2:0' shape=(10, 600) dtype=float32>"
131 |       ]
132 |      },
133 |      "execution_count": 8,
134 |      "metadata": {},
135 |      "output_type": "execute_result"
136 |     }
137 |    ],
138 |    "source": [
139 |     "state_series"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": 9,
145 |    "metadata": {
146 |     "collapsed": true
147 |    },
148 |    "outputs": [],
149 |    "source": [
150 |     "with tf.name_scope(\"hidden_layer\"):\n",
151 |     "    W = tf.Variable(tf.truncated_normal([600, 100], -0.1, 0.1), name=\"W\")\n",
152 |     "    b = tf.Variable(tf.zeros([100]), name=\"b\")\n",
153 |     "    y_hidden_layer = tf.matmul(state_series, W) + b\n",
154 |     "\n",
155 |     "with tf.name_scope(\"dropout\"):\n",
156 |     "    y_hidden_layer_drop = tf.nn.dropout(y_hidden_layer, 0.3)\n",
157 |     "\n",
158 |     "with tf.name_scope(\"softmax_layer\"):\n",
159 |     "    W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n",
160 |     "    b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n",
161 |     "    logits = tf.matmul(y_hidden_layer_drop, W) + b\n",
162 |     "    predictions = tf.argmax(logits, 1)\n",
163 |     "\n",
164 |     "tv_all = tf.trainable_variables()\n",
165 |     "tv_regu = []\n",
166 |     "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n",
167 |     "for t in tv_all:\n",
168 |     "    if t.name not in non_reg:\n",
169 |     "        if(t.name.find('biases')==-1):\n",
170 |     "            tv_regu.append(t)\n",
171 |     "\n",
172 |     "with tf.name_scope(\"loss\"):\n",
173 |     "    l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n",
174 |     "    loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n",
175 |     "    total_loss = loss + l2_loss\n",
176 |     "\n",
177 |     "global_step = tf.Variable(0, trainable=False, name=\"global_step\")\n",
178 |     "\n",
179 |     "learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, decay_steps, decay_rate, staircase=True)\n",
180 |     "optimizer = tf.train.AdamOptimizer(learning_rate).minimize(total_loss, global_step=global_step)"
181 |    ]
182 |   },
183 |   {
184 |    "cell_type": "code",
185 |    "execution_count": 12,
186 |    "metadata": {
187 |     "collapsed": true
188 |    },
189 |    "outputs": [],
190 |    "source": [
191 |     "f = open(data_dir + '/word_embd_wiki', 'rb')\n",
192 |     "vocab, word_embedding = pickle.load(f)\n",
193 |     "f.close()\n",
194 |     "\n",
195 |     "word2id = dict((w, i) for i,w in enumerate(vocab))\n",
196 |     "id2word = dict((i, w) for i,w in enumerate(vocab))\n",
197 |     "\n",
198 |     "unknown_token = \"UNKNOWN_TOKEN\"\n",
199 |     "\n",
200 |     "pos_tags_vocab = []\n",
201 |     "for line in open(data_dir + '/pos_tags.txt'):\n",
202 |     "        pos_tags_vocab.append(line.strip())\n",
203 |     "\n",
204 |     "dep_vocab = []\n",
205 |     "for line in open(data_dir + '/dependency_types.txt'):\n",
206 |     "    dep_vocab.append(line.strip())\n",
207 |     "\n",
208 |     "relation_vocab = []\n",
209 |     "for line in open(data_dir + '/relation_types.txt'):\n",
210 |     "    relation_vocab.append(line.strip())\n",
211 |     "\n",
212 |     "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n",
213 |     "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n",
214 |     "\n",
215 |     "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n",
216 |     "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n",
217 |     "\n",
218 |     "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n",
219 |     "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n",
220 |     "\n",
221 |     "pos_tag2id['OTH'] = 9\n",
222 |     "id2pos_tag[9] = 'OTH'\n",
223 |     "\n",
224 |     "dep2id['OTH'] = 20\n",
225 |     "id2dep[20] = 'OTH'\n",
226 |     "\n",
227 |     "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n",
228 |     "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n",
229 |     "RB_pos_tags = ['RB', 'RBR', 'RBS']\n",
230 |     "PRP_pos_tags = ['PRP', 'PRP$']\n",
231 |     "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n",
232 |     "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n",
233 |     "\n",
234 |     "def pos_tag(x):\n",
235 |     "    if x in JJ_pos_tags:\n",
236 |     "        return pos_tag2id['JJ']\n",
237 |     "    if x in NN_pos_tags:\n",
238 |     "        return pos_tag2id['NN']\n",
239 |     "    if x in RB_pos_tags:\n",
240 |     "        return pos_tag2id['RB']\n",
241 |     "    if x in PRP_pos_tags:\n",
242 |     "        return pos_tag2id['PRP']\n",
243 |     "    if x in VB_pos_tags:\n",
244 |     "        return pos_tag2id['VB']\n",
245 |     "    if x in _pos_tags:\n",
246 |     "        return pos_tag2id[x]\n",
247 |     "    else:\n",
248 |     "        return 9"
249 |    ]
250 |   },
251 |   {
252 |    "cell_type": "code",
253 |    "execution_count": 13,
254 |    "metadata": {
255 |     "collapsed": true
256 |    },
257 |    "outputs": [],
258 |    "source": [
259 |     "sess = tf.Session()\n",
260 |     "sess.run(tf.global_variables_initializer())\n",
261 |     "saver = tf.train.Saver()"
262 |    ]
263 |   },
264 |   {
265 |    "cell_type": "code",
266 |    "execution_count": 14,
267 |    "metadata": {
268 |     "collapsed": true
269 |    },
270 |    "outputs": [],
271 |    "source": [
272 |     "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n",
273 |     "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')"
274 |    ]
275 |   },
276 |   {
277 |    "cell_type": "code",
278 |    "execution_count": 15,
279 |    "metadata": {
280 |     "collapsed": true
281 |    },
282 |    "outputs": [],
283 |    "source": [
284 |     "# pos_embedding_saver.save(sess, pos_embd_dir + '/pos_embd')\n",
285 |     "# dep_embedding_saver.save(sess, dep_embd_dir + '/dep_embd')"
286 |    ]
287 |   },
288 |   {
289 |    "cell_type": "code",
290 |    "execution_count": 16,
291 |    "metadata": {
292 |     "collapsed": true
293 |    },
294 |    "outputs": [],
295 |    "source": [
296 |     "# model = tf.train.latest_checkpoint(model_dir)\n",
297 |     "# saver.restore(sess, model)"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "code",
302 |    "execution_count": 17,
303 |    "metadata": {
304 |     "scrolled": true
305 |    },
306 |    "outputs": [
307 |     {
308 |      "name": "stdout",
309 |      "output_type": "stream",
310 |      "text": [
311 |       "INFO:tensorflow:Restoring parameters from checkpoint/word_embd_wiki/word_embd\n"
312 |      ]
313 |     }
314 |    ],
315 |    "source": [
316 |     "latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n",
317 |     "word_embedding_saver.restore(sess, latest_embd)"
318 |    ]
319 |   },
320 |   {
321 |    "cell_type": "code",
322 |    "execution_count": 18,
323 |    "metadata": {
324 |     "collapsed": true
325 |    },
326 |    "outputs": [],
327 |    "source": [
328 |     "f = open(data_dir + '/train_paths', 'rb')\n",
329 |     "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n",
330 |     "f.close()\n",
331 |     "\n",
332 |     "relations = []\n",
333 |     "for line in open(data_dir + '/train_relations.txt'):\n",
334 |     "    relations.append(line.strip().split()[1])\n",
335 |     "\n",
336 |     "length = len(word_p1)\n",
337 |     "num_batches = int(length/batch_size)\n",
338 |     "\n",
339 |     "for i in range(length):\n",
340 |     "    for j, word in enumerate(word_p1[i]):\n",
341 |     "        word = word.lower()\n",
342 |     "        word_p1[i][j] = word if word in word2id else unknown_token \n",
343 |     "    for k, word in enumerate(word_p2[i]):\n",
344 |     "        word = word.lower()\n",
345 |     "        word_p2[i][k] = word if word in word2id else unknown_token \n",
346 |     "    for l, d in enumerate(dep_p1[i]):\n",
347 |     "        dep_p1[i][l] = d if d in dep2id else 'OTH'\n",
348 |     "    for m, d in enumerate(dep_p2[i]):\n",
349 |     "        dep_p2[i][m] = d if d in dep2id else 'OTH'\n",
350 |     "\n",
351 |     "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
352 |     "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
353 |     "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
354 |     "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
355 |     "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
356 |     "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
357 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
358 |     "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n",
359 |     "path2_len = np.array([len(w) for w in word_p2])\n",
360 |     "\n",
361 |     "for i in range(length):\n",
362 |     "    for j, w in enumerate(word_p1[i]):\n",
363 |     "        word_p1_ids[i][j] = word2id[w]\n",
364 |     "    for j, w in enumerate(word_p2[i]):\n",
365 |     "        word_p2_ids[i][j] = word2id[w]\n",
366 |     "    for j, w in enumerate(pos_p1[i]):\n",
367 |     "        pos_p1_ids[i][j] = pos_tag(w)\n",
368 |     "    for j, w in enumerate(pos_p2[i]):\n",
369 |     "        pos_p2_ids[i][j] = pos_tag(w)\n",
370 |     "    for j, w in enumerate(dep_p1[i]):\n",
371 |     "        dep_p1_ids[i][j] = dep2id[w]\n",
372 |     "    for j, w in enumerate(dep_p2[i]):\n",
373 |     "        dep_p2_ids[i][j] = dep2id[w]"
374 |    ]
375 |   },
376 |   {
377 |    "cell_type": "code",
378 |    "execution_count": 20,
379 |    "metadata": {
380 |     "scrolled": true
381 |    },
382 |    "outputs": [
383 |     {
384 |      "name": "stdout",
385 |      "output_type": "stream",
386 |      "text": [
387 |       "Epoch: 1 Step: 800 loss: 1834.98579071\n",
388 |       "Epoch: 2 Step: 1600 loss: 674.517935333\n",
389 |       "Epoch: 3 Step: 2400 loss: 274.690037136\n",
390 |       "Epoch: 4 Step: 3200 loss: 121.720370474\n",
391 |       "Epoch: 5 Step: 4000 loss: 57.159888463\n",
392 |       "Epoch: 6 Step: 4800 loss: 28.9982907772\n",
393 |       "Epoch: 7 Step: 5600 loss: 16.1453694081\n",
394 |       "Epoch: 8 Step: 6400 loss: 9.82831180751\n",
395 |       "Epoch: 9 Step: 7200 loss: 6.59800129354\n",
396 |       "Epoch: 10 Step: 8000 loss: 4.74144041985\n",
397 |       "Epoch: 11 Step: 8800 loss: 3.62599194676\n",
398 |       "Epoch: 12 Step: 9600 loss: 2.89340341777\n",
399 |       "Epoch: 13 Step: 10400 loss: 2.38419453934\n",
400 |       "Epoch: 14 Step: 11200 loss: 2.02302289039\n",
401 |       "Epoch: 15 Step: 12000 loss: 1.72652905107\n",
402 |       "Epoch: 16 Step: 12800 loss: 1.52037471995\n",
403 |       "Epoch: 17 Step: 13600 loss: 1.32972317606\n",
404 |       "Epoch: 18 Step: 14400 loss: 1.20203789197\n",
405 |       "Epoch: 19 Step: 15200 loss: 1.08597725138\n",
406 |       "Epoch: 20 Step: 16000 loss: 0.986360963807\n",
407 |       "Epoch: 21 Step: 16800 loss: 0.898290062994\n",
408 |       "Epoch: 22 Step: 17600 loss: 0.830409790054\n",
409 |       "Epoch: 23 Step: 18400 loss: 0.782534971312\n",
410 |       "Epoch: 24 Step: 19200 loss: 0.718714593202\n",
411 |       "Epoch: 25 Step: 20000 loss: 0.668461259529\n",
412 |       "Epoch: 26 Step: 20800 loss: 0.633814334124\n",
413 |       "Epoch: 27 Step: 21600 loss: 0.594128802419\n",
414 |       "Epoch: 28 Step: 22400 loss: 0.56548288703\n",
415 |       "Epoch: 29 Step: 23200 loss: 0.526212990992\n",
416 |       "Epoch: 30 Step: 24000 loss: 0.520392173678\n",
417 |       "Epoch: 31 Step: 24800 loss: 0.487168050297\n",
418 |       "Epoch: 32 Step: 25600 loss: 0.464592997283\n",
419 |       "Epoch: 33 Step: 26400 loss: 0.445906150565\n",
420 |       "Epoch: 34 Step: 27200 loss: 0.430318820551\n",
421 |       "Epoch: 35 Step: 28000 loss: 0.415004718341\n",
422 |       "Epoch: 36 Step: 28800 loss: 0.39048141662\n",
423 |       "Epoch: 37 Step: 29600 loss: 0.378652221076\n",
424 |       "Epoch: 38 Step: 30400 loss: 0.376885517202\n",
425 |       "Epoch: 39 Step: 31200 loss: 0.361440741643\n",
426 |       "Epoch: 40 Step: 32000 loss: 0.345032765269\n",
427 |       "Epoch: 41 Step: 32800 loss: 0.331929060183\n",
428 |       "Epoch: 42 Step: 33600 loss: 0.322243774533\n",
429 |       "Epoch: 43 Step: 34400 loss: 0.316909426395\n",
430 |       "Epoch: 44 Step: 35200 loss: 0.307885918804\n",
431 |       "Epoch: 45 Step: 36000 loss: 0.303443572205\n",
432 |       "Epoch: 46 Step: 36800 loss: 0.284900524076\n",
433 |       "Epoch: 47 Step: 37600 loss: 0.281887375377\n",
434 |       "Epoch: 48 Step: 38400 loss: 0.279675952736\n",
435 |       "Epoch: 49 Step: 39200 loss: 0.272306431141\n",
436 |       "Epoch: 50 Step: 40000 loss: 0.267325288765\n",
437 |       "Epoch: 51 Step: 40800 loss: 0.252997332923\n",
438 |       "Epoch: 52 Step: 41600 loss: 0.257797217574\n",
439 |       "Epoch: 53 Step: 42400 loss: 0.248855141364\n",
440 |       "Epoch: 54 Step: 43200 loss: 0.241898285728\n",
441 |       "Epoch: 55 Step: 44000 loss: 0.237289066594\n",
442 |       "Epoch: 56 Step: 44800 loss: 0.241825930495\n",
443 |       "Epoch: 57 Step: 45600 loss: 0.228385834955\n",
444 |       "Epoch: 58 Step: 46400 loss: 0.225023462269\n",
445 |       "Epoch: 59 Step: 47200 loss: 0.223864237741\n",
446 |       "Epoch: 60 Step: 48000 loss: 0.216368767507\n",
447 |       "Saved Model\n"
448 |      ]
449 |     }
450 |    ],
451 |    "source": [
452 |     "num_epochs = 60\n",
453 |     "for i in range(num_epochs):\n",
454 |     "    loss_per_epoch = 0\n",
455 |     "    for j in range(num_batches):\n",
456 |     "        path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
457 |     "        word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
458 |     "        pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
459 |     "        dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
460 |     "        y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
461 |     "        \n",
462 |     "        feed_dict = {\n",
463 |     "            path_length:path_dict,\n",
464 |     "            word_ids:word_dict,\n",
465 |     "            pos_ids:pos_dict,\n",
466 |     "            dep_ids:dep_dict,\n",
467 |     "            y:y_dict}\n",
468 |     "        _, _loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n",
469 |     "        loss_per_epoch +=_loss\n",
470 |     "        if (j+1)%num_batches==0:\n",
471 |     "            print(\"Epoch:\", i+1,\"Step:\", step, \"loss:\",loss_per_epoch/num_batches)\n",
472 |     "    \n",
473 |     "saver.save(sess, model_dir + '/model')\n",
474 |     "print(\"Saved Model\")"
475 |    ]
476 |   },
477 |   {
478 |    "cell_type": "code",
479 |    "execution_count": 21,
480 |    "metadata": {
481 |     "scrolled": false
482 |    },
483 |    "outputs": [
484 |     {
485 |      "name": "stdout",
486 |      "output_type": "stream",
487 |      "text": [
488 |       "training accuracy 98.9625\n"
489 |      ]
490 |     }
491 |    ],
492 |    "source": [
493 |     "# training accuracy\n",
494 |     "all_predictions = []\n",
495 |     "for j in range(num_batches):\n",
496 |     "    path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
497 |     "    word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
498 |     "    pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
499 |     "    dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
500 |     "    y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
501 |     "\n",
502 |     "    feed_dict = {\n",
503 |     "        path_length:path_dict,\n",
504 |     "        word_ids:word_dict,\n",
505 |     "        pos_ids:pos_dict,\n",
506 |     "        dep_ids:dep_dict,\n",
507 |     "        y:y_dict}\n",
508 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
509 |     "    all_predictions.append(batch_predictions)\n",
510 |     "\n",
511 |     "y_pred = []\n",
512 |     "for i in range(num_batches):\n",
513 |     "    for pred in all_predictions[i]:\n",
514 |     "        y_pred.append(pred)\n",
515 |     "\n",
516 |     "count = 0\n",
517 |     "for i in range(batch_size*num_batches):\n",
518 |     "    count += y_pred[i]==rel_ids[i]\n",
519 |     "accuracy = count/(batch_size*num_batches) * 100\n",
520 |     "\n",
521 |     "print(\"training accuracy\", accuracy)"
522 |    ]
523 |   },
524 |   {
525 |    "cell_type": "code",
526 |    "execution_count": 22,
527 |    "metadata": {},
528 |    "outputs": [
529 |     {
530 |      "name": "stdout",
531 |      "output_type": "stream",
532 |      "text": [
533 |       "test accuracy 61.8819188192\n"
534 |      ]
535 |     }
536 |    ],
537 |    "source": [
538 |     "f = open(data_dir + '/test_paths', 'rb')\n",
539 |     "word_p1, word_p2, dep_p1, dep_p2, pos_p1, pos_p2 = pickle.load(f)\n",
540 |     "f.close()\n",
541 |     "\n",
542 |     "relations = []\n",
543 |     "for line in open(data_dir + '/test_relations.txt'):\n",
544 |     "    relations.append(line.strip().split()[0])\n",
545 |     "\n",
546 |     "length = len(word_p1)\n",
547 |     "num_batches = int(length/batch_size)\n",
548 |     "\n",
549 |     "for i in range(length):\n",
550 |     "    for j, word in enumerate(word_p1[i]):\n",
551 |     "        word = word.lower()\n",
552 |     "        word_p1[i][j] = word if word in word2id else unknown_token \n",
553 |     "    for k, word in enumerate(word_p2[i]):\n",
554 |     "        word = word.lower()\n",
555 |     "        word_p2[i][k] = word if word in word2id else unknown_token \n",
556 |     "    for l, d in enumerate(dep_p1[i]):\n",
557 |     "        dep_p1[i][l] = d if d in dep2id else 'OTH'\n",
558 |     "    for m, d in enumerate(dep_p2[i]):\n",
559 |     "        dep_p2[i][m] = d if d in dep2id else 'OTH'\n",
560 |     "\n",
561 |     "word_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
562 |     "word_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
563 |     "pos_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
564 |     "pos_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
565 |     "dep_p1_ids = np.ones([length, max_len_path],dtype=int)\n",
566 |     "dep_p2_ids = np.ones([length, max_len_path],dtype=int)\n",
567 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
568 |     "path1_len = np.array([len(w) for w in word_p1], dtype=int)\n",
569 |     "path2_len = np.array([len(w) for w in word_p2])\n",
570 |     "\n",
571 |     "for i in range(length):\n",
572 |     "    for j, w in enumerate(word_p1[i]):\n",
573 |     "        word_p1_ids[i][j] = word2id[w]\n",
574 |     "    for j, w in enumerate(word_p2[i]):\n",
575 |     "        word_p2_ids[i][j] = word2id[w]\n",
576 |     "    for j, w in enumerate(pos_p1[i]):\n",
577 |     "        pos_p1_ids[i][j] = pos_tag(w)\n",
578 |     "    for j, w in enumerate(pos_p2[i]):\n",
579 |     "        pos_p2_ids[i][j] = pos_tag(w)\n",
580 |     "    for j, w in enumerate(dep_p1[i]):\n",
581 |     "        dep_p1_ids[i][j] = dep2id[w]\n",
582 |     "    for j, w in enumerate(dep_p2[i]):\n",
583 |     "        dep_p2_ids[i][j] = dep2id[w]\n",
584 |     "\n",
585 |     "# test predictions\n",
586 |     "all_predictions = []\n",
587 |     "for j in range(num_batches):\n",
588 |     "    path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
589 |     "    word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
590 |     "    pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
591 |     "    dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
592 |     "    y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
593 |     "\n",
594 |     "    feed_dict = {\n",
595 |     "        path_length:path_dict,\n",
596 |     "        word_ids:word_dict,\n",
597 |     "        pos_ids:pos_dict,\n",
598 |     "        dep_ids:dep_dict,\n",
599 |     "        y:y_dict}\n",
600 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
601 |     "    all_predictions.append(batch_predictions)\n",
602 |     "\n",
603 |     "y_pred = []\n",
604 |     "for i in range(num_batches):\n",
605 |     "    for pred in all_predictions[i]:\n",
606 |     "        y_pred.append(pred)\n",
607 |     "\n",
608 |     "count = 0\n",
609 |     "for i in range(batch_size*num_batches):\n",
610 |     "    count += y_pred[i]==rel_ids[i]\n",
611 |     "accuracy = count/(batch_size*num_batches) * 100\n",
612 |     "\n",
613 |     "print(\"test accuracy\", accuracy)"
614 |    ]
615 |   },
616 |   {
617 |    "cell_type": "code",
618 |    "execution_count": 15,
619 |    "metadata": {
620 |     "collapsed": true
621 |    },
622 |    "outputs": [],
623 |    "source": [
624 |     "f1 = f1_score(rel_ids[:batch_size*num_batches], y_pred, average='macro')"
625 |    ]
626 |   },
627 |   {
628 |    "cell_type": "code",
629 |    "execution_count": 16,
630 |    "metadata": {},
631 |    "outputs": [
632 |     {
633 |      "data": {
634 |       "text/plain": [
635 |        "0.62487150135880543"
636 |       ]
637 |      },
638 |      "execution_count": 16,
639 |      "metadata": {},
640 |      "output_type": "execute_result"
641 |     }
642 |    ],
643 |    "source": [
644 |     "f1"
645 |    ]
646 |   }
647 |  ],
648 |  "metadata": {
649 |   "kernelspec": {
650 |    "display_name": "Python 3",
651 |    "language": "python",
652 |    "name": "python3"
653 |   },
654 |   "language_info": {
655 |    "codemirror_mode": {
656 |     "name": "ipython",
657 |     "version": 3
658 |    },
659 |    "file_extension": ".py",
660 |    "mimetype": "text/x-python",
661 |    "name": "python",
662 |    "nbconvert_exporter": "python",
663 |    "pygments_lexer": "ipython3",
664 |    "version": "3.5.2"
665 |   }
666 |  },
667 |  "nbformat": 4,
668 |  "nbformat_minor": 2
669 | }
670 | 


--------------------------------------------------------------------------------
/LCA Shortest Path/path_extractor.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 8,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import os\n",
 12 |     "from nltk.parse import stanford\n",
 13 |     "import nltk"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": 10,
 19 |    "metadata": {
 20 |     "collapsed": true
 21 |    },
 22 |    "outputs": [],
 23 |    "source": [
 24 |     "# Dependency Tree\n",
 25 |     "from nltk.parse.stanford import StanfordDependencyParser\n",
 26 |     "dep_parser=StanfordDependencyParser(model_path=\"/home/shanu/nltk/jars/englishPCFG.ser.gz\")"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 11,
 32 |    "metadata": {
 33 |     "collapsed": true
 34 |    },
 35 |    "outputs": [],
 36 |    "source": [
 37 |     "def lca(tree, index1, index2):\n",
 38 |     "    node = index1\n",
 39 |     "    path1 = []\n",
 40 |     "    path2 = []\n",
 41 |     "    path1.append(index1)\n",
 42 |     "    path2.append(index2)\n",
 43 |     "    while(node != tree.root):\n",
 44 |     "        node = tree.nodes[node['head']]\n",
 45 |     "        path1.append(node)\n",
 46 |     "    node = index2\n",
 47 |     "    while(node != tree.root):\n",
 48 |     "        node = tree.nodes[node['head']]\n",
 49 |     "        path2.append(node)\n",
 50 |     "    for l1, l2 in zip(path1[::-1],path2[::-1]):\n",
 51 |     "        if(l1==l2):\n",
 52 |     "            temp = l1\n",
 53 |     "    return temp"
 54 |    ]
 55 |   },
 56 |   {
 57 |    "cell_type": "code",
 58 |    "execution_count": 12,
 59 |    "metadata": {
 60 |     "collapsed": true
 61 |    },
 62 |    "outputs": [],
 63 |    "source": [
 64 |     "def path_lca(tree, node, lca_node):\n",
 65 |     "    path = []\n",
 66 |     "    path.append(node)\n",
 67 |     "    while(node != lca_node):\n",
 68 |     "        node = tree.nodes[node['head']]\n",
 69 |     "        path.append(node)\n",
 70 |     "    return path"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": 13,
 76 |    "metadata": {
 77 |     "collapsed": true
 78 |    },
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "import _pickle \n",
 82 |     "f = open('../data/training_data', 'rb')\n",
 83 |     "sentences, e1, e2 = _pickle.load(f)\n",
 84 |     "f.close()"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": null,
 90 |    "metadata": {
 91 |     "collapsed": true
 92 |    },
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "sentences[7588] = 'The reaction mixture is kept in the dark at room temperature for 1.5 hours .'\n",
 96 |     "sentences[2608] = \"This strawberry sauce has about a million uses , is freezer-friendly , and is so much better than that jar of Smuckers strawberry sauce that you 've had sitting in your fridge since that time you made banana splits 1.5 years ago .\""
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {
103 |     "collapsed": true
104 |    },
105 |    "outputs": [],
106 |    "source": [
107 |     "## Uncomment this for test set. \n",
108 |     "# sentences[2590] = \"The pendant with the bail measure 1.25'' .\"\n",
109 |     "# sentences[2664] = \"The cabinet encloses a 6.5 inch cone woofer , 4 inch cone midrange , and a 0.86 inch balanced dome tweeter .\""
110 |    ]
111 |   },
112 |   {
113 |    "cell_type": "code",
114 |    "execution_count": 20,
115 |    "metadata": {
116 |     "collapsed": true
117 |    },
118 |    "outputs": [],
119 |    "source": [
120 |     "word_path1 = []\n",
121 |     "word_path2 = []\n",
122 |     "rel_path1 = []\n",
123 |     "rel_path2 = []\n",
124 |     "pos_path1 = []\n",
125 |     "pos_path2 = []\n",
126 |     "for i in range(8000):\n",
127 |     "    word_path1.append(0)\n",
128 |     "    word_path2.append(0)\n",
129 |     "    rel_path1.append(0)\n",
130 |     "    rel_path2.append(0)\n",
131 |     "    pos_path1.append(0)\n",
132 |     "    pos_path2.append(0)"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": 38,
138 |    "metadata": {
139 |     "scrolled": true
140 |    },
141 |    "outputs": [
142 |     {
143 |      "name": "stdout",
144 |      "output_type": "stream",
145 |      "text": [
146 |       "7588 success\n"
147 |      ]
148 |     }
149 |    ],
150 |    "source": [
151 |     "for i in range(8000):\n",
152 |     "    try:\n",
153 |     "        parse_tree = dep_parser.raw_parse(sentences[i])\n",
154 |     "        for trees in parse_tree:\n",
155 |     "            tree = trees\n",
156 |     "        node1 = tree.nodes[e1[i]+1]\n",
157 |     "        node2 = tree.nodes[e2[i]+1]\n",
158 |     "        if node1['address']!=None and node2['address']!=None:\n",
159 |     "            print(i, \"success\")\n",
160 |     "            lca_node = lca(tree, node1, node2)\n",
161 |     "            path1 = path_lca(tree, node1, lca_node)\n",
162 |     "            path2 = path_lca(tree, node2, lca_node)\n",
163 |     "\n",
164 |     "            word_path1[i] = [p[\"word\"] for p in path1]\n",
165 |     "            word_path2[i] = [p[\"word\"] for p in path2]\n",
166 |     "            rel_path1[i] = [p[\"rel\"] for p in path1]\n",
167 |     "            rel_path2[i] = [p[\"rel\"] for p in path2]\n",
168 |     "            pos_path1[i] = [p[\"tag\"] for p in path1]\n",
169 |     "            pos_path2[i] = [p[\"tag\"] for p in path2]\n",
170 |     "        else:\n",
171 |     "            print(i, node1[\"address\"], node2[\"address\"])\n",
172 |     "    except AssertionError:\n",
173 |     "        print(i, \"error\")\n",
174 |     "    "
175 |    ]
176 |   },
177 |   {
178 |    "cell_type": "code",
179 |    "execution_count": 39,
180 |    "metadata": {
181 |     "collapsed": true
182 |    },
183 |    "outputs": [],
184 |    "source": [
185 |     "file = open('../data/train_paths', 'wb')\n",
186 |     "_pickle.dump([word_path1, word_path2, rel_path1, rel_path2, pos_path1, pos_path2], file)"
187 |    ]
188 |   }
189 |  ],
190 |  "metadata": {
191 |   "kernelspec": {
192 |    "display_name": "Python 3",
193 |    "language": "python",
194 |    "name": "python3"
195 |   },
196 |   "language_info": {
197 |    "codemirror_mode": {
198 |     "name": "ipython",
199 |     "version": 3
200 |    },
201 |    "file_extension": ".py",
202 |    "mimetype": "text/x-python",
203 |    "name": "python",
204 |    "nbconvert_exporter": "python",
205 |    "pygments_lexer": "ipython3",
206 |    "version": "3.5.2"
207 |   }
208 |  },
209 |  "nbformat": 4,
210 |  "nbformat_minor": 2
211 | }
212 | 


--------------------------------------------------------------------------------
/LCA SubTree/README.md:
--------------------------------------------------------------------------------
 1 | ## Relation Classification using LSTMs on LCA Sub Tree
 2 | 
 3 | LSTMs are applied on the Sub Tree of Lowest Ancestor of two entities as a sequence when traversed.
 4 | 
 5 | Model | Train-Accuracy | Test-Accuracy| Epochs
 6 | --- | --- | ---| ---
 7 | model2v1 | ? | 54.6 | 11
 8 | model2v2 | ? | 55.2 | 10
 9 | 
10 | 
11 | 
12 | * dropout on hidden layer of 0.3
13 | * Learning rate = 0.001 
14 | * Learning rate decay = 0.96
15 | * state size = 100
16 | * lambda_l2 = 0.0001
17 | 
18 | 
19 | ### [model2v1](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20SubTree/model2v1.ipynb)
20 | * Foward LSTM on the sequnence traversed on LCA Sub Tree.
21 | 
22 | ### [model2v2](https://github.com/Sshanu/Relation-Classification/blob/master/LCA%20SubTree/model2v2.ipynb)
23 | * Bidirectional LSTM on the sequnence traversed on LCA Sub Tree.
24 | 


--------------------------------------------------------------------------------
/LCA SubTree/model2v1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import sys, os, _pickle as pickle\n",
 12 |     "import tensorflow as tf\n",
 13 |     "import numpy as np\n",
 14 |     "import nltk\n",
 15 |     "from sklearn.metrics import f1_score\n",
 16 |     "\n",
 17 |     "data_dir = '../data'\n",
 18 |     "ckpt_dir = '../checkpoint'\n",
 19 |     "word_embd_dir = '../checkpoint/word_embd'\n",
 20 |     "model_dir = '../checkpoint/model2v1'\n",
 21 |     "\n",
 22 |     "word_embd_dim = 100\n",
 23 |     "pos_embd_dim = 25\n",
 24 |     "dep_embd_dim = 25\n",
 25 |     "word_vocab_size = 400001\n",
 26 |     "pos_vocab_size = 10\n",
 27 |     "dep_vocab_size = 21\n",
 28 |     "relation_classes = 19\n",
 29 |     "state_size = 100\n",
 30 |     "batch_size = 10\n",
 31 |     "channels = 3\n",
 32 |     "lambda_l2 = 0.0001\n",
 33 |     "max_len_path = 70\n",
 34 |     "starter_learning_rate = 0.001\n",
 35 |     "decay_steps = 2000\n",
 36 |     "decay_rate = 0.96"
 37 |    ]
 38 |   },
 39 |   {
 40 |    "cell_type": "code",
 41 |    "execution_count": 2,
 42 |    "metadata": {
 43 |     "collapsed": true
 44 |    },
 45 |    "outputs": [],
 46 |    "source": [
 47 |     "with tf.name_scope(\"input\"):\n",
 48 |     "    path_length = tf.placeholder(tf.int32, shape=[batch_size], name=\"path1_length\")\n",
 49 |     "    word_ids = tf.placeholder(tf.int32, shape=[batch_size, max_len_path], name=\"word_ids\")\n",
 50 |     "    pos_ids = tf.placeholder(tf.int32, [batch_size, max_len_path], name=\"pos_ids\")\n",
 51 |     "    dep_ids = tf.placeholder(tf.int32, [batch_size, max_len_path], name=\"dep_ids\")\n",
 52 |     "    y = tf.placeholder(tf.int32, [batch_size], name=\"y\")\n",
 53 |     "\n",
 54 |     "with tf.name_scope(\"word_embedding\"):\n",
 55 |     "    W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n",
 56 |     "    embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n",
 57 |     "    embedding_init = W.assign(embedding_placeholder)\n",
 58 |     "    embedded_word = tf.nn.embedding_lookup(W, word_ids)\n",
 59 |     "    word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n",
 60 |     "\n",
 61 |     "with tf.name_scope(\"pos_embedding\"):\n",
 62 |     "    W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n",
 63 |     "    embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n",
 64 |     "    pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n",
 65 |     "\n",
 66 |     "with tf.name_scope(\"dep_embedding\"):\n",
 67 |     "    W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n",
 68 |     "    embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n",
 69 |     "    dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 4,
 75 |    "metadata": {
 76 |     "collapsed": true
 77 |    },
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "with tf.variable_scope(\"word_lstm\"):\n",
 81 |     "    cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n",
 82 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word, sequence_length=path_length, dtype=tf.float32)\n",
 83 |     "    state_series_word = tf.reduce_max(state_series, axis=1)\n",
 84 |     "\n",
 85 |     "with tf.variable_scope(\"pos_lstm\"):\n",
 86 |     "    cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n",
 87 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos, sequence_length=path_length, dtype=tf.float32)\n",
 88 |     "    state_series_pos = tf.reduce_max(state_series, axis=1)\n",
 89 |     "\n",
 90 |     "with tf.variable_scope(\"dep_lstm\"):\n",
 91 |     "    cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n",
 92 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep, sequence_length=path_length, dtype=tf.float32)\n",
 93 |     "    state_series_dep = tf.reduce_max(state_series, axis=1)\n",
 94 |     "    \n",
 95 |     "state_series = tf.concat([state_series_word, state_series_pos, state_series_dep], 1)"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 5,
101 |    "metadata": {},
102 |    "outputs": [
103 |     {
104 |      "data": {
105 |       "text/plain": [
106 |        "<tf.Tensor 'concat:0' shape=(10, 300) dtype=float32>"
107 |       ]
108 |      },
109 |      "execution_count": 5,
110 |      "metadata": {},
111 |      "output_type": "execute_result"
112 |     }
113 |    ],
114 |    "source": [
115 |     "state_series"
116 |    ]
117 |   },
118 |   {
119 |    "cell_type": "code",
120 |    "execution_count": 6,
121 |    "metadata": {
122 |     "collapsed": true
123 |    },
124 |    "outputs": [],
125 |    "source": [
126 |     "with tf.name_scope(\"hidden_layer\"):\n",
127 |     "    W = tf.Variable(tf.truncated_normal([300, 100], -0.1, 0.1), name=\"W\")\n",
128 |     "    b = tf.Variable(tf.zeros([100]), name=\"b\")\n",
129 |     "    y_hidden_layer = tf.nn.relu(tf.matmul(state_series, W) + b)\n",
130 |     "\n",
131 |     "with tf.name_scope(\"dropout\"):\n",
132 |     "    y_hidden_layer_drop = tf.nn.dropout(y_hidden_layer, 0.3)\n",
133 |     "\n",
134 |     "with tf.name_scope(\"softmax_layer\"):\n",
135 |     "    W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n",
136 |     "    b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n",
137 |     "    logits = tf.matmul(y_hidden_layer_drop, W) + b\n",
138 |     "    predictions = tf.argmax(logits, 1)\n",
139 |     "\n",
140 |     "tv_all = tf.trainable_variables()\n",
141 |     "tv_regu = []\n",
142 |     "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n",
143 |     "for t in tv_all:\n",
144 |     "    if t.name not in non_reg:\n",
145 |     "        if(t.name.find('biases')==-1):\n",
146 |     "            tv_regu.append(t)\n",
147 |     "\n",
148 |     "with tf.name_scope(\"loss\"):\n",
149 |     "    l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n",
150 |     "    loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n",
151 |     "    total_loss = loss + l2_loss\n",
152 |     "\n",
153 |     "global_step = tf.Variable(0, trainable=False, name=\"global_step\")\n",
154 |     "\n",
155 |     "learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, decay_steps, decay_rate, staircase=True)\n",
156 |     "optimizer = tf.train.AdamOptimizer(learning_rate).minimize(total_loss, global_step=global_step)"
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "code",
161 |    "execution_count": 2,
162 |    "metadata": {
163 |     "collapsed": true
164 |    },
165 |    "outputs": [],
166 |    "source": [
167 |     "f = open(data_dir + '/vocab.pkl', 'rb')\n",
168 |     "vocab = pickle.load(f)\n",
169 |     "f.close()\n",
170 |     "\n",
171 |     "word2id = dict((w, i) for i,w in enumerate(vocab))\n",
172 |     "id2word = dict((i, w) for i,w in enumerate(vocab))\n",
173 |     "\n",
174 |     "unknown_token = \"UNKNOWN_TOKEN\"\n",
175 |     "word2id[unknown_token] = word_vocab_size -1\n",
176 |     "id2word[word_vocab_size-1] = unknown_token\n",
177 |     "\n",
178 |     "pos_tags_vocab = []\n",
179 |     "for line in open(data_dir + '/pos_tags.txt'):\n",
180 |     "        pos_tags_vocab.append(line.strip())\n",
181 |     "\n",
182 |     "dep_vocab = []\n",
183 |     "for line in open(data_dir + '/dependency_types.txt'):\n",
184 |     "    dep_vocab.append(line.strip())\n",
185 |     "\n",
186 |     "relation_vocab = []\n",
187 |     "for line in open(data_dir + '/relation_types.txt'):\n",
188 |     "    relation_vocab.append(line.strip())\n",
189 |     "\n",
190 |     "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n",
191 |     "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n",
192 |     "\n",
193 |     "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n",
194 |     "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n",
195 |     "\n",
196 |     "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n",
197 |     "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n",
198 |     "\n",
199 |     "pos_tag2id['OTH'] = 9\n",
200 |     "id2pos_tag[9] = 'OTH'\n",
201 |     "\n",
202 |     "dep2id['OTH'] = 20\n",
203 |     "id2dep[20] = 'OTH'\n",
204 |     "\n",
205 |     "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n",
206 |     "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n",
207 |     "RB_pos_tags = ['RB', 'RBR', 'RBS']\n",
208 |     "PRP_pos_tags = ['PRP', 'PRP$']\n",
209 |     "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n",
210 |     "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n",
211 |     "\n",
212 |     "def pos_tag(x):\n",
213 |     "    if x in JJ_pos_tags:\n",
214 |     "        return pos_tag2id['JJ']\n",
215 |     "    if x in NN_pos_tags:\n",
216 |     "        return pos_tag2id['NN']\n",
217 |     "    if x in RB_pos_tags:\n",
218 |     "        return pos_tag2id['RB']\n",
219 |     "    if x in PRP_pos_tags:\n",
220 |     "        return pos_tag2id['PRP']\n",
221 |     "    if x in VB_pos_tags:\n",
222 |     "        return pos_tag2id['VB']\n",
223 |     "    if x in _pos_tags:\n",
224 |     "        return pos_tag2id[x]\n",
225 |     "    else:\n",
226 |     "        return 9"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "code",
231 |    "execution_count": 9,
232 |    "metadata": {
233 |     "collapsed": true
234 |    },
235 |    "outputs": [],
236 |    "source": [
237 |     "sess = tf.Session()\n",
238 |     "sess.run(tf.global_variables_initializer())\n",
239 |     "saver = tf.train.Saver()"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "code",
244 |    "execution_count": null,
245 |    "metadata": {
246 |     "collapsed": true
247 |    },
248 |    "outputs": [],
249 |    "source": [
250 |     "# f = open('data/word_embedding', 'rb')\n",
251 |     "# word_embedding = pickle.load(f)\n",
252 |     "# f.close()\n",
253 |     "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n",
254 |     "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "code",
259 |    "execution_count": 12,
260 |    "metadata": {
261 |     "collapsed": true
262 |    },
263 |    "outputs": [],
264 |    "source": [
265 |     "# model = tf.train.latest_checkpoint(model_dir)\n",
266 |     "# saver.restore(sess, model)"
267 |    ]
268 |   },
269 |   {
270 |    "cell_type": "code",
271 |    "execution_count": 9,
272 |    "metadata": {
273 |     "scrolled": true
274 |    },
275 |    "outputs": [
276 |     {
277 |      "name": "stdout",
278 |      "output_type": "stream",
279 |      "text": [
280 |       "INFO:tensorflow:Restoring parameters from checkpoint/word_embd/word_embd\n"
281 |      ]
282 |     }
283 |    ],
284 |    "source": [
285 |     "latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n",
286 |     "word_embedding_saver.restore(sess, latest_embd)"
287 |    ]
288 |   },
289 |   {
290 |    "cell_type": "code",
291 |    "execution_count": 13,
292 |    "metadata": {
293 |     "collapsed": true
294 |    },
295 |    "outputs": [],
296 |    "source": [
297 |     "f = open(data_dir + '/train_lca_paths', 'rb')\n",
298 |     "word_p, dep_p, pos_p = pickle.load(f)\n",
299 |     "f.close()\n",
300 |     "relations = []\n",
301 |     "for line in open(data_dir + '/train_relations.txt'):\n",
302 |     "    relations.append(line.strip().split()[1])\n",
303 |     "\n",
304 |     "length = len(word_p)\n",
305 |     "num_batches = int(length/batch_size)\n",
306 |     "\n",
307 |     "for i in range(length):\n",
308 |     "    for j, word in enumerate(word_p[i]):\n",
309 |     "        word = word.lower()\n",
310 |     "        word_p[i][j] = word if word in word2id else unknown_token \n",
311 |     "    for l, d in enumerate(dep_p[i]):\n",
312 |     "        dep_p[i][l] = d if d in dep2id else 'OTH'\n",
313 |     "        \n",
314 |     "word_p_ids = np.ones([length, max_len_path],dtype=int)\n",
315 |     "pos_p_ids = np.ones([length, max_len_path],dtype=int)\n",
316 |     "dep_p_ids = np.ones([length, max_len_path],dtype=int)\n",
317 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
318 |     "path_len = np.array([len(w) for w in word_p], dtype=int)\n",
319 |     "\n",
320 |     "for i in range(length):\n",
321 |     "    for j, w in enumerate(word_p[i]):\n",
322 |     "        word_p_ids[i][j] = word2id[w]\n",
323 |     "        \n",
324 |     "    for j, w in enumerate(pos_p[i]):\n",
325 |     "        pos_p_ids[i][j] = pos_tag(w)\n",
326 |     "        \n",
327 |     "    for j, w in enumerate(dep_p[i]):\n",
328 |     "        dep_p_ids[i][j] = dep2id[w]"
329 |    ]
330 |   },
331 |   {
332 |    "cell_type": "code",
333 |    "execution_count": null,
334 |    "metadata": {
335 |     "scrolled": true
336 |    },
337 |    "outputs": [
338 |     {
339 |      "name": "stdout",
340 |      "output_type": "stream",
341 |      "text": [
342 |       "Epoch: 1 Step: 800 loss: 2.85355205297\n",
343 |       "Saved Model\n",
344 |       "Epoch: 2 Step: 1600 loss: 2.73827668965\n",
345 |       "Saved Model\n",
346 |       "Epoch: 3 Step: 2400 loss: 2.70001435518\n",
347 |       "Saved Model\n",
348 |       "Epoch: 4 Step: 3200 loss: 2.68624746531\n",
349 |       "Saved Model\n",
350 |       "Epoch: 5 Step: 4000 loss: 2.68042603165\n",
351 |       "Saved Model\n",
352 |       "Epoch: 6 Step: 4800 loss: 2.67750604913\n",
353 |       "Saved Model\n",
354 |       "Epoch: 7 Step: 5600 loss: 2.67583220631\n",
355 |       "Saved Model\n",
356 |       "Epoch: 8 Step: 6400 loss: 2.67482194766\n",
357 |       "Saved Model\n",
358 |       "Epoch: 9 Step: 7200 loss: 2.67411908716\n",
359 |       "Saved Model\n",
360 |       "Epoch: 10 Step: 8000 loss: 2.67369878128\n",
361 |       "Saved Model\n",
362 |       "Epoch: 11 Step: 8800 loss: 2.67341704309\n",
363 |       "Saved Model\n",
364 |       "Epoch: 12 Step: 9600 loss: 2.67321884066\n",
365 |       "Saved Model\n",
366 |       "Epoch: 13 Step: 10400 loss: 2.67310401961\n",
367 |       "Saved Model\n",
368 |       "Epoch: 14 Step: 11200 loss: 2.67295600712\n",
369 |       "Saved Model\n",
370 |       "Epoch: 15 Step: 12000 loss: 2.67288722694\n",
371 |       "Saved Model\n",
372 |       "Epoch: 16 Step: 12800 loss: 2.67282888472\n",
373 |       "Saved Model\n",
374 |       "Epoch: 17 Step: 13600 loss: 2.67277920395\n",
375 |       "Saved Model\n",
376 |       "Epoch: 18 Step: 14400 loss: 2.6727619794\n",
377 |       "Saved Model\n",
378 |       "Epoch: 19 Step: 15200 loss: 2.67268569678\n",
379 |       "Saved Model\n",
380 |       "Epoch: 20 Step: 16000 loss: 2.67266457796\n",
381 |       "Saved Model\n",
382 |       "Epoch: 21 Step: 16800 loss: 2.67263956338\n",
383 |       "Saved Model\n",
384 |       "Epoch: 22 Step: 17600 loss: 2.67261722207\n",
385 |       "Saved Model\n",
386 |       "Epoch: 23 Step: 18400 loss: 2.67261824235\n",
387 |       "Saved Model\n",
388 |       "Epoch: 24 Step: 19200 loss: 2.67256126881\n",
389 |       "Saved Model\n",
390 |       "Epoch: 25 Step: 20000 loss: 2.6725519672\n",
391 |       "Saved Model\n",
392 |       "Epoch: 26 Step: 20800 loss: 2.67253558069\n",
393 |       "Saved Model\n",
394 |       "Epoch: 27 Step: 21600 loss: 2.67252239197\n",
395 |       "Saved Model\n",
396 |       "Epoch: 28 Step: 22400 loss: 2.67252858594\n",
397 |       "Saved Model\n",
398 |       "Epoch: 29 Step: 23200 loss: 2.67248077154\n",
399 |       "Saved Model\n",
400 |       "Epoch: 30 Step: 24000 loss: 2.67247578681\n",
401 |       "Saved Model\n",
402 |       "Epoch: 31 Step: 24800 loss: 2.67246250227\n",
403 |       "Saved Model\n",
404 |       "Epoch: 32 Step: 25600 loss: 2.67245363146\n",
405 |       "Saved Model\n",
406 |       "Epoch: 33 Step: 26400 loss: 2.67246143714\n",
407 |       "Saved Model\n",
408 |       "Epoch: 34 Step: 27200 loss: 2.6724195759\n",
409 |       "Saved Model\n",
410 |       "Epoch: 35 Step: 28000 loss: 2.67241657913\n",
411 |       "Saved Model\n",
412 |       "Epoch: 36 Step: 28800 loss: 2.67240460932\n",
413 |       "Saved Model\n",
414 |       "Epoch: 37 Step: 29600 loss: 2.67239822775\n",
415 |       "Saved Model\n"
416 |      ]
417 |     }
418 |    ],
419 |    "source": [
420 |     "num_epochs = 40\n",
421 |     "for i in range(num_epochs):\n",
422 |     "    loss_per_epoch = 0\n",
423 |     "    for j in range(num_batches):\n",
424 |     "        feed_dict = {\n",
425 |     "            path_length:path_len[j*batch_size:(j+1)*batch_size],\n",
426 |     "            word_ids:word_p_ids[j*batch_size:(j+1)*batch_size],\n",
427 |     "            pos_ids:pos_p_ids[j*batch_size:(j+1)*batch_size],\n",
428 |     "            dep_ids:dep_p_ids[j*batch_size:(j+1)*batch_size],\n",
429 |     "            y:rel_ids[j*batch_size:(j+1)*batch_size]}\n",
430 |     "        _, _loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n",
431 |     "        loss_per_epoch +=_loss\n",
432 |     "        if (j+1)%num_batches==0:\n",
433 |     "            print(\"Epoch:\", i+1,\"Step:\", step, \"loss:\",loss_per_epoch/num_batches)\n",
434 |     "    saver.save(sess, model_dir + '/model')\n",
435 |     "    print(\"Saved Model\")"
436 |    ]
437 |   },
438 |   {
439 |    "cell_type": "code",
440 |    "execution_count": null,
441 |    "metadata": {
442 |     "collapsed": true,
443 |     "scrolled": false
444 |    },
445 |    "outputs": [],
446 |    "source": [
447 |     "# training accuracy\n",
448 |     "all_predictions = []\n",
449 |     "for j in range(num_batches):\n",
450 |     "    path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
451 |     "    word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
452 |     "    pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
453 |     "    dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
454 |     "    y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
455 |     "\n",
456 |     "    feed_dict = {\n",
457 |     "        path_length:path_dict,\n",
458 |     "        word_ids:word_dict,\n",
459 |     "        pos_ids:pos_dict,\n",
460 |     "        dep_ids:dep_dict,\n",
461 |     "        y:y_dict}\n",
462 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
463 |     "    all_predictions.append(batch_predictions)\n",
464 |     "\n",
465 |     "y_pred = []\n",
466 |     "for i in range(num_batches):\n",
467 |     "    for pred in all_predictions[i]:\n",
468 |     "        y_pred.append(pred)\n",
469 |     "\n",
470 |     "count = 0\n",
471 |     "for i in range(batch_size*num_batches):\n",
472 |     "    count += y_pred[i]==rel_ids[i]\n",
473 |     "accuracy = count/(batch_size*num_batches) * 100\n",
474 |     "\n",
475 |     "print(\"training accuracy\", accuracy)"
476 |    ]
477 |   },
478 |   {
479 |    "cell_type": "code",
480 |    "execution_count": 11,
481 |    "metadata": {
482 |     "collapsed": true
483 |    },
484 |    "outputs": [],
485 |    "source": [
486 |     "f = open(data_dir + '/test_lca_paths', 'rb')\n",
487 |     "word_p, dep_p, pos_p = pickle.load(f)\n",
488 |     "f.close()\n",
489 |     "\n",
490 |     "relations = []\n",
491 |     "for line in open(data_dir + '/test_relations.txt'):\n",
492 |     "    relations.append(line.strip().split()[0])\n",
493 |     "\n",
494 |     "length = len(word_p1)\n",
495 |     "num_batches = int(length/batch_size)\n",
496 |     "\n",
497 |     "for i in range(length):\n",
498 |     "    for j, word in enumerate(word_p[i]):\n",
499 |     "        word = word.lower()\n",
500 |     "        word_p[i][j] = word if word in word2id else unknown_token \n",
501 |     "    for l, d in enumerate(dep_p[i]):\n",
502 |     "        dep_p[i][l] = d if d in dep2id else 'OTH'\n",
503 |     "        \n",
504 |     "word_p_ids = np.ones([length, max_len_path],dtype=int)\n",
505 |     "pos_p_ids = np.ones([length, max_len_path],dtype=int)\n",
506 |     "dep_p_ids = np.ones([length, max_len_path],dtype=int)\n",
507 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
508 |     "path_len = np.array([len(w) for w in word_p], dtype=int)\n",
509 |     "\n",
510 |     "for i in range(length):\n",
511 |     "    for j, w in enumerate(word_p[i]):\n",
512 |     "        word_p_ids[i][j] = word2id[w]\n",
513 |     "        \n",
514 |     "    for j, w in enumerate(pos_p[i]):\n",
515 |     "        pos_p_ids[i][j] = pos_tag(w)\n",
516 |     "        \n",
517 |     "    for j, w in enumerate(dep_p[i]):\n",
518 |     "        dep_p_ids[i][j] = dep2id[w]\n",
519 |     "\n",
520 |     "# test predictions\n",
521 |     "all_predictions = []\n",
522 |     "for j in range(num_batches):\n",
523 |     "    path_dict = [path1_len[j*batch_size:(j+1)*batch_size], path2_len[j*batch_size:(j+1)*batch_size]]\n",
524 |     "    word_dict = [word_p1_ids[j*batch_size:(j+1)*batch_size], word_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
525 |     "    pos_dict = [pos_p1_ids[j*batch_size:(j+1)*batch_size], pos_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
526 |     "    dep_dict = [dep_p1_ids[j*batch_size:(j+1)*batch_size], dep_p2_ids[j*batch_size:(j+1)*batch_size]]\n",
527 |     "    y_dict = rel_ids[j*batch_size:(j+1)*batch_size]\n",
528 |     "\n",
529 |     "    feed_dict = {\n",
530 |     "        path_length:path_dict,\n",
531 |     "        word_ids:word_dict,\n",
532 |     "        pos_ids:pos_dict,\n",
533 |     "        dep_ids:dep_dict,\n",
534 |     "        y:y_dict}\n",
535 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
536 |     "    all_predictions.append(batch_predictions)\n",
537 |     "\n",
538 |     "y_pred = []\n",
539 |     "for i in range(num_batches):\n",
540 |     "    for pred in all_predictions[i]:\n",
541 |     "        y_pred.append(pred)\n",
542 |     "\n",
543 |     "count = 0\n",
544 |     "for i in range(batch_size*num_batches):\n",
545 |     "    count += y_pred[i]==rel_ids[i]\n",
546 |     "accuracy = count/(batch_size*num_batches) * 100\n",
547 |     "\n",
548 |     "print(\"test accuracy\", accuracy)"
549 |    ]
550 |   },
551 |   {
552 |    "cell_type": "code",
553 |    "execution_count": null,
554 |    "metadata": {
555 |     "collapsed": true
556 |    },
557 |    "outputs": [],
558 |    "source": []
559 |   }
560 |  ],
561 |  "metadata": {
562 |   "kernelspec": {
563 |    "display_name": "Python 3",
564 |    "language": "python",
565 |    "name": "python3"
566 |   },
567 |   "language_info": {
568 |    "codemirror_mode": {
569 |     "name": "ipython",
570 |     "version": 3
571 |    },
572 |    "file_extension": ".py",
573 |    "mimetype": "text/x-python",
574 |    "name": "python",
575 |    "nbconvert_exporter": "python",
576 |    "pygments_lexer": "ipython3",
577 |    "version": "3.5.2"
578 |   }
579 |  },
580 |  "nbformat": 4,
581 |  "nbformat_minor": 2
582 | }
583 | 


--------------------------------------------------------------------------------
/LCA SubTree/model2v2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import sys, os, _pickle as pickle\n",
 12 |     "import tensorflow as tf\n",
 13 |     "import numpy as np\n",
 14 |     "import nltk\n",
 15 |     "from sklearn.metrics import f1_score\n",
 16 |     "\n",
 17 |     "\n",
 18 |     "data_dir = '../data'\n",
 19 |     "ckpt_dir = '../checkpoint'\n",
 20 |     "word_embd_dir = '../checkpoint/word_embd'\n",
 21 |     "model_dir = '../checkpoint/model2v2'\n",
 22 |     "\n",
 23 |     "word_embd_dim = 100\n",
 24 |     "pos_embd_dim = 25\n",
 25 |     "dep_embd_dim = 25\n",
 26 |     "word_vocab_size = 400001\n",
 27 |     "pos_vocab_size = 10\n",
 28 |     "dep_vocab_size = 21\n",
 29 |     "relation_classes = 19\n",
 30 |     "state_size = 100\n",
 31 |     "batch_size = 10\n",
 32 |     "channels = 3\n",
 33 |     "lambda_l2 = 0.0001\n",
 34 |     "max_len_path = 70\n",
 35 |     "starter_learning_rate = 0.001\n",
 36 |     "decay_steps = 2000\n",
 37 |     "decay_rate = 0.96"
 38 |    ]
 39 |   },
 40 |   {
 41 |    "cell_type": "code",
 42 |    "execution_count": 2,
 43 |    "metadata": {
 44 |     "collapsed": true
 45 |    },
 46 |    "outputs": [],
 47 |    "source": [
 48 |     "with tf.name_scope(\"input\"):\n",
 49 |     "    path_length = tf.placeholder(tf.int32, shape=[batch_size], name=\"path1_length\")\n",
 50 |     "    word_ids = tf.placeholder(tf.int32, shape=[batch_size, max_len_path], name=\"word_ids\")\n",
 51 |     "    pos_ids = tf.placeholder(tf.int32, [batch_size, max_len_path], name=\"pos_ids\")\n",
 52 |     "    dep_ids = tf.placeholder(tf.int32, [batch_size, max_len_path], name=\"dep_ids\")\n",
 53 |     "    y = tf.placeholder(tf.int32, [batch_size], name=\"y\")\n",
 54 |     "\n",
 55 |     "with tf.name_scope(\"word_embedding\"):\n",
 56 |     "    W = tf.Variable(tf.constant(0.0, shape=[word_vocab_size, word_embd_dim]), name=\"W\")\n",
 57 |     "    embedding_placeholder = tf.placeholder(tf.float32,[word_vocab_size, word_embd_dim])\n",
 58 |     "    embedding_init = W.assign(embedding_placeholder)\n",
 59 |     "    embedded_word = tf.nn.embedding_lookup(W, word_ids)\n",
 60 |     "    word_embedding_saver = tf.train.Saver({\"word_embedding/W\": W})\n",
 61 |     "\n",
 62 |     "with tf.name_scope(\"pos_embedding\"):\n",
 63 |     "    W = tf.Variable(tf.random_uniform([pos_vocab_size, pos_embd_dim]), name=\"W\")\n",
 64 |     "    embedded_pos = tf.nn.embedding_lookup(W, pos_ids)\n",
 65 |     "    pos_embedding_saver = tf.train.Saver({\"pos_embedding/W\": W})\n",
 66 |     "\n",
 67 |     "with tf.name_scope(\"dep_embedding\"):\n",
 68 |     "    W = tf.Variable(tf.random_uniform([dep_vocab_size, dep_embd_dim]), name=\"W\")\n",
 69 |     "    embedded_dep = tf.nn.embedding_lookup(W, dep_ids)\n",
 70 |     "    dep_embedding_saver = tf.train.Saver({\"dep_embedding/W\": W})"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": 3,
 76 |    "metadata": {
 77 |     "collapsed": true
 78 |    },
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "embedded_word_rev = tf.reverse(embedded_word, [1])\n",
 82 |     "embedded_pos_rev = tf.reverse(embedded_pos, [1])\n",
 83 |     "embedded_dep_rev = tf.reverse(embedded_dep, [1])\n",
 84 |     "path_length_rev = tf.reverse(path_length, [0])"
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": 4,
 90 |    "metadata": {
 91 |     "collapsed": true
 92 |    },
 93 |    "outputs": [],
 94 |    "source": [
 95 |     "with tf.variable_scope(\"word_lstm_fw\"):\n",
 96 |     "    cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n",
 97 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word, sequence_length=path_length, dtype=tf.float32)\n",
 98 |     "    state_series_word_fw = tf.reduce_max(state_series, axis=1)\n",
 99 |     "\n",
100 |     "with tf.variable_scope(\"pos_lstm_fw\"):\n",
101 |     "    cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n",
102 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos, sequence_length=path_length, dtype=tf.float32)\n",
103 |     "    state_series_pos_fw = tf.reduce_max(state_series, axis=1)\n",
104 |     "\n",
105 |     "with tf.variable_scope(\"dep_lstm_fw\"):\n",
106 |     "    cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n",
107 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep, sequence_length=path_length, dtype=tf.float32)\n",
108 |     "    state_series_dep_fw = tf.reduce_max(state_series, axis=1)\n",
109 |     "    \n",
110 |     "with tf.variable_scope(\"word_lstm_bw\"):\n",
111 |     "    cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n",
112 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_word_rev, sequence_length=path_length_rev, dtype=tf.float32)\n",
113 |     "    state_series_word_bw = tf.reduce_max(state_series, axis=1)\n",
114 |     "\n",
115 |     "with tf.variable_scope(\"pos_lstm_bw\"):\n",
116 |     "    cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n",
117 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_pos_rev, sequence_length=path_length_rev, dtype=tf.float32)\n",
118 |     "    state_series_pos_bw = tf.reduce_max(state_series, axis=1)\n",
119 |     "\n",
120 |     "with tf.variable_scope(\"dep_lstm_bw\"):\n",
121 |     "    cell = tf.contrib.rnn.BasicLSTMCell(state_size)\n",
122 |     "    state_series, current_state = tf.nn.dynamic_rnn(cell, embedded_dep_rev, sequence_length=path_length_rev, dtype=tf.float32)\n",
123 |     "    state_series_dep_bw = tf.reduce_max(state_series, axis=1)\n",
124 |     "    \n",
125 |     "state_series = tf.concat([state_series_word_fw, state_series_pos_fw, state_series_dep_fw, state_series_word_bw, state_series_pos_bw, state_series_dep_bw], 1)\n"
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": 5,
131 |    "metadata": {},
132 |    "outputs": [
133 |     {
134 |      "data": {
135 |       "text/plain": [
136 |        "<tf.Tensor 'concat:0' shape=(10, 600) dtype=float32>"
137 |       ]
138 |      },
139 |      "execution_count": 5,
140 |      "metadata": {},
141 |      "output_type": "execute_result"
142 |     }
143 |    ],
144 |    "source": [
145 |     "state_series"
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": 6,
151 |    "metadata": {
152 |     "collapsed": true
153 |    },
154 |    "outputs": [],
155 |    "source": [
156 |     "with tf.name_scope(\"hidden_layer\"):\n",
157 |     "    W = tf.Variable(tf.truncated_normal([600, 100], -0.1, 0.1), name=\"W\")\n",
158 |     "    b = tf.Variable(tf.zeros([100]), name=\"b\")\n",
159 |     "    y_hidden_layer = tf.matmul(state_series, W) + b\n",
160 |     "\n",
161 |     "with tf.name_scope(\"dropout\"):\n",
162 |     "    y_hidden_layer_drop = tf.nn.dropout(y_hidden_layer, 0.3)\n",
163 |     "\n",
164 |     "with tf.name_scope(\"softmax_layer\"):\n",
165 |     "    W = tf.Variable(tf.truncated_normal([100, relation_classes], -0.1, 0.1), name=\"W\")\n",
166 |     "    b = tf.Variable(tf.zeros([relation_classes]), name=\"b\")\n",
167 |     "    logits = tf.matmul(y_hidden_layer_drop, W) + b\n",
168 |     "    predictions = tf.argmax(logits, 1)\n",
169 |     "\n",
170 |     "tv_all = tf.trainable_variables()\n",
171 |     "tv_regu = []\n",
172 |     "non_reg = [\"word_embedding/W:0\",\"pos_embedding/W:0\",'dep_embedding/W:0',\"global_step:0\",'hidden_layer/b:0','softmax_layer/b:0']\n",
173 |     "for t in tv_all:\n",
174 |     "    if t.name not in non_reg:\n",
175 |     "        if(t.name.find('biases')==-1):\n",
176 |     "            tv_regu.append(t)\n",
177 |     "\n",
178 |     "with tf.name_scope(\"loss\"):\n",
179 |     "    l2_loss = lambda_l2 * tf.reduce_sum([ tf.nn.l2_loss(v) for v in tv_regu ])\n",
180 |     "    loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y))\n",
181 |     "    total_loss = loss + l2_loss\n",
182 |     "\n",
183 |     "global_step = tf.Variable(0, trainable=False, name=\"global_step\")\n",
184 |     "\n",
185 |     "learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step, decay_steps, decay_rate, staircase=True)\n",
186 |     "optimizer = tf.train.AdamOptimizer(learning_rate).minimize(total_loss, global_step=global_step)"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": 7,
192 |    "metadata": {
193 |     "collapsed": true
194 |    },
195 |    "outputs": [],
196 |    "source": [
197 |     "f = open(data_dir + '/vocab.pkl', 'rb')\n",
198 |     "vocab = pickle.load(f)\n",
199 |     "f.close()\n",
200 |     "\n",
201 |     "word2id = dict((w, i) for i,w in enumerate(vocab))\n",
202 |     "id2word = dict((i, w) for i,w in enumerate(vocab))\n",
203 |     "\n",
204 |     "unknown_token = \"UNKNOWN_TOKEN\"\n",
205 |     "word2id[unknown_token] = word_vocab_size -1\n",
206 |     "id2word[word_vocab_size-1] = unknown_token\n",
207 |     "\n",
208 |     "pos_tags_vocab = []\n",
209 |     "for line in open(data_dir + '/pos_tags.txt'):\n",
210 |     "        pos_tags_vocab.append(line.strip())\n",
211 |     "\n",
212 |     "dep_vocab = []\n",
213 |     "for line in open(data_dir + '/dependency_types.txt'):\n",
214 |     "    dep_vocab.append(line.strip())\n",
215 |     "\n",
216 |     "relation_vocab = []\n",
217 |     "for line in open(data_dir + '/relation_types.txt'):\n",
218 |     "    relation_vocab.append(line.strip())\n",
219 |     "\n",
220 |     "rel2id = dict((w, i) for i,w in enumerate(relation_vocab))\n",
221 |     "id2rel = dict((i, w) for i,w in enumerate(relation_vocab))\n",
222 |     "\n",
223 |     "pos_tag2id = dict((w, i) for i,w in enumerate(pos_tags_vocab))\n",
224 |     "id2pos_tag = dict((i, w) for i,w in enumerate(pos_tags_vocab))\n",
225 |     "\n",
226 |     "dep2id = dict((w, i) for i,w in enumerate(dep_vocab))\n",
227 |     "id2dep = dict((i, w) for i,w in enumerate(dep_vocab))\n",
228 |     "\n",
229 |     "pos_tag2id['OTH'] = 9\n",
230 |     "id2pos_tag[9] = 'OTH'\n",
231 |     "\n",
232 |     "dep2id['OTH'] = 20\n",
233 |     "id2dep[20] = 'OTH'\n",
234 |     "\n",
235 |     "JJ_pos_tags = ['JJ', 'JJR', 'JJS']\n",
236 |     "NN_pos_tags = ['NN', 'NNS', 'NNP', 'NNPS']\n",
237 |     "RB_pos_tags = ['RB', 'RBR', 'RBS']\n",
238 |     "PRP_pos_tags = ['PRP', 'PRP$']\n",
239 |     "VB_pos_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']\n",
240 |     "_pos_tags = ['CC', 'CD', 'DT', 'IN']\n",
241 |     "\n",
242 |     "def pos_tag(x):\n",
243 |     "    if x in JJ_pos_tags:\n",
244 |     "        return pos_tag2id['JJ']\n",
245 |     "    if x in NN_pos_tags:\n",
246 |     "        return pos_tag2id['NN']\n",
247 |     "    if x in RB_pos_tags:\n",
248 |     "        return pos_tag2id['RB']\n",
249 |     "    if x in PRP_pos_tags:\n",
250 |     "        return pos_tag2id['PRP']\n",
251 |     "    if x in VB_pos_tags:\n",
252 |     "        return pos_tag2id['VB']\n",
253 |     "    if x in _pos_tags:\n",
254 |     "        return pos_tag2id[x]\n",
255 |     "    else:\n",
256 |     "        return 9"
257 |    ]
258 |   },
259 |   {
260 |    "cell_type": "code",
261 |    "execution_count": 8,
262 |    "metadata": {
263 |     "collapsed": true
264 |    },
265 |    "outputs": [],
266 |    "source": [
267 |     "sess = tf.Session()\n",
268 |     "sess.run(tf.global_variables_initializer())\n",
269 |     "saver = tf.train.Saver()"
270 |    ]
271 |   },
272 |   {
273 |    "cell_type": "code",
274 |    "execution_count": 9,
275 |    "metadata": {
276 |     "collapsed": true
277 |    },
278 |    "outputs": [],
279 |    "source": [
280 |     "# f = open('data/word_embedding', 'rb')\n",
281 |     "# word_embedding = pickle.load(f)\n",
282 |     "# f.close()\n",
283 |     "# sess.run(embedding_init, feed_dict={embedding_placeholder:word_embedding})\n",
284 |     "# word_embedding_saver.save(sess, word_embd_dir + '/word_embd')"
285 |    ]
286 |   },
287 |   {
288 |    "cell_type": "code",
289 |    "execution_count": 10,
290 |    "metadata": {
291 |     "collapsed": true
292 |    },
293 |    "outputs": [],
294 |    "source": [
295 |     "# model = tf.train.latest_checkpoint(model_dir)\n",
296 |     "# saver.restore(sess, model)"
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "code",
301 |    "execution_count": 11,
302 |    "metadata": {
303 |     "scrolled": true
304 |    },
305 |    "outputs": [
306 |     {
307 |      "name": "stdout",
308 |      "output_type": "stream",
309 |      "text": [
310 |       "INFO:tensorflow:Restoring parameters from checkpoint/word_embd/word_embd\n"
311 |      ]
312 |     }
313 |    ],
314 |    "source": [
315 |     "latest_embd = tf.train.latest_checkpoint(word_embd_dir)\n",
316 |     "word_embedding_saver.restore(sess, latest_embd)"
317 |    ]
318 |   },
319 |   {
320 |    "cell_type": "code",
321 |    "execution_count": 12,
322 |    "metadata": {
323 |     "collapsed": true
324 |    },
325 |    "outputs": [],
326 |    "source": [
327 |     "f = open(data_dir + '/train_lca_paths', 'rb')\n",
328 |     "word_p, dep_p, pos_p = pickle.load(f)\n",
329 |     "f.close()\n",
330 |     "relations = []\n",
331 |     "for line in open(data_dir + '/train_relations.txt'):\n",
332 |     "    relations.append(line.strip().split()[1])\n",
333 |     "\n",
334 |     "length = len(word_p)\n",
335 |     "num_batches = int(length/batch_size)\n",
336 |     "\n",
337 |     "for i in range(length):\n",
338 |     "    for j, word in enumerate(word_p[i]):\n",
339 |     "        word = word.lower()\n",
340 |     "        word_p[i][j] = word if word in word2id else unknown_token \n",
341 |     "    for l, d in enumerate(dep_p[i]):\n",
342 |     "        dep_p[i][l] = d if d in dep2id else 'OTH'\n",
343 |     "        \n",
344 |     "word_p_ids = np.ones([length, max_len_path],dtype=int)\n",
345 |     "pos_p_ids = np.ones([length, max_len_path],dtype=int)\n",
346 |     "dep_p_ids = np.ones([length, max_len_path],dtype=int)\n",
347 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
348 |     "path_len = np.array([len(w) for w in word_p], dtype=int)\n",
349 |     "\n",
350 |     "for i in range(length):\n",
351 |     "    for j, w in enumerate(word_p[i]):\n",
352 |     "        word_p_ids[i][j] = word2id[w]\n",
353 |     "        \n",
354 |     "    for j, w in enumerate(pos_p[i]):\n",
355 |     "        pos_p_ids[i][j] = pos_tag(w)\n",
356 |     "        \n",
357 |     "    for j, w in enumerate(dep_p[i]):\n",
358 |     "        dep_p_ids[i][j] = dep2id[w]"
359 |    ]
360 |   },
361 |   {
362 |    "cell_type": "code",
363 |    "execution_count": 13,
364 |    "metadata": {
365 |     "scrolled": true
366 |    },
367 |    "outputs": [
368 |     {
369 |      "name": "stdout",
370 |      "output_type": "stream",
371 |      "text": [
372 |       "Epoch: 1 Step: 800 loss: 2.85300489247\n",
373 |       "Saved Model\n",
374 |       "Epoch: 2 Step: 1600 loss: 2.73827668965\n",
375 |       "Saved Model\n",
376 |       "Epoch: 3 Step: 2400 loss: 2.70001435518\n",
377 |       "Saved Model\n",
378 |       "Epoch: 4 Step: 3200 loss: 2.68624746531\n",
379 |       "Saved Model\n",
380 |       "Epoch: 5 Step: 4000 loss: 2.68042603165\n",
381 |       "Saved Model\n",
382 |       "Epoch: 6 Step: 4800 loss: 2.67750604913\n",
383 |       "Saved Model\n",
384 |       "Epoch: 7 Step: 5600 loss: 2.67583220631\n",
385 |       "Saved Model\n",
386 |       "Epoch: 8 Step: 6400 loss: 2.67482194766\n",
387 |       "Saved Model\n",
388 |       "Epoch: 9 Step: 7200 loss: 2.67411908716\n",
389 |       "Saved Model\n",
390 |       "Epoch: 10 Step: 8000 loss: 2.67369878128\n",
391 |       "Saved Model\n"
392 |      ]
393 |     }
394 |    ],
395 |    "source": [
396 |     "num_epochs = 10\n",
397 |     "for i in range(num_epochs):\n",
398 |     "    loss_per_epoch = 0\n",
399 |     "    for j in range(num_batches):\n",
400 |     "        feed_dict = {\n",
401 |     "            path_length:path_len[j*batch_size:(j+1)*batch_size],\n",
402 |     "            word_ids:word_p_ids[j*batch_size:(j+1)*batch_size],\n",
403 |     "            pos_ids:pos_p_ids[j*batch_size:(j+1)*batch_size],\n",
404 |     "            dep_ids:dep_p_ids[j*batch_size:(j+1)*batch_size],\n",
405 |     "            y:rel_ids[j*batch_size:(j+1)*batch_size]}\n",
406 |     "        _, _loss, step = sess.run([optimizer, total_loss, global_step], feed_dict)\n",
407 |     "        loss_per_epoch +=_loss\n",
408 |     "        if (j+1)%num_batches==0:\n",
409 |     "            print(\"Epoch:\", i+1,\"Step:\", step, \"loss:\",loss_per_epoch/num_batches)\n",
410 |     "    saver.save(sess, model_dir + '/model')\n",
411 |     "    print(\"Saved Model\")"
412 |    ]
413 |   },
414 |   {
415 |    "cell_type": "code",
416 |    "execution_count": null,
417 |    "metadata": {
418 |     "collapsed": true,
419 |     "scrolled": false
420 |    },
421 |    "outputs": [],
422 |    "source": [
423 |     "# training accuracy\n",
424 |     "all_predictions = []\n",
425 |     "for j in range(num_batches):\n",
426 |     "     feed_dict = {\n",
427 |     "            path_length:path_len[j*batch_size:(j+1)*batch_size],\n",
428 |     "            word_ids:word_p_ids[j*batch_size:(j+1)*batch_size],\n",
429 |     "            pos_ids:pos_p_ids[j*batch_size:(j+1)*batch_size],\n",
430 |     "            dep_ids:dep_p_ids[j*batch_size:(j+1)*batch_size],\n",
431 |     "            y:rel_ids[j*batch_size:(j+1)*batch_size]}\n",
432 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
433 |     "    all_predictions.append(batch_predictions)\n",
434 |     "\n",
435 |     "y_pred = []\n",
436 |     "for i in range(num_batches):\n",
437 |     "    for pred in all_predictions[i]:\n",
438 |     "        y_pred.append(pred)\n",
439 |     "\n",
440 |     "count = 0\n",
441 |     "for i in range(batch_size*num_batches):\n",
442 |     "    count += y_pred[i]==rel_ids[i]\n",
443 |     "accuracy = count/(batch_size*num_batches) * 100\n",
444 |     "\n",
445 |     "print(\"training accuracy\", accuracy)"
446 |    ]
447 |   },
448 |   {
449 |    "cell_type": "code",
450 |    "execution_count": 11,
451 |    "metadata": {
452 |     "collapsed": true
453 |    },
454 |    "outputs": [],
455 |    "source": [
456 |     "f = open(data_dir + '/test_lca_paths', 'rb')\n",
457 |     "word_p, dep_p, pos_p = pickle.load(f)\n",
458 |     "f.close()\n",
459 |     "\n",
460 |     "relations = []\n",
461 |     "for line in open(data_dir + '/test_relations.txt'):\n",
462 |     "    relations.append(line.strip().split()[0])\n",
463 |     "\n",
464 |     "length = len(word_p1)\n",
465 |     "num_batches = int(length/batch_size)\n",
466 |     "\n",
467 |     "for i in range(length):\n",
468 |     "    for j, word in enumerate(word_p[i]):\n",
469 |     "        word = word.lower()\n",
470 |     "        word_p[i][j] = word if word in word2id else unknown_token \n",
471 |     "    for l, d in enumerate(dep_p[i]):\n",
472 |     "        dep_p[i][l] = d if d in dep2id else 'OTH'\n",
473 |     "        \n",
474 |     "word_p_ids = np.ones([length, max_len_path],dtype=int)\n",
475 |     "pos_p_ids = np.ones([length, max_len_path],dtype=int)\n",
476 |     "dep_p_ids = np.ones([length, max_len_path],dtype=int)\n",
477 |     "rel_ids = np.array([rel2id[rel] for rel in relations])\n",
478 |     "path_len = np.array([len(w) for w in word_p], dtype=int)\n",
479 |     "\n",
480 |     "for i in range(length):\n",
481 |     "    for j, w in enumerate(word_p[i]):\n",
482 |     "        word_p_ids[i][j] = word2id[w]\n",
483 |     "        \n",
484 |     "    for j, w in enumerate(pos_p[i]):\n",
485 |     "        pos_p_ids[i][j] = pos_tag(w)\n",
486 |     "        \n",
487 |     "    for j, w in enumerate(dep_p[i]):\n",
488 |     "        dep_p_ids[i][j] = dep2id[w]\n",
489 |     "\n",
490 |     "# test predictions\n",
491 |     "all_predictions = []\n",
492 |     "for j in range(num_batches):\n",
493 |     "     feed_dict = {\n",
494 |     "            path_length:path_len[j*batch_size:(j+1)*batch_size],\n",
495 |     "            word_ids:word_p_ids[j*batch_size:(j+1)*batch_size],\n",
496 |     "            pos_ids:pos_p_ids[j*batch_size:(j+1)*batch_size],\n",
497 |     "            dep_ids:dep_p_ids[j*batch_size:(j+1)*batch_size],\n",
498 |     "            y:rel_ids[j*batch_size:(j+1)*batch_size]}\n",
499 |     "    batch_predictions = sess.run(predictions, feed_dict)\n",
500 |     "    all_predictions.append(batch_predictions)\n",
501 |     "\n",
502 |     "y_pred = []\n",
503 |     "for i in range(num_batches):\n",
504 |     "    for pred in all_predictions[i]:\n",
505 |     "        y_pred.append(pred)\n",
506 |     "\n",
507 |     "count = 0\n",
508 |     "for i in range(batch_size*num_batches):\n",
509 |     "    count += y_pred[i]==rel_ids[i]\n",
510 |     "accuracy = count/(batch_size*num_batches) * 100\n",
511 |     "\n",
512 |     "print(\"test accuracy\", accuracy)"
513 |    ]
514 |   },
515 |   {
516 |    "cell_type": "code",
517 |    "execution_count": null,
518 |    "metadata": {
519 |     "collapsed": true
520 |    },
521 |    "outputs": [],
522 |    "source": []
523 |   }
524 |  ],
525 |  "metadata": {
526 |   "kernelspec": {
527 |    "display_name": "Python 3",
528 |    "language": "python",
529 |    "name": "python3"
530 |   },
531 |   "language_info": {
532 |    "codemirror_mode": {
533 |     "name": "ipython",
534 |     "version": 3
535 |    },
536 |    "file_extension": ".py",
537 |    "mimetype": "text/x-python",
538 |    "name": "python",
539 |    "nbconvert_exporter": "python",
540 |    "pygments_lexer": "ipython3",
541 |    "version": "3.5.2"
542 |   }
543 |  },
544 |  "nbformat": 4,
545 |  "nbformat_minor": 2
546 | }
547 | 


--------------------------------------------------------------------------------
/LCA SubTree/path_extractor.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import os\n",
 12 |     "from nltk.parse import stanford\n",
 13 |     "import nltk\n",
 14 |     "os.environ['STANFORD_PARSER'] = '/home/shanu/nltk/jars/stanford-parser.jar'\n",
 15 |     "os.environ['STANFORD_MODELS'] = '/home/shanu/nltk/jars/stanford-parser-3.7.0-models.jar'"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 2,
 21 |    "metadata": {
 22 |     "collapsed": true
 23 |    },
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "# Dependency Tree\n",
 27 |     "from nltk.parse.stanford import StanfordDependencyParser\n",
 28 |     "dep_parser=StanfordDependencyParser(model_path=\"/home/shanu/nltk/jars/englishPCFG.ser.gz\")"
 29 |    ]
 30 |   },
 31 |   {
 32 |    "cell_type": "code",
 33 |    "execution_count": 3,
 34 |    "metadata": {
 35 |     "collapsed": true
 36 |    },
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "def lca(tree, index1, index2):\n",
 40 |     "    node = index1\n",
 41 |     "    path1 = []\n",
 42 |     "    path2 = []\n",
 43 |     "    path1.append(index1)\n",
 44 |     "    path2.append(index2)\n",
 45 |     "    while(node != tree.root):\n",
 46 |     "        node = tree.nodes[node['head']]\n",
 47 |     "        path1.append(node)\n",
 48 |     "    node = index2\n",
 49 |     "    while(node != tree.root):\n",
 50 |     "        node = tree.nodes[node['head']]\n",
 51 |     "        path2.append(node)\n",
 52 |     "    for l1, l2 in zip(path1[::-1],path2[::-1]):\n",
 53 |     "        if(l1==l2):\n",
 54 |     "            temp = l1\n",
 55 |     "    return temp"
 56 |    ]
 57 |   },
 58 |   {
 59 |    "cell_type": "code",
 60 |    "execution_count": 4,
 61 |    "metadata": {
 62 |     "collapsed": true
 63 |    },
 64 |    "outputs": [],
 65 |    "source": [
 66 |     "def path_lca(tree, node, lca_node):\n",
 67 |     "    path = []\n",
 68 |     "    path.append(node)\n",
 69 |     "    while(node != lca_node):\n",
 70 |     "        node = tree.nodes[node['head']]\n",
 71 |     "        path.append(node)\n",
 72 |     "    return path"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": 5,
 78 |    "metadata": {
 79 |     "collapsed": true
 80 |    },
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "def seq(lca):\n",
 84 |     "    l=[lca]\n",
 85 |     "    for key in tree.nodes[lca]['deps']:\n",
 86 |     "        for i in tree.nodes[lca]['deps'][key]:\n",
 87 |     "            l.extend(seq(i))\n",
 88 |     "    return l"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": null,
 94 |    "metadata": {
 95 |     "collapsed": true
 96 |    },
 97 |    "outputs": [],
 98 |    "source": []
 99 |   },
100 |   {
101 |    "cell_type": "code",
102 |    "execution_count": 8,
103 |    "metadata": {
104 |     "collapsed": true
105 |    },
106 |    "outputs": [],
107 |    "source": [
108 |     "import _pickle \n",
109 |     "f = open('../data/training_data', 'rb')\n",
110 |     "sentences, e1, e2 = _pickle.load(f)\n",
111 |     "f.close()"
112 |    ]
113 |   },
114 |   {
115 |    "cell_type": "code",
116 |    "execution_count": 9,
117 |    "metadata": {
118 |     "collapsed": true
119 |    },
120 |    "outputs": [],
121 |    "source": [
122 |     "sentences[7588] = 'The reaction mixture is kept in the dark at room temperature for 1.5 hours .'\n",
123 |     "sentences[2608] = \"This strawberry sauce has about a million uses , is freezer-friendly , and is so much better than that jar of Smuckers strawberry sauce that you 've had sitting in your fridge since that time you made banana splits 1.5 years ago .\""
124 |    ]
125 |   },
126 |   {
127 |    "cell_type": "code",
128 |    "execution_count": 41,
129 |    "metadata": {
130 |     "collapsed": true
131 |    },
132 |    "outputs": [],
133 |    "source": [
134 |     "# sentences[2590] = \"The pendant with the bail measure 1.25'' .\"\n",
135 |     "# sentences[2664] = \"The cabinet encloses a 6.5 inch cone woofer , 4 inch cone midrange , and a 0.86 inch balanced dome tweeter .\""
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": 12,
141 |    "metadata": {
142 |     "collapsed": true
143 |    },
144 |    "outputs": [],
145 |    "source": [
146 |     "length = len(sentences)"
147 |    ]
148 |   },
149 |   {
150 |    "cell_type": "code",
151 |    "execution_count": 13,
152 |    "metadata": {
153 |     "collapsed": true
154 |    },
155 |    "outputs": [],
156 |    "source": [
157 |     "word_p = []\n",
158 |     "rel_p = []\n",
159 |     "pos_p = []\n",
160 |     "for i in range(length):\n",
161 |     "    word_p.append(0)\n",
162 |     "    rel_p.append(0)\n",
163 |     "    pos_p.append(0)"
164 |    ]
165 |   },
166 |   {
167 |    "cell_type": "code",
168 |    "execution_count": 44,
169 |    "metadata": {
170 |     "scrolled": true
171 |    },
172 |    "outputs": [
173 |     {
174 |      "name": "stdout",
175 |      "output_type": "stream",
176 |      "text": [
177 |       "2590 success [2, 1, 6, 4, 5, 3]\n"
178 |      ]
179 |     }
180 |    ],
181 |    "source": [
182 |     "# for i in range(length):\n",
183 |     "i = 2590\n",
184 |     "try:\n",
185 |     "    parse_tree = dep_parser.raw_parse(sentences[i])\n",
186 |     "    for trees in parse_tree:\n",
187 |     "        tree = trees\n",
188 |     "    node1 = tree.nodes[e1[i]+1]\n",
189 |     "    node2 = tree.nodes[e2[i]+1]\n",
190 |     "    if node1['address']!=None and node2['address']!=None:\n",
191 |     "        lca_node = lca(tree, node1, node2)\n",
192 |     "        path = seq(lca_node['address'])\n",
193 |     "        print(i, \"success\", path)\n",
194 |     "\n",
195 |     "        word_p[i] = [tree.nodes[p][\"word\"] for p in path]\n",
196 |     "        rel_p[i] = [tree.nodes[p][\"rel\"] for p in path]\n",
197 |     "        pos_p[i] = [tree.nodes[p][\"tag\"] for p in path]\n",
198 |     "    else:\n",
199 |     "\n",
200 |     "        print(i, node1[\"address\"], node2[\"address\"])\n",
201 |     "except AssertionError:\n",
202 |     "    print(i, \"error\")\n",
203 |     "    "
204 |    ]
205 |   },
206 |   {
207 |    "cell_type": "code",
208 |    "execution_count": 45,
209 |    "metadata": {
210 |     "collapsed": true
211 |    },
212 |    "outputs": [],
213 |    "source": [
214 |     "file = open('../data/train_lca_paths', 'wb')\n",
215 |     "_pickle.dump([word_p, rel_p, pos_p], file)"
216 |    ]
217 |   },
218 |   {
219 |    "cell_type": "code",
220 |    "execution_count": null,
221 |    "metadata": {
222 |     "collapsed": true
223 |    },
224 |    "outputs": [],
225 |    "source": []
226 |   }
227 |  ],
228 |  "metadata": {
229 |   "kernelspec": {
230 |    "display_name": "Python 3",
231 |    "language": "python",
232 |    "name": "python3"
233 |   },
234 |   "language_info": {
235 |    "codemirror_mode": {
236 |     "name": "ipython",
237 |     "version": 3
238 |    },
239 |    "file_extension": ".py",
240 |    "mimetype": "text/x-python",
241 |    "name": "python",
242 |    "nbconvert_exporter": "python",
243 |    "pygments_lexer": "ipython3",
244 |    "version": "3.5.2"
245 |   }
246 |  },
247 |  "nbformat": 4,
248 |  "nbformat_minor": 2
249 | }
250 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 Shanu Kumar
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/LSTM Seq and Tree/README.md:
--------------------------------------------------------------------------------
 1 | ## Relation Classification using LSTMs on Sequences and Tree Structures
 2 | 
 3 | We implemented a architecture based on the paper [End-to-End Relation Extraction using LSTMs
 4 | on Sequences and Tree Structures](http://www.aclweb.org/anthology/P/P16/P16-1105.pdf). This recurrent neural network based model captures both word sequence and dependency tree substructure information by stacking bidirectional treestructured LSTM-RNNs on bidirectional sequential LSTM-RNNs. This allows our model to jointly represent both entities and relations with shared parameters in a single model.
 5 | 
 6 | 
 7 | Our model allows
 8 | joint modeling of entities and relations in a single
 9 | model by using both bidirectional sequential
10 | (left-to-right and right-to-left) and bidirectional
11 | tree-structured (bottom-up and top-down) LSTMRNNs.
12 | 
13 | 
14 | ## Model 
15 | The model mainly consists of three representation layers:
16 | a embeddings layer, a word sequence based LSTM-RNN layer (sequence layer), and finally a dependency subtree based LSTM-RNN layer (dependency layer).
17 | 
18 | ![Relation Classification Network](/img/lstm_tree.jpg)
19 | 
20 | ### Embedding Layer
21 | Embedding layer consists of words, part-of-speech (POS) tags, dependency relations.
22 | 
23 | ### Sequence Layer
24 | The sequence layer represents words in a linear sequence
25 | using the representations from the embedding layer. We represent the word sequence in a sentence with bidirectional LSTM-RNNs. 
26 | The LSTM unit at t-th word receives the concatenation of word and POS embeddings as its input vector. 
27 | 
28 | <p align="center">
29 |   <img src="/img/lstm_seq.jpg">
30 | </p>
31 | 
32 | We also concatenate the hidden state vectors of the two directions’ LSTM units corresponding to each word (denoted as ↑ht and ↓ht) as its output vector (st), and pass it to the subsequent layers.
33 | 
34 | ### Entity Detection 
35 | We perform entity detection on top of the sequence
36 | layer. We employ a two-layered NN with an hidden layer and a softmax output layer for entity detection.
37 | 
38 | ### Dependency Layer
39 | The dependency layer represents a relation between a pair of two target words (corresponding to a relation candidate in relation classification) in
40 | the dependency tree.
41 | 
42 | This layer mainly focuses on the shortest path between a pair of target words in the dependency tree (i.e., the path between the least common node and the two target words).
43 | 
44 | We employ bidirectional tree-structured LSTMRNNs (i.e., bottom-up and top-down) to represent a relation candidate by capturing the dependency
45 | structure around the target word pair. This bidirectional structure propagates to each node not only the information from the leaves but also information from the root. This is especially important for relation classification, which makes use of argument nodes near the bottom of the tree, and our top-down LSTM-RNN sends information from the top of the tree to such near-leaf nodes (unlike in standard bottom-up LSTM-RNNs).
46 | 
47 | Tree-structured LSTM-RNN's equations :
48 | <p align="center">
49 |   <img src="/img/lstm_tree_eq.jpg">
50 | </p>
51 | 
52 | While we use one node from Shortest Dependency path, then the hidden and current states of the children of this node in Dependency Tree are taken as previous state in LSTM.
53 | 
54 | We stack the dependency layers (corresponding to relation candidates) on top of the sequence layer to incorporate both word sequence and dependency tree structure information into the output.
55 | The dependency-layer LSMT unit at the t-th word recives as input, the concatenation of its corresponding hidden state vectors st in the sequence layer, dependency type embedding.
56 | 
57 | ### Relation Classification 
58 | The relation candidate vector is constructed as
59 | the concatenation dp = [↑hpA; ↓hp1; ↓hp2], where ↑hpA is the hidden state vector of the top LSTM unit in the bottom-up LSTM-RNN (representing the lowest common ancestor of the target word pair p), and ↓hp1, ↓hp2 are the hidden state vectors of the two LSTM units representing the first and second target words in the top-down LSTMRNN.
60 | 
61 | Similarly to the entity detection, we employ a two-layered NN with an hidden layer and a softmax output layer.
62 | 
63 | ### Training
64 | 
65 | We update the model parameters including weights, biases, and embeddings by BPTT and Adam gradient descent with gradient clipping, L2-regularization
66 | (we regularize weights W and U, not the bias terms b). We also apply dropout to the embedding layer and to the final hidden layers for entity detection and relation classification. We employ entity pretraining to improve the model.
67 | 
68 | ### Data
69 | 
70 | SemEval-2010 Task 8 defines 9 relation types between nominals and a tenth type Other when two nouns have none of these relations and no direction is considered.
71 | ## Experiments
72 | 
73 | Model | Train-Accuracy | Test-Accuracy| Epochs
74 | --- | --- | ---| ---
75 | model3v1 | 97.54 | 66.5 | 11
76 | model3v2 | 99.9 | 70.69 | 19
77 | 
78 | 
79 | * Learning rate = 0.001 
80 | * Learning rate decay = 0.96
81 | * state size = 100
82 | * lambda_l2 = 0.0001
83 | * Gradient Clipping = 10
84 | * Entity Detection Pretrained
85 | 
86 | 
87 | ### [model3v1](https://github.com/Sshanu/Relation-Classification/blob/master/LSTM%20Seq%20and%20Tree/model3v1.ipynb)
88 | * Bidirectional LSTM over whole sentence
89 | * Bottom-up and Top-down LSTM along Shortest Dependency Path with childrens from Dependency tree.
90 | 
91 | ### [model3v2](https://github.com/Sshanu/Relation-Classification/blob/master/LSTM%20Seq%20and%20Tree/model3v2.ipynb)
92 | * Dropout on hidden layers of both entity detection and relation classifier of 0.3.
93 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Relation Classification 
 2 | 
 3 | [![MIT License](https://img.shields.io/badge/license-MIT-green.svg)](https://opensource.org/licenses/MIT) 
 4 | 
 5 | Relation classification aims to categorize into predefined classes the relations btw pairs of given entities in texts. There are two ways to represent relations between entities using deep neural networks: recurrent neural networks (RNNs) and convolutional neural networks (CNNs). We have implemented three LSTM-RNN architectures for solving the task of relation classification:
 6 | * [Relation classification using LSTM Networks along Shortest Dependency Paths.](https://github.com/Sshanu/Relation-Classification/tree/master/LCA%20Shortest%20Path)
 7 | * [Relation classification using bidirectional LSTM Networks on LCA Sub Tree.](https://github.com/Sshanu/Relation-Classification/tree/master/LCA%20SubTree)
 8 | * [Relation classification using LSTMS on Sequences and Tree Structures.](https://github.com/Sshanu/Relation-Classification/tree/master/LSTM%20Seq%20and%20Tree)
 9 | 
10 | We achieve better performance for solving this task using the last approach "[Relation classification using LSTMS on Sequences and Tree Structures.](https://github.com/Sshanu/Relation-Classification/tree/master/LSTM%20Seq%20and%20Tree)".
11 | 
12 | 
13 | ### References:
14 | 
15 | > **End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures**<br>
16 | > Makoto Miwa, Mohit Bansal<br>
17 | > [http://www.aclweb.org/anthology/P/P16/P16-1105.pdf](http://www.aclweb.org/anthology/P/P16/P16-1105.pdf)
18 | > 
19 | > **Abstract:** *We present a novel end-to-end neural
20 | model to extract entities and relations between them. Our recurrent neural network based model captures both word sequence and dependency tree substructure
21 | information by stacking bidirectional treestructured LSTM-RNNs on bidirectional
22 | sequential LSTM-RNNs. This allows our
23 | model to jointly represent both entities and
24 | relations with shared parameters in a single model. We further encourage detection of entities during training and use of
25 | entity information in relation extraction
26 | via entity pretraining and scheduled sampling. Our model improves over the stateof-the-art feature-based model on end-toend relation extraction, achieving 12.1%
27 | and 5.7% relative error reductions in F1-
28 | score on ACE2005 and ACE2004, respectively. We also show that our LSTMRNN based model compares favorably to
29 | the state-of-the-art CNN based model (in
30 | F1-score) on nominal relation classification (SemEval-2010 Task 8). Finally, we
31 | present an extensive ablation analysis of
32 | several model components*
33 | 
34 | > **Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Paths**<br>
35 | > Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, Zhi Jin<br>
36 | > [http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP206.pdf](http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP206.pdf)
37 | > 
38 | > **Abstract:** *Relation classification is an important research arena in the field of natural language processing (NLP). In this paper, we
39 | present SDP-LSTM, a novel neural network to classify the relation of two entities in a sentence. Our neural architecture
40 | leverages the shortest dependency path
41 | (SDP) between two entities; multichannel recurrent neural networks, with long
42 | short term memory (LSTM) units, pick
43 | up heterogeneous information along the
44 | SDP. Our proposed model has several distinct features: (1) The shortest dependency
45 | paths retain most relevant information (to
46 | relation classification), while eliminating
47 | irrelevant words in the sentence. (2) The
48 | multichannel LSTM networks allow effective information integration from heterogeneous sources over the dependency
49 | paths. (3) A customized dropout strategy
50 | regularizes the neural network to alleviate overfitting. We test our model on the
51 | SemEval 2010 relation classification task,
52 | and achieve an F1-score of 83.7%, higher
53 | than competing methods in the literature.*
54 | 


--------------------------------------------------------------------------------
/data/dependency_types.txt:
--------------------------------------------------------------------------------
 1 | root
 2 | nmod
 3 | nsubj
 4 | dobj
 5 | nsubjpass
 6 | compound
 7 | conj
 8 | acl
 9 | advcl
10 | ccomp
11 | amod
12 | acl:relcl
13 | xcomp
14 | dep
15 | appos
16 | nmod:poss
17 | advmod
18 | parataxis
19 | csubj
20 | iobj
21 | 


--------------------------------------------------------------------------------
/data/full_postags_types.txt:
--------------------------------------------------------------------------------
 1 | CC
 2 | CD
 3 | DT
 4 | EX
 5 | FW
 6 | IN
 7 | JJ
 8 | JJR
 9 | JJS
10 | LS
11 | MD
12 | NN
13 | NNS
14 | NNP
15 | NNPS
16 | PDT
17 | POS
18 | PRP
19 | PRP$
20 | RB
21 | RBR
22 | RBS
23 | RP
24 | SYM
25 | TO
26 | UH
27 | VB
28 | VBD
29 | VBG
30 | VBN
31 | VBP
32 | VBZ
33 | WDT
34 | WP
35 | WP$
36 | WRB
37 | 


--------------------------------------------------------------------------------
/data/pos_tags.txt:
--------------------------------------------------------------------------------
 1 | CC
 2 | CD
 3 | DT
 4 | IN
 5 | JJ
 6 | NN
 7 | PRP
 8 | RB
 9 | VB
10 | 


--------------------------------------------------------------------------------
/data/relation_types.txt:
--------------------------------------------------------------------------------
 1 | Other
 2 | Entity-Destination(e1,e2)
 3 | Cause-Effect(e2,e1)
 4 | Member-Collection(e2,e1)
 5 | Entity-Origin(e1,e2)
 6 | Message-Topic(e1,e2)
 7 | Component-Whole(e2,e1)
 8 | Component-Whole(e1,e2)
 9 | Instrument-Agency(e2,e1)
10 | Product-Producer(e2,e1)
11 | Content-Container(e1,e2)
12 | Cause-Effect(e1,e2)
13 | Product-Producer(e1,e2)
14 | Content-Container(e2,e1)
15 | Entity-Origin(e2,e1)
16 | Message-Topic(e2,e1)
17 | Instrument-Agency(e1,e2)
18 | Member-Collection(e1,e2)
19 | Entity-Destination(e2,e1)


--------------------------------------------------------------------------------
/data/relation_typesv3.txt:
--------------------------------------------------------------------------------
 1 | Other
 2 | Entity-Destination
 3 | Cause-Effect
 4 | Member-Collection
 5 | Entity-Origin
 6 | Message-Topic
 7 | Component-Whole
 8 | Instrument-Agency
 9 | Product-Producer
10 | Content-Container
11 | 


--------------------------------------------------------------------------------
/data/test_data:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/test_data


--------------------------------------------------------------------------------
/data/test_lca_paths:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/test_lca_paths


--------------------------------------------------------------------------------
/data/test_pathsv1:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/test_pathsv1


--------------------------------------------------------------------------------
/data/test_pathsv3:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/test_pathsv3


--------------------------------------------------------------------------------
/data/train_data:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/train_data


--------------------------------------------------------------------------------
/data/train_lca_paths:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/train_lca_paths


--------------------------------------------------------------------------------
/data/train_pathsv1:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/train_pathsv1


--------------------------------------------------------------------------------
/data/train_pathsv3:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/train_pathsv3


--------------------------------------------------------------------------------
/data/vocab.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/vocab.pkl


--------------------------------------------------------------------------------
/data/vocab_glove:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/vocab_glove


--------------------------------------------------------------------------------
/data/vocab_wiki:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/data/vocab_wiki


--------------------------------------------------------------------------------
/img/lca.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/img/lca.jpg


--------------------------------------------------------------------------------
/img/lstm_seq.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/img/lstm_seq.jpg


--------------------------------------------------------------------------------
/img/lstm_tree.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/img/lstm_tree.jpg


--------------------------------------------------------------------------------
/img/lstm_tree_eq.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Sshanu/Relation-Classification-using-Bidirectional-LSTM-Tree/c779d9ae2caab28f55ff66b54ee194c30ad4b2ff/img/lstm_tree_eq.jpg


--------------------------------------------------------------------------------
/preprocessing.py.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "metadata": {
  7 |     "collapsed": true
  8 |    },
  9 |    "outputs": [],
 10 |    "source": [
 11 |     "import re, sys, nltk\n",
 12 |     "from nltk.tokenize.stanford import StanfordTokenizer\n",
 13 |     "path_to_jar = \"/home/shanu/nltk/jars/stanford-postagger.jar\"\n",
 14 |     "tokenizer = StanfordTokenizer(path_to_jar)\n",
 15 |     "import _pickle as pickle"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 2,
 21 |    "metadata": {
 22 |     "collapsed": true
 23 |    },
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "# Extracting the Relations \n",
 27 |     "# Please comment this when preprocessing the sentences.\n",
 28 |     "# for training data open \"TRAIN_FILE.TXT\" and for test data open \"TEST_FILE_FULL.TXT\"\n",
 29 |     "\n",
 30 |     "lines = []\n",
 31 |     "for line in open(\"data/TRAIN_FILE.TXT\"):\n",
 32 |     "    lines.append(line.strip())\n",
 33 |     "\n",
 34 |     "relations = []\n",
 35 |     "for i, w in enumerate(lines):\n",
 36 |     "    if((i+3)%4==0):\n",
 37 |     "        relations.append(w)\n",
 38 |     "        \n",
 39 |     "f = open(\"data/train_relations.txt\", 'w')\n",
 40 |     "for rel in relations:\n",
 41 |     "    f.write(rel+'\\n')"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 3,
 47 |    "metadata": {
 48 |     "collapsed": true
 49 |    },
 50 |    "outputs": [],
 51 |    "source": [
 52 |     "# For preprocessing Training data open \"TRAIN_FILE.TXT and for Test data open \"TEST_FILE.txt\n",
 53 |     "\n",
 54 |     "lines = []\n",
 55 |     "for line in open(\"data/TRAIN_FILE.TXT\"):   \n",
 56 |     "    m = re.match(r'^([0-9]+)\\s\"(.+)\"$', line.strip())\n",
 57 |     "    if(m is not None):\n",
 58 |     "        lines.append(m.group(2))"
 59 |    ]
 60 |   },
 61 |   {
 62 |    "cell_type": "code",
 63 |    "execution_count": null,
 64 |    "metadata": {},
 65 |    "outputs": [],
 66 |    "source": [
 67 |     "len(relations)"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": null,
 73 |    "metadata": {
 74 |     "scrolled": true
 75 |    },
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "sentences = []\n",
 79 |     "e1 = []\n",
 80 |     "e2 = []\n",
 81 |     "for j,line in enumerate(lines):\n",
 82 |     "    text = []\n",
 83 |     "    temp = []\n",
 84 |     "    t = line.split(\"<e1>\")\n",
 85 |     "    text.append(t[0])\n",
 86 |     "    temp.append(t[0])\n",
 87 |     "\n",
 88 |     "    t = t[1].split(\"</e1>\")\n",
 89 |     "    e1_text = text\n",
 90 |     "    e1_text = \" \".join(e1_text)\n",
 91 |     "    e1_text = tokenizer.tokenize(e1_text)\n",
 92 |     "    text.append(t[0])\n",
 93 |     "    e11= t[0]\n",
 94 |     "    y = tokenizer.tokenize(t[0])\n",
 95 |     "    y[0] +=\"E11\"\n",
 96 |     "    temp.append(\" \".join(y))\n",
 97 |     "    t = t[1].split(\"<e2>\")\n",
 98 |     "    text.append(t[0])\n",
 99 |     "    temp.append(t[0])\n",
100 |     "    t = t[1].split(\"</e2>\")\n",
101 |     "    e22 = t[0]\n",
102 |     "    e2_text = text\n",
103 |     "    e2_text = \" \".join(e2_text)\n",
104 |     "    e2_text = tokenizer.tokenize(e2_text)\n",
105 |     "    text.append(t[0])\n",
106 |     "    text.append(t[1])\n",
107 |     "    y = tokenizer.tokenize(t[0])\n",
108 |     "    y[0] +=\"E22\"\n",
109 |     "    temp.append(\" \".join(y))\n",
110 |     "    temp.append(t[1])\n",
111 |     "\n",
112 |     "    text = \" \".join(text)\n",
113 |     "    text = tokenizer.tokenize(text)\n",
114 |     "    temp = \" \".join(temp)\n",
115 |     "    temp = tokenizer.tokenize(temp)\n",
116 |     "\n",
117 |     "    q1 = tokenizer.tokenize(e11)[0]\n",
118 |     "    q2 = tokenizer.tokenize(e22)[0]\n",
119 |     "    for i, word in enumerate(text):\n",
120 |     "        if(word.find(q1)!=-1):\n",
121 |     "            if(temp[i].find(\"E11\")!=-1):\n",
122 |     "                e1.append(i)            \n",
123 |     "                break\n",
124 |     "    for i, word in enumerate(text):\n",
125 |     "        if(word.find(q2)!=-1):\n",
126 |     "                if(temp[i].find(\"E22\")!=-1):\n",
127 |     "                    e2.append(i)   \n",
128 |     "    text = \" \".join(text)\n",
129 |     "    sentences.append(text)\n",
130 |     "    print(j, text)"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": null,
136 |    "metadata": {},
137 |    "outputs": [],
138 |    "source": [
139 |     "len(sentences), len(e1), len(e2)"
140 |    ]
141 |   },
142 |   {
143 |    "cell_type": "code",
144 |    "execution_count": null,
145 |    "metadata": {
146 |     "collapsed": true
147 |    },
148 |    "outputs": [],
149 |    "source": [
150 |     "# for saving training data open \"train_data\" and for test data open \"test_data\"\n",
151 |     "\n",
152 |     "with open('data/train_data', 'wb') as f:\n",
153 |     "    pickle.dump((sentences, e1, e2), f)\n",
154 |     "    f.close()"
155 |    ]
156 |   },
157 |   {
158 |    "cell_type": "code",
159 |    "execution_count": null,
160 |    "metadata": {
161 |     "collapsed": true
162 |    },
163 |    "outputs": [],
164 |    "source": []
165 |   }
166 |  ],
167 |  "metadata": {
168 |   "kernelspec": {
169 |    "display_name": "Python 3",
170 |    "language": "python",
171 |    "name": "python3"
172 |   },
173 |   "language_info": {
174 |    "codemirror_mode": {
175 |     "name": "ipython",
176 |     "version": 3
177 |    },
178 |    "file_extension": ".py",
179 |    "mimetype": "text/x-python",
180 |    "name": "python",
181 |    "nbconvert_exporter": "python",
182 |    "pygments_lexer": "ipython3",
183 |    "version": "3.5.4"
184 |   }
185 |  },
186 |  "nbformat": 4,
187 |  "nbformat_minor": 2
188 | }
189 | 


--------------------------------------------------------------------------------