├── LICENSE
├── README.md
├── preprocessing
├── extract_wikipedia_corpora_boxer_test.py
└── extract_wikipedia_corpora_boxer_training.py
├── source
├── boxer_graph_module.py
├── em_inside_outside_algorithm.py
├── explore_decoder_graph_explorative.py
├── explore_decoder_graph_greedy.py
├── explore_training_graph.py
├── function_select_methods.py
├── functions_configuration_file.py
├── functions_model_files.py
├── functions_prepare_elementtree_dot.py
├── methods_feature_extract.py
├── methods_training_graph.py
├── saxparser_xml_stanfordtokenized_boxergraph.py
├── saxparser_xml_stanfordtokenized_boxergraph_traininggraph.py
└── training_graph_module.py
├── start_learning_training_models.py
└── start_simplifying_complex_sentence.py
/LICENSE:
--------------------------------------------------------------------------------
1 | BSD 3-Clause License
2 |
3 | Copyright (c) 2018, Shashi Narayan
4 | All rights reserved.
5 |
6 | Redistribution and use in source and binary forms, with or without
7 | modification, are permitted provided that the following conditions are met:
8 |
9 | * Redistributions of source code must retain the above copyright notice, this
10 | list of conditions and the following disclaimer.
11 |
12 | * Redistributions in binary form must reproduce the above copyright notice,
13 | this list of conditions and the following disclaimer in the documentation
14 | and/or other materials provided with the distribution.
15 |
16 | * Neither the name of the copyright holder nor the names of its
17 | contributors may be used to endorse or promote products derived from
18 | this software without specific prior written permission.
19 |
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Hybrid Simplification using Deep Semantics and Machine Translation
2 |
3 | Sentence simplification maps a sentence to a simpler, more readable
4 | one approximating its content. In practice, simplification is often
5 | modeled using four main operations: splitting a complex sentence into
6 | several simpler sentences; dropping and reordering phrases or
7 | constituents; substituting words/phrases with simpler ones.
8 |
9 | This is implementation from our ACL'14 paper. We have modified our
10 | code to let you choose what simplification operations you want to
11 | apply to your complex sentences. Please go through our paper for more
12 | details. Please contact Shashi Narayan
13 | (shashi.narayan(at){ed.ac.uk,gmail.com}) for any query.
14 |
15 | If you use our code, please cite the following paper.
16 |
17 | **Hybrid Simplification using Deep Semantics and Machine Translation,
18 | Shashi Narayan and Claire Gardent, The 52nd Annual meeting of the
19 | Association for Computational Linguistics (ACL), Baltimore,
20 | June. https://aclweb.org/anthology/P/P14/P14-1041.pdf.**
21 |
22 | > We present a hybrid approach to sentence simplification which
23 | > combines deep semantics and monolingual machine translation to
24 | > derive simple sentences from complex ones. The approach differs from
25 | > previous work in two main ways. First, it is semantic based in that
26 | > it takes as input a deep semantic representation rather than e.g., a
27 | > sentence or a parse tree. Second, it combines a simplification model
28 | > for splitting and deletion with a monolingual translation model for
29 | > phrase substitution and reordering. When compared against current
30 | > state of the art methods, our model yields significantly simpler
31 | > output that is both grammatical and meaning preserving.
32 |
33 | ### Requirements
34 |
35 | * Boxer 1.00: http://svn.ask.it.usyd.edu.au/trac/candc/wiki/boxer, http://www.cl.cam.ac.uk/~sc609/candc-1.00.html
36 | * Moses: http://www.statmt.org/moses/?n=Development.GetStarted
37 | * Mgiza++: http://www.statmt.org/moses/?n=Moses.ExternalTools#ntoc3
38 | * NLTK toolkit: http://www.nltk.org/
39 | * Python 2.7
40 | * Stanford Toolkit: http://nlp.stanford.edu/software/tagger.html
41 |
42 | ### Data preparation
43 |
44 |
45 | #### Training Data
46 |
47 | * code: ./preprocessing/extract_wikipedia_corpora_boxer_training.py
48 |
49 | * This code prepares the training data. It takes as input tokenized
50 | training (complex, simple) sentences and the boxer output (xml
51 | format) of the complex sentences.
52 |
53 | * I will improve the interface of this script later. But for now you
54 | have to set following parameters: (C: complex sentence and S: simple
55 | sentence)
56 |
57 | * ZHUDATA_FILE_ORG = Address to the file with combined
58 | complex-simple pairs. Format:
59 | C_1\nS^1_1\nS^2_1\n\nC_2\nS^1_2\nS^2_2\nS^3_2\n\n and so on.
60 |
61 | * ZHUDATA_FILE_MAIN = Address to the file with all tokenized complex
62 | sentences. Format: C_1\nC_2\n and so on.
63 |
64 | * ZHUDATA_FILE_SIMPLE = Address to the file with all tokenized
65 | simple sentences. Format: S^1_1\nS^2_1\nS^1_2\nS^2_2\nS^3_2\n and
66 | so on.
67 |
68 | * BOXER_DATADIR: Directory address which contains the boxer output
69 | of ZHUDATA_FILE_MAIN.
70 |
71 | * CHUNK_SIZE = Size of the boxer output chunks. The above scripts
72 | loads boxer xml file before parsing them, it is much faster to use
73 | chunks (let say of 10000) of ZHUDATA_FILE_MAIN.
74 |
75 | * boxer_main_filename = Boxer output file name pattern. For
76 | example:
77 | filename."+str(lower_index)+"-"+str(lower_index+CHUNK_SIZE)
78 |
79 | #### Test Data
80 |
81 | * code: ./preprocessing/extract_wikipedia_corpora_boxer_test.py
82 |
83 | * This code prepares the test data. It takes as input tokenized test
84 | (complex) sentences and their boxer outputs in xml format.
85 |
86 | * I will improve the interface of this script later. But for now you
87 | have to set following parameters:
88 |
89 | * TEST_FILE_MAIN: Address to the file with all tokenized complex
90 | sentences. Format: C_1\nC_2\n and so on.
91 |
92 | * TEST_FILE_BOXER: Address to the boxer xml output file for
93 | TEST_FILE_MAIN.
94 |
95 | ### Training
96 |
97 | * Training goes through three states: 1) Building Boxer training
98 | graphs, 2) EM training and 3) SMT training
99 |
100 | ```
101 | python start_learning_training_models.py --help
102 |
103 | usage: python learn_training_models.py [-h] [--start-state Start_State]
104 | [--end-state End_State]
105 | [--transformation TRANSFORMATION_MODEL]
106 | [--max-split MAX_SPLIT_SIZE]
107 | [--restricted-drop-rel RESTRICTED_DROP_REL]
108 | [--allowed-drop-mod ALLOWED_DROP_MOD]
109 | [--method-training-graph Method_Training_Graph]
110 | [--method-feature-extract Method_Feature_Extract]
111 | [--train-boxer-graph Train_Boxer_Graph]
112 | [--num-em NUM_EM_ITERATION]
113 | [--lang-model Lang_Model]
114 | [--d2s-config D2S_Config] --output-dir
115 | Output_Directory
116 |
117 | Start the training process.
118 |
119 | optional arguments:
120 | -h, --help show this help message and exit
121 | --start-state Start_State
122 | Start state of the training process
123 | --end-state End_State
124 | End state of the training process
125 | --transformation TRANSFORMATION_MODEL
126 | Transformation models learned
127 | --max-split MAX_SPLIT_SIZE
128 | Maximum split size
129 | --restricted-drop-rel RESTRICTED_DROP_REL
130 | Restricted drop relations
131 | --allowed-drop-mod ALLOWED_DROP_MOD
132 | Allowed drop modifiers
133 | --method-training-graph Method_Training_Graph
134 | Operation set for training graph file
135 | --method-feature-extract Method_Feature_Extract
136 | Operation set for extracting features
137 | --train-boxer-graph Train_Boxer_Graph
138 | The training corpus file (xml, stanford-tokenized,
139 | boxer-graph)
140 | --num-em NUM_EM_ITERATION
141 | The number of EM Algorithm iterations
142 | --lang-model Lang_Model
143 | Language model information (in the moses format)
144 | --d2s-config D2S_Config
145 | D2S Configuration file
146 | --output-dir Output_Directory
147 | The output directory
148 | ```
149 |
150 | * Have a look in start_learning_training_models.py for more
151 | information on their definitions and their default values.
152 |
153 | * train-boxer-graph: this is the output file from the training data
154 | preparation·
155 |
156 | ### Testing
157 |
158 | ```
159 | python start_simplifying_complex_sentence.py --help
160 |
161 | usage: python simplify_complex_sentence.py [-h]
162 | [--test-boxer-graph Test_Boxer_Graph]
163 | [--nbest-distinct N_Best_Distinct]
164 | [--explore-decoder Explore_Decoder]
165 | --d2s-config D2S_Config
166 | --output-dir Output_Directory
167 |
168 | Start simplifying complex sentences.
169 |
170 | optional arguments:
171 | -h, --help show this help message and exit
172 | --test-boxer-graph Test_Boxer_Graph
173 | The test corpus file (xml, stanford-tokenized, boxer-
174 | graph)
175 | --nbest-distinct N_Best_Distinct
176 | N Best Distinct produced from Moses
177 | --explore-decoder Explore_Decoder
178 | Method for generating the decoder graph
179 | --d2s-config D2S_Config
180 | D2S Configuration file
181 | --output-dir Output_Directory
182 | The output directory
183 | ```
184 |
185 | * test-boxer-graph: this is the output file from the test data
186 | preparation·
187 |
188 | * d2s-config: This is the output configuration file from the training
189 | stage.
190 |
191 | ### ToDo
192 |
193 | * ToDo: Incorporate improvements from our arXiv
194 | paper. http://arxiv.org/pdf/1507.08452v1.pdf
195 |
196 | * OOD words at the border should be dropped.
197 | * Don't split at "TO".
198 | * Full stop at the end of the sentence. (currently, this was done as a
199 | post-processing step in the end).
200 |
201 | * ToDo: Change to online version of sentence simplification.
202 |
203 |
--------------------------------------------------------------------------------
/preprocessing/extract_wikipedia_corpora_boxer_test.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #title : extract_wikipedia_corpora_boxer_test.py =
4 | #description : Test data preparation =
5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,gmail.com}) =
6 | #date : Created in 2014, Later revised in April 2016. =
7 | #version : 0.1 =
8 | #===================================================================================
9 |
10 | import os
11 | import argparse
12 | import sys
13 | import random
14 | import base64
15 | import uuid
16 |
17 | import xml.etree.ElementTree as ET
18 | from xml.dom import minidom
19 |
20 | ### Global Variables
21 |
22 | # # Zhu
23 | # TEST_FILE_MAIN="/home/ankh/Data/Simplification/Test-Data/complex.tokenized"
24 | # TEST_FILE_BOXER="/home/ankh/Data/Simplification/Test-Data/complex.tokenized.boxer.xml"
25 |
26 | # # NewSella
27 | # TEST_FILE_MAIN="/gpfs/scratch/snarayan/Sentence-Simplification/Newsella/V0V4_V1V4_V2V4_V3V4_V0V3_V0V2_V1V3.aner.ori.valid.src"
28 | # TEST_FILE_BOXER="/gpfs/scratch/snarayan/Sentence-Simplification/Newsella/boxer/V0V4_V1V4_V2V4_V3V4_V0V3_V0V2_V1V3.aner.ori.valid.src"
29 |
30 | # Wikilarge
31 | # TEST_FILE_MAIN="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/wiki.full.aner.ori.test.src"
32 | # TEST_FILE_BOXER="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/boxer/wiki.full.aner.ori.test.src"
33 |
34 | TEST_FILE_MAIN="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/wiki.full.aner.ori.valid.src"
35 | TEST_FILE_BOXER="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/boxer/wiki.full.aner.ori.valid.src"
36 |
37 |
38 | class Boxer_XML_Handler:
39 | def parse_boxer_xml(self, xdrs_elt):
40 | arg_dict = {}
41 | rel_nodes = []
42 | extra_nodes = []
43 |
44 | # Process all dr
45 | for dr in xdrs_elt.iter('dr'):
46 | arg = dr.attrib["name"]
47 | if arg not in arg_dict:
48 | arg_dict[arg] = {"position":[], "preds":[]}
49 |
50 | for index in dr.findall('index'):
51 | position = int(index.attrib["pos"])
52 | wordid = index.text
53 |
54 | if (position, wordid) not in arg_dict[arg]["position"]:
55 | arg_dict[arg]["position"].append((position, wordid))
56 |
57 | # Process all prop
58 | for prop in xdrs_elt.iter('prop'):
59 | arg = prop.attrib["argument"]
60 | if arg not in arg_dict:
61 | arg_dict[arg] = {"position":[], "preds":[]}
62 |
63 | for index in prop.findall('index'):
64 | position = int(index.attrib["pos"])
65 | wordid = index.text
66 |
67 | if (position, wordid) not in arg_dict[arg]["position"]:
68 | arg_dict[arg]["position"].append((position, wordid))
69 |
70 | # Process all pred
71 | for pred in xdrs_elt.iter('pred'):
72 | arg = pred.attrib["arg"]
73 | symbol = pred.attrib["symbol"]
74 |
75 | if arg not in arg_dict:
76 | arg_dict[arg] = {"position":[], "preds":[]}
77 |
78 | predicate = [symbol, []]
79 | for index in pred.findall('index'):
80 | position = int(index.attrib["pos"])
81 | wordid = index.text
82 |
83 | if (position, wordid) not in arg_dict[arg]["position"]:
84 | arg_dict[arg]["position"].append((position, wordid))
85 | if (position, wordid) not in predicate[1]:
86 | predicate[1].append((position, wordid))
87 | arg_dict[arg]["preds"].append(predicate)
88 |
89 | # Process all named
90 | for named in xdrs_elt.iter('named'):
91 | arg = named.attrib["arg"]
92 | symbol = named.attrib["symbol"]
93 |
94 | if arg not in arg_dict:
95 | arg_dict[arg] = {"position":[], "preds":[]}
96 |
97 | named_pred = [symbol, []]
98 | for index in named.findall('index'):
99 | position = int(index.attrib["pos"])
100 | wordid = index.text
101 |
102 | if (position, wordid) not in arg_dict[arg]["position"]:
103 | arg_dict[arg]["position"].append((position, wordid))
104 | if (position, wordid) not in named_pred[1]:
105 | named_pred[1].append((position, wordid))
106 | arg_dict[arg]["preds"].append(named_pred)
107 |
108 | # Process all card
109 | for card in xdrs_elt.iter('card'):
110 | arg = card.attrib["arg"]
111 | value = card.attrib["value"]
112 |
113 | if arg not in arg_dict:
114 | arg_dict[arg] = {"position":[], "preds":[]}
115 |
116 | card_pred = [value, []]
117 | for index in card.findall('index'):
118 | position = int(index.attrib["pos"])
119 | wordid = index.text
120 |
121 | if (position, wordid) not in arg_dict[arg]["position"]:
122 | arg_dict[arg]["position"].append((position, wordid))
123 | if (position, wordid) not in card_pred[1]:
124 | card_pred[1].append((position, wordid))
125 | arg_dict[arg]["preds"].append(card_pred)
126 |
127 | # Process all timex
128 | for timex in xdrs_elt.iter('timex'):
129 | arg = timex.attrib["arg"]
130 | datetime = ""
131 | for date in timex.iter('date'):
132 | datetime = date.text
133 | for time in timex.iter('time'):
134 | datetime = time.text
135 |
136 | if arg not in arg_dict:
137 | arg_dict[arg] = {"position":[], "preds":[]}
138 |
139 | timex_pred = [datetime, []]
140 | for index in timex.findall('index'):
141 | position = int(index.attrib["pos"])
142 | wordid = index.text
143 |
144 | if (position, wordid) not in arg_dict[arg]["position"]:
145 | arg_dict[arg]["position"].append((position, wordid))
146 | if (position, wordid) not in timex_pred[1]:
147 | timex_pred[1].append((position, wordid))
148 | arg_dict[arg]["preds"].append(timex_pred)
149 |
150 | # Process not/or/imp/whq
151 | for not_node in xdrs_elt.iter('not'):
152 | index_list = not_node.findall('index')
153 | if len(index_list) != 0:
154 | not_pred = ["not", []]
155 | for index in index_list:
156 | position = int(index.attrib["pos"])
157 | wordid = index.text
158 | if (position, wordid) not in not_pred[1]:
159 | not_pred[1].append((position, wordid))
160 | extra_nodes.append(not_pred)
161 | for or_node in xdrs_elt.iter('or'):
162 | index_list = or_node.findall('index')
163 | if len(index_list) != 0:
164 | or_pred = ["or", []]
165 | for index in index_list:
166 | position = int(index.attrib["pos"])
167 | wordid = index.text
168 | if (position, wordid) not in or_pred[1]:
169 | or_pred[1].append((position, wordid))
170 | extra_nodes.append(or_pred)
171 | for imp_node in xdrs_elt.iter('imp'):
172 | index_list = imp_node.findall('index')
173 | if len(index_list) != 0:
174 | imp_pred = ["imp", []]
175 | for index in index_list:
176 | position = int(index.attrib["pos"])
177 | wordid = index.text
178 | if (position, wordid) not in imp_pred[1]:
179 | imp_pred[1].append((position, wordid))
180 | extra_nodes.append(imp_pred)
181 | for whq_node in xdrs_elt.iter('whq'):
182 | index_list = whq_node.findall('index')
183 | if len(index_list) != 0:
184 | whq_pred = ["whq", []]
185 | for index in index_list:
186 | position = int(index.attrib["pos"])
187 | wordid = index.text
188 | if (position, wordid) not in whq_pred[1]:
189 | whq_pred[1].append((position, wordid))
190 | extra_nodes.append(whq_pred)
191 |
192 | # Process Rel node
193 | for rel_node in xdrs_elt.iter('rel'):
194 | arg1 = rel_node.attrib["arg1"]
195 | arg2 = rel_node.attrib["arg2"]
196 | symbol = rel_node.attrib["symbol"]
197 |
198 | if arg1 not in arg_dict:
199 | arg_dict[arg1] = {"position":[], "preds":[]}
200 | if arg2 not in arg_dict:
201 | arg_dict[arg2] = {"position":[], "preds":[]}
202 |
203 | index_list = rel_node.findall('index')
204 | if (len(index_list) != 0) or (len(index_list) == 0 and symbol=="nn"):
205 | if symbol=="nn":
206 | # Revert arguments
207 | temp = arg2
208 | arg2 = arg1
209 | arg1 = temp
210 | rel_nodes.append([symbol, arg1, arg2, []])
211 |
212 | elif symbol=="eq":
213 | if len(index_list) == 1:
214 | position = int(index_list[0].attrib["pos"])
215 | wordid = index_list[0].text
216 |
217 | for arg in arg_dict:
218 | if (position, wordid) in arg_dict[arg]["position"]:
219 | if ["event", []] in arg_dict[arg]["preds"]:
220 | rel_nodes.append([symbol, arg, arg1, [(position, wordid)]])
221 | rel_nodes.append([symbol, arg, arg2, [(position, wordid)]])
222 | else:
223 | rel_pred = [symbol, arg1, arg2, []]
224 | for index in index_list:
225 | position = int(index.attrib["pos"])
226 | wordid = index.text
227 | if (position, wordid) not in rel_pred[3]:
228 | rel_pred[3].append((position, wordid))
229 | rel_nodes.append(rel_pred)
230 |
231 |
232 | return arg_dict, rel_nodes, extra_nodes
233 |
234 | ### Extra Functions
235 |
236 | def prettify(elem):
237 | """Return a pretty-printed XML string for the Element.
238 | """
239 | rough_string = ET.tostring(elem)
240 | reparsed = minidom.parseString(rough_string)
241 | prettyxml = reparsed.documentElement.toprettyxml(indent=" ")
242 | return prettyxml.encode("utf-8")
243 |
244 | ###################
245 |
246 | class Boxer_Element_Creator:
247 | def graph_boxer_element(self, xdrs_elt, sent_span):
248 | # Parse the XDRS
249 | arg_dict, rel_nodes, extra_nodes = Boxer_XML_Handler().parse_boxer_xml(xdrs_elt)
250 | #print arg_dict, rel_nodes, extra_nodes
251 |
252 | # Adding extra nodes to arg_dict/node dict
253 | extra_node_count = 1
254 | for nodeinfo in extra_nodes:
255 | arg_dict["E"+str(extra_node_count)] = {"position":nodeinfo[1][:], "preds":[nodeinfo[:]]}
256 | extra_node_count += 1
257 |
258 | # Creating edge list and relation dict
259 | edge_list = []
260 | rel_dict = {}
261 |
262 | rel_node_count = 1
263 | for relinfo in rel_nodes:
264 | relation = relinfo[0]
265 | parentarg = relinfo[1]
266 | deparg = relinfo[2]
267 | indexlist = relinfo[3]
268 |
269 | rel_dict["R"+str(rel_node_count)] = [relation, indexlist[:]]
270 | edge_list.append((parentarg, deparg, "R"+str(rel_node_count)))
271 | rel_node_count += 1
272 |
273 | # Get the boxer span
274 | boxer_span = []
275 | for arg in arg_dict:
276 | for position in arg_dict[arg]["position"]:
277 | if position not in boxer_span:
278 | boxer_span.append(position)
279 | for relarg in rel_dict:
280 | for position in rel_dict[relarg][1]:
281 | if position not in boxer_span:
282 | boxer_span.append(position)
283 | boxer_span.sort()
284 | boxer_span_ids = [elt[1] for elt in boxer_span]
285 | # Adding nodes for left out word positions in discourse
286 | out_of_dis_count = 1
287 | for elt in sent_span:
288 | if elt not in boxer_span_ids:
289 | arg_dict["OOD"+str(out_of_dis_count)] = {"position":[(-1, elt)], "preds":[]}
290 | out_of_dis_count += 1
291 |
292 | # Prepare Boxer Tree
293 | boxer = ET.Element('box')
294 | nodes = ET.SubElement(boxer, "nodes")
295 | for arg in arg_dict:
296 | # print arg
297 | # print arg_dict[arg]
298 | node = ET.SubElement(nodes, "node")
299 | node.attrib = {"sym":arg}
300 |
301 | span = ET.SubElement(node, "span")
302 | position_list = arg_dict[arg]["position"]
303 | position_list.sort()
304 | for position in position_list:
305 | location = ET.SubElement(span, "loc")
306 | location.attrib = {"id":position[1]}#{"pos":str(position[0]),"id":position[1]}
307 |
308 | preds = ET.SubElement(node, "preds")
309 | predicate_list = arg_dict[arg]["preds"]
310 | predicate_list.sort()
311 | for predinfo in predicate_list:
312 | # print predinfo
313 | pred = ET.SubElement(preds, "pred")
314 | predsymbol = predinfo[0]
315 | pred.attrib = {"sym":predsymbol}
316 |
317 | position_list = predinfo[1]
318 | position_list.sort()
319 | for position in position_list:
320 | location = ET.SubElement(pred, "loc")
321 | location.attrib = {"id":position[1]}#{"pos":str(position[0]),"id":position[1]}
322 |
323 | rels = ET.SubElement(boxer, "rels")
324 | for relarg in rel_dict:
325 | rel = ET.SubElement(rels, "rel")
326 | rel.attrib = {"sym":relarg}
327 | pred = ET.SubElement(rel, "pred")
328 | pred.attrib = {"sym":rel_dict[relarg][0]}
329 | span = ET.SubElement(rel, "span")
330 | position_list = rel_dict[relarg][1]
331 | position_list.sort()
332 | for position in position_list:
333 | location = ET.SubElement(span, "loc")
334 | location.attrib = {"id":position[1]}#{"pos":str(position[0]),"id":position[1]}
335 |
336 | edges = ET.SubElement(boxer, "edges")
337 | for edgeinfo in edge_list:
338 | edge = ET.SubElement(edges, "edge")
339 | edge.attrib = {"par":edgeinfo[0], "dep":edgeinfo[1], "lab":edgeinfo[2]}
340 | return boxer
341 |
342 | def construct_boxer_element(self, xdrs_data):
343 | sentid = int(xdrs_data.get('{http://www.w3.org/XML/1998/namespace}id')[1:])
344 |
345 | # Creating Sentence Element
346 | sentence_xml = ""
347 | sentence_span = []
348 | sentence = ET.Element('s')
349 | words = (xdrs_data.find("words")).findall("word")
350 | postags = (xdrs_data.find("postags")).findall("postag")
351 | for word_elt, postag_elt in zip(words, postags):
352 | word = word_elt.text
353 | wordid = word_elt.get('{http://www.w3.org/XML/1998/namespace}id')# ("xml:id")
354 | pos = postag_elt.text
355 | posid = postag_elt.get("index")
356 |
357 | if wordid != posid:
358 | print "Warning: Both ids did not match."
359 | exit(0)
360 | else:
361 | sentence_xml += word +" "
362 | word_newelt = ET.SubElement(sentence, 'w')
363 | word_newelt.attrib = {"id":wordid, "pos":pos}
364 | word_newelt.text = word
365 |
366 | sentence_span.append(wordid)
367 |
368 | # Creating the head element
369 | headelt = ET.Element("main")
370 | headelt.append(sentence)
371 |
372 | # Creating boxer element
373 | boxer = self.graph_boxer_element(xdrs_data, sentence_span)
374 | headelt.append(boxer)
375 | return sentid, headelt
376 |
377 | def create_sentence_elt(self, main_sent):
378 | sentence = ET.Element('s')
379 | sentence.text = main_sent.decode('utf-8')
380 |
381 | # Creating the head element
382 | headelt = ET.Element("main")
383 | headelt.append(sentence)
384 |
385 | # Creating boxer element
386 | boxer = ET.Element("box")
387 | headelt.append(boxer)
388 | return headelt
389 |
390 | def get_sentence_elt(self, boxer_xml_file):
391 | sentid_sentence_dict = {}
392 | xdrs_output = ET.parse(boxer_xml_file)
393 | xdrs_list = xdrs_output.findall('xdrs')
394 | for xdrs_item in xdrs_list:
395 | sentid, head_elt = self.construct_boxer_element(xdrs_item)
396 | sentid_sentence_dict[sentid] = head_elt
397 | return sentid_sentence_dict
398 |
399 | if __name__ == "__main__":
400 | datadir = os.path.dirname(TEST_FILE_MAIN)
401 |
402 | # Start parsing Zhu Data file - Tokenized ..."
403 | print "\nStart preparing Test Data file - Tokenized ..."
404 | print "Start parsing "+TEST_FILE_MAIN+" ..."
405 | main_wiki = []
406 | fdata = open(TEST_FILE_MAIN, "r")
407 | main_wiki = fdata.read().strip().split("\n")
408 | fdata.close()
409 | print "Total number of sentences from Test (tokenized): " + str(len(main_wiki))
410 |
411 | # Call boxer element creator
412 | print "\nStart boxer element creator ..."
413 | boxer_elt_creator = Boxer_Element_Creator()
414 | sentid_sentence_dict = boxer_elt_creator.get_sentence_elt(TEST_FILE_BOXER)
415 |
416 | # Start Writing final files
417 | print "Generating XML file : "+TEST_FILE_MAIN+".boxer-graph.xml ..."
418 | foutput = open(TEST_FILE_MAIN+".boxer-graph.xml", "w")
419 | foutput.write("\n")
420 | foutput.write("\n")
421 |
422 | main_index = 1
423 | for main_sent in main_wiki:
424 | # Start creating sentence subelement
425 | sentence = ET.Element('sentence')
426 | sentence.attrib={"id":str(main_index)}
427 |
428 | main_elt = ""
429 | if main_index in sentid_sentence_dict:
430 | main_elt = sentid_sentence_dict[main_index]
431 | else:
432 | main_elt = boxer_elt_creator.create_sentence_elt(main_sent)
433 |
434 | sentence.append(main_elt)
435 | foutput.write(prettify(sentence))
436 | main_index += 1
437 |
438 |
439 | foutput.write("\n")
440 | foutput.close()
441 |
442 | print len(sentid_sentence_dict)
443 |
444 | print sentid_sentence_dict.keys()
445 |
--------------------------------------------------------------------------------
/source/em_inside_outside_algorithm.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #title : em_inside_outside_algorithm.py =
4 | #description : EM Algorithm =
5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
6 | #date : Created in 2014, Later revised in April 2016. =
7 | #version : 0.1 =
8 | #===================================================================================
9 |
10 |
11 | import function_select_methods
12 |
13 | class EM_InsideOutside_Optimiser:
14 | def __init__(self, smt_sentence_pairs, probability_tables, count_tables, METHOD_FEATURE_EXTRACT):
15 | self.smt_sentence_pairs = smt_sentence_pairs
16 | self.probability_tables = probability_tables
17 | self.count_tables = count_tables
18 | self.METHOD_FEATURE_EXTRACT = METHOD_FEATURE_EXTRACT
19 |
20 | self.method_feature_extract = function_select_methods.select_feature_extract_method(self.METHOD_FEATURE_EXTRACT)
21 |
22 | def initialize_probabilitytable_smt_input(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
23 | #print sentid
24 |
25 | # Process all oper nodes
26 | for oper_node in training_graph.oper_nodes:
27 | oper_type = training_graph.get_opernode_type(oper_node)
28 | if oper_type not in self.probability_tables:
29 | self.probability_tables[oper_type] = {}
30 | if oper_type not in self.count_tables:
31 | self.count_tables[oper_type] = {}
32 |
33 | parent_major_node = training_graph.find_parent_of_opernode(oper_node)
34 | children_major_nodes = training_graph.find_children_of_opernode(oper_node)
35 |
36 | if oper_type == "split":
37 | # Parent main sentence
38 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
39 | parent_filtered_mod_pos = training_graph.get_majornode_filtered_postions(parent_major_node)
40 | parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, parent_filtered_mod_pos)
41 |
42 | # Children sentences
43 | children_sentences = []
44 | for child_major_node in children_major_nodes:
45 | child_nodeset = training_graph.get_majornode_nodeset(child_major_node)
46 | child_filtered_mod_pos = training_graph.get_majornode_filtered_postions(child_major_node)
47 | child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, child_filtered_mod_pos)
48 | children_sentences.append(child_sentence)
49 |
50 | split_candidate = training_graph.get_opernode_oper_candidate(oper_node)
51 |
52 | #print split_candidate
53 |
54 | if split_candidate != None:
55 | split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph)
56 | if split_feature not in self.probability_tables["split"]:
57 | self.probability_tables["split"][split_feature] = {"true":0.5, "false":0.5}
58 | if split_feature not in self.count_tables["split"]:
59 | self.count_tables["split"][split_feature] = {"true":0, "false":0}
60 | else:
61 | not_applied_cands = training_graph.get_opernode_failed_oper_candidates(oper_node)
62 | #print not_applied_cands
63 | for split_candidate_left in not_applied_cands:
64 | split_feature_left = self.method_feature_extract.get_split_feature(split_candidate_left, parent_sentence, children_sentences, boxer_graph)
65 | #print split_feature_left
66 | if split_feature_left not in self.probability_tables["split"]:
67 | self.probability_tables["split"][split_feature_left] = {"true":0.5, "false":0.5}
68 | if split_feature_left not in self.count_tables["split"]:
69 | self.count_tables["split"][split_feature_left] = {"true":0, "false":0}
70 | #print self.probability_tables["split"]
71 |
72 | if oper_type == "drop-rel":
73 | rel_node = training_graph.get_opernode_oper_candidate(oper_node)
74 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
75 | drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(rel_node, parent_nodeset, main_sent_dict, boxer_graph)
76 | if drop_rel_feature not in self.probability_tables["drop-rel"]:
77 | self.probability_tables["drop-rel"][drop_rel_feature] = {"true":0.5, "false":0.5}
78 | if drop_rel_feature not in self.count_tables["drop-rel"]:
79 | self.count_tables["drop-rel"][drop_rel_feature] = {"true":0, "false":0}
80 |
81 | if oper_type == "drop-mod":
82 | mod_cand = training_graph.get_opernode_oper_candidate(oper_node)
83 | drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(mod_cand, main_sent_dict, boxer_graph)
84 | if drop_mod_feature not in self.probability_tables["drop-mod"]:
85 | self.probability_tables["drop-mod"][drop_mod_feature] = {"true":0.5, "false":0.5}
86 | if drop_mod_feature not in self.count_tables["drop-mod"]:
87 | self.count_tables["drop-mod"][drop_mod_feature] = {"true":0, "false":0}
88 |
89 | if oper_type == "drop-ood":
90 | ood_node = training_graph.get_opernode_oper_candidate(oper_node)
91 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
92 | drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(ood_node, parent_nodeset, main_sent_dict, boxer_graph)
93 | if drop_ood_feature not in self.probability_tables["drop-ood"]:
94 | self.probability_tables["drop-ood"][drop_ood_feature] = {"true":0.5, "false":0.5}
95 | if drop_ood_feature not in self.count_tables["drop-ood"]:
96 | self.count_tables["drop-ood"][drop_ood_feature] = {"true":0, "false":0}
97 |
98 | #print self.probability_tables["split"]['as-as-patient_eq-eq_1']
99 | # if int(sentid) <= 3:
100 | # print self.probability_tables["split"]
101 |
102 | # Extract all sentence pairs for SMT from all "fin" major nodes
103 | self.smt_sentence_pairs[sentid] = training_graph.get_final_sentences(main_sentence, main_sent_dict, boxer_graph)
104 |
105 | def reset_count_table(self):
106 | for oper_type in self.count_tables: # split, drop-rel, drop-mod, drop-ood
107 | for oper_feature_key in self.count_tables[oper_type]: # feature patterns
108 | for val in self.count_tables[oper_type][oper_feature_key]: # true, false
109 | self.count_tables[oper_type][oper_feature_key][val] = 0
110 |
111 | def iterate_over_probabilitytable(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
112 | #print sentid
113 | # Calculating beta-probability, inside probability
114 | #print "Calculating beta-probabilities (Inside probability) ..."
115 | bottom_nodes = training_graph.find_all_fin_majornode()
116 | beta_prob = self.calculate_inside_probability({}, bottom_nodes, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph)
117 | #print beta_prob
118 |
119 | # Calculating alpha-probability, outside probability
120 | #print "Calculating alpha-probabilities (Outside probability) ..."
121 | root_node = "MN-1"
122 | alpha_prob = self.calculate_outside_probability({}, [root_node], main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, beta_prob)
123 | #print alpha_prob
124 |
125 | # Updating counts for each operation happened in this sentence
126 | #print "Updating counts of each operation happened in this training sentence ..."
127 | self.update_count_for_operations(sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, alpha_prob, beta_prob)
128 |
129 | def calculate_outside_probability(self, alpha_prob, tgnodes_to_process, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, beta_prob):
130 | if len(tgnodes_to_process) == 0:
131 | return alpha_prob
132 |
133 | tgnode = tgnodes_to_process[0]
134 | if tgnode.startswith("MN"):
135 | # Major Nodes
136 | if tgnode == "MN-1":
137 | # Root major node
138 | alpha_prob[tgnode] = 1
139 | else:
140 | parents_oper_nodes = training_graph.find_parents_of_majornode(tgnode)
141 | alpha_prob_tgnode = 0
142 | for parent_oper_node in parents_oper_nodes:
143 | alpha_prob_parent_oper_node = alpha_prob[parent_oper_node]
144 | children_major_nodes = training_graph.find_children_of_opernode(parent_oper_node)
145 | beta_prod_product = 1
146 | for child_major_node in children_major_nodes:
147 | if child_major_node != tgnode:
148 | beta_prod_product = beta_prod_product * beta_prob[child_major_node]
149 | alpha_prob_tgnode += alpha_prob_parent_oper_node * beta_prod_product
150 | alpha_prob[tgnode] = alpha_prob_tgnode
151 |
152 | # Adding children to tgnodes_to_process
153 | children_oper_nodes = training_graph.find_children_of_majornode(tgnode)
154 | for child_oper_node in children_oper_nodes:
155 | # Check its not already inserted
156 | if (child_oper_node not in alpha_prob) and (child_oper_node not in tgnodes_to_process):
157 | # Check its parent is already inserted
158 | parent_major_node = training_graph.find_parent_of_opernode(child_oper_node)
159 | if (parent_major_node in alpha_prob) or (parent_major_node in tgnodes_to_process):
160 | tgnodes_to_process.append(child_oper_node)
161 | else:
162 | # Oper nodes
163 | parent_major_node = training_graph.find_parent_of_opernode(tgnode)
164 | alpha_prob_tgnode = alpha_prob[parent_major_node] * self.fetch_probability(tgnode, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph)
165 | alpha_prob[tgnode] = alpha_prob_tgnode
166 |
167 | # Adding children to tgnodes_to_process
168 | children_major_nodes = training_graph.find_children_of_opernode(tgnode)
169 | for child_major_node in children_major_nodes:
170 | # Check its not already inserted
171 | if (child_major_node not in alpha_prob) and (child_major_node not in tgnodes_to_process):
172 | # Check all its parents are already inserted
173 | parents_oper_nodes = training_graph.find_parents_of_majornode(child_major_node)
174 | flag = True
175 | for parent_oper_node in parents_oper_nodes:
176 | if (parent_oper_node not in alpha_prob) and (parent_oper_node not in tgnodes_to_process):
177 | flag = False
178 | break
179 | if flag == True:
180 | tgnodes_to_process.append(child_major_node)
181 |
182 | alpha_prob = self.calculate_outside_probability(alpha_prob, tgnodes_to_process[1:], main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, beta_prob)
183 | return alpha_prob
184 |
185 | def calculate_inside_probability(self, beta_prob, tgnodes_to_process, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
186 | if len(tgnodes_to_process) == 0:
187 | return beta_prob
188 |
189 | tgnode = tgnodes_to_process[0]
190 | if tgnode.startswith("MN"):
191 | # Major nodes
192 | major_node_type = training_graph.get_majornode_type(tgnode)
193 | if major_node_type == "fin":
194 | # Leaf major nodes
195 | beta_prob[tgnode] = 1
196 | else:
197 | children_oper_nodes = training_graph.find_children_of_majornode(tgnode)
198 | beta_prob_tgnode = 0
199 | for child_oper_node in children_oper_nodes:
200 | beta_prob_tgnode += self.fetch_probability(child_oper_node, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph) * beta_prob[child_oper_node]
201 | beta_prob[tgnode] = beta_prob_tgnode
202 |
203 | # Adding parents to tgnodes_to_process
204 | parents_oper_nodes = training_graph.find_parents_of_majornode(tgnode)
205 | for parent_oper_node in parents_oper_nodes:
206 | # Check its not already inserted
207 | if (parent_oper_node not in beta_prob) and (parent_oper_node not in tgnodes_to_process):
208 | # Check all its chilren are already inserted
209 | children_major_nodes = training_graph.find_children_of_opernode(parent_oper_node)
210 | flag = True
211 | for child_major_node in children_major_nodes:
212 | if (child_major_node not in beta_prob) and (child_major_node not in tgnodes_to_process):
213 | flag = False
214 | break
215 | if flag == True:
216 | tgnodes_to_process.append(parent_oper_node)
217 | else:
218 | # Oper nodes
219 | children_major_nodes = training_graph.find_children_of_opernode(tgnode)
220 | beta_prob_tgnode = 1
221 | for child_major_node in children_major_nodes:
222 | beta_prob_tgnode = beta_prob_tgnode * beta_prob[child_major_node]
223 | beta_prob[tgnode] = beta_prob_tgnode
224 |
225 | # Adding parent to tgnodes_to_process
226 | parent_major_node = training_graph.find_parent_of_opernode(tgnode)
227 | # Check its not already inserted
228 | if (parent_major_node not in beta_prob) and (parent_major_node not in tgnodes_to_process):
229 | # Check all its chilren are already inserted
230 | children_oper_nodes = training_graph.find_children_of_majornode(parent_major_node)
231 | flag = True
232 | for child_oper_node in children_oper_nodes:
233 | if (child_oper_node not in beta_prob) and (child_oper_node not in tgnodes_to_process):
234 | flag = False
235 | break
236 | if flag == True:
237 | tgnodes_to_process.append(parent_major_node)
238 |
239 | beta_prob = self.calculate_inside_probability(beta_prob, tgnodes_to_process[1:], main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph)
240 | return beta_prob
241 |
242 | def fetch_probability(self, oper_node, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
243 | oper_node_type = training_graph.get_opernode_type(oper_node)
244 | if oper_node_type == "split":
245 | # Parent main sentence
246 | parent_major_node = training_graph.find_parent_of_opernode(oper_node)
247 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
248 | parent_filtered_mod_pos = training_graph.get_majornode_filtered_postions(parent_major_node)
249 | parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, parent_filtered_mod_pos)
250 |
251 | # Children sentences
252 | children_major_nodes = training_graph.find_children_of_opernode(oper_node)
253 | children_sentences = []
254 | for child_major_node in children_major_nodes:
255 | child_nodeset = training_graph.get_majornode_nodeset(child_major_node)
256 | child_filtered_mod_pos = training_graph.get_majornode_filtered_postions(child_major_node)
257 | child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, child_filtered_mod_pos)
258 | children_sentences.append(child_sentence)
259 |
260 | total_probability = 1
261 | split_candidate = training_graph.get_opernode_oper_candidate(oper_node)
262 | if split_candidate != None:
263 | split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph)
264 | total_probability = self.probability_tables["split"][split_feature]["true"]
265 | return total_probability
266 | else:
267 | not_applied_cands = training_graph.get_opernode_failed_oper_candidates(oper_node)
268 | for split_candidate_left in not_applied_cands:
269 | split_feature_left = self.method_feature_extract.get_split_feature(split_candidate_left, parent_sentence, children_sentences, boxer_graph)
270 | total_probability = total_probability * self.probability_tables["split"][split_feature_left]["false"]
271 | return total_probability
272 |
273 | elif oper_node_type == "drop-rel":
274 | parent_major_node = training_graph.find_parent_of_opernode(oper_node)
275 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
276 | rel_candidate = training_graph.get_opernode_oper_candidate(oper_node)
277 | drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(rel_candidate, parent_nodeset, main_sent_dict, boxer_graph)
278 | isDropped = training_graph.get_opernode_drop_result(oper_node)
279 | prob_value = 0
280 | if isDropped == "True":
281 | prob_value = self.probability_tables["drop-rel"][drop_rel_feature]["true"]
282 | else:
283 | prob_value = self.probability_tables["drop-rel"][drop_rel_feature]["false"]
284 | return prob_value
285 |
286 | elif oper_node_type == "drop-mod":
287 | mod_candidate = training_graph.get_opernode_oper_candidate(oper_node)
288 | drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(mod_candidate, main_sent_dict, boxer_graph)
289 | isDropped = training_graph.get_opernode_drop_result(oper_node)
290 | prob_value = 0
291 | if isDropped == "True":
292 | prob_value = self.probability_tables["drop-mod"][drop_mod_feature]["true"]
293 | else:
294 | prob_value = self.probability_tables["drop-mod"][drop_mod_feature]["false"]
295 | return prob_value
296 |
297 | elif oper_node_type == "drop-ood":
298 | parent_major_node = training_graph.find_parent_of_opernode(oper_node)
299 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
300 | ood_candidate = training_graph.get_opernode_oper_candidate(oper_node)
301 | drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(ood_candidate, parent_nodeset, main_sent_dict, boxer_graph)
302 | isDropped = training_graph.get_opernode_drop_result(oper_node)
303 | prob_value = 0
304 | if isDropped == "True":
305 | prob_value = self.probability_tables["drop-ood"][drop_ood_feature]["true"]
306 | else:
307 | prob_value = self.probability_tables["drop-ood"][drop_ood_feature]["false"]
308 | return prob_value
309 |
310 | def update_count_for_operations(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, alpha_prob, beta_prob):
311 | # Process all oper nodes
312 | for oper_node in training_graph.oper_nodes:
313 | # Calculating count
314 | root_inside_prob = beta_prob["MN-1"]
315 | oper_node_inside_prob = beta_prob[oper_node]
316 | oper_node_outside_prob = alpha_prob[oper_node]
317 | count_oper_node = (oper_node_inside_prob * oper_node_outside_prob) / root_inside_prob
318 |
319 | oper_node_type = training_graph.get_opernode_type(oper_node)
320 | if oper_node_type == "split":
321 | # Parent main sentence
322 | parent_major_node = training_graph.find_parent_of_opernode(oper_node)
323 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
324 | parent_filtered_mod_pos = training_graph.get_majornode_filtered_postions(parent_major_node)
325 | parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, parent_filtered_mod_pos)
326 |
327 | # Children sentences
328 | children_major_nodes = training_graph.find_children_of_opernode(oper_node)
329 | children_sentences = []
330 | for child_major_node in children_major_nodes:
331 | child_nodeset = training_graph.get_majornode_nodeset(child_major_node)
332 | child_filtered_mod_pos = training_graph.get_majornode_filtered_postions(child_major_node)
333 | child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, child_filtered_mod_pos)
334 | children_sentences.append(child_sentence)
335 |
336 | split_candidate = training_graph.get_opernode_oper_candidate(oper_node)
337 | if split_candidate != None:
338 | split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph)
339 | self.count_tables["split"][split_feature]["true"] += count_oper_node
340 | else:
341 | not_applied_cands = training_graph.get_opernode_failed_oper_candidates(oper_node)
342 | for split_candidate_left in not_applied_cands:
343 | split_feature_left = self.method_feature_extract.get_split_feature(split_candidate_left, parent_sentence, children_sentences, boxer_graph)
344 | self.count_tables["split"][split_feature_left]["false"] += count_oper_node
345 |
346 | elif oper_node_type == "drop-rel":
347 | parent_major_node = training_graph.find_parent_of_opernode(oper_node)
348 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
349 | rel_candidate = training_graph.get_opernode_oper_candidate(oper_node)
350 | drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(rel_candidate, parent_nodeset, main_sent_dict, boxer_graph)
351 | isDropped = training_graph.get_opernode_drop_result(oper_node)
352 | if isDropped == "True":
353 | self.count_tables["drop-rel"][drop_rel_feature]["true"] += count_oper_node
354 | else:
355 | self.count_tables["drop-rel"][drop_rel_feature]["false"] += count_oper_node
356 |
357 | elif oper_node_type == "drop-mod":
358 | mod_candidate = training_graph.get_opernode_oper_candidate(oper_node)
359 | drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(mod_candidate, main_sent_dict, boxer_graph)
360 | isDropped = training_graph.get_opernode_drop_result(oper_node)
361 | if isDropped == "True":
362 | self.count_tables["drop-mod"][drop_mod_feature]["true"] += count_oper_node
363 | else:
364 | self.count_tables["drop-mod"][drop_mod_feature]["false"] += count_oper_node
365 |
366 | elif oper_node_type == "drop-ood":
367 | parent_major_node = training_graph.find_parent_of_opernode(oper_node)
368 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
369 | ood_candidate = training_graph.get_opernode_oper_candidate(oper_node)
370 | drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(ood_candidate, parent_nodeset, main_sent_dict, boxer_graph)
371 | isDropped = training_graph.get_opernode_drop_result(oper_node)
372 | if isDropped == "True":
373 | self.count_tables["drop-ood"][drop_ood_feature]["true"] += count_oper_node
374 | else:
375 | self.count_tables["drop-ood"][drop_ood_feature]["false"] += count_oper_node
376 |
377 | def update_probability_table(self):
378 | for oper_type in self.probability_tables: # split, drop-ood, drop-rel, drop-mod
379 | for oper_feature_key in self.probability_tables[oper_type]: # feature patterns
380 | totalSum = 0
381 | for val in self.probability_tables[oper_type][oper_feature_key]:
382 | totalSum += self.count_tables[oper_type][oper_feature_key][val]
383 |
384 | # if totalSum == 0:
385 | # print oper_type
386 | # print oper_feature_key
387 | # print self.probability_tables[oper_type][oper_feature_key]
388 | # print self.count_tables[oper_type][oper_feature_key]
389 |
390 | if totalSum == 0:
391 | for val in self.probability_tables[oper_type][oper_feature_key]:
392 | self.probability_tables[oper_type][oper_feature_key][val] = 0.5 # Uniform 1.0/len(self.probability_tables[oper_type][oper_feature_key])
393 | else:
394 | for val in self.probability_tables[oper_type][oper_feature_key]:
395 | self.probability_tables[oper_type][oper_feature_key][val] = self.count_tables[oper_type][oper_feature_key][val] / totalSum
396 |
--------------------------------------------------------------------------------
/source/explore_decoder_graph_greedy.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #title : explore_decoder_graph_greedy.py =
4 | #description : Greedy decoder =
5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
6 | #date : Created in 2014, Later revised in April 2016. =
7 | #version : 0.1 =
8 | #===================================================================================
9 |
10 | from training_graph_module import Training_Graph
11 | import function_select_methods
12 |
13 | class Explore_Decoder_Graph_Greedy:
14 | def __init__(self, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE, RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, probability_tables, METHOD_FEATURE_EXTRACT):
15 | self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL
16 | self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE
17 | self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL
18 | self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD
19 |
20 | self.probability_tables = probability_tables
21 | self.METHOD_FEATURE_EXTRACT = METHOD_FEATURE_EXTRACT
22 |
23 | self.method_feature_extract = function_select_methods.select_feature_extract_method(self.METHOD_FEATURE_EXTRACT)
24 |
25 | def explore_decoder_graph(self, sentid, main_sentence, main_sent_dict, boxer_graph):
26 | # Start a decoder graph
27 | decoder_graph = Training_Graph()
28 | nodes_2_process = []
29 |
30 | # Check if Discourse information is available
31 | if boxer_graph.isEmpty():
32 | # Adding finishing major node
33 | nodeset = boxer_graph.get_nodeset()
34 | filtered_mod_pos = []
35 | simple_sentences = []
36 | majornode_data = ("fin", nodeset, simple_sentences, filtered_mod_pos)
37 |
38 | # Creating major node
39 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
40 | nodes_2_process.append(majornode_name) # isNew = True
41 | else:
42 | # DRS data is available for the complex sentence
43 | # Check to add the starting node
44 | nodeset = boxer_graph.get_nodeset()
45 | majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "split", nodeset, [], [])
46 | nodes_2_process.append(majornode_name) # isNew = True
47 |
48 | # Start expanding the decoder graph
49 | self.expand_decoder_graph(nodes_2_process[:], main_sent_dict, boxer_graph, decoder_graph)
50 | return decoder_graph
51 |
52 | def expand_decoder_graph(self, nodes_2_process, main_sent_dict, boxer_graph, decoder_graph):
53 | if len(nodes_2_process) == 0:
54 | return
55 |
56 | node_name = nodes_2_process[0]
57 | operreq = decoder_graph.get_majornode_type(node_name)
58 | nodeset = decoder_graph.get_majornode_nodeset(node_name)[:]
59 | oper_candidates = decoder_graph.get_majornode_oper_candidates(node_name)[:]
60 | processed_oper_candidates = decoder_graph.get_majornode_processed_oper_candidates(node_name)[:]
61 | filtered_postions = decoder_graph.get_majornode_filtered_postions(node_name)[:]
62 |
63 | #print node_name, decoder_graph.major_nodes[node_name]
64 |
65 | #print node_name, operreq, nodeset, oper_candidates, processed_oper_candidates, filtered_postions
66 |
67 | if operreq == "split":
68 | split_candidate_tuples = oper_candidates
69 | nodes_2_process = self.process_split_node_decoder_graph(node_name, nodeset, split_candidate_tuples, nodes_2_process,
70 | main_sent_dict, boxer_graph, decoder_graph)
71 |
72 | if operreq == "drop-rel":
73 | relnode_candidates = oper_candidates
74 | processed_relnode_candidates = processed_oper_candidates
75 | filtered_mod_pos = filtered_postions
76 | nodes_2_process = self.process_droprel_node_decoder_graph(node_name, nodeset, relnode_candidates, processed_relnode_candidates, filtered_mod_pos,
77 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph)
78 |
79 | if operreq == "drop-mod":
80 | mod_candidates = oper_candidates
81 | processed_mod_pos = processed_oper_candidates
82 | filtered_mod_pos = filtered_postions
83 | nodes_2_process = self.process_dropmod_node_decoder_graph(node_name, nodeset, mod_candidates, processed_mod_pos, filtered_mod_pos,
84 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph)
85 |
86 | if operreq == "drop-ood":
87 | oodnode_candidates = oper_candidates
88 | processed_oodnode_candidates = processed_oper_candidates
89 | filtered_mod_pos = filtered_postions
90 | nodes_2_process = self.process_dropood_node_decoder_graph(node_name, nodeset, oodnode_candidates, processed_oodnode_candidates, filtered_mod_pos,
91 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph)
92 |
93 | self.expand_decoder_graph(nodes_2_process[1:], main_sent_dict, boxer_graph, decoder_graph)
94 |
95 | def process_split_node_decoder_graph(self, node_name, nodeset, split_candidate_tuples, nodes_2_process, main_sent_dict, boxer_graph, decoder_graph):
96 | # Calculating probabilities
97 | probability_results = []
98 |
99 | # Find the Parent main sentence
100 | parent_nodeset = nodeset[:]
101 | parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, [])
102 |
103 | # Explore no-split options
104 | probability = 1
105 | for split_candidate in split_candidate_tuples:
106 | # Get the probability
107 | split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, [parent_sentence], boxer_graph)
108 | if split_feature in self.probability_tables["split"]:
109 | probability = probability * self.probability_tables["split"][split_feature]["false"]
110 | else:
111 | probability = probability * 0.5
112 | #print split_candidate, split_feature, "false", probability
113 | probability_results.append((probability, None, []))
114 |
115 | # Explore all split options
116 | # Calculate all parent and following subtrees
117 | parent_subgraph_nodeset_dict = boxer_graph.extract_parent_subgraph_nodeset_dict()
118 | #print "parent_subgraph_nodeset_dict : "+str(parent_subgraph_nodeset_dict)
119 | for split_candidate in split_candidate_tuples:
120 | # Find the children sentences
121 | children_sentences = []
122 |
123 | # Split on the split_candidate
124 | node_subgraph_nodeset_dict, node_span_dict = boxer_graph.partition_drs_for_successful_candidate(split_candidate, parent_subgraph_nodeset_dict)
125 | #print node_subgraph_nodeset_dict, node_span_dict
126 |
127 | # Sorting them depending on span
128 | split_results = []
129 | for tnodename in split_candidate:
130 | tspan = node_span_dict[tnodename]
131 | tnodeset = node_subgraph_nodeset_dict[tnodename][:]
132 | split_results.append((tspan, tnodeset, tnodename))
133 | split_results.sort()
134 |
135 | # Prospective children major nodes
136 | for item in split_results:
137 | child_nodeset = item[1]
138 | child_nodeset.sort()
139 | child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, [])
140 | children_sentences.append(child_sentence)
141 |
142 | #print children_sentences
143 |
144 | # Get the probability
145 | split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph)
146 | probability = 0
147 | if split_feature in self.probability_tables["split"]:
148 | probability = self.probability_tables["split"][split_feature]["true"]
149 | else:
150 | probability = 0.5
151 |
152 | #print split_candidate, split_feature, "true", probability
153 |
154 | probability_results.append((probability, split_candidate, split_results))
155 |
156 | # Sort probabilities
157 | probability_results.sort(reverse=True)
158 |
159 | ##
160 | #data = [(item[0], item[1]) for item in probability_results]
161 | #print data
162 | ##
163 |
164 | split_tuple = probability_results[0][1]
165 | if split_tuple != None:
166 | # Adding the operation node
167 | not_applied_cands = [item for item in split_candidate_tuples if item is not split_tuple]
168 | opernode_data = ("split", split_tuple, not_applied_cands)
169 | opernode_name = decoder_graph.create_opernode(opernode_data)
170 | decoder_graph.create_edge((node_name, opernode_name, split_candidate))
171 |
172 | split_results = probability_results[0][2]
173 | for item in split_results:
174 | child_nodeset = item[1][:]
175 | child_nodeset.sort()
176 | parent_child_nodeset = item[2]
177 |
178 | # Check for adding OOD or subsequent nodes
179 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, [], [])
180 | if isNew:
181 | nodes_2_process.append(child_majornode_name)
182 | decoder_graph.create_edge((opernode_name, child_majornode_name, parent_child_nodeset))
183 | else:
184 | # Adding the operation node
185 | not_applied_cands = [item for item in split_candidate_tuples]
186 | opernode_data = ("split", None, not_applied_cands)
187 | opernode_name = decoder_graph.create_opernode(opernode_data)
188 | decoder_graph.create_edge((node_name, opernode_name, None))
189 |
190 | # Check for adding drop-rel or drop-mod or fin nodes
191 | child_nodeset = nodeset[:]
192 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, [], [])
193 | if isNew:
194 | nodes_2_process.append(child_majornode_name)
195 | decoder_graph.create_edge((opernode_name, child_majornode_name, None))
196 |
197 | return nodes_2_process
198 |
199 | def process_droprel_node_decoder_graph(self, node_name, nodeset, relnode_candidates, processed_relnode_candidates, filtered_mod_pos,
200 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph):
201 | relnode_to_process = relnode_candidates[0]
202 | processed_relnode_candidates.append(relnode_to_process)
203 |
204 | drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(relnode_to_process, nodeset, main_sent_dict, boxer_graph)
205 |
206 | if drop_rel_feature in self.probability_tables["drop-rel"]:
207 | drop_prob = self.probability_tables["drop-rel"][drop_rel_feature]["true"]
208 | not_drop_prob = self.probability_tables["drop-rel"][drop_rel_feature]["false"]
209 | if drop_prob > not_drop_prob:
210 | # Creating opernode for droping
211 | opernode_data = ("drop-rel", relnode_to_process, "True")
212 | opernode_name = decoder_graph.create_opernode(opernode_data)
213 | decoder_graph.create_edge((node_name, opernode_name, relnode_to_process))
214 | # Check for adding REL or subsequent nodes, (nodeset is changed)
215 | child_nodeset, child_filtered_mod_pos = boxer_graph.drop_relation(nodeset, relnode_to_process, filtered_mod_pos)
216 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, processed_relnode_candidates, child_filtered_mod_pos)
217 | if isNew:
218 | nodes_2_process.append(child_majornode_name)
219 | decoder_graph.create_edge((opernode_name, child_majornode_name, "True"))
220 | return nodes_2_process
221 |
222 | # Creating opernode for not droping
223 | opernode_data = ("drop-rel", relnode_to_process, "False")
224 | opernode_name = decoder_graph.create_opernode(opernode_data)
225 | decoder_graph.create_edge((node_name, opernode_name, relnode_to_process))
226 | # Check for adding REL or subsequent nodes, (nodeset is unchanged)
227 | child_nodeset = nodeset
228 | child_filtered_mod_pos = filtered_mod_pos
229 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, processed_relnode_candidates, child_filtered_mod_pos)
230 | if isNew:
231 | nodes_2_process.append(child_majornode_name)
232 | decoder_graph.create_edge((opernode_name, child_majornode_name, "False"))
233 | return nodes_2_process
234 |
235 | def process_dropmod_node_decoder_graph(self, node_name, nodeset, mod_candidates, processed_mod_pos, filtered_mod_pos,
236 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph):
237 | modcand_to_process = mod_candidates[0]
238 | modcand_position_to_process = modcand_to_process[0]
239 | modcand_word = main_sent_dict[modcand_position_to_process][0]
240 | modcand_node = modcand_to_process[1]
241 | processed_mod_pos.append(modcand_position_to_process)
242 |
243 | drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(modcand_to_process, main_sent_dict, boxer_graph)
244 | if drop_mod_feature in self.probability_tables["drop-mod"]:
245 | drop_prob = self.probability_tables["drop-mod"][drop_mod_feature]["true"]
246 | not_drop_prob = self.probability_tables["drop-mod"][drop_mod_feature]["false"]
247 | if drop_prob > not_drop_prob:
248 | # Drop this mod, adding the operation node
249 | opernode_data = ("drop-mod", modcand_to_process, "True")
250 | opernode_name = decoder_graph.create_opernode(opernode_data)
251 | decoder_graph.create_edge((node_name, opernode_name, modcand_to_process))
252 | # Check for adding further drop mods, (nodeset is unchanged)
253 | child_nodeset = nodeset
254 | filtered_mod_pos.append(modcand_position_to_process)
255 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos)
256 | if isNew:
257 | nodes_2_process.append(child_majornode_name)
258 | decoder_graph.create_edge((opernode_name, child_majornode_name, "True"))
259 | return nodes_2_process
260 |
261 | # Dont drop this pos, adding the operation node
262 | opernode_data = ("drop-mod", modcand_to_process, "False")
263 | opernode_name = decoder_graph.create_opernode(opernode_data)
264 | decoder_graph.create_edge((node_name, opernode_name, modcand_to_process))
265 | # Check for adding further drop mods, (nodeset is unchanged)
266 | child_nodeset = nodeset
267 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos)
268 | if isNew:
269 | nodes_2_process.append(child_majornode_name)
270 | decoder_graph.create_edge((opernode_name, child_majornode_name, "False"))
271 | return nodes_2_process
272 |
273 | def process_dropood_node_decoder_graph(self, node_name, nodeset, oodnode_candidates, processed_oodnode_candidates, filtered_mod_pos,
274 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph):
275 | oodnode_to_process = oodnode_candidates[0]
276 | processed_oodnode_candidates.append(oodnode_to_process)
277 |
278 | drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(oodnode_to_process, nodeset, main_sent_dict, boxer_graph)
279 |
280 | if drop_ood_feature in self.probability_tables["drop-ood"]:
281 | drop_prob = self.probability_tables["drop-ood"][drop_ood_feature]["true"]
282 | not_drop_prob = self.probability_tables["drop-ood"][drop_ood_feature]["false"]
283 | if drop_prob > not_drop_prob:
284 | # Creating opernode for droping
285 | opernode_data = ("drop-ood", oodnode_to_process, "True")
286 | opernode_name = decoder_graph.create_opernode(opernode_data)
287 | decoder_graph.create_edge((node_name, opernode_name, oodnode_to_process))
288 | # Check for adding OOD or subsequent nodes, (nodeset is changed)
289 | child_nodeset = nodeset
290 | child_nodeset.remove(oodnode_to_process)
291 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-ood", child_nodeset, processed_oodnode_candidates, filtered_mod_pos)
292 | if isNew:
293 | nodes_2_process.append(child_majornode_name)
294 | decoder_graph.create_edge((opernode_name, child_majornode_name, "True"))
295 | return nodes_2_process
296 |
297 | # Creating opernode for not droping
298 | opernode_data = ("drop-ood", oodnode_to_process, "False")
299 | opernode_name = decoder_graph.create_opernode(opernode_data)
300 | decoder_graph.create_edge((node_name, opernode_name, oodnode_to_process))
301 | # Check for adding OOD or subsequent nodes, (nodeset is unchanged)
302 | child_nodeset = nodeset
303 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-ood", child_nodeset, processed_oodnode_candidates, filtered_mod_pos)
304 | if isNew:
305 | nodes_2_process.append(child_majornode_name)
306 | decoder_graph.create_edge((opernode_name, child_majornode_name, "False"))
307 | return nodes_2_process
308 |
309 | def addition_major_node(self, main_sent_dict, boxer_graph, decoder_graph, opertype, nodeset, processed_candidates, extra_data):
310 | # node type - value
311 | type_val = {"split":1, "drop-rel":2, "drop-mod":3, "drop-ood":4}
312 | operval = type_val[opertype]
313 |
314 | # simple sentences not available, used to match data structures
315 | simple_sentences = []
316 |
317 | # Checking for the addition of "split" major-node
318 | if operval <= type_val["split"]:
319 | if opertype in self.DISCOURSE_SENTENCE_MODEL:
320 | # Calculating Split Candidates - DRS Graph node tuples
321 | split_candidate_tuples = boxer_graph.extract_split_candidate_tuples(nodeset, self.MAX_SPLIT_PAIR_SIZE)
322 | # print "split_candidate_tuples : " + str(split_candidate_tuples)
323 |
324 | if len(split_candidate_tuples) != 0:
325 | # Adding the major node for split
326 | majornode_data = ("split", nodeset[:], simple_sentences, split_candidate_tuples)
327 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
328 | return majornode_name, isNew
329 |
330 | if operval <= type_val["drop-rel"]:
331 | if opertype in self.DISCOURSE_SENTENCE_MODEL:
332 | # Calculate drop-rel candidates
333 | processed_relnode = processed_candidates[:] if opertype == "drop-rel" else []
334 | filtered_mod_pos = extra_data if opertype == "drop-rel" else []
335 | relnode_set = boxer_graph.extract_drop_rel_candidates(nodeset, self.RESTRICTED_DROP_REL, processed_relnode)
336 | if len(relnode_set) != 0:
337 | # Adding the major nodes for drop-rel
338 | majornode_data = ("drop-rel", nodeset[:], simple_sentences, relnode_set, processed_relnode, filtered_mod_pos)
339 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
340 | return majornode_name, isNew
341 |
342 | if operval <= type_val["drop-mod"]:
343 | if opertype in self.DISCOURSE_SENTENCE_MODEL:
344 | # Calculate drop-mod candidates
345 | processed_mod_pos = processed_candidates[:] if opertype == "drop-mod" else []
346 | filtered_mod_pos = extra_data
347 | modcand_set = boxer_graph.extract_drop_mod_candidates(nodeset, main_sent_dict, self.ALLOWED_DROP_MOD, processed_mod_pos)
348 | if len(modcand_set) != 0:
349 | # Adding the major nodes for drop-mod
350 | majornode_data = ("drop-mod", nodeset[:], simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos)
351 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
352 | return majornode_name, isNew
353 |
354 | if operval <= type_val["drop-ood"]:
355 | if opertype in self.DISCOURSE_SENTENCE_MODEL:
356 | # Check for drop-OOD node candidates
357 | processed_oodnodes = processed_candidates if opertype == "drop-ood" else []
358 | filtered_mod_pos = extra_data
359 | oodnode_candidates = boxer_graph.extract_ood_candidates(nodeset, processed_oodnodes)
360 | if len(oodnode_candidates) != 0:
361 | # Adding the major node for drop-ood
362 | majornode_data = ("drop-ood", nodeset[:], simple_sentences, oodnode_candidates, processed_oodnodes, filtered_mod_pos)
363 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
364 | return majornode_name, isNew
365 |
366 | # None of them matched, create "fin" node
367 | filtered_mod_pos = extra_data[:]
368 | majornode_data = ("fin", nodeset[:], simple_sentences, filtered_mod_pos)
369 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
370 | return majornode_name, isNew
371 |
372 |
--------------------------------------------------------------------------------
/source/explore_training_graph.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #title : explore_training_graph.py =
4 | #description : Training graph explorer =
5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
6 | #date : Created in 2014, Later revised in April 2016. =
7 | #version : 0.1 =
8 | #===================================================================================
9 |
10 |
11 | from training_graph_module import Training_Graph
12 | import function_select_methods
13 | import functions_prepare_elementtree_dot
14 |
15 | class Explore_Training_Graph:
16 | def __init__(self, output_stream, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE,
17 | RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, METHOD_TRAINING_GRAPH):
18 | self.output_stream = output_stream
19 |
20 | self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL
21 | self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE
22 | self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL
23 | self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD
24 | self.METHOD_TRAINING_GRAPH = METHOD_TRAINING_GRAPH
25 |
26 | self.method_training_graph = function_select_methods.select_training_graph_method(self.METHOD_TRAINING_GRAPH)
27 |
28 | def explore_training_graph(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph):
29 | # Start a training graph
30 | training_graph = Training_Graph()
31 | nodes_2_process = []
32 |
33 | # Check if Discourse information is available
34 | if boxer_graph.isEmpty():
35 | # Adding finishing major node
36 | nodeset = boxer_graph.get_nodeset()
37 | filtered_mod_pos = []
38 | majornode_data = ("fin", nodeset, simple_sentences, filtered_mod_pos)
39 |
40 | # Creating major node
41 | majornode_name, isNew = training_graph.create_majornode(majornode_data)
42 | nodes_2_process.append(majornode_name) # isNew = True
43 | else:
44 | # DRS data is available for the main sentence
45 | # Check to add the starting node
46 | nodeset = boxer_graph.get_nodeset()
47 | majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "split", nodeset, [], [])
48 | nodes_2_process.append(majornode_name) # isNew = True
49 |
50 | # Start expanding the training graph
51 | self.expand_training_graph(nodes_2_process[:], main_sent_dict, boxer_graph, training_graph)
52 |
53 | # Writing sentence element
54 | functions_prepare_elementtree_dot.prepare_write_sentence_element(self.output_stream, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph)
55 |
56 | # # Check to create visual representation
57 | # if int(sentid) <= 100:
58 | # functions_prepare_elementtree_dot.run_visual_graph_creator(sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph)
59 |
60 | def expand_training_graph(self, nodes_2_process, main_sent_dict, boxer_graph, training_graph):
61 | #print nodes_2_process
62 | if len(nodes_2_process) == 0:
63 | return
64 |
65 | node_name = nodes_2_process[0]
66 | operreq = training_graph.get_majornode_type(node_name)
67 | nodeset = training_graph.get_majornode_nodeset(node_name)[:]
68 | simple_sentences = training_graph.get_majornode_simple_sentences(node_name)[:]
69 | oper_candidates = training_graph.get_majornode_oper_candidates(node_name)[:]
70 | processed_oper_candidates = training_graph.get_majornode_processed_oper_candidates(node_name)[:]
71 | filtered_postions = training_graph.get_majornode_filtered_postions(node_name)[:]
72 |
73 | if operreq == "split":
74 | split_candidate_tuples = oper_candidates
75 | nodes_2_process = self.process_split_node_training_graph(node_name, nodeset, simple_sentences, split_candidate_tuples,
76 | nodes_2_process, main_sent_dict, boxer_graph, training_graph)
77 |
78 | if operreq == "drop-rel":
79 | relnode_candidates = oper_candidates
80 | processed_relnode_candidates = processed_oper_candidates
81 | filtered_mod_pos = filtered_postions
82 | nodes_2_process = self.process_droprel_node_training_graph(node_name, nodeset, simple_sentences, relnode_candidates, processed_relnode_candidates, filtered_mod_pos,
83 | nodes_2_process, main_sent_dict, boxer_graph, training_graph)
84 |
85 | if operreq == "drop-mod":
86 | mod_candidates = oper_candidates
87 | processed_mod_pos = processed_oper_candidates
88 | filtered_mod_pos = filtered_postions
89 | nodes_2_process = self.process_dropmod_node_training_graph(node_name, nodeset, simple_sentences, mod_candidates, processed_mod_pos, filtered_mod_pos,
90 | nodes_2_process, main_sent_dict, boxer_graph, training_graph)
91 |
92 | if operreq == "drop-ood":
93 | oodnode_candidates = oper_candidates
94 | processed_oodnode_candidates = processed_oper_candidates
95 | filtered_mod_pos = filtered_postions
96 | nodes_2_process = self.process_dropood_node_training_graph(node_name, nodeset, simple_sentences, oodnode_candidates, processed_oodnode_candidates, filtered_mod_pos,
97 | nodes_2_process, main_sent_dict, boxer_graph, training_graph)
98 |
99 | self.expand_training_graph(nodes_2_process[1:], main_sent_dict, boxer_graph, training_graph)
100 |
101 | def process_split_node_training_graph(self, node_name, nodeset, simple_sentences, split_candidate_tuples, nodes_2_process, main_sent_dict, boxer_graph, training_graph):
102 | split_candidate_results = []
103 | splitAchieved = False
104 | for split_candidate in split_candidate_tuples:
105 | isValidSplit, split_results = self.method_training_graph.process_split_candidate_for_split(split_candidate, simple_sentences, main_sent_dict, boxer_graph)
106 | # print "split_candidate : "+str(split_candidate) + " : " + str(isValidSplit)
107 | split_candidate_results.append((isValidSplit, split_results))
108 | if isValidSplit:
109 | splitAchieved = True
110 |
111 | if splitAchieved:
112 | # At least one split candidate succeed
113 | for split_candidate, results_tuple in zip(split_candidate_tuples, split_candidate_results):
114 | if results_tuple[0] == True:
115 | # Adding the operation node
116 | not_applied_cands = [item for item in split_candidate_tuples if item is not split_candidate]
117 | opernode_data = ("split", split_candidate, not_applied_cands)
118 | opernode_name = training_graph.create_opernode(opernode_data)
119 | training_graph.create_edge((node_name, opernode_name, split_candidate))
120 |
121 | # Adding children major nodes
122 | for item in results_tuple[1]:
123 | child_nodeset = item[1]
124 | child_nodeset.sort()
125 | parent_child_nodeset = item[2]
126 | simple_sentence = item[3]
127 |
128 | # Check for adding OOD or subsequent nodes
129 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, [simple_sentence], boxer_graph, training_graph, "drop-rel", child_nodeset, [], [])
130 | if isNew:
131 | nodes_2_process.append(child_majornode_name)
132 | training_graph.create_edge((opernode_name, child_majornode_name, parent_child_nodeset))
133 |
134 | else:
135 | # None of the split candidate succeed, adding the operation node
136 | not_applied_cands = [item for item in split_candidate_tuples]
137 | opernode_data = ("split", None, not_applied_cands)
138 | opernode_name = training_graph.create_opernode(opernode_data)
139 | training_graph.create_edge((node_name, opernode_name, None))
140 |
141 | # Check for adding drop-rel or drop-mod or fin nodes
142 | child_nodeset = nodeset
143 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-rel", child_nodeset, [], [])
144 | if isNew:
145 | nodes_2_process.append(child_majornode_name)
146 | training_graph.create_edge((opernode_name, child_majornode_name, None))
147 |
148 | return nodes_2_process
149 |
150 | def process_droprel_node_training_graph(self, node_name, nodeset, simple_sentences, relnode_set, processed_relnode, filtered_mod_pos, nodes_2_process, main_sent_dict, boxer_graph, training_graph):
151 | relnode_to_process = relnode_set[0]
152 | processed_relnode.append(relnode_to_process)
153 |
154 | isValidDrop = self.method_training_graph.process_rel_candidate_for_drop(relnode_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph)
155 | if isValidDrop:
156 | # Drop this rel node, adding the operation node
157 | opernode_data = ("drop-rel", relnode_to_process, "True")
158 | opernode_name = training_graph.create_opernode(opernode_data)
159 | training_graph.create_edge((node_name, opernode_name, relnode_to_process))
160 |
161 | # Check for adding REL or subsequent nodes, (nodeset is changed)
162 | child_nodeset, child_filtered_mod_pos = boxer_graph.drop_relation(nodeset, relnode_to_process, filtered_mod_pos)
163 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-rel", child_nodeset, processed_relnode, child_filtered_mod_pos)
164 | if isNew:
165 | nodes_2_process.append(child_majornode_name)
166 | training_graph.create_edge((opernode_name, child_majornode_name, "True"))
167 | else:
168 | # Dont drop this rel node, adding the operation node
169 | opernode_data = ("drop-rel", relnode_to_process, "False")
170 | opernode_name = training_graph.create_opernode(opernode_data)
171 | training_graph.create_edge((node_name, opernode_name, relnode_to_process))
172 |
173 | # Check for adding REL or subsequent nodes, (nodeset is unchanged)
174 | child_nodeset = nodeset
175 | child_filtered_mod_pos = filtered_mod_pos
176 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-rel", child_nodeset, processed_relnode, child_filtered_mod_pos)
177 | if isNew:
178 | nodes_2_process.append(child_majornode_name)
179 | training_graph.create_edge((opernode_name, child_majornode_name, "False"))
180 |
181 | return nodes_2_process
182 |
183 | def process_dropmod_node_training_graph(self, node_name, nodeset, simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos, nodes_2_process, main_sent_dict, boxer_graph, training_graph):
184 | modcand_to_process = modcand_set[0]
185 | modcand_position_to_process = modcand_to_process[0]
186 | processed_mod_pos.append(modcand_position_to_process)
187 |
188 | isValidDrop = self.method_training_graph.process_mod_candidate_for_drop(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph)
189 | if isValidDrop:
190 | # Drop this mod pos, adding the operation node
191 | opernode_data = ("drop-mod", modcand_to_process, "True")
192 | opernode_name = training_graph.create_opernode(opernode_data)
193 | training_graph.create_edge((node_name, opernode_name, modcand_to_process))
194 |
195 | # Check for adding mod and their subsequent nodes, (nodeset is not changed)
196 | child_nodeset = nodeset
197 | filtered_mod_pos.append(modcand_position_to_process)
198 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos)
199 | if isNew:
200 | nodes_2_process.append(child_majornode_name)
201 | training_graph.create_edge((opernode_name, child_majornode_name, "True"))
202 | else:
203 | # Dont drop this pos, adding the operation node
204 | opernode_data = ("drop-mod", modcand_to_process, "False")
205 | opernode_name = training_graph.create_opernode(opernode_data)
206 | training_graph.create_edge((node_name, opernode_name, modcand_to_process))
207 |
208 | # Check for adding mod and their subsequent nodes, (nodeset is not changed)
209 | child_nodeset = nodeset
210 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos)
211 | if isNew:
212 | nodes_2_process.append(child_majornode_name)
213 | training_graph.create_edge((opernode_name, child_majornode_name, "False"))
214 | return nodes_2_process
215 |
216 | def process_dropood_node_training_graph(self, node_name, nodeset, simple_sentences, oodnode_set, processed_oodnode, filtered_mod_pos, nodes_2_process, main_sent_dict, boxer_graph, training_graph):
217 |
218 | oodnode_to_process = oodnode_set[0]
219 | processed_oodnode.append(oodnode_to_process)
220 |
221 | isValidDrop = self.method_training_graph.process_ood_candidate_for_drop(oodnode_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph)
222 | if isValidDrop:
223 | # Drop this ood node, adding the operation node
224 | opernode_data = ("drop-ood", oodnode_to_process, "True")
225 | opernode_name = training_graph.create_opernode(opernode_data)
226 | training_graph.create_edge((node_name, opernode_name, oodnode_to_process))
227 |
228 | # Check for adding OOD or subsequent nodes, (nodeset is changed)
229 | child_nodeset = nodeset
230 | child_nodeset.remove(oodnode_to_process)
231 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-ood", child_nodeset, processed_oodnode, filtered_mod_pos)
232 | if isNew:
233 | nodes_2_process.append(child_majornode_name)
234 | training_graph.create_edge((opernode_name, child_majornode_name, "True"))
235 | else:
236 | # Dont drop this ood node, adding the operation node
237 | opernode_data = ("drop-ood", oodnode_to_process, "False")
238 | opernode_name = training_graph.create_opernode(opernode_data)
239 | training_graph.create_edge((node_name, opernode_name, oodnode_to_process))
240 |
241 | # Check for adding OOD or subsequent nodes, (nodeset is unchanged)
242 | child_nodeset = nodeset
243 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-ood", child_nodeset, processed_oodnode, filtered_mod_pos)
244 | if isNew:
245 | nodes_2_process.append(child_majornode_name)
246 | training_graph.create_edge((opernode_name, child_majornode_name, "False"))
247 |
248 | return nodes_2_process
249 |
250 | def addition_major_node(self, main_sent_dict, simple_sentences, boxer_graph, training_graph, opertype, nodeset, processed_candidates, extra_data):
251 | # node type - value
252 | type_val = {"split":1, "drop-rel":2, "drop-mod":3, "drop-ood":4}
253 | operval = type_val[opertype]
254 |
255 | # Checking for the addition of "split" major-node
256 | if operval <= type_val["split"]:
257 | if opertype in self.DISCOURSE_SENTENCE_MODEL:
258 | # Calculating Split Candidates - DRS Graph node tuples
259 | split_candidate_tuples = boxer_graph.extract_split_candidate_tuples(nodeset, self.MAX_SPLIT_PAIR_SIZE)
260 | # print "split_candidate_tuples : " + str(split_candidate_tuples)
261 |
262 | if len(split_candidate_tuples) != 0:
263 | # Adding the major node for split
264 | majornode_data = ("split", nodeset, simple_sentences, split_candidate_tuples)
265 | majornode_name, isNew = training_graph.create_majornode(majornode_data)
266 | return majornode_name, isNew
267 |
268 | if operval <= type_val["drop-rel"]:
269 | if opertype in self.DISCOURSE_SENTENCE_MODEL:
270 | # Calculate drop-rel candidates
271 | processed_relnode = processed_candidates if opertype == "drop-rel" else []
272 | filtered_mod_pos = extra_data if opertype == "drop-rel" else []
273 | relnode_set = boxer_graph.extract_drop_rel_candidates(nodeset, self.RESTRICTED_DROP_REL, processed_relnode)
274 | if len(relnode_set) != 0:
275 | # Adding the major nodes for drop-rel
276 | majornode_data = ("drop-rel", nodeset, simple_sentences, relnode_set, processed_relnode, filtered_mod_pos)
277 | majornode_name, isNew = training_graph.create_majornode(majornode_data)
278 | return majornode_name, isNew
279 |
280 | if operval <= type_val["drop-mod"]:
281 | if opertype in self.DISCOURSE_SENTENCE_MODEL:
282 | # Calculate drop-mod candidates
283 | processed_mod_pos = processed_candidates if opertype == "drop-mod" else []
284 | filtered_mod_pos = extra_data
285 | modcand_set = boxer_graph.extract_drop_mod_candidates(nodeset, main_sent_dict, self.ALLOWED_DROP_MOD, processed_mod_pos)
286 | if len(modcand_set) != 0:
287 | # Adding the major nodes for drop-mod
288 | majornode_data = ("drop-mod", nodeset, simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos)
289 | majornode_name, isNew = training_graph.create_majornode(majornode_data)
290 | return majornode_name, isNew
291 |
292 | if operval <= type_val["drop-ood"]:
293 | if opertype in self.DISCOURSE_SENTENCE_MODEL:
294 | # Check for drop-OOD node candidates
295 | processed_oodnodes = processed_candidates if opertype == "drop-ood" else []
296 | filtered_mod_pos = extra_data
297 | oodnode_candidates = boxer_graph.extract_ood_candidates(nodeset, processed_oodnodes)
298 | if len(oodnode_candidates) != 0:
299 | # Adding the major node for drop-ood
300 | majornode_data = ("drop-ood", nodeset, simple_sentences, oodnode_candidates, processed_oodnodes, filtered_mod_pos)
301 | majornode_name, isNew = training_graph.create_majornode(majornode_data)
302 | return majornode_name, isNew
303 |
304 |
305 | # None of them matched, create "fin" node
306 | filtered_mod_pos = extra_data
307 | majornode_data = ("fin", nodeset, simple_sentences, filtered_mod_pos)
308 | majornode_name, isNew = training_graph.create_majornode(majornode_data)
309 | return majornode_name, isNew
310 |
--------------------------------------------------------------------------------
/source/function_select_methods.py:
--------------------------------------------------------------------------------
1 |
2 | #===================================================================================
3 | #description : Methods for training graph and features exploration =
4 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
5 | #date : Created in 2014, Later revised in April 2016. =
6 | #version : 0.1 =
7 | #===================================================================================
8 |
9 |
10 | from methods_training_graph import Method_LED, Method_OVERLAP_LED
11 | from methods_feature_extract import Feature_Init, Feature_Nov27
12 |
13 | def select_training_graph_method(METHOD_TRAINING_GRAPH):
14 | return{
15 | "method-0.99-lteq-lt": Method_OVERLAP_LED(0.99, "lteq", "lt"),
16 | "method-0.75-lteq-lt": Method_OVERLAP_LED(0.75, "lteq", "lt"),
17 | "method-0.5-lteq-lteq": Method_OVERLAP_LED(0.5, "lteq", "lteq"),
18 | "method-led-lteq": Method_LED("lteq", "lteq", "lteq"),
19 | "method-led-lt": Method_LED("lt", "lt", "lt")
20 | }[METHOD_TRAINING_GRAPH]
21 |
22 | def select_feature_extract_method(METHOD_FEATURE_EXTRACT):
23 | return{
24 | "feature-init": Feature_Init(),
25 | "feature-Nov27": Feature_Nov27(),
26 | }[METHOD_FEATURE_EXTRACT]
27 |
--------------------------------------------------------------------------------
/source/functions_configuration_file.py:
--------------------------------------------------------------------------------
1 | #===================================================================================
2 | #title : functions_configuration_file.py =
3 | #description : Prepare/READ configuration file =
4 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
5 | #date : Created in 2014, Later revised in April 2016. =
6 | #version : 0.1 =
7 | #===================================================================================
8 |
9 | def write_config_file(config_filename, config_data_dict):
10 | config_file = open(config_filename, "w")
11 |
12 | config_file.write("##############################################################\n"+
13 | "####### Discourse-Complex-Simple Congifuration File ##########\n"+
14 | "##############################################################\n\n")
15 |
16 | config_file.write("# Generation Information\n")
17 | if "TRAIN-BOXER-GRAPH" in config_data_dict:
18 | config_file.write("[TRAIN-BOXER-GRAPH]\n"+config_data_dict["TRAIN-BOXER-GRAPH"]+"\n\n")
19 |
20 | if "TRANSFORMATION-MODEL" in config_data_dict:
21 | config_file.write("[TRANSFORMATION-MODEL]\n"+" ".join(config_data_dict["TRANSFORMATION-MODEL"])+"\n\n")
22 |
23 | if "MAX-SPLIT-SIZE" in config_data_dict:
24 | config_file.write("[MAX-SPLIT-SIZE]\n"+str(config_data_dict["MAX-SPLIT-SIZE"])+"\n\n")
25 |
26 | if "RESTRICTED-DROP-RELATION" in config_data_dict:
27 | config_file.write("[RESTRICTED-DROP-RELATION]\n"+" ".join(config_data_dict["RESTRICTED-DROP-RELATION"])+"\n\n")
28 |
29 | if "ALLOWED-DROP-MODIFIER" in config_data_dict:
30 | config_file.write("[ALLOWED-DROP-MODIFIER]\n"+" ".join(config_data_dict["ALLOWED-DROP-MODIFIER"])+"\n\n")
31 |
32 | if "METHOD-TRAINING-GRAPH" in config_data_dict:
33 | config_file.write("[METHOD-TRAINING-GRAPH]\n"+config_data_dict["METHOD-TRAINING-GRAPH"]+"\n\n")
34 |
35 | if "METHOD-FEATURE-EXTRACT" in config_data_dict:
36 | config_file.write("[METHOD-FEATURE-EXTRACT]\n"+config_data_dict["METHOD-FEATURE-EXTRACT"]+"\n\n")
37 |
38 | if "NUM-EM-ITERATION" in config_data_dict:
39 | config_file.write("[NUM-EM-ITERATION]\n"+str(config_data_dict["NUM-EM-ITERATION"])+"\n\n")
40 |
41 | if "LANGUAGE-MODEL" in config_data_dict:
42 | config_file.write("[LANGUAGE-MODEL]\n"+config_data_dict["LANGUAGE-MODEL"]+"\n\n")
43 |
44 | config_file.write("# Step-1\n")
45 | if "TRAIN-TRAINING-GRAPH" in config_data_dict:
46 | config_file.write("[TRAIN-TRAINING-GRAPH]\n"+config_data_dict["TRAIN-TRAINING-GRAPH"]+"\n\n")
47 |
48 | config_file.write("# Step-2\n")
49 | if "TRANSFORMATION-MODEL-DIR" in config_data_dict:
50 | config_file.write("[TRANSFORMATION-MODEL-DIR]\n"+config_data_dict["TRANSFORMATION-MODEL-DIR"]+"\n\n")
51 |
52 | config_file.write("# Step-3\n")
53 | if "MOSES-COMPLEX-SIMPLE-DIR" in config_data_dict:
54 | config_file.write("[MOSES-COMPLEX-SIMPLE-DIR]\n"+config_data_dict["MOSES-COMPLEX-SIMPLE-DIR"]+"\n\n")
55 |
56 | config_file.close()
57 |
58 |
59 | def parser_config_file(config_file):
60 | config_data = (open(config_file, "r").read().strip()).split("\n")
61 | config_data_dict = {}
62 | count = 0
63 | while count < len(config_data):
64 | if config_data[count].startswith("["):
65 | # Start Information
66 | if config_data[count].strip()[1:-1] == "TRAIN-BOXER-GRAPH":
67 | config_data_dict["TRAIN-BOXER-GRAPH"] = config_data[count+1].strip()
68 |
69 | if config_data[count].strip()[1:-1] == "TRANSFORMATION-MODEL":
70 | config_data_dict["TRANSFORMATION-MODEL"] = config_data[count+1].strip().split()
71 |
72 | if config_data[count].strip()[1:-1] == "MAX-SPLIT-SIZE":
73 | config_data_dict["MAX-SPLIT-SIZE"] = int(config_data[count+1].strip())
74 |
75 | if config_data[count].strip()[1:-1] == "RESTRICTED-DROP-RELATION":
76 | config_data_dict["RESTRICTED-DROP-RELATION"] = config_data[count+1].strip().split()
77 |
78 | if config_data[count].strip()[1:-1] == "ALLOWED-DROP-MODIFIER":
79 | config_data_dict["ALLOWED-DROP-MODIFIER"] = config_data[count+1].strip().split()
80 |
81 | if config_data[count].strip()[1:-1] == "METHOD-TRAINING-GRAPH":
82 | config_data_dict["METHOD-TRAINING-GRAPH"] = config_data[count+1].strip()
83 |
84 | if config_data[count].strip()[1:-1] == "METHOD-FEATURE-EXTRACT":
85 | config_data_dict["METHOD-FEATURE-EXTRACT"] = config_data[count+1].strip()
86 |
87 | if config_data[count].strip()[1:-1] == "NUM-EM-ITERATION":
88 | config_data_dict["NUM-EM-ITERATION"] = int(config_data[count+1].strip())
89 |
90 | if config_data[count].strip()[1:-1] == "LANGUAGE-MODEL":
91 | config_data_dict["LANGUAGE-MODEL"] = config_data[count+1].strip()
92 |
93 | # Step 1
94 | if config_data[count].strip()[1:-1] == "TRAIN-TRAINING-GRAPH":
95 | config_data_dict["TRAIN-TRAINING-GRAPH"] = config_data[count+1].strip()
96 |
97 | # Step 2
98 | if config_data[count].strip()[1:-1] == "TRANSFORMATION-MODEL-DIR":
99 | config_data_dict["TRANSFORMATION-MODEL-DIR"] = config_data[count+1].strip()
100 |
101 | # Step 3
102 | if config_data[count].strip()[1:-1] == "MOSES-COMPLEX-SIMPLE-DIR":
103 | config_data_dict["MOSES-COMPLEX-SIMPLE-DIR"] = config_data[count+1].strip()
104 |
105 | count += 2
106 | else:
107 | count += 1
108 | return config_data_dict
109 |
--------------------------------------------------------------------------------
/source/functions_model_files.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #title : functions_model_files.py =
4 | #description : Model file Handler =
5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
6 | #date : Created in 2014, Later revised in April 2016. =
7 | #version : 0.1 =
8 | #===================================================================================
9 |
10 | def read_model_files(model_dir, transformation_model):
11 | probability_tables = {}
12 | for trans_method in transformation_model:
13 | modelfile = model_dir+"/D2S-"+trans_method.upper()+".model"
14 | probability_tables[trans_method] = {}
15 | with open(modelfile) as infile:
16 | for line in infile:
17 | data = line.split()
18 | if data[0] not in probability_tables[trans_method]:
19 | probability_tables[trans_method][data[0]] = {data[1]:float(data[2])}
20 | else:
21 | probability_tables[trans_method][data[0]][data[1]] = float(data[2])
22 | print modelfile + " done ..."
23 | return probability_tables
24 |
25 | def write_model_files(model_dir, probability_tables, smt_sentence_pairs):
26 | if "split" in probability_tables:
27 | print "Writing "+model_dir+"/D2S-SPLIT.model ..."
28 | foutput = open(model_dir+"/D2S-SPLIT.model", "w")
29 | split_feature_set = probability_tables["split"].keys()
30 | split_feature_set.sort()
31 | for item in split_feature_set:
32 | foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["split"][item]["true"])+"\n")
33 | foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["split"][item]["false"])+"\n")
34 | foutput.close()
35 |
36 | if "drop-ood" in probability_tables:
37 | print "Writing "+model_dir+"/D2S-DROP-OOD.model ..."
38 | foutput = open(model_dir+"/D2S-DROP-OOD.model", "w")
39 | drop_ood_feature_set = probability_tables["drop-ood"].keys()
40 | drop_ood_feature_set.sort()
41 | for item in drop_ood_feature_set:
42 | foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["drop-ood"][item]["true"])+"\n")
43 | foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["drop-ood"][item]["false"])+"\n")
44 | foutput.close()
45 |
46 | if "drop-rel" in probability_tables:
47 | print "Writing "+model_dir+"/D2S-DROP-REL.model ..."
48 | foutput = open(model_dir+"/D2S-DROP-REL.model", "w")
49 | drop_rel_feature_set = probability_tables["drop-rel"].keys()
50 | drop_rel_feature_set.sort()
51 | for item in drop_rel_feature_set:
52 | foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["drop-rel"][item]["true"])+"\n")
53 | foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["drop-rel"][item]["false"])+"\n")
54 | foutput.close()
55 |
56 | if "drop-mod" in probability_tables:
57 | print "Writing "+model_dir+"/D2S-DROP-MOD.model ..."
58 | foutput = open(model_dir+"/D2S-DROP-MOD.model", "w")
59 | drop_mod_feature_set = probability_tables["drop-mod"].keys()
60 | drop_mod_feature_set.sort()
61 | for item in drop_mod_feature_set:
62 | foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["drop-mod"][item]["true"])+"\n")
63 | foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["drop-mod"][item]["false"])+"\n")
64 | foutput.close()
65 |
66 | # Writing SMT training data
67 | print "Writing "+model_dir+"/D2S-SMT.source ..."
68 | print "Writing "+ model_dir+"/D2S-SMT.target ..."
69 | fsource = open(model_dir+"/D2S-SMT.source", "w")
70 | ftarget = open(model_dir+"/D2S-SMT.target", "w")
71 | for sentid in smt_sentence_pairs:
72 | # print sentid
73 | # print smt_sentence_pairs[sentid]
74 | for pair in smt_sentence_pairs[sentid]:
75 | fsource.write(pair[0].encode('utf-8')+"\n")
76 | ftarget.write(pair[1].encode('utf-8')+"\n")
77 | fsource.close()
78 | ftarget.close()
79 |
--------------------------------------------------------------------------------
/source/functions_prepare_elementtree_dot.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #title : functions_prepare_elementtree_dot.py =
4 | #description : Prepare dot file =
5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
6 | #date : Created in 2014, Later revised in April 2016. =
7 | #version : 0.1 =
8 | #===================================================================================
9 |
10 |
11 | import os
12 | import xml.etree.ElementTree as ET
13 | from xml.dom import minidom
14 |
15 | def prettify_xml_element(element):
16 | """Return a pretty-printed XML string for the Element.
17 | """
18 | rough_string = ET.tostring(element)
19 | reparsed = minidom.parseString(rough_string)
20 | prettyxml = reparsed.documentElement.toprettyxml(indent=" ")
21 | return prettyxml.encode("utf-8")
22 |
23 | ############################### Elementary Tree ##########################################
24 |
25 | def prepare_write_sentence_element(output_stream, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
26 | # Creating Sentence element
27 | sentence = ET.Element('sentence')
28 | sentence.attrib={"id":str(sentid)}
29 |
30 | # Writing main sentence
31 | main = ET.SubElement(sentence, "main")
32 | mainsent = ET.SubElement(main, "s")
33 | mainsent.text = main_sentence
34 | wordinfo = ET.SubElement(main, "winfo")
35 | mainpositions = main_sent_dict.keys()
36 | mainpositions.sort()
37 | for position in mainpositions:
38 | word = ET.SubElement(wordinfo, "w")
39 | word.text = main_sent_dict[position][0]
40 | word.attrib = {"id":str(position), "pos":main_sent_dict[position][1]}
41 |
42 | # Writing simple sentence
43 | simpleset = ET.SubElement(sentence, "simple-set")
44 | for simple_sentence in simple_sentences:
45 | simple = ET.SubElement(simpleset, "simple")
46 | simplesent = ET.SubElement(simple, "s")
47 | simplesent.text = simple_sentence
48 |
49 | # Writing boxer Data : boxer_graph
50 | boxer = boxer_graph.convert_to_elementarytree()
51 | sentence.append(boxer)
52 |
53 | # Writing Training Graph : training_graph
54 | traininggraph = training_graph.convert_to_elementarytree()
55 | sentence.append(traininggraph)
56 |
57 | output_stream.write(prettify_xml_element(sentence))
58 |
59 | ############################ Dot - PNG File ###################################################
60 |
61 | def run_visual_graph_creator(sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
62 | print "Creating boxer and training graphs for sentence id : "+sentid+" ..."
63 |
64 | # Start creating boxer graph
65 | foutput = open("/tmp/boxer-graph-"+sentid+".dot", "w")
66 | boxer_dotstring = boxer_graph.convert_to_dotstring(sentid, main_sentence, main_sent_dict, simple_sentences)
67 | foutput.write(boxer_dotstring)
68 | foutput.close()
69 | os.system("dot -Tpng /tmp/boxer-graph-"+sentid+".dot -o /tmp/boxer-graph-"+sentid+".png")
70 |
71 |
72 | # Start creating training graph
73 | foutput = open("/tmp/training-graph-"+sentid+".dot", "w")
74 | train_dotstring = training_graph.convert_to_dotstring(main_sent_dict, boxer_graph)
75 | foutput.write(train_dotstring)
76 | foutput.close()
77 | os.system("dot -Tpng /tmp/training-graph-"+sentid+".dot -o /tmp/training-graph-"+sentid+".png")
78 |
--------------------------------------------------------------------------------
/source/methods_feature_extract.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #description : Methods for features exploration =
4 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
5 | #date : Created in 2014, Later revised in April 2016. =
6 | #version : 0.1 =
7 | #===================================================================================
8 |
9 |
10 | class Feature_Nov27:
11 |
12 | def get_split_feature(self, split_tuple, parent_sentence, children_sentence_list, boxer_graph):
13 | # Calculating iLength
14 | #iLength = boxer_graph.calculate_iLength(parent_sentence, children_sentence_list)
15 | # Get split tuple pattern
16 | split_pattern = boxer_graph.get_pattern_4_split_candidate(split_tuple)
17 | #split_feature = split_pattern+"_"+str(iLength)
18 | split_feature = split_pattern
19 | return split_feature
20 |
21 | def get_drop_ood_feature(self, ood_node, nodeset, main_sent_dict, boxer_graph):
22 | ood_word = boxer_graph.extract_oodword(ood_node, main_sent_dict)
23 | ood_position = boxer_graph.nodes[ood_node]["positions"][0] # length of positions is one
24 | span = boxer_graph.extract_span_min_max(nodeset)
25 | boundaryVal = "false"
26 | if ood_position <= span[0] or ood_position >= span[1]:
27 | boundaryVal = "true"
28 | drop_ood_feature = ood_word+"_"+boundaryVal
29 | return drop_ood_feature
30 |
31 | def get_drop_rel_feature(self, rel_node, nodeset, main_sent_dict, boxer_graph):
32 | rel_word = boxer_graph.relations[rel_node]["predicates"]
33 | rel_span = boxer_graph.extract_span_for_nodeset_with_rel(rel_node, nodeset)
34 | drop_rel_feature = rel_word+"_"
35 | if len(rel_span) <= 2:
36 | drop_rel_feature += "0-2"
37 | elif len(rel_span) <= 5:
38 | drop_rel_feature += "2-5"
39 | elif len(rel_span) <= 10:
40 | drop_rel_feature += "5-10"
41 | elif len(rel_span) <= 15:
42 | drop_rel_feature += "10-15"
43 | else:
44 | drop_rel_feature += "gt15"
45 | return drop_rel_feature
46 |
47 | def get_drop_mod_feature(self, mod_cand, main_sent_dict, boxer_graph):
48 | mod_pos = int(mod_cand[0])
49 | mod_word = main_sent_dict[mod_pos][0]
50 | #mod_node = mod_cand[1]
51 | drop_mod_feature = mod_word
52 | return drop_mod_feature
53 |
54 | class Feature_Init:
55 |
56 | def get_split_feature(self, split_tuple, parent_sentence, children_sentence_list, boxer_graph):
57 | # Calculating iLength
58 | iLength = boxer_graph.calculate_iLength(parent_sentence, children_sentence_list)
59 | # Get split tuple pattern
60 | split_pattern = boxer_graph.get_pattern_4_split_candidate(split_tuple)
61 | split_feature = split_pattern+"_"+str(iLength)
62 | return split_feature
63 |
64 | def get_drop_ood_feature(self, ood_node, nodeset, main_sent_dict, boxer_graph):
65 | ood_word = boxer_graph.extract_oodword(ood_node, main_sent_dict)
66 | ood_position = boxer_graph.nodes[ood_node]["positions"][0] # length of positions is one
67 | span = boxer_graph.extract_span_min_max(nodeset)
68 | boundaryVal = "false"
69 | if ood_position <= span[0] or ood_position >= span[1]:
70 | boundaryVal = "true"
71 | drop_ood_feature = ood_word+"_"+boundaryVal
72 | return drop_ood_feature
73 |
74 | def get_drop_rel_feature(self, rel_node, nodeset, main_sent_dict, boxer_graph):
75 | rel_word = boxer_graph.relations[rel_node]["predicates"]
76 | rel_span = boxer_graph.extract_span_for_nodeset_with_rel(rel_node, nodeset)
77 | drop_rel_feature = rel_word+"_"+str(len(rel_span))
78 | return drop_rel_feature
79 |
80 | def get_drop_mod_feature(self, mod_cand, main_sent_dict, boxer_graph):
81 | mod_pos = int(mod_cand[0])
82 | mod_word = main_sent_dict[mod_pos][0]
83 | #mod_node = mod_cand[1]
84 | drop_mod_feature = mod_word
85 | return drop_mod_feature
86 |
87 |
--------------------------------------------------------------------------------
/source/methods_training_graph.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #description : Methods for training graph exploration =
4 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
5 | #date : Created in 2014, Later revised in April 2016. =
6 | #version : 0.1 =
7 | #===================================================================================
8 |
9 | from nltk.metrics.distance import edit_distance
10 |
11 | # Compare edit distance
12 | def compare_edit_distance(operator,edit_dist_after_drop, edit_dist_before_drop):
13 | if operator == "lt":
14 | if edit_dist_after_drop < edit_dist_before_drop:
15 | return True
16 | else:
17 | return False
18 |
19 | if operator == "lteq":
20 | if edit_dist_after_drop <= edit_dist_before_drop:
21 | return True
22 | else:
23 | return False
24 |
25 | # Split Candidate: Common for all clsses
26 | def process_split_candidate_for_split_common(split_candidate, simple_sentences, main_sent_dict, boxer_graph):
27 | if len(split_candidate) != len(simple_sentences):
28 | # Number of events is less than number of simple sentences
29 | return False, []
30 |
31 | else:
32 | # Calculate all parent and following subtrees
33 | parent_subgraph_nodeset_dict = boxer_graph.extract_parent_subgraph_nodeset_dict()
34 | #print "parent_subgraph_nodeset_dict : "+str(parent_subgraph_nodeset_dict)
35 |
36 | node_overlap_dict = {}
37 | for nodename in split_candidate:
38 | split_nodeset = parent_subgraph_nodeset_dict[nodename]
39 | subsentence = boxer_graph.extract_main_sentence(split_nodeset, main_sent_dict, [])
40 | subsentence_words_set = set(subsentence.split())
41 |
42 | overlap_data = []
43 | for index in range(len(simple_sentences)):
44 | simple_sent_words_set = set(simple_sentences[index].split())
45 | overlap_words_set = subsentence_words_set & simple_sent_words_set
46 | overlap_data.append((len(overlap_words_set), index))
47 | overlap_data.sort(reverse=True)
48 |
49 | node_overlap_dict[nodename] = overlap_data[0]
50 |
51 | # Check that every node has some overlap in their maximum overlap else fail
52 | overlap_maxvalues = [node_overlap_dict[node][0] for node in node_overlap_dict]
53 | if 0 in overlap_maxvalues:
54 | return False, []
55 | else:
56 | # check the mapping covers all simple sentences
57 | overlap_max_simple_indixes = [node_overlap_dict[node][1] for node in node_overlap_dict]
58 | if len(set(overlap_max_simple_indixes)) == len(simple_sentences):
59 | # Thats a valid split, attach unprocessed graph components
60 | node_subgraph_nodeset_dict, node_span_dict = boxer_graph.partition_drs_for_successful_candidate(split_candidate, parent_subgraph_nodeset_dict)
61 |
62 | results = []
63 | for nodename in split_candidate:
64 | span = node_span_dict[nodename]
65 | nodeset = node_subgraph_nodeset_dict[nodename][:]
66 | simple_sentence = simple_sentences[node_overlap_dict[nodename][1]]
67 | results.append((span, nodeset, nodename, simple_sentence))
68 | # Sort them based on starting
69 | results.sort()
70 | return True, results
71 | else:
72 | return False, []
73 |
74 | # functions : Drop-REL Candidate
75 | def process_rel_candidate_for_drop_overlap(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, overlap_percentage):
76 | simple_sentence = " ".join(simple_sentences)
77 | simple_words = simple_sentence.split()
78 |
79 | rel_phrase = boxer_graph.extract_relation_phrase(relnode_candidate, nodeset, main_sent_dict, filtered_mod_pos)
80 |
81 | #print relnode_candidate, rel_phrase
82 |
83 | rel_words = rel_phrase.split()
84 | if len(rel_words) == 0:
85 | return True
86 | else:
87 | found = 0
88 | for word in rel_words:
89 | if word in simple_words:
90 | found += 1
91 | percentage_found = found/float(len(rel_words))
92 |
93 | if percentage_found <= overlap_percentage:
94 | return True
95 | else:
96 | return False
97 |
98 | def process_rel_candidate_for_drop_led(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, opr_drop_rel):
99 | simple_sentence = " ".join(simple_sentences)
100 |
101 | sentence_before_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_mod_pos)
102 | edit_dist_before_drop = edit_distance(sentence_before_drop.split(), simple_sentence.split())
103 |
104 | temp_nodeset, temp_filtered_mod_pos = boxer_graph.drop_relation(nodeset, relnode_candidate, filtered_mod_pos)
105 | sentence_after_drop = boxer_graph.extract_main_sentence(temp_nodeset, main_sent_dict, temp_filtered_mod_pos)
106 | edit_dist_after_drop = edit_distance(sentence_after_drop.split(), simple_sentence.split())
107 |
108 | isDrop = compare_edit_distance(opr_drop_rel, edit_dist_after_drop, edit_dist_before_drop)
109 | return isDrop
110 |
111 | # functions : Drop-MOD Candidate
112 | def process_mod_candidate_for_drop_led(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, opr_drop_mod):
113 | simple_sentence = " ".join(simple_sentences)
114 |
115 | sentence_before_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_mod_pos)
116 | edit_dist_before_drop = edit_distance(sentence_before_drop.split(), simple_sentence.split())
117 |
118 | modcand_position_to_process = modcand_to_process[0]
119 | temp_filtered_mod_pos = filtered_mod_pos[:]+[modcand_position_to_process]
120 | sentence_after_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, temp_filtered_mod_pos)
121 | edit_dist_after_drop = edit_distance(sentence_after_drop.split(), simple_sentence.split())
122 |
123 | isDrop = compare_edit_distance(opr_drop_mod, edit_dist_after_drop, edit_dist_before_drop)
124 | return isDrop
125 |
126 | # functions : Drop-OOD Candidate
127 | def process_ood_candidate_for_drop_led(oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, opr_drop_ood):
128 | simple_sentence = " ".join(simple_sentences)
129 |
130 | sentence_before_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_mod_pos)
131 | edit_dist_before_drop = edit_distance(sentence_before_drop.split(), simple_sentence.split())
132 |
133 | temp_nodeset = nodeset[:]
134 | temp_nodeset.remove(oodnode_candidate)
135 | sentence_after_drop = boxer_graph.extract_main_sentence(temp_nodeset, main_sent_dict, filtered_mod_pos)
136 | edit_dist_after_drop = edit_distance(sentence_after_drop.split(), simple_sentence.split())
137 |
138 | isDrop = compare_edit_distance(opr_drop_ood, edit_dist_after_drop, edit_dist_before_drop)
139 | return isDrop
140 |
141 | class Method_OVERLAP_LED:
142 | def __init__(self, overlap_percentage, opr_drop_mod, opr_drop_ood):
143 | self.overlap_percentage = overlap_percentage
144 | self.opr_drop_mod = opr_drop_mod
145 | self.opr_drop_ood = opr_drop_ood
146 |
147 | # Split candidate
148 | def process_split_candidate_for_split(self, split_candidate, simple_sentences, main_sent_dict, boxer_graph):
149 | isSplit, results = process_split_candidate_for_split_common(split_candidate, simple_sentences, main_sent_dict, boxer_graph)
150 | return isSplit, results
151 |
152 | # Drop-REL Candidate
153 | def process_rel_candidate_for_drop(self, relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
154 | isDrop = process_rel_candidate_for_drop_overlap(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.overlap_percentage)
155 | return isDrop
156 |
157 | # Drop-MOD Candidate
158 | def process_mod_candidate_for_drop(self, modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
159 | isDrop = process_mod_candidate_for_drop_led(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_mod)
160 | return isDrop
161 |
162 | # Drop-OOD Candidate
163 | def process_ood_candidate_for_drop(self, oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
164 | isDrop = process_ood_candidate_for_drop_led(oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_ood)
165 | return isDrop
166 |
167 | class Method_LED:
168 | def __init__(self, opr_drop_rel, opr_drop_mod, opr_drop_ood):
169 | self.opr_drop_rel = opr_drop_rel
170 | self.opr_drop_mod = opr_drop_mod
171 | self.opr_drop_ood = opr_drop_ood
172 |
173 | # Split candidate
174 | def process_split_candidate_for_split(self, split_candidate, simple_sentences, main_sent_dict, boxer_graph):
175 | isSplit, results = process_split_candidate_for_split_common(split_candidate, simple_sentences, main_sent_dict, boxer_graph)
176 | return isSplit, results
177 |
178 | # Drop-REL Candidate
179 | def process_rel_candidate_for_drop(self, relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
180 | isDrop = process_rel_candidate_for_drop_led(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_rel)
181 | return isDrop
182 |
183 | # Drop-MOD Candidate
184 | def process_mod_candidate_for_drop(self, modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
185 | isDrop = process_mod_candidate_for_drop_led(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_mod)
186 | return isDrop
187 |
188 | # Drop-OOD Candidate
189 | def process_ood_candidate_for_drop(self, oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
190 | isDrop = process_ood_candidate_for_drop_led(oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_ood)
191 | return isDrop
192 |
--------------------------------------------------------------------------------
/source/saxparser_xml_stanfordtokenized_boxergraph.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #title : saxparser_xml_stanfordtokenized_boxergraph.py =
4 | #description : Boxer-Graph-XML-Handler =
5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
6 | #date : Created in 2014, Later revised in April 2016. =
7 | #version : 0.1 =
8 | #===================================================================================
9 |
10 | from xml.sax import handler, make_parser
11 |
12 | from boxer_graph_module import Boxer_Graph
13 | from explore_training_graph import Explore_Training_Graph
14 |
15 | class SAXPARSER_XML_StanfordTokenized_BoxerGraph:
16 | def __init__(self, process, xmlfile, output_stream, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE, RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, METHOD_TRAINING_GRAPH):
17 | # process: "training" or "testing"
18 | self.process = process
19 |
20 | self.xmlfile = xmlfile
21 |
22 | # output_stream: file stream for training and dictionary for testing
23 | self.output_stream = output_stream
24 |
25 | self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL
26 | self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE
27 | self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL
28 | self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD
29 | self.METHOD_TRAINING_GRAPH = METHOD_TRAINING_GRAPH
30 |
31 | def parse_xmlfile_generating_training_graph(self):
32 | handler = SAX_Handler(self.process, self.output_stream, self.DISCOURSE_SENTENCE_MODEL, self.MAX_SPLIT_PAIR_SIZE,
33 | self.RESTRICTED_DROP_REL, self.ALLOWED_DROP_MOD, self.METHOD_TRAINING_GRAPH)
34 |
35 | parser = make_parser()
36 | parser.setContentHandler(handler)
37 | parser.parse(self.xmlfile)
38 |
39 | class SAX_Handler(handler.ContentHandler):
40 | def __init__(self, process, output_stream, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE,
41 | RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, METHOD_TRAINING_GRAPH):
42 | self.process = process
43 | self.output_stream = output_stream
44 |
45 | self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL
46 | self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE
47 | self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL
48 | self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD
49 | self.METHOD_TRAINING_GRAPH = METHOD_TRAINING_GRAPH
50 |
51 | # Training Graph Creator
52 | self.training_graph_handler = Explore_Training_Graph(self.output_stream, self.DISCOURSE_SENTENCE_MODEL, self.MAX_SPLIT_PAIR_SIZE,
53 | self.RESTRICTED_DROP_REL, self.ALLOWED_DROP_MOD, self.METHOD_TRAINING_GRAPH)
54 |
55 | # Sentence Data
56 | self.sentid = ""
57 | self.main_sentence = ""
58 | self.main_sent_dict = {}
59 | self.boxer_graph = Boxer_Graph()
60 | self.simple_sentencs = []
61 |
62 | # Sentence Flags, temporary variables
63 | self.isMain = False
64 |
65 | self.isS = False
66 | self.sentence = ""
67 | self.wordlist = []
68 |
69 | self.isW = False
70 | self.word = ""
71 | self.wid = ""
72 | self.wpos = ""
73 |
74 | self.isSimple = False
75 |
76 | # Boxer flags, temporary variables
77 | self.isNode = False
78 | self.isRel = False
79 | self.symbol = ""
80 | self.predsymbol = ""
81 | self.locationlist = []
82 |
83 | def startDocument(self):
84 | print "Start parsing the document ..."
85 |
86 | def endDocument(self):
87 | print "End parsing the document ..."
88 |
89 | def startElement(self, nameElt, attrOfElt):
90 | if nameElt == "sentence":
91 | self.sentid = attrOfElt["id"]
92 |
93 | # Refreshing Sentence Data
94 | self.main_sentence = ""
95 | self.main_sent_dict = {}
96 | self.boxer_graph = Boxer_Graph()
97 | self.simple_sentences = []
98 |
99 | if nameElt == "main":
100 | self.isMain = True
101 |
102 | if nameElt == "simple":
103 | self.isSimple = True
104 |
105 | if nameElt == "s":
106 | self.isS = True
107 | self.sentence = ""
108 | self.wordlist = []
109 |
110 | if nameElt == "w":
111 | self.isW = True
112 | self.wid = int(attrOfElt["id"][1:])
113 | self.wpos = attrOfElt["pos"]
114 | self.word = ""
115 |
116 | if nameElt == "node":
117 | self.isNode = True
118 | self.symbol = attrOfElt["sym"]
119 | self.boxer_graph.nodes[self.symbol] = {"positions":[], "predicates":[]}
120 |
121 | if nameElt == "rel":
122 | self.isRel = True
123 | self.symbol = attrOfElt["sym"]
124 | self.boxer_graph.relations[self.symbol] = {"positions":[], "predicates":""}
125 |
126 | if nameElt == "span":
127 | self.locationlist = []
128 |
129 | if nameElt == "pred":
130 | self.locationlist = []
131 | self.predsymbol = attrOfElt["sym"]
132 |
133 | if nameElt == "loc":
134 | if int(attrOfElt["id"][1:]) in self.main_sent_dict:
135 | self.locationlist.append(int(attrOfElt["id"][1:]))
136 |
137 | if nameElt == "edge":
138 | self.boxer_graph.edges.append((attrOfElt["par"], attrOfElt["dep"], attrOfElt["lab"]))
139 |
140 | def endElement(self, nameElt):
141 | if nameElt == "sentence":
142 | #print self.sentid
143 | # print self.main_sentence
144 | # print self.main_sent_dict
145 | # print self.simple_sentences
146 | # print self.boxer_graph
147 |
148 | if self.process == "training":
149 | self.training_graph_handler.explore_training_graph(self.sentid, self.main_sentence, self.main_sent_dict, self.simple_sentences, self.boxer_graph)
150 |
151 | if self.process == "testing":
152 | self.output_stream[self.sentid] = [self.main_sentence, self.main_sent_dict, self.boxer_graph]
153 |
154 | # if len(self.main_sentence) > 600:
155 | # print self.sentid
156 | # if len(self.simple_sentences) == 6:
157 | # print self.sentid
158 |
159 | if int(self.sentid)%10000 == 0:
160 | print self.sentid + " training data processed ..."
161 |
162 | if nameElt == "main":
163 | self.isMain = False
164 | if len(self.wordlist) == 0:
165 | self.main_sentence = self.sentence.lower()
166 | else:
167 | self.main_sentence = (" ".join(self.wordlist)).lower()
168 |
169 | if nameElt == "simple":
170 | self.isSimple = False
171 | self.simple_sentences.append(self.sentence.lower())
172 |
173 | if nameElt == "s":
174 | self.isS = False
175 |
176 | if nameElt == "w":
177 | self.isW = False
178 | self.main_sent_dict[self.wid] = (self.word.lower(), self.wpos.lower())
179 | self.wordlist.append(self.word.lower())
180 |
181 | if nameElt == "node":
182 | self.isNode = False
183 | self.boxer_graph.nodes[self.symbol]["predicates"].sort()
184 |
185 | if nameElt == "rel":
186 | self.isRel = False
187 |
188 | if nameElt == "span":
189 | self.locationlist.sort()
190 | if self.isNode:
191 | self.boxer_graph.nodes[self.symbol]["positions"] = self.locationlist[:]
192 | if self.isRel:
193 | self.boxer_graph.relations[self.symbol]["positions"] = self.locationlist[:]
194 |
195 | if nameElt == "pred":
196 | self.locationlist.sort()
197 | if self.isNode:
198 | self.boxer_graph.nodes[self.symbol]["predicates"].append((self.predsymbol, self.locationlist[:]))
199 | if self.isRel:
200 | self.boxer_graph.relations[self.symbol]["predicates"] = self.predsymbol
201 |
202 | def characters(self, chrs):
203 | if self.isS:
204 | self.sentence += chrs
205 |
206 | if self.isW:
207 | self.word += chrs
208 |
209 |
--------------------------------------------------------------------------------
/source/saxparser_xml_stanfordtokenized_boxergraph_traininggraph.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #title : saxparser_xml_stanfordtokenized_boxergraph_traininggraph.py =
4 | #description : Boxer-Training-Graph-XML-Handler =
5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
6 | #date : Created in 2014, Later revised in April 2016. =
7 | #version : 0.1 =
8 | #===================================================================================
9 |
10 | from xml.sax import handler, make_parser
11 | from boxer_graph_module import Boxer_Graph
12 | from training_graph_module import Training_Graph
13 | from em_inside_outside_algorithm import EM_InsideOutside_Optimiser
14 | import copy
15 |
16 | class SAXPARSER_XML_StanfordTokenized_BoxerGraph_TrainingGraph:
17 | def __init__(self, training_xmlfile, NUM_TRAINING_ITERATION, smt_sentence_pairs, probability_tables, count_tables, METHOD_FEATURE_EXTRACT):
18 | self.training_xmlfile = training_xmlfile
19 | self.NUM_TRAINING_ITERATION = NUM_TRAINING_ITERATION
20 | self.smt_sentence_pairs = smt_sentence_pairs
21 | self.probability_tables = probability_tables
22 | self.count_tables = count_tables
23 | self.METHOD_FEATURE_EXTRACT = METHOD_FEATURE_EXTRACT
24 |
25 | self.em_io_handler = EM_InsideOutside_Optimiser(self.smt_sentence_pairs, self.probability_tables, self.count_tables, self.METHOD_FEATURE_EXTRACT)
26 |
27 | def parse_to_initialize_probabilitytable(self):
28 | # Initialize probability table and populate self.smt_sentence_pairs
29 | handler = SAX_Handler("init", self.em_io_handler)
30 | parser = make_parser()
31 | parser.setContentHandler(handler)
32 | print "Start parsing "+self.training_xmlfile+" ..."
33 | parser.parse(self.training_xmlfile)
34 |
35 | def parse_to_iterate_probabilitytable(self):
36 | handler = SAX_Handler("iter", self.em_io_handler)
37 | parser = make_parser()
38 | parser.setContentHandler(handler)
39 |
40 | for count in range(self.NUM_TRAINING_ITERATION):
41 | print "Starting iteration: "+str(count+1)+" ..."
42 |
43 | print "Resetting all counts to ZERO ..."
44 | self.em_io_handler.reset_count_table()
45 |
46 | print "Start parsing "+self.training_xmlfile+" ..."
47 | parser.parse(self.training_xmlfile)
48 | print "Ending iteration: "+str(count+1)+" ..."
49 |
50 | print "Updating probability table ..."
51 | self.em_io_handler.update_probability_table()
52 |
53 | class SAX_Handler(handler.ContentHandler):
54 | def __init__(self, stage, em_io_handler):
55 | # "init" or "iter" stage
56 | self.stage = stage
57 |
58 | # EM algorithm handler
59 | self.em_io_handler = em_io_handler
60 |
61 | # Sentence Data
62 | self.sentid = ""
63 | self.main_sentence = ""
64 | self.simple_sentencs = []
65 | self.main_sent_dict = {}
66 | # Boxer Data
67 | self.boxer_graph = {"nodes":{}, "relations":{}, "edges":[]}
68 | # Training Graph Data
69 | self.training_graph = {"major-nodes":{}, "oper-nodes":{}, "edges":[]}
70 |
71 | # Common TAG variables
72 | self.isS = False
73 | self.sentence = ""
74 |
75 | # Main
76 | self.isMain = False
77 | self.isWinfo = False
78 | self.isW = False
79 | self.word = ""
80 | self.wid = ""
81 | self.wpos = ""
82 |
83 | # Simple Set
84 | self.isSimple = False
85 |
86 | # Boxer
87 | self.isBoxer = False
88 |
89 | # TrainingGraph
90 | self.isTrainingGraph = False
91 |
92 | # Node
93 | self.isNode = False
94 | self.nodesym = ""
95 |
96 | # Span
97 | self.isSpan = False
98 |
99 | # pred
100 | self.isPred = False
101 | self.predsym = ""
102 |
103 | # relation
104 | self.isRel = False
105 | self.relsym = ""
106 |
107 | # major oper nodes
108 | self.isMajorNodes = False
109 | self.isOperNodes = False
110 |
111 | # type
112 | self.isType = False
113 | self.type = ""
114 |
115 | # Nodeset
116 | self.isNodeset = False
117 |
118 | # Split
119 | self.isSplitCandidate = False
120 | self.isSplitCandidateLeft = False
121 | self.isSC = False
122 |
123 | # Out of discourse OOD
124 | self.isOODCandidates = False
125 | self.isOODProcessed = False
126 |
127 | # Relations
128 | self.isRelCandidates = False
129 | self.isRelProcessed = False
130 |
131 | # Modifiers
132 | self.isModCandidates = False
133 | self.isModposProcessed = False
134 | self.isModposFiltered = False
135 |
136 | def startDocument(self):
137 | print "Start parsing the document ..."
138 |
139 | def endDocument(self):
140 | print "End parsing the document ..."
141 |
142 | def startElement(self, nameElt, attrOfElt):
143 | if nameElt == "sentence":
144 | self.sentid = attrOfElt["id"]
145 |
146 | # Refreshing Sentence Data
147 | self.main_sentence = ""
148 | self.simple_sentences = []
149 | self.main_sent_dict = {}
150 | # Refreshing Boxer Data
151 | self.boxer_graph = {"nodes":{}, "relations":{}, "edges":[]}
152 | # Refreshing Training Graph Data
153 | self.training_graph = {"major-nodes":{}, "oper-nodes":{}, "edges":[]}
154 |
155 | if nameElt == "main":
156 | self.isMain = True
157 |
158 | if nameElt == "s":
159 | self.isS = True
160 | self.sentence = ""
161 |
162 | if nameElt == "winfo":
163 | self.isWinfo = True
164 |
165 | if nameElt == "w":
166 | self.isW = True
167 | self.word = ""
168 | self.wid = int(attrOfElt["id"])
169 | self.wpos = attrOfElt["pos"]
170 |
171 | if nameElt == "simple":
172 | self.isSimple = True
173 |
174 | if nameElt == "box":
175 | self.isBoxer = True
176 |
177 | if nameElt == "train-graph":
178 | self.isTrainingGraph = True
179 |
180 | if nameElt == "major-nodes":
181 | self.isMajorNodes = True
182 |
183 | if nameElt == "oper-nodes":
184 | self.isOperNodes = True
185 |
186 | if nameElt == "node":
187 | self.isNode = True
188 | self.nodesym = attrOfElt["sym"]
189 |
190 | if self.isBoxer == True:
191 | self.boxer_graph["nodes"][self.nodesym] = {"positions": [], "predicates":[]}
192 |
193 | if self.isTrainingGraph == True:
194 | if self.isMajorNodes == True:
195 | self.training_graph["major-nodes"][self.nodesym] = {"type": "", "nodeset": [], "simple-sentences":[],
196 | "split-candidates":[],
197 | "ood-candidates":[], "ood-processed":[],
198 | "rel-candidates":[], "rel-processed":[],
199 | "mod-candidates":[], "modpos-processed":[], "modpos-filtered":[]}
200 |
201 | if self.isOperNodes == True:
202 | self.training_graph["oper-nodes"][self.nodesym] = {"type": "",
203 | "split-candidate":[], "not-split-candidates":[],
204 | "ood-candidate":"", "drop-result":"",
205 | "rel-candidate":"","mod-candidate":""}
206 |
207 | if nameElt == "rel":
208 | self.isRel = True
209 | self.relsym = attrOfElt["sym"]
210 |
211 | if self.isBoxer == True:
212 | self.boxer_graph["relations"][self.relsym] = {"positions": [], "predicates":""}
213 |
214 | if nameElt == "span":
215 | self.isSpan = True
216 |
217 | if nameElt == "pred":
218 | self.isPred = True
219 | self.predsym = attrOfElt["sym"]
220 |
221 | if self.isBoxer == True and self.isNode == True:
222 | self.boxer_graph["nodes"][self.nodesym]["predicates"].append([self.predsym, []])
223 |
224 | if self.isBoxer == True and self.isRel == True:
225 | self.boxer_graph["relations"][self.relsym]["predicates"] = self.predsym
226 |
227 | if nameElt == "loc":
228 | if self.isBoxer == True and self.isNode == True and self.isSpan == True:
229 | self.boxer_graph["nodes"][self.nodesym]["positions"].append(int(attrOfElt["id"]))
230 |
231 | if self.isBoxer == True and self.isNode == True and self.isPred == True:
232 | self.boxer_graph["nodes"][self.nodesym]["predicates"][-1][1].append(int(attrOfElt["id"]))
233 |
234 | if self.isBoxer == True and self.isRel == True and self.isSpan == True:
235 | self.boxer_graph["relations"][self.relsym]["positions"].append(int(attrOfElt["id"]))
236 |
237 | if self.isModposProcessed == True:
238 | if self.isMajorNodes == True:
239 | self.training_graph["major-nodes"][self.nodesym]["modpos-processed"].append(int(attrOfElt["id"]))
240 |
241 | if self.isModposFiltered == True:
242 | if self.isMajorNodes == True:
243 | self.training_graph["major-nodes"][self.nodesym]["modpos-filtered"].append(int(attrOfElt["id"]))
244 |
245 | if nameElt == "edge":
246 | if self.isBoxer == True:
247 | self.boxer_graph["edges"].append((attrOfElt["par"], attrOfElt["dep"], attrOfElt["lab"]))
248 |
249 | if self.isTrainingGraph == True:
250 | self.training_graph["edges"].append((attrOfElt["par"], attrOfElt["dep"], attrOfElt["lab"]))
251 |
252 | if nameElt == "type":
253 | self.isType = True
254 | self.type = ""
255 |
256 | if nameElt == "nodeset":
257 | self.isNodeset = True
258 |
259 | if nameElt == "n":
260 | if self.isNodeset == True:
261 | if self.isMajorNodes == True:
262 | self.training_graph["major-nodes"][self.nodesym]["nodeset"].append(attrOfElt["sym"])
263 | if self.isSC == True:
264 | if self.isSplitCandidate == True:
265 | if self.isMajorNodes == True:
266 | self.training_graph["major-nodes"][self.nodesym]["split-candidates"][-1].append(attrOfElt["sym"])
267 | if self.isOperNodes == True:
268 | self.training_graph["oper-nodes"][self.nodesym]["split-candidate"].append(attrOfElt["sym"])
269 | if self.isSplitCandidateLeft == True:
270 | if self.isOperNodes == True:
271 | self.training_graph["oper-nodes"][self.nodesym]["not-split-candidates"][-1].append(attrOfElt["sym"])
272 |
273 | if self.isOODCandidates == True:
274 | if self.isMajorNodes == True:
275 | self.training_graph["major-nodes"][self.nodesym]["ood-candidates"].append(attrOfElt["sym"])
276 | if self.isOperNodes == True:
277 | self.training_graph["oper-nodes"][self.nodesym]["ood-candidate"] = attrOfElt["sym"]
278 |
279 | if self.isOODProcessed == True:
280 | if self.isMajorNodes == True:
281 | self.training_graph["major-nodes"][self.nodesym]["ood-processed"].append(attrOfElt["sym"])
282 |
283 | if self.isRelCandidates == True:
284 | if self.isMajorNodes == True:
285 | self.training_graph["major-nodes"][self.nodesym]["rel-candidates"].append(attrOfElt["sym"])
286 | if self.isOperNodes == True:
287 | self.training_graph["oper-nodes"][self.nodesym]["rel-candidate"] = attrOfElt["sym"]
288 |
289 | if self.isRelProcessed == True:
290 | if self.isMajorNodes == True:
291 | self.training_graph["major-nodes"][self.nodesym]["rel-processed"].append(attrOfElt["sym"])
292 |
293 | if self.isModCandidates == True:
294 | if self.isMajorNodes == True:
295 | self.training_graph["major-nodes"][self.nodesym]["mod-candidates"].append((attrOfElt["loc"], attrOfElt["sym"]))
296 | if self.isOperNodes == True:
297 | self.training_graph["oper-nodes"][self.nodesym]["mod-candidate"] = (attrOfElt["loc"], attrOfElt["sym"])
298 |
299 | if nameElt == "split-candidates" or nameElt == "split-candidate-applied":
300 | self.isSplitCandidate = True
301 |
302 | if nameElt == "split-candidate-left":
303 | self.isSplitCandidateLeft = True
304 |
305 | if nameElt == "sc":
306 | self.isSC = True
307 | if self.isSplitCandidate == True:
308 | if self.isMajorNodes == True:
309 | self.training_graph["major-nodes"][self.nodesym]["split-candidates"].append([])
310 | if self.isOperNodes == True:
311 | self.training_graph["oper-nodes"][self.nodesym]["split-candidate"] = []
312 | if self.isSplitCandidateLeft == True:
313 | if self.isOperNodes == True:
314 | self.training_graph["oper-nodes"][self.nodesym]["not-split-candidates"].append([])
315 |
316 | if nameElt == "ood-candidate" or nameElt == "ood-candidates":
317 | self.isOODCandidates = True
318 |
319 | if nameElt == "ood-processed":
320 | self.isOODProcessed = True
321 |
322 | if nameElt == "rel-candidate" or nameElt == "rel-candidates":
323 | self.isRelCandidates = True
324 |
325 | if nameElt == "rel-processed":
326 | self.isRelProcessed = True
327 |
328 | if nameElt == "mod-candidate" or nameElt == "mod-candidates":
329 | self.isModCandidates = True
330 |
331 | if nameElt == "mod-loc-processed":
332 | self.isModposProcessed = True
333 |
334 | if nameElt == "mod-loc-filtered":
335 | self.isModposFiltered = True
336 |
337 | if nameElt == "is-dropped":
338 | if self.isOperNodes == True:
339 | self.training_graph["oper-nodes"][self.nodesym]["drop-result"] = attrOfElt["val"]
340 |
341 | def endElement(self, nameElt):
342 | if nameElt == "sentence":
343 | # print self.sentid
344 | # print
345 | # print self.main_sentence
346 | # print
347 | # print self.main_sent_dict
348 | # print
349 | # print self.simple_sentences
350 | # print
351 | # print self.boxer_graph
352 | # print
353 | # print self.training_graph
354 |
355 | # Creating the original format of Boxer and Training Graph
356 | final_boxer_graph = Boxer_Graph()
357 | for nodename in self.boxer_graph["nodes"]:
358 | final_boxer_graph.nodes[nodename] = copy.copy(self.boxer_graph["nodes"][nodename])
359 | for nodename in self.boxer_graph["relations"]:
360 | final_boxer_graph.relations[nodename] = copy.copy(self.boxer_graph["relations"][nodename])
361 | final_boxer_graph.edges = self.boxer_graph["edges"][:]
362 |
363 | final_training_graph = Training_Graph()
364 | for nodename in self.training_graph["major-nodes"]:
365 | nodedict = self.training_graph["major-nodes"][nodename]
366 | if nodedict["type"] == "split":
367 | final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["split-candidates"][:])
368 | if nodedict["type"] == "drop-rel":
369 | final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["rel-candidates"][:],
370 | nodedict["rel-processed"][:], nodedict["modpos-filtered"][:])
371 | if nodedict["type"] == "drop-mod":
372 | final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["mod-candidates"][:],
373 | nodedict["modpos-processed"][:], nodedict["modpos-filtered"][:])
374 | if nodedict["type"] == "drop-ood":
375 | final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["ood-candidates"][:],
376 | nodedict["ood-processed"][:], nodedict["modpos-filtered"][:])
377 | if nodedict["type"] == "fin":
378 | final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["modpos-filtered"][:])
379 | for nodename in self.training_graph["oper-nodes"]:
380 | nodedict = self.training_graph["oper-nodes"][nodename]
381 | if nodedict["type"] == "split":
382 | if len(nodedict["split-candidate"]) == 0:
383 | final_training_graph.oper_nodes[nodename] = (nodedict["type"], None, nodedict["not-split-candidates"][:])
384 | else:
385 | final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["split-candidate"], nodedict["not-split-candidates"][:])
386 | if nodedict["type"] == "drop-rel":
387 | final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["rel-candidate"], nodedict["drop-result"])
388 | if nodedict["type"] == "drop-mod":
389 | final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["mod-candidate"], nodedict["drop-result"])
390 | if nodedict["type"] == "drop-ood":
391 | final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["ood-candidate"], nodedict["drop-result"])
392 | final_training_graph.edges = self.training_graph["edges"][:]
393 |
394 | # Process various stage "init" or "iter"
395 | if self.stage == "init":
396 | self.em_io_handler.initialize_probabilitytable_smt_input(self.sentid, self.main_sentence, self.main_sent_dict, self.simple_sentences, final_boxer_graph, final_training_graph)
397 |
398 | if self.stage == "iter":
399 | self.em_io_handler.iterate_over_probabilitytable(self.sentid, self.main_sentence, self.main_sent_dict, self.simple_sentences, final_boxer_graph, final_training_graph)
400 |
401 | if int(self.sentid)%10000 == 0:
402 | print self.sentid + " training data processed ..."
403 |
404 | if nameElt == "main":
405 | self.isMain = False
406 |
407 | if nameElt == "s":
408 | self.isS = False
409 |
410 | if self.isMain == True:
411 | self.main_sentence = self.sentence
412 |
413 | if self.isSimple == True:
414 | if self.isNode == True:
415 | if self.isMajorNodes == True:
416 | self.training_graph["major-nodes"][self.nodesym]["simple-sentences"].append(self.sentence)
417 | else:
418 | self.simple_sentences.append(self.sentence)
419 |
420 | if nameElt == "winfo":
421 | self.isWinfo = False
422 |
423 | if nameElt == "w":
424 | self.isW = False
425 |
426 | if self.isWinfo == True:
427 | self.main_sent_dict[self.wid] = (self.word, self.wpos)
428 |
429 | if nameElt == "simple":
430 | self.isSimple = False
431 |
432 | if nameElt == "box":
433 | self.isBoxer = False
434 |
435 | if nameElt == "train-graph":
436 | self.isTrainingGraph = False
437 |
438 | if nameElt == "major-nodes":
439 | self.isMajorNodes = False
440 |
441 | if nameElt == "oper-nodes":
442 | self.isOperNodes = False
443 |
444 | if nameElt == "node":
445 | self.isNode = False
446 |
447 | if nameElt == "rel":
448 | self.isRel = False
449 |
450 | if nameElt == "span":
451 | self.isSpan = False
452 |
453 | if nameElt == "pred":
454 | self.isPred = False
455 |
456 | if nameElt == "type":
457 | self.isType = False
458 | if self.isMajorNodes == True:
459 | self.training_graph["major-nodes"][self.nodesym]["type"] = self.type
460 | if self.isOperNodes == True:
461 | self.training_graph["oper-nodes"][self.nodesym]["type"] = self.type
462 |
463 | if nameElt == "nodeset":
464 | self.isNodeset = False
465 |
466 | if nameElt == "split-candidates" or nameElt == "split-candidate-applied":
467 | self.isSplitCandidate = False
468 |
469 | if nameElt == "split-candidate-left":
470 | self.isSplitCandidateLeft = False
471 |
472 | if nameElt == "sc":
473 | self.isSC = False
474 |
475 | if nameElt == "ood-candidate" or nameElt == "ood-candidates":
476 | self.isOODCandidates = False
477 |
478 | if nameElt == "ood-processed":
479 | self.isOODProcessed = False
480 |
481 | if nameElt == "rel-candidate" or nameElt == "rel-candidates":
482 | self.isRelCandidates = False
483 |
484 | if nameElt == "rel-processed":
485 | self.isRelProcessed = False
486 |
487 | if nameElt == "mod-candidate" or nameElt == "mod-candidates":
488 | self.isModCandidates = False
489 |
490 | if nameElt == "mod-loc-processed":
491 | self.isModposProcessed = False
492 |
493 | if nameElt == "mod-loc-filtered":
494 | self.isModposFiltered = False
495 |
496 | def characters(self, chrs):
497 | if self.isS:
498 | self.sentence += chrs
499 |
500 | if self.isW:
501 | self.word += chrs
502 |
503 | if self.isType:
504 | self.type += chrs
505 |
--------------------------------------------------------------------------------
/source/training_graph_module.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #title : training_graph_module.py =
4 | #description : Define Training Graph =
5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
6 | #date : Created in 2014, Later revised in April 2016. =
7 | #version : 0.1 =
8 | #===================================================================================
9 |
10 |
11 | import xml.etree.ElementTree as ET
12 | import copy
13 |
14 | class Training_Graph:
15 | def __init__(self):
16 | '''
17 | self.major_nodes["MN-*"]
18 | ("split", nodeset, simple_sentences, split_candidate_tuples)
19 | ("drop-rel", nodeset, simple_sentences, relnode_set, processed_relnode, filtered_mod_pos)
20 | ("drop-mod", nodeset, simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos)
21 | ("drop-ood", nodeset, simple_sentence, oodnode_set, processed_oodnode, filtered_mod_pos)
22 | ("fin", nodeset, simple_sentences, filtered_mod_pos)
23 |
24 | self.oper_nodes["ON-*"]
25 | ("split", split_candidate, not_applied_cands)
26 | ("split", None, not_applied_cands)
27 | ("drop-rel", relnode_to_process, "True")
28 | ("drop-rel", relnode_to_process, "False")
29 | ("drop-mod", modcand_to_process, "True")
30 | ("drop-mod", modcand_to_process, "False")
31 | ("drop-ood", oodnode_to_process, "True")
32 | ("drop-ood", oodnode_to_process, "False")
33 |
34 | self.edges = [(par, dep, lab)]
35 |
36 | '''
37 | self.major_nodes = {}
38 | self.oper_nodes = {}
39 | self.edges = []
40 |
41 | def get_majornode_type(self, majornode_name):
42 | majornode_tuple = self.major_nodes[majornode_name]
43 | return majornode_tuple[0]
44 |
45 | def get_majornode_nodeset(self, majornode_name):
46 | majornode_tuple = self.major_nodes[majornode_name]
47 | return majornode_tuple[1]
48 |
49 | def get_majornode_simple_sentences(self, majornode_name):
50 | majornode_tuple = self.major_nodes[majornode_name]
51 | return majornode_tuple[2]
52 |
53 | def get_majornode_oper_candidates(self, majornode_name):
54 | majornode_tuple = self.major_nodes[majornode_name]
55 | if majornode_tuple[0] != "fin":
56 | return majornode_tuple[3]
57 | else:
58 | return []
59 |
60 | def get_majornode_processed_oper_candidates(self, majornode_name):
61 | majornode_tuple = self.major_nodes[majornode_name]
62 | if majornode_tuple[0] != "fin" and majornode_tuple[0] != "split":
63 | return majornode_tuple[4]
64 | else:
65 | return []
66 |
67 | def get_majornode_filtered_postions(self, majornode_name):
68 | majornode_tuple = self.major_nodes[majornode_name]
69 | if majornode_tuple[0] == "fin":
70 | return majornode_tuple[3]
71 | elif majornode_tuple[0] == "drop-rel" or majornode_tuple[0] == "drop-mod" or majornode_tuple[0] == "drop-ood":
72 | return majornode_tuple[5]
73 | else:
74 | return []
75 |
76 | def get_opernode_type(self, opernode_name):
77 | opernode_tuple = self.oper_nodes[opernode_name]
78 | return opernode_tuple[0]
79 |
80 | def get_opernode_oper_candidate(self, opernode_name):
81 | opernode_tuple = self.oper_nodes[opernode_name]
82 | return opernode_tuple[1]
83 |
84 | def get_opernode_failed_oper_candidates(self, opernode_name):
85 | opernode_tuple = self.oper_nodes[opernode_name]
86 | if opernode_tuple[0] == "split":
87 | return opernode_tuple[2]
88 | else:
89 | return []
90 |
91 | def get_opernode_drop_result(self, opernode_name):
92 | opernode_tuple = self.oper_nodes[opernode_name]
93 | if opernode_tuple[0] != "split":
94 | return opernode_tuple[2]
95 | else:
96 | return None
97 |
98 | # @@@@@@@@@@@@@@@@@@@@@ Create nodes @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
99 |
100 | def create_majornode(self, majornode_data):
101 | copy_data = copy.copy(majornode_data)
102 |
103 | # Check if node exists
104 | for node_name in self.major_nodes:
105 | node_data = self.major_nodes[node_name]
106 | if node_data == copy_data:
107 | return node_name, False
108 |
109 | # Otherwise create new node
110 | majornode_name = "MN-"+str(len(self.major_nodes)+1)
111 | self.major_nodes[majornode_name] = copy_data
112 | return majornode_name, True
113 |
114 | def create_opernode(self, opernode_data):
115 | copy_data = copy.copy(opernode_data)
116 | opernode_name = "ON-"+str(len(self.oper_nodes)+1)
117 | self.oper_nodes[opernode_name] = copy_data
118 | return opernode_name
119 |
120 | def create_edge(self, edge_data):
121 | self.edges.append(copy.copy(edge_data))
122 |
123 | # @@@@@@@@@@@@@@@@@@@@@@@@ Final sentences @@@@@@@@@@@@@@@@@@@@@@@@@@
124 |
125 | def get_final_sentences(self, main_sentence, main_sent_dict, boxer_graph):
126 | fin_nodes = self.find_all_fin_majornode()
127 | print
128 | node_sent = []
129 | for node in fin_nodes:
130 | # intpart = int(node[3:]) # removing "MN-", lower int part sentence comes before
131 | if boxer_graph.isEmpty():
132 | #main_sentence = main_sentence.encode('utf-8')
133 | simple_sentences = self.get_majornode_simple_sentences(node)
134 | simple_sentence = " ".join(simple_sentences)
135 | #node_sent.append((intpart, main_sentence, simple_sentence))
136 |
137 | node_span = (0, len(main_sentence.split()))
138 | node_sent.append((node_span, main_sentence, simple_sentence))
139 |
140 | else:
141 | nodeset = self.get_majornode_nodeset(node)
142 | node_span = boxer_graph.extract_span_min_max(nodeset)
143 | filtered_pos = self.get_majornode_filtered_postions(node)
144 | main_sentence = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_pos)
145 | simple_sentences = self.get_majornode_simple_sentences(node)
146 | simple_sentence = " ".join(simple_sentences)
147 | #node_sent.append((intpart, main_sentence, simple_sentence))
148 | node_sent.append((node_span, main_sentence, simple_sentence))
149 | node_sent.sort()
150 | sentence_pairs = [(item[1], item[2]) for item in node_sent]
151 | #sentence_pairs = [(item[1].encode('utf-8'), item[2].encode('utf-8')) for item in node_sent]
152 | #print sentence_pairs
153 | return sentence_pairs
154 |
155 |
156 | # @@@@@@@@@@@@@@@@@@@@@ Find nodes in Training Graph @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
157 |
158 | def find_all_fin_majornode(self):
159 | fin_nodes = []
160 | for major_node in self.major_nodes:
161 | if self.major_nodes[major_node][0] == "fin":
162 | fin_nodes.append(major_node)
163 | return fin_nodes
164 |
165 | def find_children_of_majornode(self, major_node):
166 | children_oper_nodes = []
167 | for edge in self.edges:
168 | if edge[0] == major_node:
169 | children_oper_nodes.append(edge[1])
170 | return children_oper_nodes
171 |
172 | def find_children_of_opernode(self, oper_node):
173 | children_major_nodes = []
174 | for edge in self.edges:
175 | if edge[0] == oper_node:
176 | children_major_nodes.append(edge[1])
177 | return children_major_nodes
178 |
179 | def find_parents_of_majornode(self, major_node):
180 | parents_oper_nodes = []
181 | for edge in self.edges:
182 | if edge[1] == major_node:
183 | parent_oper_node = edge[0]
184 | parents_oper_nodes.append(parent_oper_node)
185 | return parents_oper_nodes
186 |
187 | def find_parent_of_opernode(self, oper_node):
188 | parent_major_node = ""
189 | for edge in self.edges:
190 | if edge[1] == oper_node:
191 | parent_major_node = edge[0]
192 | break
193 | return parent_major_node
194 |
195 | # @@@@@@@@@@@@ Training Graph -> Elementary Tree @@@@@@@@@@@@@@@@@@@@
196 |
197 | def convert_to_elementarytree(self):
198 | traininggraph = ET.Element("train-graph")
199 |
200 | # Major nodes
201 | major_nodes_elt = ET.SubElement(traininggraph, "major-nodes")
202 | for major_nodename in self.major_nodes:
203 | major_nodetype = self.get_majornode_type(major_nodename)
204 | major_nodeset = self.get_majornode_nodeset(major_nodename)
205 | major_simple_sentences = self.get_majornode_simple_sentences(major_nodename)
206 | oper_candidates = self.get_majornode_oper_candidates(major_nodename)
207 | processed_oper_candidates = self.get_majornode_processed_oper_candidates(major_nodename)
208 | filtered_postions = self.get_majornode_filtered_postions(major_nodename)
209 |
210 | major_node_elt = ET.SubElement(major_nodes_elt, "node")
211 | major_node_elt.attrib = {"sym":major_nodename}
212 |
213 | # Opertype
214 | major_nodetype_elt = ET.SubElement(major_node_elt, "type")
215 | major_nodetype_elt.text = major_nodetype
216 |
217 | # Nodeset
218 | major_nodeset_elt = ET.SubElement(major_node_elt, "nodeset")
219 | for node in major_nodeset:
220 | node_elt = ET.SubElement(major_nodeset_elt, "n")
221 | node_elt.attrib = {"sym":node}
222 |
223 | # Simple sentences
224 | major_simple_sentences_elt = ET.SubElement(major_node_elt, "simple-set")
225 | for simple_sentence in major_simple_sentences:
226 | major_simple_sentence_elt = ET.SubElement(major_simple_sentences_elt, "simple")
227 | sent_data_elt = ET.SubElement(major_simple_sentence_elt, "s")
228 | sent_data_elt.text = simple_sentence
229 |
230 | # Oper Candidates
231 | if major_nodetype == "split":
232 | split_candidate_tuples = oper_candidates
233 | major_split_candidates_elt = ET.SubElement(major_node_elt, "split-candidates")
234 | for split_candidate in split_candidate_tuples:
235 | major_split_candidate_elt = ET.SubElement(major_split_candidates_elt, "sc")
236 | for node in split_candidate:
237 | node_elt = ET.SubElement(major_split_candidate_elt, "n")
238 | node_elt.attrib = {"sym":str(node)}
239 |
240 | if major_nodetype == "drop-rel":
241 | relnode_set = oper_candidates
242 | major_relnode_set_elt = ET.SubElement(major_node_elt, "rel-candidates")
243 | for node in relnode_set:
244 | node_elt = ET.SubElement(major_relnode_set_elt, "n")
245 | node_elt.attrib = {"sym":str(node)}
246 |
247 | processed_relnodes = processed_oper_candidates
248 | major_processed_relnodes_elt = ET.SubElement(major_node_elt, "rel-processed")
249 | for node in processed_relnodes:
250 | node_elt = ET.SubElement(major_processed_relnodes_elt, "n")
251 | node_elt.attrib = {"sym":str(node)}
252 |
253 | filtered_mod_pos = filtered_postions
254 | major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered")
255 | for node in filtered_mod_pos:
256 | node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc")
257 | node_elt.attrib = {"id":str(node)}
258 |
259 | if major_nodetype == "drop-mod":
260 | modcand_set = oper_candidates
261 | major_modcand_set_elt = ET.SubElement(major_node_elt, "mod-candidates")
262 | for node in modcand_set:
263 | node_elt = ET.SubElement(major_modcand_set_elt, "n")
264 | node_elt.attrib = {"sym":node[1],"loc":str(node[0])}
265 |
266 | processed_mod_pos = processed_oper_candidates
267 | major_processed_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-processed")
268 | for node in processed_mod_pos:
269 | node_elt = ET.SubElement(major_processed_mod_pos_elt, "loc")
270 | node_elt.attrib = {"id":str(node)}
271 |
272 | filtered_mod_pos = filtered_postions
273 | major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered")
274 | for node in filtered_mod_pos:
275 | node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc")
276 | node_elt.attrib = {"id":str(node)}
277 |
278 | if major_nodetype == "drop-ood":
279 | oodnode_set = oper_candidates
280 | major_oodnode_set_elt = ET.SubElement(major_node_elt, "ood-candidates")
281 | for node in oodnode_set:
282 | node_elt = ET.SubElement(major_oodnode_set_elt, "n")
283 | node_elt.attrib = {"sym":str(node)}
284 |
285 | processed_oodnodes = processed_oper_candidates
286 | major_processed_oodnodes_elt = ET.SubElement(major_node_elt, "ood-processed")
287 | for node in processed_oodnodes:
288 | node_elt = ET.SubElement(major_processed_oodnodes_elt, "n")
289 | node_elt.attrib = {"sym":str(node)}
290 |
291 | filtered_mod_pos = filtered_postions
292 | major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered")
293 | for node in filtered_mod_pos:
294 | node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc")
295 | node_elt.attrib = {"id":str(node)}
296 |
297 | if major_nodetype == "fin":
298 | filtered_mod_pos = filtered_postions
299 | major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered")
300 | for node in filtered_mod_pos:
301 | node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc")
302 | node_elt.attrib = {"id":str(node)}
303 |
304 | # Oper nodes
305 | oper_nodes_elt = ET.SubElement(traininggraph, "oper-nodes")
306 | for oper_nodename in self.oper_nodes:
307 | oper_node_elt = ET.SubElement(oper_nodes_elt, "node")
308 | oper_node_elt.attrib = {"sym":oper_nodename}
309 |
310 | oper_nodedata = self.oper_nodes[oper_nodename]
311 |
312 | # Nodetype
313 | oper_nodetype = oper_nodedata[0]
314 | oper_nodetype_elt = ET.SubElement(oper_node_elt, "type")
315 | oper_nodetype_elt.text = oper_nodetype
316 |
317 | if oper_nodetype == "split":
318 | split_cand_applied = oper_nodedata[1]
319 | split_cand_applied_elt = ET.SubElement(oper_node_elt, "split-candidate-applied")
320 | if split_cand_applied != None:
321 | split_candidate_elt = ET.SubElement(split_cand_applied_elt, "sc")
322 | for node in split_cand_applied:
323 | node_elt = ET.SubElement(split_candidate_elt, "n")
324 | node_elt.attrib = {"sym":node}
325 |
326 | split_cand_left = oper_nodedata[2]
327 | split_cand_left_elt = ET.SubElement(oper_node_elt, "split-candidate-left")
328 | for split_candidate in split_cand_left:
329 | split_candidate_elt = ET.SubElement(split_cand_left_elt, "sc")
330 | for node in split_candidate:
331 | node_elt = ET.SubElement(split_candidate_elt, "n")
332 | node_elt.attrib = {"sym":node}
333 |
334 | if oper_nodetype == "drop-ood":
335 | oodnode_to_process = oper_nodedata[1]
336 | oodnode_to_process_elt = ET.SubElement(oper_node_elt, "ood-candidate")
337 | node_elt = ET.SubElement(oodnode_to_process_elt, "n")
338 | node_elt.attrib = {"sym":oodnode_to_process}
339 |
340 | dropped = oper_nodedata[2]
341 | dropped_elt = ET.SubElement(oper_node_elt, "is-dropped")
342 | dropped_elt.attrib = {"val":dropped}
343 |
344 | if oper_nodetype == "drop-rel":
345 | relnode_to_process = oper_nodedata[1]
346 | relnode_to_process_elt = ET.SubElement(oper_node_elt, "rel-candidate")
347 | node_elt = ET.SubElement(relnode_to_process_elt, "n")
348 | node_elt.attrib = {"sym":relnode_to_process}
349 |
350 | dropped = oper_nodedata[2]
351 | dropped_elt = ET.SubElement(oper_node_elt, "is-dropped")
352 | dropped_elt.attrib = {"val":dropped}
353 |
354 | if oper_nodetype == "drop-mod":
355 | modcand_to_process = oper_nodedata[1]
356 | modcand_to_process_elt = ET.SubElement(oper_node_elt, "mod-candidate")
357 | node_elt = ET.SubElement(modcand_to_process_elt, "n")
358 | node_elt.attrib = {"sym":modcand_to_process[1],"loc":str(modcand_to_process[0])}
359 |
360 | dropped = oper_nodedata[2]
361 | dropped_elt = ET.SubElement(oper_node_elt, "is-dropped")
362 | dropped_elt.attrib = {"val":dropped}
363 |
364 | tg_edges_elt = ET.SubElement(traininggraph, "tg-edges")
365 | for tg_edge in self.edges:
366 | tg_edge_elt = ET.SubElement(tg_edges_elt, "edge")
367 | tg_edge_elt.attrib = {"lab":str(tg_edge[2]), "par":tg_edge[0], "dep":tg_edge[1]}
368 |
369 | return traininggraph
370 |
371 | # @@@@@@@@@@@@ Training Graph -> Dot Graph @@@@@@@@@@@@@@@@@@@@
372 |
373 | def convert_to_dotstring(self, main_sent_dict, boxer_graph):
374 | dot_string = "digraph boxer{\n"
375 |
376 | nodename = 0
377 | node_graph_dict = {}
378 | # Writing Major nodes
379 | for major_nodename in self.major_nodes:
380 | major_nodetype = self.get_majornode_type(major_nodename)
381 | major_nodeset = self.get_majornode_nodeset(major_nodename)
382 | major_simple_sentences = self.get_majornode_simple_sentences(major_nodename)
383 | oper_candidates = self.get_majornode_oper_candidates(major_nodename)
384 | processed_oper_candidates = self.get_majornode_processed_oper_candidates(major_nodename)
385 | filtered_postions = self.get_majornode_filtered_postions(major_nodename)
386 |
387 | main_sentence = boxer_graph.extract_main_sentence(major_nodeset, main_sent_dict, filtered_postions)
388 | simple_sentence_string = " ".join(major_simple_sentences)
389 | major_node_data = [major_nodetype, major_nodeset[:], main_sentence, simple_sentence_string]
390 |
391 | if major_nodetype == "split":
392 | major_node_data += [oper_candidates[:]]
393 |
394 | if major_nodetype == "drop-rel" or major_nodetype == "drop-mod" or major_nodetype == "drop-ood":
395 | major_node_data += [oper_candidates[:], processed_oper_candidates[:], filtered_postions[:]]
396 |
397 | if major_nodetype == "fin":
398 | major_node_data += [filtered_postions[:]]
399 |
400 | major_node_string, nodename = self.textdot_majornode(nodename, major_nodename, major_node_data[:])
401 | node_graph_dict[major_nodename] = "struct"+str(nodename)
402 | dot_string += major_node_string+"\n"
403 |
404 | # Writing operation nodes
405 | for oper_nodename in self.oper_nodes:
406 | oper_node_string, nodename = self.textdot_opernode(nodename, oper_nodename, self.oper_nodes[oper_nodename])
407 | node_graph_dict[oper_nodename] = "struct"+str(nodename)
408 | dot_string += oper_node_string+"\n"
409 |
410 | # Writing edges
411 | for edge in self.edges:
412 | par_graphnode = node_graph_dict[edge[0]]
413 | dep_graphnode = node_graph_dict[edge[1]]
414 | dot_string += par_graphnode+" -> "+dep_graphnode+"[label=\""+str(edge[2])+"\"];\n"
415 | dot_string += "}"
416 | return dot_string
417 |
418 | def textdot_majornode(self, nodename, node, nodedata):
419 | textdot_node = "struct"+str(nodename+1)+" [shape=record,label=\"{"
420 | textdot_node += "major-node: "+node+"|"
421 | index = 0
422 | for data in nodedata:
423 | textdot_node += self.processtext(str(data))
424 | index += 1
425 | if index < len(nodedata):
426 | textdot_node += "|"
427 | textdot_node += "}\"];"
428 | return textdot_node, nodename+1
429 |
430 | def textdot_opernode(self, nodename, node, nodedata):
431 | textdot_node = "struct"+str(nodename+1)+" [shape=record,label=\"{"
432 | textdot_node += "oper-node: "+node+"|"
433 | index = 0
434 | for data in nodedata:
435 | textdot_node += self.processtext(str(data))
436 | index += 1
437 | if index < len(nodedata):
438 | textdot_node += "|"
439 | textdot_node += "}\"];"
440 | return textdot_node, nodename+1
441 |
442 | def processtext(self, inputstring):
443 | linesize = 100
444 | outputstring = ""
445 | index = 0
446 | substr = inputstring[index*linesize:(index+1)*linesize]
447 | while (substr!=""):
448 | outputstring += substr
449 | index += 1
450 | substr = inputstring[index*linesize:(index+1)*linesize]
451 | if substr!="":
452 | outputstring += "\\n"
453 | return outputstring
454 |
455 | # @@@@@@@@@@@@@@@@@@@@@@@@@@ DONE @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
456 |
--------------------------------------------------------------------------------
/start_learning_training_models.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #title : start_learning_training_models.py =
4 | #description : Training =
5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
6 | #date : Created in 2014, Later revised in April 2016. =
7 | #version : 0.1 =
8 | #===================================================================================
9 |
10 |
11 | import os
12 | import argparse
13 | import sys
14 | import datetime
15 |
16 | # Used for wikilarge data: not recommended
17 | sys.setrecursionlimit(10000)
18 |
19 | sys.path.append("./source")
20 | import functions_configuration_file
21 | import functions_model_files
22 | from saxparser_xml_stanfordtokenized_boxergraph import SAXPARSER_XML_StanfordTokenized_BoxerGraph
23 | from saxparser_xml_stanfordtokenized_boxergraph_traininggraph import SAXPARSER_XML_StanfordTokenized_BoxerGraph_TrainingGraph
24 |
25 | MOSESDIR = "~/tools/mosesdecoder"
26 |
27 | if __name__=="__main__":
28 | # Command line arguments ##############
29 | argparser = argparse.ArgumentParser(prog='python learn_training_models.py', description=('Start the training process.'))
30 |
31 | # Optional [default value: 1]
32 | argparser.add_argument('--start-state', help='Start state of the training process', choices=['1','2','3'], default='1', metavar=('Start_State'))
33 |
34 | # Optional [default value: 3]
35 | argparser.add_argument('--end-state', help='End state of the training process', choices=['1','2','3'], default='3', metavar=('End_State'))
36 |
37 | # Optional [default value: split:drop-ood:drop-rel:drop-mod] (Any of their combinations, order is not important), drop-ood only applied after split
38 | argparser.add_argument('--transformation', help='Transformation models learned', default="split:drop-ood:drop-rel:drop-mod", metavar=('TRANSFORMATION_MODEL'))
39 |
40 | # Optional [default value: 2]
41 | argparser.add_argument('--max-split', help='Maximum split size', choices=['2','3'], default='2', metavar=('MAX_SPLIT_SIZE'))
42 |
43 | # Optional [default value: agent:patient:eq:theme], (order is not important)
44 | argparser.add_argument('--restricted-drop-rel', help='Restricted drop relations', default="agent:patient:eq:theme", metavar=('RESTRICTED_DROP_REL'))
45 |
46 | # Optional [default value: jj:jjr:jjs:rb:rbr:rbs], (order is not important)
47 | argparser.add_argument('--allowed-drop-mod', help='Allowed drop modifiers', default="jj:jjr:jjs:rb:rbr:rbs", metavar=('ALLOWED_DROP_MOD'))
48 |
49 | # Optional [default value: update with most recent one]
50 | argparser.add_argument('--method-training-graph', help='Operation set for training graph file', choices=['method-led-lt', 'method-led-lteq', 'method-0.5-lteq-lteq',
51 | 'method-0.75-lteq-lt', 'method-0.99-lteq-lt'],
52 | default='method-0.99-lteq-lt', metavar=('Method_Training_Graph'))
53 |
54 | # Optional [default value: update with most recent one]
55 | argparser.add_argument('--method-feature-extract', help='Operation set for extracting features', choices=['feature-init', 'feature-Nov27'], default='feature-Nov27',
56 | metavar=('Method_Feature_Extract'))
57 |
58 | # Optional [default value: /home/ankh/Data/Simplification/Zhu-2010/PWKP_108016.tokenized.boxer-graph.xml]
59 | argparser.add_argument('--train-boxer-graph', help='The training corpus file (xml, stanford-tokenized, boxer-graph)', metavar=('Train_Boxer_Graph'),
60 | default='/home/ankh/Data/Simplification/Zhu-2010/PWKP_108016.tokenized.boxer-graph.xml')
61 |
62 | # Optional [default value: 10]
63 | argparser.add_argument('--num-em', help='The number of EM Algorithm iterations', metavar=('NUM_EM_ITERATION'), default='10')
64 |
65 | # Optional [default value: 0:3:/home/ankh/Desktop/Sentence-Simplification/LANGUAGE-MODEL/simplewiki-20131030-data.srilm:0]
66 | argparser.add_argument('--lang-model', help='Language model information (in the moses format)', metavar=('Lang_Model'),
67 | default="0:3:/gpfs/scratch/snarayan/Sentence-Simplification/Language-Model/newsela_lm/V0V4_V1V4_V2V4_V3V4_V0V3_V0V2_V1V3.aner.ori.train.dst.arpa.en:0")
68 | # default="0:3:/gpfs/scratch/snarayan/Sentence-Simplification/Language-Model/wikilarge_lm/wiki.full.aner.ori.train.dst.arpa.en:0")
69 | # default="0:3:/home/ankh/Desktop/Sentence-Simplification/LANGUAGE-MODEL/simplewiki-20131030-data.srilm:0")
70 |
71 | # Optional (Cumpolsary when start state is >= 2)
72 | argparser.add_argument('--d2s-config', help='D2S Configuration file', metavar=('D2S_Config'))
73 |
74 | # Compolsary
75 | argparser.add_argument('--output-dir', help='The output directory',required=True, metavar=('Output_Directory'))
76 | # #####################################
77 | args_dict = vars(argparser.parse_args(sys.argv[1:]))
78 | # #####################################
79 |
80 | # Creating the output directory to store training models
81 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
82 | print timestamp+", Creating the output directory: "+args_dict['output_dir']
83 | try:
84 | os.mkdir(args_dict['output_dir'])
85 | print
86 | except OSError:
87 | print args_dict['output_dir'] + " directory already exists.\n"
88 |
89 | # Configuration dictionary
90 | D2S_Config_data = {}
91 | D2S_Config = args_dict['d2s_config']
92 | if D2S_Config != None:
93 | D2S_Config_data = functions_configuration_file.parser_config_file(D2S_Config)
94 | else:
95 | D2S_Config_data["TRAIN-BOXER-GRAPH"] = args_dict['train_boxer_graph']
96 | D2S_Config_data["TRANSFORMATION-MODEL"] = args_dict['transformation'].split(":")
97 | D2S_Config_data["MAX-SPLIT-SIZE"] = int(args_dict['max_split'])
98 | D2S_Config_data["RESTRICTED-DROP-RELATION"] = args_dict['restricted_drop_rel'].split(":")
99 | D2S_Config_data["ALLOWED-DROP-MODIFIER"] = args_dict['allowed_drop_mod'].split(":")
100 | D2S_Config_data["METHOD-TRAINING-GRAPH"] = args_dict['method_training_graph']
101 | D2S_Config_data["METHOD-FEATURE-EXTRACT"] = args_dict['method_feature_extract']
102 | D2S_Config_data["NUM-EM-ITERATION"] = int(args_dict['num_em'])
103 | D2S_Config_data["LANGUAGE-MODEL"] = args_dict['lang_model']
104 |
105 | # Extracting arguments with their default values (default unless its specified)
106 | START_STATE = int(args_dict['start_state'])
107 | END_STATE = int(args_dict['end_state'])
108 |
109 | # Start state: 1, Starting building training graph
110 | state = 1
111 | if (int(args_dict['start_state']) <= state) and (state <= int(args_dict['end_state'])):
112 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
113 | print timestamp+", Starting building training graph (Step-"+str(state)+") ..."
114 |
115 | print "Input training file (xml, stanford tokenized and boxer graph): " + D2S_Config_data["TRAIN-BOXER-GRAPH"] + " ..."
116 | TRAIN_TRAINING_GRAPH = args_dict['output_dir']+"/"+os.path.splitext(os.path.basename(D2S_Config_data["TRAIN-BOXER-GRAPH"]))[0]+".training-graph.xml"
117 | print "Generating training graph file (xml, stanford tokenized, boxer graph and training graph): "+TRAIN_TRAINING_GRAPH+" ..."
118 |
119 | foutput = open(TRAIN_TRAINING_GRAPH, "w")
120 | foutput.write("\n")
121 | foutput.write("\n")
122 |
123 | print "Creating the SAX file (xml, stanford tokenized and boxer graph) handler ..."
124 | training_xml_handler = SAXPARSER_XML_StanfordTokenized_BoxerGraph("training", D2S_Config_data["TRAIN-BOXER-GRAPH"], foutput, D2S_Config_data["TRANSFORMATION-MODEL"],
125 | D2S_Config_data["MAX-SPLIT-SIZE"], D2S_Config_data["RESTRICTED-DROP-RELATION"],
126 | D2S_Config_data["ALLOWED-DROP-MODIFIER"], D2S_Config_data["METHOD-TRAINING-GRAPH"])
127 |
128 | print "Start generating training graph ..."
129 | print "Start parsing "+D2S_Config_data["TRAIN-BOXER-GRAPH"]+" ..."
130 | training_xml_handler.parse_xmlfile_generating_training_graph()
131 |
132 | foutput.write("\n")
133 | foutput.close()
134 |
135 | D2S_Config_data["TRAIN-TRAINING-GRAPH"] = TRAIN_TRAINING_GRAPH
136 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
137 | print timestamp+", Finished building training graph (Step-"+str(state)+")\n"
138 |
139 | # Start state: 2
140 | state = 2
141 | if (int(args_dict['start_state']) <= state) and (state <= int(args_dict['end_state'])):
142 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
143 | print timestamp+", Starting learning transformation models (Step-"+str(state)+") ..."
144 |
145 | if "TRAIN-TRAINING-GRAPH" not in D2S_Config_data:
146 | print "The training graph file (xml, stanford tokenized, boxer graph and training graph) is not available."
147 | print "Please enter the configuration file or start with the State 1."
148 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
149 | print timestamp+", No transformation models learned (Step-"+str(state)+")\n"
150 | exit(0)
151 |
152 | # @ Defining data structure @
153 | # Stores various sentence pairs (complex, simple) for SMT.
154 | smt_sentence_pairs = {}
155 | # probability tables - store all probabilities
156 | probability_tables = {}
157 | # count tables - store counts in next iteration
158 | count_tables = {}
159 | # @ @
160 |
161 | print "Creating the em-training XML file (stanford tokenized, boxer graph and training graph) handler ..."
162 | em_training_xml_handler = SAXPARSER_XML_StanfordTokenized_BoxerGraph_TrainingGraph(D2S_Config_data["TRAIN-TRAINING-GRAPH"], D2S_Config_data["NUM-EM-ITERATION"],
163 | smt_sentence_pairs, probability_tables, count_tables, D2S_Config_data["METHOD-FEATURE-EXTRACT"])
164 |
165 | print "Start Expectation Maximization (Inside-Outside) algorithm ..."
166 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
167 | print timestamp+", Step "+str(state)+".1: Initialization of probability tables and populating smt_sentence_pairs ..."
168 | em_training_xml_handler.parse_to_initialize_probabilitytable()
169 | # print probability_tables
170 |
171 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
172 | print timestamp+", Step "+str(state)+".2: Start iterating for EM Inside-Outside probabilities ..."
173 | em_training_xml_handler.parse_to_iterate_probabilitytable()
174 | # print probability_tables
175 |
176 | # Start writing model files
177 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
178 | print timestamp+", Step "+str(state)+".3: Start writing model files ..."
179 | # Creating the output directory to store training models
180 | model_dir = args_dict['output_dir']+"/TRANSFORMATION-MODEL-DIR"
181 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
182 | print timestamp+", Creating the output model directory: "+model_dir
183 | try:
184 | os.mkdir(model_dir)
185 | except OSError:
186 | print model_dir + " directory already exists."
187 | # Wriing model files
188 | functions_model_files.write_model_files(model_dir, probability_tables, smt_sentence_pairs)
189 |
190 | D2S_Config_data["TRANSFORMATION-MODEL-DIR"] = model_dir
191 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
192 | print timestamp+", Finished learning transformation models (Step-"+str(state)+")\n"
193 |
194 | # Start state: 3
195 | state = 3
196 | if (int(args_dict['start_state']) <= state) and (state <= int(args_dict['end_state'])):
197 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
198 | print timestamp+", Starting learning moses translation model (Step-"+str(state)+") ..."
199 |
200 | if "TRANSFORMATION-MODEL-DIR" not in D2S_Config_data:
201 | print "The moses training files are not available."
202 | print "Please enter the configuration file or start with the State 1."
203 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
204 | print timestamp+", No moses models learned (Step-"+str(state)+")\n"
205 | exit(0)
206 |
207 | # Preparing the moses directory
208 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
209 | print timestamp+", Step "+str(state)+".1: Preparing the moses directory ..."
210 | # Creating the output directory to store moses files
211 | moses_dir = args_dict['output_dir']+"/MOSES-COMPLEX-SIMPLE-DIR"
212 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
213 | print timestamp+", Creating the moses directory: "+moses_dir
214 | try:
215 | os.mkdir(moses_dir)
216 | except OSError:
217 | print moses_dir + " directory already exists."
218 | # Creating the corpus directory
219 | moses_corpus_dir = args_dict['output_dir']+"/MOSES-COMPLEX-SIMPLE-DIR/corpus"
220 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
221 | print timestamp+", Creating the moses corpus directory: "+moses_corpus_dir
222 | try:
223 | os.mkdir(moses_corpus_dir)
224 | except OSError:
225 | print moses_corpus_dir + " directory already exists."
226 |
227 | # Cleaning the moses training file
228 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
229 | print timestamp+", Step "+str(state)+".2: Cleaning the moses training file ..."
230 | command = MOSESDIR+"/scripts/training/clean-corpus-n.perl "+D2S_Config_data["TRANSFORMATION-MODEL-DIR"]+"/D2S-SMT source target "+moses_corpus_dir+"/D2S-SMT-clean 1 95"
231 | os.system(command)
232 |
233 | # Running moses training
234 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
235 | print timestamp+", Step "+str(state)+".3: Running the moses training ..."
236 | command = (MOSESDIR+"/scripts/training/train-model.perl -mgiza -mgiza-cpus 3 -cores 3 -parallel -sort-buffer-size 3G -sort-batch-size 253 -sort-compress gzip -sort-parallel 3 "+
237 | "-root-dir "+moses_dir+" -corpus "+moses_corpus_dir+"/D2S-SMT-clean -f source -e target -external-bin-dir "+MOSESDIR+"/mgiza/mgizapp/bin "+
238 | "-lm "+D2S_Config_data["LANGUAGE-MODEL"])
239 | os.system(command)
240 |
241 | D2S_Config_data["MOSES-COMPLEX-SIMPLE-DIR"] = moses_dir
242 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
243 | print timestamp+", Finished learning moses translation model (Step-"+str(state)+")\n"
244 |
245 | # Last Step
246 | config_file = args_dict['output_dir']+"/d2s.ini"
247 | print "Writing the configuration file: "+config_file+" ..."
248 | functions_configuration_file.write_config_file(config_file, D2S_Config_data)
249 |
250 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
251 | print timestamp+", Learning process done!!!"
252 |
253 |
--------------------------------------------------------------------------------
/start_simplifying_complex_sentence.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | #===================================================================================
3 | #title : start_simplifying_complex_sentence.py =
4 | #description : Testing =
5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=
6 | #date : Created in 2014, Later revised in April 2016. =
7 | #version : 0.1 =
8 | #===================================================================================
9 |
10 | import os
11 | import argparse
12 | import sys
13 | import datetime
14 | from nltk.metrics.distance import edit_distance
15 |
16 | sys.path.append("./source")
17 | import functions_configuration_file
18 | import functions_model_files
19 | import functions_prepare_elementtree_dot
20 | from saxparser_xml_stanfordtokenized_boxergraph import SAXPARSER_XML_StanfordTokenized_BoxerGraph
21 | from explore_decoder_graph_greedy import Explore_Decoder_Graph_Greedy
22 | from explore_decoder_graph_explorative import Explore_Decoder_Graph_Explorative
23 |
24 | MOSESDIR = "~/tools/mosesdecoder"
25 |
26 | def get_greedy_decoder_graph(test_boxerdata_dict, test_sentids, TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER,
27 | probability_tables, METHOD_FEATURE_EXTRACT):
28 | mapper_transformation = {}
29 | moses_input = {}
30 | transformation_complex_count = 0
31 |
32 | # Transformation decoder
33 | decoder_graph_explorer = Explore_Decoder_Graph_Greedy(TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER,
34 | probability_tables, METHOD_FEATURE_EXTRACT)
35 | for sentid in test_sentids:
36 | print sentid
37 | sent_data = test_boxerdata_dict[str(sentid)]
38 | main_sentence = sent_data[0]
39 | main_sent_dict = sent_data[1]
40 | boxer_graph = sent_data[2]
41 |
42 | # Explore decoder graph
43 | decoder_graph = decoder_graph_explorer.explore_decoder_graph(str(sentid), main_sentence, main_sent_dict, boxer_graph)
44 |
45 | # # Generating boxer and decoder graph
46 | # if sentid not in [13, 28, 41]:
47 | # functions_prepare_elementtree_dot.run_visual_graph_creator(str(sentid), main_sentence, main_sent_dict, [], boxer_graph, decoder_graph)
48 |
49 | sentence_pairs = decoder_graph.get_final_sentences(main_sentence, main_sent_dict, boxer_graph)
50 | transformed_sentences = [item[0] for item in sentence_pairs]
51 |
52 | # Writing transformation results
53 | mapper_transformation[sentid] = []
54 | for sent in transformed_sentences:
55 | mapper_transformation[sentid].append(transformation_complex_count)
56 | moses_input[transformation_complex_count] = sent
57 | transformation_complex_count += 1
58 | return mapper_transformation, moses_input
59 |
60 | def get_explorative_decoder_graph(test_boxerdata_dict, test_sentids, TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER,
61 | probability_tables, METHOD_FEATURE_EXTRACT):
62 | mapper_transformation = {}
63 | moses_input = {}
64 | transformation_complex_count = 0
65 |
66 | # Transformation decoder
67 | decoder_graph_explorer = Explore_Decoder_Graph_Explorative(TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER,
68 | probability_tables, METHOD_FEATURE_EXTRACT)
69 | for sentid in test_sentids:
70 | print sentid
71 | sent_data = test_boxerdata_dict[str(sentid)]
72 | main_sentence = sent_data[0]
73 | main_sent_dict = sent_data[1]
74 | boxer_graph = sent_data[2]
75 |
76 | # Explore decoder graph
77 | print "Building decoder graph ..."
78 | decoder_graph = decoder_graph_explorer.explore_decoder_graph(str(sentid), main_sentence, main_sent_dict, boxer_graph)
79 |
80 | # Start updating edges with the probabilities, for unseen : 0.5/0.5
81 | print "Updating probability bottom-up ..."
82 | node_probability_dict, potential_edges = decoder_graph_explorer.start_probability_update(main_sentence, main_sent_dict, boxer_graph, decoder_graph)
83 |
84 | # Filtered decoder graph
85 | print "Creating filtered decoder graph ..."
86 | filtered_decoder_graph = decoder_graph_explorer.create_filtered_decoder_graph(potential_edges, main_sentence, main_sent_dict, boxer_graph, decoder_graph)
87 |
88 | # Generating boxer and decoder graph
89 | functions_prepare_elementtree_dot.run_visual_graph_creator(str(sentid), main_sentence, main_sent_dict, [], boxer_graph, filtered_decoder_graph)
90 |
91 | sentence_pairs = filtered_decoder_graph.get_final_sentences(main_sentence, main_sent_dict, boxer_graph)
92 | transformed_sentences = [item[0] for item in sentence_pairs]
93 |
94 | # Writing transformation results
95 | mapper_transformation[sentid] = []
96 | for sent in transformed_sentences:
97 | mapper_transformation[sentid].append(transformation_complex_count)
98 | moses_input[transformation_complex_count] = sent
99 | transformation_complex_count += 1
100 | return mapper_transformation, moses_input
101 |
102 | if __name__ == "__main__":
103 | argparser = argparse.ArgumentParser(prog='python simplify_complex_sentence.py', description=('Start simplifying complex sentences.'))
104 |
105 | # Optional [default value: /home/ankh/Data/Simplification/Test-Data/complex.tokenized.boxer-graph.xml]
106 | argparser.add_argument('--test-boxer-graph', help='The test corpus file (xml, stanford-tokenized, boxer-graph)', metavar=('Test_Boxer_Graph'),
107 | default='/home/ankh/Data/Simplification/Test-Data/complex.tokenized.boxer-graph.xml')
108 |
109 | # Optional [default value: 10]
110 | argparser.add_argument('--nbest-distinct', help='N Best Distinct produced from Moses', metavar=('N_Best_Distinct'), default='10')
111 |
112 | # Optional [default value: greedy]
113 | argparser.add_argument('--explore-decoder', help='Method for generating the decoder graph', metavar=('Explore_Decoder'), choices=['greedy', 'explorative'], default='greedy')
114 |
115 | # Compolsary
116 | argparser.add_argument('--d2s-config', help='D2S Configuration file', required=True, metavar=('D2S_Config'))
117 |
118 | # Compolsary
119 | argparser.add_argument('--output-dir', help='The output directory', required=True, metavar=('Output_Directory'))
120 | # #####################################
121 | args_dict = vars(argparser.parse_args(sys.argv[1:]))
122 | # #####################################
123 |
124 | # STEP:1 Creating test directory in the output directory
125 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
126 | test_output_directory = args_dict['output_dir']+"/Test-Results-"+args_dict["explore_decoder"].upper()
127 | print timestamp+", Creating test result directory: "+test_output_directory
128 | try:
129 | os.mkdir(test_output_directory)
130 | except OSError:
131 | print test_output_directory + " directory already exists."
132 |
133 | # STEP:2 Configuration dictionary
134 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
135 | print "\n"+timestamp+", Reading the D2S Configuration file ..."
136 | D2S_Config_data = functions_configuration_file.parser_config_file(args_dict['d2s_config'])
137 |
138 | # STEP:3 Reading transformation model files
139 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
140 | print "\n"+timestamp+", Reading transformation model files ..."
141 | probability_tables = functions_model_files.read_model_files(D2S_Config_data["TRANSFORMATION-MODEL-DIR"], D2S_Config_data["TRANSFORMATION-MODEL"])
142 |
143 | # STEP:4 Reading the test corpus file (xml, stanford-tokenized, boxer-graph) ..."
144 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
145 | print "\n"+timestamp+", Start reading test corpus file (xml, stanford-tokenized, boxer-graph): "+args_dict['test_boxer_graph']+" ..."
146 | print "Creating the SAX file (xml, stanford tokenized and boxer graph) handler ..."
147 | test_boxerdata_dict = {}
148 | test_sentids = []
149 | testing_xml_handler = SAXPARSER_XML_StanfordTokenized_BoxerGraph("testing", args_dict['test_boxer_graph'], test_boxerdata_dict, D2S_Config_data["TRANSFORMATION-MODEL"],
150 | D2S_Config_data["MAX-SPLIT-SIZE"], D2S_Config_data["RESTRICTED-DROP-RELATION"],
151 | D2S_Config_data["ALLOWED-DROP-MODIFIER"], D2S_Config_data["METHOD-TRAINING-GRAPH"])
152 | print "Start parsing "+args_dict['test_boxer_graph']+" ..."
153 | testing_xml_handler.parse_xmlfile_generating_training_graph()
154 | test_sentids = [int(item) for item in test_boxerdata_dict.keys()]
155 | test_sentids.sort()
156 |
157 | # STEP:5 Applying the transformation models and creating the output of transformation
158 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
159 | print "\n"+timestamp+", Applying the transformation models and writing complex sentences after transformation ..."
160 | mapper_transformation = {}
161 | moses_input = {}
162 | if args_dict["explore_decoder"] == "greedy":
163 | mapper_transformation, moses_input = get_greedy_decoder_graph(test_boxerdata_dict, test_sentids, D2S_Config_data["TRANSFORMATION-MODEL"], D2S_Config_data["MAX-SPLIT-SIZE"],
164 | D2S_Config_data["RESTRICTED-DROP-RELATION"], D2S_Config_data["ALLOWED-DROP-MODIFIER"],
165 | probability_tables, D2S_Config_data["METHOD-FEATURE-EXTRACT"])
166 | else:
167 | mapper_transformation, moses_input = get_explorative_decoder_graph(test_boxerdata_dict, test_sentids, D2S_Config_data["TRANSFORMATION-MODEL"], D2S_Config_data["MAX-SPLIT-SIZE"],
168 | D2S_Config_data["RESTRICTED-DROP-RELATION"], D2S_Config_data["ALLOWED-DROP-MODIFIER"],
169 | probability_tables, D2S_Config_data["METHOD-FEATURE-EXTRACT"])
170 |
171 | print "Writing "+test_output_directory+"/transformation-output.moses-input ..."
172 | d2s_complex_file = open(test_output_directory+"/transformation-output.moses-input", "w")
173 | for sentid in test_sentids:
174 | for moses_input_id in mapper_transformation[sentid]:
175 | transformed_sent = moses_input[moses_input_id]
176 | d2s_complex_file.write(transformed_sent.encode('utf-8')+"\n")
177 | d2s_complex_file.close()
178 |
179 | print "Writing "+test_output_directory+"/transformation-output.map ..."
180 | d2s_complex_map = open(test_output_directory+"/transformation-output.map", "w")
181 | sentids = mapper_transformation.keys()
182 | sentids.sort()
183 | for sentid in sentids:
184 | d2s_complex_map.write(str(sentid)+" ")
185 | for item in mapper_transformation[sentid]:
186 | d2s_complex_map.write(str(item)+" ")
187 | d2s_complex_map.write("\n")
188 | d2s_complex_map.close()
189 |
190 | print "Writing "+test_output_directory+"/transformation-output.simple ..."
191 | d2s_complex_file = open(test_output_directory+"/transformation-output.simple", "w")
192 | for sentid in test_sentids:
193 | simple_sentence = []
194 | for moses_input_id in mapper_transformation[sentid]:
195 | transformed_sent = moses_input[moses_input_id]
196 | simple_sentence.append(transformed_sent)
197 | d2s_complex_file.write((" ".join(simple_sentence)).encode('utf-8')+"\n")
198 | d2s_complex_file.close()
199 |
200 | # STEP:6 Running Moses
201 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
202 | print "\n"+timestamp+", Applying the moses translation model ..."
203 | command = (MOSESDIR+"/bin/moses -f "+D2S_Config_data["MOSES-COMPLEX-SIMPLE-DIR"]+"/model/moses.ini "+
204 | "-n-best-list "+test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple"+" "+args_dict['nbest_distinct']+" distinct "+
205 | "-input-file "+test_output_directory+"/transformation-output.moses-input")
206 | os.system(command)
207 |
208 | # Reading the moses output file
209 | print "Parsing the moses output file: "+test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple"
210 | moses_output = {}
211 | finput = open(test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple", "r")
212 | datalines = finput.readlines()
213 | for line in datalines:
214 | parts = line.split(" ||| ")
215 | sentid = int(parts[0].strip())
216 | sent = parts[1].strip()
217 | if sentid not in moses_output:
218 | moses_output[sentid] = [sent]
219 | else:
220 | moses_output[sentid].append(sent)
221 | finput.close()
222 |
223 | # Storing the best moses output
224 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
225 | print "\n"+timestamp+", Best output of moses ..."
226 | final_output_filename = test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple.best"
227 | print "Writing to the file: "+final_output_filename
228 | final_output_file = open(final_output_filename, "w")
229 | for sentid in test_sentids:
230 | simple_sentence = []
231 | for moses_input_id in mapper_transformation[sentid]:
232 | moses_simple_output_best = moses_output[moses_input_id][0]
233 | simple_sentence.append(moses_simple_output_best)
234 | final_output_file.write(" ".join(simple_sentence)+"\n")
235 | final_output_file.close()
236 |
237 | # Running posthoc reranking
238 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
239 | print "\n"+timestamp+", Running the post-hoc reranking ..."
240 | posthoc_reranked = {}
241 | for sentid in test_sentids:
242 | for moses_input_id in mapper_transformation[sentid]:
243 | moses_complex_input = moses_input[moses_input_id]
244 | moses_simple_outputs = moses_output[moses_input_id]
245 |
246 | posthoc_reranked[moses_input_id] = []
247 | for simple_output in moses_simple_outputs:
248 | edit_dist = edit_distance(simple_output, moses_complex_input)
249 | posthoc_reranked[moses_input_id].append((edit_dist, simple_output))
250 |
251 | # More different are ranked at top
252 | posthoc_reranked[moses_input_id].sort(reverse=True)
253 |
254 | # Writing post-hoc reranked output
255 | final_output_filename = test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.post-hoc-reranking.simple"
256 | print "Writing to the file: "+final_output_filename
257 | final_output_file = open(final_output_filename, "w")
258 | for sentid in test_sentids:
259 | for moses_input_id in mapper_transformation[sentid]:
260 | for item in posthoc_reranked[moses_input_id]:
261 | final_output_file.write(str(moses_input_id)+"\t"+str(item[0])+"\t"+item[1]+"\n")
262 | final_output_file.write("\n")
263 | final_output_file.close()
264 |
265 | # Writing post-hoc reranked best output
266 | final_output_filename = test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.post-hoc-reranking.simple.best"
267 | print "Writing to the file: "+final_output_filename
268 | final_output_file = open(final_output_filename, "w")
269 | for sentid in test_sentids:
270 | simple_sentence = []
271 | for moses_input_id in mapper_transformation[sentid]:
272 | simple_output_best = posthoc_reranked[moses_input_id][0][1]
273 | simple_sentence.append(simple_output_best)
274 | final_output_file.write(" ".join(simple_sentence)+"\n")
275 | final_output_file.close()
276 |
277 |
278 | # test_boxerdata_dict = {}
279 | # test_sentids = []
280 |
281 | # mapper_transformation = {}
282 | # moses_input = {}
283 |
284 | # moses_output = {}
285 |
286 | # posthoc_reranked = {}
287 |
--------------------------------------------------------------------------------