├── LICENSE ├── README.md ├── preprocessing ├── extract_wikipedia_corpora_boxer_test.py └── extract_wikipedia_corpora_boxer_training.py ├── source ├── boxer_graph_module.py ├── em_inside_outside_algorithm.py ├── explore_decoder_graph_explorative.py ├── explore_decoder_graph_greedy.py ├── explore_training_graph.py ├── function_select_methods.py ├── functions_configuration_file.py ├── functions_model_files.py ├── functions_prepare_elementtree_dot.py ├── methods_feature_extract.py ├── methods_training_graph.py ├── saxparser_xml_stanfordtokenized_boxergraph.py ├── saxparser_xml_stanfordtokenized_boxergraph_traininggraph.py └── training_graph_module.py ├── start_learning_training_models.py └── start_simplifying_complex_sentence.py /LICENSE: -------------------------------------------------------------------------------- 1 | BSD 3-Clause License 2 | 3 | Copyright (c) 2018, Shashi Narayan 4 | All rights reserved. 5 | 6 | Redistribution and use in source and binary forms, with or without 7 | modification, are permitted provided that the following conditions are met: 8 | 9 | * Redistributions of source code must retain the above copyright notice, this 10 | list of conditions and the following disclaimer. 11 | 12 | * Redistributions in binary form must reproduce the above copyright notice, 13 | this list of conditions and the following disclaimer in the documentation 14 | and/or other materials provided with the distribution. 15 | 16 | * Neither the name of the copyright holder nor the names of its 17 | contributors may be used to endorse or promote products derived from 18 | this software without specific prior written permission. 19 | 20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE 24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 30 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Hybrid Simplification using Deep Semantics and Machine Translation 2 | 3 | Sentence simplification maps a sentence to a simpler, more readable 4 | one approximating its content. In practice, simplification is often 5 | modeled using four main operations: splitting a complex sentence into 6 | several simpler sentences; dropping and reordering phrases or 7 | constituents; substituting words/phrases with simpler ones. 8 | 9 | This is implementation from our ACL'14 paper. We have modified our 10 | code to let you choose what simplification operations you want to 11 | apply to your complex sentences. Please go through our paper for more 12 | details. Please contact Shashi Narayan 13 | (shashi.narayan(at){ed.ac.uk,gmail.com}) for any query. 14 | 15 | If you use our code, please cite the following paper. 16 | 17 | **Hybrid Simplification using Deep Semantics and Machine Translation, 18 | Shashi Narayan and Claire Gardent, The 52nd Annual meeting of the 19 | Association for Computational Linguistics (ACL), Baltimore, 20 | June. https://aclweb.org/anthology/P/P14/P14-1041.pdf.** 21 | 22 | > We present a hybrid approach to sentence simplification which 23 | > combines deep semantics and monolingual machine translation to 24 | > derive simple sentences from complex ones. The approach differs from 25 | > previous work in two main ways. First, it is semantic based in that 26 | > it takes as input a deep semantic representation rather than e.g., a 27 | > sentence or a parse tree. Second, it combines a simplification model 28 | > for splitting and deletion with a monolingual translation model for 29 | > phrase substitution and reordering. When compared against current 30 | > state of the art methods, our model yields significantly simpler 31 | > output that is both grammatical and meaning preserving. 32 | 33 | ### Requirements 34 | 35 | * Boxer 1.00: http://svn.ask.it.usyd.edu.au/trac/candc/wiki/boxer, http://www.cl.cam.ac.uk/~sc609/candc-1.00.html 36 | * Moses: http://www.statmt.org/moses/?n=Development.GetStarted 37 | * Mgiza++: http://www.statmt.org/moses/?n=Moses.ExternalTools#ntoc3 38 | * NLTK toolkit: http://www.nltk.org/ 39 | * Python 2.7 40 | * Stanford Toolkit: http://nlp.stanford.edu/software/tagger.html 41 | 42 | ### Data preparation 43 | 44 | 45 | #### Training Data 46 | 47 | * code: ./preprocessing/extract_wikipedia_corpora_boxer_training.py 48 | 49 | * This code prepares the training data. It takes as input tokenized 50 | training (complex, simple) sentences and the boxer output (xml 51 | format) of the complex sentences. 52 | 53 | * I will improve the interface of this script later. But for now you 54 | have to set following parameters: (C: complex sentence and S: simple 55 | sentence) 56 | 57 | * ZHUDATA_FILE_ORG = Address to the file with combined 58 | complex-simple pairs. Format: 59 | C_1\nS^1_1\nS^2_1\n\nC_2\nS^1_2\nS^2_2\nS^3_2\n\n and so on. 60 | 61 | * ZHUDATA_FILE_MAIN = Address to the file with all tokenized complex 62 | sentences. Format: C_1\nC_2\n and so on. 63 | 64 | * ZHUDATA_FILE_SIMPLE = Address to the file with all tokenized 65 | simple sentences. Format: S^1_1\nS^2_1\nS^1_2\nS^2_2\nS^3_2\n and 66 | so on. 67 | 68 | * BOXER_DATADIR: Directory address which contains the boxer output 69 | of ZHUDATA_FILE_MAIN. 70 | 71 | * CHUNK_SIZE = Size of the boxer output chunks. The above scripts 72 | loads boxer xml file before parsing them, it is much faster to use 73 | chunks (let say of 10000) of ZHUDATA_FILE_MAIN. 74 | 75 | * boxer_main_filename = Boxer output file name pattern. For 76 | example: 77 | filename."+str(lower_index)+"-"+str(lower_index+CHUNK_SIZE) 78 | 79 | #### Test Data 80 | 81 | * code: ./preprocessing/extract_wikipedia_corpora_boxer_test.py 82 | 83 | * This code prepares the test data. It takes as input tokenized test 84 | (complex) sentences and their boxer outputs in xml format. 85 | 86 | * I will improve the interface of this script later. But for now you 87 | have to set following parameters: 88 | 89 | * TEST_FILE_MAIN: Address to the file with all tokenized complex 90 | sentences. Format: C_1\nC_2\n and so on. 91 | 92 | * TEST_FILE_BOXER: Address to the boxer xml output file for 93 | TEST_FILE_MAIN. 94 | 95 | ### Training 96 | 97 | * Training goes through three states: 1) Building Boxer training 98 | graphs, 2) EM training and 3) SMT training 99 | 100 | ``` 101 | python start_learning_training_models.py --help 102 | 103 | usage: python learn_training_models.py [-h] [--start-state Start_State] 104 | [--end-state End_State] 105 | [--transformation TRANSFORMATION_MODEL] 106 | [--max-split MAX_SPLIT_SIZE] 107 | [--restricted-drop-rel RESTRICTED_DROP_REL] 108 | [--allowed-drop-mod ALLOWED_DROP_MOD] 109 | [--method-training-graph Method_Training_Graph] 110 | [--method-feature-extract Method_Feature_Extract] 111 | [--train-boxer-graph Train_Boxer_Graph] 112 | [--num-em NUM_EM_ITERATION] 113 | [--lang-model Lang_Model] 114 | [--d2s-config D2S_Config] --output-dir 115 | Output_Directory 116 | 117 | Start the training process. 118 | 119 | optional arguments: 120 | -h, --help show this help message and exit 121 | --start-state Start_State 122 | Start state of the training process 123 | --end-state End_State 124 | End state of the training process 125 | --transformation TRANSFORMATION_MODEL 126 | Transformation models learned 127 | --max-split MAX_SPLIT_SIZE 128 | Maximum split size 129 | --restricted-drop-rel RESTRICTED_DROP_REL 130 | Restricted drop relations 131 | --allowed-drop-mod ALLOWED_DROP_MOD 132 | Allowed drop modifiers 133 | --method-training-graph Method_Training_Graph 134 | Operation set for training graph file 135 | --method-feature-extract Method_Feature_Extract 136 | Operation set for extracting features 137 | --train-boxer-graph Train_Boxer_Graph 138 | The training corpus file (xml, stanford-tokenized, 139 | boxer-graph) 140 | --num-em NUM_EM_ITERATION 141 | The number of EM Algorithm iterations 142 | --lang-model Lang_Model 143 | Language model information (in the moses format) 144 | --d2s-config D2S_Config 145 | D2S Configuration file 146 | --output-dir Output_Directory 147 | The output directory 148 | ``` 149 | 150 | * Have a look in start_learning_training_models.py for more 151 | information on their definitions and their default values. 152 | 153 | * train-boxer-graph: this is the output file from the training data 154 | preparation· 155 | 156 | ### Testing 157 | 158 | ``` 159 | python start_simplifying_complex_sentence.py --help 160 | 161 | usage: python simplify_complex_sentence.py [-h] 162 | [--test-boxer-graph Test_Boxer_Graph] 163 | [--nbest-distinct N_Best_Distinct] 164 | [--explore-decoder Explore_Decoder] 165 | --d2s-config D2S_Config 166 | --output-dir Output_Directory 167 | 168 | Start simplifying complex sentences. 169 | 170 | optional arguments: 171 | -h, --help show this help message and exit 172 | --test-boxer-graph Test_Boxer_Graph 173 | The test corpus file (xml, stanford-tokenized, boxer- 174 | graph) 175 | --nbest-distinct N_Best_Distinct 176 | N Best Distinct produced from Moses 177 | --explore-decoder Explore_Decoder 178 | Method for generating the decoder graph 179 | --d2s-config D2S_Config 180 | D2S Configuration file 181 | --output-dir Output_Directory 182 | The output directory 183 | ``` 184 | 185 | * test-boxer-graph: this is the output file from the test data 186 | preparation· 187 | 188 | * d2s-config: This is the output configuration file from the training 189 | stage. 190 | 191 | ### ToDo 192 | 193 | * ToDo: Incorporate improvements from our arXiv 194 | paper. http://arxiv.org/pdf/1507.08452v1.pdf 195 | 196 | * OOD words at the border should be dropped. 197 | * Don't split at "TO". 198 | * Full stop at the end of the sentence. (currently, this was done as a 199 | post-processing step in the end). 200 | 201 | * ToDo: Change to online version of sentence simplification. 202 | 203 | -------------------------------------------------------------------------------- /preprocessing/extract_wikipedia_corpora_boxer_test.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #title : extract_wikipedia_corpora_boxer_test.py = 4 | #description : Test data preparation = 5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,gmail.com}) = 6 | #date : Created in 2014, Later revised in April 2016. = 7 | #version : 0.1 = 8 | #=================================================================================== 9 | 10 | import os 11 | import argparse 12 | import sys 13 | import random 14 | import base64 15 | import uuid 16 | 17 | import xml.etree.ElementTree as ET 18 | from xml.dom import minidom 19 | 20 | ### Global Variables 21 | 22 | # # Zhu 23 | # TEST_FILE_MAIN="/home/ankh/Data/Simplification/Test-Data/complex.tokenized" 24 | # TEST_FILE_BOXER="/home/ankh/Data/Simplification/Test-Data/complex.tokenized.boxer.xml" 25 | 26 | # # NewSella 27 | # TEST_FILE_MAIN="/gpfs/scratch/snarayan/Sentence-Simplification/Newsella/V0V4_V1V4_V2V4_V3V4_V0V3_V0V2_V1V3.aner.ori.valid.src" 28 | # TEST_FILE_BOXER="/gpfs/scratch/snarayan/Sentence-Simplification/Newsella/boxer/V0V4_V1V4_V2V4_V3V4_V0V3_V0V2_V1V3.aner.ori.valid.src" 29 | 30 | # Wikilarge 31 | # TEST_FILE_MAIN="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/wiki.full.aner.ori.test.src" 32 | # TEST_FILE_BOXER="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/boxer/wiki.full.aner.ori.test.src" 33 | 34 | TEST_FILE_MAIN="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/wiki.full.aner.ori.valid.src" 35 | TEST_FILE_BOXER="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/boxer/wiki.full.aner.ori.valid.src" 36 | 37 | 38 | class Boxer_XML_Handler: 39 | def parse_boxer_xml(self, xdrs_elt): 40 | arg_dict = {} 41 | rel_nodes = [] 42 | extra_nodes = [] 43 | 44 | # Process all dr 45 | for dr in xdrs_elt.iter('dr'): 46 | arg = dr.attrib["name"] 47 | if arg not in arg_dict: 48 | arg_dict[arg] = {"position":[], "preds":[]} 49 | 50 | for index in dr.findall('index'): 51 | position = int(index.attrib["pos"]) 52 | wordid = index.text 53 | 54 | if (position, wordid) not in arg_dict[arg]["position"]: 55 | arg_dict[arg]["position"].append((position, wordid)) 56 | 57 | # Process all prop 58 | for prop in xdrs_elt.iter('prop'): 59 | arg = prop.attrib["argument"] 60 | if arg not in arg_dict: 61 | arg_dict[arg] = {"position":[], "preds":[]} 62 | 63 | for index in prop.findall('index'): 64 | position = int(index.attrib["pos"]) 65 | wordid = index.text 66 | 67 | if (position, wordid) not in arg_dict[arg]["position"]: 68 | arg_dict[arg]["position"].append((position, wordid)) 69 | 70 | # Process all pred 71 | for pred in xdrs_elt.iter('pred'): 72 | arg = pred.attrib["arg"] 73 | symbol = pred.attrib["symbol"] 74 | 75 | if arg not in arg_dict: 76 | arg_dict[arg] = {"position":[], "preds":[]} 77 | 78 | predicate = [symbol, []] 79 | for index in pred.findall('index'): 80 | position = int(index.attrib["pos"]) 81 | wordid = index.text 82 | 83 | if (position, wordid) not in arg_dict[arg]["position"]: 84 | arg_dict[arg]["position"].append((position, wordid)) 85 | if (position, wordid) not in predicate[1]: 86 | predicate[1].append((position, wordid)) 87 | arg_dict[arg]["preds"].append(predicate) 88 | 89 | # Process all named 90 | for named in xdrs_elt.iter('named'): 91 | arg = named.attrib["arg"] 92 | symbol = named.attrib["symbol"] 93 | 94 | if arg not in arg_dict: 95 | arg_dict[arg] = {"position":[], "preds":[]} 96 | 97 | named_pred = [symbol, []] 98 | for index in named.findall('index'): 99 | position = int(index.attrib["pos"]) 100 | wordid = index.text 101 | 102 | if (position, wordid) not in arg_dict[arg]["position"]: 103 | arg_dict[arg]["position"].append((position, wordid)) 104 | if (position, wordid) not in named_pred[1]: 105 | named_pred[1].append((position, wordid)) 106 | arg_dict[arg]["preds"].append(named_pred) 107 | 108 | # Process all card 109 | for card in xdrs_elt.iter('card'): 110 | arg = card.attrib["arg"] 111 | value = card.attrib["value"] 112 | 113 | if arg not in arg_dict: 114 | arg_dict[arg] = {"position":[], "preds":[]} 115 | 116 | card_pred = [value, []] 117 | for index in card.findall('index'): 118 | position = int(index.attrib["pos"]) 119 | wordid = index.text 120 | 121 | if (position, wordid) not in arg_dict[arg]["position"]: 122 | arg_dict[arg]["position"].append((position, wordid)) 123 | if (position, wordid) not in card_pred[1]: 124 | card_pred[1].append((position, wordid)) 125 | arg_dict[arg]["preds"].append(card_pred) 126 | 127 | # Process all timex 128 | for timex in xdrs_elt.iter('timex'): 129 | arg = timex.attrib["arg"] 130 | datetime = "" 131 | for date in timex.iter('date'): 132 | datetime = date.text 133 | for time in timex.iter('time'): 134 | datetime = time.text 135 | 136 | if arg not in arg_dict: 137 | arg_dict[arg] = {"position":[], "preds":[]} 138 | 139 | timex_pred = [datetime, []] 140 | for index in timex.findall('index'): 141 | position = int(index.attrib["pos"]) 142 | wordid = index.text 143 | 144 | if (position, wordid) not in arg_dict[arg]["position"]: 145 | arg_dict[arg]["position"].append((position, wordid)) 146 | if (position, wordid) not in timex_pred[1]: 147 | timex_pred[1].append((position, wordid)) 148 | arg_dict[arg]["preds"].append(timex_pred) 149 | 150 | # Process not/or/imp/whq 151 | for not_node in xdrs_elt.iter('not'): 152 | index_list = not_node.findall('index') 153 | if len(index_list) != 0: 154 | not_pred = ["not", []] 155 | for index in index_list: 156 | position = int(index.attrib["pos"]) 157 | wordid = index.text 158 | if (position, wordid) not in not_pred[1]: 159 | not_pred[1].append((position, wordid)) 160 | extra_nodes.append(not_pred) 161 | for or_node in xdrs_elt.iter('or'): 162 | index_list = or_node.findall('index') 163 | if len(index_list) != 0: 164 | or_pred = ["or", []] 165 | for index in index_list: 166 | position = int(index.attrib["pos"]) 167 | wordid = index.text 168 | if (position, wordid) not in or_pred[1]: 169 | or_pred[1].append((position, wordid)) 170 | extra_nodes.append(or_pred) 171 | for imp_node in xdrs_elt.iter('imp'): 172 | index_list = imp_node.findall('index') 173 | if len(index_list) != 0: 174 | imp_pred = ["imp", []] 175 | for index in index_list: 176 | position = int(index.attrib["pos"]) 177 | wordid = index.text 178 | if (position, wordid) not in imp_pred[1]: 179 | imp_pred[1].append((position, wordid)) 180 | extra_nodes.append(imp_pred) 181 | for whq_node in xdrs_elt.iter('whq'): 182 | index_list = whq_node.findall('index') 183 | if len(index_list) != 0: 184 | whq_pred = ["whq", []] 185 | for index in index_list: 186 | position = int(index.attrib["pos"]) 187 | wordid = index.text 188 | if (position, wordid) not in whq_pred[1]: 189 | whq_pred[1].append((position, wordid)) 190 | extra_nodes.append(whq_pred) 191 | 192 | # Process Rel node 193 | for rel_node in xdrs_elt.iter('rel'): 194 | arg1 = rel_node.attrib["arg1"] 195 | arg2 = rel_node.attrib["arg2"] 196 | symbol = rel_node.attrib["symbol"] 197 | 198 | if arg1 not in arg_dict: 199 | arg_dict[arg1] = {"position":[], "preds":[]} 200 | if arg2 not in arg_dict: 201 | arg_dict[arg2] = {"position":[], "preds":[]} 202 | 203 | index_list = rel_node.findall('index') 204 | if (len(index_list) != 0) or (len(index_list) == 0 and symbol=="nn"): 205 | if symbol=="nn": 206 | # Revert arguments 207 | temp = arg2 208 | arg2 = arg1 209 | arg1 = temp 210 | rel_nodes.append([symbol, arg1, arg2, []]) 211 | 212 | elif symbol=="eq": 213 | if len(index_list) == 1: 214 | position = int(index_list[0].attrib["pos"]) 215 | wordid = index_list[0].text 216 | 217 | for arg in arg_dict: 218 | if (position, wordid) in arg_dict[arg]["position"]: 219 | if ["event", []] in arg_dict[arg]["preds"]: 220 | rel_nodes.append([symbol, arg, arg1, [(position, wordid)]]) 221 | rel_nodes.append([symbol, arg, arg2, [(position, wordid)]]) 222 | else: 223 | rel_pred = [symbol, arg1, arg2, []] 224 | for index in index_list: 225 | position = int(index.attrib["pos"]) 226 | wordid = index.text 227 | if (position, wordid) not in rel_pred[3]: 228 | rel_pred[3].append((position, wordid)) 229 | rel_nodes.append(rel_pred) 230 | 231 | 232 | return arg_dict, rel_nodes, extra_nodes 233 | 234 | ### Extra Functions 235 | 236 | def prettify(elem): 237 | """Return a pretty-printed XML string for the Element. 238 | """ 239 | rough_string = ET.tostring(elem) 240 | reparsed = minidom.parseString(rough_string) 241 | prettyxml = reparsed.documentElement.toprettyxml(indent=" ") 242 | return prettyxml.encode("utf-8") 243 | 244 | ################### 245 | 246 | class Boxer_Element_Creator: 247 | def graph_boxer_element(self, xdrs_elt, sent_span): 248 | # Parse the XDRS 249 | arg_dict, rel_nodes, extra_nodes = Boxer_XML_Handler().parse_boxer_xml(xdrs_elt) 250 | #print arg_dict, rel_nodes, extra_nodes 251 | 252 | # Adding extra nodes to arg_dict/node dict 253 | extra_node_count = 1 254 | for nodeinfo in extra_nodes: 255 | arg_dict["E"+str(extra_node_count)] = {"position":nodeinfo[1][:], "preds":[nodeinfo[:]]} 256 | extra_node_count += 1 257 | 258 | # Creating edge list and relation dict 259 | edge_list = [] 260 | rel_dict = {} 261 | 262 | rel_node_count = 1 263 | for relinfo in rel_nodes: 264 | relation = relinfo[0] 265 | parentarg = relinfo[1] 266 | deparg = relinfo[2] 267 | indexlist = relinfo[3] 268 | 269 | rel_dict["R"+str(rel_node_count)] = [relation, indexlist[:]] 270 | edge_list.append((parentarg, deparg, "R"+str(rel_node_count))) 271 | rel_node_count += 1 272 | 273 | # Get the boxer span 274 | boxer_span = [] 275 | for arg in arg_dict: 276 | for position in arg_dict[arg]["position"]: 277 | if position not in boxer_span: 278 | boxer_span.append(position) 279 | for relarg in rel_dict: 280 | for position in rel_dict[relarg][1]: 281 | if position not in boxer_span: 282 | boxer_span.append(position) 283 | boxer_span.sort() 284 | boxer_span_ids = [elt[1] for elt in boxer_span] 285 | # Adding nodes for left out word positions in discourse 286 | out_of_dis_count = 1 287 | for elt in sent_span: 288 | if elt not in boxer_span_ids: 289 | arg_dict["OOD"+str(out_of_dis_count)] = {"position":[(-1, elt)], "preds":[]} 290 | out_of_dis_count += 1 291 | 292 | # Prepare Boxer Tree 293 | boxer = ET.Element('box') 294 | nodes = ET.SubElement(boxer, "nodes") 295 | for arg in arg_dict: 296 | # print arg 297 | # print arg_dict[arg] 298 | node = ET.SubElement(nodes, "node") 299 | node.attrib = {"sym":arg} 300 | 301 | span = ET.SubElement(node, "span") 302 | position_list = arg_dict[arg]["position"] 303 | position_list.sort() 304 | for position in position_list: 305 | location = ET.SubElement(span, "loc") 306 | location.attrib = {"id":position[1]}#{"pos":str(position[0]),"id":position[1]} 307 | 308 | preds = ET.SubElement(node, "preds") 309 | predicate_list = arg_dict[arg]["preds"] 310 | predicate_list.sort() 311 | for predinfo in predicate_list: 312 | # print predinfo 313 | pred = ET.SubElement(preds, "pred") 314 | predsymbol = predinfo[0] 315 | pred.attrib = {"sym":predsymbol} 316 | 317 | position_list = predinfo[1] 318 | position_list.sort() 319 | for position in position_list: 320 | location = ET.SubElement(pred, "loc") 321 | location.attrib = {"id":position[1]}#{"pos":str(position[0]),"id":position[1]} 322 | 323 | rels = ET.SubElement(boxer, "rels") 324 | for relarg in rel_dict: 325 | rel = ET.SubElement(rels, "rel") 326 | rel.attrib = {"sym":relarg} 327 | pred = ET.SubElement(rel, "pred") 328 | pred.attrib = {"sym":rel_dict[relarg][0]} 329 | span = ET.SubElement(rel, "span") 330 | position_list = rel_dict[relarg][1] 331 | position_list.sort() 332 | for position in position_list: 333 | location = ET.SubElement(span, "loc") 334 | location.attrib = {"id":position[1]}#{"pos":str(position[0]),"id":position[1]} 335 | 336 | edges = ET.SubElement(boxer, "edges") 337 | for edgeinfo in edge_list: 338 | edge = ET.SubElement(edges, "edge") 339 | edge.attrib = {"par":edgeinfo[0], "dep":edgeinfo[1], "lab":edgeinfo[2]} 340 | return boxer 341 | 342 | def construct_boxer_element(self, xdrs_data): 343 | sentid = int(xdrs_data.get('{http://www.w3.org/XML/1998/namespace}id')[1:]) 344 | 345 | # Creating Sentence Element 346 | sentence_xml = "" 347 | sentence_span = [] 348 | sentence = ET.Element('s') 349 | words = (xdrs_data.find("words")).findall("word") 350 | postags = (xdrs_data.find("postags")).findall("postag") 351 | for word_elt, postag_elt in zip(words, postags): 352 | word = word_elt.text 353 | wordid = word_elt.get('{http://www.w3.org/XML/1998/namespace}id')# ("xml:id") 354 | pos = postag_elt.text 355 | posid = postag_elt.get("index") 356 | 357 | if wordid != posid: 358 | print "Warning: Both ids did not match." 359 | exit(0) 360 | else: 361 | sentence_xml += word +" " 362 | word_newelt = ET.SubElement(sentence, 'w') 363 | word_newelt.attrib = {"id":wordid, "pos":pos} 364 | word_newelt.text = word 365 | 366 | sentence_span.append(wordid) 367 | 368 | # Creating the head element 369 | headelt = ET.Element("main") 370 | headelt.append(sentence) 371 | 372 | # Creating boxer element 373 | boxer = self.graph_boxer_element(xdrs_data, sentence_span) 374 | headelt.append(boxer) 375 | return sentid, headelt 376 | 377 | def create_sentence_elt(self, main_sent): 378 | sentence = ET.Element('s') 379 | sentence.text = main_sent.decode('utf-8') 380 | 381 | # Creating the head element 382 | headelt = ET.Element("main") 383 | headelt.append(sentence) 384 | 385 | # Creating boxer element 386 | boxer = ET.Element("box") 387 | headelt.append(boxer) 388 | return headelt 389 | 390 | def get_sentence_elt(self, boxer_xml_file): 391 | sentid_sentence_dict = {} 392 | xdrs_output = ET.parse(boxer_xml_file) 393 | xdrs_list = xdrs_output.findall('xdrs') 394 | for xdrs_item in xdrs_list: 395 | sentid, head_elt = self.construct_boxer_element(xdrs_item) 396 | sentid_sentence_dict[sentid] = head_elt 397 | return sentid_sentence_dict 398 | 399 | if __name__ == "__main__": 400 | datadir = os.path.dirname(TEST_FILE_MAIN) 401 | 402 | # Start parsing Zhu Data file - Tokenized ..." 403 | print "\nStart preparing Test Data file - Tokenized ..." 404 | print "Start parsing "+TEST_FILE_MAIN+" ..." 405 | main_wiki = [] 406 | fdata = open(TEST_FILE_MAIN, "r") 407 | main_wiki = fdata.read().strip().split("\n") 408 | fdata.close() 409 | print "Total number of sentences from Test (tokenized): " + str(len(main_wiki)) 410 | 411 | # Call boxer element creator 412 | print "\nStart boxer element creator ..." 413 | boxer_elt_creator = Boxer_Element_Creator() 414 | sentid_sentence_dict = boxer_elt_creator.get_sentence_elt(TEST_FILE_BOXER) 415 | 416 | # Start Writing final files 417 | print "Generating XML file : "+TEST_FILE_MAIN+".boxer-graph.xml ..." 418 | foutput = open(TEST_FILE_MAIN+".boxer-graph.xml", "w") 419 | foutput.write("\n") 420 | foutput.write("\n") 421 | 422 | main_index = 1 423 | for main_sent in main_wiki: 424 | # Start creating sentence subelement 425 | sentence = ET.Element('sentence') 426 | sentence.attrib={"id":str(main_index)} 427 | 428 | main_elt = "" 429 | if main_index in sentid_sentence_dict: 430 | main_elt = sentid_sentence_dict[main_index] 431 | else: 432 | main_elt = boxer_elt_creator.create_sentence_elt(main_sent) 433 | 434 | sentence.append(main_elt) 435 | foutput.write(prettify(sentence)) 436 | main_index += 1 437 | 438 | 439 | foutput.write("\n") 440 | foutput.close() 441 | 442 | print len(sentid_sentence_dict) 443 | 444 | print sentid_sentence_dict.keys() 445 | -------------------------------------------------------------------------------- /source/em_inside_outside_algorithm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #title : em_inside_outside_algorithm.py = 4 | #description : EM Algorithm = 5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 6 | #date : Created in 2014, Later revised in April 2016. = 7 | #version : 0.1 = 8 | #=================================================================================== 9 | 10 | 11 | import function_select_methods 12 | 13 | class EM_InsideOutside_Optimiser: 14 | def __init__(self, smt_sentence_pairs, probability_tables, count_tables, METHOD_FEATURE_EXTRACT): 15 | self.smt_sentence_pairs = smt_sentence_pairs 16 | self.probability_tables = probability_tables 17 | self.count_tables = count_tables 18 | self.METHOD_FEATURE_EXTRACT = METHOD_FEATURE_EXTRACT 19 | 20 | self.method_feature_extract = function_select_methods.select_feature_extract_method(self.METHOD_FEATURE_EXTRACT) 21 | 22 | def initialize_probabilitytable_smt_input(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph): 23 | #print sentid 24 | 25 | # Process all oper nodes 26 | for oper_node in training_graph.oper_nodes: 27 | oper_type = training_graph.get_opernode_type(oper_node) 28 | if oper_type not in self.probability_tables: 29 | self.probability_tables[oper_type] = {} 30 | if oper_type not in self.count_tables: 31 | self.count_tables[oper_type] = {} 32 | 33 | parent_major_node = training_graph.find_parent_of_opernode(oper_node) 34 | children_major_nodes = training_graph.find_children_of_opernode(oper_node) 35 | 36 | if oper_type == "split": 37 | # Parent main sentence 38 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node) 39 | parent_filtered_mod_pos = training_graph.get_majornode_filtered_postions(parent_major_node) 40 | parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, parent_filtered_mod_pos) 41 | 42 | # Children sentences 43 | children_sentences = [] 44 | for child_major_node in children_major_nodes: 45 | child_nodeset = training_graph.get_majornode_nodeset(child_major_node) 46 | child_filtered_mod_pos = training_graph.get_majornode_filtered_postions(child_major_node) 47 | child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, child_filtered_mod_pos) 48 | children_sentences.append(child_sentence) 49 | 50 | split_candidate = training_graph.get_opernode_oper_candidate(oper_node) 51 | 52 | #print split_candidate 53 | 54 | if split_candidate != None: 55 | split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph) 56 | if split_feature not in self.probability_tables["split"]: 57 | self.probability_tables["split"][split_feature] = {"true":0.5, "false":0.5} 58 | if split_feature not in self.count_tables["split"]: 59 | self.count_tables["split"][split_feature] = {"true":0, "false":0} 60 | else: 61 | not_applied_cands = training_graph.get_opernode_failed_oper_candidates(oper_node) 62 | #print not_applied_cands 63 | for split_candidate_left in not_applied_cands: 64 | split_feature_left = self.method_feature_extract.get_split_feature(split_candidate_left, parent_sentence, children_sentences, boxer_graph) 65 | #print split_feature_left 66 | if split_feature_left not in self.probability_tables["split"]: 67 | self.probability_tables["split"][split_feature_left] = {"true":0.5, "false":0.5} 68 | if split_feature_left not in self.count_tables["split"]: 69 | self.count_tables["split"][split_feature_left] = {"true":0, "false":0} 70 | #print self.probability_tables["split"] 71 | 72 | if oper_type == "drop-rel": 73 | rel_node = training_graph.get_opernode_oper_candidate(oper_node) 74 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node) 75 | drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(rel_node, parent_nodeset, main_sent_dict, boxer_graph) 76 | if drop_rel_feature not in self.probability_tables["drop-rel"]: 77 | self.probability_tables["drop-rel"][drop_rel_feature] = {"true":0.5, "false":0.5} 78 | if drop_rel_feature not in self.count_tables["drop-rel"]: 79 | self.count_tables["drop-rel"][drop_rel_feature] = {"true":0, "false":0} 80 | 81 | if oper_type == "drop-mod": 82 | mod_cand = training_graph.get_opernode_oper_candidate(oper_node) 83 | drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(mod_cand, main_sent_dict, boxer_graph) 84 | if drop_mod_feature not in self.probability_tables["drop-mod"]: 85 | self.probability_tables["drop-mod"][drop_mod_feature] = {"true":0.5, "false":0.5} 86 | if drop_mod_feature not in self.count_tables["drop-mod"]: 87 | self.count_tables["drop-mod"][drop_mod_feature] = {"true":0, "false":0} 88 | 89 | if oper_type == "drop-ood": 90 | ood_node = training_graph.get_opernode_oper_candidate(oper_node) 91 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node) 92 | drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(ood_node, parent_nodeset, main_sent_dict, boxer_graph) 93 | if drop_ood_feature not in self.probability_tables["drop-ood"]: 94 | self.probability_tables["drop-ood"][drop_ood_feature] = {"true":0.5, "false":0.5} 95 | if drop_ood_feature not in self.count_tables["drop-ood"]: 96 | self.count_tables["drop-ood"][drop_ood_feature] = {"true":0, "false":0} 97 | 98 | #print self.probability_tables["split"]['as-as-patient_eq-eq_1'] 99 | # if int(sentid) <= 3: 100 | # print self.probability_tables["split"] 101 | 102 | # Extract all sentence pairs for SMT from all "fin" major nodes 103 | self.smt_sentence_pairs[sentid] = training_graph.get_final_sentences(main_sentence, main_sent_dict, boxer_graph) 104 | 105 | def reset_count_table(self): 106 | for oper_type in self.count_tables: # split, drop-rel, drop-mod, drop-ood 107 | for oper_feature_key in self.count_tables[oper_type]: # feature patterns 108 | for val in self.count_tables[oper_type][oper_feature_key]: # true, false 109 | self.count_tables[oper_type][oper_feature_key][val] = 0 110 | 111 | def iterate_over_probabilitytable(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph): 112 | #print sentid 113 | # Calculating beta-probability, inside probability 114 | #print "Calculating beta-probabilities (Inside probability) ..." 115 | bottom_nodes = training_graph.find_all_fin_majornode() 116 | beta_prob = self.calculate_inside_probability({}, bottom_nodes, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph) 117 | #print beta_prob 118 | 119 | # Calculating alpha-probability, outside probability 120 | #print "Calculating alpha-probabilities (Outside probability) ..." 121 | root_node = "MN-1" 122 | alpha_prob = self.calculate_outside_probability({}, [root_node], main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, beta_prob) 123 | #print alpha_prob 124 | 125 | # Updating counts for each operation happened in this sentence 126 | #print "Updating counts of each operation happened in this training sentence ..." 127 | self.update_count_for_operations(sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, alpha_prob, beta_prob) 128 | 129 | def calculate_outside_probability(self, alpha_prob, tgnodes_to_process, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, beta_prob): 130 | if len(tgnodes_to_process) == 0: 131 | return alpha_prob 132 | 133 | tgnode = tgnodes_to_process[0] 134 | if tgnode.startswith("MN"): 135 | # Major Nodes 136 | if tgnode == "MN-1": 137 | # Root major node 138 | alpha_prob[tgnode] = 1 139 | else: 140 | parents_oper_nodes = training_graph.find_parents_of_majornode(tgnode) 141 | alpha_prob_tgnode = 0 142 | for parent_oper_node in parents_oper_nodes: 143 | alpha_prob_parent_oper_node = alpha_prob[parent_oper_node] 144 | children_major_nodes = training_graph.find_children_of_opernode(parent_oper_node) 145 | beta_prod_product = 1 146 | for child_major_node in children_major_nodes: 147 | if child_major_node != tgnode: 148 | beta_prod_product = beta_prod_product * beta_prob[child_major_node] 149 | alpha_prob_tgnode += alpha_prob_parent_oper_node * beta_prod_product 150 | alpha_prob[tgnode] = alpha_prob_tgnode 151 | 152 | # Adding children to tgnodes_to_process 153 | children_oper_nodes = training_graph.find_children_of_majornode(tgnode) 154 | for child_oper_node in children_oper_nodes: 155 | # Check its not already inserted 156 | if (child_oper_node not in alpha_prob) and (child_oper_node not in tgnodes_to_process): 157 | # Check its parent is already inserted 158 | parent_major_node = training_graph.find_parent_of_opernode(child_oper_node) 159 | if (parent_major_node in alpha_prob) or (parent_major_node in tgnodes_to_process): 160 | tgnodes_to_process.append(child_oper_node) 161 | else: 162 | # Oper nodes 163 | parent_major_node = training_graph.find_parent_of_opernode(tgnode) 164 | alpha_prob_tgnode = alpha_prob[parent_major_node] * self.fetch_probability(tgnode, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph) 165 | alpha_prob[tgnode] = alpha_prob_tgnode 166 | 167 | # Adding children to tgnodes_to_process 168 | children_major_nodes = training_graph.find_children_of_opernode(tgnode) 169 | for child_major_node in children_major_nodes: 170 | # Check its not already inserted 171 | if (child_major_node not in alpha_prob) and (child_major_node not in tgnodes_to_process): 172 | # Check all its parents are already inserted 173 | parents_oper_nodes = training_graph.find_parents_of_majornode(child_major_node) 174 | flag = True 175 | for parent_oper_node in parents_oper_nodes: 176 | if (parent_oper_node not in alpha_prob) and (parent_oper_node not in tgnodes_to_process): 177 | flag = False 178 | break 179 | if flag == True: 180 | tgnodes_to_process.append(child_major_node) 181 | 182 | alpha_prob = self.calculate_outside_probability(alpha_prob, tgnodes_to_process[1:], main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, beta_prob) 183 | return alpha_prob 184 | 185 | def calculate_inside_probability(self, beta_prob, tgnodes_to_process, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph): 186 | if len(tgnodes_to_process) == 0: 187 | return beta_prob 188 | 189 | tgnode = tgnodes_to_process[0] 190 | if tgnode.startswith("MN"): 191 | # Major nodes 192 | major_node_type = training_graph.get_majornode_type(tgnode) 193 | if major_node_type == "fin": 194 | # Leaf major nodes 195 | beta_prob[tgnode] = 1 196 | else: 197 | children_oper_nodes = training_graph.find_children_of_majornode(tgnode) 198 | beta_prob_tgnode = 0 199 | for child_oper_node in children_oper_nodes: 200 | beta_prob_tgnode += self.fetch_probability(child_oper_node, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph) * beta_prob[child_oper_node] 201 | beta_prob[tgnode] = beta_prob_tgnode 202 | 203 | # Adding parents to tgnodes_to_process 204 | parents_oper_nodes = training_graph.find_parents_of_majornode(tgnode) 205 | for parent_oper_node in parents_oper_nodes: 206 | # Check its not already inserted 207 | if (parent_oper_node not in beta_prob) and (parent_oper_node not in tgnodes_to_process): 208 | # Check all its chilren are already inserted 209 | children_major_nodes = training_graph.find_children_of_opernode(parent_oper_node) 210 | flag = True 211 | for child_major_node in children_major_nodes: 212 | if (child_major_node not in beta_prob) and (child_major_node not in tgnodes_to_process): 213 | flag = False 214 | break 215 | if flag == True: 216 | tgnodes_to_process.append(parent_oper_node) 217 | else: 218 | # Oper nodes 219 | children_major_nodes = training_graph.find_children_of_opernode(tgnode) 220 | beta_prob_tgnode = 1 221 | for child_major_node in children_major_nodes: 222 | beta_prob_tgnode = beta_prob_tgnode * beta_prob[child_major_node] 223 | beta_prob[tgnode] = beta_prob_tgnode 224 | 225 | # Adding parent to tgnodes_to_process 226 | parent_major_node = training_graph.find_parent_of_opernode(tgnode) 227 | # Check its not already inserted 228 | if (parent_major_node not in beta_prob) and (parent_major_node not in tgnodes_to_process): 229 | # Check all its chilren are already inserted 230 | children_oper_nodes = training_graph.find_children_of_majornode(parent_major_node) 231 | flag = True 232 | for child_oper_node in children_oper_nodes: 233 | if (child_oper_node not in beta_prob) and (child_oper_node not in tgnodes_to_process): 234 | flag = False 235 | break 236 | if flag == True: 237 | tgnodes_to_process.append(parent_major_node) 238 | 239 | beta_prob = self.calculate_inside_probability(beta_prob, tgnodes_to_process[1:], main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph) 240 | return beta_prob 241 | 242 | def fetch_probability(self, oper_node, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph): 243 | oper_node_type = training_graph.get_opernode_type(oper_node) 244 | if oper_node_type == "split": 245 | # Parent main sentence 246 | parent_major_node = training_graph.find_parent_of_opernode(oper_node) 247 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node) 248 | parent_filtered_mod_pos = training_graph.get_majornode_filtered_postions(parent_major_node) 249 | parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, parent_filtered_mod_pos) 250 | 251 | # Children sentences 252 | children_major_nodes = training_graph.find_children_of_opernode(oper_node) 253 | children_sentences = [] 254 | for child_major_node in children_major_nodes: 255 | child_nodeset = training_graph.get_majornode_nodeset(child_major_node) 256 | child_filtered_mod_pos = training_graph.get_majornode_filtered_postions(child_major_node) 257 | child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, child_filtered_mod_pos) 258 | children_sentences.append(child_sentence) 259 | 260 | total_probability = 1 261 | split_candidate = training_graph.get_opernode_oper_candidate(oper_node) 262 | if split_candidate != None: 263 | split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph) 264 | total_probability = self.probability_tables["split"][split_feature]["true"] 265 | return total_probability 266 | else: 267 | not_applied_cands = training_graph.get_opernode_failed_oper_candidates(oper_node) 268 | for split_candidate_left in not_applied_cands: 269 | split_feature_left = self.method_feature_extract.get_split_feature(split_candidate_left, parent_sentence, children_sentences, boxer_graph) 270 | total_probability = total_probability * self.probability_tables["split"][split_feature_left]["false"] 271 | return total_probability 272 | 273 | elif oper_node_type == "drop-rel": 274 | parent_major_node = training_graph.find_parent_of_opernode(oper_node) 275 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node) 276 | rel_candidate = training_graph.get_opernode_oper_candidate(oper_node) 277 | drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(rel_candidate, parent_nodeset, main_sent_dict, boxer_graph) 278 | isDropped = training_graph.get_opernode_drop_result(oper_node) 279 | prob_value = 0 280 | if isDropped == "True": 281 | prob_value = self.probability_tables["drop-rel"][drop_rel_feature]["true"] 282 | else: 283 | prob_value = self.probability_tables["drop-rel"][drop_rel_feature]["false"] 284 | return prob_value 285 | 286 | elif oper_node_type == "drop-mod": 287 | mod_candidate = training_graph.get_opernode_oper_candidate(oper_node) 288 | drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(mod_candidate, main_sent_dict, boxer_graph) 289 | isDropped = training_graph.get_opernode_drop_result(oper_node) 290 | prob_value = 0 291 | if isDropped == "True": 292 | prob_value = self.probability_tables["drop-mod"][drop_mod_feature]["true"] 293 | else: 294 | prob_value = self.probability_tables["drop-mod"][drop_mod_feature]["false"] 295 | return prob_value 296 | 297 | elif oper_node_type == "drop-ood": 298 | parent_major_node = training_graph.find_parent_of_opernode(oper_node) 299 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node) 300 | ood_candidate = training_graph.get_opernode_oper_candidate(oper_node) 301 | drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(ood_candidate, parent_nodeset, main_sent_dict, boxer_graph) 302 | isDropped = training_graph.get_opernode_drop_result(oper_node) 303 | prob_value = 0 304 | if isDropped == "True": 305 | prob_value = self.probability_tables["drop-ood"][drop_ood_feature]["true"] 306 | else: 307 | prob_value = self.probability_tables["drop-ood"][drop_ood_feature]["false"] 308 | return prob_value 309 | 310 | def update_count_for_operations(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, alpha_prob, beta_prob): 311 | # Process all oper nodes 312 | for oper_node in training_graph.oper_nodes: 313 | # Calculating count 314 | root_inside_prob = beta_prob["MN-1"] 315 | oper_node_inside_prob = beta_prob[oper_node] 316 | oper_node_outside_prob = alpha_prob[oper_node] 317 | count_oper_node = (oper_node_inside_prob * oper_node_outside_prob) / root_inside_prob 318 | 319 | oper_node_type = training_graph.get_opernode_type(oper_node) 320 | if oper_node_type == "split": 321 | # Parent main sentence 322 | parent_major_node = training_graph.find_parent_of_opernode(oper_node) 323 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node) 324 | parent_filtered_mod_pos = training_graph.get_majornode_filtered_postions(parent_major_node) 325 | parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, parent_filtered_mod_pos) 326 | 327 | # Children sentences 328 | children_major_nodes = training_graph.find_children_of_opernode(oper_node) 329 | children_sentences = [] 330 | for child_major_node in children_major_nodes: 331 | child_nodeset = training_graph.get_majornode_nodeset(child_major_node) 332 | child_filtered_mod_pos = training_graph.get_majornode_filtered_postions(child_major_node) 333 | child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, child_filtered_mod_pos) 334 | children_sentences.append(child_sentence) 335 | 336 | split_candidate = training_graph.get_opernode_oper_candidate(oper_node) 337 | if split_candidate != None: 338 | split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph) 339 | self.count_tables["split"][split_feature]["true"] += count_oper_node 340 | else: 341 | not_applied_cands = training_graph.get_opernode_failed_oper_candidates(oper_node) 342 | for split_candidate_left in not_applied_cands: 343 | split_feature_left = self.method_feature_extract.get_split_feature(split_candidate_left, parent_sentence, children_sentences, boxer_graph) 344 | self.count_tables["split"][split_feature_left]["false"] += count_oper_node 345 | 346 | elif oper_node_type == "drop-rel": 347 | parent_major_node = training_graph.find_parent_of_opernode(oper_node) 348 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node) 349 | rel_candidate = training_graph.get_opernode_oper_candidate(oper_node) 350 | drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(rel_candidate, parent_nodeset, main_sent_dict, boxer_graph) 351 | isDropped = training_graph.get_opernode_drop_result(oper_node) 352 | if isDropped == "True": 353 | self.count_tables["drop-rel"][drop_rel_feature]["true"] += count_oper_node 354 | else: 355 | self.count_tables["drop-rel"][drop_rel_feature]["false"] += count_oper_node 356 | 357 | elif oper_node_type == "drop-mod": 358 | mod_candidate = training_graph.get_opernode_oper_candidate(oper_node) 359 | drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(mod_candidate, main_sent_dict, boxer_graph) 360 | isDropped = training_graph.get_opernode_drop_result(oper_node) 361 | if isDropped == "True": 362 | self.count_tables["drop-mod"][drop_mod_feature]["true"] += count_oper_node 363 | else: 364 | self.count_tables["drop-mod"][drop_mod_feature]["false"] += count_oper_node 365 | 366 | elif oper_node_type == "drop-ood": 367 | parent_major_node = training_graph.find_parent_of_opernode(oper_node) 368 | parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node) 369 | ood_candidate = training_graph.get_opernode_oper_candidate(oper_node) 370 | drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(ood_candidate, parent_nodeset, main_sent_dict, boxer_graph) 371 | isDropped = training_graph.get_opernode_drop_result(oper_node) 372 | if isDropped == "True": 373 | self.count_tables["drop-ood"][drop_ood_feature]["true"] += count_oper_node 374 | else: 375 | self.count_tables["drop-ood"][drop_ood_feature]["false"] += count_oper_node 376 | 377 | def update_probability_table(self): 378 | for oper_type in self.probability_tables: # split, drop-ood, drop-rel, drop-mod 379 | for oper_feature_key in self.probability_tables[oper_type]: # feature patterns 380 | totalSum = 0 381 | for val in self.probability_tables[oper_type][oper_feature_key]: 382 | totalSum += self.count_tables[oper_type][oper_feature_key][val] 383 | 384 | # if totalSum == 0: 385 | # print oper_type 386 | # print oper_feature_key 387 | # print self.probability_tables[oper_type][oper_feature_key] 388 | # print self.count_tables[oper_type][oper_feature_key] 389 | 390 | if totalSum == 0: 391 | for val in self.probability_tables[oper_type][oper_feature_key]: 392 | self.probability_tables[oper_type][oper_feature_key][val] = 0.5 # Uniform 1.0/len(self.probability_tables[oper_type][oper_feature_key]) 393 | else: 394 | for val in self.probability_tables[oper_type][oper_feature_key]: 395 | self.probability_tables[oper_type][oper_feature_key][val] = self.count_tables[oper_type][oper_feature_key][val] / totalSum 396 | -------------------------------------------------------------------------------- /source/explore_decoder_graph_greedy.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #title : explore_decoder_graph_greedy.py = 4 | #description : Greedy decoder = 5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 6 | #date : Created in 2014, Later revised in April 2016. = 7 | #version : 0.1 = 8 | #=================================================================================== 9 | 10 | from training_graph_module import Training_Graph 11 | import function_select_methods 12 | 13 | class Explore_Decoder_Graph_Greedy: 14 | def __init__(self, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE, RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, probability_tables, METHOD_FEATURE_EXTRACT): 15 | self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL 16 | self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE 17 | self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL 18 | self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD 19 | 20 | self.probability_tables = probability_tables 21 | self.METHOD_FEATURE_EXTRACT = METHOD_FEATURE_EXTRACT 22 | 23 | self.method_feature_extract = function_select_methods.select_feature_extract_method(self.METHOD_FEATURE_EXTRACT) 24 | 25 | def explore_decoder_graph(self, sentid, main_sentence, main_sent_dict, boxer_graph): 26 | # Start a decoder graph 27 | decoder_graph = Training_Graph() 28 | nodes_2_process = [] 29 | 30 | # Check if Discourse information is available 31 | if boxer_graph.isEmpty(): 32 | # Adding finishing major node 33 | nodeset = boxer_graph.get_nodeset() 34 | filtered_mod_pos = [] 35 | simple_sentences = [] 36 | majornode_data = ("fin", nodeset, simple_sentences, filtered_mod_pos) 37 | 38 | # Creating major node 39 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data) 40 | nodes_2_process.append(majornode_name) # isNew = True 41 | else: 42 | # DRS data is available for the complex sentence 43 | # Check to add the starting node 44 | nodeset = boxer_graph.get_nodeset() 45 | majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "split", nodeset, [], []) 46 | nodes_2_process.append(majornode_name) # isNew = True 47 | 48 | # Start expanding the decoder graph 49 | self.expand_decoder_graph(nodes_2_process[:], main_sent_dict, boxer_graph, decoder_graph) 50 | return decoder_graph 51 | 52 | def expand_decoder_graph(self, nodes_2_process, main_sent_dict, boxer_graph, decoder_graph): 53 | if len(nodes_2_process) == 0: 54 | return 55 | 56 | node_name = nodes_2_process[0] 57 | operreq = decoder_graph.get_majornode_type(node_name) 58 | nodeset = decoder_graph.get_majornode_nodeset(node_name)[:] 59 | oper_candidates = decoder_graph.get_majornode_oper_candidates(node_name)[:] 60 | processed_oper_candidates = decoder_graph.get_majornode_processed_oper_candidates(node_name)[:] 61 | filtered_postions = decoder_graph.get_majornode_filtered_postions(node_name)[:] 62 | 63 | #print node_name, decoder_graph.major_nodes[node_name] 64 | 65 | #print node_name, operreq, nodeset, oper_candidates, processed_oper_candidates, filtered_postions 66 | 67 | if operreq == "split": 68 | split_candidate_tuples = oper_candidates 69 | nodes_2_process = self.process_split_node_decoder_graph(node_name, nodeset, split_candidate_tuples, nodes_2_process, 70 | main_sent_dict, boxer_graph, decoder_graph) 71 | 72 | if operreq == "drop-rel": 73 | relnode_candidates = oper_candidates 74 | processed_relnode_candidates = processed_oper_candidates 75 | filtered_mod_pos = filtered_postions 76 | nodes_2_process = self.process_droprel_node_decoder_graph(node_name, nodeset, relnode_candidates, processed_relnode_candidates, filtered_mod_pos, 77 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph) 78 | 79 | if operreq == "drop-mod": 80 | mod_candidates = oper_candidates 81 | processed_mod_pos = processed_oper_candidates 82 | filtered_mod_pos = filtered_postions 83 | nodes_2_process = self.process_dropmod_node_decoder_graph(node_name, nodeset, mod_candidates, processed_mod_pos, filtered_mod_pos, 84 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph) 85 | 86 | if operreq == "drop-ood": 87 | oodnode_candidates = oper_candidates 88 | processed_oodnode_candidates = processed_oper_candidates 89 | filtered_mod_pos = filtered_postions 90 | nodes_2_process = self.process_dropood_node_decoder_graph(node_name, nodeset, oodnode_candidates, processed_oodnode_candidates, filtered_mod_pos, 91 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph) 92 | 93 | self.expand_decoder_graph(nodes_2_process[1:], main_sent_dict, boxer_graph, decoder_graph) 94 | 95 | def process_split_node_decoder_graph(self, node_name, nodeset, split_candidate_tuples, nodes_2_process, main_sent_dict, boxer_graph, decoder_graph): 96 | # Calculating probabilities 97 | probability_results = [] 98 | 99 | # Find the Parent main sentence 100 | parent_nodeset = nodeset[:] 101 | parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, []) 102 | 103 | # Explore no-split options 104 | probability = 1 105 | for split_candidate in split_candidate_tuples: 106 | # Get the probability 107 | split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, [parent_sentence], boxer_graph) 108 | if split_feature in self.probability_tables["split"]: 109 | probability = probability * self.probability_tables["split"][split_feature]["false"] 110 | else: 111 | probability = probability * 0.5 112 | #print split_candidate, split_feature, "false", probability 113 | probability_results.append((probability, None, [])) 114 | 115 | # Explore all split options 116 | # Calculate all parent and following subtrees 117 | parent_subgraph_nodeset_dict = boxer_graph.extract_parent_subgraph_nodeset_dict() 118 | #print "parent_subgraph_nodeset_dict : "+str(parent_subgraph_nodeset_dict) 119 | for split_candidate in split_candidate_tuples: 120 | # Find the children sentences 121 | children_sentences = [] 122 | 123 | # Split on the split_candidate 124 | node_subgraph_nodeset_dict, node_span_dict = boxer_graph.partition_drs_for_successful_candidate(split_candidate, parent_subgraph_nodeset_dict) 125 | #print node_subgraph_nodeset_dict, node_span_dict 126 | 127 | # Sorting them depending on span 128 | split_results = [] 129 | for tnodename in split_candidate: 130 | tspan = node_span_dict[tnodename] 131 | tnodeset = node_subgraph_nodeset_dict[tnodename][:] 132 | split_results.append((tspan, tnodeset, tnodename)) 133 | split_results.sort() 134 | 135 | # Prospective children major nodes 136 | for item in split_results: 137 | child_nodeset = item[1] 138 | child_nodeset.sort() 139 | child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, []) 140 | children_sentences.append(child_sentence) 141 | 142 | #print children_sentences 143 | 144 | # Get the probability 145 | split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph) 146 | probability = 0 147 | if split_feature in self.probability_tables["split"]: 148 | probability = self.probability_tables["split"][split_feature]["true"] 149 | else: 150 | probability = 0.5 151 | 152 | #print split_candidate, split_feature, "true", probability 153 | 154 | probability_results.append((probability, split_candidate, split_results)) 155 | 156 | # Sort probabilities 157 | probability_results.sort(reverse=True) 158 | 159 | ## 160 | #data = [(item[0], item[1]) for item in probability_results] 161 | #print data 162 | ## 163 | 164 | split_tuple = probability_results[0][1] 165 | if split_tuple != None: 166 | # Adding the operation node 167 | not_applied_cands = [item for item in split_candidate_tuples if item is not split_tuple] 168 | opernode_data = ("split", split_tuple, not_applied_cands) 169 | opernode_name = decoder_graph.create_opernode(opernode_data) 170 | decoder_graph.create_edge((node_name, opernode_name, split_candidate)) 171 | 172 | split_results = probability_results[0][2] 173 | for item in split_results: 174 | child_nodeset = item[1][:] 175 | child_nodeset.sort() 176 | parent_child_nodeset = item[2] 177 | 178 | # Check for adding OOD or subsequent nodes 179 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, [], []) 180 | if isNew: 181 | nodes_2_process.append(child_majornode_name) 182 | decoder_graph.create_edge((opernode_name, child_majornode_name, parent_child_nodeset)) 183 | else: 184 | # Adding the operation node 185 | not_applied_cands = [item for item in split_candidate_tuples] 186 | opernode_data = ("split", None, not_applied_cands) 187 | opernode_name = decoder_graph.create_opernode(opernode_data) 188 | decoder_graph.create_edge((node_name, opernode_name, None)) 189 | 190 | # Check for adding drop-rel or drop-mod or fin nodes 191 | child_nodeset = nodeset[:] 192 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, [], []) 193 | if isNew: 194 | nodes_2_process.append(child_majornode_name) 195 | decoder_graph.create_edge((opernode_name, child_majornode_name, None)) 196 | 197 | return nodes_2_process 198 | 199 | def process_droprel_node_decoder_graph(self, node_name, nodeset, relnode_candidates, processed_relnode_candidates, filtered_mod_pos, 200 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph): 201 | relnode_to_process = relnode_candidates[0] 202 | processed_relnode_candidates.append(relnode_to_process) 203 | 204 | drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(relnode_to_process, nodeset, main_sent_dict, boxer_graph) 205 | 206 | if drop_rel_feature in self.probability_tables["drop-rel"]: 207 | drop_prob = self.probability_tables["drop-rel"][drop_rel_feature]["true"] 208 | not_drop_prob = self.probability_tables["drop-rel"][drop_rel_feature]["false"] 209 | if drop_prob > not_drop_prob: 210 | # Creating opernode for droping 211 | opernode_data = ("drop-rel", relnode_to_process, "True") 212 | opernode_name = decoder_graph.create_opernode(opernode_data) 213 | decoder_graph.create_edge((node_name, opernode_name, relnode_to_process)) 214 | # Check for adding REL or subsequent nodes, (nodeset is changed) 215 | child_nodeset, child_filtered_mod_pos = boxer_graph.drop_relation(nodeset, relnode_to_process, filtered_mod_pos) 216 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, processed_relnode_candidates, child_filtered_mod_pos) 217 | if isNew: 218 | nodes_2_process.append(child_majornode_name) 219 | decoder_graph.create_edge((opernode_name, child_majornode_name, "True")) 220 | return nodes_2_process 221 | 222 | # Creating opernode for not droping 223 | opernode_data = ("drop-rel", relnode_to_process, "False") 224 | opernode_name = decoder_graph.create_opernode(opernode_data) 225 | decoder_graph.create_edge((node_name, opernode_name, relnode_to_process)) 226 | # Check for adding REL or subsequent nodes, (nodeset is unchanged) 227 | child_nodeset = nodeset 228 | child_filtered_mod_pos = filtered_mod_pos 229 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, processed_relnode_candidates, child_filtered_mod_pos) 230 | if isNew: 231 | nodes_2_process.append(child_majornode_name) 232 | decoder_graph.create_edge((opernode_name, child_majornode_name, "False")) 233 | return nodes_2_process 234 | 235 | def process_dropmod_node_decoder_graph(self, node_name, nodeset, mod_candidates, processed_mod_pos, filtered_mod_pos, 236 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph): 237 | modcand_to_process = mod_candidates[0] 238 | modcand_position_to_process = modcand_to_process[0] 239 | modcand_word = main_sent_dict[modcand_position_to_process][0] 240 | modcand_node = modcand_to_process[1] 241 | processed_mod_pos.append(modcand_position_to_process) 242 | 243 | drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(modcand_to_process, main_sent_dict, boxer_graph) 244 | if drop_mod_feature in self.probability_tables["drop-mod"]: 245 | drop_prob = self.probability_tables["drop-mod"][drop_mod_feature]["true"] 246 | not_drop_prob = self.probability_tables["drop-mod"][drop_mod_feature]["false"] 247 | if drop_prob > not_drop_prob: 248 | # Drop this mod, adding the operation node 249 | opernode_data = ("drop-mod", modcand_to_process, "True") 250 | opernode_name = decoder_graph.create_opernode(opernode_data) 251 | decoder_graph.create_edge((node_name, opernode_name, modcand_to_process)) 252 | # Check for adding further drop mods, (nodeset is unchanged) 253 | child_nodeset = nodeset 254 | filtered_mod_pos.append(modcand_position_to_process) 255 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos) 256 | if isNew: 257 | nodes_2_process.append(child_majornode_name) 258 | decoder_graph.create_edge((opernode_name, child_majornode_name, "True")) 259 | return nodes_2_process 260 | 261 | # Dont drop this pos, adding the operation node 262 | opernode_data = ("drop-mod", modcand_to_process, "False") 263 | opernode_name = decoder_graph.create_opernode(opernode_data) 264 | decoder_graph.create_edge((node_name, opernode_name, modcand_to_process)) 265 | # Check for adding further drop mods, (nodeset is unchanged) 266 | child_nodeset = nodeset 267 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos) 268 | if isNew: 269 | nodes_2_process.append(child_majornode_name) 270 | decoder_graph.create_edge((opernode_name, child_majornode_name, "False")) 271 | return nodes_2_process 272 | 273 | def process_dropood_node_decoder_graph(self, node_name, nodeset, oodnode_candidates, processed_oodnode_candidates, filtered_mod_pos, 274 | nodes_2_process, main_sent_dict, boxer_graph, decoder_graph): 275 | oodnode_to_process = oodnode_candidates[0] 276 | processed_oodnode_candidates.append(oodnode_to_process) 277 | 278 | drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(oodnode_to_process, nodeset, main_sent_dict, boxer_graph) 279 | 280 | if drop_ood_feature in self.probability_tables["drop-ood"]: 281 | drop_prob = self.probability_tables["drop-ood"][drop_ood_feature]["true"] 282 | not_drop_prob = self.probability_tables["drop-ood"][drop_ood_feature]["false"] 283 | if drop_prob > not_drop_prob: 284 | # Creating opernode for droping 285 | opernode_data = ("drop-ood", oodnode_to_process, "True") 286 | opernode_name = decoder_graph.create_opernode(opernode_data) 287 | decoder_graph.create_edge((node_name, opernode_name, oodnode_to_process)) 288 | # Check for adding OOD or subsequent nodes, (nodeset is changed) 289 | child_nodeset = nodeset 290 | child_nodeset.remove(oodnode_to_process) 291 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-ood", child_nodeset, processed_oodnode_candidates, filtered_mod_pos) 292 | if isNew: 293 | nodes_2_process.append(child_majornode_name) 294 | decoder_graph.create_edge((opernode_name, child_majornode_name, "True")) 295 | return nodes_2_process 296 | 297 | # Creating opernode for not droping 298 | opernode_data = ("drop-ood", oodnode_to_process, "False") 299 | opernode_name = decoder_graph.create_opernode(opernode_data) 300 | decoder_graph.create_edge((node_name, opernode_name, oodnode_to_process)) 301 | # Check for adding OOD or subsequent nodes, (nodeset is unchanged) 302 | child_nodeset = nodeset 303 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-ood", child_nodeset, processed_oodnode_candidates, filtered_mod_pos) 304 | if isNew: 305 | nodes_2_process.append(child_majornode_name) 306 | decoder_graph.create_edge((opernode_name, child_majornode_name, "False")) 307 | return nodes_2_process 308 | 309 | def addition_major_node(self, main_sent_dict, boxer_graph, decoder_graph, opertype, nodeset, processed_candidates, extra_data): 310 | # node type - value 311 | type_val = {"split":1, "drop-rel":2, "drop-mod":3, "drop-ood":4} 312 | operval = type_val[opertype] 313 | 314 | # simple sentences not available, used to match data structures 315 | simple_sentences = [] 316 | 317 | # Checking for the addition of "split" major-node 318 | if operval <= type_val["split"]: 319 | if opertype in self.DISCOURSE_SENTENCE_MODEL: 320 | # Calculating Split Candidates - DRS Graph node tuples 321 | split_candidate_tuples = boxer_graph.extract_split_candidate_tuples(nodeset, self.MAX_SPLIT_PAIR_SIZE) 322 | # print "split_candidate_tuples : " + str(split_candidate_tuples) 323 | 324 | if len(split_candidate_tuples) != 0: 325 | # Adding the major node for split 326 | majornode_data = ("split", nodeset[:], simple_sentences, split_candidate_tuples) 327 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data) 328 | return majornode_name, isNew 329 | 330 | if operval <= type_val["drop-rel"]: 331 | if opertype in self.DISCOURSE_SENTENCE_MODEL: 332 | # Calculate drop-rel candidates 333 | processed_relnode = processed_candidates[:] if opertype == "drop-rel" else [] 334 | filtered_mod_pos = extra_data if opertype == "drop-rel" else [] 335 | relnode_set = boxer_graph.extract_drop_rel_candidates(nodeset, self.RESTRICTED_DROP_REL, processed_relnode) 336 | if len(relnode_set) != 0: 337 | # Adding the major nodes for drop-rel 338 | majornode_data = ("drop-rel", nodeset[:], simple_sentences, relnode_set, processed_relnode, filtered_mod_pos) 339 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data) 340 | return majornode_name, isNew 341 | 342 | if operval <= type_val["drop-mod"]: 343 | if opertype in self.DISCOURSE_SENTENCE_MODEL: 344 | # Calculate drop-mod candidates 345 | processed_mod_pos = processed_candidates[:] if opertype == "drop-mod" else [] 346 | filtered_mod_pos = extra_data 347 | modcand_set = boxer_graph.extract_drop_mod_candidates(nodeset, main_sent_dict, self.ALLOWED_DROP_MOD, processed_mod_pos) 348 | if len(modcand_set) != 0: 349 | # Adding the major nodes for drop-mod 350 | majornode_data = ("drop-mod", nodeset[:], simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos) 351 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data) 352 | return majornode_name, isNew 353 | 354 | if operval <= type_val["drop-ood"]: 355 | if opertype in self.DISCOURSE_SENTENCE_MODEL: 356 | # Check for drop-OOD node candidates 357 | processed_oodnodes = processed_candidates if opertype == "drop-ood" else [] 358 | filtered_mod_pos = extra_data 359 | oodnode_candidates = boxer_graph.extract_ood_candidates(nodeset, processed_oodnodes) 360 | if len(oodnode_candidates) != 0: 361 | # Adding the major node for drop-ood 362 | majornode_data = ("drop-ood", nodeset[:], simple_sentences, oodnode_candidates, processed_oodnodes, filtered_mod_pos) 363 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data) 364 | return majornode_name, isNew 365 | 366 | # None of them matched, create "fin" node 367 | filtered_mod_pos = extra_data[:] 368 | majornode_data = ("fin", nodeset[:], simple_sentences, filtered_mod_pos) 369 | majornode_name, isNew = decoder_graph.create_majornode(majornode_data) 370 | return majornode_name, isNew 371 | 372 | -------------------------------------------------------------------------------- /source/explore_training_graph.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #title : explore_training_graph.py = 4 | #description : Training graph explorer = 5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 6 | #date : Created in 2014, Later revised in April 2016. = 7 | #version : 0.1 = 8 | #=================================================================================== 9 | 10 | 11 | from training_graph_module import Training_Graph 12 | import function_select_methods 13 | import functions_prepare_elementtree_dot 14 | 15 | class Explore_Training_Graph: 16 | def __init__(self, output_stream, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE, 17 | RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, METHOD_TRAINING_GRAPH): 18 | self.output_stream = output_stream 19 | 20 | self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL 21 | self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE 22 | self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL 23 | self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD 24 | self.METHOD_TRAINING_GRAPH = METHOD_TRAINING_GRAPH 25 | 26 | self.method_training_graph = function_select_methods.select_training_graph_method(self.METHOD_TRAINING_GRAPH) 27 | 28 | def explore_training_graph(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph): 29 | # Start a training graph 30 | training_graph = Training_Graph() 31 | nodes_2_process = [] 32 | 33 | # Check if Discourse information is available 34 | if boxer_graph.isEmpty(): 35 | # Adding finishing major node 36 | nodeset = boxer_graph.get_nodeset() 37 | filtered_mod_pos = [] 38 | majornode_data = ("fin", nodeset, simple_sentences, filtered_mod_pos) 39 | 40 | # Creating major node 41 | majornode_name, isNew = training_graph.create_majornode(majornode_data) 42 | nodes_2_process.append(majornode_name) # isNew = True 43 | else: 44 | # DRS data is available for the main sentence 45 | # Check to add the starting node 46 | nodeset = boxer_graph.get_nodeset() 47 | majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "split", nodeset, [], []) 48 | nodes_2_process.append(majornode_name) # isNew = True 49 | 50 | # Start expanding the training graph 51 | self.expand_training_graph(nodes_2_process[:], main_sent_dict, boxer_graph, training_graph) 52 | 53 | # Writing sentence element 54 | functions_prepare_elementtree_dot.prepare_write_sentence_element(self.output_stream, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph) 55 | 56 | # # Check to create visual representation 57 | # if int(sentid) <= 100: 58 | # functions_prepare_elementtree_dot.run_visual_graph_creator(sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph) 59 | 60 | def expand_training_graph(self, nodes_2_process, main_sent_dict, boxer_graph, training_graph): 61 | #print nodes_2_process 62 | if len(nodes_2_process) == 0: 63 | return 64 | 65 | node_name = nodes_2_process[0] 66 | operreq = training_graph.get_majornode_type(node_name) 67 | nodeset = training_graph.get_majornode_nodeset(node_name)[:] 68 | simple_sentences = training_graph.get_majornode_simple_sentences(node_name)[:] 69 | oper_candidates = training_graph.get_majornode_oper_candidates(node_name)[:] 70 | processed_oper_candidates = training_graph.get_majornode_processed_oper_candidates(node_name)[:] 71 | filtered_postions = training_graph.get_majornode_filtered_postions(node_name)[:] 72 | 73 | if operreq == "split": 74 | split_candidate_tuples = oper_candidates 75 | nodes_2_process = self.process_split_node_training_graph(node_name, nodeset, simple_sentences, split_candidate_tuples, 76 | nodes_2_process, main_sent_dict, boxer_graph, training_graph) 77 | 78 | if operreq == "drop-rel": 79 | relnode_candidates = oper_candidates 80 | processed_relnode_candidates = processed_oper_candidates 81 | filtered_mod_pos = filtered_postions 82 | nodes_2_process = self.process_droprel_node_training_graph(node_name, nodeset, simple_sentences, relnode_candidates, processed_relnode_candidates, filtered_mod_pos, 83 | nodes_2_process, main_sent_dict, boxer_graph, training_graph) 84 | 85 | if operreq == "drop-mod": 86 | mod_candidates = oper_candidates 87 | processed_mod_pos = processed_oper_candidates 88 | filtered_mod_pos = filtered_postions 89 | nodes_2_process = self.process_dropmod_node_training_graph(node_name, nodeset, simple_sentences, mod_candidates, processed_mod_pos, filtered_mod_pos, 90 | nodes_2_process, main_sent_dict, boxer_graph, training_graph) 91 | 92 | if operreq == "drop-ood": 93 | oodnode_candidates = oper_candidates 94 | processed_oodnode_candidates = processed_oper_candidates 95 | filtered_mod_pos = filtered_postions 96 | nodes_2_process = self.process_dropood_node_training_graph(node_name, nodeset, simple_sentences, oodnode_candidates, processed_oodnode_candidates, filtered_mod_pos, 97 | nodes_2_process, main_sent_dict, boxer_graph, training_graph) 98 | 99 | self.expand_training_graph(nodes_2_process[1:], main_sent_dict, boxer_graph, training_graph) 100 | 101 | def process_split_node_training_graph(self, node_name, nodeset, simple_sentences, split_candidate_tuples, nodes_2_process, main_sent_dict, boxer_graph, training_graph): 102 | split_candidate_results = [] 103 | splitAchieved = False 104 | for split_candidate in split_candidate_tuples: 105 | isValidSplit, split_results = self.method_training_graph.process_split_candidate_for_split(split_candidate, simple_sentences, main_sent_dict, boxer_graph) 106 | # print "split_candidate : "+str(split_candidate) + " : " + str(isValidSplit) 107 | split_candidate_results.append((isValidSplit, split_results)) 108 | if isValidSplit: 109 | splitAchieved = True 110 | 111 | if splitAchieved: 112 | # At least one split candidate succeed 113 | for split_candidate, results_tuple in zip(split_candidate_tuples, split_candidate_results): 114 | if results_tuple[0] == True: 115 | # Adding the operation node 116 | not_applied_cands = [item for item in split_candidate_tuples if item is not split_candidate] 117 | opernode_data = ("split", split_candidate, not_applied_cands) 118 | opernode_name = training_graph.create_opernode(opernode_data) 119 | training_graph.create_edge((node_name, opernode_name, split_candidate)) 120 | 121 | # Adding children major nodes 122 | for item in results_tuple[1]: 123 | child_nodeset = item[1] 124 | child_nodeset.sort() 125 | parent_child_nodeset = item[2] 126 | simple_sentence = item[3] 127 | 128 | # Check for adding OOD or subsequent nodes 129 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, [simple_sentence], boxer_graph, training_graph, "drop-rel", child_nodeset, [], []) 130 | if isNew: 131 | nodes_2_process.append(child_majornode_name) 132 | training_graph.create_edge((opernode_name, child_majornode_name, parent_child_nodeset)) 133 | 134 | else: 135 | # None of the split candidate succeed, adding the operation node 136 | not_applied_cands = [item for item in split_candidate_tuples] 137 | opernode_data = ("split", None, not_applied_cands) 138 | opernode_name = training_graph.create_opernode(opernode_data) 139 | training_graph.create_edge((node_name, opernode_name, None)) 140 | 141 | # Check for adding drop-rel or drop-mod or fin nodes 142 | child_nodeset = nodeset 143 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-rel", child_nodeset, [], []) 144 | if isNew: 145 | nodes_2_process.append(child_majornode_name) 146 | training_graph.create_edge((opernode_name, child_majornode_name, None)) 147 | 148 | return nodes_2_process 149 | 150 | def process_droprel_node_training_graph(self, node_name, nodeset, simple_sentences, relnode_set, processed_relnode, filtered_mod_pos, nodes_2_process, main_sent_dict, boxer_graph, training_graph): 151 | relnode_to_process = relnode_set[0] 152 | processed_relnode.append(relnode_to_process) 153 | 154 | isValidDrop = self.method_training_graph.process_rel_candidate_for_drop(relnode_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph) 155 | if isValidDrop: 156 | # Drop this rel node, adding the operation node 157 | opernode_data = ("drop-rel", relnode_to_process, "True") 158 | opernode_name = training_graph.create_opernode(opernode_data) 159 | training_graph.create_edge((node_name, opernode_name, relnode_to_process)) 160 | 161 | # Check for adding REL or subsequent nodes, (nodeset is changed) 162 | child_nodeset, child_filtered_mod_pos = boxer_graph.drop_relation(nodeset, relnode_to_process, filtered_mod_pos) 163 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-rel", child_nodeset, processed_relnode, child_filtered_mod_pos) 164 | if isNew: 165 | nodes_2_process.append(child_majornode_name) 166 | training_graph.create_edge((opernode_name, child_majornode_name, "True")) 167 | else: 168 | # Dont drop this rel node, adding the operation node 169 | opernode_data = ("drop-rel", relnode_to_process, "False") 170 | opernode_name = training_graph.create_opernode(opernode_data) 171 | training_graph.create_edge((node_name, opernode_name, relnode_to_process)) 172 | 173 | # Check for adding REL or subsequent nodes, (nodeset is unchanged) 174 | child_nodeset = nodeset 175 | child_filtered_mod_pos = filtered_mod_pos 176 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-rel", child_nodeset, processed_relnode, child_filtered_mod_pos) 177 | if isNew: 178 | nodes_2_process.append(child_majornode_name) 179 | training_graph.create_edge((opernode_name, child_majornode_name, "False")) 180 | 181 | return nodes_2_process 182 | 183 | def process_dropmod_node_training_graph(self, node_name, nodeset, simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos, nodes_2_process, main_sent_dict, boxer_graph, training_graph): 184 | modcand_to_process = modcand_set[0] 185 | modcand_position_to_process = modcand_to_process[0] 186 | processed_mod_pos.append(modcand_position_to_process) 187 | 188 | isValidDrop = self.method_training_graph.process_mod_candidate_for_drop(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph) 189 | if isValidDrop: 190 | # Drop this mod pos, adding the operation node 191 | opernode_data = ("drop-mod", modcand_to_process, "True") 192 | opernode_name = training_graph.create_opernode(opernode_data) 193 | training_graph.create_edge((node_name, opernode_name, modcand_to_process)) 194 | 195 | # Check for adding mod and their subsequent nodes, (nodeset is not changed) 196 | child_nodeset = nodeset 197 | filtered_mod_pos.append(modcand_position_to_process) 198 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos) 199 | if isNew: 200 | nodes_2_process.append(child_majornode_name) 201 | training_graph.create_edge((opernode_name, child_majornode_name, "True")) 202 | else: 203 | # Dont drop this pos, adding the operation node 204 | opernode_data = ("drop-mod", modcand_to_process, "False") 205 | opernode_name = training_graph.create_opernode(opernode_data) 206 | training_graph.create_edge((node_name, opernode_name, modcand_to_process)) 207 | 208 | # Check for adding mod and their subsequent nodes, (nodeset is not changed) 209 | child_nodeset = nodeset 210 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos) 211 | if isNew: 212 | nodes_2_process.append(child_majornode_name) 213 | training_graph.create_edge((opernode_name, child_majornode_name, "False")) 214 | return nodes_2_process 215 | 216 | def process_dropood_node_training_graph(self, node_name, nodeset, simple_sentences, oodnode_set, processed_oodnode, filtered_mod_pos, nodes_2_process, main_sent_dict, boxer_graph, training_graph): 217 | 218 | oodnode_to_process = oodnode_set[0] 219 | processed_oodnode.append(oodnode_to_process) 220 | 221 | isValidDrop = self.method_training_graph.process_ood_candidate_for_drop(oodnode_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph) 222 | if isValidDrop: 223 | # Drop this ood node, adding the operation node 224 | opernode_data = ("drop-ood", oodnode_to_process, "True") 225 | opernode_name = training_graph.create_opernode(opernode_data) 226 | training_graph.create_edge((node_name, opernode_name, oodnode_to_process)) 227 | 228 | # Check for adding OOD or subsequent nodes, (nodeset is changed) 229 | child_nodeset = nodeset 230 | child_nodeset.remove(oodnode_to_process) 231 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-ood", child_nodeset, processed_oodnode, filtered_mod_pos) 232 | if isNew: 233 | nodes_2_process.append(child_majornode_name) 234 | training_graph.create_edge((opernode_name, child_majornode_name, "True")) 235 | else: 236 | # Dont drop this ood node, adding the operation node 237 | opernode_data = ("drop-ood", oodnode_to_process, "False") 238 | opernode_name = training_graph.create_opernode(opernode_data) 239 | training_graph.create_edge((node_name, opernode_name, oodnode_to_process)) 240 | 241 | # Check for adding OOD or subsequent nodes, (nodeset is unchanged) 242 | child_nodeset = nodeset 243 | child_majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-ood", child_nodeset, processed_oodnode, filtered_mod_pos) 244 | if isNew: 245 | nodes_2_process.append(child_majornode_name) 246 | training_graph.create_edge((opernode_name, child_majornode_name, "False")) 247 | 248 | return nodes_2_process 249 | 250 | def addition_major_node(self, main_sent_dict, simple_sentences, boxer_graph, training_graph, opertype, nodeset, processed_candidates, extra_data): 251 | # node type - value 252 | type_val = {"split":1, "drop-rel":2, "drop-mod":3, "drop-ood":4} 253 | operval = type_val[opertype] 254 | 255 | # Checking for the addition of "split" major-node 256 | if operval <= type_val["split"]: 257 | if opertype in self.DISCOURSE_SENTENCE_MODEL: 258 | # Calculating Split Candidates - DRS Graph node tuples 259 | split_candidate_tuples = boxer_graph.extract_split_candidate_tuples(nodeset, self.MAX_SPLIT_PAIR_SIZE) 260 | # print "split_candidate_tuples : " + str(split_candidate_tuples) 261 | 262 | if len(split_candidate_tuples) != 0: 263 | # Adding the major node for split 264 | majornode_data = ("split", nodeset, simple_sentences, split_candidate_tuples) 265 | majornode_name, isNew = training_graph.create_majornode(majornode_data) 266 | return majornode_name, isNew 267 | 268 | if operval <= type_val["drop-rel"]: 269 | if opertype in self.DISCOURSE_SENTENCE_MODEL: 270 | # Calculate drop-rel candidates 271 | processed_relnode = processed_candidates if opertype == "drop-rel" else [] 272 | filtered_mod_pos = extra_data if opertype == "drop-rel" else [] 273 | relnode_set = boxer_graph.extract_drop_rel_candidates(nodeset, self.RESTRICTED_DROP_REL, processed_relnode) 274 | if len(relnode_set) != 0: 275 | # Adding the major nodes for drop-rel 276 | majornode_data = ("drop-rel", nodeset, simple_sentences, relnode_set, processed_relnode, filtered_mod_pos) 277 | majornode_name, isNew = training_graph.create_majornode(majornode_data) 278 | return majornode_name, isNew 279 | 280 | if operval <= type_val["drop-mod"]: 281 | if opertype in self.DISCOURSE_SENTENCE_MODEL: 282 | # Calculate drop-mod candidates 283 | processed_mod_pos = processed_candidates if opertype == "drop-mod" else [] 284 | filtered_mod_pos = extra_data 285 | modcand_set = boxer_graph.extract_drop_mod_candidates(nodeset, main_sent_dict, self.ALLOWED_DROP_MOD, processed_mod_pos) 286 | if len(modcand_set) != 0: 287 | # Adding the major nodes for drop-mod 288 | majornode_data = ("drop-mod", nodeset, simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos) 289 | majornode_name, isNew = training_graph.create_majornode(majornode_data) 290 | return majornode_name, isNew 291 | 292 | if operval <= type_val["drop-ood"]: 293 | if opertype in self.DISCOURSE_SENTENCE_MODEL: 294 | # Check for drop-OOD node candidates 295 | processed_oodnodes = processed_candidates if opertype == "drop-ood" else [] 296 | filtered_mod_pos = extra_data 297 | oodnode_candidates = boxer_graph.extract_ood_candidates(nodeset, processed_oodnodes) 298 | if len(oodnode_candidates) != 0: 299 | # Adding the major node for drop-ood 300 | majornode_data = ("drop-ood", nodeset, simple_sentences, oodnode_candidates, processed_oodnodes, filtered_mod_pos) 301 | majornode_name, isNew = training_graph.create_majornode(majornode_data) 302 | return majornode_name, isNew 303 | 304 | 305 | # None of them matched, create "fin" node 306 | filtered_mod_pos = extra_data 307 | majornode_data = ("fin", nodeset, simple_sentences, filtered_mod_pos) 308 | majornode_name, isNew = training_graph.create_majornode(majornode_data) 309 | return majornode_name, isNew 310 | -------------------------------------------------------------------------------- /source/function_select_methods.py: -------------------------------------------------------------------------------- 1 | 2 | #=================================================================================== 3 | #description : Methods for training graph and features exploration = 4 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 5 | #date : Created in 2014, Later revised in April 2016. = 6 | #version : 0.1 = 7 | #=================================================================================== 8 | 9 | 10 | from methods_training_graph import Method_LED, Method_OVERLAP_LED 11 | from methods_feature_extract import Feature_Init, Feature_Nov27 12 | 13 | def select_training_graph_method(METHOD_TRAINING_GRAPH): 14 | return{ 15 | "method-0.99-lteq-lt": Method_OVERLAP_LED(0.99, "lteq", "lt"), 16 | "method-0.75-lteq-lt": Method_OVERLAP_LED(0.75, "lteq", "lt"), 17 | "method-0.5-lteq-lteq": Method_OVERLAP_LED(0.5, "lteq", "lteq"), 18 | "method-led-lteq": Method_LED("lteq", "lteq", "lteq"), 19 | "method-led-lt": Method_LED("lt", "lt", "lt") 20 | }[METHOD_TRAINING_GRAPH] 21 | 22 | def select_feature_extract_method(METHOD_FEATURE_EXTRACT): 23 | return{ 24 | "feature-init": Feature_Init(), 25 | "feature-Nov27": Feature_Nov27(), 26 | }[METHOD_FEATURE_EXTRACT] 27 | -------------------------------------------------------------------------------- /source/functions_configuration_file.py: -------------------------------------------------------------------------------- 1 | #=================================================================================== 2 | #title : functions_configuration_file.py = 3 | #description : Prepare/READ configuration file = 4 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 5 | #date : Created in 2014, Later revised in April 2016. = 6 | #version : 0.1 = 7 | #=================================================================================== 8 | 9 | def write_config_file(config_filename, config_data_dict): 10 | config_file = open(config_filename, "w") 11 | 12 | config_file.write("##############################################################\n"+ 13 | "####### Discourse-Complex-Simple Congifuration File ##########\n"+ 14 | "##############################################################\n\n") 15 | 16 | config_file.write("# Generation Information\n") 17 | if "TRAIN-BOXER-GRAPH" in config_data_dict: 18 | config_file.write("[TRAIN-BOXER-GRAPH]\n"+config_data_dict["TRAIN-BOXER-GRAPH"]+"\n\n") 19 | 20 | if "TRANSFORMATION-MODEL" in config_data_dict: 21 | config_file.write("[TRANSFORMATION-MODEL]\n"+" ".join(config_data_dict["TRANSFORMATION-MODEL"])+"\n\n") 22 | 23 | if "MAX-SPLIT-SIZE" in config_data_dict: 24 | config_file.write("[MAX-SPLIT-SIZE]\n"+str(config_data_dict["MAX-SPLIT-SIZE"])+"\n\n") 25 | 26 | if "RESTRICTED-DROP-RELATION" in config_data_dict: 27 | config_file.write("[RESTRICTED-DROP-RELATION]\n"+" ".join(config_data_dict["RESTRICTED-DROP-RELATION"])+"\n\n") 28 | 29 | if "ALLOWED-DROP-MODIFIER" in config_data_dict: 30 | config_file.write("[ALLOWED-DROP-MODIFIER]\n"+" ".join(config_data_dict["ALLOWED-DROP-MODIFIER"])+"\n\n") 31 | 32 | if "METHOD-TRAINING-GRAPH" in config_data_dict: 33 | config_file.write("[METHOD-TRAINING-GRAPH]\n"+config_data_dict["METHOD-TRAINING-GRAPH"]+"\n\n") 34 | 35 | if "METHOD-FEATURE-EXTRACT" in config_data_dict: 36 | config_file.write("[METHOD-FEATURE-EXTRACT]\n"+config_data_dict["METHOD-FEATURE-EXTRACT"]+"\n\n") 37 | 38 | if "NUM-EM-ITERATION" in config_data_dict: 39 | config_file.write("[NUM-EM-ITERATION]\n"+str(config_data_dict["NUM-EM-ITERATION"])+"\n\n") 40 | 41 | if "LANGUAGE-MODEL" in config_data_dict: 42 | config_file.write("[LANGUAGE-MODEL]\n"+config_data_dict["LANGUAGE-MODEL"]+"\n\n") 43 | 44 | config_file.write("# Step-1\n") 45 | if "TRAIN-TRAINING-GRAPH" in config_data_dict: 46 | config_file.write("[TRAIN-TRAINING-GRAPH]\n"+config_data_dict["TRAIN-TRAINING-GRAPH"]+"\n\n") 47 | 48 | config_file.write("# Step-2\n") 49 | if "TRANSFORMATION-MODEL-DIR" in config_data_dict: 50 | config_file.write("[TRANSFORMATION-MODEL-DIR]\n"+config_data_dict["TRANSFORMATION-MODEL-DIR"]+"\n\n") 51 | 52 | config_file.write("# Step-3\n") 53 | if "MOSES-COMPLEX-SIMPLE-DIR" in config_data_dict: 54 | config_file.write("[MOSES-COMPLEX-SIMPLE-DIR]\n"+config_data_dict["MOSES-COMPLEX-SIMPLE-DIR"]+"\n\n") 55 | 56 | config_file.close() 57 | 58 | 59 | def parser_config_file(config_file): 60 | config_data = (open(config_file, "r").read().strip()).split("\n") 61 | config_data_dict = {} 62 | count = 0 63 | while count < len(config_data): 64 | if config_data[count].startswith("["): 65 | # Start Information 66 | if config_data[count].strip()[1:-1] == "TRAIN-BOXER-GRAPH": 67 | config_data_dict["TRAIN-BOXER-GRAPH"] = config_data[count+1].strip() 68 | 69 | if config_data[count].strip()[1:-1] == "TRANSFORMATION-MODEL": 70 | config_data_dict["TRANSFORMATION-MODEL"] = config_data[count+1].strip().split() 71 | 72 | if config_data[count].strip()[1:-1] == "MAX-SPLIT-SIZE": 73 | config_data_dict["MAX-SPLIT-SIZE"] = int(config_data[count+1].strip()) 74 | 75 | if config_data[count].strip()[1:-1] == "RESTRICTED-DROP-RELATION": 76 | config_data_dict["RESTRICTED-DROP-RELATION"] = config_data[count+1].strip().split() 77 | 78 | if config_data[count].strip()[1:-1] == "ALLOWED-DROP-MODIFIER": 79 | config_data_dict["ALLOWED-DROP-MODIFIER"] = config_data[count+1].strip().split() 80 | 81 | if config_data[count].strip()[1:-1] == "METHOD-TRAINING-GRAPH": 82 | config_data_dict["METHOD-TRAINING-GRAPH"] = config_data[count+1].strip() 83 | 84 | if config_data[count].strip()[1:-1] == "METHOD-FEATURE-EXTRACT": 85 | config_data_dict["METHOD-FEATURE-EXTRACT"] = config_data[count+1].strip() 86 | 87 | if config_data[count].strip()[1:-1] == "NUM-EM-ITERATION": 88 | config_data_dict["NUM-EM-ITERATION"] = int(config_data[count+1].strip()) 89 | 90 | if config_data[count].strip()[1:-1] == "LANGUAGE-MODEL": 91 | config_data_dict["LANGUAGE-MODEL"] = config_data[count+1].strip() 92 | 93 | # Step 1 94 | if config_data[count].strip()[1:-1] == "TRAIN-TRAINING-GRAPH": 95 | config_data_dict["TRAIN-TRAINING-GRAPH"] = config_data[count+1].strip() 96 | 97 | # Step 2 98 | if config_data[count].strip()[1:-1] == "TRANSFORMATION-MODEL-DIR": 99 | config_data_dict["TRANSFORMATION-MODEL-DIR"] = config_data[count+1].strip() 100 | 101 | # Step 3 102 | if config_data[count].strip()[1:-1] == "MOSES-COMPLEX-SIMPLE-DIR": 103 | config_data_dict["MOSES-COMPLEX-SIMPLE-DIR"] = config_data[count+1].strip() 104 | 105 | count += 2 106 | else: 107 | count += 1 108 | return config_data_dict 109 | -------------------------------------------------------------------------------- /source/functions_model_files.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #title : functions_model_files.py = 4 | #description : Model file Handler = 5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 6 | #date : Created in 2014, Later revised in April 2016. = 7 | #version : 0.1 = 8 | #=================================================================================== 9 | 10 | def read_model_files(model_dir, transformation_model): 11 | probability_tables = {} 12 | for trans_method in transformation_model: 13 | modelfile = model_dir+"/D2S-"+trans_method.upper()+".model" 14 | probability_tables[trans_method] = {} 15 | with open(modelfile) as infile: 16 | for line in infile: 17 | data = line.split() 18 | if data[0] not in probability_tables[trans_method]: 19 | probability_tables[trans_method][data[0]] = {data[1]:float(data[2])} 20 | else: 21 | probability_tables[trans_method][data[0]][data[1]] = float(data[2]) 22 | print modelfile + " done ..." 23 | return probability_tables 24 | 25 | def write_model_files(model_dir, probability_tables, smt_sentence_pairs): 26 | if "split" in probability_tables: 27 | print "Writing "+model_dir+"/D2S-SPLIT.model ..." 28 | foutput = open(model_dir+"/D2S-SPLIT.model", "w") 29 | split_feature_set = probability_tables["split"].keys() 30 | split_feature_set.sort() 31 | for item in split_feature_set: 32 | foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["split"][item]["true"])+"\n") 33 | foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["split"][item]["false"])+"\n") 34 | foutput.close() 35 | 36 | if "drop-ood" in probability_tables: 37 | print "Writing "+model_dir+"/D2S-DROP-OOD.model ..." 38 | foutput = open(model_dir+"/D2S-DROP-OOD.model", "w") 39 | drop_ood_feature_set = probability_tables["drop-ood"].keys() 40 | drop_ood_feature_set.sort() 41 | for item in drop_ood_feature_set: 42 | foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["drop-ood"][item]["true"])+"\n") 43 | foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["drop-ood"][item]["false"])+"\n") 44 | foutput.close() 45 | 46 | if "drop-rel" in probability_tables: 47 | print "Writing "+model_dir+"/D2S-DROP-REL.model ..." 48 | foutput = open(model_dir+"/D2S-DROP-REL.model", "w") 49 | drop_rel_feature_set = probability_tables["drop-rel"].keys() 50 | drop_rel_feature_set.sort() 51 | for item in drop_rel_feature_set: 52 | foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["drop-rel"][item]["true"])+"\n") 53 | foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["drop-rel"][item]["false"])+"\n") 54 | foutput.close() 55 | 56 | if "drop-mod" in probability_tables: 57 | print "Writing "+model_dir+"/D2S-DROP-MOD.model ..." 58 | foutput = open(model_dir+"/D2S-DROP-MOD.model", "w") 59 | drop_mod_feature_set = probability_tables["drop-mod"].keys() 60 | drop_mod_feature_set.sort() 61 | for item in drop_mod_feature_set: 62 | foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["drop-mod"][item]["true"])+"\n") 63 | foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["drop-mod"][item]["false"])+"\n") 64 | foutput.close() 65 | 66 | # Writing SMT training data 67 | print "Writing "+model_dir+"/D2S-SMT.source ..." 68 | print "Writing "+ model_dir+"/D2S-SMT.target ..." 69 | fsource = open(model_dir+"/D2S-SMT.source", "w") 70 | ftarget = open(model_dir+"/D2S-SMT.target", "w") 71 | for sentid in smt_sentence_pairs: 72 | # print sentid 73 | # print smt_sentence_pairs[sentid] 74 | for pair in smt_sentence_pairs[sentid]: 75 | fsource.write(pair[0].encode('utf-8')+"\n") 76 | ftarget.write(pair[1].encode('utf-8')+"\n") 77 | fsource.close() 78 | ftarget.close() 79 | -------------------------------------------------------------------------------- /source/functions_prepare_elementtree_dot.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #title : functions_prepare_elementtree_dot.py = 4 | #description : Prepare dot file = 5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 6 | #date : Created in 2014, Later revised in April 2016. = 7 | #version : 0.1 = 8 | #=================================================================================== 9 | 10 | 11 | import os 12 | import xml.etree.ElementTree as ET 13 | from xml.dom import minidom 14 | 15 | def prettify_xml_element(element): 16 | """Return a pretty-printed XML string for the Element. 17 | """ 18 | rough_string = ET.tostring(element) 19 | reparsed = minidom.parseString(rough_string) 20 | prettyxml = reparsed.documentElement.toprettyxml(indent=" ") 21 | return prettyxml.encode("utf-8") 22 | 23 | ############################### Elementary Tree ########################################## 24 | 25 | def prepare_write_sentence_element(output_stream, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph): 26 | # Creating Sentence element 27 | sentence = ET.Element('sentence') 28 | sentence.attrib={"id":str(sentid)} 29 | 30 | # Writing main sentence 31 | main = ET.SubElement(sentence, "main") 32 | mainsent = ET.SubElement(main, "s") 33 | mainsent.text = main_sentence 34 | wordinfo = ET.SubElement(main, "winfo") 35 | mainpositions = main_sent_dict.keys() 36 | mainpositions.sort() 37 | for position in mainpositions: 38 | word = ET.SubElement(wordinfo, "w") 39 | word.text = main_sent_dict[position][0] 40 | word.attrib = {"id":str(position), "pos":main_sent_dict[position][1]} 41 | 42 | # Writing simple sentence 43 | simpleset = ET.SubElement(sentence, "simple-set") 44 | for simple_sentence in simple_sentences: 45 | simple = ET.SubElement(simpleset, "simple") 46 | simplesent = ET.SubElement(simple, "s") 47 | simplesent.text = simple_sentence 48 | 49 | # Writing boxer Data : boxer_graph 50 | boxer = boxer_graph.convert_to_elementarytree() 51 | sentence.append(boxer) 52 | 53 | # Writing Training Graph : training_graph 54 | traininggraph = training_graph.convert_to_elementarytree() 55 | sentence.append(traininggraph) 56 | 57 | output_stream.write(prettify_xml_element(sentence)) 58 | 59 | ############################ Dot - PNG File ################################################### 60 | 61 | def run_visual_graph_creator(sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph): 62 | print "Creating boxer and training graphs for sentence id : "+sentid+" ..." 63 | 64 | # Start creating boxer graph 65 | foutput = open("/tmp/boxer-graph-"+sentid+".dot", "w") 66 | boxer_dotstring = boxer_graph.convert_to_dotstring(sentid, main_sentence, main_sent_dict, simple_sentences) 67 | foutput.write(boxer_dotstring) 68 | foutput.close() 69 | os.system("dot -Tpng /tmp/boxer-graph-"+sentid+".dot -o /tmp/boxer-graph-"+sentid+".png") 70 | 71 | 72 | # Start creating training graph 73 | foutput = open("/tmp/training-graph-"+sentid+".dot", "w") 74 | train_dotstring = training_graph.convert_to_dotstring(main_sent_dict, boxer_graph) 75 | foutput.write(train_dotstring) 76 | foutput.close() 77 | os.system("dot -Tpng /tmp/training-graph-"+sentid+".dot -o /tmp/training-graph-"+sentid+".png") 78 | -------------------------------------------------------------------------------- /source/methods_feature_extract.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #description : Methods for features exploration = 4 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 5 | #date : Created in 2014, Later revised in April 2016. = 6 | #version : 0.1 = 7 | #=================================================================================== 8 | 9 | 10 | class Feature_Nov27: 11 | 12 | def get_split_feature(self, split_tuple, parent_sentence, children_sentence_list, boxer_graph): 13 | # Calculating iLength 14 | #iLength = boxer_graph.calculate_iLength(parent_sentence, children_sentence_list) 15 | # Get split tuple pattern 16 | split_pattern = boxer_graph.get_pattern_4_split_candidate(split_tuple) 17 | #split_feature = split_pattern+"_"+str(iLength) 18 | split_feature = split_pattern 19 | return split_feature 20 | 21 | def get_drop_ood_feature(self, ood_node, nodeset, main_sent_dict, boxer_graph): 22 | ood_word = boxer_graph.extract_oodword(ood_node, main_sent_dict) 23 | ood_position = boxer_graph.nodes[ood_node]["positions"][0] # length of positions is one 24 | span = boxer_graph.extract_span_min_max(nodeset) 25 | boundaryVal = "false" 26 | if ood_position <= span[0] or ood_position >= span[1]: 27 | boundaryVal = "true" 28 | drop_ood_feature = ood_word+"_"+boundaryVal 29 | return drop_ood_feature 30 | 31 | def get_drop_rel_feature(self, rel_node, nodeset, main_sent_dict, boxer_graph): 32 | rel_word = boxer_graph.relations[rel_node]["predicates"] 33 | rel_span = boxer_graph.extract_span_for_nodeset_with_rel(rel_node, nodeset) 34 | drop_rel_feature = rel_word+"_" 35 | if len(rel_span) <= 2: 36 | drop_rel_feature += "0-2" 37 | elif len(rel_span) <= 5: 38 | drop_rel_feature += "2-5" 39 | elif len(rel_span) <= 10: 40 | drop_rel_feature += "5-10" 41 | elif len(rel_span) <= 15: 42 | drop_rel_feature += "10-15" 43 | else: 44 | drop_rel_feature += "gt15" 45 | return drop_rel_feature 46 | 47 | def get_drop_mod_feature(self, mod_cand, main_sent_dict, boxer_graph): 48 | mod_pos = int(mod_cand[0]) 49 | mod_word = main_sent_dict[mod_pos][0] 50 | #mod_node = mod_cand[1] 51 | drop_mod_feature = mod_word 52 | return drop_mod_feature 53 | 54 | class Feature_Init: 55 | 56 | def get_split_feature(self, split_tuple, parent_sentence, children_sentence_list, boxer_graph): 57 | # Calculating iLength 58 | iLength = boxer_graph.calculate_iLength(parent_sentence, children_sentence_list) 59 | # Get split tuple pattern 60 | split_pattern = boxer_graph.get_pattern_4_split_candidate(split_tuple) 61 | split_feature = split_pattern+"_"+str(iLength) 62 | return split_feature 63 | 64 | def get_drop_ood_feature(self, ood_node, nodeset, main_sent_dict, boxer_graph): 65 | ood_word = boxer_graph.extract_oodword(ood_node, main_sent_dict) 66 | ood_position = boxer_graph.nodes[ood_node]["positions"][0] # length of positions is one 67 | span = boxer_graph.extract_span_min_max(nodeset) 68 | boundaryVal = "false" 69 | if ood_position <= span[0] or ood_position >= span[1]: 70 | boundaryVal = "true" 71 | drop_ood_feature = ood_word+"_"+boundaryVal 72 | return drop_ood_feature 73 | 74 | def get_drop_rel_feature(self, rel_node, nodeset, main_sent_dict, boxer_graph): 75 | rel_word = boxer_graph.relations[rel_node]["predicates"] 76 | rel_span = boxer_graph.extract_span_for_nodeset_with_rel(rel_node, nodeset) 77 | drop_rel_feature = rel_word+"_"+str(len(rel_span)) 78 | return drop_rel_feature 79 | 80 | def get_drop_mod_feature(self, mod_cand, main_sent_dict, boxer_graph): 81 | mod_pos = int(mod_cand[0]) 82 | mod_word = main_sent_dict[mod_pos][0] 83 | #mod_node = mod_cand[1] 84 | drop_mod_feature = mod_word 85 | return drop_mod_feature 86 | 87 | -------------------------------------------------------------------------------- /source/methods_training_graph.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #description : Methods for training graph exploration = 4 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 5 | #date : Created in 2014, Later revised in April 2016. = 6 | #version : 0.1 = 7 | #=================================================================================== 8 | 9 | from nltk.metrics.distance import edit_distance 10 | 11 | # Compare edit distance 12 | def compare_edit_distance(operator,edit_dist_after_drop, edit_dist_before_drop): 13 | if operator == "lt": 14 | if edit_dist_after_drop < edit_dist_before_drop: 15 | return True 16 | else: 17 | return False 18 | 19 | if operator == "lteq": 20 | if edit_dist_after_drop <= edit_dist_before_drop: 21 | return True 22 | else: 23 | return False 24 | 25 | # Split Candidate: Common for all clsses 26 | def process_split_candidate_for_split_common(split_candidate, simple_sentences, main_sent_dict, boxer_graph): 27 | if len(split_candidate) != len(simple_sentences): 28 | # Number of events is less than number of simple sentences 29 | return False, [] 30 | 31 | else: 32 | # Calculate all parent and following subtrees 33 | parent_subgraph_nodeset_dict = boxer_graph.extract_parent_subgraph_nodeset_dict() 34 | #print "parent_subgraph_nodeset_dict : "+str(parent_subgraph_nodeset_dict) 35 | 36 | node_overlap_dict = {} 37 | for nodename in split_candidate: 38 | split_nodeset = parent_subgraph_nodeset_dict[nodename] 39 | subsentence = boxer_graph.extract_main_sentence(split_nodeset, main_sent_dict, []) 40 | subsentence_words_set = set(subsentence.split()) 41 | 42 | overlap_data = [] 43 | for index in range(len(simple_sentences)): 44 | simple_sent_words_set = set(simple_sentences[index].split()) 45 | overlap_words_set = subsentence_words_set & simple_sent_words_set 46 | overlap_data.append((len(overlap_words_set), index)) 47 | overlap_data.sort(reverse=True) 48 | 49 | node_overlap_dict[nodename] = overlap_data[0] 50 | 51 | # Check that every node has some overlap in their maximum overlap else fail 52 | overlap_maxvalues = [node_overlap_dict[node][0] for node in node_overlap_dict] 53 | if 0 in overlap_maxvalues: 54 | return False, [] 55 | else: 56 | # check the mapping covers all simple sentences 57 | overlap_max_simple_indixes = [node_overlap_dict[node][1] for node in node_overlap_dict] 58 | if len(set(overlap_max_simple_indixes)) == len(simple_sentences): 59 | # Thats a valid split, attach unprocessed graph components 60 | node_subgraph_nodeset_dict, node_span_dict = boxer_graph.partition_drs_for_successful_candidate(split_candidate, parent_subgraph_nodeset_dict) 61 | 62 | results = [] 63 | for nodename in split_candidate: 64 | span = node_span_dict[nodename] 65 | nodeset = node_subgraph_nodeset_dict[nodename][:] 66 | simple_sentence = simple_sentences[node_overlap_dict[nodename][1]] 67 | results.append((span, nodeset, nodename, simple_sentence)) 68 | # Sort them based on starting 69 | results.sort() 70 | return True, results 71 | else: 72 | return False, [] 73 | 74 | # functions : Drop-REL Candidate 75 | def process_rel_candidate_for_drop_overlap(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, overlap_percentage): 76 | simple_sentence = " ".join(simple_sentences) 77 | simple_words = simple_sentence.split() 78 | 79 | rel_phrase = boxer_graph.extract_relation_phrase(relnode_candidate, nodeset, main_sent_dict, filtered_mod_pos) 80 | 81 | #print relnode_candidate, rel_phrase 82 | 83 | rel_words = rel_phrase.split() 84 | if len(rel_words) == 0: 85 | return True 86 | else: 87 | found = 0 88 | for word in rel_words: 89 | if word in simple_words: 90 | found += 1 91 | percentage_found = found/float(len(rel_words)) 92 | 93 | if percentage_found <= overlap_percentage: 94 | return True 95 | else: 96 | return False 97 | 98 | def process_rel_candidate_for_drop_led(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, opr_drop_rel): 99 | simple_sentence = " ".join(simple_sentences) 100 | 101 | sentence_before_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_mod_pos) 102 | edit_dist_before_drop = edit_distance(sentence_before_drop.split(), simple_sentence.split()) 103 | 104 | temp_nodeset, temp_filtered_mod_pos = boxer_graph.drop_relation(nodeset, relnode_candidate, filtered_mod_pos) 105 | sentence_after_drop = boxer_graph.extract_main_sentence(temp_nodeset, main_sent_dict, temp_filtered_mod_pos) 106 | edit_dist_after_drop = edit_distance(sentence_after_drop.split(), simple_sentence.split()) 107 | 108 | isDrop = compare_edit_distance(opr_drop_rel, edit_dist_after_drop, edit_dist_before_drop) 109 | return isDrop 110 | 111 | # functions : Drop-MOD Candidate 112 | def process_mod_candidate_for_drop_led(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, opr_drop_mod): 113 | simple_sentence = " ".join(simple_sentences) 114 | 115 | sentence_before_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_mod_pos) 116 | edit_dist_before_drop = edit_distance(sentence_before_drop.split(), simple_sentence.split()) 117 | 118 | modcand_position_to_process = modcand_to_process[0] 119 | temp_filtered_mod_pos = filtered_mod_pos[:]+[modcand_position_to_process] 120 | sentence_after_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, temp_filtered_mod_pos) 121 | edit_dist_after_drop = edit_distance(sentence_after_drop.split(), simple_sentence.split()) 122 | 123 | isDrop = compare_edit_distance(opr_drop_mod, edit_dist_after_drop, edit_dist_before_drop) 124 | return isDrop 125 | 126 | # functions : Drop-OOD Candidate 127 | def process_ood_candidate_for_drop_led(oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, opr_drop_ood): 128 | simple_sentence = " ".join(simple_sentences) 129 | 130 | sentence_before_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_mod_pos) 131 | edit_dist_before_drop = edit_distance(sentence_before_drop.split(), simple_sentence.split()) 132 | 133 | temp_nodeset = nodeset[:] 134 | temp_nodeset.remove(oodnode_candidate) 135 | sentence_after_drop = boxer_graph.extract_main_sentence(temp_nodeset, main_sent_dict, filtered_mod_pos) 136 | edit_dist_after_drop = edit_distance(sentence_after_drop.split(), simple_sentence.split()) 137 | 138 | isDrop = compare_edit_distance(opr_drop_ood, edit_dist_after_drop, edit_dist_before_drop) 139 | return isDrop 140 | 141 | class Method_OVERLAP_LED: 142 | def __init__(self, overlap_percentage, opr_drop_mod, opr_drop_ood): 143 | self.overlap_percentage = overlap_percentage 144 | self.opr_drop_mod = opr_drop_mod 145 | self.opr_drop_ood = opr_drop_ood 146 | 147 | # Split candidate 148 | def process_split_candidate_for_split(self, split_candidate, simple_sentences, main_sent_dict, boxer_graph): 149 | isSplit, results = process_split_candidate_for_split_common(split_candidate, simple_sentences, main_sent_dict, boxer_graph) 150 | return isSplit, results 151 | 152 | # Drop-REL Candidate 153 | def process_rel_candidate_for_drop(self, relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph): 154 | isDrop = process_rel_candidate_for_drop_overlap(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.overlap_percentage) 155 | return isDrop 156 | 157 | # Drop-MOD Candidate 158 | def process_mod_candidate_for_drop(self, modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph): 159 | isDrop = process_mod_candidate_for_drop_led(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_mod) 160 | return isDrop 161 | 162 | # Drop-OOD Candidate 163 | def process_ood_candidate_for_drop(self, oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph): 164 | isDrop = process_ood_candidate_for_drop_led(oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_ood) 165 | return isDrop 166 | 167 | class Method_LED: 168 | def __init__(self, opr_drop_rel, opr_drop_mod, opr_drop_ood): 169 | self.opr_drop_rel = opr_drop_rel 170 | self.opr_drop_mod = opr_drop_mod 171 | self.opr_drop_ood = opr_drop_ood 172 | 173 | # Split candidate 174 | def process_split_candidate_for_split(self, split_candidate, simple_sentences, main_sent_dict, boxer_graph): 175 | isSplit, results = process_split_candidate_for_split_common(split_candidate, simple_sentences, main_sent_dict, boxer_graph) 176 | return isSplit, results 177 | 178 | # Drop-REL Candidate 179 | def process_rel_candidate_for_drop(self, relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph): 180 | isDrop = process_rel_candidate_for_drop_led(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_rel) 181 | return isDrop 182 | 183 | # Drop-MOD Candidate 184 | def process_mod_candidate_for_drop(self, modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph): 185 | isDrop = process_mod_candidate_for_drop_led(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_mod) 186 | return isDrop 187 | 188 | # Drop-OOD Candidate 189 | def process_ood_candidate_for_drop(self, oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph): 190 | isDrop = process_ood_candidate_for_drop_led(oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_ood) 191 | return isDrop 192 | -------------------------------------------------------------------------------- /source/saxparser_xml_stanfordtokenized_boxergraph.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #title : saxparser_xml_stanfordtokenized_boxergraph.py = 4 | #description : Boxer-Graph-XML-Handler = 5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 6 | #date : Created in 2014, Later revised in April 2016. = 7 | #version : 0.1 = 8 | #=================================================================================== 9 | 10 | from xml.sax import handler, make_parser 11 | 12 | from boxer_graph_module import Boxer_Graph 13 | from explore_training_graph import Explore_Training_Graph 14 | 15 | class SAXPARSER_XML_StanfordTokenized_BoxerGraph: 16 | def __init__(self, process, xmlfile, output_stream, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE, RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, METHOD_TRAINING_GRAPH): 17 | # process: "training" or "testing" 18 | self.process = process 19 | 20 | self.xmlfile = xmlfile 21 | 22 | # output_stream: file stream for training and dictionary for testing 23 | self.output_stream = output_stream 24 | 25 | self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL 26 | self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE 27 | self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL 28 | self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD 29 | self.METHOD_TRAINING_GRAPH = METHOD_TRAINING_GRAPH 30 | 31 | def parse_xmlfile_generating_training_graph(self): 32 | handler = SAX_Handler(self.process, self.output_stream, self.DISCOURSE_SENTENCE_MODEL, self.MAX_SPLIT_PAIR_SIZE, 33 | self.RESTRICTED_DROP_REL, self.ALLOWED_DROP_MOD, self.METHOD_TRAINING_GRAPH) 34 | 35 | parser = make_parser() 36 | parser.setContentHandler(handler) 37 | parser.parse(self.xmlfile) 38 | 39 | class SAX_Handler(handler.ContentHandler): 40 | def __init__(self, process, output_stream, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE, 41 | RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, METHOD_TRAINING_GRAPH): 42 | self.process = process 43 | self.output_stream = output_stream 44 | 45 | self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL 46 | self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE 47 | self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL 48 | self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD 49 | self.METHOD_TRAINING_GRAPH = METHOD_TRAINING_GRAPH 50 | 51 | # Training Graph Creator 52 | self.training_graph_handler = Explore_Training_Graph(self.output_stream, self.DISCOURSE_SENTENCE_MODEL, self.MAX_SPLIT_PAIR_SIZE, 53 | self.RESTRICTED_DROP_REL, self.ALLOWED_DROP_MOD, self.METHOD_TRAINING_GRAPH) 54 | 55 | # Sentence Data 56 | self.sentid = "" 57 | self.main_sentence = "" 58 | self.main_sent_dict = {} 59 | self.boxer_graph = Boxer_Graph() 60 | self.simple_sentencs = [] 61 | 62 | # Sentence Flags, temporary variables 63 | self.isMain = False 64 | 65 | self.isS = False 66 | self.sentence = "" 67 | self.wordlist = [] 68 | 69 | self.isW = False 70 | self.word = "" 71 | self.wid = "" 72 | self.wpos = "" 73 | 74 | self.isSimple = False 75 | 76 | # Boxer flags, temporary variables 77 | self.isNode = False 78 | self.isRel = False 79 | self.symbol = "" 80 | self.predsymbol = "" 81 | self.locationlist = [] 82 | 83 | def startDocument(self): 84 | print "Start parsing the document ..." 85 | 86 | def endDocument(self): 87 | print "End parsing the document ..." 88 | 89 | def startElement(self, nameElt, attrOfElt): 90 | if nameElt == "sentence": 91 | self.sentid = attrOfElt["id"] 92 | 93 | # Refreshing Sentence Data 94 | self.main_sentence = "" 95 | self.main_sent_dict = {} 96 | self.boxer_graph = Boxer_Graph() 97 | self.simple_sentences = [] 98 | 99 | if nameElt == "main": 100 | self.isMain = True 101 | 102 | if nameElt == "simple": 103 | self.isSimple = True 104 | 105 | if nameElt == "s": 106 | self.isS = True 107 | self.sentence = "" 108 | self.wordlist = [] 109 | 110 | if nameElt == "w": 111 | self.isW = True 112 | self.wid = int(attrOfElt["id"][1:]) 113 | self.wpos = attrOfElt["pos"] 114 | self.word = "" 115 | 116 | if nameElt == "node": 117 | self.isNode = True 118 | self.symbol = attrOfElt["sym"] 119 | self.boxer_graph.nodes[self.symbol] = {"positions":[], "predicates":[]} 120 | 121 | if nameElt == "rel": 122 | self.isRel = True 123 | self.symbol = attrOfElt["sym"] 124 | self.boxer_graph.relations[self.symbol] = {"positions":[], "predicates":""} 125 | 126 | if nameElt == "span": 127 | self.locationlist = [] 128 | 129 | if nameElt == "pred": 130 | self.locationlist = [] 131 | self.predsymbol = attrOfElt["sym"] 132 | 133 | if nameElt == "loc": 134 | if int(attrOfElt["id"][1:]) in self.main_sent_dict: 135 | self.locationlist.append(int(attrOfElt["id"][1:])) 136 | 137 | if nameElt == "edge": 138 | self.boxer_graph.edges.append((attrOfElt["par"], attrOfElt["dep"], attrOfElt["lab"])) 139 | 140 | def endElement(self, nameElt): 141 | if nameElt == "sentence": 142 | #print self.sentid 143 | # print self.main_sentence 144 | # print self.main_sent_dict 145 | # print self.simple_sentences 146 | # print self.boxer_graph 147 | 148 | if self.process == "training": 149 | self.training_graph_handler.explore_training_graph(self.sentid, self.main_sentence, self.main_sent_dict, self.simple_sentences, self.boxer_graph) 150 | 151 | if self.process == "testing": 152 | self.output_stream[self.sentid] = [self.main_sentence, self.main_sent_dict, self.boxer_graph] 153 | 154 | # if len(self.main_sentence) > 600: 155 | # print self.sentid 156 | # if len(self.simple_sentences) == 6: 157 | # print self.sentid 158 | 159 | if int(self.sentid)%10000 == 0: 160 | print self.sentid + " training data processed ..." 161 | 162 | if nameElt == "main": 163 | self.isMain = False 164 | if len(self.wordlist) == 0: 165 | self.main_sentence = self.sentence.lower() 166 | else: 167 | self.main_sentence = (" ".join(self.wordlist)).lower() 168 | 169 | if nameElt == "simple": 170 | self.isSimple = False 171 | self.simple_sentences.append(self.sentence.lower()) 172 | 173 | if nameElt == "s": 174 | self.isS = False 175 | 176 | if nameElt == "w": 177 | self.isW = False 178 | self.main_sent_dict[self.wid] = (self.word.lower(), self.wpos.lower()) 179 | self.wordlist.append(self.word.lower()) 180 | 181 | if nameElt == "node": 182 | self.isNode = False 183 | self.boxer_graph.nodes[self.symbol]["predicates"].sort() 184 | 185 | if nameElt == "rel": 186 | self.isRel = False 187 | 188 | if nameElt == "span": 189 | self.locationlist.sort() 190 | if self.isNode: 191 | self.boxer_graph.nodes[self.symbol]["positions"] = self.locationlist[:] 192 | if self.isRel: 193 | self.boxer_graph.relations[self.symbol]["positions"] = self.locationlist[:] 194 | 195 | if nameElt == "pred": 196 | self.locationlist.sort() 197 | if self.isNode: 198 | self.boxer_graph.nodes[self.symbol]["predicates"].append((self.predsymbol, self.locationlist[:])) 199 | if self.isRel: 200 | self.boxer_graph.relations[self.symbol]["predicates"] = self.predsymbol 201 | 202 | def characters(self, chrs): 203 | if self.isS: 204 | self.sentence += chrs 205 | 206 | if self.isW: 207 | self.word += chrs 208 | 209 | -------------------------------------------------------------------------------- /source/saxparser_xml_stanfordtokenized_boxergraph_traininggraph.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #title : saxparser_xml_stanfordtokenized_boxergraph_traininggraph.py = 4 | #description : Boxer-Training-Graph-XML-Handler = 5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 6 | #date : Created in 2014, Later revised in April 2016. = 7 | #version : 0.1 = 8 | #=================================================================================== 9 | 10 | from xml.sax import handler, make_parser 11 | from boxer_graph_module import Boxer_Graph 12 | from training_graph_module import Training_Graph 13 | from em_inside_outside_algorithm import EM_InsideOutside_Optimiser 14 | import copy 15 | 16 | class SAXPARSER_XML_StanfordTokenized_BoxerGraph_TrainingGraph: 17 | def __init__(self, training_xmlfile, NUM_TRAINING_ITERATION, smt_sentence_pairs, probability_tables, count_tables, METHOD_FEATURE_EXTRACT): 18 | self.training_xmlfile = training_xmlfile 19 | self.NUM_TRAINING_ITERATION = NUM_TRAINING_ITERATION 20 | self.smt_sentence_pairs = smt_sentence_pairs 21 | self.probability_tables = probability_tables 22 | self.count_tables = count_tables 23 | self.METHOD_FEATURE_EXTRACT = METHOD_FEATURE_EXTRACT 24 | 25 | self.em_io_handler = EM_InsideOutside_Optimiser(self.smt_sentence_pairs, self.probability_tables, self.count_tables, self.METHOD_FEATURE_EXTRACT) 26 | 27 | def parse_to_initialize_probabilitytable(self): 28 | # Initialize probability table and populate self.smt_sentence_pairs 29 | handler = SAX_Handler("init", self.em_io_handler) 30 | parser = make_parser() 31 | parser.setContentHandler(handler) 32 | print "Start parsing "+self.training_xmlfile+" ..." 33 | parser.parse(self.training_xmlfile) 34 | 35 | def parse_to_iterate_probabilitytable(self): 36 | handler = SAX_Handler("iter", self.em_io_handler) 37 | parser = make_parser() 38 | parser.setContentHandler(handler) 39 | 40 | for count in range(self.NUM_TRAINING_ITERATION): 41 | print "Starting iteration: "+str(count+1)+" ..." 42 | 43 | print "Resetting all counts to ZERO ..." 44 | self.em_io_handler.reset_count_table() 45 | 46 | print "Start parsing "+self.training_xmlfile+" ..." 47 | parser.parse(self.training_xmlfile) 48 | print "Ending iteration: "+str(count+1)+" ..." 49 | 50 | print "Updating probability table ..." 51 | self.em_io_handler.update_probability_table() 52 | 53 | class SAX_Handler(handler.ContentHandler): 54 | def __init__(self, stage, em_io_handler): 55 | # "init" or "iter" stage 56 | self.stage = stage 57 | 58 | # EM algorithm handler 59 | self.em_io_handler = em_io_handler 60 | 61 | # Sentence Data 62 | self.sentid = "" 63 | self.main_sentence = "" 64 | self.simple_sentencs = [] 65 | self.main_sent_dict = {} 66 | # Boxer Data 67 | self.boxer_graph = {"nodes":{}, "relations":{}, "edges":[]} 68 | # Training Graph Data 69 | self.training_graph = {"major-nodes":{}, "oper-nodes":{}, "edges":[]} 70 | 71 | # Common TAG variables 72 | self.isS = False 73 | self.sentence = "" 74 | 75 | # Main 76 | self.isMain = False 77 | self.isWinfo = False 78 | self.isW = False 79 | self.word = "" 80 | self.wid = "" 81 | self.wpos = "" 82 | 83 | # Simple Set 84 | self.isSimple = False 85 | 86 | # Boxer 87 | self.isBoxer = False 88 | 89 | # TrainingGraph 90 | self.isTrainingGraph = False 91 | 92 | # Node 93 | self.isNode = False 94 | self.nodesym = "" 95 | 96 | # Span 97 | self.isSpan = False 98 | 99 | # pred 100 | self.isPred = False 101 | self.predsym = "" 102 | 103 | # relation 104 | self.isRel = False 105 | self.relsym = "" 106 | 107 | # major oper nodes 108 | self.isMajorNodes = False 109 | self.isOperNodes = False 110 | 111 | # type 112 | self.isType = False 113 | self.type = "" 114 | 115 | # Nodeset 116 | self.isNodeset = False 117 | 118 | # Split 119 | self.isSplitCandidate = False 120 | self.isSplitCandidateLeft = False 121 | self.isSC = False 122 | 123 | # Out of discourse OOD 124 | self.isOODCandidates = False 125 | self.isOODProcessed = False 126 | 127 | # Relations 128 | self.isRelCandidates = False 129 | self.isRelProcessed = False 130 | 131 | # Modifiers 132 | self.isModCandidates = False 133 | self.isModposProcessed = False 134 | self.isModposFiltered = False 135 | 136 | def startDocument(self): 137 | print "Start parsing the document ..." 138 | 139 | def endDocument(self): 140 | print "End parsing the document ..." 141 | 142 | def startElement(self, nameElt, attrOfElt): 143 | if nameElt == "sentence": 144 | self.sentid = attrOfElt["id"] 145 | 146 | # Refreshing Sentence Data 147 | self.main_sentence = "" 148 | self.simple_sentences = [] 149 | self.main_sent_dict = {} 150 | # Refreshing Boxer Data 151 | self.boxer_graph = {"nodes":{}, "relations":{}, "edges":[]} 152 | # Refreshing Training Graph Data 153 | self.training_graph = {"major-nodes":{}, "oper-nodes":{}, "edges":[]} 154 | 155 | if nameElt == "main": 156 | self.isMain = True 157 | 158 | if nameElt == "s": 159 | self.isS = True 160 | self.sentence = "" 161 | 162 | if nameElt == "winfo": 163 | self.isWinfo = True 164 | 165 | if nameElt == "w": 166 | self.isW = True 167 | self.word = "" 168 | self.wid = int(attrOfElt["id"]) 169 | self.wpos = attrOfElt["pos"] 170 | 171 | if nameElt == "simple": 172 | self.isSimple = True 173 | 174 | if nameElt == "box": 175 | self.isBoxer = True 176 | 177 | if nameElt == "train-graph": 178 | self.isTrainingGraph = True 179 | 180 | if nameElt == "major-nodes": 181 | self.isMajorNodes = True 182 | 183 | if nameElt == "oper-nodes": 184 | self.isOperNodes = True 185 | 186 | if nameElt == "node": 187 | self.isNode = True 188 | self.nodesym = attrOfElt["sym"] 189 | 190 | if self.isBoxer == True: 191 | self.boxer_graph["nodes"][self.nodesym] = {"positions": [], "predicates":[]} 192 | 193 | if self.isTrainingGraph == True: 194 | if self.isMajorNodes == True: 195 | self.training_graph["major-nodes"][self.nodesym] = {"type": "", "nodeset": [], "simple-sentences":[], 196 | "split-candidates":[], 197 | "ood-candidates":[], "ood-processed":[], 198 | "rel-candidates":[], "rel-processed":[], 199 | "mod-candidates":[], "modpos-processed":[], "modpos-filtered":[]} 200 | 201 | if self.isOperNodes == True: 202 | self.training_graph["oper-nodes"][self.nodesym] = {"type": "", 203 | "split-candidate":[], "not-split-candidates":[], 204 | "ood-candidate":"", "drop-result":"", 205 | "rel-candidate":"","mod-candidate":""} 206 | 207 | if nameElt == "rel": 208 | self.isRel = True 209 | self.relsym = attrOfElt["sym"] 210 | 211 | if self.isBoxer == True: 212 | self.boxer_graph["relations"][self.relsym] = {"positions": [], "predicates":""} 213 | 214 | if nameElt == "span": 215 | self.isSpan = True 216 | 217 | if nameElt == "pred": 218 | self.isPred = True 219 | self.predsym = attrOfElt["sym"] 220 | 221 | if self.isBoxer == True and self.isNode == True: 222 | self.boxer_graph["nodes"][self.nodesym]["predicates"].append([self.predsym, []]) 223 | 224 | if self.isBoxer == True and self.isRel == True: 225 | self.boxer_graph["relations"][self.relsym]["predicates"] = self.predsym 226 | 227 | if nameElt == "loc": 228 | if self.isBoxer == True and self.isNode == True and self.isSpan == True: 229 | self.boxer_graph["nodes"][self.nodesym]["positions"].append(int(attrOfElt["id"])) 230 | 231 | if self.isBoxer == True and self.isNode == True and self.isPred == True: 232 | self.boxer_graph["nodes"][self.nodesym]["predicates"][-1][1].append(int(attrOfElt["id"])) 233 | 234 | if self.isBoxer == True and self.isRel == True and self.isSpan == True: 235 | self.boxer_graph["relations"][self.relsym]["positions"].append(int(attrOfElt["id"])) 236 | 237 | if self.isModposProcessed == True: 238 | if self.isMajorNodes == True: 239 | self.training_graph["major-nodes"][self.nodesym]["modpos-processed"].append(int(attrOfElt["id"])) 240 | 241 | if self.isModposFiltered == True: 242 | if self.isMajorNodes == True: 243 | self.training_graph["major-nodes"][self.nodesym]["modpos-filtered"].append(int(attrOfElt["id"])) 244 | 245 | if nameElt == "edge": 246 | if self.isBoxer == True: 247 | self.boxer_graph["edges"].append((attrOfElt["par"], attrOfElt["dep"], attrOfElt["lab"])) 248 | 249 | if self.isTrainingGraph == True: 250 | self.training_graph["edges"].append((attrOfElt["par"], attrOfElt["dep"], attrOfElt["lab"])) 251 | 252 | if nameElt == "type": 253 | self.isType = True 254 | self.type = "" 255 | 256 | if nameElt == "nodeset": 257 | self.isNodeset = True 258 | 259 | if nameElt == "n": 260 | if self.isNodeset == True: 261 | if self.isMajorNodes == True: 262 | self.training_graph["major-nodes"][self.nodesym]["nodeset"].append(attrOfElt["sym"]) 263 | if self.isSC == True: 264 | if self.isSplitCandidate == True: 265 | if self.isMajorNodes == True: 266 | self.training_graph["major-nodes"][self.nodesym]["split-candidates"][-1].append(attrOfElt["sym"]) 267 | if self.isOperNodes == True: 268 | self.training_graph["oper-nodes"][self.nodesym]["split-candidate"].append(attrOfElt["sym"]) 269 | if self.isSplitCandidateLeft == True: 270 | if self.isOperNodes == True: 271 | self.training_graph["oper-nodes"][self.nodesym]["not-split-candidates"][-1].append(attrOfElt["sym"]) 272 | 273 | if self.isOODCandidates == True: 274 | if self.isMajorNodes == True: 275 | self.training_graph["major-nodes"][self.nodesym]["ood-candidates"].append(attrOfElt["sym"]) 276 | if self.isOperNodes == True: 277 | self.training_graph["oper-nodes"][self.nodesym]["ood-candidate"] = attrOfElt["sym"] 278 | 279 | if self.isOODProcessed == True: 280 | if self.isMajorNodes == True: 281 | self.training_graph["major-nodes"][self.nodesym]["ood-processed"].append(attrOfElt["sym"]) 282 | 283 | if self.isRelCandidates == True: 284 | if self.isMajorNodes == True: 285 | self.training_graph["major-nodes"][self.nodesym]["rel-candidates"].append(attrOfElt["sym"]) 286 | if self.isOperNodes == True: 287 | self.training_graph["oper-nodes"][self.nodesym]["rel-candidate"] = attrOfElt["sym"] 288 | 289 | if self.isRelProcessed == True: 290 | if self.isMajorNodes == True: 291 | self.training_graph["major-nodes"][self.nodesym]["rel-processed"].append(attrOfElt["sym"]) 292 | 293 | if self.isModCandidates == True: 294 | if self.isMajorNodes == True: 295 | self.training_graph["major-nodes"][self.nodesym]["mod-candidates"].append((attrOfElt["loc"], attrOfElt["sym"])) 296 | if self.isOperNodes == True: 297 | self.training_graph["oper-nodes"][self.nodesym]["mod-candidate"] = (attrOfElt["loc"], attrOfElt["sym"]) 298 | 299 | if nameElt == "split-candidates" or nameElt == "split-candidate-applied": 300 | self.isSplitCandidate = True 301 | 302 | if nameElt == "split-candidate-left": 303 | self.isSplitCandidateLeft = True 304 | 305 | if nameElt == "sc": 306 | self.isSC = True 307 | if self.isSplitCandidate == True: 308 | if self.isMajorNodes == True: 309 | self.training_graph["major-nodes"][self.nodesym]["split-candidates"].append([]) 310 | if self.isOperNodes == True: 311 | self.training_graph["oper-nodes"][self.nodesym]["split-candidate"] = [] 312 | if self.isSplitCandidateLeft == True: 313 | if self.isOperNodes == True: 314 | self.training_graph["oper-nodes"][self.nodesym]["not-split-candidates"].append([]) 315 | 316 | if nameElt == "ood-candidate" or nameElt == "ood-candidates": 317 | self.isOODCandidates = True 318 | 319 | if nameElt == "ood-processed": 320 | self.isOODProcessed = True 321 | 322 | if nameElt == "rel-candidate" or nameElt == "rel-candidates": 323 | self.isRelCandidates = True 324 | 325 | if nameElt == "rel-processed": 326 | self.isRelProcessed = True 327 | 328 | if nameElt == "mod-candidate" or nameElt == "mod-candidates": 329 | self.isModCandidates = True 330 | 331 | if nameElt == "mod-loc-processed": 332 | self.isModposProcessed = True 333 | 334 | if nameElt == "mod-loc-filtered": 335 | self.isModposFiltered = True 336 | 337 | if nameElt == "is-dropped": 338 | if self.isOperNodes == True: 339 | self.training_graph["oper-nodes"][self.nodesym]["drop-result"] = attrOfElt["val"] 340 | 341 | def endElement(self, nameElt): 342 | if nameElt == "sentence": 343 | # print self.sentid 344 | # print 345 | # print self.main_sentence 346 | # print 347 | # print self.main_sent_dict 348 | # print 349 | # print self.simple_sentences 350 | # print 351 | # print self.boxer_graph 352 | # print 353 | # print self.training_graph 354 | 355 | # Creating the original format of Boxer and Training Graph 356 | final_boxer_graph = Boxer_Graph() 357 | for nodename in self.boxer_graph["nodes"]: 358 | final_boxer_graph.nodes[nodename] = copy.copy(self.boxer_graph["nodes"][nodename]) 359 | for nodename in self.boxer_graph["relations"]: 360 | final_boxer_graph.relations[nodename] = copy.copy(self.boxer_graph["relations"][nodename]) 361 | final_boxer_graph.edges = self.boxer_graph["edges"][:] 362 | 363 | final_training_graph = Training_Graph() 364 | for nodename in self.training_graph["major-nodes"]: 365 | nodedict = self.training_graph["major-nodes"][nodename] 366 | if nodedict["type"] == "split": 367 | final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["split-candidates"][:]) 368 | if nodedict["type"] == "drop-rel": 369 | final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["rel-candidates"][:], 370 | nodedict["rel-processed"][:], nodedict["modpos-filtered"][:]) 371 | if nodedict["type"] == "drop-mod": 372 | final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["mod-candidates"][:], 373 | nodedict["modpos-processed"][:], nodedict["modpos-filtered"][:]) 374 | if nodedict["type"] == "drop-ood": 375 | final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["ood-candidates"][:], 376 | nodedict["ood-processed"][:], nodedict["modpos-filtered"][:]) 377 | if nodedict["type"] == "fin": 378 | final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["modpos-filtered"][:]) 379 | for nodename in self.training_graph["oper-nodes"]: 380 | nodedict = self.training_graph["oper-nodes"][nodename] 381 | if nodedict["type"] == "split": 382 | if len(nodedict["split-candidate"]) == 0: 383 | final_training_graph.oper_nodes[nodename] = (nodedict["type"], None, nodedict["not-split-candidates"][:]) 384 | else: 385 | final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["split-candidate"], nodedict["not-split-candidates"][:]) 386 | if nodedict["type"] == "drop-rel": 387 | final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["rel-candidate"], nodedict["drop-result"]) 388 | if nodedict["type"] == "drop-mod": 389 | final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["mod-candidate"], nodedict["drop-result"]) 390 | if nodedict["type"] == "drop-ood": 391 | final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["ood-candidate"], nodedict["drop-result"]) 392 | final_training_graph.edges = self.training_graph["edges"][:] 393 | 394 | # Process various stage "init" or "iter" 395 | if self.stage == "init": 396 | self.em_io_handler.initialize_probabilitytable_smt_input(self.sentid, self.main_sentence, self.main_sent_dict, self.simple_sentences, final_boxer_graph, final_training_graph) 397 | 398 | if self.stage == "iter": 399 | self.em_io_handler.iterate_over_probabilitytable(self.sentid, self.main_sentence, self.main_sent_dict, self.simple_sentences, final_boxer_graph, final_training_graph) 400 | 401 | if int(self.sentid)%10000 == 0: 402 | print self.sentid + " training data processed ..." 403 | 404 | if nameElt == "main": 405 | self.isMain = False 406 | 407 | if nameElt == "s": 408 | self.isS = False 409 | 410 | if self.isMain == True: 411 | self.main_sentence = self.sentence 412 | 413 | if self.isSimple == True: 414 | if self.isNode == True: 415 | if self.isMajorNodes == True: 416 | self.training_graph["major-nodes"][self.nodesym]["simple-sentences"].append(self.sentence) 417 | else: 418 | self.simple_sentences.append(self.sentence) 419 | 420 | if nameElt == "winfo": 421 | self.isWinfo = False 422 | 423 | if nameElt == "w": 424 | self.isW = False 425 | 426 | if self.isWinfo == True: 427 | self.main_sent_dict[self.wid] = (self.word, self.wpos) 428 | 429 | if nameElt == "simple": 430 | self.isSimple = False 431 | 432 | if nameElt == "box": 433 | self.isBoxer = False 434 | 435 | if nameElt == "train-graph": 436 | self.isTrainingGraph = False 437 | 438 | if nameElt == "major-nodes": 439 | self.isMajorNodes = False 440 | 441 | if nameElt == "oper-nodes": 442 | self.isOperNodes = False 443 | 444 | if nameElt == "node": 445 | self.isNode = False 446 | 447 | if nameElt == "rel": 448 | self.isRel = False 449 | 450 | if nameElt == "span": 451 | self.isSpan = False 452 | 453 | if nameElt == "pred": 454 | self.isPred = False 455 | 456 | if nameElt == "type": 457 | self.isType = False 458 | if self.isMajorNodes == True: 459 | self.training_graph["major-nodes"][self.nodesym]["type"] = self.type 460 | if self.isOperNodes == True: 461 | self.training_graph["oper-nodes"][self.nodesym]["type"] = self.type 462 | 463 | if nameElt == "nodeset": 464 | self.isNodeset = False 465 | 466 | if nameElt == "split-candidates" or nameElt == "split-candidate-applied": 467 | self.isSplitCandidate = False 468 | 469 | if nameElt == "split-candidate-left": 470 | self.isSplitCandidateLeft = False 471 | 472 | if nameElt == "sc": 473 | self.isSC = False 474 | 475 | if nameElt == "ood-candidate" or nameElt == "ood-candidates": 476 | self.isOODCandidates = False 477 | 478 | if nameElt == "ood-processed": 479 | self.isOODProcessed = False 480 | 481 | if nameElt == "rel-candidate" or nameElt == "rel-candidates": 482 | self.isRelCandidates = False 483 | 484 | if nameElt == "rel-processed": 485 | self.isRelProcessed = False 486 | 487 | if nameElt == "mod-candidate" or nameElt == "mod-candidates": 488 | self.isModCandidates = False 489 | 490 | if nameElt == "mod-loc-processed": 491 | self.isModposProcessed = False 492 | 493 | if nameElt == "mod-loc-filtered": 494 | self.isModposFiltered = False 495 | 496 | def characters(self, chrs): 497 | if self.isS: 498 | self.sentence += chrs 499 | 500 | if self.isW: 501 | self.word += chrs 502 | 503 | if self.isType: 504 | self.type += chrs 505 | -------------------------------------------------------------------------------- /source/training_graph_module.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #title : training_graph_module.py = 4 | #description : Define Training Graph = 5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 6 | #date : Created in 2014, Later revised in April 2016. = 7 | #version : 0.1 = 8 | #=================================================================================== 9 | 10 | 11 | import xml.etree.ElementTree as ET 12 | import copy 13 | 14 | class Training_Graph: 15 | def __init__(self): 16 | ''' 17 | self.major_nodes["MN-*"] 18 | ("split", nodeset, simple_sentences, split_candidate_tuples) 19 | ("drop-rel", nodeset, simple_sentences, relnode_set, processed_relnode, filtered_mod_pos) 20 | ("drop-mod", nodeset, simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos) 21 | ("drop-ood", nodeset, simple_sentence, oodnode_set, processed_oodnode, filtered_mod_pos) 22 | ("fin", nodeset, simple_sentences, filtered_mod_pos) 23 | 24 | self.oper_nodes["ON-*"] 25 | ("split", split_candidate, not_applied_cands) 26 | ("split", None, not_applied_cands) 27 | ("drop-rel", relnode_to_process, "True") 28 | ("drop-rel", relnode_to_process, "False") 29 | ("drop-mod", modcand_to_process, "True") 30 | ("drop-mod", modcand_to_process, "False") 31 | ("drop-ood", oodnode_to_process, "True") 32 | ("drop-ood", oodnode_to_process, "False") 33 | 34 | self.edges = [(par, dep, lab)] 35 | 36 | ''' 37 | self.major_nodes = {} 38 | self.oper_nodes = {} 39 | self.edges = [] 40 | 41 | def get_majornode_type(self, majornode_name): 42 | majornode_tuple = self.major_nodes[majornode_name] 43 | return majornode_tuple[0] 44 | 45 | def get_majornode_nodeset(self, majornode_name): 46 | majornode_tuple = self.major_nodes[majornode_name] 47 | return majornode_tuple[1] 48 | 49 | def get_majornode_simple_sentences(self, majornode_name): 50 | majornode_tuple = self.major_nodes[majornode_name] 51 | return majornode_tuple[2] 52 | 53 | def get_majornode_oper_candidates(self, majornode_name): 54 | majornode_tuple = self.major_nodes[majornode_name] 55 | if majornode_tuple[0] != "fin": 56 | return majornode_tuple[3] 57 | else: 58 | return [] 59 | 60 | def get_majornode_processed_oper_candidates(self, majornode_name): 61 | majornode_tuple = self.major_nodes[majornode_name] 62 | if majornode_tuple[0] != "fin" and majornode_tuple[0] != "split": 63 | return majornode_tuple[4] 64 | else: 65 | return [] 66 | 67 | def get_majornode_filtered_postions(self, majornode_name): 68 | majornode_tuple = self.major_nodes[majornode_name] 69 | if majornode_tuple[0] == "fin": 70 | return majornode_tuple[3] 71 | elif majornode_tuple[0] == "drop-rel" or majornode_tuple[0] == "drop-mod" or majornode_tuple[0] == "drop-ood": 72 | return majornode_tuple[5] 73 | else: 74 | return [] 75 | 76 | def get_opernode_type(self, opernode_name): 77 | opernode_tuple = self.oper_nodes[opernode_name] 78 | return opernode_tuple[0] 79 | 80 | def get_opernode_oper_candidate(self, opernode_name): 81 | opernode_tuple = self.oper_nodes[opernode_name] 82 | return opernode_tuple[1] 83 | 84 | def get_opernode_failed_oper_candidates(self, opernode_name): 85 | opernode_tuple = self.oper_nodes[opernode_name] 86 | if opernode_tuple[0] == "split": 87 | return opernode_tuple[2] 88 | else: 89 | return [] 90 | 91 | def get_opernode_drop_result(self, opernode_name): 92 | opernode_tuple = self.oper_nodes[opernode_name] 93 | if opernode_tuple[0] != "split": 94 | return opernode_tuple[2] 95 | else: 96 | return None 97 | 98 | # @@@@@@@@@@@@@@@@@@@@@ Create nodes @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 99 | 100 | def create_majornode(self, majornode_data): 101 | copy_data = copy.copy(majornode_data) 102 | 103 | # Check if node exists 104 | for node_name in self.major_nodes: 105 | node_data = self.major_nodes[node_name] 106 | if node_data == copy_data: 107 | return node_name, False 108 | 109 | # Otherwise create new node 110 | majornode_name = "MN-"+str(len(self.major_nodes)+1) 111 | self.major_nodes[majornode_name] = copy_data 112 | return majornode_name, True 113 | 114 | def create_opernode(self, opernode_data): 115 | copy_data = copy.copy(opernode_data) 116 | opernode_name = "ON-"+str(len(self.oper_nodes)+1) 117 | self.oper_nodes[opernode_name] = copy_data 118 | return opernode_name 119 | 120 | def create_edge(self, edge_data): 121 | self.edges.append(copy.copy(edge_data)) 122 | 123 | # @@@@@@@@@@@@@@@@@@@@@@@@ Final sentences @@@@@@@@@@@@@@@@@@@@@@@@@@ 124 | 125 | def get_final_sentences(self, main_sentence, main_sent_dict, boxer_graph): 126 | fin_nodes = self.find_all_fin_majornode() 127 | print 128 | node_sent = [] 129 | for node in fin_nodes: 130 | # intpart = int(node[3:]) # removing "MN-", lower int part sentence comes before 131 | if boxer_graph.isEmpty(): 132 | #main_sentence = main_sentence.encode('utf-8') 133 | simple_sentences = self.get_majornode_simple_sentences(node) 134 | simple_sentence = " ".join(simple_sentences) 135 | #node_sent.append((intpart, main_sentence, simple_sentence)) 136 | 137 | node_span = (0, len(main_sentence.split())) 138 | node_sent.append((node_span, main_sentence, simple_sentence)) 139 | 140 | else: 141 | nodeset = self.get_majornode_nodeset(node) 142 | node_span = boxer_graph.extract_span_min_max(nodeset) 143 | filtered_pos = self.get_majornode_filtered_postions(node) 144 | main_sentence = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_pos) 145 | simple_sentences = self.get_majornode_simple_sentences(node) 146 | simple_sentence = " ".join(simple_sentences) 147 | #node_sent.append((intpart, main_sentence, simple_sentence)) 148 | node_sent.append((node_span, main_sentence, simple_sentence)) 149 | node_sent.sort() 150 | sentence_pairs = [(item[1], item[2]) for item in node_sent] 151 | #sentence_pairs = [(item[1].encode('utf-8'), item[2].encode('utf-8')) for item in node_sent] 152 | #print sentence_pairs 153 | return sentence_pairs 154 | 155 | 156 | # @@@@@@@@@@@@@@@@@@@@@ Find nodes in Training Graph @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 157 | 158 | def find_all_fin_majornode(self): 159 | fin_nodes = [] 160 | for major_node in self.major_nodes: 161 | if self.major_nodes[major_node][0] == "fin": 162 | fin_nodes.append(major_node) 163 | return fin_nodes 164 | 165 | def find_children_of_majornode(self, major_node): 166 | children_oper_nodes = [] 167 | for edge in self.edges: 168 | if edge[0] == major_node: 169 | children_oper_nodes.append(edge[1]) 170 | return children_oper_nodes 171 | 172 | def find_children_of_opernode(self, oper_node): 173 | children_major_nodes = [] 174 | for edge in self.edges: 175 | if edge[0] == oper_node: 176 | children_major_nodes.append(edge[1]) 177 | return children_major_nodes 178 | 179 | def find_parents_of_majornode(self, major_node): 180 | parents_oper_nodes = [] 181 | for edge in self.edges: 182 | if edge[1] == major_node: 183 | parent_oper_node = edge[0] 184 | parents_oper_nodes.append(parent_oper_node) 185 | return parents_oper_nodes 186 | 187 | def find_parent_of_opernode(self, oper_node): 188 | parent_major_node = "" 189 | for edge in self.edges: 190 | if edge[1] == oper_node: 191 | parent_major_node = edge[0] 192 | break 193 | return parent_major_node 194 | 195 | # @@@@@@@@@@@@ Training Graph -> Elementary Tree @@@@@@@@@@@@@@@@@@@@ 196 | 197 | def convert_to_elementarytree(self): 198 | traininggraph = ET.Element("train-graph") 199 | 200 | # Major nodes 201 | major_nodes_elt = ET.SubElement(traininggraph, "major-nodes") 202 | for major_nodename in self.major_nodes: 203 | major_nodetype = self.get_majornode_type(major_nodename) 204 | major_nodeset = self.get_majornode_nodeset(major_nodename) 205 | major_simple_sentences = self.get_majornode_simple_sentences(major_nodename) 206 | oper_candidates = self.get_majornode_oper_candidates(major_nodename) 207 | processed_oper_candidates = self.get_majornode_processed_oper_candidates(major_nodename) 208 | filtered_postions = self.get_majornode_filtered_postions(major_nodename) 209 | 210 | major_node_elt = ET.SubElement(major_nodes_elt, "node") 211 | major_node_elt.attrib = {"sym":major_nodename} 212 | 213 | # Opertype 214 | major_nodetype_elt = ET.SubElement(major_node_elt, "type") 215 | major_nodetype_elt.text = major_nodetype 216 | 217 | # Nodeset 218 | major_nodeset_elt = ET.SubElement(major_node_elt, "nodeset") 219 | for node in major_nodeset: 220 | node_elt = ET.SubElement(major_nodeset_elt, "n") 221 | node_elt.attrib = {"sym":node} 222 | 223 | # Simple sentences 224 | major_simple_sentences_elt = ET.SubElement(major_node_elt, "simple-set") 225 | for simple_sentence in major_simple_sentences: 226 | major_simple_sentence_elt = ET.SubElement(major_simple_sentences_elt, "simple") 227 | sent_data_elt = ET.SubElement(major_simple_sentence_elt, "s") 228 | sent_data_elt.text = simple_sentence 229 | 230 | # Oper Candidates 231 | if major_nodetype == "split": 232 | split_candidate_tuples = oper_candidates 233 | major_split_candidates_elt = ET.SubElement(major_node_elt, "split-candidates") 234 | for split_candidate in split_candidate_tuples: 235 | major_split_candidate_elt = ET.SubElement(major_split_candidates_elt, "sc") 236 | for node in split_candidate: 237 | node_elt = ET.SubElement(major_split_candidate_elt, "n") 238 | node_elt.attrib = {"sym":str(node)} 239 | 240 | if major_nodetype == "drop-rel": 241 | relnode_set = oper_candidates 242 | major_relnode_set_elt = ET.SubElement(major_node_elt, "rel-candidates") 243 | for node in relnode_set: 244 | node_elt = ET.SubElement(major_relnode_set_elt, "n") 245 | node_elt.attrib = {"sym":str(node)} 246 | 247 | processed_relnodes = processed_oper_candidates 248 | major_processed_relnodes_elt = ET.SubElement(major_node_elt, "rel-processed") 249 | for node in processed_relnodes: 250 | node_elt = ET.SubElement(major_processed_relnodes_elt, "n") 251 | node_elt.attrib = {"sym":str(node)} 252 | 253 | filtered_mod_pos = filtered_postions 254 | major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered") 255 | for node in filtered_mod_pos: 256 | node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc") 257 | node_elt.attrib = {"id":str(node)} 258 | 259 | if major_nodetype == "drop-mod": 260 | modcand_set = oper_candidates 261 | major_modcand_set_elt = ET.SubElement(major_node_elt, "mod-candidates") 262 | for node in modcand_set: 263 | node_elt = ET.SubElement(major_modcand_set_elt, "n") 264 | node_elt.attrib = {"sym":node[1],"loc":str(node[0])} 265 | 266 | processed_mod_pos = processed_oper_candidates 267 | major_processed_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-processed") 268 | for node in processed_mod_pos: 269 | node_elt = ET.SubElement(major_processed_mod_pos_elt, "loc") 270 | node_elt.attrib = {"id":str(node)} 271 | 272 | filtered_mod_pos = filtered_postions 273 | major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered") 274 | for node in filtered_mod_pos: 275 | node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc") 276 | node_elt.attrib = {"id":str(node)} 277 | 278 | if major_nodetype == "drop-ood": 279 | oodnode_set = oper_candidates 280 | major_oodnode_set_elt = ET.SubElement(major_node_elt, "ood-candidates") 281 | for node in oodnode_set: 282 | node_elt = ET.SubElement(major_oodnode_set_elt, "n") 283 | node_elt.attrib = {"sym":str(node)} 284 | 285 | processed_oodnodes = processed_oper_candidates 286 | major_processed_oodnodes_elt = ET.SubElement(major_node_elt, "ood-processed") 287 | for node in processed_oodnodes: 288 | node_elt = ET.SubElement(major_processed_oodnodes_elt, "n") 289 | node_elt.attrib = {"sym":str(node)} 290 | 291 | filtered_mod_pos = filtered_postions 292 | major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered") 293 | for node in filtered_mod_pos: 294 | node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc") 295 | node_elt.attrib = {"id":str(node)} 296 | 297 | if major_nodetype == "fin": 298 | filtered_mod_pos = filtered_postions 299 | major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered") 300 | for node in filtered_mod_pos: 301 | node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc") 302 | node_elt.attrib = {"id":str(node)} 303 | 304 | # Oper nodes 305 | oper_nodes_elt = ET.SubElement(traininggraph, "oper-nodes") 306 | for oper_nodename in self.oper_nodes: 307 | oper_node_elt = ET.SubElement(oper_nodes_elt, "node") 308 | oper_node_elt.attrib = {"sym":oper_nodename} 309 | 310 | oper_nodedata = self.oper_nodes[oper_nodename] 311 | 312 | # Nodetype 313 | oper_nodetype = oper_nodedata[0] 314 | oper_nodetype_elt = ET.SubElement(oper_node_elt, "type") 315 | oper_nodetype_elt.text = oper_nodetype 316 | 317 | if oper_nodetype == "split": 318 | split_cand_applied = oper_nodedata[1] 319 | split_cand_applied_elt = ET.SubElement(oper_node_elt, "split-candidate-applied") 320 | if split_cand_applied != None: 321 | split_candidate_elt = ET.SubElement(split_cand_applied_elt, "sc") 322 | for node in split_cand_applied: 323 | node_elt = ET.SubElement(split_candidate_elt, "n") 324 | node_elt.attrib = {"sym":node} 325 | 326 | split_cand_left = oper_nodedata[2] 327 | split_cand_left_elt = ET.SubElement(oper_node_elt, "split-candidate-left") 328 | for split_candidate in split_cand_left: 329 | split_candidate_elt = ET.SubElement(split_cand_left_elt, "sc") 330 | for node in split_candidate: 331 | node_elt = ET.SubElement(split_candidate_elt, "n") 332 | node_elt.attrib = {"sym":node} 333 | 334 | if oper_nodetype == "drop-ood": 335 | oodnode_to_process = oper_nodedata[1] 336 | oodnode_to_process_elt = ET.SubElement(oper_node_elt, "ood-candidate") 337 | node_elt = ET.SubElement(oodnode_to_process_elt, "n") 338 | node_elt.attrib = {"sym":oodnode_to_process} 339 | 340 | dropped = oper_nodedata[2] 341 | dropped_elt = ET.SubElement(oper_node_elt, "is-dropped") 342 | dropped_elt.attrib = {"val":dropped} 343 | 344 | if oper_nodetype == "drop-rel": 345 | relnode_to_process = oper_nodedata[1] 346 | relnode_to_process_elt = ET.SubElement(oper_node_elt, "rel-candidate") 347 | node_elt = ET.SubElement(relnode_to_process_elt, "n") 348 | node_elt.attrib = {"sym":relnode_to_process} 349 | 350 | dropped = oper_nodedata[2] 351 | dropped_elt = ET.SubElement(oper_node_elt, "is-dropped") 352 | dropped_elt.attrib = {"val":dropped} 353 | 354 | if oper_nodetype == "drop-mod": 355 | modcand_to_process = oper_nodedata[1] 356 | modcand_to_process_elt = ET.SubElement(oper_node_elt, "mod-candidate") 357 | node_elt = ET.SubElement(modcand_to_process_elt, "n") 358 | node_elt.attrib = {"sym":modcand_to_process[1],"loc":str(modcand_to_process[0])} 359 | 360 | dropped = oper_nodedata[2] 361 | dropped_elt = ET.SubElement(oper_node_elt, "is-dropped") 362 | dropped_elt.attrib = {"val":dropped} 363 | 364 | tg_edges_elt = ET.SubElement(traininggraph, "tg-edges") 365 | for tg_edge in self.edges: 366 | tg_edge_elt = ET.SubElement(tg_edges_elt, "edge") 367 | tg_edge_elt.attrib = {"lab":str(tg_edge[2]), "par":tg_edge[0], "dep":tg_edge[1]} 368 | 369 | return traininggraph 370 | 371 | # @@@@@@@@@@@@ Training Graph -> Dot Graph @@@@@@@@@@@@@@@@@@@@ 372 | 373 | def convert_to_dotstring(self, main_sent_dict, boxer_graph): 374 | dot_string = "digraph boxer{\n" 375 | 376 | nodename = 0 377 | node_graph_dict = {} 378 | # Writing Major nodes 379 | for major_nodename in self.major_nodes: 380 | major_nodetype = self.get_majornode_type(major_nodename) 381 | major_nodeset = self.get_majornode_nodeset(major_nodename) 382 | major_simple_sentences = self.get_majornode_simple_sentences(major_nodename) 383 | oper_candidates = self.get_majornode_oper_candidates(major_nodename) 384 | processed_oper_candidates = self.get_majornode_processed_oper_candidates(major_nodename) 385 | filtered_postions = self.get_majornode_filtered_postions(major_nodename) 386 | 387 | main_sentence = boxer_graph.extract_main_sentence(major_nodeset, main_sent_dict, filtered_postions) 388 | simple_sentence_string = " ".join(major_simple_sentences) 389 | major_node_data = [major_nodetype, major_nodeset[:], main_sentence, simple_sentence_string] 390 | 391 | if major_nodetype == "split": 392 | major_node_data += [oper_candidates[:]] 393 | 394 | if major_nodetype == "drop-rel" or major_nodetype == "drop-mod" or major_nodetype == "drop-ood": 395 | major_node_data += [oper_candidates[:], processed_oper_candidates[:], filtered_postions[:]] 396 | 397 | if major_nodetype == "fin": 398 | major_node_data += [filtered_postions[:]] 399 | 400 | major_node_string, nodename = self.textdot_majornode(nodename, major_nodename, major_node_data[:]) 401 | node_graph_dict[major_nodename] = "struct"+str(nodename) 402 | dot_string += major_node_string+"\n" 403 | 404 | # Writing operation nodes 405 | for oper_nodename in self.oper_nodes: 406 | oper_node_string, nodename = self.textdot_opernode(nodename, oper_nodename, self.oper_nodes[oper_nodename]) 407 | node_graph_dict[oper_nodename] = "struct"+str(nodename) 408 | dot_string += oper_node_string+"\n" 409 | 410 | # Writing edges 411 | for edge in self.edges: 412 | par_graphnode = node_graph_dict[edge[0]] 413 | dep_graphnode = node_graph_dict[edge[1]] 414 | dot_string += par_graphnode+" -> "+dep_graphnode+"[label=\""+str(edge[2])+"\"];\n" 415 | dot_string += "}" 416 | return dot_string 417 | 418 | def textdot_majornode(self, nodename, node, nodedata): 419 | textdot_node = "struct"+str(nodename+1)+" [shape=record,label=\"{" 420 | textdot_node += "major-node: "+node+"|" 421 | index = 0 422 | for data in nodedata: 423 | textdot_node += self.processtext(str(data)) 424 | index += 1 425 | if index < len(nodedata): 426 | textdot_node += "|" 427 | textdot_node += "}\"];" 428 | return textdot_node, nodename+1 429 | 430 | def textdot_opernode(self, nodename, node, nodedata): 431 | textdot_node = "struct"+str(nodename+1)+" [shape=record,label=\"{" 432 | textdot_node += "oper-node: "+node+"|" 433 | index = 0 434 | for data in nodedata: 435 | textdot_node += self.processtext(str(data)) 436 | index += 1 437 | if index < len(nodedata): 438 | textdot_node += "|" 439 | textdot_node += "}\"];" 440 | return textdot_node, nodename+1 441 | 442 | def processtext(self, inputstring): 443 | linesize = 100 444 | outputstring = "" 445 | index = 0 446 | substr = inputstring[index*linesize:(index+1)*linesize] 447 | while (substr!=""): 448 | outputstring += substr 449 | index += 1 450 | substr = inputstring[index*linesize:(index+1)*linesize] 451 | if substr!="": 452 | outputstring += "\\n" 453 | return outputstring 454 | 455 | # @@@@@@@@@@@@@@@@@@@@@@@@@@ DONE @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 456 | -------------------------------------------------------------------------------- /start_learning_training_models.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #title : start_learning_training_models.py = 4 | #description : Training = 5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 6 | #date : Created in 2014, Later revised in April 2016. = 7 | #version : 0.1 = 8 | #=================================================================================== 9 | 10 | 11 | import os 12 | import argparse 13 | import sys 14 | import datetime 15 | 16 | # Used for wikilarge data: not recommended 17 | sys.setrecursionlimit(10000) 18 | 19 | sys.path.append("./source") 20 | import functions_configuration_file 21 | import functions_model_files 22 | from saxparser_xml_stanfordtokenized_boxergraph import SAXPARSER_XML_StanfordTokenized_BoxerGraph 23 | from saxparser_xml_stanfordtokenized_boxergraph_traininggraph import SAXPARSER_XML_StanfordTokenized_BoxerGraph_TrainingGraph 24 | 25 | MOSESDIR = "~/tools/mosesdecoder" 26 | 27 | if __name__=="__main__": 28 | # Command line arguments ############## 29 | argparser = argparse.ArgumentParser(prog='python learn_training_models.py', description=('Start the training process.')) 30 | 31 | # Optional [default value: 1] 32 | argparser.add_argument('--start-state', help='Start state of the training process', choices=['1','2','3'], default='1', metavar=('Start_State')) 33 | 34 | # Optional [default value: 3] 35 | argparser.add_argument('--end-state', help='End state of the training process', choices=['1','2','3'], default='3', metavar=('End_State')) 36 | 37 | # Optional [default value: split:drop-ood:drop-rel:drop-mod] (Any of their combinations, order is not important), drop-ood only applied after split 38 | argparser.add_argument('--transformation', help='Transformation models learned', default="split:drop-ood:drop-rel:drop-mod", metavar=('TRANSFORMATION_MODEL')) 39 | 40 | # Optional [default value: 2] 41 | argparser.add_argument('--max-split', help='Maximum split size', choices=['2','3'], default='2', metavar=('MAX_SPLIT_SIZE')) 42 | 43 | # Optional [default value: agent:patient:eq:theme], (order is not important) 44 | argparser.add_argument('--restricted-drop-rel', help='Restricted drop relations', default="agent:patient:eq:theme", metavar=('RESTRICTED_DROP_REL')) 45 | 46 | # Optional [default value: jj:jjr:jjs:rb:rbr:rbs], (order is not important) 47 | argparser.add_argument('--allowed-drop-mod', help='Allowed drop modifiers', default="jj:jjr:jjs:rb:rbr:rbs", metavar=('ALLOWED_DROP_MOD')) 48 | 49 | # Optional [default value: update with most recent one] 50 | argparser.add_argument('--method-training-graph', help='Operation set for training graph file', choices=['method-led-lt', 'method-led-lteq', 'method-0.5-lteq-lteq', 51 | 'method-0.75-lteq-lt', 'method-0.99-lteq-lt'], 52 | default='method-0.99-lteq-lt', metavar=('Method_Training_Graph')) 53 | 54 | # Optional [default value: update with most recent one] 55 | argparser.add_argument('--method-feature-extract', help='Operation set for extracting features', choices=['feature-init', 'feature-Nov27'], default='feature-Nov27', 56 | metavar=('Method_Feature_Extract')) 57 | 58 | # Optional [default value: /home/ankh/Data/Simplification/Zhu-2010/PWKP_108016.tokenized.boxer-graph.xml] 59 | argparser.add_argument('--train-boxer-graph', help='The training corpus file (xml, stanford-tokenized, boxer-graph)', metavar=('Train_Boxer_Graph'), 60 | default='/home/ankh/Data/Simplification/Zhu-2010/PWKP_108016.tokenized.boxer-graph.xml') 61 | 62 | # Optional [default value: 10] 63 | argparser.add_argument('--num-em', help='The number of EM Algorithm iterations', metavar=('NUM_EM_ITERATION'), default='10') 64 | 65 | # Optional [default value: 0:3:/home/ankh/Desktop/Sentence-Simplification/LANGUAGE-MODEL/simplewiki-20131030-data.srilm:0] 66 | argparser.add_argument('--lang-model', help='Language model information (in the moses format)', metavar=('Lang_Model'), 67 | default="0:3:/gpfs/scratch/snarayan/Sentence-Simplification/Language-Model/newsela_lm/V0V4_V1V4_V2V4_V3V4_V0V3_V0V2_V1V3.aner.ori.train.dst.arpa.en:0") 68 | # default="0:3:/gpfs/scratch/snarayan/Sentence-Simplification/Language-Model/wikilarge_lm/wiki.full.aner.ori.train.dst.arpa.en:0") 69 | # default="0:3:/home/ankh/Desktop/Sentence-Simplification/LANGUAGE-MODEL/simplewiki-20131030-data.srilm:0") 70 | 71 | # Optional (Cumpolsary when start state is >= 2) 72 | argparser.add_argument('--d2s-config', help='D2S Configuration file', metavar=('D2S_Config')) 73 | 74 | # Compolsary 75 | argparser.add_argument('--output-dir', help='The output directory',required=True, metavar=('Output_Directory')) 76 | # ##################################### 77 | args_dict = vars(argparser.parse_args(sys.argv[1:])) 78 | # ##################################### 79 | 80 | # Creating the output directory to store training models 81 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 82 | print timestamp+", Creating the output directory: "+args_dict['output_dir'] 83 | try: 84 | os.mkdir(args_dict['output_dir']) 85 | print 86 | except OSError: 87 | print args_dict['output_dir'] + " directory already exists.\n" 88 | 89 | # Configuration dictionary 90 | D2S_Config_data = {} 91 | D2S_Config = args_dict['d2s_config'] 92 | if D2S_Config != None: 93 | D2S_Config_data = functions_configuration_file.parser_config_file(D2S_Config) 94 | else: 95 | D2S_Config_data["TRAIN-BOXER-GRAPH"] = args_dict['train_boxer_graph'] 96 | D2S_Config_data["TRANSFORMATION-MODEL"] = args_dict['transformation'].split(":") 97 | D2S_Config_data["MAX-SPLIT-SIZE"] = int(args_dict['max_split']) 98 | D2S_Config_data["RESTRICTED-DROP-RELATION"] = args_dict['restricted_drop_rel'].split(":") 99 | D2S_Config_data["ALLOWED-DROP-MODIFIER"] = args_dict['allowed_drop_mod'].split(":") 100 | D2S_Config_data["METHOD-TRAINING-GRAPH"] = args_dict['method_training_graph'] 101 | D2S_Config_data["METHOD-FEATURE-EXTRACT"] = args_dict['method_feature_extract'] 102 | D2S_Config_data["NUM-EM-ITERATION"] = int(args_dict['num_em']) 103 | D2S_Config_data["LANGUAGE-MODEL"] = args_dict['lang_model'] 104 | 105 | # Extracting arguments with their default values (default unless its specified) 106 | START_STATE = int(args_dict['start_state']) 107 | END_STATE = int(args_dict['end_state']) 108 | 109 | # Start state: 1, Starting building training graph 110 | state = 1 111 | if (int(args_dict['start_state']) <= state) and (state <= int(args_dict['end_state'])): 112 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 113 | print timestamp+", Starting building training graph (Step-"+str(state)+") ..." 114 | 115 | print "Input training file (xml, stanford tokenized and boxer graph): " + D2S_Config_data["TRAIN-BOXER-GRAPH"] + " ..." 116 | TRAIN_TRAINING_GRAPH = args_dict['output_dir']+"/"+os.path.splitext(os.path.basename(D2S_Config_data["TRAIN-BOXER-GRAPH"]))[0]+".training-graph.xml" 117 | print "Generating training graph file (xml, stanford tokenized, boxer graph and training graph): "+TRAIN_TRAINING_GRAPH+" ..." 118 | 119 | foutput = open(TRAIN_TRAINING_GRAPH, "w") 120 | foutput.write("\n") 121 | foutput.write("\n") 122 | 123 | print "Creating the SAX file (xml, stanford tokenized and boxer graph) handler ..." 124 | training_xml_handler = SAXPARSER_XML_StanfordTokenized_BoxerGraph("training", D2S_Config_data["TRAIN-BOXER-GRAPH"], foutput, D2S_Config_data["TRANSFORMATION-MODEL"], 125 | D2S_Config_data["MAX-SPLIT-SIZE"], D2S_Config_data["RESTRICTED-DROP-RELATION"], 126 | D2S_Config_data["ALLOWED-DROP-MODIFIER"], D2S_Config_data["METHOD-TRAINING-GRAPH"]) 127 | 128 | print "Start generating training graph ..." 129 | print "Start parsing "+D2S_Config_data["TRAIN-BOXER-GRAPH"]+" ..." 130 | training_xml_handler.parse_xmlfile_generating_training_graph() 131 | 132 | foutput.write("\n") 133 | foutput.close() 134 | 135 | D2S_Config_data["TRAIN-TRAINING-GRAPH"] = TRAIN_TRAINING_GRAPH 136 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 137 | print timestamp+", Finished building training graph (Step-"+str(state)+")\n" 138 | 139 | # Start state: 2 140 | state = 2 141 | if (int(args_dict['start_state']) <= state) and (state <= int(args_dict['end_state'])): 142 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 143 | print timestamp+", Starting learning transformation models (Step-"+str(state)+") ..." 144 | 145 | if "TRAIN-TRAINING-GRAPH" not in D2S_Config_data: 146 | print "The training graph file (xml, stanford tokenized, boxer graph and training graph) is not available." 147 | print "Please enter the configuration file or start with the State 1." 148 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 149 | print timestamp+", No transformation models learned (Step-"+str(state)+")\n" 150 | exit(0) 151 | 152 | # @ Defining data structure @ 153 | # Stores various sentence pairs (complex, simple) for SMT. 154 | smt_sentence_pairs = {} 155 | # probability tables - store all probabilities 156 | probability_tables = {} 157 | # count tables - store counts in next iteration 158 | count_tables = {} 159 | # @ @ 160 | 161 | print "Creating the em-training XML file (stanford tokenized, boxer graph and training graph) handler ..." 162 | em_training_xml_handler = SAXPARSER_XML_StanfordTokenized_BoxerGraph_TrainingGraph(D2S_Config_data["TRAIN-TRAINING-GRAPH"], D2S_Config_data["NUM-EM-ITERATION"], 163 | smt_sentence_pairs, probability_tables, count_tables, D2S_Config_data["METHOD-FEATURE-EXTRACT"]) 164 | 165 | print "Start Expectation Maximization (Inside-Outside) algorithm ..." 166 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 167 | print timestamp+", Step "+str(state)+".1: Initialization of probability tables and populating smt_sentence_pairs ..." 168 | em_training_xml_handler.parse_to_initialize_probabilitytable() 169 | # print probability_tables 170 | 171 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 172 | print timestamp+", Step "+str(state)+".2: Start iterating for EM Inside-Outside probabilities ..." 173 | em_training_xml_handler.parse_to_iterate_probabilitytable() 174 | # print probability_tables 175 | 176 | # Start writing model files 177 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 178 | print timestamp+", Step "+str(state)+".3: Start writing model files ..." 179 | # Creating the output directory to store training models 180 | model_dir = args_dict['output_dir']+"/TRANSFORMATION-MODEL-DIR" 181 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 182 | print timestamp+", Creating the output model directory: "+model_dir 183 | try: 184 | os.mkdir(model_dir) 185 | except OSError: 186 | print model_dir + " directory already exists." 187 | # Wriing model files 188 | functions_model_files.write_model_files(model_dir, probability_tables, smt_sentence_pairs) 189 | 190 | D2S_Config_data["TRANSFORMATION-MODEL-DIR"] = model_dir 191 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 192 | print timestamp+", Finished learning transformation models (Step-"+str(state)+")\n" 193 | 194 | # Start state: 3 195 | state = 3 196 | if (int(args_dict['start_state']) <= state) and (state <= int(args_dict['end_state'])): 197 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 198 | print timestamp+", Starting learning moses translation model (Step-"+str(state)+") ..." 199 | 200 | if "TRANSFORMATION-MODEL-DIR" not in D2S_Config_data: 201 | print "The moses training files are not available." 202 | print "Please enter the configuration file or start with the State 1." 203 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 204 | print timestamp+", No moses models learned (Step-"+str(state)+")\n" 205 | exit(0) 206 | 207 | # Preparing the moses directory 208 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 209 | print timestamp+", Step "+str(state)+".1: Preparing the moses directory ..." 210 | # Creating the output directory to store moses files 211 | moses_dir = args_dict['output_dir']+"/MOSES-COMPLEX-SIMPLE-DIR" 212 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 213 | print timestamp+", Creating the moses directory: "+moses_dir 214 | try: 215 | os.mkdir(moses_dir) 216 | except OSError: 217 | print moses_dir + " directory already exists." 218 | # Creating the corpus directory 219 | moses_corpus_dir = args_dict['output_dir']+"/MOSES-COMPLEX-SIMPLE-DIR/corpus" 220 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 221 | print timestamp+", Creating the moses corpus directory: "+moses_corpus_dir 222 | try: 223 | os.mkdir(moses_corpus_dir) 224 | except OSError: 225 | print moses_corpus_dir + " directory already exists." 226 | 227 | # Cleaning the moses training file 228 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 229 | print timestamp+", Step "+str(state)+".2: Cleaning the moses training file ..." 230 | command = MOSESDIR+"/scripts/training/clean-corpus-n.perl "+D2S_Config_data["TRANSFORMATION-MODEL-DIR"]+"/D2S-SMT source target "+moses_corpus_dir+"/D2S-SMT-clean 1 95" 231 | os.system(command) 232 | 233 | # Running moses training 234 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 235 | print timestamp+", Step "+str(state)+".3: Running the moses training ..." 236 | command = (MOSESDIR+"/scripts/training/train-model.perl -mgiza -mgiza-cpus 3 -cores 3 -parallel -sort-buffer-size 3G -sort-batch-size 253 -sort-compress gzip -sort-parallel 3 "+ 237 | "-root-dir "+moses_dir+" -corpus "+moses_corpus_dir+"/D2S-SMT-clean -f source -e target -external-bin-dir "+MOSESDIR+"/mgiza/mgizapp/bin "+ 238 | "-lm "+D2S_Config_data["LANGUAGE-MODEL"]) 239 | os.system(command) 240 | 241 | D2S_Config_data["MOSES-COMPLEX-SIMPLE-DIR"] = moses_dir 242 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 243 | print timestamp+", Finished learning moses translation model (Step-"+str(state)+")\n" 244 | 245 | # Last Step 246 | config_file = args_dict['output_dir']+"/d2s.ini" 247 | print "Writing the configuration file: "+config_file+" ..." 248 | functions_configuration_file.write_config_file(config_file, D2S_Config_data) 249 | 250 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 251 | print timestamp+", Learning process done!!!" 252 | 253 | -------------------------------------------------------------------------------- /start_simplifying_complex_sentence.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #=================================================================================== 3 | #title : start_simplifying_complex_sentence.py = 4 | #description : Testing = 5 | #author : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})= 6 | #date : Created in 2014, Later revised in April 2016. = 7 | #version : 0.1 = 8 | #=================================================================================== 9 | 10 | import os 11 | import argparse 12 | import sys 13 | import datetime 14 | from nltk.metrics.distance import edit_distance 15 | 16 | sys.path.append("./source") 17 | import functions_configuration_file 18 | import functions_model_files 19 | import functions_prepare_elementtree_dot 20 | from saxparser_xml_stanfordtokenized_boxergraph import SAXPARSER_XML_StanfordTokenized_BoxerGraph 21 | from explore_decoder_graph_greedy import Explore_Decoder_Graph_Greedy 22 | from explore_decoder_graph_explorative import Explore_Decoder_Graph_Explorative 23 | 24 | MOSESDIR = "~/tools/mosesdecoder" 25 | 26 | def get_greedy_decoder_graph(test_boxerdata_dict, test_sentids, TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER, 27 | probability_tables, METHOD_FEATURE_EXTRACT): 28 | mapper_transformation = {} 29 | moses_input = {} 30 | transformation_complex_count = 0 31 | 32 | # Transformation decoder 33 | decoder_graph_explorer = Explore_Decoder_Graph_Greedy(TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER, 34 | probability_tables, METHOD_FEATURE_EXTRACT) 35 | for sentid in test_sentids: 36 | print sentid 37 | sent_data = test_boxerdata_dict[str(sentid)] 38 | main_sentence = sent_data[0] 39 | main_sent_dict = sent_data[1] 40 | boxer_graph = sent_data[2] 41 | 42 | # Explore decoder graph 43 | decoder_graph = decoder_graph_explorer.explore_decoder_graph(str(sentid), main_sentence, main_sent_dict, boxer_graph) 44 | 45 | # # Generating boxer and decoder graph 46 | # if sentid not in [13, 28, 41]: 47 | # functions_prepare_elementtree_dot.run_visual_graph_creator(str(sentid), main_sentence, main_sent_dict, [], boxer_graph, decoder_graph) 48 | 49 | sentence_pairs = decoder_graph.get_final_sentences(main_sentence, main_sent_dict, boxer_graph) 50 | transformed_sentences = [item[0] for item in sentence_pairs] 51 | 52 | # Writing transformation results 53 | mapper_transformation[sentid] = [] 54 | for sent in transformed_sentences: 55 | mapper_transformation[sentid].append(transformation_complex_count) 56 | moses_input[transformation_complex_count] = sent 57 | transformation_complex_count += 1 58 | return mapper_transformation, moses_input 59 | 60 | def get_explorative_decoder_graph(test_boxerdata_dict, test_sentids, TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER, 61 | probability_tables, METHOD_FEATURE_EXTRACT): 62 | mapper_transformation = {} 63 | moses_input = {} 64 | transformation_complex_count = 0 65 | 66 | # Transformation decoder 67 | decoder_graph_explorer = Explore_Decoder_Graph_Explorative(TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER, 68 | probability_tables, METHOD_FEATURE_EXTRACT) 69 | for sentid in test_sentids: 70 | print sentid 71 | sent_data = test_boxerdata_dict[str(sentid)] 72 | main_sentence = sent_data[0] 73 | main_sent_dict = sent_data[1] 74 | boxer_graph = sent_data[2] 75 | 76 | # Explore decoder graph 77 | print "Building decoder graph ..." 78 | decoder_graph = decoder_graph_explorer.explore_decoder_graph(str(sentid), main_sentence, main_sent_dict, boxer_graph) 79 | 80 | # Start updating edges with the probabilities, for unseen : 0.5/0.5 81 | print "Updating probability bottom-up ..." 82 | node_probability_dict, potential_edges = decoder_graph_explorer.start_probability_update(main_sentence, main_sent_dict, boxer_graph, decoder_graph) 83 | 84 | # Filtered decoder graph 85 | print "Creating filtered decoder graph ..." 86 | filtered_decoder_graph = decoder_graph_explorer.create_filtered_decoder_graph(potential_edges, main_sentence, main_sent_dict, boxer_graph, decoder_graph) 87 | 88 | # Generating boxer and decoder graph 89 | functions_prepare_elementtree_dot.run_visual_graph_creator(str(sentid), main_sentence, main_sent_dict, [], boxer_graph, filtered_decoder_graph) 90 | 91 | sentence_pairs = filtered_decoder_graph.get_final_sentences(main_sentence, main_sent_dict, boxer_graph) 92 | transformed_sentences = [item[0] for item in sentence_pairs] 93 | 94 | # Writing transformation results 95 | mapper_transformation[sentid] = [] 96 | for sent in transformed_sentences: 97 | mapper_transformation[sentid].append(transformation_complex_count) 98 | moses_input[transformation_complex_count] = sent 99 | transformation_complex_count += 1 100 | return mapper_transformation, moses_input 101 | 102 | if __name__ == "__main__": 103 | argparser = argparse.ArgumentParser(prog='python simplify_complex_sentence.py', description=('Start simplifying complex sentences.')) 104 | 105 | # Optional [default value: /home/ankh/Data/Simplification/Test-Data/complex.tokenized.boxer-graph.xml] 106 | argparser.add_argument('--test-boxer-graph', help='The test corpus file (xml, stanford-tokenized, boxer-graph)', metavar=('Test_Boxer_Graph'), 107 | default='/home/ankh/Data/Simplification/Test-Data/complex.tokenized.boxer-graph.xml') 108 | 109 | # Optional [default value: 10] 110 | argparser.add_argument('--nbest-distinct', help='N Best Distinct produced from Moses', metavar=('N_Best_Distinct'), default='10') 111 | 112 | # Optional [default value: greedy] 113 | argparser.add_argument('--explore-decoder', help='Method for generating the decoder graph', metavar=('Explore_Decoder'), choices=['greedy', 'explorative'], default='greedy') 114 | 115 | # Compolsary 116 | argparser.add_argument('--d2s-config', help='D2S Configuration file', required=True, metavar=('D2S_Config')) 117 | 118 | # Compolsary 119 | argparser.add_argument('--output-dir', help='The output directory', required=True, metavar=('Output_Directory')) 120 | # ##################################### 121 | args_dict = vars(argparser.parse_args(sys.argv[1:])) 122 | # ##################################### 123 | 124 | # STEP:1 Creating test directory in the output directory 125 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 126 | test_output_directory = args_dict['output_dir']+"/Test-Results-"+args_dict["explore_decoder"].upper() 127 | print timestamp+", Creating test result directory: "+test_output_directory 128 | try: 129 | os.mkdir(test_output_directory) 130 | except OSError: 131 | print test_output_directory + " directory already exists." 132 | 133 | # STEP:2 Configuration dictionary 134 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 135 | print "\n"+timestamp+", Reading the D2S Configuration file ..." 136 | D2S_Config_data = functions_configuration_file.parser_config_file(args_dict['d2s_config']) 137 | 138 | # STEP:3 Reading transformation model files 139 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 140 | print "\n"+timestamp+", Reading transformation model files ..." 141 | probability_tables = functions_model_files.read_model_files(D2S_Config_data["TRANSFORMATION-MODEL-DIR"], D2S_Config_data["TRANSFORMATION-MODEL"]) 142 | 143 | # STEP:4 Reading the test corpus file (xml, stanford-tokenized, boxer-graph) ..." 144 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 145 | print "\n"+timestamp+", Start reading test corpus file (xml, stanford-tokenized, boxer-graph): "+args_dict['test_boxer_graph']+" ..." 146 | print "Creating the SAX file (xml, stanford tokenized and boxer graph) handler ..." 147 | test_boxerdata_dict = {} 148 | test_sentids = [] 149 | testing_xml_handler = SAXPARSER_XML_StanfordTokenized_BoxerGraph("testing", args_dict['test_boxer_graph'], test_boxerdata_dict, D2S_Config_data["TRANSFORMATION-MODEL"], 150 | D2S_Config_data["MAX-SPLIT-SIZE"], D2S_Config_data["RESTRICTED-DROP-RELATION"], 151 | D2S_Config_data["ALLOWED-DROP-MODIFIER"], D2S_Config_data["METHOD-TRAINING-GRAPH"]) 152 | print "Start parsing "+args_dict['test_boxer_graph']+" ..." 153 | testing_xml_handler.parse_xmlfile_generating_training_graph() 154 | test_sentids = [int(item) for item in test_boxerdata_dict.keys()] 155 | test_sentids.sort() 156 | 157 | # STEP:5 Applying the transformation models and creating the output of transformation 158 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 159 | print "\n"+timestamp+", Applying the transformation models and writing complex sentences after transformation ..." 160 | mapper_transformation = {} 161 | moses_input = {} 162 | if args_dict["explore_decoder"] == "greedy": 163 | mapper_transformation, moses_input = get_greedy_decoder_graph(test_boxerdata_dict, test_sentids, D2S_Config_data["TRANSFORMATION-MODEL"], D2S_Config_data["MAX-SPLIT-SIZE"], 164 | D2S_Config_data["RESTRICTED-DROP-RELATION"], D2S_Config_data["ALLOWED-DROP-MODIFIER"], 165 | probability_tables, D2S_Config_data["METHOD-FEATURE-EXTRACT"]) 166 | else: 167 | mapper_transformation, moses_input = get_explorative_decoder_graph(test_boxerdata_dict, test_sentids, D2S_Config_data["TRANSFORMATION-MODEL"], D2S_Config_data["MAX-SPLIT-SIZE"], 168 | D2S_Config_data["RESTRICTED-DROP-RELATION"], D2S_Config_data["ALLOWED-DROP-MODIFIER"], 169 | probability_tables, D2S_Config_data["METHOD-FEATURE-EXTRACT"]) 170 | 171 | print "Writing "+test_output_directory+"/transformation-output.moses-input ..." 172 | d2s_complex_file = open(test_output_directory+"/transformation-output.moses-input", "w") 173 | for sentid in test_sentids: 174 | for moses_input_id in mapper_transformation[sentid]: 175 | transformed_sent = moses_input[moses_input_id] 176 | d2s_complex_file.write(transformed_sent.encode('utf-8')+"\n") 177 | d2s_complex_file.close() 178 | 179 | print "Writing "+test_output_directory+"/transformation-output.map ..." 180 | d2s_complex_map = open(test_output_directory+"/transformation-output.map", "w") 181 | sentids = mapper_transformation.keys() 182 | sentids.sort() 183 | for sentid in sentids: 184 | d2s_complex_map.write(str(sentid)+" ") 185 | for item in mapper_transformation[sentid]: 186 | d2s_complex_map.write(str(item)+" ") 187 | d2s_complex_map.write("\n") 188 | d2s_complex_map.close() 189 | 190 | print "Writing "+test_output_directory+"/transformation-output.simple ..." 191 | d2s_complex_file = open(test_output_directory+"/transformation-output.simple", "w") 192 | for sentid in test_sentids: 193 | simple_sentence = [] 194 | for moses_input_id in mapper_transformation[sentid]: 195 | transformed_sent = moses_input[moses_input_id] 196 | simple_sentence.append(transformed_sent) 197 | d2s_complex_file.write((" ".join(simple_sentence)).encode('utf-8')+"\n") 198 | d2s_complex_file.close() 199 | 200 | # STEP:6 Running Moses 201 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 202 | print "\n"+timestamp+", Applying the moses translation model ..." 203 | command = (MOSESDIR+"/bin/moses -f "+D2S_Config_data["MOSES-COMPLEX-SIMPLE-DIR"]+"/model/moses.ini "+ 204 | "-n-best-list "+test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple"+" "+args_dict['nbest_distinct']+" distinct "+ 205 | "-input-file "+test_output_directory+"/transformation-output.moses-input") 206 | os.system(command) 207 | 208 | # Reading the moses output file 209 | print "Parsing the moses output file: "+test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple" 210 | moses_output = {} 211 | finput = open(test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple", "r") 212 | datalines = finput.readlines() 213 | for line in datalines: 214 | parts = line.split(" ||| ") 215 | sentid = int(parts[0].strip()) 216 | sent = parts[1].strip() 217 | if sentid not in moses_output: 218 | moses_output[sentid] = [sent] 219 | else: 220 | moses_output[sentid].append(sent) 221 | finput.close() 222 | 223 | # Storing the best moses output 224 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 225 | print "\n"+timestamp+", Best output of moses ..." 226 | final_output_filename = test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple.best" 227 | print "Writing to the file: "+final_output_filename 228 | final_output_file = open(final_output_filename, "w") 229 | for sentid in test_sentids: 230 | simple_sentence = [] 231 | for moses_input_id in mapper_transformation[sentid]: 232 | moses_simple_output_best = moses_output[moses_input_id][0] 233 | simple_sentence.append(moses_simple_output_best) 234 | final_output_file.write(" ".join(simple_sentence)+"\n") 235 | final_output_file.close() 236 | 237 | # Running posthoc reranking 238 | timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p") 239 | print "\n"+timestamp+", Running the post-hoc reranking ..." 240 | posthoc_reranked = {} 241 | for sentid in test_sentids: 242 | for moses_input_id in mapper_transformation[sentid]: 243 | moses_complex_input = moses_input[moses_input_id] 244 | moses_simple_outputs = moses_output[moses_input_id] 245 | 246 | posthoc_reranked[moses_input_id] = [] 247 | for simple_output in moses_simple_outputs: 248 | edit_dist = edit_distance(simple_output, moses_complex_input) 249 | posthoc_reranked[moses_input_id].append((edit_dist, simple_output)) 250 | 251 | # More different are ranked at top 252 | posthoc_reranked[moses_input_id].sort(reverse=True) 253 | 254 | # Writing post-hoc reranked output 255 | final_output_filename = test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.post-hoc-reranking.simple" 256 | print "Writing to the file: "+final_output_filename 257 | final_output_file = open(final_output_filename, "w") 258 | for sentid in test_sentids: 259 | for moses_input_id in mapper_transformation[sentid]: 260 | for item in posthoc_reranked[moses_input_id]: 261 | final_output_file.write(str(moses_input_id)+"\t"+str(item[0])+"\t"+item[1]+"\n") 262 | final_output_file.write("\n") 263 | final_output_file.close() 264 | 265 | # Writing post-hoc reranked best output 266 | final_output_filename = test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.post-hoc-reranking.simple.best" 267 | print "Writing to the file: "+final_output_filename 268 | final_output_file = open(final_output_filename, "w") 269 | for sentid in test_sentids: 270 | simple_sentence = [] 271 | for moses_input_id in mapper_transformation[sentid]: 272 | simple_output_best = posthoc_reranked[moses_input_id][0][1] 273 | simple_sentence.append(simple_output_best) 274 | final_output_file.write(" ".join(simple_sentence)+"\n") 275 | final_output_file.close() 276 | 277 | 278 | # test_boxerdata_dict = {} 279 | # test_sentids = [] 280 | 281 | # mapper_transformation = {} 282 | # moses_input = {} 283 | 284 | # moses_output = {} 285 | 286 | # posthoc_reranked = {} 287 | --------------------------------------------------------------------------------