├── LICENSE
├── README.md
├── preprocessing
    ├── extract_wikipedia_corpora_boxer_test.py
    └── extract_wikipedia_corpora_boxer_training.py
├── source
    ├── boxer_graph_module.py
    ├── em_inside_outside_algorithm.py
    ├── explore_decoder_graph_explorative.py
    ├── explore_decoder_graph_greedy.py
    ├── explore_training_graph.py
    ├── function_select_methods.py
    ├── functions_configuration_file.py
    ├── functions_model_files.py
    ├── functions_prepare_elementtree_dot.py
    ├── methods_feature_extract.py
    ├── methods_training_graph.py
    ├── saxparser_xml_stanfordtokenized_boxergraph.py
    ├── saxparser_xml_stanfordtokenized_boxergraph_traininggraph.py
    └── training_graph_module.py
├── start_learning_training_models.py
└── start_simplifying_complex_sentence.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | BSD 3-Clause License
 2 | 
 3 | Copyright (c) 2018, Shashi Narayan
 4 | All rights reserved.
 5 | 
 6 | Redistribution and use in source and binary forms, with or without
 7 | modification, are permitted provided that the following conditions are met:
 8 | 
 9 | * Redistributions of source code must retain the above copyright notice, this
10 |   list of conditions and the following disclaimer.
11 | 
12 | * Redistributions in binary form must reproduce the above copyright notice,
13 |   this list of conditions and the following disclaimer in the documentation
14 |   and/or other materials provided with the distribution.
15 | 
16 | * Neither the name of the copyright holder nor the names of its
17 |   contributors may be used to endorse or promote products derived from
18 |   this software without specific prior written permission.
19 | 
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | ## Hybrid Simplification using Deep Semantics and Machine Translation
  2 | 
  3 | Sentence simplification maps a sentence to a simpler, more readable
  4 | one approximating its content. In practice, simplification is often
  5 | modeled using four main operations: splitting a complex sentence into
  6 | several simpler sentences; dropping and reordering phrases or
  7 | constituents; substituting words/phrases with simpler ones.
  8 | 
  9 | This is implementation from our ACL'14 paper. We have modified our
 10 | code to let you choose what simplification operations you want to
 11 | apply to your complex sentences. Please go through our paper for more
 12 | details. Please contact Shashi Narayan
 13 | (shashi.narayan(at){ed.ac.uk,gmail.com}) for any query.
 14 | 
 15 | If you use our code, please cite the following paper. 
 16 | 
 17 | **Hybrid Simplification using Deep Semantics and Machine Translation,
 18 |   Shashi Narayan and Claire Gardent, The 52nd Annual meeting of the
 19 |   Association for Computational Linguistics (ACL), Baltimore,
 20 |   June. https://aclweb.org/anthology/P/P14/P14-1041.pdf.**
 21 | 
 22 | > We present a hybrid approach to sentence simplification which
 23 | > combines deep semantics and monolingual machine translation to
 24 | > derive simple sentences from complex ones. The approach differs from
 25 | > previous work in two main ways. First, it is semantic based in that
 26 | > it takes as input a deep semantic representation rather than e.g., a
 27 | > sentence or a parse tree. Second, it combines a simplification model
 28 | > for splitting and deletion with a monolingual translation model for
 29 | > phrase substitution and reordering. When compared against current
 30 | > state of the art methods, our model yields significantly simpler
 31 | > output that is both grammatical and meaning preserving.
 32 | 
 33 | ### Requirements
 34 | 
 35 | * Boxer 1.00:  http://svn.ask.it.usyd.edu.au/trac/candc/wiki/boxer, http://www.cl.cam.ac.uk/~sc609/candc-1.00.html
 36 | * Moses: http://www.statmt.org/moses/?n=Development.GetStarted
 37 | * Mgiza++:  http://www.statmt.org/moses/?n=Moses.ExternalTools#ntoc3
 38 | * NLTK toolkit: http://www.nltk.org/
 39 | * Python 2.7
 40 | * Stanford Toolkit: http://nlp.stanford.edu/software/tagger.html
 41 | 
 42 | ### Data preparation
 43 | 
 44 | 
 45 | #### Training Data 
 46 | 
 47 | * code: ./preprocessing/extract_wikipedia_corpora_boxer_training.py
 48 | 
 49 | * This code prepares the training data. It takes as input tokenized
 50 |   training (complex, simple) sentences and the boxer output (xml
 51 |   format) of the complex sentences.
 52 | 
 53 | * I will improve the interface of this script later. But for now you
 54 |   have to set following parameters: (C: complex sentence and S: simple
 55 |   sentence)
 56 | 
 57 |   * ZHUDATA_FILE_ORG = Address to the file with combined
 58 |   complex-simple pairs. Format:
 59 |   C_1\nS^1_1\nS^2_1\n\nC_2\nS^1_2\nS^2_2\nS^3_2\n\n and so on.
 60 | 
 61 |   * ZHUDATA_FILE_MAIN = Address to the file with all tokenized complex
 62 |    sentences. Format: C_1\nC_2\n and so on.
 63 | 
 64 |    * ZHUDATA_FILE_SIMPLE = Address to the file with all tokenized
 65 |    simple sentences. Format: S^1_1\nS^2_1\nS^1_2\nS^2_2\nS^3_2\n and
 66 |    so on.
 67 | 
 68 |    * BOXER_DATADIR: Directory address which contains the boxer output
 69 |    of ZHUDATA_FILE_MAIN.
 70 | 
 71 |    * CHUNK_SIZE = Size of the boxer output chunks. The above scripts
 72 |    loads boxer xml file before parsing them, it is much faster to use
 73 |    chunks (let say of 10000) of ZHUDATA_FILE_MAIN.
 74 | 
 75 |    * boxer_main_filename = Boxer output file name pattern. For
 76 |    example:
 77 |    filename."+str(lower_index)+"-"+str(lower_index+CHUNK_SIZE)
 78 |         
 79 | #### Test Data
 80 |     
 81 | * code: ./preprocessing/extract_wikipedia_corpora_boxer_test.py
 82 | 
 83 | * This code prepares the test data. It takes as input tokenized test
 84 |   (complex) sentences and their boxer outputs in xml format.
 85 | 
 86 | * I will improve the interface of this script later. But for now you
 87 |   have to set following parameters: 
 88 | 
 89 |   * TEST_FILE_MAIN: Address to the file with all tokenized complex
 90 |    sentences. Format: C_1\nC_2\n and so on.
 91 | 
 92 |   * TEST_FILE_BOXER: Address to the boxer xml output file for
 93 |    TEST_FILE_MAIN.
 94 | 
 95 | ### Training
 96 | 
 97 | * Training goes through three states: 1) Building Boxer training
 98 |   graphs, 2) EM training and 3) SMT training
 99 | 
100 | ```
101 | python start_learning_training_models.py --help
102 |  
103 | usage: python learn_training_models.py [-h] [--start-state Start_State]
104 |                                        [--end-state End_State]
105 |                                        [--transformation TRANSFORMATION_MODEL]
106 |                                        [--max-split MAX_SPLIT_SIZE]
107 |                                        [--restricted-drop-rel RESTRICTED_DROP_REL]
108 |                                        [--allowed-drop-mod ALLOWED_DROP_MOD]
109 |                                        [--method-training-graph Method_Training_Graph]
110 |                                        [--method-feature-extract Method_Feature_Extract]
111 |                                        [--train-boxer-graph Train_Boxer_Graph]
112 |                                        [--num-em NUM_EM_ITERATION]
113 |                                        [--lang-model Lang_Model]
114 |                                        [--d2s-config D2S_Config] --output-dir
115 |                                        Output_Directory
116 | 
117 | Start the training process.
118 | 
119 | optional arguments:
120 |   -h, --help            show this help message and exit
121 |   --start-state Start_State
122 |                         Start state of the training process
123 |   --end-state End_State
124 |                         End state of the training process
125 |   --transformation TRANSFORMATION_MODEL
126 |                         Transformation models learned
127 |   --max-split MAX_SPLIT_SIZE
128 |                         Maximum split size
129 |   --restricted-drop-rel RESTRICTED_DROP_REL
130 |                         Restricted drop relations
131 |   --allowed-drop-mod ALLOWED_DROP_MOD
132 |                         Allowed drop modifiers
133 |   --method-training-graph Method_Training_Graph
134 |                         Operation set for training graph file
135 |   --method-feature-extract Method_Feature_Extract
136 |                         Operation set for extracting features
137 |   --train-boxer-graph Train_Boxer_Graph
138 |                         The training corpus file (xml, stanford-tokenized,
139 |                         boxer-graph)
140 |   --num-em NUM_EM_ITERATION
141 |                         The number of EM Algorithm iterations
142 |   --lang-model Lang_Model
143 |                         Language model information (in the moses format)
144 |   --d2s-config D2S_Config
145 |                         D2S Configuration file
146 |   --output-dir Output_Directory
147 |                         The output directory
148 | ```
149 | 
150 | * Have a look in start_learning_training_models.py for more
151 | information on their definitions and their default values.
152 | 
153 | * train-boxer-graph: this is the output file from the training data
154 |   preparation·
155 | 
156 | ### Testing
157 | 
158 | ```
159 | python start_simplifying_complex_sentence.py --help
160 | 
161 | usage: python simplify_complex_sentence.py [-h]
162 |                                            [--test-boxer-graph Test_Boxer_Graph]
163 |                                            [--nbest-distinct N_Best_Distinct]
164 |                                            [--explore-decoder Explore_Decoder]
165 |                                            --d2s-config D2S_Config
166 |                                            --output-dir Output_Directory
167 | 
168 | Start simplifying complex sentences.
169 | 
170 | optional arguments:
171 |   -h, --help            show this help message and exit
172 |   --test-boxer-graph Test_Boxer_Graph
173 |                         The test corpus file (xml, stanford-tokenized, boxer-
174 |                         graph)
175 |   --nbest-distinct N_Best_Distinct
176 |                         N Best Distinct produced from Moses
177 |   --explore-decoder Explore_Decoder
178 |                         Method for generating the decoder graph
179 |   --d2s-config D2S_Config
180 |                         D2S Configuration file
181 |   --output-dir Output_Directory
182 |                         The output directory
183 | ```
184 | 
185 | * test-boxer-graph: this is the output file from the test data
186 |   preparation·
187 | 
188 | * d2s-config: This is the output configuration file from the training
189 |   stage.
190 | 
191 | ### ToDo
192 | 
193 | * ToDo: Incorporate improvements from our arXiv
194 |  paper. http://arxiv.org/pdf/1507.08452v1.pdf
195 | 
196 |       * OOD words at the border should be dropped.
197 |       * Don't split at "TO".
198 |       * Full stop at the end of the sentence. (currently, this was done as a
199 |               post-processing step in the end).
200 |        
201 | * ToDo: Change to online version of sentence simplification.
202 | 
203 | 


--------------------------------------------------------------------------------
/preprocessing/extract_wikipedia_corpora_boxer_test.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #===================================================================================
  3 | #title           : extract_wikipedia_corpora_boxer_test.py                         =
  4 | #description     : Test data preparation                                           =
  5 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,gmail.com})         =                                    
  6 | #date            : Created in 2014, Later revised in April 2016.                   =
  7 | #version         : 0.1                                                             =
  8 | #===================================================================================
  9 | 
 10 | import os
 11 | import argparse
 12 | import sys
 13 | import random 
 14 | import base64
 15 | import uuid
 16 | 
 17 | import xml.etree.ElementTree as ET
 18 | from xml.dom import minidom
 19 | 
 20 | ### Global Variables
 21 | 
 22 | # # Zhu
 23 | # TEST_FILE_MAIN="/home/ankh/Data/Simplification/Test-Data/complex.tokenized"
 24 | # TEST_FILE_BOXER="/home/ankh/Data/Simplification/Test-Data/complex.tokenized.boxer.xml"
 25 | 
 26 | # # NewSella
 27 | # TEST_FILE_MAIN="/gpfs/scratch/snarayan/Sentence-Simplification/Newsella/V0V4_V1V4_V2V4_V3V4_V0V3_V0V2_V1V3.aner.ori.valid.src"
 28 | # TEST_FILE_BOXER="/gpfs/scratch/snarayan/Sentence-Simplification/Newsella/boxer/V0V4_V1V4_V2V4_V3V4_V0V3_V0V2_V1V3.aner.ori.valid.src"
 29 | 
 30 | #  Wikilarge
 31 | # TEST_FILE_MAIN="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/wiki.full.aner.ori.test.src"
 32 | # TEST_FILE_BOXER="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/boxer/wiki.full.aner.ori.test.src"
 33 | 
 34 | TEST_FILE_MAIN="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/wiki.full.aner.ori.valid.src"
 35 | TEST_FILE_BOXER="/gpfs/scratch/snarayan/Sentence-Simplification/WikiLarge/boxer/wiki.full.aner.ori.valid.src"
 36 | 
 37 | 
 38 | class Boxer_XML_Handler:
 39 |     def parse_boxer_xml(self, xdrs_elt):
 40 |         arg_dict = {}
 41 |         rel_nodes = []
 42 |         extra_nodes = []
 43 | 
 44 |         # Process all dr
 45 |         for dr in xdrs_elt.iter('dr'):
 46 |             arg = dr.attrib["name"]
 47 |             if arg not in arg_dict:
 48 |                 arg_dict[arg] = {"position":[], "preds":[]}
 49 | 
 50 |             for index in dr.findall('index'):
 51 |                 position = int(index.attrib["pos"])
 52 |                 wordid = index.text
 53 | 
 54 |                 if (position, wordid) not in arg_dict[arg]["position"]:
 55 |                     arg_dict[arg]["position"].append((position, wordid))
 56 |         
 57 |         # Process all prop
 58 |         for prop in xdrs_elt.iter('prop'):
 59 |             arg = prop.attrib["argument"]
 60 |             if arg not in arg_dict:
 61 |                 arg_dict[arg] = {"position":[], "preds":[]}
 62 | 
 63 |             for index in prop.findall('index'):
 64 |                 position = int(index.attrib["pos"])
 65 |                 wordid = index.text
 66 | 
 67 |                 if (position, wordid) not in arg_dict[arg]["position"]:
 68 |                     arg_dict[arg]["position"].append((position, wordid))   
 69 | 
 70 |         # Process all pred
 71 |         for pred in xdrs_elt.iter('pred'):
 72 |             arg = pred.attrib["arg"]
 73 |             symbol = pred.attrib["symbol"]
 74 | 
 75 |             if arg not in arg_dict:
 76 |                 arg_dict[arg] = {"position":[], "preds":[]}
 77 | 
 78 |             predicate = [symbol, []]
 79 |             for index in pred.findall('index'):
 80 |                 position = int(index.attrib["pos"])
 81 |                 wordid = index.text
 82 | 
 83 |                 if (position, wordid) not in arg_dict[arg]["position"]:
 84 |                     arg_dict[arg]["position"].append((position, wordid))
 85 |                 if (position, wordid) not in predicate[1]:
 86 |                     predicate[1].append((position, wordid))   
 87 |             arg_dict[arg]["preds"].append(predicate)
 88 | 
 89 |         # Process all named
 90 |         for named in xdrs_elt.iter('named'):
 91 |             arg = named.attrib["arg"]
 92 |             symbol = named.attrib["symbol"]
 93 | 
 94 |             if arg not in arg_dict:
 95 |                 arg_dict[arg] = {"position":[], "preds":[]}
 96 | 
 97 |             named_pred = [symbol, []]
 98 |             for index in named.findall('index'):
 99 |                 position = int(index.attrib["pos"])
100 |                 wordid = index.text
101 | 
102 |                 if (position, wordid) not in arg_dict[arg]["position"]:
103 |                     arg_dict[arg]["position"].append((position, wordid))
104 |                 if (position, wordid) not in named_pred[1]:
105 |                     named_pred[1].append((position, wordid))   
106 |             arg_dict[arg]["preds"].append(named_pred)            
107 | 
108 |         # Process all card
109 |         for card in xdrs_elt.iter('card'):
110 |             arg = card.attrib["arg"]
111 |             value = card.attrib["value"]
112 | 
113 |             if arg not in arg_dict:
114 |                 arg_dict[arg] = {"position":[], "preds":[]}
115 | 
116 |             card_pred = [value, []]
117 |             for index in card.findall('index'):
118 |                 position = int(index.attrib["pos"])
119 |                 wordid = index.text
120 | 
121 |                 if (position, wordid) not in arg_dict[arg]["position"]:
122 |                     arg_dict[arg]["position"].append((position, wordid))
123 |                 if (position, wordid) not in card_pred[1]:
124 |                     card_pred[1].append((position, wordid))   
125 |             arg_dict[arg]["preds"].append(card_pred)
126 | 
127 |         # Process all timex
128 |         for timex in xdrs_elt.iter('timex'):
129 |             arg = timex.attrib["arg"]
130 |             datetime = ""
131 |             for date in timex.iter('date'):
132 |                 datetime = date.text
133 |             for time in timex.iter('time'):
134 |                 datetime = time.text
135 | 
136 |             if arg not in arg_dict:
137 |                 arg_dict[arg] = {"position":[], "preds":[]}
138 | 
139 |             timex_pred = [datetime, []]
140 |             for index in timex.findall('index'):
141 |                 position = int(index.attrib["pos"])
142 |                 wordid = index.text
143 | 
144 |                 if (position, wordid) not in arg_dict[arg]["position"]:
145 |                     arg_dict[arg]["position"].append((position, wordid))
146 |                 if (position, wordid) not in timex_pred[1]:
147 |                     timex_pred[1].append((position, wordid))   
148 |             arg_dict[arg]["preds"].append(timex_pred)            
149 | 
150 |         # Process not/or/imp/whq
151 |         for not_node in xdrs_elt.iter('not'):
152 |             index_list = not_node.findall('index')
153 |             if len(index_list) != 0:
154 |                 not_pred = ["not", []]
155 |                 for index in index_list:
156 |                     position = int(index.attrib["pos"])
157 |                     wordid = index.text
158 |                     if (position, wordid) not in not_pred[1]:
159 |                         not_pred[1].append((position, wordid))
160 |                 extra_nodes.append(not_pred)
161 |         for or_node in xdrs_elt.iter('or'):
162 |             index_list = or_node.findall('index')
163 |             if len(index_list) != 0:
164 |                 or_pred = ["or", []]
165 |                 for index in index_list:
166 |                     position = int(index.attrib["pos"])
167 |                     wordid = index.text
168 |                     if (position, wordid) not in or_pred[1]:
169 |                         or_pred[1].append((position, wordid))
170 |                 extra_nodes.append(or_pred)
171 |         for imp_node in xdrs_elt.iter('imp'):
172 |             index_list = imp_node.findall('index')
173 |             if len(index_list) != 0:
174 |                 imp_pred = ["imp", []]
175 |                 for index in index_list:
176 |                     position = int(index.attrib["pos"])
177 |                     wordid = index.text
178 |                     if (position, wordid) not in imp_pred[1]:
179 |                         imp_pred[1].append((position, wordid))
180 |                 extra_nodes.append(imp_pred)        
181 |         for whq_node in xdrs_elt.iter('whq'):
182 |             index_list = whq_node.findall('index')
183 |             if len(index_list) != 0:
184 |                 whq_pred = ["whq", []]
185 |                 for index in index_list:
186 |                     position = int(index.attrib["pos"])
187 |                     wordid = index.text
188 |                     if (position, wordid) not in whq_pred[1]:
189 |                         whq_pred[1].append((position, wordid))
190 |                 extra_nodes.append(whq_pred)                    
191 |         
192 |         # Process Rel node
193 |         for rel_node in xdrs_elt.iter('rel'): 
194 |             arg1 = rel_node.attrib["arg1"]
195 |             arg2 = rel_node.attrib["arg2"]
196 |             symbol = rel_node.attrib["symbol"]
197 | 
198 |             if arg1 not in arg_dict:
199 |                 arg_dict[arg1] = {"position":[], "preds":[]}
200 |             if arg2 not in arg_dict:
201 |                 arg_dict[arg2] = {"position":[], "preds":[]}
202 | 
203 |             index_list = rel_node.findall('index')
204 |             if (len(index_list) != 0) or (len(index_list) == 0 and symbol=="nn"):
205 |                 if symbol=="nn":
206 |                     # Revert arguments
207 |                     temp = arg2
208 |                     arg2 = arg1
209 |                     arg1 = temp
210 |                     rel_nodes.append([symbol, arg1, arg2, []])
211 |                     
212 |                 elif symbol=="eq":
213 |                     if len(index_list) == 1:
214 |                         position = int(index_list[0].attrib["pos"])
215 |                         wordid = index_list[0].text
216 |                         
217 |                         for arg in arg_dict:
218 |                             if (position, wordid) in arg_dict[arg]["position"]:
219 |                                 if ["event", []] in arg_dict[arg]["preds"]:
220 |                                     rel_nodes.append([symbol, arg, arg1, [(position, wordid)]])
221 |                                     rel_nodes.append([symbol, arg, arg2, [(position, wordid)]])
222 |                 else:
223 |                     rel_pred = [symbol, arg1, arg2, []]
224 |                     for index in index_list:
225 |                         position = int(index.attrib["pos"])
226 |                         wordid = index.text
227 |                         if (position, wordid) not in rel_pred[3]:
228 |                             rel_pred[3].append((position, wordid))
229 |                     rel_nodes.append(rel_pred) 
230 | 
231 |         
232 |         return arg_dict, rel_nodes, extra_nodes
233 | 
234 | ### Extra Functions
235 | 
236 | def prettify(elem):
237 |     """Return a pretty-printed XML string for the Element.
238 |     """
239 |     rough_string = ET.tostring(elem)
240 |     reparsed = minidom.parseString(rough_string)
241 |     prettyxml = reparsed.documentElement.toprettyxml(indent=" ")
242 |     return prettyxml.encode("utf-8")
243 | 
244 | ###################
245 | 
246 | class Boxer_Element_Creator:
247 |     def graph_boxer_element(self, xdrs_elt, sent_span):
248 |         # Parse the XDRS
249 |         arg_dict, rel_nodes, extra_nodes = Boxer_XML_Handler().parse_boxer_xml(xdrs_elt)
250 |         #print arg_dict, rel_nodes, extra_nodes        
251 | 
252 |         # Adding extra nodes to arg_dict/node dict
253 |         extra_node_count = 1
254 |         for nodeinfo in extra_nodes:
255 |             arg_dict["E"+str(extra_node_count)] = {"position":nodeinfo[1][:], "preds":[nodeinfo[:]]}
256 |             extra_node_count += 1
257 |         
258 |         # Creating edge list and relation dict
259 |         edge_list = []
260 |         rel_dict = {}
261 | 
262 |         rel_node_count = 1
263 |         for relinfo in rel_nodes:
264 |             relation = relinfo[0]
265 |             parentarg = relinfo[1]
266 |             deparg = relinfo[2]
267 |             indexlist = relinfo[3]
268 |             
269 |             rel_dict["R"+str(rel_node_count)] = [relation, indexlist[:]]
270 |             edge_list.append((parentarg, deparg, "R"+str(rel_node_count)))
271 |             rel_node_count += 1
272 | 
273 |         # Get the boxer span
274 |         boxer_span = []
275 |         for arg in arg_dict:
276 |             for position in arg_dict[arg]["position"]:
277 |                 if position not in boxer_span:
278 |                     boxer_span.append(position)
279 |         for relarg in rel_dict:
280 |             for position in rel_dict[relarg][1]:
281 |                 if position not in boxer_span:
282 |                     boxer_span.append(position)
283 |         boxer_span.sort()
284 |         boxer_span_ids = [elt[1] for elt in boxer_span]
285 |         # Adding nodes for left out word positions in discourse
286 |         out_of_dis_count = 1
287 |         for elt in sent_span:
288 |             if elt not in boxer_span_ids:
289 |                 arg_dict["OOD"+str(out_of_dis_count)] = {"position":[(-1, elt)], "preds":[]}
290 |                 out_of_dis_count += 1
291 | 
292 |         # Prepare Boxer Tree 
293 |         boxer = ET.Element('box')
294 |         nodes = ET.SubElement(boxer, "nodes")
295 |         for arg in arg_dict:            
296 |             # print arg
297 |             # print arg_dict[arg]
298 |             node = ET.SubElement(nodes, "node")
299 |             node.attrib = {"sym":arg}
300 | 
301 |             span = ET.SubElement(node, "span")
302 |             position_list = arg_dict[arg]["position"]
303 |             position_list.sort()
304 |             for position in position_list:
305 |                 location = ET.SubElement(span, "loc")
306 |                 location.attrib = {"id":position[1]}#{"pos":str(position[0]),"id":position[1]}
307 | 
308 |             preds = ET.SubElement(node, "preds")
309 |             predicate_list = arg_dict[arg]["preds"]
310 |             predicate_list.sort()
311 |             for predinfo in predicate_list:
312 |                 # print predinfo
313 |                 pred = ET.SubElement(preds, "pred")
314 |                 predsymbol = predinfo[0]
315 |                 pred.attrib = {"sym":predsymbol}
316 | 
317 |                 position_list = predinfo[1]
318 |                 position_list.sort()
319 |                 for position in position_list:
320 |                     location = ET.SubElement(pred, "loc")
321 |                     location.attrib = {"id":position[1]}#{"pos":str(position[0]),"id":position[1]}
322 |                     
323 |         rels = ET.SubElement(boxer, "rels")
324 |         for relarg in rel_dict:
325 |             rel = ET.SubElement(rels, "rel")
326 |             rel.attrib = {"sym":relarg}
327 |             pred = ET.SubElement(rel, "pred")
328 |             pred.attrib = {"sym":rel_dict[relarg][0]}
329 |             span = ET.SubElement(rel, "span")
330 |             position_list = rel_dict[relarg][1]
331 |             position_list.sort()
332 |             for position in position_list:
333 |                 location = ET.SubElement(span, "loc")
334 |                 location.attrib = {"id":position[1]}#{"pos":str(position[0]),"id":position[1]}
335 | 
336 |         edges = ET.SubElement(boxer, "edges")
337 |         for edgeinfo in edge_list:
338 |             edge = ET.SubElement(edges, "edge")
339 |             edge.attrib = {"par":edgeinfo[0], "dep":edgeinfo[1], "lab":edgeinfo[2]}
340 |         return boxer
341 | 
342 |     def construct_boxer_element(self, xdrs_data):
343 |         sentid = int(xdrs_data.get('{http://www.w3.org/XML/1998/namespace}id')[1:])
344 |         
345 |         # Creating Sentence Element
346 |         sentence_xml = ""
347 |         sentence_span = []
348 |         sentence = ET.Element('s')
349 |         words = (xdrs_data.find("words")).findall("word")
350 |         postags = (xdrs_data.find("postags")).findall("postag")
351 |         for word_elt, postag_elt in zip(words, postags):
352 |             word = word_elt.text
353 |             wordid = word_elt.get('{http://www.w3.org/XML/1998/namespace}id')# ("xml:id")
354 |             pos = postag_elt.text
355 |             posid = postag_elt.get("index")
356 | 
357 |             if wordid != posid:
358 |                 print "Warning: Both ids did not match."
359 |                 exit(0)
360 |             else:
361 |                 sentence_xml += word +" "
362 |                 word_newelt = ET.SubElement(sentence, 'w')
363 |                 word_newelt.attrib = {"id":wordid, "pos":pos}
364 |                 word_newelt.text = word
365 | 
366 |                 sentence_span.append(wordid) 
367 |           
368 |         # Creating the head element
369 |         headelt = ET.Element("main")
370 |         headelt.append(sentence)
371 | 
372 |         # Creating boxer element
373 |         boxer = self.graph_boxer_element(xdrs_data, sentence_span)
374 |         headelt.append(boxer)
375 |         return sentid, headelt
376 | 
377 |     def create_sentence_elt(self, main_sent):
378 |         sentence = ET.Element('s')
379 |         sentence.text = main_sent.decode('utf-8')
380 | 
381 |         # Creating the head element
382 |         headelt = ET.Element("main")
383 |         headelt.append(sentence)
384 | 
385 |         # Creating boxer element
386 |         boxer = ET.Element("box")
387 |         headelt.append(boxer)
388 |         return headelt
389 | 
390 |     def get_sentence_elt(self, boxer_xml_file):
391 |         sentid_sentence_dict = {}
392 |         xdrs_output = ET.parse(boxer_xml_file)
393 |         xdrs_list = xdrs_output.findall('xdrs')
394 |         for xdrs_item in xdrs_list:
395 |             sentid, head_elt = self.construct_boxer_element(xdrs_item)
396 |             sentid_sentence_dict[sentid] = head_elt
397 |         return sentid_sentence_dict
398 | 
399 | if __name__ == "__main__":
400 |     datadir = os.path.dirname(TEST_FILE_MAIN)
401 | 
402 |     # Start parsing Zhu Data file - Tokenized ..."
403 |     print "\nStart preparing Test Data file - Tokenized ..."
404 |     print "Start parsing "+TEST_FILE_MAIN+" ..."
405 |     main_wiki = []
406 |     fdata = open(TEST_FILE_MAIN, "r")
407 |     main_wiki = fdata.read().strip().split("\n")
408 |     fdata.close()
409 |     print "Total number of sentences from Test (tokenized): " + str(len(main_wiki))
410 | 
411 |     # Call boxer element creator
412 |     print "\nStart boxer element creator ..."
413 |     boxer_elt_creator = Boxer_Element_Creator()
414 |     sentid_sentence_dict = boxer_elt_creator.get_sentence_elt(TEST_FILE_BOXER)
415 |     
416 |     # Start Writing final files
417 |     print "Generating XML file : "+TEST_FILE_MAIN+".boxer-graph.xml ..."
418 |     foutput = open(TEST_FILE_MAIN+".boxer-graph.xml", "w")
419 |     foutput.write("<?xml version=\'1.0\' encoding=\'UTF-8\'?>\n")
420 |     foutput.write("<Test-Data>\n")
421 | 
422 |     main_index = 1
423 |     for main_sent in main_wiki:
424 |         # Start creating sentence subelement
425 |         sentence = ET.Element('sentence')
426 |         sentence.attrib={"id":str(main_index)}
427 |         
428 |         main_elt = ""
429 |         if main_index in sentid_sentence_dict:
430 |             main_elt = sentid_sentence_dict[main_index]
431 |         else:
432 |             main_elt = boxer_elt_creator.create_sentence_elt(main_sent)
433 | 
434 |         sentence.append(main_elt)
435 |         foutput.write(prettify(sentence))
436 |         main_index += 1
437 | 
438 |     
439 |     foutput.write("</Test-Data>\n")
440 |     foutput.close()
441 | 
442 |     print len(sentid_sentence_dict)
443 |     
444 |     print sentid_sentence_dict.keys()
445 | 


--------------------------------------------------------------------------------
/source/em_inside_outside_algorithm.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #===================================================================================
  3 | #title           : em_inside_outside_algorithm.py                                  =
  4 | #description     : EM Algorithm                                                    =
  5 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
  6 | #date            : Created in 2014, Later revised in April 2016.                   =
  7 | #version         : 0.1                                                             =
  8 | #===================================================================================
  9 | 
 10 | 
 11 | import function_select_methods
 12 | 
 13 | class EM_InsideOutside_Optimiser:
 14 |     def __init__(self, smt_sentence_pairs, probability_tables, count_tables, METHOD_FEATURE_EXTRACT):
 15 |         self.smt_sentence_pairs = smt_sentence_pairs
 16 |         self.probability_tables = probability_tables
 17 |         self.count_tables = count_tables
 18 |         self.METHOD_FEATURE_EXTRACT = METHOD_FEATURE_EXTRACT
 19 | 
 20 |         self.method_feature_extract = function_select_methods.select_feature_extract_method(self.METHOD_FEATURE_EXTRACT)
 21 | 
 22 |     def initialize_probabilitytable_smt_input(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
 23 |         #print sentid
 24 | 
 25 |         # Process all oper nodes
 26 |         for oper_node in training_graph.oper_nodes:
 27 |             oper_type = training_graph.get_opernode_type(oper_node)
 28 |             if oper_type not in self.probability_tables:
 29 |                 self.probability_tables[oper_type] = {}
 30 |             if oper_type not in self.count_tables:
 31 |                 self.count_tables[oper_type] = {}
 32 | 
 33 |             parent_major_node = training_graph.find_parent_of_opernode(oper_node)
 34 |             children_major_nodes = training_graph.find_children_of_opernode(oper_node)
 35 | 
 36 |             if oper_type == "split":
 37 |                 # Parent main sentence
 38 |                 parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
 39 |                 parent_filtered_mod_pos = training_graph.get_majornode_filtered_postions(parent_major_node)
 40 |                 parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, parent_filtered_mod_pos)
 41 | 
 42 |                 # Children sentences
 43 |                 children_sentences = []
 44 |                 for child_major_node in children_major_nodes:
 45 |                     child_nodeset = training_graph.get_majornode_nodeset(child_major_node)
 46 |                     child_filtered_mod_pos = training_graph.get_majornode_filtered_postions(child_major_node)
 47 |                     child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, child_filtered_mod_pos)
 48 |                     children_sentences.append(child_sentence)
 49 | 
 50 |                 split_candidate = training_graph.get_opernode_oper_candidate(oper_node)
 51 |                 
 52 |                 #print split_candidate
 53 | 
 54 |                 if split_candidate != None:
 55 |                     split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph)
 56 |                     if split_feature not in self.probability_tables["split"]:
 57 |                         self.probability_tables["split"][split_feature] = {"true":0.5, "false":0.5}
 58 |                     if split_feature not in self.count_tables["split"]:
 59 |                         self.count_tables["split"][split_feature] = {"true":0, "false":0}
 60 |                 else:
 61 |                     not_applied_cands = training_graph.get_opernode_failed_oper_candidates(oper_node)
 62 |                     #print not_applied_cands
 63 |                     for split_candidate_left in not_applied_cands:
 64 |                         split_feature_left = self.method_feature_extract.get_split_feature(split_candidate_left, parent_sentence, children_sentences, boxer_graph)
 65 |                         #print split_feature_left
 66 |                         if split_feature_left not in self.probability_tables["split"]:
 67 |                             self.probability_tables["split"][split_feature_left] = {"true":0.5, "false":0.5}
 68 |                         if split_feature_left not in self.count_tables["split"]:
 69 |                             self.count_tables["split"][split_feature_left] = {"true":0, "false":0}
 70 |                 #print self.probability_tables["split"]
 71 | 
 72 |             if oper_type == "drop-rel":
 73 |                 rel_node = training_graph.get_opernode_oper_candidate(oper_node)
 74 |                 parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
 75 |                 drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(rel_node, parent_nodeset, main_sent_dict, boxer_graph)
 76 |                 if drop_rel_feature not in self.probability_tables["drop-rel"]:
 77 |                     self.probability_tables["drop-rel"][drop_rel_feature] = {"true":0.5, "false":0.5}
 78 |                 if drop_rel_feature not in self.count_tables["drop-rel"]:
 79 |                     self.count_tables["drop-rel"][drop_rel_feature] = {"true":0, "false":0}
 80 |                     
 81 |             if oper_type == "drop-mod":
 82 |                 mod_cand = training_graph.get_opernode_oper_candidate(oper_node)
 83 |                 drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(mod_cand, main_sent_dict, boxer_graph)
 84 |                 if drop_mod_feature not in self.probability_tables["drop-mod"]:
 85 |                     self.probability_tables["drop-mod"][drop_mod_feature] = {"true":0.5, "false":0.5}
 86 |                 if drop_mod_feature not in self.count_tables["drop-mod"]:
 87 |                     self.count_tables["drop-mod"][drop_mod_feature] = {"true":0, "false":0}
 88 | 
 89 |             if oper_type == "drop-ood":
 90 |                 ood_node = training_graph.get_opernode_oper_candidate(oper_node)
 91 |                 parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
 92 |                 drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(ood_node, parent_nodeset, main_sent_dict, boxer_graph)
 93 |                 if drop_ood_feature not in self.probability_tables["drop-ood"]:
 94 |                     self.probability_tables["drop-ood"][drop_ood_feature] = {"true":0.5, "false":0.5}
 95 |                 if drop_ood_feature not in self.count_tables["drop-ood"]:
 96 |                     self.count_tables["drop-ood"][drop_ood_feature] = {"true":0, "false":0}
 97 | 
 98 |         #print self.probability_tables["split"]['as-as-patient_eq-eq_1']
 99 |         # if int(sentid) <= 3:
100 |         #     print self.probability_tables["split"]
101 | 
102 |         # Extract all sentence pairs for SMT from all "fin" major nodes
103 |         self.smt_sentence_pairs[sentid] = training_graph.get_final_sentences(main_sentence, main_sent_dict, boxer_graph)
104 | 
105 |     def reset_count_table(self):
106 |         for oper_type in self.count_tables: # split, drop-rel, drop-mod, drop-ood
107 |             for oper_feature_key in self.count_tables[oper_type]: # feature patterns
108 |                 for val in self.count_tables[oper_type][oper_feature_key]: # true, false
109 |                     self.count_tables[oper_type][oper_feature_key][val] = 0
110 | 
111 |     def iterate_over_probabilitytable(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
112 |         #print sentid
113 |         # Calculating beta-probability, inside probability
114 |         #print "Calculating beta-probabilities (Inside probability) ..."
115 |         bottom_nodes = training_graph.find_all_fin_majornode()
116 |         beta_prob = self.calculate_inside_probability({}, bottom_nodes, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph)
117 |         #print beta_prob
118 | 
119 |         # Calculating alpha-probability, outside probability
120 |         #print "Calculating alpha-probabilities (Outside probability) ..."
121 |         root_node = "MN-1"
122 |         alpha_prob = self.calculate_outside_probability({}, [root_node], main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, beta_prob)
123 |         #print alpha_prob
124 |         
125 |         # Updating counts for each operation happened in this sentence
126 |         #print "Updating counts of each operation happened in this training sentence ..."
127 |         self.update_count_for_operations(sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, alpha_prob, beta_prob)
128 | 
129 |     def calculate_outside_probability(self, alpha_prob, tgnodes_to_process, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, beta_prob):
130 |         if len(tgnodes_to_process) == 0:
131 |             return alpha_prob
132 |         
133 |         tgnode = tgnodes_to_process[0]
134 |         if tgnode.startswith("MN"):
135 |             # Major Nodes
136 |             if tgnode == "MN-1":
137 |                 # Root major node
138 |                 alpha_prob[tgnode] = 1
139 |             else:
140 |                 parents_oper_nodes = training_graph.find_parents_of_majornode(tgnode)
141 |                 alpha_prob_tgnode = 0
142 |                 for parent_oper_node in parents_oper_nodes:
143 |                     alpha_prob_parent_oper_node = alpha_prob[parent_oper_node]
144 |                     children_major_nodes = training_graph.find_children_of_opernode(parent_oper_node)
145 |                     beta_prod_product = 1
146 |                     for child_major_node in children_major_nodes:
147 |                         if child_major_node != tgnode:
148 |                             beta_prod_product = beta_prod_product * beta_prob[child_major_node]
149 |                     alpha_prob_tgnode += alpha_prob_parent_oper_node * beta_prod_product
150 |                 alpha_prob[tgnode] = alpha_prob_tgnode
151 | 
152 |             # Adding children to tgnodes_to_process
153 |             children_oper_nodes = training_graph.find_children_of_majornode(tgnode)
154 |             for child_oper_node in children_oper_nodes:
155 |                 # Check its not already inserted
156 |                 if (child_oper_node not in alpha_prob) and (child_oper_node not in tgnodes_to_process):
157 |                     # Check its parent is already inserted
158 |                     parent_major_node = training_graph.find_parent_of_opernode(child_oper_node)
159 |                     if (parent_major_node in alpha_prob) or (parent_major_node in tgnodes_to_process):
160 |                         tgnodes_to_process.append(child_oper_node)
161 |         else:
162 |             # Oper nodes
163 |             parent_major_node = training_graph.find_parent_of_opernode(tgnode)
164 |             alpha_prob_tgnode = alpha_prob[parent_major_node] * self.fetch_probability(tgnode, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph)
165 |             alpha_prob[tgnode] = alpha_prob_tgnode 
166 |             
167 |             # Adding children to tgnodes_to_process
168 |             children_major_nodes = training_graph.find_children_of_opernode(tgnode)
169 |             for child_major_node in children_major_nodes:
170 |                 # Check its not already inserted
171 |                 if (child_major_node not in alpha_prob) and (child_major_node not in tgnodes_to_process):
172 |                     # Check all its parents are already inserted
173 |                     parents_oper_nodes = training_graph.find_parents_of_majornode(child_major_node)
174 |                     flag = True
175 |                     for parent_oper_node in parents_oper_nodes:
176 |                         if (parent_oper_node not in alpha_prob) and (parent_oper_node not in tgnodes_to_process):
177 |                             flag = False
178 |                             break
179 |                     if flag == True:
180 |                         tgnodes_to_process.append(child_major_node)
181 |         
182 |         alpha_prob = self.calculate_outside_probability(alpha_prob, tgnodes_to_process[1:], main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, beta_prob)
183 |         return alpha_prob
184 |         
185 |     def calculate_inside_probability(self, beta_prob, tgnodes_to_process, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
186 |         if len(tgnodes_to_process) == 0:
187 |             return beta_prob
188 | 
189 |         tgnode = tgnodes_to_process[0]
190 |         if tgnode.startswith("MN"):
191 |             # Major nodes
192 |             major_node_type = training_graph.get_majornode_type(tgnode)
193 |             if major_node_type == "fin":
194 |                 # Leaf major nodes
195 |                 beta_prob[tgnode] = 1
196 |             else:
197 |                 children_oper_nodes = training_graph.find_children_of_majornode(tgnode)
198 |                 beta_prob_tgnode = 0
199 |                 for child_oper_node in children_oper_nodes:
200 |                     beta_prob_tgnode += self.fetch_probability(child_oper_node, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph) * beta_prob[child_oper_node]
201 |                 beta_prob[tgnode] = beta_prob_tgnode
202 | 
203 |             # Adding parents to tgnodes_to_process
204 |             parents_oper_nodes = training_graph.find_parents_of_majornode(tgnode)
205 |             for parent_oper_node in parents_oper_nodes:
206 |                 # Check its not already inserted
207 |                 if (parent_oper_node not in beta_prob) and (parent_oper_node not in tgnodes_to_process):
208 |                     # Check all its chilren are already inserted
209 |                     children_major_nodes = training_graph.find_children_of_opernode(parent_oper_node)
210 |                     flag = True
211 |                     for child_major_node in children_major_nodes:
212 |                         if (child_major_node not in beta_prob) and (child_major_node not in tgnodes_to_process):
213 |                             flag = False
214 |                             break
215 |                     if flag == True:
216 |                         tgnodes_to_process.append(parent_oper_node)
217 |         else:
218 |             # Oper nodes
219 |             children_major_nodes = training_graph.find_children_of_opernode(tgnode)
220 |             beta_prob_tgnode = 1
221 |             for child_major_node in children_major_nodes:
222 |                 beta_prob_tgnode = beta_prob_tgnode * beta_prob[child_major_node]
223 |             beta_prob[tgnode] = beta_prob_tgnode
224 | 
225 |             # Adding parent to tgnodes_to_process
226 |             parent_major_node = training_graph.find_parent_of_opernode(tgnode)
227 |             # Check its not already inserted
228 |             if (parent_major_node not in beta_prob) and (parent_major_node not in tgnodes_to_process):
229 |                 # Check all its chilren are already inserted
230 |                 children_oper_nodes = training_graph.find_children_of_majornode(parent_major_node)
231 |                 flag = True
232 |                 for child_oper_node in children_oper_nodes:
233 |                     if (child_oper_node not in beta_prob) and (child_oper_node not in tgnodes_to_process):
234 |                         flag = False
235 |                         break
236 |                 if flag == True:
237 |                     tgnodes_to_process.append(parent_major_node)
238 | 
239 |         beta_prob = self.calculate_inside_probability(beta_prob, tgnodes_to_process[1:], main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph)
240 |         return beta_prob
241 | 
242 |     def fetch_probability(self, oper_node, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
243 |         oper_node_type = training_graph.get_opernode_type(oper_node)
244 |         if oper_node_type == "split":
245 |             # Parent main sentence
246 |             parent_major_node = training_graph.find_parent_of_opernode(oper_node)
247 |             parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
248 |             parent_filtered_mod_pos = training_graph.get_majornode_filtered_postions(parent_major_node)
249 |             parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, parent_filtered_mod_pos)
250 | 
251 |             # Children sentences
252 |             children_major_nodes = training_graph.find_children_of_opernode(oper_node)
253 |             children_sentences = []
254 |             for child_major_node in children_major_nodes:
255 |                 child_nodeset = training_graph.get_majornode_nodeset(child_major_node)
256 |                 child_filtered_mod_pos = training_graph.get_majornode_filtered_postions(child_major_node)
257 |                 child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, child_filtered_mod_pos)
258 |                 children_sentences.append(child_sentence)
259 | 
260 |             total_probability = 1
261 |             split_candidate = training_graph.get_opernode_oper_candidate(oper_node)
262 |             if split_candidate != None:
263 |                 split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph)
264 |                 total_probability = self.probability_tables["split"][split_feature]["true"]
265 |                 return total_probability
266 |             else:
267 |                 not_applied_cands = training_graph.get_opernode_failed_oper_candidates(oper_node)
268 |                 for split_candidate_left in not_applied_cands:
269 |                     split_feature_left = self.method_feature_extract.get_split_feature(split_candidate_left, parent_sentence, children_sentences, boxer_graph)
270 |                     total_probability = total_probability * self.probability_tables["split"][split_feature_left]["false"]
271 |                 return total_probability
272 |                 
273 |         elif oper_node_type == "drop-rel":
274 |             parent_major_node = training_graph.find_parent_of_opernode(oper_node)
275 |             parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
276 |             rel_candidate = training_graph.get_opernode_oper_candidate(oper_node)
277 |             drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(rel_candidate, parent_nodeset, main_sent_dict, boxer_graph)
278 |             isDropped = training_graph.get_opernode_drop_result(oper_node)
279 |             prob_value = 0
280 |             if isDropped == "True":
281 |                 prob_value = self.probability_tables["drop-rel"][drop_rel_feature]["true"]
282 |             else:
283 |                 prob_value = self.probability_tables["drop-rel"][drop_rel_feature]["false"]
284 |             return prob_value
285 | 
286 |         elif oper_node_type == "drop-mod":
287 |             mod_candidate = training_graph.get_opernode_oper_candidate(oper_node)
288 |             drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(mod_candidate, main_sent_dict, boxer_graph)
289 |             isDropped = training_graph.get_opernode_drop_result(oper_node)
290 |             prob_value = 0
291 |             if isDropped == "True":
292 |                 prob_value = self.probability_tables["drop-mod"][drop_mod_feature]["true"]
293 |             else:
294 |                 prob_value = self.probability_tables["drop-mod"][drop_mod_feature]["false"]
295 |             return prob_value
296 | 
297 |         elif oper_node_type == "drop-ood":
298 |             parent_major_node = training_graph.find_parent_of_opernode(oper_node)
299 |             parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
300 |             ood_candidate = training_graph.get_opernode_oper_candidate(oper_node)
301 |             drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(ood_candidate, parent_nodeset, main_sent_dict, boxer_graph)
302 |             isDropped = training_graph.get_opernode_drop_result(oper_node)
303 |             prob_value = 0
304 |             if isDropped == "True":
305 |                 prob_value = self.probability_tables["drop-ood"][drop_ood_feature]["true"]
306 |             else:
307 |                 prob_value = self.probability_tables["drop-ood"][drop_ood_feature]["false"]
308 |             return prob_value
309 | 
310 |     def update_count_for_operations(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph, alpha_prob, beta_prob):
311 |         # Process all oper nodes
312 |         for oper_node in training_graph.oper_nodes:
313 |             # Calculating count
314 |             root_inside_prob = beta_prob["MN-1"]
315 |             oper_node_inside_prob = beta_prob[oper_node]
316 |             oper_node_outside_prob = alpha_prob[oper_node]
317 |             count_oper_node = (oper_node_inside_prob * oper_node_outside_prob) / root_inside_prob
318 | 
319 |             oper_node_type = training_graph.get_opernode_type(oper_node)
320 |             if oper_node_type == "split":
321 |                 # Parent main sentence
322 |                 parent_major_node = training_graph.find_parent_of_opernode(oper_node)
323 |                 parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
324 |                 parent_filtered_mod_pos = training_graph.get_majornode_filtered_postions(parent_major_node)
325 |                 parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, parent_filtered_mod_pos)
326 |                 
327 |                 # Children sentences
328 |                 children_major_nodes = training_graph.find_children_of_opernode(oper_node)
329 |                 children_sentences = []
330 |                 for child_major_node in children_major_nodes:
331 |                     child_nodeset = training_graph.get_majornode_nodeset(child_major_node)
332 |                     child_filtered_mod_pos = training_graph.get_majornode_filtered_postions(child_major_node)
333 |                     child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, child_filtered_mod_pos)
334 |                     children_sentences.append(child_sentence)
335 | 
336 |                 split_candidate = training_graph.get_opernode_oper_candidate(oper_node)
337 |                 if split_candidate != None:
338 |                     split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph)
339 |                     self.count_tables["split"][split_feature]["true"] += count_oper_node
340 |                 else:
341 |                     not_applied_cands = training_graph.get_opernode_failed_oper_candidates(oper_node)
342 |                     for split_candidate_left in not_applied_cands:
343 |                         split_feature_left = self.method_feature_extract.get_split_feature(split_candidate_left, parent_sentence, children_sentences, boxer_graph)
344 |                         self.count_tables["split"][split_feature_left]["false"] += count_oper_node
345 |                     
346 |             elif oper_node_type == "drop-rel":
347 |                 parent_major_node = training_graph.find_parent_of_opernode(oper_node)
348 |                 parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
349 |                 rel_candidate = training_graph.get_opernode_oper_candidate(oper_node)
350 |                 drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(rel_candidate, parent_nodeset, main_sent_dict, boxer_graph)
351 |                 isDropped = training_graph.get_opernode_drop_result(oper_node)
352 |                 if isDropped == "True":
353 |                     self.count_tables["drop-rel"][drop_rel_feature]["true"] += count_oper_node
354 |                 else:
355 |                     self.count_tables["drop-rel"][drop_rel_feature]["false"] += count_oper_node
356 |                     
357 |             elif oper_node_type == "drop-mod":
358 |                 mod_candidate = training_graph.get_opernode_oper_candidate(oper_node)
359 |                 drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(mod_candidate, main_sent_dict, boxer_graph)
360 |                 isDropped = training_graph.get_opernode_drop_result(oper_node)
361 |                 if isDropped == "True":
362 |                     self.count_tables["drop-mod"][drop_mod_feature]["true"] += count_oper_node
363 |                 else:
364 |                     self.count_tables["drop-mod"][drop_mod_feature]["false"] += count_oper_node
365 |                     
366 |             elif oper_node_type == "drop-ood":
367 |                 parent_major_node = training_graph.find_parent_of_opernode(oper_node)
368 |                 parent_nodeset = training_graph.get_majornode_nodeset(parent_major_node)
369 |                 ood_candidate = training_graph.get_opernode_oper_candidate(oper_node)
370 |                 drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(ood_candidate, parent_nodeset, main_sent_dict, boxer_graph)
371 |                 isDropped = training_graph.get_opernode_drop_result(oper_node)
372 |                 if isDropped == "True":
373 |                     self.count_tables["drop-ood"][drop_ood_feature]["true"] += count_oper_node
374 |                 else:
375 |                     self.count_tables["drop-ood"][drop_ood_feature]["false"] += count_oper_node
376 | 
377 |     def update_probability_table(self):
378 |         for oper_type in self.probability_tables: # split, drop-ood, drop-rel, drop-mod
379 |             for oper_feature_key in self.probability_tables[oper_type]: # feature patterns
380 |                 totalSum = 0
381 |                 for val in self.probability_tables[oper_type][oper_feature_key]:
382 |                     totalSum += self.count_tables[oper_type][oper_feature_key][val] 
383 |                 
384 |                 # if totalSum == 0:
385 |                 #     print oper_type
386 |                 #     print oper_feature_key
387 |                 #     print self.probability_tables[oper_type][oper_feature_key]
388 |                 #     print self.count_tables[oper_type][oper_feature_key]
389 | 
390 |                 if totalSum == 0:
391 |                     for val in self.probability_tables[oper_type][oper_feature_key]:
392 |                         self.probability_tables[oper_type][oper_feature_key][val] = 0.5 # Uniform 1.0/len(self.probability_tables[oper_type][oper_feature_key]) 
393 |                 else:
394 |                     for val in self.probability_tables[oper_type][oper_feature_key]:
395 |                         self.probability_tables[oper_type][oper_feature_key][val] = self.count_tables[oper_type][oper_feature_key][val] / totalSum
396 | 


--------------------------------------------------------------------------------
/source/explore_decoder_graph_greedy.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #===================================================================================
  3 | #title           : explore_decoder_graph_greedy.py                                 =
  4 | #description     : Greedy decoder                                                  =
  5 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
  6 | #date            : Created in 2014, Later revised in April 2016.                   =
  7 | #version         : 0.1                                                             =
  8 | #===================================================================================
  9 | 
 10 | from training_graph_module import Training_Graph
 11 | import function_select_methods
 12 | 
 13 | class Explore_Decoder_Graph_Greedy:
 14 |     def __init__(self, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE, RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, probability_tables, METHOD_FEATURE_EXTRACT):
 15 |         self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL
 16 |         self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE
 17 |         self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL
 18 |         self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD
 19 | 
 20 |         self.probability_tables = probability_tables
 21 |         self.METHOD_FEATURE_EXTRACT = METHOD_FEATURE_EXTRACT
 22 | 
 23 |         self.method_feature_extract = function_select_methods.select_feature_extract_method(self.METHOD_FEATURE_EXTRACT)
 24 | 
 25 |     def explore_decoder_graph(self, sentid, main_sentence, main_sent_dict, boxer_graph):        
 26 |         # Start a decoder graph
 27 |         decoder_graph = Training_Graph()
 28 |         nodes_2_process = []
 29 | 
 30 |         # Check if Discourse information is available
 31 |         if boxer_graph.isEmpty():
 32 |             # Adding finishing major node
 33 |             nodeset = boxer_graph.get_nodeset()
 34 |             filtered_mod_pos = []
 35 |             simple_sentences = []
 36 |             majornode_data = ("fin", nodeset, simple_sentences, filtered_mod_pos)
 37 |             
 38 |             # Creating major node
 39 |             majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
 40 |             nodes_2_process.append(majornode_name) # isNew = True
 41 |         else:
 42 |             # DRS data is available for the complex sentence
 43 |             # Check to add the starting node
 44 |             nodeset = boxer_graph.get_nodeset()
 45 |             majornode_name, isNew = self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "split", nodeset, [], [])
 46 |             nodes_2_process.append(majornode_name) # isNew = True
 47 |             
 48 |         # Start expanding the decoder graph
 49 |         self.expand_decoder_graph(nodes_2_process[:], main_sent_dict, boxer_graph,  decoder_graph)
 50 |         return decoder_graph
 51 | 
 52 |     def expand_decoder_graph(self, nodes_2_process, main_sent_dict, boxer_graph,  decoder_graph):
 53 |         if len(nodes_2_process) == 0:
 54 |             return 
 55 | 
 56 |         node_name = nodes_2_process[0]
 57 |         operreq = decoder_graph.get_majornode_type(node_name)
 58 |         nodeset = decoder_graph.get_majornode_nodeset(node_name)[:]
 59 |         oper_candidates = decoder_graph.get_majornode_oper_candidates(node_name)[:]
 60 |         processed_oper_candidates = decoder_graph.get_majornode_processed_oper_candidates(node_name)[:]
 61 |         filtered_postions = decoder_graph.get_majornode_filtered_postions(node_name)[:]
 62 | 
 63 |         #print node_name, decoder_graph.major_nodes[node_name]
 64 | 
 65 |         #print node_name, operreq, nodeset, oper_candidates, processed_oper_candidates, filtered_postions
 66 | 
 67 |         if operreq == "split":
 68 |             split_candidate_tuples = oper_candidates
 69 |             nodes_2_process = self.process_split_node_decoder_graph(node_name, nodeset, split_candidate_tuples, nodes_2_process, 
 70 |                                                                     main_sent_dict, boxer_graph, decoder_graph)
 71 | 
 72 |         if operreq == "drop-rel":
 73 |             relnode_candidates = oper_candidates
 74 |             processed_relnode_candidates = processed_oper_candidates 
 75 |             filtered_mod_pos = filtered_postions
 76 |             nodes_2_process = self.process_droprel_node_decoder_graph(node_name, nodeset, relnode_candidates, processed_relnode_candidates, filtered_mod_pos,
 77 |                                                                       nodes_2_process, main_sent_dict, boxer_graph, decoder_graph)
 78 | 
 79 |         if operreq == "drop-mod":
 80 |             mod_candidates = oper_candidates
 81 |             processed_mod_pos = processed_oper_candidates 
 82 |             filtered_mod_pos = filtered_postions
 83 |             nodes_2_process = self.process_dropmod_node_decoder_graph(node_name, nodeset, mod_candidates, processed_mod_pos, filtered_mod_pos,
 84 |                                                                        nodes_2_process, main_sent_dict, boxer_graph, decoder_graph)
 85 | 
 86 |         if operreq == "drop-ood":
 87 |             oodnode_candidates = oper_candidates
 88 |             processed_oodnode_candidates = processed_oper_candidates 
 89 |             filtered_mod_pos = filtered_postions
 90 |             nodes_2_process = self.process_dropood_node_decoder_graph(node_name, nodeset, oodnode_candidates, processed_oodnode_candidates, filtered_mod_pos,
 91 |                                                                       nodes_2_process, main_sent_dict, boxer_graph, decoder_graph)
 92 | 
 93 |         self.expand_decoder_graph(nodes_2_process[1:], main_sent_dict, boxer_graph, decoder_graph)
 94 | 
 95 |     def process_split_node_decoder_graph(self, node_name, nodeset, split_candidate_tuples, nodes_2_process, main_sent_dict, boxer_graph, decoder_graph):        
 96 |         # Calculating probabilities
 97 |         probability_results = []
 98 | 
 99 |         # Find the Parent main sentence
100 |         parent_nodeset = nodeset[:]
101 |         parent_sentence = boxer_graph.extract_main_sentence(parent_nodeset, main_sent_dict, [])
102 | 
103 |         # Explore no-split options
104 |         probability = 1
105 |         for split_candidate in split_candidate_tuples:
106 |             # Get the probability
107 |             split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, [parent_sentence], boxer_graph)
108 |             if split_feature in self.probability_tables["split"]:
109 |                 probability = probability * self.probability_tables["split"][split_feature]["false"] 
110 |             else:
111 |                 probability = probability * 0.5
112 |             #print split_candidate, split_feature, "false", probability
113 |         probability_results.append((probability, None, []))
114 | 
115 |         # Explore all split options
116 |         # Calculate all parent and following subtrees
117 |         parent_subgraph_nodeset_dict = boxer_graph.extract_parent_subgraph_nodeset_dict()
118 |         #print "parent_subgraph_nodeset_dict : "+str(parent_subgraph_nodeset_dict)
119 |         for split_candidate in split_candidate_tuples:
120 |             # Find the children sentences
121 |             children_sentences = []
122 | 
123 |             # Split on the split_candidate
124 |             node_subgraph_nodeset_dict, node_span_dict = boxer_graph.partition_drs_for_successful_candidate(split_candidate, parent_subgraph_nodeset_dict)
125 |             #print node_subgraph_nodeset_dict, node_span_dict
126 |             
127 |             # Sorting them depending on span
128 |             split_results = []
129 |             for tnodename in split_candidate:
130 |                 tspan = node_span_dict[tnodename]
131 |                 tnodeset = node_subgraph_nodeset_dict[tnodename][:]
132 |                 split_results.append((tspan, tnodeset, tnodename))
133 |             split_results.sort()
134 |             
135 |             # Prospective children major nodes
136 |             for item in split_results:
137 |                 child_nodeset = item[1]
138 |                 child_nodeset.sort()
139 |                 child_sentence = boxer_graph.extract_main_sentence(child_nodeset, main_sent_dict, [])
140 |                 children_sentences.append(child_sentence)
141 | 
142 |             #print children_sentences
143 | 
144 |             # Get the probability
145 |             split_feature = self.method_feature_extract.get_split_feature(split_candidate, parent_sentence, children_sentences, boxer_graph)
146 |             probability = 0
147 |             if split_feature in self.probability_tables["split"]:
148 |                 probability = self.probability_tables["split"][split_feature]["true"] 
149 |             else:
150 |                 probability = 0.5
151 | 
152 |             #print split_candidate, split_feature, "true", probability
153 | 
154 |             probability_results.append((probability, split_candidate, split_results))
155 |         
156 |         # Sort probabilities
157 |         probability_results.sort(reverse=True)
158 |         
159 |         ##
160 |         #data = [(item[0], item[1]) for item in probability_results]
161 |         #print data
162 |         ##
163 | 
164 |         split_tuple = probability_results[0][1]
165 |         if split_tuple != None:
166 |             # Adding the operation node
167 |             not_applied_cands = [item for item in split_candidate_tuples if item is not split_tuple]
168 |             opernode_data = ("split", split_tuple, not_applied_cands)
169 |             opernode_name = decoder_graph.create_opernode(opernode_data)
170 |             decoder_graph.create_edge((node_name, opernode_name, split_candidate))
171 | 
172 |             split_results = probability_results[0][2]
173 |             for item in split_results:
174 |                 child_nodeset = item[1][:]
175 |                 child_nodeset.sort()
176 |                 parent_child_nodeset = item[2]
177 |                 
178 |                 # Check for adding OOD or subsequent nodes
179 |                 child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, [], [])
180 |                 if isNew:
181 |                     nodes_2_process.append(child_majornode_name)
182 |                 decoder_graph.create_edge((opernode_name, child_majornode_name, parent_child_nodeset))
183 |         else:
184 |             # Adding the operation node
185 |             not_applied_cands = [item for item in split_candidate_tuples]
186 |             opernode_data = ("split", None, not_applied_cands)
187 |             opernode_name = decoder_graph.create_opernode(opernode_data)
188 |             decoder_graph.create_edge((node_name, opernode_name, None))
189 |             
190 |             # Check for adding drop-rel or drop-mod or fin nodes
191 |             child_nodeset = nodeset[:]
192 |             child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, [], [])
193 |             if isNew:
194 |                 nodes_2_process.append(child_majornode_name)
195 |             decoder_graph.create_edge((opernode_name, child_majornode_name, None))
196 |         
197 |         return nodes_2_process 
198 | 
199 |     def process_droprel_node_decoder_graph(self, node_name, nodeset, relnode_candidates, processed_relnode_candidates, filtered_mod_pos,
200 |                                            nodes_2_process, main_sent_dict, boxer_graph, decoder_graph):
201 |         relnode_to_process = relnode_candidates[0]
202 |         processed_relnode_candidates.append(relnode_to_process)
203 | 
204 |         drop_rel_feature = self.method_feature_extract.get_drop_rel_feature(relnode_to_process, nodeset, main_sent_dict, boxer_graph)
205 |         
206 |         if drop_rel_feature in self.probability_tables["drop-rel"]:
207 |             drop_prob = self.probability_tables["drop-rel"][drop_rel_feature]["true"]
208 |             not_drop_prob = self.probability_tables["drop-rel"][drop_rel_feature]["false"]
209 |             if drop_prob > not_drop_prob:
210 |                 # Creating opernode for droping 
211 |                 opernode_data = ("drop-rel", relnode_to_process, "True")
212 |                 opernode_name = decoder_graph.create_opernode(opernode_data)
213 |                 decoder_graph.create_edge((node_name, opernode_name, relnode_to_process))
214 |                 # Check for adding REL or subsequent nodes, (nodeset is changed)
215 |                 child_nodeset, child_filtered_mod_pos = boxer_graph.drop_relation(nodeset, relnode_to_process, filtered_mod_pos)             
216 |                 child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, processed_relnode_candidates, child_filtered_mod_pos)
217 |                 if isNew:
218 |                     nodes_2_process.append(child_majornode_name)
219 |                 decoder_graph.create_edge((opernode_name, child_majornode_name, "True"))
220 |                 return  nodes_2_process
221 |         
222 |         # Creating opernode for not droping
223 |         opernode_data = ("drop-rel", relnode_to_process, "False")
224 |         opernode_name = decoder_graph.create_opernode(opernode_data)
225 |         decoder_graph.create_edge((node_name, opernode_name, relnode_to_process))
226 |         # Check for adding REL or subsequent nodes, (nodeset is unchanged)
227 |         child_nodeset = nodeset
228 |         child_filtered_mod_pos = filtered_mod_pos
229 |         child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-rel", child_nodeset, processed_relnode_candidates, child_filtered_mod_pos)
230 |         if isNew:
231 |             nodes_2_process.append(child_majornode_name)
232 |         decoder_graph.create_edge((opernode_name, child_majornode_name, "False"))
233 |         return  nodes_2_process   
234 | 
235 |     def process_dropmod_node_decoder_graph(self, node_name, nodeset, mod_candidates, processed_mod_pos, filtered_mod_pos,
236 |                                            nodes_2_process, main_sent_dict, boxer_graph, decoder_graph):
237 |         modcand_to_process = mod_candidates[0]
238 |         modcand_position_to_process = modcand_to_process[0]
239 |         modcand_word = main_sent_dict[modcand_position_to_process][0]
240 |         modcand_node = modcand_to_process[1]
241 |         processed_mod_pos.append(modcand_position_to_process)
242 | 
243 |         drop_mod_feature = self.method_feature_extract.get_drop_mod_feature(modcand_to_process, main_sent_dict, boxer_graph)
244 |         if drop_mod_feature in self.probability_tables["drop-mod"]:
245 |             drop_prob = self.probability_tables["drop-mod"][drop_mod_feature]["true"]
246 |             not_drop_prob = self.probability_tables["drop-mod"][drop_mod_feature]["false"]
247 |             if drop_prob > not_drop_prob:
248 |                 # Drop this mod, adding the operation node
249 |                 opernode_data = ("drop-mod", modcand_to_process, "True")
250 |                 opernode_name = decoder_graph.create_opernode(opernode_data)
251 |                 decoder_graph.create_edge((node_name, opernode_name, modcand_to_process))
252 |                 # Check for adding further drop mods, (nodeset is unchanged)
253 |                 child_nodeset = nodeset
254 |                 filtered_mod_pos.append(modcand_position_to_process)
255 |                 child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos)
256 |                 if isNew:
257 |                     nodes_2_process.append(child_majornode_name)
258 |                 decoder_graph.create_edge((opernode_name, child_majornode_name, "True"))
259 |                 return nodes_2_process
260 | 
261 |         # Dont drop this pos, adding the operation node
262 |         opernode_data = ("drop-mod", modcand_to_process, "False")
263 |         opernode_name = decoder_graph.create_opernode(opernode_data)
264 |         decoder_graph.create_edge((node_name, opernode_name, modcand_to_process))
265 |         # Check for adding further drop mods, (nodeset is unchanged)
266 |         child_nodeset = nodeset
267 |         child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos)
268 |         if isNew:
269 |             nodes_2_process.append(child_majornode_name)
270 |         decoder_graph.create_edge((opernode_name, child_majornode_name, "False"))
271 |         return nodes_2_process
272 |        
273 |     def process_dropood_node_decoder_graph(self, node_name, nodeset, oodnode_candidates, processed_oodnode_candidates, filtered_mod_pos,
274 |                                            nodes_2_process, main_sent_dict, boxer_graph, decoder_graph):
275 |         oodnode_to_process = oodnode_candidates[0]
276 |         processed_oodnode_candidates.append(oodnode_to_process)
277 | 
278 |         drop_ood_feature = self.method_feature_extract.get_drop_ood_feature(oodnode_to_process, nodeset, main_sent_dict, boxer_graph)
279 |         
280 |         if drop_ood_feature in self.probability_tables["drop-ood"]:
281 |             drop_prob = self.probability_tables["drop-ood"][drop_ood_feature]["true"]
282 |             not_drop_prob = self.probability_tables["drop-ood"][drop_ood_feature]["false"]
283 |             if drop_prob > not_drop_prob:
284 |                 # Creating opernode for droping 
285 |                 opernode_data = ("drop-ood", oodnode_to_process, "True")
286 |                 opernode_name = decoder_graph.create_opernode(opernode_data)
287 |                 decoder_graph.create_edge((node_name, opernode_name, oodnode_to_process))
288 |                 # Check for adding OOD or subsequent nodes, (nodeset is changed)
289 |                 child_nodeset = nodeset
290 |                 child_nodeset.remove(oodnode_to_process)              
291 |                 child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-ood", child_nodeset, processed_oodnode_candidates, filtered_mod_pos)
292 |                 if isNew:
293 |                     nodes_2_process.append(child_majornode_name)
294 |                 decoder_graph.create_edge((opernode_name, child_majornode_name, "True"))
295 |                 return  nodes_2_process
296 |         
297 |         # Creating opernode for not droping
298 |         opernode_data = ("drop-ood", oodnode_to_process, "False")
299 |         opernode_name = decoder_graph.create_opernode(opernode_data)
300 |         decoder_graph.create_edge((node_name, opernode_name, oodnode_to_process))
301 |         # Check for adding OOD or subsequent nodes, (nodeset is unchanged)
302 |         child_nodeset = nodeset
303 |         child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, boxer_graph, decoder_graph, "drop-ood", child_nodeset, processed_oodnode_candidates, filtered_mod_pos)
304 |         if isNew:
305 |             nodes_2_process.append(child_majornode_name)
306 |         decoder_graph.create_edge((opernode_name, child_majornode_name, "False"))
307 |         return  nodes_2_process
308 | 
309 |     def addition_major_node(self, main_sent_dict, boxer_graph, decoder_graph, opertype, nodeset, processed_candidates, extra_data):
310 |         # node type - value
311 |         type_val = {"split":1, "drop-rel":2, "drop-mod":3, "drop-ood":4}
312 |         operval = type_val[opertype]
313 | 
314 |         # simple sentences not available, used to match data structures
315 |         simple_sentences = []
316 | 
317 |         # Checking for the addition of "split" major-node
318 |         if operval <= type_val["split"]:
319 |             if opertype in self.DISCOURSE_SENTENCE_MODEL:
320 |                 # Calculating Split Candidates - DRS Graph node tuples
321 |                 split_candidate_tuples = boxer_graph.extract_split_candidate_tuples(nodeset, self.MAX_SPLIT_PAIR_SIZE)
322 |                 # print "split_candidate_tuples : " + str(split_candidate_tuples)
323 | 
324 |                 if len(split_candidate_tuples) != 0:
325 |                     # Adding the major node for split
326 |                     majornode_data = ("split", nodeset[:], simple_sentences, split_candidate_tuples)
327 |                     majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
328 |                     return majornode_name, isNew
329 | 
330 |         if operval <= type_val["drop-rel"]:
331 |             if opertype in self.DISCOURSE_SENTENCE_MODEL:
332 |                 # Calculate drop-rel candidates
333 |                 processed_relnode = processed_candidates[:] if opertype == "drop-rel" else []
334 |                 filtered_mod_pos = extra_data if opertype == "drop-rel" else []
335 |                 relnode_set = boxer_graph.extract_drop_rel_candidates(nodeset, self.RESTRICTED_DROP_REL, processed_relnode)
336 |                 if len(relnode_set) != 0:
337 |                     # Adding the major nodes for drop-rel
338 |                     majornode_data = ("drop-rel", nodeset[:], simple_sentences, relnode_set, processed_relnode, filtered_mod_pos)
339 |                     majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
340 |                     return majornode_name, isNew
341 |                 
342 |         if operval <= type_val["drop-mod"]:
343 |             if opertype in self.DISCOURSE_SENTENCE_MODEL:
344 |                 # Calculate drop-mod candidates
345 |                 processed_mod_pos = processed_candidates[:] if opertype == "drop-mod" else []
346 |                 filtered_mod_pos = extra_data
347 |                 modcand_set = boxer_graph.extract_drop_mod_candidates(nodeset, main_sent_dict, self.ALLOWED_DROP_MOD, processed_mod_pos)
348 |                 if len(modcand_set) != 0:
349 |                     # Adding the major nodes for drop-mod
350 |                     majornode_data = ("drop-mod", nodeset[:], simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos)
351 |                     majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
352 |                     return majornode_name, isNew
353 | 
354 |         if operval <= type_val["drop-ood"]:
355 |             if opertype in self.DISCOURSE_SENTENCE_MODEL:
356 |                 # Check for drop-OOD node candidates
357 |                 processed_oodnodes = processed_candidates if opertype == "drop-ood" else []
358 |                 filtered_mod_pos = extra_data
359 |                 oodnode_candidates = boxer_graph.extract_ood_candidates(nodeset, processed_oodnodes)
360 |                 if len(oodnode_candidates) != 0:
361 |                     # Adding the major node for drop-ood
362 |                     majornode_data = ("drop-ood", nodeset[:], simple_sentences, oodnode_candidates, processed_oodnodes, filtered_mod_pos)
363 |                     majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
364 |                     return majornode_name, isNew
365 | 
366 |         # None of them matched, create "fin" node
367 |         filtered_mod_pos = extra_data[:]
368 |         majornode_data = ("fin", nodeset[:], simple_sentences, filtered_mod_pos)
369 |         majornode_name, isNew = decoder_graph.create_majornode(majornode_data)
370 |         return majornode_name, isNew 
371 | 
372 | 


--------------------------------------------------------------------------------
/source/explore_training_graph.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #===================================================================================
  3 | #title           : explore_training_graph.py                                       =
  4 | #description     : Training graph explorer                                         =
  5 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
  6 | #date            : Created in 2014, Later revised in April 2016.                   =
  7 | #version         : 0.1                                                             =
  8 | #===================================================================================
  9 | 
 10 | 
 11 | from training_graph_module import Training_Graph
 12 | import function_select_methods
 13 | import functions_prepare_elementtree_dot
 14 | 
 15 | class Explore_Training_Graph:
 16 |     def __init__(self, output_stream, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE, 
 17 |                  RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, METHOD_TRAINING_GRAPH):
 18 |         self.output_stream = output_stream
 19 | 
 20 |         self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL
 21 |         self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE
 22 |         self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL
 23 |         self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD
 24 |         self.METHOD_TRAINING_GRAPH = METHOD_TRAINING_GRAPH
 25 | 
 26 |         self.method_training_graph = function_select_methods.select_training_graph_method(self.METHOD_TRAINING_GRAPH)
 27 |         
 28 |     def explore_training_graph(self, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph):
 29 |         # Start a training graph
 30 |         training_graph = Training_Graph()
 31 |         nodes_2_process = []
 32 | 
 33 |         # Check if Discourse information is available
 34 |         if boxer_graph.isEmpty():
 35 |             # Adding finishing major node
 36 |             nodeset = boxer_graph.get_nodeset()
 37 |             filtered_mod_pos = []
 38 |             majornode_data = ("fin", nodeset, simple_sentences, filtered_mod_pos)
 39 | 
 40 |             # Creating major node
 41 |             majornode_name, isNew = training_graph.create_majornode(majornode_data)
 42 |             nodes_2_process.append(majornode_name) # isNew = True
 43 |         else:
 44 |             # DRS data is available for the main sentence
 45 |             # Check to add the starting node
 46 |             nodeset = boxer_graph.get_nodeset()
 47 |             majornode_name, isNew = self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "split", nodeset, [], [])
 48 |             nodes_2_process.append(majornode_name) # isNew = True
 49 | 
 50 |             # Start expanding the training graph
 51 |             self.expand_training_graph(nodes_2_process[:], main_sent_dict, boxer_graph, training_graph)
 52 |         
 53 |         # Writing sentence element
 54 |         functions_prepare_elementtree_dot.prepare_write_sentence_element(self.output_stream, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph)
 55 | 
 56 |         # # Check to create visual representation
 57 |         # if int(sentid) <= 100:
 58 |         #     functions_prepare_elementtree_dot.run_visual_graph_creator(sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph)
 59 | 
 60 |     def expand_training_graph(self, nodes_2_process, main_sent_dict, boxer_graph, training_graph):
 61 |         #print nodes_2_process
 62 |         if len(nodes_2_process) == 0:
 63 |             return 
 64 | 
 65 |         node_name = nodes_2_process[0]
 66 |         operreq = training_graph.get_majornode_type(node_name)
 67 |         nodeset = training_graph.get_majornode_nodeset(node_name)[:]
 68 |         simple_sentences = training_graph.get_majornode_simple_sentences(node_name)[:]
 69 |         oper_candidates = training_graph.get_majornode_oper_candidates(node_name)[:]
 70 |         processed_oper_candidates = training_graph.get_majornode_processed_oper_candidates(node_name)[:]
 71 |         filtered_postions = training_graph.get_majornode_filtered_postions(node_name)[:]
 72 | 
 73 |         if operreq == "split":
 74 |             split_candidate_tuples = oper_candidates
 75 |             nodes_2_process = self.process_split_node_training_graph(node_name, nodeset, simple_sentences, split_candidate_tuples, 
 76 |                                                                      nodes_2_process, main_sent_dict, boxer_graph, training_graph)
 77 | 
 78 |         if operreq == "drop-rel":
 79 |             relnode_candidates = oper_candidates
 80 |             processed_relnode_candidates = processed_oper_candidates 
 81 |             filtered_mod_pos = filtered_postions
 82 |             nodes_2_process = self.process_droprel_node_training_graph(node_name, nodeset, simple_sentences, relnode_candidates, processed_relnode_candidates, filtered_mod_pos,
 83 |                                                                        nodes_2_process, main_sent_dict, boxer_graph, training_graph)
 84 | 
 85 |         if operreq == "drop-mod":
 86 |             mod_candidates = oper_candidates
 87 |             processed_mod_pos = processed_oper_candidates 
 88 |             filtered_mod_pos = filtered_postions
 89 |             nodes_2_process = self.process_dropmod_node_training_graph(node_name, nodeset, simple_sentences, mod_candidates, processed_mod_pos, filtered_mod_pos,
 90 |                                                                        nodes_2_process, main_sent_dict, boxer_graph, training_graph)
 91 | 
 92 |         if operreq == "drop-ood":
 93 |             oodnode_candidates = oper_candidates
 94 |             processed_oodnode_candidates = processed_oper_candidates 
 95 |             filtered_mod_pos = filtered_postions
 96 |             nodes_2_process = self.process_dropood_node_training_graph(node_name, nodeset, simple_sentences, oodnode_candidates, processed_oodnode_candidates, filtered_mod_pos,
 97 |                                                                        nodes_2_process, main_sent_dict, boxer_graph, training_graph)
 98 | 
 99 |         self.expand_training_graph(nodes_2_process[1:], main_sent_dict, boxer_graph, training_graph)
100 | 
101 |     def process_split_node_training_graph(self, node_name, nodeset, simple_sentences, split_candidate_tuples, nodes_2_process, main_sent_dict, boxer_graph, training_graph):
102 |         split_candidate_results = []
103 |         splitAchieved = False
104 |         for split_candidate in split_candidate_tuples:
105 |             isValidSplit, split_results = self.method_training_graph.process_split_candidate_for_split(split_candidate, simple_sentences, main_sent_dict, boxer_graph)
106 |             # print "split_candidate : "+str(split_candidate) + " : " + str(isValidSplit)
107 |             split_candidate_results.append((isValidSplit, split_results))
108 |             if isValidSplit:
109 |                 splitAchieved = True
110 | 
111 |         if splitAchieved:
112 |             # At least one split candidate succeed
113 |             for split_candidate, results_tuple in zip(split_candidate_tuples, split_candidate_results):
114 |                 if results_tuple[0] == True:
115 |                     # Adding the operation node
116 |                     not_applied_cands = [item for item in split_candidate_tuples if item is not split_candidate]
117 |                     opernode_data = ("split", split_candidate, not_applied_cands)
118 |                     opernode_name = training_graph.create_opernode(opernode_data)
119 |                     training_graph.create_edge((node_name, opernode_name, split_candidate))
120 |                     
121 |                     # Adding children major nodes
122 |                     for item in results_tuple[1]:
123 |                         child_nodeset = item[1]
124 |                         child_nodeset.sort()
125 |                         parent_child_nodeset = item[2]
126 |                         simple_sentence = item[3]
127 |                         
128 |                         # Check for adding OOD or subsequent nodes
129 |                         child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, [simple_sentence], boxer_graph, training_graph, "drop-rel", child_nodeset, [], [])
130 |                         if isNew:
131 |                             nodes_2_process.append(child_majornode_name)
132 |                         training_graph.create_edge((opernode_name, child_majornode_name, parent_child_nodeset))
133 | 
134 |         else:
135 |             # None of the split candidate succeed, adding the operation node
136 |             not_applied_cands = [item for item in split_candidate_tuples]
137 |             opernode_data = ("split", None, not_applied_cands)
138 |             opernode_name = training_graph.create_opernode(opernode_data)
139 |             training_graph.create_edge((node_name, opernode_name, None))
140 |             
141 |             # Check for adding drop-rel or drop-mod or fin nodes
142 |             child_nodeset = nodeset
143 |             child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-rel", child_nodeset, [], [])
144 |             if isNew:
145 |                 nodes_2_process.append(child_majornode_name)
146 |             training_graph.create_edge((opernode_name, child_majornode_name, None))
147 |         
148 |         return nodes_2_process
149 | 
150 |     def process_droprel_node_training_graph(self, node_name, nodeset, simple_sentences, relnode_set, processed_relnode, filtered_mod_pos, nodes_2_process, main_sent_dict, boxer_graph, training_graph):
151 |         relnode_to_process = relnode_set[0]
152 |         processed_relnode.append(relnode_to_process)
153 | 
154 |         isValidDrop = self.method_training_graph.process_rel_candidate_for_drop(relnode_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph)
155 |         if isValidDrop:
156 |             # Drop this rel node, adding the operation node
157 |             opernode_data = ("drop-rel", relnode_to_process, "True")
158 |             opernode_name = training_graph.create_opernode(opernode_data)
159 |             training_graph.create_edge((node_name, opernode_name, relnode_to_process))
160 | 
161 |             # Check for adding REL or subsequent nodes, (nodeset is changed)
162 |             child_nodeset, child_filtered_mod_pos = boxer_graph.drop_relation(nodeset, relnode_to_process, filtered_mod_pos)
163 |             child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-rel", child_nodeset, processed_relnode, child_filtered_mod_pos)
164 |             if isNew:
165 |                 nodes_2_process.append(child_majornode_name)
166 |             training_graph.create_edge((opernode_name, child_majornode_name, "True"))
167 |         else:
168 |             # Dont drop this rel node, adding the operation node
169 |             opernode_data = ("drop-rel", relnode_to_process, "False")
170 |             opernode_name = training_graph.create_opernode(opernode_data)
171 |             training_graph.create_edge((node_name, opernode_name, relnode_to_process))
172 | 
173 |             # Check for adding REL or subsequent nodes, (nodeset is unchanged)
174 |             child_nodeset = nodeset
175 |             child_filtered_mod_pos = filtered_mod_pos
176 |             child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-rel", child_nodeset, processed_relnode, child_filtered_mod_pos)
177 |             if isNew:
178 |                 nodes_2_process.append(child_majornode_name)
179 |             training_graph.create_edge((opernode_name, child_majornode_name, "False"))
180 | 
181 |         return nodes_2_process
182 | 
183 |     def process_dropmod_node_training_graph(self, node_name, nodeset, simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos, nodes_2_process, main_sent_dict, boxer_graph, training_graph):
184 |         modcand_to_process = modcand_set[0]
185 |         modcand_position_to_process = modcand_to_process[0]
186 |         processed_mod_pos.append(modcand_position_to_process)
187 | 
188 |         isValidDrop = self.method_training_graph.process_mod_candidate_for_drop(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph)
189 |         if isValidDrop:
190 |             # Drop this mod pos, adding the operation node
191 |             opernode_data = ("drop-mod", modcand_to_process, "True")
192 |             opernode_name = training_graph.create_opernode(opernode_data)
193 |             training_graph.create_edge((node_name, opernode_name, modcand_to_process))
194 | 
195 |             # Check for adding mod and their subsequent nodes, (nodeset is not changed)
196 |             child_nodeset = nodeset
197 |             filtered_mod_pos.append(modcand_position_to_process)
198 |             child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos)
199 |             if isNew:
200 |                 nodes_2_process.append(child_majornode_name)
201 |             training_graph.create_edge((opernode_name, child_majornode_name, "True"))
202 |         else:
203 |             # Dont drop this pos, adding the operation node
204 |             opernode_data = ("drop-mod", modcand_to_process, "False")
205 |             opernode_name = training_graph.create_opernode(opernode_data)
206 |             training_graph.create_edge((node_name, opernode_name, modcand_to_process))
207 | 
208 |             # Check for adding mod and their subsequent nodes, (nodeset is not changed)
209 |             child_nodeset = nodeset
210 |             child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-mod", child_nodeset, processed_mod_pos, filtered_mod_pos)
211 |             if isNew:
212 |                 nodes_2_process.append(child_majornode_name)
213 |             training_graph.create_edge((opernode_name, child_majornode_name, "False"))
214 |         return nodes_2_process
215 | 
216 |     def process_dropood_node_training_graph(self, node_name, nodeset, simple_sentences, oodnode_set, processed_oodnode, filtered_mod_pos, nodes_2_process, main_sent_dict, boxer_graph, training_graph):
217 |         
218 |         oodnode_to_process = oodnode_set[0]
219 |         processed_oodnode.append(oodnode_to_process)
220 | 
221 |         isValidDrop = self.method_training_graph.process_ood_candidate_for_drop(oodnode_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph)
222 |         if isValidDrop:
223 |             # Drop this ood node, adding the operation node
224 |             opernode_data = ("drop-ood", oodnode_to_process, "True")
225 |             opernode_name = training_graph.create_opernode(opernode_data)
226 |             training_graph.create_edge((node_name, opernode_name, oodnode_to_process))
227 | 
228 |             # Check for adding OOD or subsequent nodes, (nodeset is changed)
229 |             child_nodeset = nodeset
230 |             child_nodeset.remove(oodnode_to_process)
231 |             child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-ood", child_nodeset, processed_oodnode, filtered_mod_pos)
232 |             if isNew:
233 |                 nodes_2_process.append(child_majornode_name)
234 |             training_graph.create_edge((opernode_name, child_majornode_name, "True"))
235 |         else:
236 |             # Dont drop this ood node, adding the operation node
237 |             opernode_data = ("drop-ood", oodnode_to_process, "False")
238 |             opernode_name = training_graph.create_opernode(opernode_data)
239 |             training_graph.create_edge((node_name, opernode_name, oodnode_to_process))
240 | 
241 |             # Check for adding OOD or subsequent nodes, (nodeset is unchanged)
242 |             child_nodeset = nodeset
243 |             child_majornode_name, isNew =  self.addition_major_node(main_sent_dict, simple_sentences, boxer_graph, training_graph, "drop-ood", child_nodeset, processed_oodnode, filtered_mod_pos)
244 |             if isNew:
245 |                 nodes_2_process.append(child_majornode_name)
246 |             training_graph.create_edge((opernode_name, child_majornode_name, "False"))
247 | 
248 |         return nodes_2_process
249 | 
250 |     def addition_major_node(self, main_sent_dict, simple_sentences, boxer_graph, training_graph, opertype, nodeset, processed_candidates, extra_data):
251 |         # node type - value
252 |         type_val = {"split":1, "drop-rel":2, "drop-mod":3, "drop-ood":4}
253 |         operval = type_val[opertype]
254 | 
255 |         # Checking for the addition of "split" major-node
256 |         if operval <= type_val["split"]:
257 |             if opertype in self.DISCOURSE_SENTENCE_MODEL:
258 |                 # Calculating Split Candidates - DRS Graph node tuples
259 |                 split_candidate_tuples = boxer_graph.extract_split_candidate_tuples(nodeset, self.MAX_SPLIT_PAIR_SIZE)
260 |                 # print "split_candidate_tuples : " + str(split_candidate_tuples)
261 | 
262 |                 if len(split_candidate_tuples) != 0:
263 |                     # Adding the major node for split
264 |                     majornode_data = ("split", nodeset, simple_sentences, split_candidate_tuples)
265 |                     majornode_name, isNew = training_graph.create_majornode(majornode_data)
266 |                     return majornode_name, isNew
267 | 
268 |         if operval <= type_val["drop-rel"]:
269 |             if opertype in self.DISCOURSE_SENTENCE_MODEL:
270 |                 # Calculate drop-rel candidates
271 |                 processed_relnode = processed_candidates if opertype == "drop-rel" else []
272 |                 filtered_mod_pos = extra_data if opertype == "drop-rel" else []
273 |                 relnode_set = boxer_graph.extract_drop_rel_candidates(nodeset, self.RESTRICTED_DROP_REL, processed_relnode)
274 |                 if len(relnode_set) != 0:
275 |                     # Adding the major nodes for drop-rel
276 |                     majornode_data = ("drop-rel", nodeset, simple_sentences, relnode_set, processed_relnode, filtered_mod_pos)
277 |                     majornode_name, isNew = training_graph.create_majornode(majornode_data)
278 |                     return majornode_name, isNew
279 |                 
280 |         if operval <= type_val["drop-mod"]:
281 |             if opertype in self.DISCOURSE_SENTENCE_MODEL:
282 |                 # Calculate drop-mod candidates
283 |                 processed_mod_pos = processed_candidates if opertype == "drop-mod" else []
284 |                 filtered_mod_pos = extra_data
285 |                 modcand_set = boxer_graph.extract_drop_mod_candidates(nodeset, main_sent_dict, self.ALLOWED_DROP_MOD, processed_mod_pos)
286 |                 if len(modcand_set) != 0:
287 |                     # Adding the major nodes for drop-mod
288 |                     majornode_data = ("drop-mod", nodeset, simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos)
289 |                     majornode_name, isNew = training_graph.create_majornode(majornode_data)
290 |                     return majornode_name, isNew
291 | 
292 |         if operval <= type_val["drop-ood"]:
293 |             if opertype in self.DISCOURSE_SENTENCE_MODEL:
294 |                 # Check for drop-OOD node candidates
295 |                 processed_oodnodes = processed_candidates if opertype == "drop-ood" else []
296 |                 filtered_mod_pos = extra_data 
297 |                 oodnode_candidates = boxer_graph.extract_ood_candidates(nodeset, processed_oodnodes)
298 |                 if len(oodnode_candidates) != 0:
299 |                     # Adding the major node for drop-ood
300 |                     majornode_data = ("drop-ood", nodeset, simple_sentences, oodnode_candidates, processed_oodnodes, filtered_mod_pos)
301 |                     majornode_name, isNew = training_graph.create_majornode(majornode_data)
302 |                     return majornode_name, isNew
303 | 
304 | 
305 |         # None of them matched, create "fin" node
306 |         filtered_mod_pos = extra_data
307 |         majornode_data = ("fin", nodeset, simple_sentences, filtered_mod_pos) 
308 |         majornode_name, isNew = training_graph.create_majornode(majornode_data)
309 |         return majornode_name, isNew 
310 | 


--------------------------------------------------------------------------------
/source/function_select_methods.py:
--------------------------------------------------------------------------------
 1 | 
 2 | #===================================================================================
 3 | #description     : Methods for training graph and features exploration             =
 4 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
 5 | #date            : Created in 2014, Later revised in April 2016.                   =
 6 | #version         : 0.1                                                             =
 7 | #===================================================================================
 8 | 
 9 | 
10 | from methods_training_graph import Method_LED, Method_OVERLAP_LED
11 | from methods_feature_extract import Feature_Init, Feature_Nov27
12 | 
13 | def select_training_graph_method(METHOD_TRAINING_GRAPH):
14 |     return{
15 | 	"method-0.99-lteq-lt": Method_OVERLAP_LED(0.99, "lteq", "lt"),
16 |         "method-0.75-lteq-lt": Method_OVERLAP_LED(0.75, "lteq", "lt"),
17 |         "method-0.5-lteq-lteq": Method_OVERLAP_LED(0.5, "lteq", "lteq"),
18 |         "method-led-lteq": Method_LED("lteq", "lteq", "lteq"),
19 |         "method-led-lt": Method_LED("lt", "lt", "lt")
20 |         }[METHOD_TRAINING_GRAPH]
21 | 
22 | def select_feature_extract_method(METHOD_FEATURE_EXTRACT):
23 |     return{
24 |         "feature-init": Feature_Init(),
25 |         "feature-Nov27": Feature_Nov27(),
26 |         }[METHOD_FEATURE_EXTRACT]
27 | 


--------------------------------------------------------------------------------
/source/functions_configuration_file.py:
--------------------------------------------------------------------------------
  1 | #===================================================================================
  2 | #title           : functions_configuration_file.py                                 =
  3 | #description     : Prepare/READ configuration file                                 =
  4 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
  5 | #date            : Created in 2014, Later revised in April 2016.                   =
  6 | #version         : 0.1                                                             =
  7 | #===================================================================================
  8 | 
  9 | def write_config_file(config_filename, config_data_dict):
 10 |     config_file = open(config_filename, "w")
 11 |     
 12 |     config_file.write("##############################################################\n"+
 13 |                       "####### Discourse-Complex-Simple Congifuration File ##########\n"+
 14 |                       "##############################################################\n\n")
 15 | 
 16 |     config_file.write("# Generation Information\n")
 17 |     if "TRAIN-BOXER-GRAPH" in config_data_dict:
 18 |         config_file.write("[TRAIN-BOXER-GRAPH]\n"+config_data_dict["TRAIN-BOXER-GRAPH"]+"\n\n")
 19 | 
 20 |     if "TRANSFORMATION-MODEL" in config_data_dict:
 21 |         config_file.write("[TRANSFORMATION-MODEL]\n"+" ".join(config_data_dict["TRANSFORMATION-MODEL"])+"\n\n")
 22 |     
 23 |     if "MAX-SPLIT-SIZE" in config_data_dict:
 24 |         config_file.write("[MAX-SPLIT-SIZE]\n"+str(config_data_dict["MAX-SPLIT-SIZE"])+"\n\n")
 25 | 
 26 |     if "RESTRICTED-DROP-RELATION" in config_data_dict:
 27 |         config_file.write("[RESTRICTED-DROP-RELATION]\n"+" ".join(config_data_dict["RESTRICTED-DROP-RELATION"])+"\n\n")
 28 | 
 29 |     if "ALLOWED-DROP-MODIFIER" in config_data_dict:
 30 |         config_file.write("[ALLOWED-DROP-MODIFIER]\n"+" ".join(config_data_dict["ALLOWED-DROP-MODIFIER"])+"\n\n")
 31 |                           
 32 |     if "METHOD-TRAINING-GRAPH" in config_data_dict:
 33 |         config_file.write("[METHOD-TRAINING-GRAPH]\n"+config_data_dict["METHOD-TRAINING-GRAPH"]+"\n\n")
 34 | 
 35 |     if "METHOD-FEATURE-EXTRACT" in config_data_dict:
 36 |         config_file.write("[METHOD-FEATURE-EXTRACT]\n"+config_data_dict["METHOD-FEATURE-EXTRACT"]+"\n\n")
 37 | 
 38 |     if "NUM-EM-ITERATION" in config_data_dict:
 39 |         config_file.write("[NUM-EM-ITERATION]\n"+str(config_data_dict["NUM-EM-ITERATION"])+"\n\n")
 40 | 
 41 |     if "LANGUAGE-MODEL" in config_data_dict:
 42 |         config_file.write("[LANGUAGE-MODEL]\n"+config_data_dict["LANGUAGE-MODEL"]+"\n\n")
 43 | 
 44 |     config_file.write("# Step-1\n")
 45 |     if "TRAIN-TRAINING-GRAPH" in config_data_dict:
 46 |         config_file.write("[TRAIN-TRAINING-GRAPH]\n"+config_data_dict["TRAIN-TRAINING-GRAPH"]+"\n\n")
 47 |     
 48 |     config_file.write("# Step-2\n")
 49 |     if "TRANSFORMATION-MODEL-DIR" in config_data_dict:
 50 |         config_file.write("[TRANSFORMATION-MODEL-DIR]\n"+config_data_dict["TRANSFORMATION-MODEL-DIR"]+"\n\n")
 51 | 
 52 |     config_file.write("# Step-3\n")
 53 |     if "MOSES-COMPLEX-SIMPLE-DIR" in config_data_dict:
 54 |         config_file.write("[MOSES-COMPLEX-SIMPLE-DIR]\n"+config_data_dict["MOSES-COMPLEX-SIMPLE-DIR"]+"\n\n")
 55 |     
 56 |     config_file.close()
 57 | 
 58 | 
 59 | def parser_config_file(config_file):
 60 |     config_data = (open(config_file, "r").read().strip()).split("\n")
 61 |     config_data_dict = {}
 62 |     count = 0
 63 |     while count < len(config_data):
 64 |         if config_data[count].startswith("["):
 65 |             # Start Information
 66 |             if config_data[count].strip()[1:-1] == "TRAIN-BOXER-GRAPH":
 67 |                 config_data_dict["TRAIN-BOXER-GRAPH"] = config_data[count+1].strip()
 68 | 
 69 |             if config_data[count].strip()[1:-1] == "TRANSFORMATION-MODEL":
 70 |                 config_data_dict["TRANSFORMATION-MODEL"] = config_data[count+1].strip().split()
 71 |             
 72 |             if config_data[count].strip()[1:-1] == "MAX-SPLIT-SIZE":
 73 |                 config_data_dict["MAX-SPLIT-SIZE"] = int(config_data[count+1].strip())
 74 | 
 75 |             if config_data[count].strip()[1:-1] == "RESTRICTED-DROP-RELATION":
 76 |                 config_data_dict["RESTRICTED-DROP-RELATION"] = config_data[count+1].strip().split()
 77 | 
 78 |             if config_data[count].strip()[1:-1] == "ALLOWED-DROP-MODIFIER":
 79 |                 config_data_dict["ALLOWED-DROP-MODIFIER"] = config_data[count+1].strip().split()            
 80 | 
 81 |             if config_data[count].strip()[1:-1] == "METHOD-TRAINING-GRAPH":
 82 |                 config_data_dict["METHOD-TRAINING-GRAPH"] = config_data[count+1].strip()
 83 | 
 84 |             if config_data[count].strip()[1:-1] == "METHOD-FEATURE-EXTRACT":
 85 |                 config_data_dict["METHOD-FEATURE-EXTRACT"] = config_data[count+1].strip()
 86 |                 
 87 |             if config_data[count].strip()[1:-1] == "NUM-EM-ITERATION":
 88 |                 config_data_dict["NUM-EM-ITERATION"] = int(config_data[count+1].strip())   
 89 | 
 90 |             if config_data[count].strip()[1:-1] == "LANGUAGE-MODEL":
 91 |                 config_data_dict["LANGUAGE-MODEL"] = config_data[count+1].strip()
 92 | 
 93 |             # Step 1
 94 |             if config_data[count].strip()[1:-1] == "TRAIN-TRAINING-GRAPH":
 95 |                 config_data_dict["TRAIN-TRAINING-GRAPH"] = config_data[count+1].strip()
 96 | 
 97 |             # Step 2
 98 |             if config_data[count].strip()[1:-1] == "TRANSFORMATION-MODEL-DIR":
 99 |                 config_data_dict["TRANSFORMATION-MODEL-DIR"] = config_data[count+1].strip()
100 |                 
101 |             # Step 3
102 |             if config_data[count].strip()[1:-1] == "MOSES-COMPLEX-SIMPLE-DIR":
103 |                 config_data_dict["MOSES-COMPLEX-SIMPLE-DIR"] = config_data[count+1].strip()
104 |             
105 |             count += 2
106 |         else:
107 |             count += 1
108 |     return config_data_dict
109 | 


--------------------------------------------------------------------------------
/source/functions_model_files.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | #===================================================================================
 3 | #title           : functions_model_files.py                                        =
 4 | #description     : Model file Handler                                              =
 5 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
 6 | #date            : Created in 2014, Later revised in April 2016.                   =
 7 | #version         : 0.1                                                             =
 8 | #===================================================================================
 9 | 
10 | def read_model_files(model_dir, transformation_model):
11 |     probability_tables = {}
12 |     for trans_method in transformation_model:
13 |         modelfile = model_dir+"/D2S-"+trans_method.upper()+".model"
14 |         probability_tables[trans_method] = {}            
15 |         with open(modelfile) as infile:
16 |                 for line in infile:
17 |                     data = line.split()
18 |                     if data[0] not in probability_tables[trans_method]:
19 |                         probability_tables[trans_method][data[0]] = {data[1]:float(data[2])}
20 |                     else:
21 |                         probability_tables[trans_method][data[0]][data[1]] = float(data[2])
22 |         print modelfile + " done ..."
23 |     return probability_tables 
24 | 
25 | def write_model_files(model_dir, probability_tables, smt_sentence_pairs):
26 |     if "split" in probability_tables:
27 |         print "Writing "+model_dir+"/D2S-SPLIT.model ..."
28 |         foutput = open(model_dir+"/D2S-SPLIT.model", "w")
29 |         split_feature_set = probability_tables["split"].keys()
30 |         split_feature_set.sort()
31 |         for item in split_feature_set:
32 |             foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["split"][item]["true"])+"\n")
33 |             foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["split"][item]["false"])+"\n")
34 |         foutput.close()
35 |         
36 |     if "drop-ood" in probability_tables:
37 |         print "Writing "+model_dir+"/D2S-DROP-OOD.model ..."
38 |         foutput = open(model_dir+"/D2S-DROP-OOD.model", "w")
39 |         drop_ood_feature_set = probability_tables["drop-ood"].keys()
40 |         drop_ood_feature_set.sort()
41 |         for item in drop_ood_feature_set:
42 |             foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["drop-ood"][item]["true"])+"\n")
43 |             foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["drop-ood"][item]["false"])+"\n")
44 |         foutput.close()
45 |         
46 |     if "drop-rel" in probability_tables:
47 |         print "Writing "+model_dir+"/D2S-DROP-REL.model ..."
48 |         foutput = open(model_dir+"/D2S-DROP-REL.model", "w")
49 |         drop_rel_feature_set = probability_tables["drop-rel"].keys()
50 |         drop_rel_feature_set.sort()
51 |         for item in drop_rel_feature_set:
52 |             foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["drop-rel"][item]["true"])+"\n")
53 |             foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["drop-rel"][item]["false"])+"\n")
54 |         foutput.close()
55 | 
56 |     if "drop-mod" in probability_tables:
57 |         print "Writing "+model_dir+"/D2S-DROP-MOD.model ..."
58 |         foutput = open(model_dir+"/D2S-DROP-MOD.model", "w")
59 |         drop_mod_feature_set = probability_tables["drop-mod"].keys()
60 |         drop_mod_feature_set.sort()
61 |         for item in drop_mod_feature_set:
62 |             foutput.write(item.encode('utf-8')+"\t"+"true"+"\t"+str(probability_tables["drop-mod"][item]["true"])+"\n")
63 |             foutput.write(item.encode('utf-8')+"\t"+"false"+"\t"+str(probability_tables["drop-mod"][item]["false"])+"\n")
64 |         foutput.close()       
65 | 
66 |     # Writing SMT training data
67 |     print "Writing "+model_dir+"/D2S-SMT.source ..."
68 |     print "Writing "+ model_dir+"/D2S-SMT.target ..."
69 |     fsource = open(model_dir+"/D2S-SMT.source", "w")
70 |     ftarget = open(model_dir+"/D2S-SMT.target", "w")
71 |     for sentid in smt_sentence_pairs:
72 |         # print sentid
73 |         # print smt_sentence_pairs[sentid]
74 |         for pair in smt_sentence_pairs[sentid]:
75 |             fsource.write(pair[0].encode('utf-8')+"\n")
76 |             ftarget.write(pair[1].encode('utf-8')+"\n")
77 |     fsource.close()
78 |     ftarget.close()
79 | 


--------------------------------------------------------------------------------
/source/functions_prepare_elementtree_dot.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | #===================================================================================
 3 | #title           : functions_prepare_elementtree_dot.py                            =
 4 | #description     : Prepare dot file                                                =
 5 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
 6 | #date            : Created in 2014, Later revised in April 2016.                   =
 7 | #version         : 0.1                                                             =
 8 | #===================================================================================
 9 | 
10 | 
11 | import os
12 | import xml.etree.ElementTree as ET
13 | from xml.dom import minidom
14 | 
15 | def prettify_xml_element(element):
16 |     """Return a pretty-printed XML string for the Element.
17 |     """
18 |     rough_string = ET.tostring(element)
19 |     reparsed = minidom.parseString(rough_string)
20 |     prettyxml = reparsed.documentElement.toprettyxml(indent=" ")
21 |     return prettyxml.encode("utf-8")
22 | 
23 | ############################### Elementary Tree ##########################################
24 | 
25 | def prepare_write_sentence_element(output_stream, sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
26 |     # Creating Sentence element
27 |     sentence = ET.Element('sentence')
28 |     sentence.attrib={"id":str(sentid)}
29 |         
30 |     # Writing main sentence
31 |     main = ET.SubElement(sentence, "main")
32 |     mainsent = ET.SubElement(main, "s")
33 |     mainsent.text = main_sentence
34 |     wordinfo = ET.SubElement(main, "winfo")
35 |     mainpositions  = main_sent_dict.keys()
36 |     mainpositions.sort()
37 |     for position in mainpositions:
38 |         word = ET.SubElement(wordinfo, "w")
39 |         word.text = main_sent_dict[position][0]
40 |         word.attrib = {"id":str(position), "pos":main_sent_dict[position][1]}
41 |         
42 |     # Writing simple sentence
43 |     simpleset = ET.SubElement(sentence, "simple-set")
44 |     for simple_sentence in simple_sentences:
45 |         simple = ET.SubElement(simpleset, "simple")
46 |         simplesent = ET.SubElement(simple, "s")
47 |         simplesent.text = simple_sentence
48 | 
49 |     # Writing boxer Data : boxer_graph
50 |     boxer = boxer_graph.convert_to_elementarytree()
51 |     sentence.append(boxer)
52 | 
53 |     # Writing Training Graph : training_graph
54 |     traininggraph = training_graph.convert_to_elementarytree()
55 |     sentence.append(traininggraph)
56 | 
57 |     output_stream.write(prettify_xml_element(sentence))
58 |         
59 | ############################ Dot - PNG File ###################################################
60 | 
61 | def run_visual_graph_creator(sentid, main_sentence, main_sent_dict, simple_sentences, boxer_graph, training_graph):
62 |     print "Creating boxer and training graphs for sentence id : "+sentid+" ..."
63 |     
64 |     # Start creating boxer graph
65 |     foutput = open("/tmp/boxer-graph-"+sentid+".dot", "w")
66 |     boxer_dotstring = boxer_graph.convert_to_dotstring(sentid, main_sentence, main_sent_dict, simple_sentences)
67 |     foutput.write(boxer_dotstring)
68 |     foutput.close()
69 |     os.system("dot -Tpng /tmp/boxer-graph-"+sentid+".dot -o /tmp/boxer-graph-"+sentid+".png")
70 |     
71 | 
72 |     # Start creating training graph
73 |     foutput = open("/tmp/training-graph-"+sentid+".dot", "w")
74 |     train_dotstring = training_graph.convert_to_dotstring(main_sent_dict, boxer_graph)
75 |     foutput.write(train_dotstring)
76 |     foutput.close()
77 |     os.system("dot -Tpng /tmp/training-graph-"+sentid+".dot -o /tmp/training-graph-"+sentid+".png")
78 | 


--------------------------------------------------------------------------------
/source/methods_feature_extract.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | #===================================================================================
 3 | #description     : Methods for features exploration                                =
 4 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
 5 | #date            : Created in 2014, Later revised in April 2016.                   =
 6 | #version         : 0.1                                                             =
 7 | #===================================================================================
 8 | 
 9 | 
10 | class Feature_Nov27:
11 | 
12 |     def get_split_feature(self, split_tuple, parent_sentence, children_sentence_list, boxer_graph):
13 |         # Calculating iLength
14 |         #iLength = boxer_graph.calculate_iLength(parent_sentence, children_sentence_list)
15 |         # Get split tuple pattern
16 |         split_pattern = boxer_graph.get_pattern_4_split_candidate(split_tuple)
17 |         #split_feature = split_pattern+"_"+str(iLength)
18 |         split_feature = split_pattern
19 |         return split_feature
20 | 
21 |     def get_drop_ood_feature(self, ood_node, nodeset, main_sent_dict, boxer_graph):
22 |         ood_word = boxer_graph.extract_oodword(ood_node, main_sent_dict)
23 |         ood_position = boxer_graph.nodes[ood_node]["positions"][0] # length of positions is one
24 |         span = boxer_graph.extract_span_min_max(nodeset)
25 |         boundaryVal = "false"
26 |         if ood_position <= span[0] or ood_position >= span[1]:
27 |             boundaryVal = "true"
28 |         drop_ood_feature = ood_word+"_"+boundaryVal
29 |         return drop_ood_feature
30 | 
31 |     def get_drop_rel_feature(self, rel_node, nodeset, main_sent_dict, boxer_graph):
32 |         rel_word = boxer_graph.relations[rel_node]["predicates"]
33 |         rel_span = boxer_graph.extract_span_for_nodeset_with_rel(rel_node, nodeset)
34 |         drop_rel_feature = rel_word+"_"
35 |         if len(rel_span) <= 2:
36 |             drop_rel_feature += "0-2"
37 |         elif len(rel_span) <= 5:
38 |             drop_rel_feature += "2-5"
39 |         elif len(rel_span) <= 10:
40 |             drop_rel_feature += "5-10"
41 |         elif len(rel_span) <= 15:
42 |             drop_rel_feature += "10-15"
43 |         else:
44 |             drop_rel_feature += "gt15"
45 |         return drop_rel_feature
46 |         
47 |     def get_drop_mod_feature(self, mod_cand, main_sent_dict, boxer_graph):
48 |         mod_pos = int(mod_cand[0])
49 |         mod_word = main_sent_dict[mod_pos][0]
50 |         #mod_node = mod_cand[1]
51 |         drop_mod_feature = mod_word
52 |         return drop_mod_feature 
53 | 
54 | class Feature_Init:
55 | 
56 |     def get_split_feature(self, split_tuple, parent_sentence, children_sentence_list, boxer_graph):
57 |         # Calculating iLength
58 |         iLength = boxer_graph.calculate_iLength(parent_sentence, children_sentence_list)
59 |         # Get split tuple pattern
60 |         split_pattern = boxer_graph.get_pattern_4_split_candidate(split_tuple)
61 |         split_feature = split_pattern+"_"+str(iLength)
62 |         return split_feature
63 | 
64 |     def get_drop_ood_feature(self, ood_node, nodeset, main_sent_dict, boxer_graph):
65 |         ood_word = boxer_graph.extract_oodword(ood_node, main_sent_dict)
66 |         ood_position = boxer_graph.nodes[ood_node]["positions"][0] # length of positions is one
67 |         span = boxer_graph.extract_span_min_max(nodeset)
68 |         boundaryVal = "false"
69 |         if ood_position <= span[0] or ood_position >= span[1]:
70 |             boundaryVal = "true"
71 |         drop_ood_feature = ood_word+"_"+boundaryVal
72 |         return drop_ood_feature
73 | 
74 |     def get_drop_rel_feature(self, rel_node, nodeset, main_sent_dict, boxer_graph):
75 |         rel_word = boxer_graph.relations[rel_node]["predicates"]
76 |         rel_span = boxer_graph.extract_span_for_nodeset_with_rel(rel_node, nodeset)
77 |         drop_rel_feature = rel_word+"_"+str(len(rel_span))
78 |         return drop_rel_feature
79 |         
80 |     def get_drop_mod_feature(self, mod_cand, main_sent_dict, boxer_graph):
81 |         mod_pos = int(mod_cand[0])
82 |         mod_word = main_sent_dict[mod_pos][0]
83 |         #mod_node = mod_cand[1]
84 |         drop_mod_feature = mod_word
85 |         return drop_mod_feature 
86 | 
87 | 


--------------------------------------------------------------------------------
/source/methods_training_graph.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #===================================================================================
  3 | #description     : Methods for training graph exploration                          =
  4 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
  5 | #date            : Created in 2014, Later revised in April 2016.                   =
  6 | #version         : 0.1                                                             =
  7 | #===================================================================================
  8 | 
  9 | from nltk.metrics.distance import edit_distance
 10 | 
 11 | # Compare edit distance
 12 | def compare_edit_distance(operator,edit_dist_after_drop,  edit_dist_before_drop):
 13 |     if operator == "lt":
 14 |         if edit_dist_after_drop < edit_dist_before_drop:
 15 |             return True
 16 |         else:
 17 |             return False
 18 |     
 19 |     if operator == "lteq":
 20 |         if edit_dist_after_drop <= edit_dist_before_drop:
 21 |             return True
 22 |         else:
 23 |             return False
 24 | 
 25 | # Split Candidate: Common for all clsses
 26 | def process_split_candidate_for_split_common(split_candidate, simple_sentences, main_sent_dict, boxer_graph):
 27 |     if len(split_candidate) != len(simple_sentences):
 28 |         # Number of events is less than number of simple sentences
 29 |         return False, []
 30 | 
 31 |     else:
 32 |         # Calculate all parent and following subtrees
 33 |         parent_subgraph_nodeset_dict = boxer_graph.extract_parent_subgraph_nodeset_dict()
 34 |         #print "parent_subgraph_nodeset_dict : "+str(parent_subgraph_nodeset_dict)
 35 |             
 36 |         node_overlap_dict = {}
 37 |         for nodename in split_candidate:
 38 |             split_nodeset = parent_subgraph_nodeset_dict[nodename]
 39 |             subsentence = boxer_graph.extract_main_sentence(split_nodeset, main_sent_dict, [])
 40 |             subsentence_words_set = set(subsentence.split())
 41 |             
 42 |             overlap_data = []
 43 |             for index in range(len(simple_sentences)):
 44 |                 simple_sent_words_set = set(simple_sentences[index].split())
 45 |                 overlap_words_set = subsentence_words_set & simple_sent_words_set
 46 |                 overlap_data.append((len(overlap_words_set), index))
 47 |             overlap_data.sort(reverse=True)
 48 | 
 49 |             node_overlap_dict[nodename] = overlap_data[0]
 50 |                 
 51 |         # Check that every node has some overlap in their maximum overlap else fail
 52 |         overlap_maxvalues = [node_overlap_dict[node][0] for node in node_overlap_dict]
 53 |         if 0 in overlap_maxvalues:
 54 |             return False, []
 55 |         else:
 56 |             # check the mapping covers all simple sentences
 57 |             overlap_max_simple_indixes = [node_overlap_dict[node][1] for node in node_overlap_dict]
 58 |             if len(set(overlap_max_simple_indixes)) == len(simple_sentences):
 59 |                 # Thats a valid split, attach unprocessed graph components
 60 |                 node_subgraph_nodeset_dict, node_span_dict = boxer_graph.partition_drs_for_successful_candidate(split_candidate, parent_subgraph_nodeset_dict)
 61 | 
 62 |                 results = []
 63 |                 for nodename in split_candidate:
 64 |                     span = node_span_dict[nodename]
 65 |                     nodeset = node_subgraph_nodeset_dict[nodename][:]
 66 |                     simple_sentence = simple_sentences[node_overlap_dict[nodename][1]]
 67 |                     results.append((span, nodeset, nodename, simple_sentence))
 68 |                 # Sort them based on starting
 69 |                 results.sort()
 70 |                 return True, results
 71 |             else:
 72 |                 return False, []
 73 | 
 74 | # functions : Drop-REL Candidate
 75 | def process_rel_candidate_for_drop_overlap(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, overlap_percentage):
 76 |     simple_sentence = " ".join(simple_sentences)
 77 |     simple_words = simple_sentence.split()
 78 | 
 79 |     rel_phrase = boxer_graph.extract_relation_phrase(relnode_candidate, nodeset, main_sent_dict, filtered_mod_pos)
 80 | 
 81 |     #print relnode_candidate, rel_phrase
 82 | 
 83 |     rel_words = rel_phrase.split()
 84 |     if len(rel_words) == 0:
 85 |         return True
 86 |     else:
 87 |         found = 0
 88 |         for word in rel_words:
 89 |             if word in simple_words:
 90 |                 found += 1
 91 |         percentage_found = found/float(len(rel_words))
 92 | 
 93 |         if percentage_found <= overlap_percentage:
 94 |             return True
 95 |         else:
 96 |             return False
 97 | 
 98 | def process_rel_candidate_for_drop_led(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, opr_drop_rel):
 99 |     simple_sentence = " ".join(simple_sentences)
100 |     
101 |     sentence_before_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_mod_pos)
102 |     edit_dist_before_drop = edit_distance(sentence_before_drop.split(), simple_sentence.split())        
103 |     
104 |     temp_nodeset, temp_filtered_mod_pos = boxer_graph.drop_relation(nodeset, relnode_candidate, filtered_mod_pos)
105 |     sentence_after_drop = boxer_graph.extract_main_sentence(temp_nodeset, main_sent_dict, temp_filtered_mod_pos)
106 |     edit_dist_after_drop = edit_distance(sentence_after_drop.split(), simple_sentence.split())
107 |     
108 |     isDrop = compare_edit_distance(opr_drop_rel, edit_dist_after_drop, edit_dist_before_drop)
109 |     return isDrop
110 | 
111 | # functions : Drop-MOD Candidate
112 | def process_mod_candidate_for_drop_led(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, opr_drop_mod):
113 |     simple_sentence = " ".join(simple_sentences)
114 |     
115 |     sentence_before_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_mod_pos)
116 |     edit_dist_before_drop = edit_distance(sentence_before_drop.split(), simple_sentence.split())
117 |     
118 |     modcand_position_to_process = modcand_to_process[0]
119 |     temp_filtered_mod_pos = filtered_mod_pos[:]+[modcand_position_to_process]
120 |     sentence_after_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, temp_filtered_mod_pos)
121 |     edit_dist_after_drop = edit_distance(sentence_after_drop.split(), simple_sentence.split())
122 |     
123 |     isDrop = compare_edit_distance(opr_drop_mod, edit_dist_after_drop, edit_dist_before_drop)
124 |     return isDrop
125 | 
126 | # functions : Drop-OOD Candidate
127 | def process_ood_candidate_for_drop_led(oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, opr_drop_ood):
128 |     simple_sentence = " ".join(simple_sentences)
129 |     
130 |     sentence_before_drop = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_mod_pos)
131 |     edit_dist_before_drop = edit_distance(sentence_before_drop.split(), simple_sentence.split())
132 |     
133 |     temp_nodeset = nodeset[:]
134 |     temp_nodeset.remove(oodnode_candidate)
135 |     sentence_after_drop = boxer_graph.extract_main_sentence(temp_nodeset, main_sent_dict, filtered_mod_pos)
136 |     edit_dist_after_drop = edit_distance(sentence_after_drop.split(), simple_sentence.split())
137 | 
138 |     isDrop = compare_edit_distance(opr_drop_ood, edit_dist_after_drop, edit_dist_before_drop)
139 |     return isDrop
140 | 
141 | class Method_OVERLAP_LED:
142 |     def __init__(self, overlap_percentage, opr_drop_mod, opr_drop_ood):
143 |         self.overlap_percentage = overlap_percentage
144 |         self.opr_drop_mod = opr_drop_mod
145 |         self.opr_drop_ood = opr_drop_ood
146 | 
147 |     # Split candidate
148 |     def process_split_candidate_for_split(self, split_candidate, simple_sentences, main_sent_dict, boxer_graph):
149 |         isSplit, results = process_split_candidate_for_split_common(split_candidate, simple_sentences, main_sent_dict, boxer_graph)
150 |         return isSplit, results    
151 | 
152 |     # Drop-REL Candidate
153 |     def process_rel_candidate_for_drop(self, relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
154 |         isDrop = process_rel_candidate_for_drop_overlap(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.overlap_percentage)
155 |         return isDrop
156 | 
157 |     # Drop-MOD Candidate
158 |     def process_mod_candidate_for_drop(self, modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
159 |         isDrop = process_mod_candidate_for_drop_led(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_mod)
160 |         return isDrop
161 | 
162 |     # Drop-OOD Candidate
163 |     def process_ood_candidate_for_drop(self, oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
164 |         isDrop = process_ood_candidate_for_drop_led(oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_ood)
165 |         return isDrop
166 | 
167 | class Method_LED:
168 |     def __init__(self, opr_drop_rel, opr_drop_mod, opr_drop_ood):
169 |         self.opr_drop_rel = opr_drop_rel
170 |         self.opr_drop_mod = opr_drop_mod
171 |         self.opr_drop_ood = opr_drop_ood
172 | 
173 |     # Split candidate
174 |     def process_split_candidate_for_split(self, split_candidate, simple_sentences, main_sent_dict, boxer_graph):
175 |         isSplit, results = process_split_candidate_for_split_common(split_candidate, simple_sentences, main_sent_dict, boxer_graph)
176 |         return isSplit, results
177 | 
178 |     # Drop-REL Candidate
179 |     def process_rel_candidate_for_drop(self, relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
180 |         isDrop = process_rel_candidate_for_drop_led(relnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_rel)
181 |         return isDrop
182 | 
183 |     # Drop-MOD Candidate
184 |     def process_mod_candidate_for_drop(self, modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
185 |         isDrop = process_mod_candidate_for_drop_led(modcand_to_process, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_mod)
186 |         return isDrop
187 | 
188 |     # Drop-OOD Candidate
189 |     def process_ood_candidate_for_drop(self, oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph):
190 |         isDrop = process_ood_candidate_for_drop_led(oodnode_candidate, filtered_mod_pos, nodeset, simple_sentences, main_sent_dict, boxer_graph, self.opr_drop_ood)
191 |         return isDrop
192 | 


--------------------------------------------------------------------------------
/source/saxparser_xml_stanfordtokenized_boxergraph.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #===================================================================================
  3 | #title           : saxparser_xml_stanfordtokenized_boxergraph.py                   =
  4 | #description     : Boxer-Graph-XML-Handler                                         =
  5 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
  6 | #date            : Created in 2014, Later revised in April 2016.                   =
  7 | #version         : 0.1                                                             =
  8 | #===================================================================================
  9 | 
 10 | from xml.sax import handler, make_parser
 11 | 
 12 | from boxer_graph_module import Boxer_Graph
 13 | from explore_training_graph import Explore_Training_Graph
 14 | 
 15 | class SAXPARSER_XML_StanfordTokenized_BoxerGraph:
 16 |     def __init__(self, process, xmlfile, output_stream, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE, RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, METHOD_TRAINING_GRAPH):
 17 |         # process: "training" or "testing" 
 18 |         self.process = process
 19 | 
 20 |         self.xmlfile = xmlfile
 21 | 
 22 |         # output_stream: file stream for training and dictionary for testing 
 23 |         self.output_stream = output_stream
 24 | 
 25 |         self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL
 26 |         self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE
 27 |         self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL
 28 |         self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD
 29 |         self.METHOD_TRAINING_GRAPH = METHOD_TRAINING_GRAPH
 30 |     
 31 |     def parse_xmlfile_generating_training_graph(self):
 32 |         handler = SAX_Handler(self.process, self.output_stream, self.DISCOURSE_SENTENCE_MODEL, self.MAX_SPLIT_PAIR_SIZE, 
 33 |                               self.RESTRICTED_DROP_REL, self.ALLOWED_DROP_MOD, self.METHOD_TRAINING_GRAPH)
 34 | 
 35 |         parser = make_parser()
 36 |         parser.setContentHandler(handler)
 37 |         parser.parse(self.xmlfile)
 38 | 
 39 | class SAX_Handler(handler.ContentHandler):
 40 |     def __init__(self, process, output_stream, DISCOURSE_SENTENCE_MODEL, MAX_SPLIT_PAIR_SIZE, 
 41 |                  RESTRICTED_DROP_REL, ALLOWED_DROP_MOD, METHOD_TRAINING_GRAPH):
 42 |         self.process = process
 43 |         self.output_stream = output_stream
 44 | 
 45 |         self.DISCOURSE_SENTENCE_MODEL = DISCOURSE_SENTENCE_MODEL
 46 |         self.MAX_SPLIT_PAIR_SIZE = MAX_SPLIT_PAIR_SIZE
 47 |         self.RESTRICTED_DROP_REL = RESTRICTED_DROP_REL
 48 |         self.ALLOWED_DROP_MOD = ALLOWED_DROP_MOD
 49 |         self.METHOD_TRAINING_GRAPH = METHOD_TRAINING_GRAPH
 50 | 
 51 |         # Training Graph Creator
 52 |         self.training_graph_handler = Explore_Training_Graph(self.output_stream, self.DISCOURSE_SENTENCE_MODEL, self.MAX_SPLIT_PAIR_SIZE, 
 53 |                                                              self.RESTRICTED_DROP_REL, self.ALLOWED_DROP_MOD, self.METHOD_TRAINING_GRAPH)
 54 | 
 55 |         # Sentence Data
 56 |         self.sentid = ""
 57 |         self.main_sentence = ""
 58 |         self.main_sent_dict = {}
 59 |         self.boxer_graph = Boxer_Graph()
 60 |         self.simple_sentencs = []
 61 | 
 62 |         # Sentence Flags, temporary variables
 63 |         self.isMain = False
 64 | 
 65 |         self.isS = False
 66 |         self.sentence = ""
 67 |         self.wordlist = []
 68 | 
 69 |         self.isW = False
 70 |         self.word = ""
 71 |         self.wid = ""
 72 |         self.wpos = ""
 73 |         
 74 |         self.isSimple = False
 75 |         
 76 |         # Boxer flags, temporary variables
 77 |         self.isNode = False
 78 |         self.isRel = False
 79 |         self.symbol = ""
 80 |         self.predsymbol = ""
 81 |         self.locationlist = []
 82 |         
 83 |     def startDocument(self):     
 84 |         print "Start parsing the document ..."
 85 |        
 86 |     def endDocument(self):
 87 |         print "End parsing the document ..."
 88 | 
 89 |     def startElement(self, nameElt, attrOfElt):
 90 |         if nameElt == "sentence":
 91 |             self.sentid = attrOfElt["id"]
 92 | 
 93 |             # Refreshing Sentence Data
 94 |             self.main_sentence = ""
 95 |             self.main_sent_dict = {}
 96 |             self.boxer_graph = Boxer_Graph()
 97 |             self.simple_sentences = []
 98 | 
 99 |         if nameElt == "main":
100 |             self.isMain = True
101 | 
102 |         if nameElt == "simple":
103 |             self.isSimple = True
104 |                 
105 |         if nameElt == "s":
106 |             self.isS = True
107 |             self.sentence = ""
108 |             self.wordlist = []
109 | 
110 |         if nameElt == "w":
111 |             self.isW = True
112 |             self.wid = int(attrOfElt["id"][1:])
113 |             self.wpos = attrOfElt["pos"]
114 |             self.word = ""
115 |             
116 |         if nameElt == "node":
117 |             self.isNode = True
118 |             self.symbol = attrOfElt["sym"]
119 |             self.boxer_graph.nodes[self.symbol] = {"positions":[], "predicates":[]}
120 | 
121 |         if nameElt == "rel":
122 |             self.isRel = True
123 |             self.symbol = attrOfElt["sym"]
124 |             self.boxer_graph.relations[self.symbol] = {"positions":[], "predicates":""} 
125 | 
126 |         if nameElt == "span":
127 |             self.locationlist = []
128 | 
129 |         if nameElt == "pred":
130 |             self.locationlist = []
131 |             self.predsymbol = attrOfElt["sym"]
132 |             
133 |         if nameElt == "loc":
134 |             if int(attrOfElt["id"][1:]) in self.main_sent_dict:
135 |                 self.locationlist.append(int(attrOfElt["id"][1:]))
136 |             
137 |         if nameElt == "edge":
138 |             self.boxer_graph.edges.append((attrOfElt["par"], attrOfElt["dep"], attrOfElt["lab"]))
139 |             
140 |     def endElement(self, nameElt):
141 |         if nameElt == "sentence":
142 |             #print self.sentid
143 |             # print self.main_sentence
144 |             # print self.main_sent_dict
145 |             # print self.simple_sentences
146 |             # print self.boxer_graph
147 | 
148 |             if self.process == "training":
149 |                 self.training_graph_handler.explore_training_graph(self.sentid, self.main_sentence, self.main_sent_dict, self.simple_sentences, self.boxer_graph)
150 |                 
151 |             if self.process == "testing":
152 |                 self.output_stream[self.sentid] = [self.main_sentence, self.main_sent_dict, self.boxer_graph]
153 | 
154 |             # if len(self.main_sentence) > 600:
155 |             #     print self.sentid
156 |             # if len(self.simple_sentences) == 6:
157 |             #     print self.sentid
158 | 
159 |             if int(self.sentid)%10000 == 0:
160 |                 print self.sentid + " training data processed ..."            
161 |             
162 |         if nameElt == "main":
163 |             self.isMain = False
164 |             if len(self.wordlist) == 0:
165 |                 self.main_sentence = self.sentence.lower()
166 |             else:
167 |                 self.main_sentence = (" ".join(self.wordlist)).lower()
168 | 
169 |         if nameElt == "simple":
170 |             self.isSimple = False
171 |             self.simple_sentences.append(self.sentence.lower())
172 | 
173 |         if nameElt == "s":
174 |             self.isS = False
175 |             
176 |         if nameElt == "w":
177 |             self.isW = False
178 |             self.main_sent_dict[self.wid] = (self.word.lower(), self.wpos.lower())
179 |             self.wordlist.append(self.word.lower())
180 |         
181 |         if nameElt == "node":
182 |             self.isNode = False
183 |             self.boxer_graph.nodes[self.symbol]["predicates"].sort()
184 |             
185 |         if nameElt == "rel":
186 |             self.isRel = False
187 | 
188 |         if nameElt == "span":
189 |             self.locationlist.sort()
190 |             if self.isNode:
191 |                 self.boxer_graph.nodes[self.symbol]["positions"] = self.locationlist[:]
192 |             if self.isRel:
193 |                 self.boxer_graph.relations[self.symbol]["positions"] = self.locationlist[:]
194 |             
195 |         if nameElt == "pred":
196 |             self.locationlist.sort()
197 |             if self.isNode:
198 |                 self.boxer_graph.nodes[self.symbol]["predicates"].append((self.predsymbol, self.locationlist[:]))
199 |             if self.isRel:
200 |                 self.boxer_graph.relations[self.symbol]["predicates"] = self.predsymbol
201 |             
202 |     def characters(self, chrs):
203 |         if self.isS:
204 |             self.sentence += chrs
205 | 
206 |         if self.isW:
207 |             self.word += chrs
208 |             
209 | 


--------------------------------------------------------------------------------
/source/saxparser_xml_stanfordtokenized_boxergraph_traininggraph.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #===================================================================================
  3 | #title           : saxparser_xml_stanfordtokenized_boxergraph_traininggraph.py     =
  4 | #description     : Boxer-Training-Graph-XML-Handler                                =
  5 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
  6 | #date            : Created in 2014, Later revised in April 2016.                   =
  7 | #version         : 0.1                                                             =
  8 | #===================================================================================
  9 | 
 10 | from xml.sax import handler, make_parser
 11 | from boxer_graph_module import Boxer_Graph
 12 | from training_graph_module import Training_Graph
 13 | from em_inside_outside_algorithm import EM_InsideOutside_Optimiser
 14 | import copy
 15 | 
 16 | class SAXPARSER_XML_StanfordTokenized_BoxerGraph_TrainingGraph:
 17 |     def __init__(self, training_xmlfile, NUM_TRAINING_ITERATION, smt_sentence_pairs, probability_tables, count_tables, METHOD_FEATURE_EXTRACT):
 18 |         self.training_xmlfile = training_xmlfile
 19 |         self.NUM_TRAINING_ITERATION = NUM_TRAINING_ITERATION
 20 |         self.smt_sentence_pairs = smt_sentence_pairs
 21 |         self.probability_tables = probability_tables
 22 |         self.count_tables = count_tables
 23 |         self.METHOD_FEATURE_EXTRACT = METHOD_FEATURE_EXTRACT
 24 | 
 25 |         self.em_io_handler = EM_InsideOutside_Optimiser(self.smt_sentence_pairs, self.probability_tables, self.count_tables, self.METHOD_FEATURE_EXTRACT)
 26 | 
 27 |     def parse_to_initialize_probabilitytable(self):
 28 |         # Initialize probability table and populate self.smt_sentence_pairs
 29 |         handler = SAX_Handler("init", self.em_io_handler)
 30 |         parser = make_parser()
 31 |         parser.setContentHandler(handler)
 32 |         print "Start parsing "+self.training_xmlfile+" ..."
 33 |         parser.parse(self.training_xmlfile)
 34 |         
 35 |     def parse_to_iterate_probabilitytable(self):
 36 |         handler = SAX_Handler("iter", self.em_io_handler)
 37 |         parser = make_parser()
 38 |         parser.setContentHandler(handler)
 39 |         
 40 |         for count in range(self.NUM_TRAINING_ITERATION):
 41 |             print "Starting iteration: "+str(count+1)+" ..."
 42 | 
 43 |             print "Resetting all counts to ZERO ..."
 44 |             self.em_io_handler.reset_count_table()
 45 | 
 46 |             print "Start parsing "+self.training_xmlfile+" ..."
 47 |             parser.parse(self.training_xmlfile)  
 48 |             print "Ending iteration: "+str(count+1)+" ..."
 49 |         
 50 |             print "Updating probability table ..."
 51 |             self.em_io_handler.update_probability_table()
 52 |  
 53 | class SAX_Handler(handler.ContentHandler):
 54 |     def __init__(self, stage, em_io_handler):
 55 |         # "init" or "iter" stage
 56 |         self.stage = stage
 57 |         
 58 |         # EM algorithm handler
 59 |         self.em_io_handler = em_io_handler
 60 |         
 61 |         # Sentence Data
 62 |         self.sentid = ""
 63 |         self.main_sentence = ""
 64 |         self.simple_sentencs = []
 65 |         self.main_sent_dict = {}
 66 |         # Boxer Data
 67 |         self.boxer_graph = {"nodes":{}, "relations":{}, "edges":[]}
 68 |         # Training Graph Data 
 69 |         self.training_graph = {"major-nodes":{}, "oper-nodes":{}, "edges":[]}
 70 |         
 71 |         # Common TAG variables
 72 |         self.isS = False
 73 |         self.sentence = ""
 74 | 
 75 |         # Main
 76 |         self.isMain = False
 77 |         self.isWinfo = False
 78 |         self.isW = False
 79 |         self.word = ""
 80 |         self.wid = ""
 81 |         self.wpos = ""
 82 | 
 83 |         # Simple Set
 84 |         self.isSimple = False
 85 | 
 86 |         # Boxer
 87 |         self.isBoxer = False
 88 | 
 89 |         # TrainingGraph
 90 |         self.isTrainingGraph = False
 91 | 
 92 |         # Node
 93 |         self.isNode = False
 94 |         self.nodesym = ""
 95 | 
 96 |         # Span
 97 |         self.isSpan = False
 98 | 
 99 |         # pred
100 |         self.isPred = False
101 |         self.predsym = ""
102 | 
103 |         # relation
104 |         self.isRel = False
105 |         self.relsym = ""
106 |         
107 |         # major oper nodes
108 |         self.isMajorNodes = False
109 |         self.isOperNodes = False
110 | 
111 |         # type
112 |         self.isType = False
113 |         self.type = ""
114 | 
115 |         # Nodeset
116 |         self.isNodeset = False
117 | 
118 |         # Split
119 |         self.isSplitCandidate = False
120 |         self.isSplitCandidateLeft = False
121 |         self.isSC = False
122 |         
123 |         # Out of discourse OOD
124 |         self.isOODCandidates = False
125 |         self.isOODProcessed = False
126 | 
127 |         # Relations
128 |         self.isRelCandidates = False
129 |         self.isRelProcessed = False
130 |         
131 |         # Modifiers
132 |         self.isModCandidates = False
133 |         self.isModposProcessed = False
134 |         self.isModposFiltered = False
135 | 
136 |     def startDocument(self):     
137 |         print "Start parsing the document ..."
138 |        
139 |     def endDocument(self):
140 |         print "End parsing the document ..."
141 | 
142 |     def startElement(self, nameElt, attrOfElt):
143 |         if nameElt == "sentence":
144 |             self.sentid = attrOfElt["id"]
145 |             
146 |             # Refreshing Sentence Data
147 |             self.main_sentence = ""
148 |             self.simple_sentences = []
149 |             self.main_sent_dict = {}
150 |             # Refreshing Boxer Data
151 |             self.boxer_graph = {"nodes":{}, "relations":{}, "edges":[]}
152 |             # Refreshing Training Graph Data 
153 |             self.training_graph = {"major-nodes":{}, "oper-nodes":{}, "edges":[]}
154 |         
155 |         if nameElt == "main":
156 |             self.isMain = True
157 |             
158 |         if nameElt == "s":
159 |             self.isS = True
160 |             self.sentence = ""
161 | 
162 |         if nameElt == "winfo":
163 |             self.isWinfo = True
164 |             
165 |         if nameElt == "w":
166 |             self.isW = True
167 |             self.word = ""
168 |             self.wid = int(attrOfElt["id"])
169 |             self.wpos = attrOfElt["pos"]
170 |         
171 |         if nameElt == "simple":
172 |             self.isSimple = True
173 |             
174 |         if nameElt == "box":
175 |             self.isBoxer = True
176 | 
177 |         if nameElt == "train-graph":
178 |             self.isTrainingGraph = True
179 | 
180 |         if nameElt == "major-nodes":
181 |             self.isMajorNodes = True
182 |             
183 |         if nameElt == "oper-nodes":
184 |             self.isOperNodes = True
185 | 
186 |         if nameElt == "node":
187 |             self.isNode = True
188 |             self.nodesym = attrOfElt["sym"]
189 |             
190 |             if self.isBoxer == True:
191 |                 self.boxer_graph["nodes"][self.nodesym] = {"positions": [], "predicates":[]}
192 |         
193 |             if self.isTrainingGraph == True:
194 |                 if self.isMajorNodes == True:
195 |                     self.training_graph["major-nodes"][self.nodesym] = {"type": "", "nodeset": [], "simple-sentences":[], 
196 |                                                                         "split-candidates":[],
197 |                                                                         "ood-candidates":[], "ood-processed":[],
198 |                                                                         "rel-candidates":[], "rel-processed":[],
199 |                                                                         "mod-candidates":[], "modpos-processed":[], "modpos-filtered":[]}
200 |                     
201 |                 if self.isOperNodes == True:
202 |                     self.training_graph["oper-nodes"][self.nodesym] = {"type": "", 
203 |                                                                        "split-candidate":[], "not-split-candidates":[],
204 |                                                                        "ood-candidate":"", "drop-result":"",
205 |                                                                        "rel-candidate":"","mod-candidate":""}
206 | 
207 |         if nameElt == "rel":
208 |             self.isRel = True
209 |             self.relsym = attrOfElt["sym"]
210 |             
211 |             if self.isBoxer == True:
212 |                 self.boxer_graph["relations"][self.relsym] = {"positions": [], "predicates":""}
213 | 
214 |         if nameElt == "span":
215 |             self.isSpan = True
216 |                
217 |         if nameElt == "pred":
218 |             self.isPred = True
219 |             self.predsym = attrOfElt["sym"]
220 | 
221 |             if self.isBoxer == True and self.isNode == True:
222 |                 self.boxer_graph["nodes"][self.nodesym]["predicates"].append([self.predsym, []])
223 | 
224 |             if self.isBoxer == True and self.isRel == True:
225 |                 self.boxer_graph["relations"][self.relsym]["predicates"] = self.predsym
226 |                 
227 |         if nameElt == "loc":
228 |             if self.isBoxer == True and self.isNode == True and self.isSpan == True:
229 |                 self.boxer_graph["nodes"][self.nodesym]["positions"].append(int(attrOfElt["id"]))
230 |             
231 |             if self.isBoxer == True and self.isNode == True and self.isPred == True:
232 |                 self.boxer_graph["nodes"][self.nodesym]["predicates"][-1][1].append(int(attrOfElt["id"]))
233 | 
234 |             if self.isBoxer == True and self.isRel == True and self.isSpan == True:
235 |                 self.boxer_graph["relations"][self.relsym]["positions"].append(int(attrOfElt["id"]))
236 | 
237 |             if self.isModposProcessed == True:
238 |                 if self.isMajorNodes == True:
239 |                     self.training_graph["major-nodes"][self.nodesym]["modpos-processed"].append(int(attrOfElt["id"]))
240 |                     
241 |             if self.isModposFiltered == True:
242 |                 if self.isMajorNodes == True:
243 |                     self.training_graph["major-nodes"][self.nodesym]["modpos-filtered"].append(int(attrOfElt["id"]))
244 |                     
245 |         if nameElt == "edge":
246 |             if self.isBoxer == True:
247 |                 self.boxer_graph["edges"].append((attrOfElt["par"], attrOfElt["dep"], attrOfElt["lab"]))
248 | 
249 |             if self.isTrainingGraph == True:
250 |                 self.training_graph["edges"].append((attrOfElt["par"], attrOfElt["dep"], attrOfElt["lab"]))
251 |                 
252 |         if nameElt == "type":
253 |             self.isType = True
254 |             self.type = ""
255 |             
256 |         if nameElt == "nodeset":
257 |             self.isNodeset = True
258 | 
259 |         if nameElt == "n":
260 |             if self.isNodeset == True:
261 |                 if self.isMajorNodes == True:
262 |                     self.training_graph["major-nodes"][self.nodesym]["nodeset"].append(attrOfElt["sym"])
263 |             if self.isSC == True:
264 |                 if self.isSplitCandidate == True:
265 |                     if self.isMajorNodes == True:
266 |                         self.training_graph["major-nodes"][self.nodesym]["split-candidates"][-1].append(attrOfElt["sym"])
267 |                     if self.isOperNodes == True:
268 |                         self.training_graph["oper-nodes"][self.nodesym]["split-candidate"].append(attrOfElt["sym"])
269 |                 if self.isSplitCandidateLeft == True:
270 |                     if self.isOperNodes == True:
271 |                         self.training_graph["oper-nodes"][self.nodesym]["not-split-candidates"][-1].append(attrOfElt["sym"])
272 |             
273 |             if self.isOODCandidates == True:
274 |                 if self.isMajorNodes == True:
275 |                     self.training_graph["major-nodes"][self.nodesym]["ood-candidates"].append(attrOfElt["sym"])
276 |                 if self.isOperNodes == True:
277 |                     self.training_graph["oper-nodes"][self.nodesym]["ood-candidate"] = attrOfElt["sym"]
278 |                     
279 |             if self.isOODProcessed == True:
280 |                 if self.isMajorNodes == True:
281 |                     self.training_graph["major-nodes"][self.nodesym]["ood-processed"].append(attrOfElt["sym"])
282 | 
283 |             if self.isRelCandidates == True:
284 |                 if self.isMajorNodes == True:
285 |                     self.training_graph["major-nodes"][self.nodesym]["rel-candidates"].append(attrOfElt["sym"])
286 |                 if self.isOperNodes == True:
287 |                     self.training_graph["oper-nodes"][self.nodesym]["rel-candidate"] = attrOfElt["sym"]
288 |                 
289 |             if self.isRelProcessed == True:
290 |                 if self.isMajorNodes == True:
291 |                     self.training_graph["major-nodes"][self.nodesym]["rel-processed"].append(attrOfElt["sym"])
292 | 
293 |             if self.isModCandidates == True:
294 |                 if self.isMajorNodes == True:
295 |                     self.training_graph["major-nodes"][self.nodesym]["mod-candidates"].append((attrOfElt["loc"], attrOfElt["sym"]))
296 |                 if self.isOperNodes == True:
297 |                     self.training_graph["oper-nodes"][self.nodesym]["mod-candidate"] = (attrOfElt["loc"], attrOfElt["sym"])
298 |                     
299 |         if nameElt == "split-candidates" or nameElt == "split-candidate-applied":
300 |             self.isSplitCandidate = True
301 |             
302 |         if nameElt == "split-candidate-left":
303 |             self.isSplitCandidateLeft = True
304 |             
305 |         if nameElt == "sc":
306 |             self.isSC = True
307 |             if self.isSplitCandidate == True:
308 |                 if self.isMajorNodes == True:
309 |                     self.training_graph["major-nodes"][self.nodesym]["split-candidates"].append([])
310 |                 if self.isOperNodes == True:
311 |                     self.training_graph["oper-nodes"][self.nodesym]["split-candidate"] = []
312 |             if self.isSplitCandidateLeft == True:
313 |                 if self.isOperNodes == True:
314 |                     self.training_graph["oper-nodes"][self.nodesym]["not-split-candidates"].append([])
315 |     
316 |         if nameElt == "ood-candidate" or nameElt == "ood-candidates":
317 |             self.isOODCandidates = True
318 | 
319 |         if nameElt == "ood-processed":
320 |             self.isOODProcessed = True
321 | 
322 |         if nameElt == "rel-candidate" or nameElt == "rel-candidates":
323 |             self.isRelCandidates = True
324 | 
325 |         if nameElt == "rel-processed":
326 |             self.isRelProcessed = True
327 | 
328 |         if nameElt == "mod-candidate" or nameElt == "mod-candidates":
329 |             self.isModCandidates = True
330 | 
331 |         if nameElt == "mod-loc-processed":
332 |             self.isModposProcessed = True
333 |         
334 |         if nameElt == "mod-loc-filtered":
335 |             self.isModposFiltered = True
336 | 
337 |         if nameElt == "is-dropped":
338 |             if self.isOperNodes == True:
339 |                 self.training_graph["oper-nodes"][self.nodesym]["drop-result"] = attrOfElt["val"]
340 | 
341 |     def endElement(self, nameElt):
342 |         if nameElt == "sentence":
343 |             # print self.sentid
344 |             # print
345 |             # print self.main_sentence
346 |             # print 
347 |             # print self.main_sent_dict
348 |             # print
349 |             # print self.simple_sentences
350 |             # print
351 |             # print self.boxer_graph
352 |             # print
353 |             # print self.training_graph
354 | 
355 |             # Creating the original format of Boxer and Training Graph
356 |             final_boxer_graph = Boxer_Graph()
357 |             for nodename in self.boxer_graph["nodes"]:
358 |                 final_boxer_graph.nodes[nodename] = copy.copy(self.boxer_graph["nodes"][nodename])
359 |             for nodename in self.boxer_graph["relations"]:
360 |                 final_boxer_graph.relations[nodename] = copy.copy(self.boxer_graph["relations"][nodename])
361 |             final_boxer_graph.edges = self.boxer_graph["edges"][:]
362 |             
363 |             final_training_graph = Training_Graph()
364 |             for nodename in self.training_graph["major-nodes"]:
365 |                 nodedict = self.training_graph["major-nodes"][nodename]
366 |                 if nodedict["type"] == "split":
367 |                     final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["split-candidates"][:])
368 |                 if nodedict["type"] == "drop-rel":
369 |                     final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["rel-candidates"][:], 
370 |                                                                   nodedict["rel-processed"][:], nodedict["modpos-filtered"][:])
371 |                 if nodedict["type"] == "drop-mod":
372 |                     final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["mod-candidates"][:], 
373 |                                                                   nodedict["modpos-processed"][:], nodedict["modpos-filtered"][:])
374 |                 if nodedict["type"] == "drop-ood":
375 |                     final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["ood-candidates"][:], 
376 |                                                                   nodedict["ood-processed"][:], nodedict["modpos-filtered"][:])
377 |                 if nodedict["type"] == "fin":
378 |                     final_training_graph.major_nodes[nodename] = (nodedict["type"], nodedict["nodeset"][:], nodedict["simple-sentences"][:], nodedict["modpos-filtered"][:])
379 |             for nodename in self.training_graph["oper-nodes"]:
380 |                 nodedict = self.training_graph["oper-nodes"][nodename]
381 |                 if nodedict["type"] == "split":
382 |                     if len(nodedict["split-candidate"]) == 0:
383 |                         final_training_graph.oper_nodes[nodename] = (nodedict["type"], None, nodedict["not-split-candidates"][:])
384 |                     else:
385 |                         final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["split-candidate"], nodedict["not-split-candidates"][:])
386 |                 if nodedict["type"] == "drop-rel":
387 |                     final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["rel-candidate"], nodedict["drop-result"])
388 |                 if nodedict["type"] == "drop-mod":
389 |                     final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["mod-candidate"], nodedict["drop-result"])
390 |                 if nodedict["type"] == "drop-ood":
391 |                     final_training_graph.oper_nodes[nodename] = (nodedict["type"], nodedict["ood-candidate"], nodedict["drop-result"])
392 |             final_training_graph.edges = self.training_graph["edges"][:]
393 | 
394 |             # Process various stage "init" or "iter"
395 |             if self.stage == "init":
396 |                 self.em_io_handler.initialize_probabilitytable_smt_input(self.sentid, self.main_sentence, self.main_sent_dict, self.simple_sentences, final_boxer_graph, final_training_graph)
397 |             
398 |             if self.stage == "iter":
399 |                 self.em_io_handler.iterate_over_probabilitytable(self.sentid, self.main_sentence, self.main_sent_dict, self.simple_sentences, final_boxer_graph, final_training_graph)
400 | 
401 |             if int(self.sentid)%10000 == 0:
402 |                 print self.sentid + " training data processed ..."            
403 | 
404 |         if nameElt == "main":
405 |             self.isMain = False
406 |             
407 |         if nameElt == "s":
408 |             self.isS = False
409 |             
410 |             if self.isMain == True:
411 |                 self.main_sentence = self.sentence
412 | 
413 |             if self.isSimple == True:
414 |                 if self.isNode == True:
415 |                     if self.isMajorNodes == True:
416 |                         self.training_graph["major-nodes"][self.nodesym]["simple-sentences"].append(self.sentence)
417 |                 else:
418 |                     self.simple_sentences.append(self.sentence)
419 | 
420 |         if nameElt == "winfo":
421 |             self.isWinfo = False
422 |             
423 |         if nameElt == "w":
424 |             self.isW = False
425 |             
426 |             if self.isWinfo == True:
427 |                 self.main_sent_dict[self.wid] = (self.word, self.wpos) 
428 | 
429 |         if nameElt == "simple":
430 |             self.isSimple = False
431 | 
432 |         if nameElt == "box":
433 |             self.isBoxer = False
434 | 
435 |         if nameElt == "train-graph":
436 |             self.isTrainingGraph = False        
437 | 
438 |         if nameElt == "major-nodes":
439 |             self.isMajorNodes = False
440 |             
441 |         if nameElt == "oper-nodes":
442 |             self.isOperNodes = False          
443 | 
444 |         if nameElt == "node":
445 |             self.isNode = False
446 | 
447 |         if nameElt == "rel":
448 |             self.isRel = False
449 | 
450 |         if nameElt == "span":
451 |             self.isSpan = False
452 | 
453 |         if nameElt == "pred":
454 |             self.isPred = False
455 | 
456 |         if nameElt == "type":
457 |             self.isType = False
458 |             if self.isMajorNodes == True:
459 |                 self.training_graph["major-nodes"][self.nodesym]["type"] = self.type
460 |             if self.isOperNodes == True:
461 |                 self.training_graph["oper-nodes"][self.nodesym]["type"] = self.type
462 | 
463 |         if nameElt == "nodeset":
464 |             self.isNodeset = False
465 | 
466 |         if nameElt == "split-candidates" or nameElt == "split-candidate-applied":
467 |             self.isSplitCandidate = False
468 |             
469 |         if nameElt == "split-candidate-left":
470 |             self.isSplitCandidateLeft = False
471 |             
472 |         if nameElt == "sc":
473 |             self.isSC = False
474 |     
475 |         if nameElt == "ood-candidate" or nameElt == "ood-candidates":
476 |             self.isOODCandidates = False
477 | 
478 |         if nameElt == "ood-processed":
479 |             self.isOODProcessed = False
480 | 
481 |         if nameElt == "rel-candidate" or nameElt == "rel-candidates":
482 |             self.isRelCandidates = False
483 | 
484 |         if nameElt == "rel-processed":
485 |             self.isRelProcessed = False
486 | 
487 |         if nameElt == "mod-candidate" or nameElt == "mod-candidates":
488 |             self.isModCandidates = False
489 | 
490 |         if nameElt == "mod-loc-processed":
491 |             self.isModposProcessed = False
492 |         
493 |         if nameElt == "mod-loc-filtered":
494 |             self.isModposFiltered = False
495 | 
496 |     def characters(self, chrs):
497 |         if self.isS:
498 |             self.sentence += chrs
499 | 
500 |         if self.isW:
501 |             self.word += chrs
502 | 
503 |         if self.isType:
504 |             self.type += chrs
505 | 


--------------------------------------------------------------------------------
/source/training_graph_module.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #===================================================================================
  3 | #title           : training_graph_module.py                                        =
  4 | #description     : Define Training Graph                                           =
  5 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
  6 | #date            : Created in 2014, Later revised in April 2016.                   =
  7 | #version         : 0.1                                                             =
  8 | #===================================================================================
  9 | 
 10 | 
 11 | import xml.etree.ElementTree as ET
 12 | import copy
 13 | 
 14 | class Training_Graph:
 15 |     def __init__(self):
 16 |         '''
 17 |         self.major_nodes["MN-*"]
 18 |         ("split", nodeset, simple_sentences, split_candidate_tuples)
 19 |         ("drop-rel", nodeset, simple_sentences, relnode_set, processed_relnode, filtered_mod_pos)
 20 |         ("drop-mod", nodeset, simple_sentences, modcand_set, processed_mod_pos, filtered_mod_pos)
 21 |         ("drop-ood", nodeset, simple_sentence, oodnode_set, processed_oodnode, filtered_mod_pos)
 22 |         ("fin", nodeset, simple_sentences, filtered_mod_pos)
 23 | 
 24 |         self.oper_nodes["ON-*"]
 25 |         ("split", split_candidate, not_applied_cands)
 26 |         ("split", None, not_applied_cands)
 27 |         ("drop-rel", relnode_to_process, "True")
 28 |         ("drop-rel", relnode_to_process, "False")
 29 |         ("drop-mod", modcand_to_process, "True")
 30 |         ("drop-mod", modcand_to_process, "False")
 31 |         ("drop-ood", oodnode_to_process, "True")
 32 |         ("drop-ood", oodnode_to_process, "False")
 33 | 
 34 |         self.edges = [(par, dep, lab)]
 35 | 
 36 |         '''
 37 |         self.major_nodes = {}
 38 |         self.oper_nodes = {}
 39 |         self.edges = []
 40 | 
 41 |     def get_majornode_type(self, majornode_name):
 42 |         majornode_tuple = self.major_nodes[majornode_name]
 43 |         return majornode_tuple[0]
 44 | 
 45 |     def get_majornode_nodeset(self, majornode_name):
 46 |         majornode_tuple = self.major_nodes[majornode_name]
 47 |         return majornode_tuple[1]
 48 | 
 49 |     def get_majornode_simple_sentences(self, majornode_name):
 50 |         majornode_tuple = self.major_nodes[majornode_name]
 51 |         return majornode_tuple[2]
 52 | 
 53 |     def get_majornode_oper_candidates(self, majornode_name):
 54 |         majornode_tuple = self.major_nodes[majornode_name]
 55 |         if majornode_tuple[0] != "fin":
 56 |             return majornode_tuple[3]
 57 |         else:
 58 |             return []
 59 |     
 60 |     def get_majornode_processed_oper_candidates(self, majornode_name):
 61 |         majornode_tuple = self.major_nodes[majornode_name]
 62 |         if majornode_tuple[0] != "fin" and majornode_tuple[0] != "split":
 63 |             return majornode_tuple[4]
 64 |         else:
 65 |             return []
 66 |     
 67 |     def get_majornode_filtered_postions(self, majornode_name):
 68 |         majornode_tuple = self.major_nodes[majornode_name]
 69 |         if majornode_tuple[0] == "fin":
 70 |             return majornode_tuple[3]
 71 |         elif majornode_tuple[0] == "drop-rel" or majornode_tuple[0] == "drop-mod" or majornode_tuple[0] == "drop-ood":
 72 |             return majornode_tuple[5]
 73 |         else:
 74 |             return []
 75 | 
 76 |     def get_opernode_type(self, opernode_name):
 77 |         opernode_tuple = self.oper_nodes[opernode_name]
 78 |         return opernode_tuple[0]
 79 | 
 80 |     def get_opernode_oper_candidate(self, opernode_name):
 81 |         opernode_tuple = self.oper_nodes[opernode_name]
 82 |         return opernode_tuple[1]
 83 | 
 84 |     def get_opernode_failed_oper_candidates(self, opernode_name):
 85 |         opernode_tuple = self.oper_nodes[opernode_name]
 86 |         if opernode_tuple[0] == "split":
 87 |             return opernode_tuple[2]
 88 |         else:
 89 |             return []
 90 |     
 91 |     def get_opernode_drop_result(self, opernode_name):
 92 |         opernode_tuple = self.oper_nodes[opernode_name]
 93 |         if opernode_tuple[0] != "split":
 94 |             return opernode_tuple[2]
 95 |         else:
 96 |             return None
 97 | 
 98 |     # @@@@@@@@@@@@@@@@@@@@@ Create nodes @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
 99 | 
100 |     def create_majornode(self, majornode_data):
101 |         copy_data = copy.copy(majornode_data)
102 | 
103 |         # Check if node exists
104 |         for node_name in self.major_nodes:
105 |             node_data = self.major_nodes[node_name]
106 |             if node_data == copy_data:
107 |                 return node_name, False
108 | 
109 |         # Otherwise create new node
110 |         majornode_name = "MN-"+str(len(self.major_nodes)+1)
111 |         self.major_nodes[majornode_name] = copy_data 
112 |         return majornode_name, True
113 | 
114 |     def create_opernode(self, opernode_data):
115 |         copy_data = copy.copy(opernode_data)
116 |         opernode_name = "ON-"+str(len(self.oper_nodes)+1)
117 |         self.oper_nodes[opernode_name] = copy_data
118 |         return opernode_name
119 | 
120 |     def create_edge(self, edge_data):
121 |         self.edges.append(copy.copy(edge_data))
122 | 
123 |     # @@@@@@@@@@@@@@@@@@@@@@@@ Final sentences @@@@@@@@@@@@@@@@@@@@@@@@@@
124 |     
125 |     def get_final_sentences(self, main_sentence, main_sent_dict, boxer_graph):
126 |         fin_nodes = self.find_all_fin_majornode()
127 |         print 
128 |         node_sent = []
129 |         for node in fin_nodes:
130 |             # intpart = int(node[3:]) # removing "MN-", lower int part sentence comes before
131 |             if boxer_graph.isEmpty():
132 |                 #main_sentence = main_sentence.encode('utf-8')
133 |                 simple_sentences = self.get_majornode_simple_sentences(node)
134 |                 simple_sentence = " ".join(simple_sentences)
135 |                 #node_sent.append((intpart, main_sentence, simple_sentence))
136 | 
137 |                 node_span = (0, len(main_sentence.split()))
138 |                 node_sent.append((node_span, main_sentence, simple_sentence))
139 | 
140 |             else:
141 |                 nodeset = self.get_majornode_nodeset(node)
142 |                 node_span = boxer_graph.extract_span_min_max(nodeset)
143 |                 filtered_pos = self.get_majornode_filtered_postions(node)
144 |                 main_sentence = boxer_graph.extract_main_sentence(nodeset, main_sent_dict, filtered_pos)
145 |                 simple_sentences = self.get_majornode_simple_sentences(node)
146 |                 simple_sentence = " ".join(simple_sentences)
147 |                 #node_sent.append((intpart, main_sentence, simple_sentence))
148 |                 node_sent.append((node_span, main_sentence, simple_sentence))
149 |         node_sent.sort()
150 |         sentence_pairs = [(item[1], item[2]) for item in node_sent]
151 |         #sentence_pairs = [(item[1].encode('utf-8'), item[2].encode('utf-8')) for item in node_sent]
152 |         #print sentence_pairs
153 |         return sentence_pairs
154 | 
155 | 
156 |     # @@@@@@@@@@@@@@@@@@@@@ Find nodes in Training Graph @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
157 |         
158 |     def find_all_fin_majornode(self):
159 |         fin_nodes = []
160 |         for major_node in self.major_nodes:
161 |             if self.major_nodes[major_node][0] == "fin":
162 |                 fin_nodes.append(major_node)
163 |         return fin_nodes 
164 | 
165 |     def find_children_of_majornode(self, major_node):
166 |         children_oper_nodes = []
167 |         for edge in self.edges:
168 |             if edge[0] == major_node:
169 |                 children_oper_nodes.append(edge[1])
170 |         return children_oper_nodes
171 |         
172 |     def find_children_of_opernode(self, oper_node):
173 |         children_major_nodes = []
174 |         for edge in self.edges:
175 |             if edge[0] == oper_node:
176 |                 children_major_nodes.append(edge[1])
177 |         return children_major_nodes        
178 | 
179 |     def find_parents_of_majornode(self, major_node):
180 |         parents_oper_nodes = []
181 |         for edge in self.edges:
182 |             if edge[1] == major_node:
183 |                 parent_oper_node = edge[0]
184 |                 parents_oper_nodes.append(parent_oper_node)
185 |         return parents_oper_nodes
186 | 
187 |     def find_parent_of_opernode(self, oper_node):
188 |         parent_major_node = ""
189 |         for edge in self.edges:
190 |             if edge[1] == oper_node:
191 |                 parent_major_node = edge[0]
192 |                 break
193 |         return parent_major_node
194 | 
195 |     # @@@@@@@@@@@@ Training Graph -> Elementary Tree @@@@@@@@@@@@@@@@@@@@
196 |         
197 |     def convert_to_elementarytree(self):
198 |         traininggraph = ET.Element("train-graph") 
199 | 
200 |         # Major nodes
201 |         major_nodes_elt = ET.SubElement(traininggraph, "major-nodes")
202 |         for major_nodename in self.major_nodes:
203 |             major_nodetype = self.get_majornode_type(major_nodename)
204 |             major_nodeset = self.get_majornode_nodeset(major_nodename)
205 |             major_simple_sentences = self.get_majornode_simple_sentences(major_nodename)
206 |             oper_candidates = self.get_majornode_oper_candidates(major_nodename)
207 |             processed_oper_candidates = self.get_majornode_processed_oper_candidates(major_nodename)
208 |             filtered_postions = self.get_majornode_filtered_postions(major_nodename)
209 |             
210 |             major_node_elt = ET.SubElement(major_nodes_elt, "node")
211 |             major_node_elt.attrib = {"sym":major_nodename}
212 | 
213 |             # Opertype
214 |             major_nodetype_elt = ET.SubElement(major_node_elt, "type")
215 |             major_nodetype_elt.text = major_nodetype
216 | 
217 |             # Nodeset
218 |             major_nodeset_elt = ET.SubElement(major_node_elt, "nodeset")
219 |             for node in major_nodeset:
220 |                 node_elt = ET.SubElement(major_nodeset_elt, "n")
221 |                 node_elt.attrib = {"sym":node}
222 |                 
223 |             # Simple sentences
224 |             major_simple_sentences_elt = ET.SubElement(major_node_elt, "simple-set")
225 |             for simple_sentence in major_simple_sentences:
226 |                 major_simple_sentence_elt = ET.SubElement(major_simple_sentences_elt, "simple")
227 |                 sent_data_elt = ET.SubElement(major_simple_sentence_elt, "s")
228 |                 sent_data_elt.text = simple_sentence
229 |             
230 |             # Oper Candidates
231 |             if major_nodetype == "split":
232 |                 split_candidate_tuples = oper_candidates
233 |                 major_split_candidates_elt = ET.SubElement(major_node_elt, "split-candidates")
234 |                 for split_candidate in split_candidate_tuples:
235 |                     major_split_candidate_elt = ET.SubElement(major_split_candidates_elt, "sc")
236 |                     for node in split_candidate:
237 |                         node_elt = ET.SubElement(major_split_candidate_elt, "n")
238 |                         node_elt.attrib = {"sym":str(node)}
239 |                 
240 |             if major_nodetype == "drop-rel":
241 |                 relnode_set = oper_candidates
242 |                 major_relnode_set_elt = ET.SubElement(major_node_elt, "rel-candidates")
243 |                 for node in relnode_set:
244 |                     node_elt = ET.SubElement(major_relnode_set_elt, "n")
245 |                     node_elt.attrib = {"sym":str(node)}
246 | 
247 |                 processed_relnodes = processed_oper_candidates
248 |                 major_processed_relnodes_elt = ET.SubElement(major_node_elt, "rel-processed")
249 |                 for node in processed_relnodes:
250 |                     node_elt = ET.SubElement(major_processed_relnodes_elt, "n")
251 |                     node_elt.attrib = {"sym":str(node)} 
252 | 
253 |                 filtered_mod_pos = filtered_postions
254 |                 major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered")
255 |                 for node in filtered_mod_pos:
256 |                     node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc")
257 |                     node_elt.attrib = {"id":str(node)}  
258 |                     
259 |             if major_nodetype == "drop-mod":
260 |                 modcand_set = oper_candidates
261 |                 major_modcand_set_elt = ET.SubElement(major_node_elt, "mod-candidates")
262 |                 for node in modcand_set:
263 |                     node_elt = ET.SubElement(major_modcand_set_elt, "n")
264 |                     node_elt.attrib = {"sym":node[1],"loc":str(node[0])}
265 | 
266 |                 processed_mod_pos = processed_oper_candidates
267 |                 major_processed_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-processed")
268 |                 for node in processed_mod_pos:
269 |                     node_elt = ET.SubElement(major_processed_mod_pos_elt, "loc")
270 |                     node_elt.attrib = {"id":str(node)}   
271 | 
272 |                 filtered_mod_pos = filtered_postions
273 |                 major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered")
274 |                 for node in filtered_mod_pos:
275 |                     node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc")
276 |                     node_elt.attrib = {"id":str(node)}       
277 | 
278 |             if major_nodetype == "drop-ood":
279 |                 oodnode_set = oper_candidates
280 |                 major_oodnode_set_elt = ET.SubElement(major_node_elt, "ood-candidates")
281 |                 for node in oodnode_set:
282 |                     node_elt = ET.SubElement(major_oodnode_set_elt, "n")
283 |                     node_elt.attrib = {"sym":str(node)}
284 | 
285 |                 processed_oodnodes = processed_oper_candidates
286 |                 major_processed_oodnodes_elt = ET.SubElement(major_node_elt, "ood-processed")
287 |                 for node in processed_oodnodes:
288 |                     node_elt = ET.SubElement(major_processed_oodnodes_elt, "n")
289 |                     node_elt.attrib = {"sym":str(node)}
290 | 
291 |                 filtered_mod_pos = filtered_postions
292 |                 major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered")
293 |                 for node in filtered_mod_pos:
294 |                     node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc")
295 |                     node_elt.attrib = {"id":str(node)}  
296 | 
297 |             if major_nodetype == "fin":
298 |                 filtered_mod_pos = filtered_postions
299 |                 major_filtered_mod_pos_elt = ET.SubElement(major_node_elt, "mod-loc-filtered")
300 |                 for node in filtered_mod_pos:
301 |                     node_elt = ET.SubElement(major_filtered_mod_pos_elt, "loc")
302 |                     node_elt.attrib = {"id":str(node)}   
303 | 
304 |         # Oper nodes
305 |         oper_nodes_elt = ET.SubElement(traininggraph, "oper-nodes")
306 |         for oper_nodename in self.oper_nodes:
307 |             oper_node_elt = ET.SubElement(oper_nodes_elt, "node")
308 |             oper_node_elt.attrib = {"sym":oper_nodename}
309 |             
310 |             oper_nodedata = self.oper_nodes[oper_nodename]
311 | 
312 |             # Nodetype
313 |             oper_nodetype = oper_nodedata[0]
314 |             oper_nodetype_elt = ET.SubElement(oper_node_elt, "type")
315 |             oper_nodetype_elt.text = oper_nodetype
316 | 
317 |             if oper_nodetype == "split":
318 |                 split_cand_applied = oper_nodedata[1]
319 |                 split_cand_applied_elt = ET.SubElement(oper_node_elt, "split-candidate-applied")
320 |                 if split_cand_applied != None:
321 |                     split_candidate_elt = ET.SubElement(split_cand_applied_elt, "sc")
322 |                     for node in split_cand_applied:
323 |                         node_elt = ET.SubElement(split_candidate_elt, "n")
324 |                         node_elt.attrib = {"sym":node}
325 |                 
326 |                 split_cand_left = oper_nodedata[2]
327 |                 split_cand_left_elt = ET.SubElement(oper_node_elt, "split-candidate-left")
328 |                 for split_candidate in split_cand_left:
329 |                     split_candidate_elt = ET.SubElement(split_cand_left_elt, "sc")
330 |                     for node in split_candidate:
331 |                         node_elt = ET.SubElement(split_candidate_elt, "n")
332 |                         node_elt.attrib = {"sym":node}
333 | 
334 |             if oper_nodetype == "drop-ood":
335 |                 oodnode_to_process = oper_nodedata[1]
336 |                 oodnode_to_process_elt = ET.SubElement(oper_node_elt, "ood-candidate")
337 |                 node_elt = ET.SubElement(oodnode_to_process_elt, "n")
338 |                 node_elt.attrib = {"sym":oodnode_to_process}
339 | 
340 |                 dropped = oper_nodedata[2]
341 |                 dropped_elt = ET.SubElement(oper_node_elt, "is-dropped")
342 |                 dropped_elt.attrib = {"val":dropped}
343 |                 
344 |             if oper_nodetype == "drop-rel":
345 |                 relnode_to_process = oper_nodedata[1]
346 |                 relnode_to_process_elt = ET.SubElement(oper_node_elt, "rel-candidate")
347 |                 node_elt = ET.SubElement(relnode_to_process_elt, "n")
348 |                 node_elt.attrib = {"sym":relnode_to_process}
349 | 
350 |                 dropped = oper_nodedata[2]
351 |                 dropped_elt = ET.SubElement(oper_node_elt, "is-dropped")
352 |                 dropped_elt.attrib = {"val":dropped}
353 |             
354 |             if oper_nodetype == "drop-mod":
355 |                 modcand_to_process = oper_nodedata[1]
356 |                 modcand_to_process_elt = ET.SubElement(oper_node_elt, "mod-candidate")
357 |                 node_elt = ET.SubElement(modcand_to_process_elt, "n")
358 |                 node_elt.attrib = {"sym":modcand_to_process[1],"loc":str(modcand_to_process[0])}
359 | 
360 |                 dropped = oper_nodedata[2]
361 |                 dropped_elt = ET.SubElement(oper_node_elt, "is-dropped")
362 |                 dropped_elt.attrib = {"val":dropped}
363 | 
364 |         tg_edges_elt = ET.SubElement(traininggraph, "tg-edges")
365 |         for tg_edge in self.edges:
366 |             tg_edge_elt = ET.SubElement(tg_edges_elt, "edge")
367 |             tg_edge_elt.attrib = {"lab":str(tg_edge[2]), "par":tg_edge[0], "dep":tg_edge[1]}
368 |             
369 |         return traininggraph
370 | 
371 |     # @@@@@@@@@@@@ Training Graph -> Dot Graph @@@@@@@@@@@@@@@@@@@@
372 | 
373 |     def convert_to_dotstring(self, main_sent_dict, boxer_graph):
374 |         dot_string = "digraph boxer{\n"
375 | 
376 |         nodename = 0
377 |         node_graph_dict = {}
378 |         # Writing Major nodes
379 |         for major_nodename in self.major_nodes:
380 |             major_nodetype = self.get_majornode_type(major_nodename)
381 |             major_nodeset = self.get_majornode_nodeset(major_nodename)
382 |             major_simple_sentences = self.get_majornode_simple_sentences(major_nodename)
383 |             oper_candidates = self.get_majornode_oper_candidates(major_nodename)
384 |             processed_oper_candidates = self.get_majornode_processed_oper_candidates(major_nodename)
385 |             filtered_postions = self.get_majornode_filtered_postions(major_nodename)
386 | 
387 |             main_sentence = boxer_graph.extract_main_sentence(major_nodeset, main_sent_dict, filtered_postions)
388 |             simple_sentence_string = " ".join(major_simple_sentences)
389 |             major_node_data = [major_nodetype, major_nodeset[:], main_sentence, simple_sentence_string]
390 |             
391 |             if major_nodetype == "split":
392 |                 major_node_data += [oper_candidates[:]]
393 | 
394 |             if major_nodetype == "drop-rel" or major_nodetype == "drop-mod" or major_nodetype == "drop-ood":
395 |                 major_node_data += [oper_candidates[:], processed_oper_candidates[:], filtered_postions[:]]
396 |             
397 |             if major_nodetype == "fin":
398 |                 major_node_data += [filtered_postions[:]]
399 | 					
400 |             major_node_string, nodename = self.textdot_majornode(nodename, major_nodename, major_node_data[:])
401 |             node_graph_dict[major_nodename] = "struct"+str(nodename)
402 |             dot_string += major_node_string+"\n"
403 | 
404 |         # Writing operation nodes
405 |         for oper_nodename in self.oper_nodes:
406 |             oper_node_string, nodename = self.textdot_opernode(nodename, oper_nodename, self.oper_nodes[oper_nodename])
407 |             node_graph_dict[oper_nodename] = "struct"+str(nodename)
408 |             dot_string += oper_node_string+"\n"
409 | 
410 |         # Writing edges
411 |         for edge in self.edges:
412 |             par_graphnode = node_graph_dict[edge[0]]
413 |             dep_graphnode = node_graph_dict[edge[1]]
414 |             dot_string += par_graphnode+" -> "+dep_graphnode+"[label=\""+str(edge[2])+"\"];\n"
415 |         dot_string += "}"
416 |         return dot_string
417 | 
418 |     def textdot_majornode(self, nodename, node, nodedata):
419 |         textdot_node = "struct"+str(nodename+1)+" [shape=record,label=\"{"
420 |         textdot_node += "major-node: "+node+"|"
421 |         index = 0
422 |         for data in nodedata:
423 |             textdot_node += self.processtext(str(data))
424 |             index += 1
425 |             if index < len(nodedata):
426 |                 textdot_node += "|"  
427 |         textdot_node += "}\"];"
428 |         return textdot_node, nodename+1
429 | 	
430 |     def textdot_opernode(self, nodename, node, nodedata):
431 |         textdot_node = "struct"+str(nodename+1)+" [shape=record,label=\"{"
432 |         textdot_node += "oper-node: "+node+"|"
433 |         index = 0
434 |         for data in nodedata:
435 |             textdot_node += self.processtext(str(data))
436 |             index += 1
437 |             if index < len(nodedata):
438 |                 textdot_node += "|"  
439 |         textdot_node += "}\"];"
440 |         return textdot_node, nodename+1
441 | 	
442 |     def processtext(self, inputstring):
443 |         linesize = 100
444 |         outputstring = ""
445 |         index = 0
446 |         substr = inputstring[index*linesize:(index+1)*linesize]
447 |         while (substr!=""):
448 |             outputstring += substr
449 |             index += 1
450 |             substr = inputstring[index*linesize:(index+1)*linesize]
451 |             if substr!="":
452 |                 outputstring += "\\n"
453 |         return outputstring
454 | 
455 |     # @@@@@@@@@@@@@@@@@@@@@@@@@@ DONE @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
456 | 


--------------------------------------------------------------------------------
/start_learning_training_models.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #===================================================================================
  3 | #title           : start_learning_training_models.py                               =
  4 | #description     : Training                                                        =
  5 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
  6 | #date            : Created in 2014, Later revised in April 2016.                   =
  7 | #version         : 0.1                                                             =
  8 | #===================================================================================
  9 | 
 10 | 
 11 | import os
 12 | import argparse
 13 | import sys
 14 | import datetime
 15 | 
 16 | # Used for wikilarge data: not recommended
 17 | sys.setrecursionlimit(10000)
 18 | 
 19 | sys.path.append("./source")
 20 | import functions_configuration_file
 21 | import functions_model_files
 22 | from saxparser_xml_stanfordtokenized_boxergraph import SAXPARSER_XML_StanfordTokenized_BoxerGraph
 23 | from saxparser_xml_stanfordtokenized_boxergraph_traininggraph import SAXPARSER_XML_StanfordTokenized_BoxerGraph_TrainingGraph
 24 | 
 25 | MOSESDIR = "~/tools/mosesdecoder"
 26 | 
 27 | if __name__=="__main__":
 28 |     # Command line arguments ##############
 29 |     argparser = argparse.ArgumentParser(prog='python learn_training_models.py', description=('Start the training process.'))
 30 | 
 31 |     # Optional [default value: 1]
 32 |     argparser.add_argument('--start-state', help='Start state of the training process', choices=['1','2','3'], default='1', metavar=('Start_State'))
 33 | 
 34 |     # Optional [default value: 3]
 35 |     argparser.add_argument('--end-state', help='End state of the training process', choices=['1','2','3'], default='3', metavar=('End_State'))
 36 | 
 37 |     # Optional [default value: split:drop-ood:drop-rel:drop-mod] (Any of their combinations, order is not important), drop-ood only applied after split
 38 |     argparser.add_argument('--transformation', help='Transformation models learned', default="split:drop-ood:drop-rel:drop-mod", metavar=('TRANSFORMATION_MODEL'))
 39 | 
 40 |     # Optional [default value: 2]
 41 |     argparser.add_argument('--max-split', help='Maximum split size', choices=['2','3'], default='2', metavar=('MAX_SPLIT_SIZE'))
 42 | 
 43 |     # Optional [default value: agent:patient:eq:theme], (order is not important)
 44 |     argparser.add_argument('--restricted-drop-rel', help='Restricted drop relations', default="agent:patient:eq:theme", metavar=('RESTRICTED_DROP_REL'))
 45 | 
 46 |     # Optional [default value: jj:jjr:jjs:rb:rbr:rbs], (order is not important)
 47 |     argparser.add_argument('--allowed-drop-mod', help='Allowed drop modifiers', default="jj:jjr:jjs:rb:rbr:rbs", metavar=('ALLOWED_DROP_MOD'))
 48 | 
 49 |     # Optional [default value: update with most recent one]
 50 |     argparser.add_argument('--method-training-graph', help='Operation set for training graph file', choices=['method-led-lt', 'method-led-lteq', 'method-0.5-lteq-lteq', 
 51 |                                                                                                              'method-0.75-lteq-lt', 'method-0.99-lteq-lt'], 
 52 |                            default='method-0.99-lteq-lt', metavar=('Method_Training_Graph'))
 53 |     
 54 |     # Optional [default value: update with most recent one]
 55 |     argparser.add_argument('--method-feature-extract', help='Operation set for extracting features', choices=['feature-init', 'feature-Nov27'], default='feature-Nov27', 
 56 |                            metavar=('Method_Feature_Extract'))
 57 |     
 58 |     # Optional [default value: /home/ankh/Data/Simplification/Zhu-2010/PWKP_108016.tokenized.boxer-graph.xml]
 59 |     argparser.add_argument('--train-boxer-graph', help='The training corpus file (xml, stanford-tokenized, boxer-graph)', metavar=('Train_Boxer_Graph'),
 60 |                            default='/home/ankh/Data/Simplification/Zhu-2010/PWKP_108016.tokenized.boxer-graph.xml')
 61 | 
 62 |     # Optional [default value: 10]
 63 |     argparser.add_argument('--num-em', help='The number of EM Algorithm iterations', metavar=('NUM_EM_ITERATION'), default='10')
 64 |         
 65 |     # Optional [default value: 0:3:/home/ankh/Desktop/Sentence-Simplification/LANGUAGE-MODEL/simplewiki-20131030-data.srilm:0]
 66 |     argparser.add_argument('--lang-model', help='Language model information (in the moses format)', metavar=('Lang_Model'), 
 67 |                            default="0:3:/gpfs/scratch/snarayan/Sentence-Simplification/Language-Model/newsela_lm/V0V4_V1V4_V2V4_V3V4_V0V3_V0V2_V1V3.aner.ori.train.dst.arpa.en:0")
 68 |                            # default="0:3:/gpfs/scratch/snarayan/Sentence-Simplification/Language-Model/wikilarge_lm/wiki.full.aner.ori.train.dst.arpa.en:0")
 69 |                            # default="0:3:/home/ankh/Desktop/Sentence-Simplification/LANGUAGE-MODEL/simplewiki-20131030-data.srilm:0")
 70 | 
 71 |     # Optional (Cumpolsary when start state is >= 2)
 72 |     argparser.add_argument('--d2s-config', help='D2S Configuration file', metavar=('D2S_Config'))
 73 | 
 74 |     # Compolsary
 75 |     argparser.add_argument('--output-dir', help='The output directory',required=True, metavar=('Output_Directory'))
 76 |     # #####################################
 77 |     args_dict = vars(argparser.parse_args(sys.argv[1:]))
 78 |     # #####################################
 79 |     
 80 |     # Creating the output directory to store training models
 81 |     timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
 82 |     print timestamp+", Creating the output directory: "+args_dict['output_dir']
 83 |     try:
 84 |         os.mkdir(args_dict['output_dir'])
 85 |         print
 86 |     except OSError:
 87 |         print  args_dict['output_dir'] + " directory already exists.\n"
 88 | 
 89 |     # Configuration dictionary
 90 |     D2S_Config_data = {}
 91 |     D2S_Config = args_dict['d2s_config']
 92 |     if D2S_Config != None:
 93 |         D2S_Config_data = functions_configuration_file.parser_config_file(D2S_Config)
 94 |     else:
 95 |         D2S_Config_data["TRAIN-BOXER-GRAPH"] = args_dict['train_boxer_graph']
 96 |         D2S_Config_data["TRANSFORMATION-MODEL"] = args_dict['transformation'].split(":")
 97 |         D2S_Config_data["MAX-SPLIT-SIZE"] = int(args_dict['max_split'])
 98 |         D2S_Config_data["RESTRICTED-DROP-RELATION"] = args_dict['restricted_drop_rel'].split(":")
 99 |         D2S_Config_data["ALLOWED-DROP-MODIFIER"] = args_dict['allowed_drop_mod'].split(":")
100 |         D2S_Config_data["METHOD-TRAINING-GRAPH"] = args_dict['method_training_graph']
101 |         D2S_Config_data["METHOD-FEATURE-EXTRACT"] = args_dict['method_feature_extract']
102 |         D2S_Config_data["NUM-EM-ITERATION"] = int(args_dict['num_em'])
103 |         D2S_Config_data["LANGUAGE-MODEL"] = args_dict['lang_model']
104 |         
105 |     # Extracting arguments with their default values (default unless its specified)
106 |     START_STATE = int(args_dict['start_state'])
107 |     END_STATE = int(args_dict['end_state'])
108 | 
109 |     # Start state: 1, Starting building training graph
110 |     state = 1
111 |     if (int(args_dict['start_state']) <= state) and (state <= int(args_dict['end_state'])):
112 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
113 |         print timestamp+", Starting building training graph (Step-"+str(state)+") ..."
114 |         
115 |         print "Input training file (xml, stanford tokenized and boxer graph): " + D2S_Config_data["TRAIN-BOXER-GRAPH"] + " ..."
116 |         TRAIN_TRAINING_GRAPH = args_dict['output_dir']+"/"+os.path.splitext(os.path.basename(D2S_Config_data["TRAIN-BOXER-GRAPH"]))[0]+".training-graph.xml"
117 |         print "Generating training graph file (xml, stanford tokenized, boxer graph and training graph): "+TRAIN_TRAINING_GRAPH+" ..."
118 |         
119 |         foutput = open(TRAIN_TRAINING_GRAPH, "w")
120 |         foutput.write("<?xml version=\'1.0\' encoding=\'UTF-8\'?>\n")
121 |         foutput.write("<Simplification-Data>\n")
122 | 
123 |         print "Creating the SAX file (xml, stanford tokenized and boxer graph) handler ..."
124 |         training_xml_handler = SAXPARSER_XML_StanfordTokenized_BoxerGraph("training", D2S_Config_data["TRAIN-BOXER-GRAPH"], foutput, D2S_Config_data["TRANSFORMATION-MODEL"],
125 |                                                                           D2S_Config_data["MAX-SPLIT-SIZE"], D2S_Config_data["RESTRICTED-DROP-RELATION"],
126 |                                                                           D2S_Config_data["ALLOWED-DROP-MODIFIER"], D2S_Config_data["METHOD-TRAINING-GRAPH"])
127 | 
128 |         print "Start  generating training graph ..."
129 |         print "Start parsing "+D2S_Config_data["TRAIN-BOXER-GRAPH"]+" ..."
130 |         training_xml_handler.parse_xmlfile_generating_training_graph()
131 | 
132 |         foutput.write("</Simplification-Data>\n")
133 |         foutput.close()
134 | 
135 |         D2S_Config_data["TRAIN-TRAINING-GRAPH"] = TRAIN_TRAINING_GRAPH
136 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
137 |         print timestamp+", Finished building training graph (Step-"+str(state)+")\n"
138 | 
139 |     # Start state: 2
140 |     state = 2
141 |     if (int(args_dict['start_state']) <= state) and (state <= int(args_dict['end_state'])):
142 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
143 |         print timestamp+", Starting learning transformation models (Step-"+str(state)+") ..."
144 |         
145 |         if "TRAIN-TRAINING-GRAPH" not in D2S_Config_data:
146 |             print "The training graph file (xml, stanford tokenized, boxer graph and training graph) is not available."
147 |             print "Please enter the configuration file or start with the State 1."
148 |             timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
149 |             print timestamp+", No transformation models learned (Step-"+str(state)+")\n"
150 |             exit(0)
151 |             
152 |         # @ Defining data structure @ 
153 |         # Stores various sentence pairs (complex, simple) for SMT.
154 |         smt_sentence_pairs = {} 
155 |         # probability tables - store all probabilities
156 |         probability_tables = {}
157 |         # count tables - store counts in next iteration
158 |         count_tables = {}
159 |         # @ @
160 |         
161 |         print "Creating the em-training XML file (stanford tokenized, boxer graph and training graph) handler ..."
162 |         em_training_xml_handler = SAXPARSER_XML_StanfordTokenized_BoxerGraph_TrainingGraph(D2S_Config_data["TRAIN-TRAINING-GRAPH"], D2S_Config_data["NUM-EM-ITERATION"], 
163 |                                                                                            smt_sentence_pairs, probability_tables, count_tables,  D2S_Config_data["METHOD-FEATURE-EXTRACT"])
164 | 
165 |         print "Start Expectation Maximization (Inside-Outside) algorithm ..."
166 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
167 |         print timestamp+", Step "+str(state)+".1: Initialization of probability tables and populating smt_sentence_pairs ..."
168 |         em_training_xml_handler.parse_to_initialize_probabilitytable()
169 |         # print probability_tables
170 | 
171 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
172 |         print timestamp+", Step "+str(state)+".2: Start iterating for EM Inside-Outside probabilities ..."
173 |         em_training_xml_handler.parse_to_iterate_probabilitytable()
174 |         # print probability_tables
175 | 
176 |         # Start writing model files
177 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
178 |         print timestamp+", Step "+str(state)+".3: Start writing model files ..."
179 |         # Creating the output directory to store training models
180 |         model_dir = args_dict['output_dir']+"/TRANSFORMATION-MODEL-DIR"
181 |         timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
182 |         print timestamp+", Creating the output model directory: "+model_dir 
183 |         try:
184 |             os.mkdir(model_dir)
185 |         except OSError:
186 |             print  model_dir + " directory already exists."
187 |         # Wriing model files
188 |         functions_model_files.write_model_files(model_dir, probability_tables, smt_sentence_pairs)
189 | 
190 |         D2S_Config_data["TRANSFORMATION-MODEL-DIR"] = model_dir
191 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
192 |         print timestamp+", Finished learning transformation models (Step-"+str(state)+")\n"
193 | 
194 |     # Start state: 3
195 |     state = 3
196 |     if (int(args_dict['start_state']) <= state) and (state <= int(args_dict['end_state'])):
197 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
198 |         print timestamp+", Starting learning moses translation model (Step-"+str(state)+") ..."
199 | 
200 |         if "TRANSFORMATION-MODEL-DIR" not in D2S_Config_data:
201 |             print "The moses training files are not available."
202 |             print "Please enter the configuration file or start with the State 1."
203 |             timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
204 |             print timestamp+", No moses models learned (Step-"+str(state)+")\n"
205 |             exit(0)
206 |         
207 |         # Preparing the moses directory
208 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
209 |         print timestamp+", Step "+str(state)+".1: Preparing the moses directory ..."
210 |         # Creating the output directory to store moses files
211 |         moses_dir = args_dict['output_dir']+"/MOSES-COMPLEX-SIMPLE-DIR"
212 |         timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
213 |         print timestamp+", Creating the moses directory: "+moses_dir 
214 |         try:
215 |             os.mkdir(moses_dir)
216 |         except OSError:
217 |             print  moses_dir + " directory already exists."
218 |         # Creating the corpus directory 
219 |         moses_corpus_dir = args_dict['output_dir']+"/MOSES-COMPLEX-SIMPLE-DIR/corpus"
220 |         timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
221 |         print timestamp+", Creating the moses corpus directory: "+moses_corpus_dir 
222 |         try:
223 |             os.mkdir(moses_corpus_dir)
224 |         except OSError:
225 |             print  moses_corpus_dir + " directory already exists."    
226 |             
227 |         # Cleaning the moses training file
228 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
229 |         print timestamp+", Step "+str(state)+".2: Cleaning the moses training file ..."
230 |         command = MOSESDIR+"/scripts/training/clean-corpus-n.perl "+D2S_Config_data["TRANSFORMATION-MODEL-DIR"]+"/D2S-SMT source target "+moses_corpus_dir+"/D2S-SMT-clean 1 95"
231 |         os.system(command)
232 | 
233 |         # Running moses training 
234 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
235 |         print timestamp+", Step "+str(state)+".3: Running the moses training ..."
236 |         command = (MOSESDIR+"/scripts/training/train-model.perl -mgiza -mgiza-cpus 3 -cores 3 -parallel -sort-buffer-size 3G -sort-batch-size 253 -sort-compress gzip -sort-parallel 3 "+
237 |                    "-root-dir "+moses_dir+" -corpus "+moses_corpus_dir+"/D2S-SMT-clean -f source -e target -external-bin-dir "+MOSESDIR+"/mgiza/mgizapp/bin "+
238 |                    "-lm "+D2S_Config_data["LANGUAGE-MODEL"])
239 |         os.system(command)
240 | 
241 |         D2S_Config_data["MOSES-COMPLEX-SIMPLE-DIR"] = moses_dir
242 |         timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
243 |         print timestamp+", Finished learning moses translation model (Step-"+str(state)+")\n"
244 |         
245 |     # Last Step
246 |     config_file = args_dict['output_dir']+"/d2s.ini"
247 |     print "Writing the configuration file: "+config_file+" ..."
248 |     functions_configuration_file.write_config_file(config_file, D2S_Config_data)
249 | 
250 |     timestamp = datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
251 |     print timestamp+", Learning process done!!!"
252 | 
253 | 


--------------------------------------------------------------------------------
/start_simplifying_complex_sentence.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #===================================================================================
  3 | #title           : start_simplifying_complex_sentence.py                           =
  4 | #description     : Testing                                                         =
  5 | #author          : Shashi Narayan, shashi.narayan(at){ed.ac.uk,loria.fr,gmail.com})=                                    
  6 | #date            : Created in 2014, Later revised in April 2016.                   =
  7 | #version         : 0.1                                                             =
  8 | #===================================================================================
  9 | 
 10 | import os
 11 | import argparse
 12 | import sys
 13 | import datetime
 14 | from nltk.metrics.distance import edit_distance
 15 | 
 16 | sys.path.append("./source")
 17 | import functions_configuration_file
 18 | import functions_model_files
 19 | import functions_prepare_elementtree_dot
 20 | from saxparser_xml_stanfordtokenized_boxergraph import SAXPARSER_XML_StanfordTokenized_BoxerGraph
 21 | from explore_decoder_graph_greedy import Explore_Decoder_Graph_Greedy
 22 | from explore_decoder_graph_explorative import Explore_Decoder_Graph_Explorative
 23 | 
 24 | MOSESDIR = "~/tools/mosesdecoder"
 25 | 
 26 | def get_greedy_decoder_graph(test_boxerdata_dict, test_sentids, TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER, 
 27 |                              probability_tables, METHOD_FEATURE_EXTRACT):
 28 |     mapper_transformation = {}
 29 |     moses_input = {}
 30 |     transformation_complex_count = 0
 31 | 
 32 |     # Transformation decoder
 33 |     decoder_graph_explorer = Explore_Decoder_Graph_Greedy(TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER, 
 34 |                                                           probability_tables, METHOD_FEATURE_EXTRACT)
 35 |     for sentid in test_sentids:
 36 |         print sentid
 37 |         sent_data = test_boxerdata_dict[str(sentid)]
 38 |         main_sentence =  sent_data[0]
 39 |         main_sent_dict = sent_data[1]
 40 |         boxer_graph = sent_data[2]
 41 | 
 42 |         # Explore decoder graph
 43 |         decoder_graph = decoder_graph_explorer.explore_decoder_graph(str(sentid), main_sentence, main_sent_dict, boxer_graph)
 44 |         
 45 |         # # Generating boxer and decoder graph
 46 |         # if sentid not in  [13, 28, 41]:
 47 |         #     functions_prepare_elementtree_dot.run_visual_graph_creator(str(sentid), main_sentence, main_sent_dict, [], boxer_graph, decoder_graph) 
 48 | 
 49 |         sentence_pairs = decoder_graph.get_final_sentences(main_sentence, main_sent_dict, boxer_graph)
 50 |         transformed_sentences = [item[0] for item in sentence_pairs]
 51 | 
 52 |         # Writing transformation results
 53 |         mapper_transformation[sentid] = []
 54 |         for sent in transformed_sentences:
 55 |             mapper_transformation[sentid].append(transformation_complex_count)
 56 |             moses_input[transformation_complex_count] = sent
 57 |             transformation_complex_count += 1
 58 |     return mapper_transformation, moses_input
 59 | 
 60 | def get_explorative_decoder_graph(test_boxerdata_dict, test_sentids, TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER, 
 61 |                                   probability_tables, METHOD_FEATURE_EXTRACT):
 62 |     mapper_transformation = {}
 63 |     moses_input = {}
 64 |     transformation_complex_count = 0
 65 |     
 66 |     # Transformation decoder
 67 |     decoder_graph_explorer = Explore_Decoder_Graph_Explorative(TRANSFORMATION_MODEL, MAX_SPLIT_SIZE, RESTRICTED_DROP_RELATION, ALLOWED_DROP_MODIFIER, 
 68 |                                                                probability_tables, METHOD_FEATURE_EXTRACT)
 69 |     for sentid in test_sentids:
 70 |         print sentid
 71 |         sent_data = test_boxerdata_dict[str(sentid)]
 72 |         main_sentence =  sent_data[0]
 73 |         main_sent_dict = sent_data[1]
 74 |         boxer_graph = sent_data[2]
 75 | 
 76 |         # Explore decoder graph
 77 |         print "Building decoder graph ..."
 78 |         decoder_graph = decoder_graph_explorer.explore_decoder_graph(str(sentid), main_sentence, main_sent_dict, boxer_graph)
 79 | 
 80 |         # Start updating edges with the probabilities, for unseen : 0.5/0.5
 81 |         print "Updating probability bottom-up ..."
 82 |         node_probability_dict, potential_edges = decoder_graph_explorer.start_probability_update(main_sentence, main_sent_dict, boxer_graph, decoder_graph)
 83 | 
 84 |         # Filtered decoder graph
 85 |         print "Creating filtered decoder graph ..."
 86 |         filtered_decoder_graph = decoder_graph_explorer.create_filtered_decoder_graph(potential_edges, main_sentence, main_sent_dict, boxer_graph, decoder_graph)
 87 | 
 88 |         # Generating boxer and decoder graph
 89 |         functions_prepare_elementtree_dot.run_visual_graph_creator(str(sentid), main_sentence, main_sent_dict, [], boxer_graph, filtered_decoder_graph)
 90 | 
 91 |         sentence_pairs = filtered_decoder_graph.get_final_sentences(main_sentence, main_sent_dict, boxer_graph)
 92 |         transformed_sentences = [item[0] for item in sentence_pairs]
 93 | 
 94 |         # Writing transformation results
 95 |         mapper_transformation[sentid] = []
 96 |         for sent in transformed_sentences:
 97 |             mapper_transformation[sentid].append(transformation_complex_count)
 98 |             moses_input[transformation_complex_count] = sent
 99 |             transformation_complex_count += 1
100 |     return mapper_transformation, moses_input
101 | 
102 | if __name__ == "__main__":
103 |     argparser = argparse.ArgumentParser(prog='python simplify_complex_sentence.py', description=('Start simplifying complex sentences.'))
104 | 
105 |     # Optional [default value: /home/ankh/Data/Simplification/Test-Data/complex.tokenized.boxer-graph.xml]
106 |     argparser.add_argument('--test-boxer-graph', help='The test corpus file (xml, stanford-tokenized, boxer-graph)', metavar=('Test_Boxer_Graph'),
107 |                            default='/home/ankh/Data/Simplification/Test-Data/complex.tokenized.boxer-graph.xml')
108 | 
109 |     # Optional [default value: 10]
110 |     argparser.add_argument('--nbest-distinct', help='N Best Distinct produced from Moses', metavar=('N_Best_Distinct'), default='10')
111 | 
112 |     # Optional [default value: greedy]
113 |     argparser.add_argument('--explore-decoder', help='Method for generating the decoder graph', metavar=('Explore_Decoder'), choices=['greedy', 'explorative'], default='greedy')
114 | 
115 |     # Compolsary 
116 |     argparser.add_argument('--d2s-config', help='D2S Configuration file', required=True, metavar=('D2S_Config'))
117 | 
118 |     # Compolsary
119 |     argparser.add_argument('--output-dir', help='The output directory', required=True, metavar=('Output_Directory'))
120 |     # #####################################
121 |     args_dict = vars(argparser.parse_args(sys.argv[1:]))
122 |     # #####################################
123 | 
124 |     # STEP:1 Creating test directory in the output directory
125 |     timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
126 |     test_output_directory = args_dict['output_dir']+"/Test-Results-"+args_dict["explore_decoder"].upper()
127 |     print timestamp+", Creating test result directory: "+test_output_directory
128 |     try:
129 |         os.mkdir(test_output_directory)
130 |     except OSError:
131 |         print test_output_directory + " directory already exists."
132 |         
133 |     # STEP:2 Configuration dictionary
134 |     timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
135 |     print "\n"+timestamp+", Reading the D2S Configuration file ..."
136 |     D2S_Config_data = functions_configuration_file.parser_config_file(args_dict['d2s_config'])
137 | 
138 |     # STEP:3 Reading transformation model files
139 |     timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
140 |     print "\n"+timestamp+", Reading transformation model files ..."
141 |     probability_tables = functions_model_files.read_model_files(D2S_Config_data["TRANSFORMATION-MODEL-DIR"], D2S_Config_data["TRANSFORMATION-MODEL"])
142 | 
143 |     # STEP:4 Reading the test corpus file (xml, stanford-tokenized, boxer-graph) ..."
144 |     timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
145 |     print "\n"+timestamp+", Start reading test corpus file (xml, stanford-tokenized, boxer-graph): "+args_dict['test_boxer_graph']+" ..." 
146 |     print "Creating the SAX file (xml, stanford tokenized and boxer graph) handler ..."
147 |     test_boxerdata_dict = {}
148 |     test_sentids = []
149 |     testing_xml_handler = SAXPARSER_XML_StanfordTokenized_BoxerGraph("testing", args_dict['test_boxer_graph'], test_boxerdata_dict, D2S_Config_data["TRANSFORMATION-MODEL"],
150 |                                                                      D2S_Config_data["MAX-SPLIT-SIZE"], D2S_Config_data["RESTRICTED-DROP-RELATION"],
151 |                                                                      D2S_Config_data["ALLOWED-DROP-MODIFIER"], D2S_Config_data["METHOD-TRAINING-GRAPH"])
152 |     print "Start parsing "+args_dict['test_boxer_graph']+" ..."
153 |     testing_xml_handler.parse_xmlfile_generating_training_graph()
154 |     test_sentids = [int(item) for item in test_boxerdata_dict.keys()]
155 |     test_sentids.sort()
156 | 
157 |     # STEP:5 Applying the transformation models and creating the output of transformation
158 |     timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
159 |     print "\n"+timestamp+", Applying the transformation models and writing complex sentences after transformation ..."
160 |     mapper_transformation = {}
161 |     moses_input = {}
162 |     if args_dict["explore_decoder"] == "greedy":
163 |         mapper_transformation, moses_input = get_greedy_decoder_graph(test_boxerdata_dict, test_sentids, D2S_Config_data["TRANSFORMATION-MODEL"], D2S_Config_data["MAX-SPLIT-SIZE"], 
164 |                                                                       D2S_Config_data["RESTRICTED-DROP-RELATION"], D2S_Config_data["ALLOWED-DROP-MODIFIER"], 
165 |                                                                       probability_tables, D2S_Config_data["METHOD-FEATURE-EXTRACT"])
166 |     else:
167 |         mapper_transformation, moses_input = get_explorative_decoder_graph(test_boxerdata_dict, test_sentids, D2S_Config_data["TRANSFORMATION-MODEL"], D2S_Config_data["MAX-SPLIT-SIZE"], 
168 |                                                                            D2S_Config_data["RESTRICTED-DROP-RELATION"], D2S_Config_data["ALLOWED-DROP-MODIFIER"], 
169 |                                                                            probability_tables, D2S_Config_data["METHOD-FEATURE-EXTRACT"])
170 | 
171 |     print "Writing "+test_output_directory+"/transformation-output.moses-input ..."
172 |     d2s_complex_file = open(test_output_directory+"/transformation-output.moses-input", "w")
173 |     for sentid in test_sentids:
174 |         for moses_input_id in mapper_transformation[sentid]:
175 |             transformed_sent = moses_input[moses_input_id]
176 |             d2s_complex_file.write(transformed_sent.encode('utf-8')+"\n")
177 |     d2s_complex_file.close()
178 | 
179 |     print "Writing "+test_output_directory+"/transformation-output.map ..."
180 |     d2s_complex_map = open(test_output_directory+"/transformation-output.map", "w")
181 |     sentids =   mapper_transformation.keys()
182 |     sentids.sort()
183 |     for sentid in sentids:
184 |          d2s_complex_map.write(str(sentid)+" ")
185 |          for item in mapper_transformation[sentid]:
186 |              d2s_complex_map.write(str(item)+" ")
187 |          d2s_complex_map.write("\n")
188 |     d2s_complex_map.close()
189 | 
190 |     print "Writing "+test_output_directory+"/transformation-output.simple ..."
191 |     d2s_complex_file = open(test_output_directory+"/transformation-output.simple", "w")
192 |     for sentid in test_sentids:
193 |         simple_sentence = []
194 |         for moses_input_id in mapper_transformation[sentid]:
195 |             transformed_sent = moses_input[moses_input_id]
196 |             simple_sentence.append(transformed_sent)    
197 |         d2s_complex_file.write((" ".join(simple_sentence)).encode('utf-8')+"\n")
198 |     d2s_complex_file.close()
199 | 
200 |     # STEP:6 Running Moses
201 |     timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
202 |     print "\n"+timestamp+", Applying the moses translation model ..."
203 |     command = (MOSESDIR+"/bin/moses -f "+D2S_Config_data["MOSES-COMPLEX-SIMPLE-DIR"]+"/model/moses.ini "+
204 |                "-n-best-list "+test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple"+" "+args_dict['nbest_distinct']+" distinct "+
205 |                "-input-file "+test_output_directory+"/transformation-output.moses-input")
206 |     os.system(command)
207 | 
208 |     # Reading the moses output file
209 |     print "Parsing the moses output file: "+test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple"
210 |     moses_output = {}
211 |     finput = open(test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple", "r")
212 |     datalines = finput.readlines()
213 |     for line in datalines:
214 |         parts = line.split(" ||| ")
215 |         sentid = int(parts[0].strip())
216 |         sent = parts[1].strip()
217 |         if sentid not in moses_output:
218 |             moses_output[sentid] = [sent]
219 |         else:
220 |             moses_output[sentid].append(sent)
221 |     finput.close()
222 | 
223 |     # Storing the best moses output
224 |     timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
225 |     print "\n"+timestamp+", Best output of moses ..."
226 |     final_output_filename = test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.simple.best"
227 |     print "Writing to the file: "+final_output_filename
228 |     final_output_file = open(final_output_filename, "w")
229 |     for sentid in test_sentids:
230 |         simple_sentence = []
231 |         for moses_input_id in mapper_transformation[sentid]:
232 |             moses_simple_output_best = moses_output[moses_input_id][0]
233 |             simple_sentence.append(moses_simple_output_best)
234 |         final_output_file.write(" ".join(simple_sentence)+"\n") 
235 |     final_output_file.close()
236 | 
237 |     # Running posthoc reranking
238 |     timestamp =  datetime.datetime.now().strftime("%A%d-%B%Y-%I%M%p")
239 |     print "\n"+timestamp+", Running the post-hoc reranking ..."
240 |     posthoc_reranked = {}
241 |     for sentid in test_sentids:
242 |         for moses_input_id in mapper_transformation[sentid]:
243 |             moses_complex_input = moses_input[moses_input_id]
244 |             moses_simple_outputs = moses_output[moses_input_id]
245 | 
246 |             posthoc_reranked[moses_input_id] = []
247 |             for simple_output in moses_simple_outputs:
248 |                 edit_dist = edit_distance(simple_output, moses_complex_input)
249 |                 posthoc_reranked[moses_input_id].append((edit_dist, simple_output))
250 | 
251 |             # More different are ranked at top
252 |             posthoc_reranked[moses_input_id].sort(reverse=True)
253 |             
254 |     # Writing post-hoc reranked output
255 |     final_output_filename = test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.post-hoc-reranking.simple"
256 |     print "Writing to the file: "+final_output_filename
257 |     final_output_file = open(final_output_filename, "w")
258 |     for sentid in test_sentids:
259 |         for moses_input_id in mapper_transformation[sentid]:
260 |             for item in posthoc_reranked[moses_input_id]:
261 |                 final_output_file.write(str(moses_input_id)+"\t"+str(item[0])+"\t"+item[1]+"\n")
262 |             final_output_file.write("\n")
263 |     final_output_file.close()
264 | 
265 |     # Writing post-hoc reranked best output
266 |     final_output_filename = test_output_directory+"/transformation-output.moses-"+args_dict['nbest_distinct']+"best-distinct.post-hoc-reranking.simple.best"
267 |     print "Writing to the file: "+final_output_filename
268 |     final_output_file = open(final_output_filename, "w")
269 |     for sentid in test_sentids:
270 |         simple_sentence = []
271 |         for moses_input_id in mapper_transformation[sentid]:
272 |             simple_output_best = posthoc_reranked[moses_input_id][0][1]
273 |             simple_sentence.append(simple_output_best)
274 |         final_output_file.write(" ".join(simple_sentence)+"\n") 
275 |     final_output_file.close()
276 | 
277 | 
278 |     # test_boxerdata_dict = {}
279 |     # test_sentids = []    
280 | 
281 |     # mapper_transformation = {}
282 |     # moses_input = {}
283 | 
284 |     # moses_output = {}
285 |     
286 |     # posthoc_reranked = {}
287 | 


--------------------------------------------------------------------------------