├── input.txt ├── main.py ├── README.md └── node.py /input.txt: -------------------------------------------------------------------------------- 1 | A 2 | ABCD 3 | AB 4 | B 5 | BD 6 | BAD -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | from node import Node 2 | #Set this to True to get a full tree print out 3 | VERBOSE= True 4 | 5 | #Open the input file and read each word into a list 6 | f = open('input.txt', 'r') 7 | wordlist = f.read().splitlines() 8 | f.close() 9 | wordlist = [s.upper() for s in wordlist] 10 | #Sort this list into alphabetical order 11 | #This may be a waste of time... 12 | wordlist.sort() 13 | 14 | #Remove all commented lines 15 | i = 0 16 | while(i < len(wordlist)-1): 17 | if(wordlist[i][0] != '#'): 18 | break 19 | i += 1 20 | #Ensure the list has some words in it 21 | if(i > len(wordlist)-1): 22 | print('No elements found in list.') 23 | exit(0) 24 | del wordlist[:i] 25 | 26 | #This empty node will act as the head of the tree 27 | head = Node() 28 | 29 | #Add each word to the tree by creating new nodes if they do not exist 30 | for word in wordlist: 31 | parent = head 32 | for letter in word: 33 | if(parent.getChildByData(letter) == None): 34 | parent.addChild(letter) 35 | parent = parent.getChildByData(letter) 36 | parent.validEnd = True 37 | 38 | #Colapse the tree to get rid of extra nodes that have no possibility 39 | # of being reached by themself. 40 | head.collapse() 41 | 42 | if(VERBOSE): 43 | head.printTree() 44 | #This generates the regEx one letter at a time. 45 | #Should be updated to generate from the head 46 | fullRegEx = head.generateRegEx() 47 | print(fullRegEx) -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # regexGenerator 2 | ## Overview 3 | This python script generates a regular expression that can be used to match words from a word list. An online demo can be found on [my repl.it page](https://repl.it/@jkoppp/RegExGenerator). This demo is not as actively maintained and may not match this repo exactly. 4 | ## Installation 5 | This program requires Python3 to run. It can also be run online using [repl.it](repl.it.com). 6 | 7 | Note: For the prettyPrintTree method of the node class to work, graphviz must be installed on the system. As of right now, this is not working on repl.it. The script can generate a .gv file that can then be rendered using another tool but if graphviz can be installed on the machine the script runs on it should be able to render straight to .png or a similar format. 8 | 9 | ## Full Description 10 | ### Purpose 11 | This script was origionally created to help with reddit automoderator configuration. The automoderator can check posts for certain key words by using a regular expression. This script can convert a text file of key words to a regular expression that matches any of the keywords. 12 | ### Main 13 | The program uses the Node class to generate a tree type data structure. This tree contains one character per leaf and each path down the tree represents a valid word. This type of data structure allows a regular expression to be generated recursively and allows common groups of letters to be shared so they do not need to be repeated. 14 | 15 | Example: A regular expresion that is looking for 'New York', 'New Jersey', or 'New Mexico' can check for 'New ' and then if that is found check for the three possible endings. The shared characters at the begining do not need to be repeated. The raw tree for this example would look like this. 16 | 17 | ![](https://i.imgur.com/HnPMnz5.png) 18 | 19 | This is very inefficient because most of those leafs are not nodes are not needed separately. In this file, a 'N' will never be found that isn't followed by an 'E'. So these two nodes can be combined into one 'NE' node. The program accounts for this and optimizes the original tree. This optimization uses recursion to combine nodes that are only ever found together. The optimized tree is much smaller and can be seen below. More information about this optimization can be found in the [Tree Optimization](#tree-optimization) section. 20 | 21 | ![](https://i.imgur.com/Su7fTPC.png) 22 | 23 | The regular expression is then generated from this optimized tree. 24 | 25 | Example: The regular expression for this example looks like this. 26 | ```` 27 | ((NEW ((JERSEY)|(MEXICO)|(YORK)))) 28 | ```` 29 | You can see how the common characters are grouped together and the shared characters are only included once. 30 | 31 | ### Node Class 32 | This class provides functions to build and access the tree type data structure used in this project. Nodes can be created with 0-1 parent node and unlimited children nodes. The HEAD of the tree should be the only node with no parent. Child nodes can be accessed directly by fetching the child array from the node or you can search for a child by the data it contains. For this project, the data is always strings. 33 | #### Visualizing the tree 34 | There is a print method that returns a visual representation of the tree. The default string method prints a text based representation of the tree. Each node is printed with indentation to show what level the node is on. Each node is also represented by a + or - to note whether the node is the end of a word in the list or not. A + represents a valid end to a word. A more advanced print method is also included. The pretty print method uses graphviz to create a graphical representation of the tree. This method uses square nodes to represent valid word ends and circles for other nodes. However, this method is still under development and should be used with caution. The images used in this file were generated using the pretty print method but had to be rendered separately. The two print methods are shown below for the same tree. 35 | 36 | ![](https://i.imgur.com/9MIMtkZ.png) 37 | ![](https://i.imgur.com/w9cINfF.png) 38 | 39 | #### Tree Optimization 40 | The optimization of the tree is done in the colapse tree function. This function moves through the tree recursively and checks each node. If a node matches three criteria it can be combined with it's child. 41 | + The node must have only one child 42 | + The node must not be the end of a word in the input list 43 | + The node must not be the HEAD. The HEAD of the tree can not be collasped. 44 | 45 | Example: If the input list includes the words 'ABE', 'ABCD', 'F','FGH', and 'FGHI' then the raw tree that is generated looks like this. 46 | 47 | ![](https://i.imgur.com/xTATQvI.png) 48 | 49 | The square nodes are valid ends which means that a word on the list ends at that node. 'FGH' is a valid word so that node is a valid end but 'FG' is not on the list so that node is not a valid end. So, in this example, only two nodes meet all 3 criteria for the optimization. 50 | + Nodes A, F, and H can not be collapsed because they are valid ends 51 | + Nodes D, E, and I do not have children so they can not be combinded 52 | + Node B can not be collapsed because it has more than one child 53 | + Node C and G meet all three criterial so they can be combinded with the node below them. 54 | 55 | This optimization results in the following tree. You can see that Nodes C and G were combined with the node below them. 56 | 57 | ![](https://i.imgur.com/w9cINfF.png) 58 | 59 | -------------------------------------------------------------------------------- /node.py: -------------------------------------------------------------------------------- 1 | from graphviz import Graph 2 | class Node: 3 | ######################################################## 4 | # __init__ 5 | # Description: Creates a new node 6 | # Parameters: 7 | # letter (Optional)- String that holds the node data 8 | # par (Optional)- This node's parent 9 | # end (Optional) - Boolean that stores whether or not 10 | # this node is a valid end to a word 11 | # Returns: Nothing 12 | ######################################################### 13 | def __init__(self, letter=None, par=None, end=False): 14 | self.parent = par 15 | self.data = letter 16 | self.validEnd = end 17 | self.child = [] 18 | return 19 | 20 | ######################################################## 21 | # addChild 22 | # Description: Creates and adds a new child node 23 | # Parameters: 24 | # data - String that holds the childs data 25 | # Returns: Nothing 26 | ######################################################### 27 | def addChild(self, data): 28 | self.child.append(Node(data, self)) 29 | return 30 | 31 | ######################################################## 32 | # getChildByData 33 | # Description: Searches for and returns a child that 34 | # matches the data. 35 | # Parameters: 36 | # d - String to search for 37 | # Returns: Child with matching data or None if no child 38 | # is found 39 | ######################################################### 40 | def getChildByData(self, d): 41 | for i in self.child: 42 | if i.data == d: 43 | return i 44 | return None 45 | 46 | ######################################################## 47 | # collapseTree 48 | # Description: Recursive meathod that reduces the trees 49 | # size by combining elements that do not 50 | # occur in the word list separately 51 | # Parameters: None 52 | # Returns: None 53 | ######################################################### 54 | def collapseTree(self): 55 | while((len(self.child) == 1) and not(self.validEnd) and (self.data != None)): 56 | self.data += self.child[0].data 57 | self.validEnd = self.child[0].validEnd 58 | self.child = self.child[0].child 59 | for child in self.child: 60 | child.collapseTree() 61 | return 62 | 63 | ######################################################## 64 | # generateRegEx 65 | # Description: Creates a regular expression string that 66 | # matches all words found in the list. 67 | # Parameters: 68 | # regEx (Private)- This paramater is only used for 69 | # recurrsion. Do not pass a 70 | # parameter when calling this function 71 | # Returns: A regular expression as a string 72 | ######################################################### 73 | def generateRegEx(self, regEx=""): 74 | if(self.data != None): 75 | regEx += '(' 76 | regEx += self.data 77 | if(len(self.child) > 1): 78 | regEx += '(' 79 | for c in self.child: 80 | regEx = c.generateRegEx(regEx) 81 | if(len(self.child) > 1): 82 | regEx += ')' 83 | if(self.validEnd): 84 | regEx += '?' 85 | if(self.data != None): 86 | if(self != self.parent.child[-1]): 87 | #Not Last Child 88 | regEx += ")|" 89 | else: 90 | if(len(self.parent.child) > 1): 91 | #Last And Not Only Child 92 | regEx += ')' 93 | elif(self.parent.validEnd): 94 | #Only Child And Parent is End 95 | regEx += ")?" 96 | else: 97 | #Only Child and Parent is not End 98 | regEx += ')' 99 | return regEx 100 | 101 | ######################################################## 102 | # printTree 103 | # Description: Creates a text based representation of the 104 | # tree. 105 | # Parameters: 106 | # string (Private) - This paramater is only used for 107 | # recurrsion. Do not pass a 108 | # parameter when calling this 109 | # function 110 | # indent (Private) - This paramater is only used for 111 | # recurrsion. Do not pass a 112 | # parameter when calling this 113 | # function 114 | # Returns: Text represntaion of the tree as a string 115 | ######################################################### 116 | def printTree(self, string="", indent=""): 117 | if(self.data == None): 118 | if(self.validEnd): 119 | string += indent+'+'+"None" 120 | else: 121 | string += indent+'-'+"None" 122 | else: 123 | if(self.validEnd): 124 | string += indent+'|-+'+self.data 125 | else: 126 | string += indent+'|--'+self.data 127 | string += "\n" 128 | for chi in self.child: 129 | string = chi.printTree(string, indent+" ") 130 | return string 131 | 132 | ######################################################## 133 | # __str__ 134 | # Description: This function ensures if you use the 135 | # print() function a string representation 136 | # is given 137 | # Parameters: None 138 | # Returns: Text represntaion of the tree as a string 139 | ######################################################### 140 | def __str__(self): 141 | return self.printTree() 142 | 143 | ######################################################## 144 | # prettyPrintTree 145 | # Description: This function ensures uses graphviz to 146 | # create a graphical representation of the 147 | # tree. This method requires graphviz to 148 | # be installed on the machine or else the 149 | # graphic will have to be generated separately 150 | # Parameters: 151 | # graph - A graphviz graph object 152 | # parent_index (Private) - This paramater is 153 | # only used for 154 | # recurrsion. Do 155 | # not pass a 156 | # parameter when 157 | # calling this 158 | # function 159 | # current_index (Private) - This paramater is 160 | # only used for 161 | # recurrsion. Do 162 | # not pass a 163 | # parameter when 164 | # calling this 165 | # function 166 | # Returns: A graphviz Graph object 167 | ######################################################### 168 | def prettyPrintTree(self, graph, parent_index=None, current_index=0): 169 | if(self.data == None): 170 | label = "HEAD" 171 | else: 172 | label = self.data 173 | if(self.validEnd): 174 | graph.node(str(current_index), label, shape="square") 175 | else: 176 | graph.node(str(current_index), label, shape="circle") 177 | if(parent_index != None): 178 | graph.edge(str(parent_index), str(current_index)) 179 | index = current_index+1 180 | for chi in self.child: 181 | index = chi.prettyPrintTree(graph, current_index, index) 182 | return index 183 | --------------------------------------------------------------------------------