├── .gitignore ├── README.md ├── assembler.py ├── basicblock.py ├── decoder.py ├── deobfuscator.py ├── disassembler.py ├── images ├── after-forwarder.png ├── after-merge.png ├── before-forwarder.png └── before-merge.png ├── instruction.py ├── main.py ├── simplifier.py ├── utils ├── __init__.py ├── rendergraph.py └── striplineno.py └── verifier.py /.gitignore: -------------------------------------------------------------------------------- 1 | .idea/ 2 | *.pyc 3 | *.svg -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Bytecode Simplifier 2 | 3 | Bytecode simplifier is a tool to deobfuscate PjOrion protected python scripts. 4 | This is a complete rewrite of my older tool [PjOrion Deobfuscator](https://github.com/extremecoders-re/PjOrion-Deobfuscator) 5 | 6 | ## Pre-requisites 7 | 8 | You need to have the following packages pre-installed: 9 | - [networkx](http://networkx.github.io/) 10 | - [pydotplus](http://pydotplus.readthedocs.io/) 11 | 12 | Both of the packages are `pip` installable. Additionally, make sure graphviz executable is in your path for `pydotplus` to work. 13 | `pydotplus` is required only for drawing graphs and if you do not want this feature you can comment out the `render_graph` calls in `deobfuscate` function in file [deobfuscator.py](deobfuscator.py) 14 | 15 | ## Statutory Warning 16 | 17 | PjOrion obfuscates the original file and introduces several wrapper layers on top of it. The purpose of these layers is simply to (sort of) decrypt the next inner layer and execute it via an `EXEC_STMT` instruction. Hence you CANNOT use this tool as-is on an obfuscated file. First, you would need to remove the wrapper layers and get hold of the actual obfuscated code object. Then you can marshal the obfuscated code to disk and run this tool on it which should hopefully give you back the deobfuscated code. 18 | 19 | Refer to [this](https://0xec.blogspot.com/2017/07/deobfuscating-pjorion-using-bytecode.html) blog post for details. 20 | 21 | ## Implementation details 22 | 23 | ### Analysis 24 | 25 | Bytecode simplifier analyzes obfuscated code using a recursive traversal disassembly approach. 26 | The instructions are disassembled and combined into basic blocks. A basic block is a sequence of instructions with a single entry and a single exit. 27 | 28 | ### Representation 29 | 30 | A basic block may end with a control flow instruction such as an `if` statement. An `if` statement has two branches - true and false. 31 | Corresponding to the branches basic blocks have edges between them. This entire structure is represented by a [Directed Graph](https://en.wikipedia.org/wiki/Directed_graph). 32 | We use [`networkx.DiGraph`](https://networkx.github.io/documentation/development/reference/classes.digraph.html) for this purpose. 33 | 34 | The nodes in the graph represent the basic blocks whereas the edges between the nodes represent the control flow between the basic blocks. 35 | 36 | A conditional control transfer instruction like `if` has two branches - the true branch which is executed when the condition holds. We tag this branch as the explicit edge as this branch is explicitly specified by the `if` statement. The other branch viz. the false one is tagged as an implicit edge, as it is logically deduced. 37 | 38 | An unconditional control transfer instruction like `jump` has only an explicit edge. 39 | 40 | ### De-obfuscation 41 | 42 | Once we have the code in a graph form it is easier to reason about the code. Hence simplification of the obfuscated code is done on this graph. 43 | 44 | Two simplification strategies are principally used: 45 | - Forwarder elimination 46 | - Basic block merging 47 | 48 | #### Forwarder elimination 49 | A forwarder is a basic block that consists of a single unconditional control flow instructions. A forwarder transfers the execution to some other basic block without doing any useful work. 50 | 51 | **Before forwarder elimination** 52 | 53 | ![](images/before-forwarder.png) 54 | 55 | The basic blocked highlighted in yellow is a forwarder block. It can be eliminated to give the following structure below. 56 | 57 | **After forwarder elimination** 58 | 59 | ![](images/after-forwarder.png) 60 | 61 | However, a forwarder cannot always be eliminated specifically in cases where the forwarder has implicit in-edges. Refer to [simplifier.py](simplifier.py) for the details 62 | 63 | #### Basic block merging 64 | 65 | A basic block can be merged onto its predecessor if and only if there is exactly one predecessor and the predecessor has this basic block as its lone successor. 66 | 67 | **Before merging** 68 | 69 | ![](images/before-merge.png) 70 | 71 | The highlighted instructions of the basic blocks have been merged to form a bigger basic block shown below. Control transfer instructions have been removed as they are not needed anymore. 72 | 73 | **After merging** 74 | 75 | ![](images/after-merge.png) 76 | 77 | 78 | ### Assembly 79 | 80 | Once the graph has been simplified, we need to assemble the basic blocks back into flat code. This is implemented in the file [assembler.py](assembler.py). 81 | 82 | The assembly process in itself consists of sub-stages: 83 | 84 | 1. **DFS**: Do a depth-first search of the graph to determine the order of the basic blocks when they are laid out sequentially. (This is done in a post-order fashion) 85 | 2. **Removal of redundant jumps**: If basic block A has a jump instruction to block B, but block B is immediately located after A, then the jump instruction can safely be removed. This feature is ***experimental*** and is known to break decompilers. However, the functionality of code is not affected. The advantage of this feature is it reduces code size. This feature is disabled by default. 86 | 3. **Changing `JUMP_FORWARD` to `JUMP_ABSOLUTE`**: If basic block A has a relative control flow instruction to block B, then block B must be located after block A in the generated layout. This is because relative control flow instructions are USUALLY used to refer to addresses located after it. If the relative c.f. instruction is `JUMP_FORWARD` we can change to `JUMP_ABSOLUTE`. 87 | 4. **Re-introducing forwarder blocks** : For other relative c.f instructions like `SETUP_LOOP,` `SETUP_EXCEPT` etc, we need to create a new forwarder block consisting of an absolute jump instruction to block B, and make the relative control flow instruction in block A to point to the forwarder block. This works since the forwarder block will naturally be after block A in the generated layout and relative instructions can be always used to point to blocks located after it, i.e. have a higher address. 88 | 5. **Calculation of block addresses**: Once the layouts of the blocks are fixed, we need to calculate the address of each block. 89 | 6. **Calculation of instruction operands**: Instructions like `JUMP_FORWARD` & `SETUP_LOOP` uses the operand to refer to other instructions. This reference is an integer denoting the offset/absolute address of the target. 90 | 7. **Emitting code**: This is the final code generation phase where the instructions are converted to their binary representations. 91 | 92 | ## Example usage 93 | 94 | ``` 95 | $ python main.py --ifile=obfuscated.pyc --ofile=deobfuscated.pyc 96 | 97 | INFO:__main__:Opening file obfuscated.pyc 98 | INFO:__main__:Input pyc file header matched 99 | DEBUG:__main__:Unmarshalling file 100 | INFO:__main__:Processing code object \x0b\x08\x0c\x19\x0b\x0e\x03 101 | DEBUG:deobfuscator:Code entrypoint matched PjOrion signature v1 102 | INFO:deobfuscator:Original code entrypoint at 124 103 | INFO:deobfuscator:Starting control flow analysis... 104 | DEBUG:disassembler:Finding leaders... 105 | DEBUG:disassembler:Start leader at 124 106 | DEBUG:disassembler:End leader at 127 107 | . 108 | 109 | . 110 | DEBUG:disassembler:Found 904 leaders 111 | DEBUG:disassembler:Constructing basic blocks... 112 | DEBUG:disassembler:Creating basic block 0x27dc5a8 spanning from 13 to 13, both inclusive 113 | DEBUG:disassembler:Creating basic block 0x2837800 spanning from 5369 to 5370, end exclusive 114 | . 115 | 116 | . 117 | DEBUG:disassembler:461 basic blocks created 118 | DEBUG:disassembler:Constructing edges between basic blocks... 119 | DEBUG:disassembler:Adding explicit edge from block 0x2a98080 to 0x2aa88a0 120 | DEBUG:disassembler:Adding explicit edge from block 0x2aa80f8 to 0x2a9ab70 121 | DEBUG:disassembler:Basic block 0x2aa8dc8 has xreference 122 | . 123 | 124 | . 125 | INFO:deobfuscator:Control flow analysis completed. 126 | INFO:deobfuscator:Starting simplication of basic blocks... 127 | DEBUG:simplifier:Eliminating forwarders... 128 | INFO:simplifier:Adding implicit edge from block 0x2aa8058 to 0x2a9ab70 129 | INFO:simplifier:Adding explicit edge from block 0x2b07ee0 to 0x2a9ab70 130 | DEBUG:simplifier:Forwarder basic block 0x2aa80f8 eliminated 131 | . 132 | 133 | . 134 | INFO: 135 | INFO:simplifier:307 basic blocks merged. 136 | INFO:deobfuscator:Simplication of basic blocks completed. 137 | INFO:deobfuscator:Beginning verification of simplified basic block graph... 138 | INFO:deobfuscator:Verification succeeded. 139 | INFO:deobfuscator:Assembling basic blocks... 140 | DEBUG:assembler:Performing a DFS on the graph to generate the layout of the blocks. 141 | DEBUG:assembler:Morphing some JUMP_ABSOLUTE instructions to make file decompilable. 142 | DEBUG:assembler:Verifying generated layout... 143 | INFO:assembler:Basic block 0x2b0e940 uses a relative control transfer instruction to access block 0x2abb3a0 located before it. 144 | INFO:assembler:Basic block 0x2ab5300 uses a relative control transfer instruction to access block 0x2ada918 located before it. 145 | DEBUG:assembler:Successfully verified layout. 146 | DEBUG:assembler:Calculating addresses of basic blocks. 147 | DEBUG:assembler:Calculating instruction operands. 148 | DEBUG:assembler:Generating code... 149 | INFO:deobfuscator:Successfully assembled. 150 | INFO:__main__:Successfully deobfuscated code object main 151 | INFO:__main__:Collecting constants for code object main 152 | INFO:__main__:Generating new code object for main 153 | INFO:__main__:Generating new code object for \x0b\x08\x0c\x19\x0b\x0e\x03 154 | INFO:__main__:Writing deobfuscated code object to disk 155 | INFO:__main__:Success 156 | ``` 157 | -------------------------------------------------------------------------------- /assembler.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | import cStringIO 4 | import networkx as nx 5 | import dis 6 | 7 | from basicblock import BasicBlock 8 | from instruction import Instruction 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | class Assembler: 14 | def __init__(self, bb_graph): 15 | """ 16 | :param bb_graph: The graph of basic blocks 17 | :type bb_graph: nx.DiGraph 18 | """ 19 | self.bb_graph = bb_graph 20 | self.bb_ordered = [None] * self.bb_graph.number_of_nodes() 21 | self.idx = self.bb_graph.number_of_nodes() - 1 22 | 23 | def assemble(self): 24 | """ 25 | Assembling consists of several sub-stages: 26 | 1. Depth first search the graph to fix the order of the nodes when they are laid out sequentially. 27 | (This is done in a post-order fashion) 28 | 2. 29 | 30 | """ 31 | entryblock = nx.get_node_attributes(self.bb_graph, 'isEntry').keys()[0] 32 | logger.debug('Performing a DFS on the graph to generate the layout of the blocks.') 33 | self.dfs(entryblock) 34 | 35 | logger.debug('Morphing some JUMP_ABSOLUTE instructions to make file decompilable.') 36 | self.convert_abs_to_rel() 37 | 38 | # self.remove_redundant_jumps() 39 | 40 | # If basic block A has a relative control flow instruction to block B, then block B 41 | # must be located after block A in the generated layout. 42 | # This is because relative control flow instructions are USUALLY used to refer to 43 | # addresses located after it. 44 | 45 | # If the relative c.f. instruction is JUMP_FORWARD we can change to JUMP_ABSOLUTE without 46 | # any further modifications. 47 | 48 | # For other relative c.f instructions like SETUP_LOOP, SETUP_EXCEPT etc, 49 | # we need to create an new forwarder block consisting of an absolute jump instruction 50 | # to block B, and make the relative control flow instruction in block A to point to 51 | # the forwarder block. This works since the forwarder block will naturally be after 52 | # block A in the generated layout and relative instructions can be always used to point 53 | # to blocks located after it, i.e. have a higher address. 54 | logger.debug('Verifying generated layout...') 55 | for idx in xrange(len(self.bb_ordered)): 56 | block = self.bb_ordered[idx] 57 | for ins in block.instruction_iter(): 58 | if ins.opcode in dis.hasjrel: 59 | targetBlock = ins.argval 60 | 61 | # Check if target block occurs before the current block 62 | if self.bb_ordered.index(targetBlock) <= idx: 63 | logger.info( 64 | 'Basic block {} uses a relative control transfer instruction to access block {} located before it.'.format( 65 | hex(id(block)), hex(id(targetBlock)))) 66 | 67 | # Modify relative jump to absolute jump 68 | if ins.mnemonic == 'JUMP_FORWARD': 69 | ins.opcode = dis.opmap['JUMP_ABSOLUTE'] 70 | 71 | # If instruction is a relative control transfer instruction 72 | # but is not JUMP_FORWARD (like SETUP_LOOP) 73 | else: 74 | # Create a new forwarder block 75 | bb = self.create_forwarder_block(targetBlock) 76 | 77 | # Make the original instruction point to the new block 78 | ins.argval = bb 79 | 80 | # Append new block at end 81 | self.bb_ordered.append(bb) 82 | 83 | logger.debug('Successfully verified layout.') 84 | self.calculate_block_addresses() 85 | self.calculate_ins_operands() 86 | return self.emit() 87 | 88 | def create_forwarder_block(self, target): 89 | """ 90 | Create a new basic block consisting of a `JUMP_ABSOLUTE` 91 | instruction to target block 92 | 93 | :param target: The target basic block to jump to 94 | :type target: BasicBlock 95 | :return: The new basic block 96 | :rtype: BasicBlock 97 | """ 98 | bb = BasicBlock() 99 | ins = Instruction(dis.opmap['JUMP_ABSOLUTE'], target, 3) 100 | ins.argval = target 101 | bb.add_instruction(ins) 102 | return bb 103 | 104 | def dfs(self, bb): 105 | """ 106 | Depth first search. 107 | Ported from: https://github.com/python/cpython/blob/2.7/Python/compile.c#L3409 108 | 109 | :param bb: The basic block 110 | :type bb: basicblock.BasicBlock 111 | """ 112 | 113 | # Return if the current block has already been visited 114 | if bb.b_seen: 115 | return 116 | 117 | # Mark this block as visited 118 | bb.b_seen = True 119 | 120 | # Recursively dfs on all out going explicit edges 121 | for o_edge in self.bb_graph.out_edges(bb, data=True): 122 | # o_edge is a tuple (edge src, edge dest, edge attrib dict) 123 | if o_edge[2]['edge_type'] == 'explicit': 124 | self.dfs(o_edge[1]) 125 | 126 | # Iterate over the instructions in the basic block 127 | for ins in bb.instruction_iter(): 128 | # Recursively dfs if instruction have xreferences 129 | if ins.has_xref(): 130 | self.dfs(ins.argval) 131 | 132 | # Recursively dfs on all out going implicit edges 133 | for o_edge in self.bb_graph.out_edges(bb, data=True): 134 | # o_edge is a tuple (edge src, edge dest, edge attrib dict) 135 | if o_edge[2]['edge_type'] == 'implicit': 136 | self.dfs(o_edge[1]) 137 | 138 | # Add the basic block in a reversed order 139 | self.bb_ordered[self.idx] = bb 140 | self.idx -= 1 141 | 142 | def calculate_block_addresses(self): 143 | """ 144 | Once the layout of the blocks are fixed, we need to calculate the address of each block. 145 | """ 146 | logger.debug('Calculating addresses of basic blocks.') 147 | size = 0 148 | for block in self.bb_ordered: 149 | block.address = size 150 | size += block.size() 151 | 152 | def calculate_ins_operands(self): 153 | """ 154 | Instructions like JUMP_FORWARD & SETUP_LOOP uses the operand to refer to other instructions. 155 | This reference is an integer denoting the offset/absolute address of the target. This function 156 | calculates the values of these operand 157 | """ 158 | logger.debug('Calculating instruction operands.') 159 | for block in self.bb_ordered: 160 | addr = block.address 161 | for ins in block.instruction_iter(): 162 | addr += ins.size 163 | if ins.opcode in dis.hasjabs: 164 | # ins.argval is a BasicBlock 165 | ins.arg = ins.argval.address 166 | # TODO 167 | # We do not generate EXTENDED_ARG opcode at the moment, 168 | # hence size of opcode argument can only be 2 bytes 169 | assert ins.arg <= 0xFFFF 170 | elif ins.opcode in dis.hasjrel: 171 | ins.arg = ins.argval.address - addr 172 | # relative jump can USUALLY go forward 173 | assert ins.arg >= 0 174 | assert ins.arg <= 0xFFFF 175 | 176 | def emit(self): 177 | logger.debug('Generating code...') 178 | codestring = cStringIO.StringIO() 179 | for block in self.bb_ordered: 180 | for ins in block.instruction_iter(): 181 | codestring.write(ins.assemble()) 182 | return codestring.getvalue() 183 | 184 | def convert_abs_to_rel(self): 185 | """ 186 | An JUMP_ABSOLUTE instruction from basic block A to block B can be replaced with a JUMP_FORWARD 187 | if block B is located after block A. 188 | This conversion is not really required, but some decompilers like uncompyle fails without it. 189 | """ 190 | for idx in xrange(len(self.bb_ordered)): 191 | block = self.bb_ordered[idx] 192 | 193 | # Fetch the last instruction 194 | ins = block.instructions[-1] 195 | 196 | # A JUMP_ABSOLUTE instruction whose target block is located after it 197 | if ins.mnemonic == 'JUMP_ABSOLUTE' and self.bb_ordered.index(ins.argval) > idx: 198 | ins.opcode = dis.opmap['JUMP_FORWARD'] 199 | ins.mnemonic = 'JUMP_FORWARD' 200 | 201 | def remove_redundant_jumps(self): 202 | """ 203 | If basic block A has a jump instruction to block B, but block B is immediately located after A, 204 | then the jump instruction can safely be removed. 205 | This feature is experimental and may break decompilers. The advantage of this feature is it reduces 206 | generated code size. 207 | """ 208 | logger.warning('Removing redundant jump instruction. This feature is EXPERIMENTAL.') 209 | numRemoved = 0 210 | 211 | for idx in xrange(len(self.bb_ordered)): 212 | block = self.bb_ordered[idx] 213 | 214 | # Fetch the ;ast instruction 215 | ins = block.instructions[-1] 216 | 217 | if ins.mnemonic == 'JUMP_ABSOLUTE' or ins.mnemonic == 'JUMP_FORWARD': 218 | target = ins.argval 219 | # If target block is immediately located after it 220 | if self.bb_ordered.index(target) == idx + 1: 221 | # Remove the instruction 222 | del block.instructions[-1] 223 | numRemoved += 1 224 | logger.debug('Removed {} redundant jump instructions'.format(numRemoved)) 225 | -------------------------------------------------------------------------------- /basicblock.py: -------------------------------------------------------------------------------- 1 | class BasicBlock: 2 | """ 3 | A basic block is a set of instructions, that has a single entry and single exit. 4 | Execution begins from the top and ends at the bottom. There can be no branching 5 | in between. 6 | """ 7 | 8 | def __init__(self): 9 | self.address = 0 10 | self.instructions = [] 11 | self.has_xrefs_to = False 12 | # Instructions which xreference this basic block 13 | self.xref_instructions = [] 14 | 15 | # b_seen is used to perform a DFS of basicblocks 16 | self.b_seen = False 17 | 18 | def add_instruction(self, ins): 19 | self.instructions.append(ins) 20 | 21 | def instruction_iter(self): 22 | """ 23 | An iterator for traversing over the instructions. 24 | 25 | :return: An iterator for iterating over the instruction 26 | """ 27 | for ins in self.instructions: 28 | yield ins 29 | 30 | def size(self): 31 | """ 32 | Calculates the size of the basic block 33 | :return: 34 | """ 35 | return reduce(lambda x, ins: x + ins.size, self.instructions, 0) 36 | -------------------------------------------------------------------------------- /decoder.py: -------------------------------------------------------------------------------- 1 | import dis 2 | 3 | from instruction import Instruction 4 | 5 | 6 | class Decoder: 7 | """ 8 | Class to decode raw bytes into instruction. 9 | """ 10 | 11 | def __init__(self, insBytes): 12 | self.insBytes = insBytes 13 | 14 | def decode_at(self, offset): 15 | assert offset < len(self.insBytes) 16 | 17 | opcode = self.insBytes[offset] 18 | 19 | if opcode == dis.opmap['EXTENDED_ARG']: 20 | raise Exception('EXTENDED_ARG not yet implemented') 21 | 22 | # Invalid instruction 23 | if opcode not in dis.opmap.values(): 24 | return Instruction(-1, None, 1) 25 | 26 | if opcode < dis.HAVE_ARGUMENT: 27 | return Instruction(opcode, None, 1) 28 | 29 | if opcode >= dis.HAVE_ARGUMENT: 30 | arg = (self.insBytes[offset + 2] << 8) | self.insBytes[offset + 1] 31 | return Instruction(opcode, arg, 3) 32 | -------------------------------------------------------------------------------- /deobfuscator.py: -------------------------------------------------------------------------------- 1 | import logging 2 | 3 | from assembler import Assembler 4 | from simplifier import Simplifier 5 | from decoder import Decoder 6 | from disassembler import Disassembler 7 | from utils.rendergraph import render_graph 8 | from verifier import verify_graph 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | def find_oep(insBytes): 14 | """ 15 | Finds the original entry point of a code object obfuscated by PjOrion. 16 | If the entrypoint does not match the predefine signature it will return 0. 17 | 18 | :param insBytes: the code object 19 | :type insBytes: bytearray 20 | :returns: the entrypoint 21 | :rtype: int 22 | """ 23 | 24 | dec = Decoder(insBytes) 25 | ins = dec.decode_at(0) 26 | 27 | try: 28 | # First instruction sets up an exception handler 29 | assert ins.mnemonic == 'SETUP_EXCEPT' 30 | 31 | # Get location of exception handler 32 | exc_handler = 0 + ins.arg + ins.size 33 | 34 | # Second instruction is intentionally invalid, on execution 35 | # control transfers to exception handler 36 | assert dec.decode_at(3).is_opcode_valid() == False 37 | 38 | assert dec.decode_at(exc_handler).mnemonic == 'POP_TOP' 39 | assert dec.decode_at(exc_handler + 1).mnemonic == 'POP_TOP' 40 | assert dec.decode_at(exc_handler + 2).mnemonic == 'POP_TOP' 41 | logger.debug('Code entrypoint matched PjOrion signature v1') 42 | oep = exc_handler + 3 43 | except: 44 | if ins.mnemonic == 'JUMP_FORWARD': 45 | oep = 0 + ins.arg + ins.size 46 | logger.debug('Code entrypoint matched PjOrion signature v2') 47 | elif ins.mnemonic == 'JUMP_ABSOLUTE': 48 | oep = ins.arg 49 | logger.debug('Code entrypoint matched PjOrion signature v2') 50 | else: 51 | logger.warning('Code entrypoint did not match PjOrion signature') 52 | oep = 0 53 | 54 | return oep 55 | 56 | 57 | def deobfuscate(codestring): 58 | # Instructions are stored as a string, we need 59 | # to convert it to an array of the raw bytes 60 | insBytes = bytearray(codestring) 61 | 62 | oep = find_oep(insBytes) 63 | logger.info('Original code entrypoint at {}'.format(oep)) 64 | 65 | logger.info('Starting control flow analysis...') 66 | disasm = Disassembler(insBytes, oep) 67 | disasm.find_leaders() 68 | disasm.construct_basic_blocks() 69 | disasm.build_bb_edges() 70 | logger.info('Control flow analysis completed.') 71 | logger.info('Starting simplication of basic blocks...') 72 | render_graph(disasm.bb_graph, 'before.svg') 73 | simplifier = Simplifier(disasm.bb_graph) 74 | simplifier.eliminate_forwarders() 75 | render_graph(simplifier.bb_graph, 'after_forwarder.svg') 76 | simplifier.merge_basic_blocks() 77 | logger.info('Simplification of basic blocks completed.') 78 | simplified_graph = simplifier.bb_graph 79 | render_graph(simplified_graph, 'after.svg') 80 | logger.info('Beginning verification of simplified basic block graph...') 81 | 82 | if not verify_graph(simplified_graph): 83 | logger.error('Verification failed.') 84 | raise SystemExit 85 | 86 | logger.info('Verification succeeded.') 87 | logger.info('Assembling basic blocks...') 88 | asm = Assembler(simplified_graph) 89 | codestring = asm.assemble() 90 | logger.info('Successfully assembled. ') 91 | return codestring 92 | -------------------------------------------------------------------------------- /disassembler.py: -------------------------------------------------------------------------------- 1 | import Queue 2 | import logging 3 | import collections 4 | import dis 5 | import networkx as nx 6 | 7 | from basicblock import BasicBlock 8 | from decoder import Decoder 9 | 10 | logger = logging.getLogger(__name__) 11 | 12 | 13 | class Disassembler: 14 | """ 15 | A Recursive traversal disassembler. 16 | """ 17 | 18 | def __init__(self, insBytes, entrypoint): 19 | self.insBytes = insBytes 20 | self.entrypoint = entrypoint 21 | self.leaders = None 22 | self.bb_graph = nx.DiGraph() 23 | 24 | def get_next_ins_addresses(self, ins, addr): 25 | """ 26 | Given an instruction and an address at which this resides, this function 27 | returns a dictionary of addresses of the instruction expected to be executed next. 28 | 29 | explicit addresses are indicated by the control flow of the instruction. 30 | implicit address is the address of the instruction located sequentially after. 31 | 32 | :rtype: dict 33 | """ 34 | next_addresses = {} 35 | 36 | if ins.mnemonic == 'JUMP_IF_FALSE_OR_POP': 37 | next_addresses['implicit'] = addr + ins.size 38 | next_addresses['explicit'] = ins.arg 39 | 40 | elif ins.mnemonic == 'JUMP_IF_TRUE_OR_POP': 41 | next_addresses['implicit'] = addr + ins.size 42 | next_addresses['explicit'] = ins.arg 43 | 44 | elif ins.mnemonic == 'JUMP_ABSOLUTE': 45 | next_addresses['explicit'] = ins.arg 46 | 47 | elif ins.mnemonic == 'POP_JUMP_IF_FALSE': 48 | next_addresses['implicit'] = addr + ins.size 49 | next_addresses['explicit'] = ins.arg 50 | 51 | elif ins.mnemonic == 'POP_JUMP_IF_TRUE': 52 | next_addresses['implicit'] = addr + ins.size 53 | next_addresses['explicit'] = ins.arg 54 | 55 | elif ins.mnemonic == 'CONTINUE_LOOP': 56 | next_addresses['explicit'] = ins.arg 57 | 58 | elif ins.mnemonic == 'FOR_ITER': 59 | next_addresses['implicit'] = addr + ins.size 60 | next_addresses['explicit'] = addr + ins.size + ins.arg 61 | 62 | elif ins.mnemonic == 'JUMP_FORWARD': 63 | next_addresses['explicit'] = addr + ins.size + ins.arg 64 | 65 | elif ins.mnemonic == 'RETURN_VALUE': 66 | pass 67 | 68 | else: 69 | next_addresses['implicit'] = addr + ins.size 70 | 71 | return next_addresses 72 | 73 | def get_ins_xref(self, ins, addr): 74 | """ 75 | An instruction may reference other instruction. 76 | Example: SETUP_EXCEPT exc_handler 77 | the exception handler is the xref. 78 | """ 79 | xref_ins = (dis.opmap[x] for x in ('SETUP_LOOP', 'SETUP_EXCEPT', 'SETUP_FINALLY', 'SETUP_WITH')) 80 | if ins.opcode in xref_ins: 81 | return addr + ins.size + ins.arg 82 | else: 83 | return None 84 | 85 | def find_leaders(self): 86 | logger.debug('Finding leaders...') 87 | 88 | # A leader is a mark that identifies either the start or end of a basic block 89 | # address is the positional offset of the leader within the instruction bytes 90 | # type can be either S (starting) or E (ending) 91 | Leader = collections.namedtuple('leader', ['address', 'type']) 92 | 93 | # Set to contain all the leaders. We use a set to prevent duplicates 94 | leader_set = set() 95 | 96 | # The entrypoint is automatically the start of a basic block, and hence a start leader 97 | leader_set.add(Leader(self.entrypoint, 'S')) 98 | logger.debug('Start leader at {}'.format(self.entrypoint)) 99 | 100 | # Queue to contain list of addresses, from where linear sweep disassembling would start 101 | analysis_Q = Queue.Queue() 102 | 103 | # Start analysis from the entrypoint 104 | analysis_Q.put(self.entrypoint) 105 | 106 | # Already analyzed addresses must not be analyzed later, else we would get into an infinite loop 107 | # while processing instructions that branch backwards to an previously analyzed address. 108 | # The already_analyzed set would contains the addresses that have been previously encountered. 109 | already_analyzed = set() 110 | 111 | # Create the decoder 112 | dec = Decoder(self.insBytes) 113 | 114 | while not analysis_Q.empty(): 115 | addr = analysis_Q.get() 116 | 117 | while True: 118 | ins = dec.decode_at(addr) 119 | 120 | # Put the current address into the already_analyzed set 121 | already_analyzed.add(addr) 122 | 123 | # If current instruction is a return, stop disassembling further. 124 | # current address is an end leader 125 | if ins.is_ret(): 126 | leader_set.add(Leader(addr, 'E')) 127 | logger.debug('End leader at {}'.format(addr)) 128 | break 129 | 130 | # If current instruction is control flow, stop disassembling further. 131 | # the current instr is an end leader, control flow target(s) is(are) start leaders 132 | if ins.is_control_flow(): 133 | # Current instruction is an end leader 134 | leader_set.add(Leader(addr, 'E')) 135 | logger.debug('End leader at {}'.format(addr)) 136 | 137 | # The list of addresses where execution is expected to transfer are starting leaders 138 | for target in self.get_next_ins_addresses(ins, addr).values(): 139 | leader_set.add(Leader(target, 'S')) 140 | logger.debug('Start leader at {}'.format(addr)) 141 | 142 | # Put into analysis queue if not already analyzed 143 | if target not in already_analyzed: 144 | analysis_Q.put(target) 145 | break 146 | 147 | # Current instruction is not control flow 148 | else: 149 | # Get cross refs 150 | xref = self.get_ins_xref(ins, addr) 151 | nextAddress = self.get_next_ins_addresses(ins, addr).values() 152 | 153 | # Non control flow instruction should only have a single possible next address 154 | assert len(nextAddress) == 1 155 | 156 | # The immediate next instruction positionally 157 | addr = nextAddress[0] 158 | 159 | # If the instruction has xrefs, they are start leaders 160 | if xref is not None: 161 | leader_set.add(Leader(xref, 'S')) 162 | logger.debug('Start leader at {}'.format(xref)) 163 | 164 | # Put into analysis queue if not already analyzed 165 | if xref not in already_analyzed: 166 | analysis_Q.put(xref) 167 | 168 | # Comparator function to sort the leaders according to increasing offsets 169 | def __leaderSortFunc(elem1, elem2): 170 | if elem1.address != elem2.address: 171 | return elem1.address - elem2.address 172 | else: 173 | if elem1.type == 'S': 174 | return -1 175 | else: 176 | return 1 177 | 178 | logger.debug('Found {} leaders'.format(len(leader_set))) 179 | self.leaders = sorted(leader_set, cmp=__leaderSortFunc) 180 | 181 | def construct_basic_blocks(self): 182 | """ 183 | Once we have obtained the leaders, i.e. the boundaries where a basic block may start or end, 184 | we need to build the basic blocks by parsing the leaders. A basic block spans from the starting leader 185 | upto the immediate next end leader as per their addresses. 186 | """ 187 | logger.debug('Constructing basic blocks...') 188 | idx = 0 189 | dec = Decoder(self.insBytes) 190 | 191 | while idx < len(self.leaders): 192 | # Get a pair of leaders 193 | leader1, leader2 = self.leaders[idx], self.leaders[idx + 1] 194 | 195 | # Get the addresses of the respective leaders 196 | addr1, addr2 = leader1.address, leader2.address 197 | 198 | # Create a new basic block 199 | bb = BasicBlock() 200 | 201 | # Set the address of the basic block 202 | bb.address = addr1 203 | 204 | # The offset variable is used track the position of the individual instructions within the basic block 205 | offset = 0 206 | 207 | # Store the basic block at the entrypoint separately 208 | if addr1 == self.entrypoint: 209 | self.bb_graph.add_node(bb, isEntry=True) 210 | else: 211 | self.bb_graph.add_node(bb) 212 | 213 | # Add the basic block to the graph 214 | self.bb_graph.add_node(bb) 215 | 216 | # Leader1 is start leader, leader2 is end leader 217 | # All instructions inclusive of leader1 and leader2 are part of this basic block 218 | if leader1.type == 'S' and leader2.type == 'E': 219 | logger.debug( 220 | 'Creating basic block {} spanning from {} to {}, both inclusive'.format(hex(id(bb)), 221 | leader1.address, 222 | leader2.address)) 223 | while addr1 + offset <= addr2: 224 | ins = dec.decode_at(addr1 + offset) 225 | bb.add_instruction(ins) 226 | offset += ins.size 227 | idx += 2 228 | 229 | # Both Leader1 and leader2 are start leader 230 | # Instructions inclusive of leader1 but exclusive of leader2 are part of this basic block 231 | elif leader1.type == 'S' and leader2.type == 'S': 232 | logger.debug( 233 | 'Creating basic block {} spanning from {} to {}, end exclusive'.format(hex(id(bb)), leader1.address, 234 | leader2.address)) 235 | while addr1 + offset < addr2: 236 | ins = dec.decode_at(addr1 + offset) 237 | bb.add_instruction(ins) 238 | offset += ins.size 239 | idx += 1 240 | 241 | logger.debug('{} basic blocks created'.format(self.bb_graph.number_of_nodes())) 242 | 243 | def find_bb_by_address(self, address): 244 | for bb in self.bb_graph.nodes(): 245 | if bb.address == address: 246 | return bb 247 | 248 | def build_bb_edges(self): 249 | """ 250 | The list of basic blocks forms a graph. The basic block themselves are the vertices with edges between them. 251 | Edges refer to the control flow between the basic block. 252 | """ 253 | logger.debug('Constructing edges between basic blocks...') 254 | 255 | for bb in self.bb_graph.nodes(): 256 | offset = 0 257 | 258 | for idx in xrange(len(bb.instructions)): 259 | ins = bb.instructions[idx] 260 | 261 | # If instruction has an xref, resolve it 262 | xref = self.get_ins_xref(ins, bb.address + offset) 263 | if xref is not None: 264 | xref_bb = self.find_bb_by_address(xref) 265 | ins.argval = xref_bb 266 | xref_bb.has_xrefs_to = True 267 | xref_bb.xref_instructions.append(ins) 268 | logger.debug('Basic block {} has xreference'.format(hex(id(bb)))) 269 | 270 | nextInsAddr = self.get_next_ins_addresses(ins, bb.address + offset) 271 | 272 | # Check of this is is the last instruction of this basic block. 273 | # This is required to construct edges 274 | if idx == len(bb.instructions) - 1: 275 | # A control flow instruction can be of two types: conditional and un-conditional. 276 | # An un-conditional control flow instruction can have only a single successor instruction 277 | # which is indicated by its argument. 278 | if ins.is_unconditional(): 279 | assert len(nextInsAddr) == 1 and nextInsAddr.has_key('explicit') 280 | target = nextInsAddr['explicit'] 281 | targetBB = self.find_bb_by_address(target) 282 | ins.argval = targetBB 283 | 284 | # Add edge 285 | self.bb_graph.add_edge(bb, targetBB, edge_type='explicit') 286 | 287 | logger.debug( 288 | 'Adding explicit edge from block {} to {}'.format(hex(id(bb)), hex(id(targetBB)))) 289 | 290 | # A conditional control flow instruction has two possible successor instructions. 291 | # One is explicit and indicated by its argument, the other is implicit and is the 292 | # immediate next instruction according to address. 293 | elif ins.is_conditional(): 294 | assert len(nextInsAddr) == 2 295 | # target1 is the implicit successor instruction, i.e the immediate next instruction 296 | # target2 is the explicit successor instruction, i.e. the branch target indicated by the argument 297 | target1, target2 = nextInsAddr['implicit'], nextInsAddr['explicit'] 298 | 299 | target1BB = self.find_bb_by_address(target1) 300 | target2BB = self.find_bb_by_address(target2) 301 | 302 | ins.argval = target2BB 303 | 304 | # Add the two edges 305 | self.bb_graph.add_edge(bb, target1BB, edge_type='implicit') 306 | self.bb_graph.add_edge(bb, target2BB, edge_type='explicit') 307 | 308 | logger.debug( 309 | 'Adding implicit edge from block {} to {}'.format(hex(id(bb)), hex(id(target1BB)))) 310 | 311 | logger.debug( 312 | 'Adding explicit edge from block {} to {}'.format(hex(id(bb)), hex(id(target2BB)))) 313 | 314 | # RETURN_VALUE 315 | elif ins.is_ret(): 316 | nx.set_node_attributes(self.bb_graph, {bb: {'isTerminal': True}}) 317 | # Does not have any sucessors 318 | assert len(nextInsAddr) == 0 319 | 320 | # The last instruction does not have an explicit control flow 321 | else: 322 | assert len(nextInsAddr) == 1 and nextInsAddr.has_key('implicit') 323 | nextBB = self.find_bb_by_address(nextInsAddr['implicit']) 324 | 325 | # Add edge 326 | self.bb_graph.add_edge(bb, nextBB, edge_type='implicit') 327 | 328 | offset += ins.size 329 | -------------------------------------------------------------------------------- /images/after-forwarder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/extremecoders-re/bytecode_simplifier/35467c62dd1292c8a727c4d2346f11ab35000ce2/images/after-forwarder.png -------------------------------------------------------------------------------- /images/after-merge.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/extremecoders-re/bytecode_simplifier/35467c62dd1292c8a727c4d2346f11ab35000ce2/images/after-merge.png -------------------------------------------------------------------------------- /images/before-forwarder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/extremecoders-re/bytecode_simplifier/35467c62dd1292c8a727c4d2346f11ab35000ce2/images/before-forwarder.png -------------------------------------------------------------------------------- /images/before-merge.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/extremecoders-re/bytecode_simplifier/35467c62dd1292c8a727c4d2346f11ab35000ce2/images/before-merge.png -------------------------------------------------------------------------------- /instruction.py: -------------------------------------------------------------------------------- 1 | import dis 2 | 3 | 4 | class Instruction: 5 | """ 6 | This class represents an instruction. 7 | """ 8 | 9 | def __init__(self, opcode, arg, size): 10 | # Numeric code for operation, corresponding to the opcode values 11 | self.opcode = opcode 12 | 13 | # Numeric argument to operation(if any), otherwise None 14 | self.arg = arg 15 | 16 | # The size of the instruction including the arguement 17 | self.size = size 18 | 19 | # Resolved arg value (if known), otherwise same as arg 20 | self.argval = arg 21 | 22 | # Human readable name for operation 23 | self.mnemonic = dis.opname[self.opcode] 24 | 25 | def is_opcode_valid(self): 26 | """ 27 | Checks whether the instruction is legal. A legal instruction has an opcode 28 | which is understood by the CPython VM. 29 | """ 30 | return self.opcode in dis.opmap.values() 31 | 32 | def is_ret(self): 33 | """ 34 | Checks whether the instruction is a return 35 | :return: 36 | """ 37 | return self.opcode == dis.opmap['RETURN_VALUE'] 38 | 39 | def is_control_flow(self): 40 | """ 41 | Checks whether the instruction cause change of control flow. 42 | A control flow instruction can be conditional or unconditional 43 | :return: 44 | """ 45 | # All control flow instructions 46 | cfIns = ( 47 | 'JUMP_IF_FALSE_OR_POP', 'JUMP_IF_TRUE_OR_POP', 'JUMP_ABSOLUTE', 'POP_JUMP_IF_FALSE', 'POP_JUMP_IF_TRUE', 48 | 'CONTINUE_LOOP', 'FOR_ITER', 'JUMP_FORWARD') 49 | return self.mnemonic in cfIns 50 | 51 | def is_conditional(self): 52 | """ 53 | Checks whether the instruction is a conditional control flow instruction. 54 | A conditional control flow instruction has two possible successor instructions. 55 | """ 56 | conditionalIns = ( 57 | 'JUMP_IF_FALSE_OR_POP', 'JUMP_IF_TRUE_OR_POP', 'POP_JUMP_IF_FALSE', 'POP_JUMP_IF_TRUE', 'FOR_ITER') 58 | return self.is_control_flow() and self.mnemonic in conditionalIns 59 | 60 | def is_unconditional(self): 61 | """ 62 | Checks whether the instruction is a conditional control flow instruction. 63 | A conditional control flow instruction has two possible successor instructions. 64 | """ 65 | unconditionalIns = ('JUMP_ABSOLUTE', 'JUMP_FORWARD', 'CONTINUE_LOOP') 66 | return self.is_control_flow() and self.mnemonic in unconditionalIns 67 | 68 | def has_xref(self): 69 | """ 70 | Checks whether the instruction has xreferences. 71 | """ 72 | return self.mnemonic in ('SETUP_LOOP', 'SETUP_EXCEPT', 'SETUP_FINALLY', 'SETUP_WITH') 73 | 74 | def assemble(self): 75 | if self.size == 1: 76 | return chr(self.opcode) 77 | else: 78 | return chr(self.opcode) + chr(self.arg & 0xFF) + chr((self.arg >> 8) & 0xFF) 79 | 80 | def __str__(self): 81 | return '{} {} {}'.format(self.opcode, self.mnemonic, self.arg) 82 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | import marshal 4 | import logging 5 | import types 6 | 7 | logging.basicConfig(level=logging.DEBUG) 8 | logger = logging.getLogger(__name__) 9 | 10 | from deobfuscator import deobfuscate 11 | 12 | 13 | def parse_code_object(codeObject): 14 | logger.info('Processing code object {}'.format(codeObject.co_name).encode('string_escape')) 15 | co_argcount = codeObject.co_argcount 16 | co_nlocals = codeObject.co_nlocals 17 | co_stacksize = codeObject.co_stacksize 18 | co_flags = codeObject.co_flags 19 | 20 | co_codestring = deobfuscate(codeObject.co_code) 21 | logger.info('Successfully deobfuscated code object {}'.format(codeObject.co_name).encode('string_escape')) 22 | co_names = codeObject.co_names 23 | 24 | co_varnames = codeObject.co_varnames 25 | co_filename = codeObject.co_filename 26 | co_name = codeObject.co_name 27 | co_firstlineno = codeObject.co_firstlineno 28 | co_lnotab = codeObject.co_lnotab 29 | 30 | logger.info('Collecting constants for code object {}'.format(codeObject.co_name).encode('string_escape')) 31 | mod_const = [] 32 | for const in codeObject.co_consts: 33 | if isinstance(const, types.CodeType): 34 | logger.info( 35 | 'Code object {} contains embedded code object {}'.format(codeObject.co_name, const.co_name).encode( 36 | 'string_escape')) 37 | mod_const.append(parse_code_object(const)) 38 | else: 39 | mod_const.append(const) 40 | co_constants = tuple(mod_const) 41 | 42 | logger.info('Generating new code object for {}'.format(codeObject.co_name).encode('string_escape')) 43 | return types.CodeType(co_argcount, co_nlocals, co_stacksize, co_flags, 44 | co_codestring, co_constants, co_names, co_varnames, 45 | co_filename, co_name, co_firstlineno, co_lnotab) 46 | 47 | 48 | def process(ifile, ofile): 49 | logger.info('Opening file ' + ifile) 50 | ifPtr = open(ifile, 'rb') 51 | header = ifPtr.read(8) 52 | if not header.startswith('\x03\xF3\x0D\x0A'): 53 | raise SystemExit('[!] Header mismatch. The input file is not a valid pyc file.') 54 | logger.info('Input pyc file header matched') 55 | logger.debug('Unmarshalling file') 56 | rootCodeObject = marshal.load(ifPtr) 57 | ifPtr.close() 58 | deob = parse_code_object(rootCodeObject) 59 | logger.info('Writing deobfuscated code object to disk') 60 | ofPtr = open(ofile, 'wb') 61 | ofPtr.write(header) 62 | marshal.dump(deob, ofPtr) 63 | ofPtr.close() 64 | logger.info('Success') 65 | 66 | 67 | if __name__ == '__main__': 68 | parser = argparse.ArgumentParser() 69 | parser.add_argument('-i', '--ifile', help='Input pyc file name', required=True) 70 | parser.add_argument('-o', '--ofile', help='Output pyc file name', required=True) 71 | args = parser.parse_args() 72 | process(args.ifile, args.ofile) 73 | -------------------------------------------------------------------------------- /simplifier.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import networkx as nx 3 | 4 | logger = logging.getLogger(__name__) 5 | 6 | 7 | class Simplifier: 8 | def __init__(self, bb_graph): 9 | """ 10 | 11 | :param bb_graph: The basic block graph 12 | :type bb_graph: nx.DiGraph 13 | """ 14 | self.bb_graph = bb_graph 15 | 16 | def eliminate_forwarders(self): 17 | """ 18 | Eliminates a basic block that acts as a forwarder, i.e. only consists of a single un-conditional 19 | control flow instructions. 20 | """ 21 | numEliminated = 0 22 | logger.debug('Eliminating forwarders...') 23 | 24 | # flag variable to indicate whether a basic block was eliminated in a pass 25 | bb_eliminated = True 26 | 27 | # Loop until no basic block can be eliminated any more 28 | while bb_eliminated: 29 | bb_eliminated = False 30 | for bb in self.bb_graph.nodes(): 31 | # Must have a single instruction 32 | if len(bb.instructions) == 1: 33 | ins = bb.instructions[0] 34 | if ins.mnemonic == 'JUMP_ABSOLUTE' or ins.mnemonic == 'JUMP_FORWARD': 35 | # Must have a single successor 36 | assert self.bb_graph.out_degree(bb) == 1 37 | 38 | forwarderBB = bb 39 | forwardedBB = list(self.bb_graph.successors(bb))[0] 40 | 41 | # Check if forwardedBB has atleast one implicit in edge 42 | forwardedBB_in_edge_exists = len(filter(lambda edge: edge[2]['edge_type'] == 'implicit', 43 | self.bb_graph.in_edges(forwardedBB, data=True))) > 0 44 | 45 | # Check if forwarderBB has atleast one implicit in edge 46 | forwarderBB_in_edge_exists = len(filter(lambda edge: edge[2]['edge_type'] == 'implicit', 47 | self.bb_graph.in_edges(forwarderBB, data=True))) > 0 48 | 49 | # Cannot delete block 50 | if forwardedBB_in_edge_exists and forwarderBB_in_edge_exists: 51 | continue 52 | 53 | # Remove the edge between forwarder and forwarded 54 | self.bb_graph.remove_edge(forwarderBB, forwardedBB) 55 | 56 | # Iterate over the predecessors of the forwarder 57 | for predecessorBB in list(self.bb_graph.predecessors(forwarderBB)): 58 | # Get existing edge type 59 | e_type = self.bb_graph.get_edge_data(predecessorBB, forwarderBB)['edge_type'] 60 | 61 | # Remove the edge between the predecessor and the forwarder 62 | self.bb_graph.remove_edge(predecessorBB, forwarderBB) 63 | 64 | # Add edge between the predecessor and the forwarded block 65 | self.bb_graph.add_edge(predecessorBB, forwardedBB, edge_type=e_type) 66 | 67 | logger.info('Adding {} edge from block {} to {}'.format(e_type, hex(id(predecessorBB)), 68 | hex(id(forwardedBB)))) 69 | 70 | # Get last instruction of the predecessor 71 | last_ins = predecessorBB.instructions[-1] 72 | 73 | # Check if the last instruction of the predecessor points to the forwarder 74 | if last_ins.argval == forwarderBB: 75 | # Change the xref to the forwarded 76 | last_ins.argval = forwardedBB 77 | 78 | # Check if the forwarder has xrefs, if so patch them appropriately 79 | if forwarderBB.has_xrefs_to: 80 | forwardedBB.has_xrefs_to = True 81 | for xref_ins in forwarderBB.xref_instructions: 82 | xref_ins.argval = forwardedBB 83 | forwardedBB.xref_instructions.append(xref_ins) 84 | 85 | logger.debug('Forwarder basic block {} eliminated'.format(hex(id(bb)))) 86 | 87 | # There must not be any edges left 88 | assert self.bb_graph.degree(forwarderBB) == 0 89 | 90 | # Remove the node from the graph 91 | self.bb_graph.remove_node(forwarderBB) 92 | del forwarderBB 93 | bb_eliminated = True 94 | numEliminated += 1 95 | break 96 | logger.info('{} basic blocks eliminated'.format(numEliminated)) 97 | 98 | def merge_basic_blocks(self): 99 | """ 100 | Merges a basic block into its predecessor iff the basic block has exactly one predecessor 101 | and the predecessor has this basic block as its lone successor 102 | 103 | :param bb_graph: A graph of basic blocks 104 | :type bb_graph: nx.DiGraph 105 | :returns: The simplified graph of basic blocks 106 | :rtype: nx.DiGraph 107 | """ 108 | numMerged = 0 109 | logger.debug('Merging basic blocks...') 110 | # flag variable to indicate whether a basic block was eliminated in a pass 111 | bb_merged = True 112 | 113 | # Loop until no basic block can be eliminated any more 114 | while bb_merged: 115 | bb_merged = False 116 | for bb in self.bb_graph.nodes(): 117 | # The basic block should not have any xrefs and must have exactly one predecessor 118 | if not bb.has_xrefs_to and self.bb_graph.in_degree(bb) == 1: 119 | predecessorBB = list(self.bb_graph.predecessors(bb))[0] 120 | 121 | # Predecessor basic block must have exactly one successor 122 | if self.bb_graph.out_degree(predecessorBB) == 1 and list(self.bb_graph.predecessors(bb))[0] == bb: 123 | # The predecessor block will be the merged block 124 | mergedBB = predecessorBB 125 | 126 | # Get the last instruction 127 | last_ins = mergedBB.instructions[-1] 128 | 129 | # Check if the last instruction is an un-conditional jump 130 | if last_ins.mnemonic == 'JUMP_FORWARD' or last_ins.mnemonic == 'JUMP_ABSOLUTE': 131 | # Remove the instruction as it is unnecessary after the blocks are merged 132 | del mergedBB.instructions[-1] 133 | 134 | # Merge the block by adding all instructions 135 | for ins in bb.instructions: 136 | mergedBB.add_instruction(ins) 137 | 138 | # If bb is a terminal node, mark the mergedBB as terminal too 139 | if bb in nx.get_node_attributes(self.bb_graph, 'isTerminal').keys(): 140 | nx.set_node_attributes(self.bb_graph, 'isTerminal', {mergedBB: True}) 141 | 142 | # Remove the edge 143 | self.bb_graph.remove_edge(mergedBB, bb) 144 | 145 | for successorBB in self.bb_graph.successors(bb): 146 | # Get existing type 147 | e_type = self.bb_graph.get_edge_data(bb, successorBB)['edge_type'] 148 | 149 | self.bb_graph.add_edge(mergedBB, successorBB, edge_type=e_type) 150 | logger.info('Adding {} edge from block {} to {}'.format(e_type, hex(id(mergedBB)), 151 | hex(id(successorBB)))) 152 | self.bb_graph.remove_edge(bb, successorBB) 153 | 154 | logger.debug('Basic block {} merged with block {}'.format(hex(id(bb)), hex(id(mergedBB)))) 155 | assert self.bb_graph.degree(bb) == 0 156 | self.bb_graph.remove_node(bb) 157 | del bb 158 | bb_merged = True 159 | numMerged += 1 160 | break 161 | logger.info('{} basic blocks merged.'.format(numMerged)) 162 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/extremecoders-re/bytecode_simplifier/35467c62dd1292c8a727c4d2346f11ab35000ce2/utils/__init__.py -------------------------------------------------------------------------------- /utils/rendergraph.py: -------------------------------------------------------------------------------- 1 | import pydotplus 2 | import networkx as nx 3 | 4 | 5 | def render_bb(bb, is_entry, is_terminal): 6 | dot = hex(id(bb)) + ':\l\l' 7 | 8 | for ins in bb.instruction_iter(): 9 | dot += ins.mnemonic.ljust(22) 10 | 11 | if ins.is_control_flow() or ins.has_xref(): 12 | dot += hex(id(ins.argval)) 13 | elif ins.arg is not None: 14 | dot += str(ins.arg) 15 | dot += '\l' 16 | 17 | if not is_entry and not is_terminal: 18 | return pydotplus.Node(hex(id(bb)), label=dot, shape='box', fontname='Consolas') 19 | 20 | if is_entry: 21 | return pydotplus.Node(hex(id(bb)), label=dot, shape='box', style='filled', color='cyan', fontname='Consolas') 22 | 23 | return pydotplus.Node(hex(id(bb)), label=dot, shape='box', style='filled', color='orange', fontname='Consolas') 24 | 25 | 26 | def render_graph(bb_graph, filename): 27 | """ 28 | Renders a basic block graph to file 29 | 30 | :param bb_graph: The Graph to render 31 | :type bb_graph: networkx.DiGraph 32 | """ 33 | graph = pydotplus.Dot(graph_type='digraph', rankdir='TB') 34 | entryblock = nx.get_node_attributes(bb_graph, 'isEntry').keys()[0] 35 | returnblocks = nx.get_node_attributes(bb_graph, 'isTerminal').keys() 36 | 37 | nodedict = {} 38 | 39 | for bb in bb_graph.nodes(): 40 | node = render_bb(bb, bb == entryblock, bb in returnblocks) 41 | if bb == entryblock: 42 | sub = pydotplus.Subgraph('sub', rank='source') 43 | sub.add_node(node) 44 | graph.add_subgraph(sub) 45 | else: 46 | graph.add_node(node) 47 | nodedict[bb] = node 48 | 49 | for edge in bb_graph.edges(data=True): 50 | src = nodedict[edge[0]] 51 | dest = nodedict[edge[1]] 52 | e_style = 'dashed' if edge[2]['edge_type'] == 'implicit' else 'solid' 53 | 54 | graph.add_edge(pydotplus.Edge(src, dest, style=e_style)) 55 | # graph.set('splines', 'ortho') 56 | # graph.set_prog('neato') 57 | # graph.set('dpi', '100') 58 | 59 | graph.write(filename, format='svg') 60 | -------------------------------------------------------------------------------- /utils/striplineno.py: -------------------------------------------------------------------------------- 1 | import sys 2 | 3 | import marshal 4 | import types 5 | 6 | 7 | def process(co): 8 | co_constants = [] 9 | for const in co.co_consts: 10 | if isinstance(const, types.CodeType): 11 | co_constants.append(process(const)) 12 | else: 13 | co_constants.append(const) 14 | 15 | return types.CodeType(co.co_argcount, co.co_nlocals, co.co_stacksize, co.co_flags, 16 | co.co_code, tuple(co_constants), co.co_names, co.co_varnames, 17 | co.co_filename, co.co_name, 1, '') 18 | 19 | 20 | def main(): 21 | print sys.argv[1] 22 | fn = sys.argv[1] 23 | inf = open(fn, 'rb') 24 | header = inf.read(4) 25 | assert header == '\x03\xf3\x0d\x0a' 26 | inf.read(4) # Discard 27 | co = marshal.load(inf) 28 | inf.close() 29 | outf = open('noline.pyc', 'wb') 30 | outf.write('\x03\xf3\x0d\x0a\0\0\0\0') 31 | marshal.dump(process(co), outf) 32 | outf.close() 33 | 34 | 35 | if __name__ == '__main__': 36 | main() 37 | -------------------------------------------------------------------------------- /verifier.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import networkx as nx 3 | 4 | logger = logging.getLogger(__name__) 5 | 6 | 7 | def verify_graph(bb_graph): 8 | """ 9 | Verify the graph for correctness 10 | 11 | :param bb_graph: 12 | :type bb_graph: nx.DiGraph 13 | """ 14 | try: 15 | # There must exists exactly one entry point 16 | numEntryPoint = len(nx.get_node_attributes(bb_graph, 'isEntry')) 17 | if numEntryPoint != 1: 18 | logger.error('Basic block graph has {} entrypoint(s)'.format(numEntryPoint)) 19 | raise Exception 20 | 21 | # The entrypoint must have a in degree of zero 22 | i_degree_entry = bb_graph.in_degree(nx.get_node_attributes(bb_graph, 'isEntry').keys()[0]) 23 | 24 | if i_degree_entry != 0: 25 | logger.error('The entry point basic block has an in degree of {}'.format(i_degree_entry)) 26 | raise Exception 27 | 28 | for bb in bb_graph.nodes(): 29 | o_degree = bb_graph.out_degree(bb) 30 | # A basic block can have 0,1 or 2 successors 31 | if o_degree > 2: 32 | logger.error('Basic block {} has an out degree of {}'.format(hex(id(bb)), o_degree)) 33 | raise Exception 34 | 35 | # A basic block having a out degree of 0 must have a RETURN_VALUE as the last instruction 36 | if o_degree == 0: 37 | if bb.instructions[-1].mnemonic != 'RETURN_VALUE': 38 | logger.error('Basic block {} has an out degree of zero, but does not end with RETURN_VALUE'.format( 39 | hex(id(bb)))) 40 | raise Exception 41 | 42 | # A basic block having out degree of 2, cannot have both out edge as of explicit type or implicit type 43 | if o_degree == 2: 44 | o_edges = bb_graph.out_edges(bb, data=True) 45 | if o_edges[0][2]['edge_type'] == 'explicit' and o_edges[1][2]['edge_type'] == 'explicit': 46 | logger.error('Basic block {} has both out edges of explicit type'.format(hex(id(bb)))) 47 | raise Exception 48 | if o_edges[0][2]['edge_type'] == 'implicit' and o_edges[1][2]['edge_type'] == 'implicit': 49 | logger.error('Basic block {} has both out edges of implicit type'.format(hex(id(bb)))) 50 | raise Exception 51 | 52 | i_degree = bb_graph.in_degree(bb) 53 | 54 | # If in degree is greater than zero 55 | if i_degree > 0: 56 | numImplicitEdges = 0 57 | for edge in bb_graph.in_edges(bb, data=True): 58 | if edge[2]['edge_type'] == 'implicit': 59 | numImplicitEdges += 1 60 | 61 | if numImplicitEdges > 1: 62 | logger.error('Basic block {} has {} implicit in edges'.format(hex(id(bb)), numImplicitEdges)) 63 | raise Exception 64 | 65 | if i_degree == o_degree == 0: 66 | logger.error('Orphaned block {} has no edges'.format(hex(id(bb)))) 67 | 68 | except Exception as ex: 69 | print ex 70 | return False 71 | return True 72 | --------------------------------------------------------------------------------