├── .gitignore
├── README.md
├── assembler.py
├── basicblock.py
├── decoder.py
├── deobfuscator.py
├── disassembler.py
├── images
    ├── after-forwarder.png
    ├── after-merge.png
    ├── before-forwarder.png
    └── before-merge.png
├── instruction.py
├── main.py
├── simplifier.py
├── utils
    ├── __init__.py
    ├── rendergraph.py
    └── striplineno.py
└── verifier.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea/
2 | *.pyc
3 | *.svg


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Bytecode Simplifier
  2 | 
  3 | Bytecode simplifier is a tool to deobfuscate PjOrion protected python scripts. 
  4 | This is a complete rewrite of my older tool [PjOrion Deobfuscator](https://github.com/extremecoders-re/PjOrion-Deobfuscator)
  5 | 
  6 | ## Pre-requisites
  7 | 
  8 | You need to have the following packages pre-installed:
  9 | - [networkx](http://networkx.github.io/)
 10 | - [pydotplus](http://pydotplus.readthedocs.io/)
 11 | 
 12 | Both of the packages are `pip` installable. Additionally, make sure graphviz executable is in your path for `pydotplus` to work.
 13 | `pydotplus` is required only for drawing graphs and if you do not want this feature you can comment out the `render_graph` calls in `deobfuscate` function in file [deobfuscator.py](deobfuscator.py)
 14 | 
 15 | ## Statutory Warning 
 16 | 
 17 | PjOrion obfuscates the original file and introduces several wrapper layers on top of it. The purpose of these layers is simply to (sort of) decrypt the next inner layer and execute it via an `EXEC_STMT` instruction. Hence you CANNOT use this tool as-is on an obfuscated file. First, you would need to remove the wrapper layers and get hold of the actual obfuscated code object. Then you can marshal the obfuscated code to disk and run this tool on it which should hopefully give you back the deobfuscated code.
 18 | 
 19 | Refer to [this](https://0xec.blogspot.com/2017/07/deobfuscating-pjorion-using-bytecode.html) blog post for details.
 20 | 
 21 | ## Implementation details
 22 | 
 23 | ### Analysis
 24 | 
 25 | Bytecode simplifier analyzes obfuscated code using a recursive traversal disassembly approach. 
 26 | The instructions are disassembled and combined into basic blocks. A basic block is a sequence of instructions with a single entry and a single exit.
 27 | 
 28 | ### Representation
 29 | 
 30 | A basic block may end with a control flow instruction such as an `if` statement. An `if` statement has two branches - true and false. 
 31 | Corresponding to the branches basic blocks have edges between them. This entire structure is represented by a [Directed Graph](https://en.wikipedia.org/wiki/Directed_graph).
 32 | We use [`networkx.DiGraph`](https://networkx.github.io/documentation/development/reference/classes.digraph.html) for this purpose.
 33 | 
 34 | The nodes in the graph represent the basic blocks whereas the edges between the nodes represent the control flow between the basic blocks.
 35 | 
 36 | A conditional control transfer instruction like `if` has two branches - the true branch which is executed when the condition holds. We tag this branch as the explicit edge as this branch is explicitly specified by the `if` statement. The other branch viz. the false one is tagged as an implicit edge, as it is logically deduced.
 37 | 
 38 | An unconditional control transfer instruction like `jump` has only an explicit edge.
 39 | 
 40 | ### De-obfuscation
 41 | 
 42 | Once we have the code in a graph form it is easier to reason about the code. Hence simplification of the obfuscated code is done on this graph.
 43 | 
 44 | Two simplification strategies are principally used:
 45 |  - Forwarder elimination
 46 |  - Basic block merging
 47 | 
 48 | #### Forwarder elimination
 49 | A forwarder is a basic block that consists of a single unconditional control flow instructions. A forwarder transfers the execution to some other basic block without doing any useful work. 
 50 | 
 51 | **Before forwarder elimination**
 52 | 
 53 | ![](images/before-forwarder.png)
 54 | 
 55 | The basic blocked highlighted in yellow is a forwarder block. It can be eliminated to give the following structure below.
 56 | 
 57 | **After forwarder elimination**
 58 | 
 59 | ![](images/after-forwarder.png)
 60 | 
 61 | However, a forwarder cannot always be eliminated specifically in cases where the forwarder has implicit in-edges. Refer to [simplifier.py](simplifier.py) for the details
 62 | 
 63 | #### Basic block merging
 64 | 
 65 | A basic block can be merged onto its predecessor if and only if there is exactly one predecessor and the predecessor has this basic block as its lone successor.
 66 | 
 67 | **Before merging**
 68 | 
 69 | ![](images/before-merge.png)
 70 | 
 71 | The highlighted instructions of the basic blocks have been merged to form a bigger basic block shown below. Control transfer instructions have been removed as they are not needed anymore.
 72 | 
 73 | **After merging**
 74 | 
 75 | ![](images/after-merge.png)
 76 | 
 77 | 
 78 | ### Assembly 
 79 | 
 80 | Once the graph has been simplified, we need to assemble the basic blocks back into flat code. This is implemented in the file [assembler.py](assembler.py). 
 81 | 
 82 | The assembly process in itself consists of sub-stages:
 83 | 
 84 | 1. **DFS**: Do a depth-first search of the graph to determine the order of the basic blocks when they are laid out sequentially. (This is done in a post-order fashion)
 85 | 2. **Removal of redundant jumps**: If basic block A has a jump instruction to block B, but block B is immediately located after A,  then the jump instruction can safely be removed. This feature is ***experimental*** and is known to break decompilers. However, the functionality of code is not affected. The advantage of this feature is it reduces code size. This feature is disabled by default.
 86 | 3. **Changing `JUMP_FORWARD` to `JUMP_ABSOLUTE`**: If basic block A has a relative control flow instruction to block B, then block B must be located after block A in the generated layout. This is because relative control flow instructions are USUALLY used to refer to addresses located after it. If the relative c.f. instruction is `JUMP_FORWARD` we can change to `JUMP_ABSOLUTE`.
 87 | 4. **Re-introducing forwarder blocks** : For other relative c.f instructions like `SETUP_LOOP,` `SETUP_EXCEPT` etc, we need to create a new forwarder block consisting of an absolute jump instruction to block B, and make the relative control flow instruction in block A to point to the forwarder block. This works since the forwarder block will naturally be after block A in the generated layout and relative instructions can be always used to point to blocks located after it, i.e. have a higher address.
 88 | 5. **Calculation of block addresses**: Once the layouts of the blocks are fixed, we need to calculate the address of each block.
 89 | 6. **Calculation of instruction operands**: Instructions like `JUMP_FORWARD` & `SETUP_LOOP` uses the operand to refer to other instructions. This reference is an integer denoting the offset/absolute address of the target.
 90 | 7. **Emitting code**: This is the final code generation phase where the instructions are converted to their binary representations.
 91 | 
 92 | ## Example usage
 93 | 
 94 | ```
 95 | $ python main.py --ifile=obfuscated.pyc --ofile=deobfuscated.pyc
 96 | 
 97 | INFO:__main__:Opening file obfuscated.pyc
 98 | INFO:__main__:Input pyc file header matched
 99 | DEBUG:__main__:Unmarshalling file
100 | INFO:__main__:Processing code object \x0b\x08\x0c\x19\x0b\x0e\x03
101 | DEBUG:deobfuscator:Code entrypoint matched PjOrion signature v1
102 | INFO:deobfuscator:Original code entrypoint at 124
103 | INFO:deobfuscator:Starting control flow analysis...
104 | DEBUG:disassembler:Finding leaders...
105 | DEBUG:disassembler:Start leader at 124
106 | DEBUG:disassembler:End leader at 127
107 | .
108 | <snip>
109 | .
110 | DEBUG:disassembler:Found 904 leaders
111 | DEBUG:disassembler:Constructing basic blocks...
112 | DEBUG:disassembler:Creating basic block 0x27dc5a8 spanning from 13 to 13, both inclusive
113 | DEBUG:disassembler:Creating basic block 0x2837800 spanning from 5369 to 5370, end exclusive
114 | .
115 | <snip>
116 | .
117 | DEBUG:disassembler:461 basic blocks created
118 | DEBUG:disassembler:Constructing edges between basic blocks...
119 | DEBUG:disassembler:Adding explicit edge from block 0x2a98080 to 0x2aa88a0
120 | DEBUG:disassembler:Adding explicit edge from block 0x2aa80f8 to 0x2a9ab70
121 | DEBUG:disassembler:Basic block 0x2aa8dc8 has xreference
122 | .
123 | <snip>
124 | .
125 | INFO:deobfuscator:Control flow analysis completed.
126 | INFO:deobfuscator:Starting simplication of basic blocks...
127 | DEBUG:simplifier:Eliminating forwarders...
128 | INFO:simplifier:Adding implicit edge from block 0x2aa8058 to 0x2a9ab70
129 | INFO:simplifier:Adding explicit edge from block 0x2b07ee0 to 0x2a9ab70
130 | DEBUG:simplifier:Forwarder basic block 0x2aa80f8 eliminated
131 | .
132 | <snip>
133 | .
134 | INFO:
135 | INFO:simplifier:307 basic blocks merged.
136 | INFO:deobfuscator:Simplication of basic blocks completed.
137 | INFO:deobfuscator:Beginning verification of simplified basic block graph...
138 | INFO:deobfuscator:Verification succeeded.
139 | INFO:deobfuscator:Assembling basic blocks...
140 | DEBUG:assembler:Performing a DFS on the graph to generate the layout of the blocks.
141 | DEBUG:assembler:Morphing some JUMP_ABSOLUTE instructions to make file decompilable.
142 | DEBUG:assembler:Verifying generated layout...
143 | INFO:assembler:Basic block 0x2b0e940 uses a relative control transfer instruction to access block 0x2abb3a0 located before it.
144 | INFO:assembler:Basic block 0x2ab5300 uses a relative control transfer instruction to access block 0x2ada918 located before it.
145 | DEBUG:assembler:Successfully verified layout.
146 | DEBUG:assembler:Calculating addresses of basic blocks.
147 | DEBUG:assembler:Calculating instruction operands.
148 | DEBUG:assembler:Generating code...
149 | INFO:deobfuscator:Successfully assembled. 
150 | INFO:__main__:Successfully deobfuscated code object main
151 | INFO:__main__:Collecting constants for code object main
152 | INFO:__main__:Generating new code object for main
153 | INFO:__main__:Generating new code object for \x0b\x08\x0c\x19\x0b\x0e\x03
154 | INFO:__main__:Writing deobfuscated code object to disk
155 | INFO:__main__:Success
156 | ```
157 | 


--------------------------------------------------------------------------------
/assembler.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | 
  3 | import cStringIO
  4 | import networkx as nx
  5 | import dis
  6 | 
  7 | from basicblock import BasicBlock
  8 | from instruction import Instruction
  9 | 
 10 | logger = logging.getLogger(__name__)
 11 | 
 12 | 
 13 | class Assembler:
 14 |     def __init__(self, bb_graph):
 15 |         """
 16 |         :param bb_graph: The graph of basic blocks
 17 |         :type bb_graph: nx.DiGraph
 18 |         """
 19 |         self.bb_graph = bb_graph
 20 |         self.bb_ordered = [None] * self.bb_graph.number_of_nodes()
 21 |         self.idx = self.bb_graph.number_of_nodes() - 1
 22 | 
 23 |     def assemble(self):
 24 |         """
 25 |         Assembling consists of several sub-stages:
 26 |         1. Depth first search the graph to fix the order of the nodes when they are laid out sequentially.
 27 |             (This is done in a post-order fashion)
 28 |         2.
 29 | 
 30 |         """
 31 |         entryblock = nx.get_node_attributes(self.bb_graph, 'isEntry').keys()[0]
 32 |         logger.debug('Performing a DFS on the graph to generate the layout of the blocks.')
 33 |         self.dfs(entryblock)
 34 | 
 35 |         logger.debug('Morphing some JUMP_ABSOLUTE instructions to make file decompilable.')
 36 |         self.convert_abs_to_rel()
 37 | 
 38 |         # self.remove_redundant_jumps()
 39 | 
 40 |         # If basic block A has a relative control flow instruction to block B, then block B
 41 |         # must be located after block A in the generated layout.
 42 |         # This is because relative control flow instructions are USUALLY used to refer to
 43 |         # addresses located after it.
 44 | 
 45 |         # If the relative c.f. instruction is JUMP_FORWARD we can change to JUMP_ABSOLUTE without
 46 |         # any further modifications.
 47 | 
 48 |         # For other relative c.f instructions like SETUP_LOOP, SETUP_EXCEPT etc,
 49 |         # we need to create an new forwarder block consisting of an absolute jump instruction
 50 |         # to block B, and make the relative control flow instruction in block A to point to
 51 |         # the forwarder block. This works since the forwarder block will naturally be after
 52 |         # block A in the generated layout and relative instructions can be always used to point
 53 |         # to blocks located after it, i.e. have a higher address.
 54 |         logger.debug('Verifying generated layout...')
 55 |         for idx in xrange(len(self.bb_ordered)):
 56 |             block = self.bb_ordered[idx]
 57 |             for ins in block.instruction_iter():
 58 |                 if ins.opcode in dis.hasjrel:
 59 |                     targetBlock = ins.argval
 60 | 
 61 |                     # Check if target block occurs before the current block
 62 |                     if self.bb_ordered.index(targetBlock) <= idx:
 63 |                         logger.info(
 64 |                             'Basic block {} uses a relative control transfer instruction to access block {} located before it.'.format(
 65 |                                 hex(id(block)), hex(id(targetBlock))))
 66 | 
 67 |                         # Modify relative jump to absolute jump
 68 |                         if ins.mnemonic == 'JUMP_FORWARD':
 69 |                             ins.opcode = dis.opmap['JUMP_ABSOLUTE']
 70 | 
 71 |                         # If instruction is a relative control transfer instruction
 72 |                         # but is not JUMP_FORWARD (like SETUP_LOOP)
 73 |                         else:
 74 |                             # Create a new forwarder block
 75 |                             bb = self.create_forwarder_block(targetBlock)
 76 | 
 77 |                             # Make the original instruction point to the new block
 78 |                             ins.argval = bb
 79 | 
 80 |                             # Append new block at end
 81 |                             self.bb_ordered.append(bb)
 82 | 
 83 |         logger.debug('Successfully verified layout.')
 84 |         self.calculate_block_addresses()
 85 |         self.calculate_ins_operands()
 86 |         return self.emit()
 87 | 
 88 |     def create_forwarder_block(self, target):
 89 |         """
 90 |         Create a new basic block consisting of a `JUMP_ABSOLUTE`
 91 |         instruction to target block
 92 | 
 93 |         :param target: The target basic block to jump to
 94 |         :type target: BasicBlock
 95 |         :return: The new basic block
 96 |         :rtype: BasicBlock
 97 |         """
 98 |         bb = BasicBlock()
 99 |         ins = Instruction(dis.opmap['JUMP_ABSOLUTE'], target, 3)
100 |         ins.argval = target
101 |         bb.add_instruction(ins)
102 |         return bb
103 | 
104 |     def dfs(self, bb):
105 |         """
106 |         Depth first search.
107 |         Ported from: https://github.com/python/cpython/blob/2.7/Python/compile.c#L3409
108 | 
109 |         :param bb: The basic block
110 |         :type bb: basicblock.BasicBlock
111 |         """
112 | 
113 |         # Return if the current block has already been visited
114 |         if bb.b_seen:
115 |             return
116 | 
117 |         # Mark this block as visited
118 |         bb.b_seen = True
119 | 
120 |         # Recursively dfs on all out going explicit edges
121 |         for o_edge in self.bb_graph.out_edges(bb, data=True):
122 |             # o_edge is a tuple (edge src, edge dest, edge attrib dict)
123 |             if o_edge[2]['edge_type'] == 'explicit':
124 |                 self.dfs(o_edge[1])
125 | 
126 |         # Iterate over the instructions in the basic block
127 |         for ins in bb.instruction_iter():
128 |             # Recursively dfs if instruction have xreferences
129 |             if ins.has_xref():
130 |                 self.dfs(ins.argval)
131 | 
132 |         # Recursively dfs on all out going implicit edges
133 |         for o_edge in self.bb_graph.out_edges(bb, data=True):
134 |             # o_edge is a tuple (edge src, edge dest, edge attrib dict)
135 |             if o_edge[2]['edge_type'] == 'implicit':
136 |                 self.dfs(o_edge[1])
137 | 
138 |         # Add the basic block in a reversed order
139 |         self.bb_ordered[self.idx] = bb
140 |         self.idx -= 1
141 | 
142 |     def calculate_block_addresses(self):
143 |         """
144 |         Once the layout of the blocks are fixed, we need to calculate the address of each block.
145 |         """
146 |         logger.debug('Calculating addresses of basic blocks.')
147 |         size = 0
148 |         for block in self.bb_ordered:
149 |             block.address = size
150 |             size += block.size()
151 | 
152 |     def calculate_ins_operands(self):
153 |         """
154 |         Instructions like JUMP_FORWARD & SETUP_LOOP uses the operand to refer to other instructions.
155 |         This reference is an integer denoting the offset/absolute address of the target. This function
156 |         calculates the values of these operand
157 |         """
158 |         logger.debug('Calculating instruction operands.')
159 |         for block in self.bb_ordered:
160 |             addr = block.address
161 |             for ins in block.instruction_iter():
162 |                 addr += ins.size
163 |                 if ins.opcode in dis.hasjabs:
164 |                     # ins.argval is a BasicBlock
165 |                     ins.arg = ins.argval.address
166 |                     # TODO
167 |                     # We do not generate EXTENDED_ARG opcode at the moment,
168 |                     # hence size of opcode argument can only be 2 bytes
169 |                     assert ins.arg <= 0xFFFF
170 |                 elif ins.opcode in dis.hasjrel:
171 |                     ins.arg = ins.argval.address - addr
172 |                     # relative jump can USUALLY go forward
173 |                     assert ins.arg >= 0
174 |                     assert ins.arg <= 0xFFFF
175 | 
176 |     def emit(self):
177 |         logger.debug('Generating code...')
178 |         codestring = cStringIO.StringIO()
179 |         for block in self.bb_ordered:
180 |             for ins in block.instruction_iter():
181 |                 codestring.write(ins.assemble())
182 |         return codestring.getvalue()
183 | 
184 |     def convert_abs_to_rel(self):
185 |         """
186 |         An JUMP_ABSOLUTE instruction from basic block A to block B can be replaced with a JUMP_FORWARD
187 |         if block B is located after block A.
188 |         This conversion is not really required, but some decompilers like uncompyle fails without it.
189 |         """
190 |         for idx in xrange(len(self.bb_ordered)):
191 |             block = self.bb_ordered[idx]
192 | 
193 |             # Fetch the last instruction
194 |             ins = block.instructions[-1]
195 | 
196 |             # A JUMP_ABSOLUTE instruction whose target block is located after it
197 |             if ins.mnemonic == 'JUMP_ABSOLUTE' and self.bb_ordered.index(ins.argval) > idx:
198 |                 ins.opcode = dis.opmap['JUMP_FORWARD']
199 |                 ins.mnemonic = 'JUMP_FORWARD'
200 | 
201 |     def remove_redundant_jumps(self):
202 |         """
203 |         If basic block A has a jump instruction to block B, but block B is immediately located after A,
204 |         then the jump instruction can safely be removed.
205 |         This feature is experimental and may break decompilers. The advantage of this feature is it reduces
206 |         generated code size.
207 |         """
208 |         logger.warning('Removing redundant jump instruction. This feature is EXPERIMENTAL.')
209 |         numRemoved = 0
210 | 
211 |         for idx in xrange(len(self.bb_ordered)):
212 |             block = self.bb_ordered[idx]
213 | 
214 |             # Fetch the ;ast instruction
215 |             ins = block.instructions[-1]
216 | 
217 |             if ins.mnemonic == 'JUMP_ABSOLUTE' or ins.mnemonic == 'JUMP_FORWARD':
218 |                 target = ins.argval
219 |                 # If target block is immediately located after it
220 |                 if self.bb_ordered.index(target) == idx + 1:
221 |                     # Remove the instruction
222 |                     del block.instructions[-1]
223 |                     numRemoved += 1
224 |         logger.debug('Removed {} redundant jump instructions'.format(numRemoved))
225 | 


--------------------------------------------------------------------------------
/basicblock.py:
--------------------------------------------------------------------------------
 1 | class BasicBlock:
 2 |     """
 3 |     A basic block is a set of instructions, that has a single entry and single exit.
 4 |     Execution begins from the top and ends at the bottom. There can be no branching
 5 |     in between.
 6 |     """
 7 | 
 8 |     def __init__(self):
 9 |         self.address = 0
10 |         self.instructions = []
11 |         self.has_xrefs_to = False
12 |         # Instructions which xreference this basic block
13 |         self.xref_instructions = []
14 | 
15 |         # b_seen is used to perform a DFS of basicblocks
16 |         self.b_seen = False
17 | 
18 |     def add_instruction(self, ins):
19 |         self.instructions.append(ins)
20 | 
21 |     def instruction_iter(self):
22 |         """
23 |         An iterator for traversing over the instructions.
24 | 
25 |         :return: An iterator for iterating over the instruction
26 |         """
27 |         for ins in self.instructions:
28 |             yield ins
29 | 
30 |     def size(self):
31 |         """
32 |         Calculates the size of the basic block
33 |         :return:
34 |         """
35 |         return reduce(lambda x, ins: x + ins.size, self.instructions, 0)
36 | 


--------------------------------------------------------------------------------
/decoder.py:
--------------------------------------------------------------------------------
 1 | import dis
 2 | 
 3 | from instruction import Instruction
 4 | 
 5 | 
 6 | class Decoder:
 7 |     """
 8 |     Class to decode raw bytes into instruction.
 9 |     """
10 | 
11 |     def __init__(self, insBytes):
12 |         self.insBytes = insBytes
13 | 
14 |     def decode_at(self, offset):
15 |         assert offset < len(self.insBytes)
16 | 
17 |         opcode = self.insBytes[offset]
18 | 
19 |         if opcode == dis.opmap['EXTENDED_ARG']:
20 |             raise Exception('EXTENDED_ARG not yet implemented')
21 | 
22 |         # Invalid instruction
23 |         if opcode not in dis.opmap.values():
24 |             return Instruction(-1, None, 1)
25 | 
26 |         if opcode < dis.HAVE_ARGUMENT:
27 |             return Instruction(opcode, None, 1)
28 | 
29 |         if opcode >= dis.HAVE_ARGUMENT:
30 |             arg = (self.insBytes[offset + 2] << 8) | self.insBytes[offset + 1]
31 |             return Instruction(opcode, arg, 3)
32 | 


--------------------------------------------------------------------------------
/deobfuscator.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | 
 3 | from assembler import Assembler
 4 | from simplifier import Simplifier
 5 | from decoder import Decoder
 6 | from disassembler import Disassembler
 7 | from utils.rendergraph import render_graph
 8 | from verifier import verify_graph
 9 | 
10 | logger = logging.getLogger(__name__)
11 | 
12 | 
13 | def find_oep(insBytes):
14 |     """
15 |     Finds the original entry point of a code object obfuscated by PjOrion.
16 |     If the entrypoint does not match the predefine signature it will return 0.
17 | 
18 |     :param insBytes: the code object
19 |     :type insBytes: bytearray
20 |     :returns: the entrypoint
21 |     :rtype: int
22 |     """
23 | 
24 |     dec = Decoder(insBytes)
25 |     ins = dec.decode_at(0)
26 | 
27 |     try:
28 |         # First instruction sets up an exception handler
29 |         assert ins.mnemonic == 'SETUP_EXCEPT'
30 | 
31 |         # Get location of exception handler
32 |         exc_handler = 0 + ins.arg + ins.size
33 | 
34 |         # Second instruction is intentionally invalid, on execution
35 |         # control transfers to exception handler
36 |         assert dec.decode_at(3).is_opcode_valid() == False
37 | 
38 |         assert dec.decode_at(exc_handler).mnemonic == 'POP_TOP'
39 |         assert dec.decode_at(exc_handler + 1).mnemonic == 'POP_TOP'
40 |         assert dec.decode_at(exc_handler + 2).mnemonic == 'POP_TOP'
41 |         logger.debug('Code entrypoint matched PjOrion signature v1')
42 |         oep = exc_handler + 3
43 |     except:
44 |         if ins.mnemonic == 'JUMP_FORWARD':
45 |             oep = 0 + ins.arg + ins.size
46 |             logger.debug('Code entrypoint matched PjOrion signature v2')
47 |         elif ins.mnemonic == 'JUMP_ABSOLUTE':
48 |             oep = ins.arg
49 |             logger.debug('Code entrypoint matched PjOrion signature v2')
50 |         else:
51 |             logger.warning('Code entrypoint did not match PjOrion signature')
52 |             oep = 0
53 | 
54 |     return oep
55 | 
56 | 
57 | def deobfuscate(codestring):
58 |     # Instructions are stored as a string, we need
59 |     # to convert it to an array of the raw bytes
60 |     insBytes = bytearray(codestring)
61 | 
62 |     oep = find_oep(insBytes)
63 |     logger.info('Original code entrypoint at {}'.format(oep))
64 | 
65 |     logger.info('Starting control flow analysis...')
66 |     disasm = Disassembler(insBytes, oep)
67 |     disasm.find_leaders()
68 |     disasm.construct_basic_blocks()
69 |     disasm.build_bb_edges()
70 |     logger.info('Control flow analysis completed.')
71 |     logger.info('Starting simplication of basic blocks...')
72 |     render_graph(disasm.bb_graph, 'before.svg')
73 |     simplifier = Simplifier(disasm.bb_graph)
74 |     simplifier.eliminate_forwarders()
75 |     render_graph(simplifier.bb_graph, 'after_forwarder.svg')
76 |     simplifier.merge_basic_blocks()
77 |     logger.info('Simplification of basic blocks completed.')
78 |     simplified_graph = simplifier.bb_graph
79 |     render_graph(simplified_graph, 'after.svg')
80 |     logger.info('Beginning verification of simplified basic block graph...')
81 | 
82 |     if not verify_graph(simplified_graph):
83 |         logger.error('Verification failed.')
84 |         raise SystemExit
85 | 
86 |     logger.info('Verification succeeded.')
87 |     logger.info('Assembling basic blocks...')
88 |     asm = Assembler(simplified_graph)
89 |     codestring = asm.assemble()
90 |     logger.info('Successfully assembled. ')
91 |     return codestring
92 | 


--------------------------------------------------------------------------------
/disassembler.py:
--------------------------------------------------------------------------------
  1 | import Queue
  2 | import logging
  3 | import collections
  4 | import dis
  5 | import networkx as nx
  6 | 
  7 | from basicblock import BasicBlock
  8 | from decoder import Decoder
  9 | 
 10 | logger = logging.getLogger(__name__)
 11 | 
 12 | 
 13 | class Disassembler:
 14 |     """
 15 |     A Recursive traversal disassembler.
 16 |     """
 17 | 
 18 |     def __init__(self, insBytes, entrypoint):
 19 |         self.insBytes = insBytes
 20 |         self.entrypoint = entrypoint
 21 |         self.leaders = None
 22 |         self.bb_graph = nx.DiGraph()
 23 | 
 24 |     def get_next_ins_addresses(self, ins, addr):
 25 |         """
 26 |         Given an instruction and an address at which this resides, this function
 27 |         returns a dictionary of addresses of the instruction expected to be executed next.
 28 | 
 29 |         explicit addresses are indicated by the control flow of the instruction.
 30 |         implicit address is the address of the instruction located sequentially after.
 31 | 
 32 |         :rtype: dict
 33 |         """
 34 |         next_addresses = {}
 35 | 
 36 |         if ins.mnemonic == 'JUMP_IF_FALSE_OR_POP':
 37 |             next_addresses['implicit'] = addr + ins.size
 38 |             next_addresses['explicit'] = ins.arg
 39 | 
 40 |         elif ins.mnemonic == 'JUMP_IF_TRUE_OR_POP':
 41 |             next_addresses['implicit'] = addr + ins.size
 42 |             next_addresses['explicit'] = ins.arg
 43 | 
 44 |         elif ins.mnemonic == 'JUMP_ABSOLUTE':
 45 |             next_addresses['explicit'] = ins.arg
 46 | 
 47 |         elif ins.mnemonic == 'POP_JUMP_IF_FALSE':
 48 |             next_addresses['implicit'] = addr + ins.size
 49 |             next_addresses['explicit'] = ins.arg
 50 | 
 51 |         elif ins.mnemonic == 'POP_JUMP_IF_TRUE':
 52 |             next_addresses['implicit'] = addr + ins.size
 53 |             next_addresses['explicit'] = ins.arg
 54 | 
 55 |         elif ins.mnemonic == 'CONTINUE_LOOP':
 56 |             next_addresses['explicit'] = ins.arg
 57 | 
 58 |         elif ins.mnemonic == 'FOR_ITER':
 59 |             next_addresses['implicit'] = addr + ins.size
 60 |             next_addresses['explicit'] = addr + ins.size + ins.arg
 61 | 
 62 |         elif ins.mnemonic == 'JUMP_FORWARD':
 63 |             next_addresses['explicit'] = addr + ins.size + ins.arg
 64 | 
 65 |         elif ins.mnemonic == 'RETURN_VALUE':
 66 |             pass
 67 | 
 68 |         else:
 69 |             next_addresses['implicit'] = addr + ins.size
 70 | 
 71 |         return next_addresses
 72 | 
 73 |     def get_ins_xref(self, ins, addr):
 74 |         """
 75 |         An instruction may reference other instruction.
 76 |         Example: SETUP_EXCEPT exc_handler
 77 |         the exception handler is the xref.
 78 |         """
 79 |         xref_ins = (dis.opmap[x] for x in ('SETUP_LOOP', 'SETUP_EXCEPT', 'SETUP_FINALLY', 'SETUP_WITH'))
 80 |         if ins.opcode in xref_ins:
 81 |             return addr + ins.size + ins.arg
 82 |         else:
 83 |             return None
 84 | 
 85 |     def find_leaders(self):
 86 |         logger.debug('Finding leaders...')
 87 | 
 88 |         # A leader is a mark that identifies either the start or end of a basic block
 89 |         # address is the positional offset of the leader within the instruction bytes
 90 |         # type can be either S (starting) or E (ending)
 91 |         Leader = collections.namedtuple('leader', ['address', 'type'])
 92 | 
 93 |         # Set to contain all the leaders. We use a set to prevent duplicates
 94 |         leader_set = set()
 95 | 
 96 |         # The entrypoint is automatically the start of a basic block, and hence a start leader
 97 |         leader_set.add(Leader(self.entrypoint, 'S'))
 98 |         logger.debug('Start leader at {}'.format(self.entrypoint))
 99 | 
100 |         # Queue to contain list of addresses, from where linear sweep disassembling would start
101 |         analysis_Q = Queue.Queue()
102 | 
103 |         # Start analysis from the entrypoint
104 |         analysis_Q.put(self.entrypoint)
105 | 
106 |         # Already analyzed addresses must not be analyzed later, else we would get into an infinite loop
107 |         # while processing instructions that branch backwards to an previously analyzed address.
108 |         # The already_analyzed set would contains the addresses that have been previously encountered.
109 |         already_analyzed = set()
110 | 
111 |         # Create the decoder
112 |         dec = Decoder(self.insBytes)
113 | 
114 |         while not analysis_Q.empty():
115 |             addr = analysis_Q.get()
116 | 
117 |             while True:
118 |                 ins = dec.decode_at(addr)
119 | 
120 |                 # Put the current address into the already_analyzed set
121 |                 already_analyzed.add(addr)
122 | 
123 |                 # If current instruction is a return, stop disassembling further.
124 |                 # current address is an end leader
125 |                 if ins.is_ret():
126 |                     leader_set.add(Leader(addr, 'E'))
127 |                     logger.debug('End leader at {}'.format(addr))
128 |                     break
129 | 
130 |                 # If current instruction is control flow, stop disassembling further.
131 |                 # the current instr is an end leader, control flow target(s) is(are) start leaders
132 |                 if ins.is_control_flow():
133 |                     # Current instruction is an end leader
134 |                     leader_set.add(Leader(addr, 'E'))
135 |                     logger.debug('End leader at {}'.format(addr))
136 | 
137 |                     # The list of addresses where execution is expected to transfer are starting leaders
138 |                     for target in self.get_next_ins_addresses(ins, addr).values():
139 |                         leader_set.add(Leader(target, 'S'))
140 |                         logger.debug('Start leader at {}'.format(addr))
141 | 
142 |                         # Put into analysis queue if not already analyzed
143 |                         if target not in already_analyzed:
144 |                             analysis_Q.put(target)
145 |                     break
146 | 
147 |                 # Current instruction is not control flow
148 |                 else:
149 |                     # Get cross refs
150 |                     xref = self.get_ins_xref(ins, addr)
151 |                     nextAddress = self.get_next_ins_addresses(ins, addr).values()
152 | 
153 |                     # Non control flow instruction should only have a single possible next address
154 |                     assert len(nextAddress) == 1
155 | 
156 |                     # The immediate next instruction positionally
157 |                     addr = nextAddress[0]
158 | 
159 |                     # If the instruction has xrefs, they are start leaders
160 |                     if xref is not None:
161 |                         leader_set.add(Leader(xref, 'S'))
162 |                         logger.debug('Start leader at {}'.format(xref))
163 | 
164 |                         # Put into analysis queue if not already analyzed
165 |                         if xref not in already_analyzed:
166 |                             analysis_Q.put(xref)
167 | 
168 |         # Comparator function to sort the leaders according to increasing offsets
169 |         def __leaderSortFunc(elem1, elem2):
170 |             if elem1.address != elem2.address:
171 |                 return elem1.address - elem2.address
172 |             else:
173 |                 if elem1.type == 'S':
174 |                     return -1
175 |                 else:
176 |                     return 1
177 | 
178 |         logger.debug('Found {} leaders'.format(len(leader_set)))
179 |         self.leaders = sorted(leader_set, cmp=__leaderSortFunc)
180 | 
181 |     def construct_basic_blocks(self):
182 |         """
183 |         Once we have obtained the leaders, i.e. the boundaries where a basic block may start or end,
184 |         we need to build the basic blocks by parsing the leaders. A basic block spans from the starting leader
185 |         upto the immediate next end leader as per their addresses.
186 |         """
187 |         logger.debug('Constructing basic blocks...')
188 |         idx = 0
189 |         dec = Decoder(self.insBytes)
190 | 
191 |         while idx < len(self.leaders):
192 |             # Get a pair of leaders
193 |             leader1, leader2 = self.leaders[idx], self.leaders[idx + 1]
194 | 
195 |             # Get the addresses of the respective leaders
196 |             addr1, addr2 = leader1.address, leader2.address
197 | 
198 |             # Create a new basic block
199 |             bb = BasicBlock()
200 | 
201 |             # Set the address of the basic block
202 |             bb.address = addr1
203 | 
204 |             # The offset variable is used track the position of the individual instructions within the basic block
205 |             offset = 0
206 | 
207 |             # Store the basic block at the entrypoint separately
208 |             if addr1 == self.entrypoint:
209 |                 self.bb_graph.add_node(bb, isEntry=True)
210 |             else:
211 |                 self.bb_graph.add_node(bb)
212 | 
213 |             # Add the basic block to the graph
214 |             self.bb_graph.add_node(bb)
215 | 
216 |             # Leader1 is start leader, leader2 is end leader
217 |             # All instructions inclusive of leader1 and leader2 are part of this basic block
218 |             if leader1.type == 'S' and leader2.type == 'E':
219 |                 logger.debug(
220 |                     'Creating basic block {} spanning from {} to {}, both inclusive'.format(hex(id(bb)),
221 |                                                                                             leader1.address,
222 |                                                                                             leader2.address))
223 |                 while addr1 + offset <= addr2:
224 |                     ins = dec.decode_at(addr1 + offset)
225 |                     bb.add_instruction(ins)
226 |                     offset += ins.size
227 |                 idx += 2
228 | 
229 |             # Both Leader1 and leader2 are start leader
230 |             # Instructions inclusive of leader1 but exclusive of leader2 are part of this basic block
231 |             elif leader1.type == 'S' and leader2.type == 'S':
232 |                 logger.debug(
233 |                     'Creating basic block {} spanning from {} to {}, end exclusive'.format(hex(id(bb)), leader1.address,
234 |                                                                                            leader2.address))
235 |                 while addr1 + offset < addr2:
236 |                     ins = dec.decode_at(addr1 + offset)
237 |                     bb.add_instruction(ins)
238 |                     offset += ins.size
239 |                 idx += 1
240 | 
241 |         logger.debug('{} basic blocks created'.format(self.bb_graph.number_of_nodes()))
242 | 
243 |     def find_bb_by_address(self, address):
244 |         for bb in self.bb_graph.nodes():
245 |             if bb.address == address:
246 |                 return bb
247 | 
248 |     def build_bb_edges(self):
249 |         """
250 |         The list of basic blocks forms a graph. The basic block themselves are the vertices with edges between them.
251 |         Edges refer to the control flow between the basic block.
252 |         """
253 |         logger.debug('Constructing edges between basic blocks...')
254 | 
255 |         for bb in self.bb_graph.nodes():
256 |             offset = 0
257 | 
258 |             for idx in xrange(len(bb.instructions)):
259 |                 ins = bb.instructions[idx]
260 | 
261 |                 # If instruction has an xref, resolve it
262 |                 xref = self.get_ins_xref(ins, bb.address + offset)
263 |                 if xref is not None:
264 |                     xref_bb = self.find_bb_by_address(xref)
265 |                     ins.argval = xref_bb
266 |                     xref_bb.has_xrefs_to = True
267 |                     xref_bb.xref_instructions.append(ins)
268 |                     logger.debug('Basic block {} has xreference'.format(hex(id(bb))))
269 | 
270 |                 nextInsAddr = self.get_next_ins_addresses(ins, bb.address + offset)
271 | 
272 |                 # Check of this is is the last instruction of this basic block.
273 |                 # This is required to construct edges
274 |                 if idx == len(bb.instructions) - 1:
275 |                     # A control flow instruction can be of two types: conditional and un-conditional.
276 |                     # An un-conditional control flow instruction can have only a single successor instruction
277 |                     # which is indicated by its argument.
278 |                     if ins.is_unconditional():
279 |                         assert len(nextInsAddr) == 1 and nextInsAddr.has_key('explicit')
280 |                         target = nextInsAddr['explicit']
281 |                         targetBB = self.find_bb_by_address(target)
282 |                         ins.argval = targetBB
283 | 
284 |                         # Add edge
285 |                         self.bb_graph.add_edge(bb, targetBB, edge_type='explicit')
286 | 
287 |                         logger.debug(
288 |                             'Adding explicit edge from block {} to {}'.format(hex(id(bb)), hex(id(targetBB))))
289 | 
290 |                     # A conditional control flow instruction has two possible successor instructions.
291 |                     # One is explicit and indicated by its argument, the other is implicit and is the
292 |                     # immediate next instruction according to address.
293 |                     elif ins.is_conditional():
294 |                         assert len(nextInsAddr) == 2
295 |                         # target1 is the implicit successor instruction, i.e the immediate next instruction
296 |                         # target2 is the explicit successor instruction, i.e. the branch target indicated by the argument
297 |                         target1, target2 = nextInsAddr['implicit'], nextInsAddr['explicit']
298 | 
299 |                         target1BB = self.find_bb_by_address(target1)
300 |                         target2BB = self.find_bb_by_address(target2)
301 | 
302 |                         ins.argval = target2BB
303 | 
304 |                         # Add the two edges
305 |                         self.bb_graph.add_edge(bb, target1BB, edge_type='implicit')
306 |                         self.bb_graph.add_edge(bb, target2BB, edge_type='explicit')
307 | 
308 |                         logger.debug(
309 |                             'Adding implicit edge from block {} to {}'.format(hex(id(bb)), hex(id(target1BB))))
310 | 
311 |                         logger.debug(
312 |                             'Adding explicit edge from block {} to {}'.format(hex(id(bb)), hex(id(target2BB))))
313 | 
314 |                     # RETURN_VALUE
315 |                     elif ins.is_ret():
316 |                         nx.set_node_attributes(self.bb_graph, {bb: {'isTerminal': True}})
317 |                         # Does not have any sucessors
318 |                         assert len(nextInsAddr) == 0
319 | 
320 |                     # The last instruction does not have an explicit control flow
321 |                     else:
322 |                         assert len(nextInsAddr) == 1 and nextInsAddr.has_key('implicit')
323 |                         nextBB = self.find_bb_by_address(nextInsAddr['implicit'])
324 | 
325 |                         # Add edge
326 |                         self.bb_graph.add_edge(bb, nextBB, edge_type='implicit')
327 | 
328 |                 offset += ins.size
329 | 


--------------------------------------------------------------------------------
/images/after-forwarder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/extremecoders-re/bytecode_simplifier/35467c62dd1292c8a727c4d2346f11ab35000ce2/images/after-forwarder.png


--------------------------------------------------------------------------------
/images/after-merge.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/extremecoders-re/bytecode_simplifier/35467c62dd1292c8a727c4d2346f11ab35000ce2/images/after-merge.png


--------------------------------------------------------------------------------
/images/before-forwarder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/extremecoders-re/bytecode_simplifier/35467c62dd1292c8a727c4d2346f11ab35000ce2/images/before-forwarder.png


--------------------------------------------------------------------------------
/images/before-merge.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/extremecoders-re/bytecode_simplifier/35467c62dd1292c8a727c4d2346f11ab35000ce2/images/before-merge.png


--------------------------------------------------------------------------------
/instruction.py:
--------------------------------------------------------------------------------
 1 | import dis
 2 | 
 3 | 
 4 | class Instruction:
 5 |     """
 6 |     This class represents an instruction.
 7 |     """
 8 | 
 9 |     def __init__(self, opcode, arg, size):
10 |         # Numeric code for operation, corresponding to the opcode values
11 |         self.opcode = opcode
12 | 
13 |         # Numeric argument to operation(if any), otherwise None
14 |         self.arg = arg
15 | 
16 |         # The size of the instruction including the arguement
17 |         self.size = size
18 | 
19 |         # Resolved arg value (if known), otherwise same as arg
20 |         self.argval = arg
21 | 
22 |         # Human readable name for operation
23 |         self.mnemonic = dis.opname[self.opcode]
24 | 
25 |     def is_opcode_valid(self):
26 |         """
27 |         Checks whether the instruction is legal. A legal instruction has an opcode
28 |         which is understood by the CPython VM.
29 |         """
30 |         return self.opcode in dis.opmap.values()
31 | 
32 |     def is_ret(self):
33 |         """
34 |         Checks whether the instruction is a return
35 |         :return:
36 |         """
37 |         return self.opcode == dis.opmap['RETURN_VALUE']
38 | 
39 |     def is_control_flow(self):
40 |         """
41 |         Checks whether the instruction cause change of control flow.
42 |         A control flow instruction can be conditional or unconditional
43 |         :return:
44 |         """
45 |         # All control flow instructions
46 |         cfIns = (
47 |             'JUMP_IF_FALSE_OR_POP', 'JUMP_IF_TRUE_OR_POP', 'JUMP_ABSOLUTE', 'POP_JUMP_IF_FALSE', 'POP_JUMP_IF_TRUE',
48 |             'CONTINUE_LOOP', 'FOR_ITER', 'JUMP_FORWARD')
49 |         return self.mnemonic in cfIns
50 | 
51 |     def is_conditional(self):
52 |         """
53 |         Checks whether the instruction is a conditional control flow instruction.
54 |         A conditional control flow instruction has two possible successor instructions.
55 |         """
56 |         conditionalIns = (
57 |             'JUMP_IF_FALSE_OR_POP', 'JUMP_IF_TRUE_OR_POP', 'POP_JUMP_IF_FALSE', 'POP_JUMP_IF_TRUE', 'FOR_ITER')
58 |         return self.is_control_flow() and self.mnemonic in conditionalIns
59 | 
60 |     def is_unconditional(self):
61 |         """
62 |         Checks whether the instruction is a conditional control flow instruction.
63 |         A conditional control flow instruction has two possible successor instructions.
64 |         """
65 |         unconditionalIns = ('JUMP_ABSOLUTE', 'JUMP_FORWARD', 'CONTINUE_LOOP')
66 |         return self.is_control_flow() and self.mnemonic in unconditionalIns
67 | 
68 |     def has_xref(self):
69 |         """
70 |         Checks whether the instruction has xreferences.
71 |         """
72 |         return self.mnemonic in ('SETUP_LOOP', 'SETUP_EXCEPT', 'SETUP_FINALLY', 'SETUP_WITH')
73 | 
74 |     def assemble(self):
75 |         if self.size == 1:
76 |             return chr(self.opcode)
77 |         else:
78 |             return chr(self.opcode) + chr(self.arg & 0xFF) + chr((self.arg >> 8) & 0xFF)
79 | 
80 |     def __str__(self):
81 |         return '{} {} {}'.format(self.opcode, self.mnemonic, self.arg)
82 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | 
 3 | import marshal
 4 | import logging
 5 | import types
 6 | 
 7 | logging.basicConfig(level=logging.DEBUG)
 8 | logger = logging.getLogger(__name__)
 9 | 
10 | from deobfuscator import deobfuscate
11 | 
12 | 
13 | def parse_code_object(codeObject):
14 |     logger.info('Processing code object {}'.format(codeObject.co_name).encode('string_escape'))
15 |     co_argcount = codeObject.co_argcount
16 |     co_nlocals = codeObject.co_nlocals
17 |     co_stacksize = codeObject.co_stacksize
18 |     co_flags = codeObject.co_flags
19 | 
20 |     co_codestring = deobfuscate(codeObject.co_code)
21 |     logger.info('Successfully deobfuscated code object {}'.format(codeObject.co_name).encode('string_escape'))
22 |     co_names = codeObject.co_names
23 | 
24 |     co_varnames = codeObject.co_varnames
25 |     co_filename = codeObject.co_filename
26 |     co_name = codeObject.co_name
27 |     co_firstlineno = codeObject.co_firstlineno
28 |     co_lnotab = codeObject.co_lnotab
29 | 
30 |     logger.info('Collecting constants for code object {}'.format(codeObject.co_name).encode('string_escape'))
31 |     mod_const = []
32 |     for const in codeObject.co_consts:
33 |         if isinstance(const, types.CodeType):
34 |             logger.info(
35 |                 'Code object {} contains embedded code object {}'.format(codeObject.co_name, const.co_name).encode(
36 |                     'string_escape'))
37 |             mod_const.append(parse_code_object(const))
38 |         else:
39 |             mod_const.append(const)
40 |     co_constants = tuple(mod_const)
41 | 
42 |     logger.info('Generating new code object for {}'.format(codeObject.co_name).encode('string_escape'))
43 |     return types.CodeType(co_argcount, co_nlocals, co_stacksize, co_flags,
44 |                           co_codestring, co_constants, co_names, co_varnames,
45 |                           co_filename, co_name, co_firstlineno, co_lnotab)
46 | 
47 | 
48 | def process(ifile, ofile):
49 |     logger.info('Opening file ' + ifile)
50 |     ifPtr = open(ifile, 'rb')
51 |     header = ifPtr.read(8)
52 |     if not header.startswith('\x03\xF3\x0D\x0A'):
53 |         raise SystemExit('[!] Header mismatch. The input file is not a valid pyc file.')
54 |     logger.info('Input pyc file header matched')
55 |     logger.debug('Unmarshalling file')
56 |     rootCodeObject = marshal.load(ifPtr)
57 |     ifPtr.close()
58 |     deob = parse_code_object(rootCodeObject)
59 |     logger.info('Writing deobfuscated code object to disk')
60 |     ofPtr = open(ofile, 'wb')
61 |     ofPtr.write(header)
62 |     marshal.dump(deob, ofPtr)
63 |     ofPtr.close()
64 |     logger.info('Success')
65 | 
66 | 
67 | if __name__ == '__main__':
68 |     parser = argparse.ArgumentParser()
69 |     parser.add_argument('-i', '--ifile', help='Input pyc file name', required=True)
70 |     parser.add_argument('-o', '--ofile', help='Output pyc file name', required=True)
71 |     args = parser.parse_args()
72 |     process(args.ifile, args.ofile)
73 | 


--------------------------------------------------------------------------------
/simplifier.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | import networkx as nx
  3 | 
  4 | logger = logging.getLogger(__name__)
  5 | 
  6 | 
  7 | class Simplifier:
  8 |     def __init__(self, bb_graph):
  9 |         """
 10 | 
 11 |         :param bb_graph: The basic block graph
 12 |         :type bb_graph: nx.DiGraph
 13 |         """
 14 |         self.bb_graph = bb_graph
 15 | 
 16 |     def eliminate_forwarders(self):
 17 |         """
 18 |         Eliminates a basic block that acts as a forwarder, i.e. only consists of a single un-conditional
 19 |         control flow instructions.
 20 |         """
 21 |         numEliminated = 0
 22 |         logger.debug('Eliminating forwarders...')
 23 | 
 24 |         # flag variable to indicate whether a basic block was eliminated in a pass
 25 |         bb_eliminated = True
 26 | 
 27 |         # Loop until no basic block can be eliminated any more
 28 |         while bb_eliminated:
 29 |             bb_eliminated = False
 30 |             for bb in self.bb_graph.nodes():
 31 |                 # Must have a single instruction
 32 |                 if len(bb.instructions) == 1:
 33 |                     ins = bb.instructions[0]
 34 |                     if ins.mnemonic == 'JUMP_ABSOLUTE' or ins.mnemonic == 'JUMP_FORWARD':
 35 |                         # Must have a single successor
 36 |                         assert self.bb_graph.out_degree(bb) == 1
 37 | 
 38 |                         forwarderBB = bb
 39 |                         forwardedBB = list(self.bb_graph.successors(bb))[0]
 40 | 
 41 |                         # Check if forwardedBB has atleast one implicit in edge
 42 |                         forwardedBB_in_edge_exists = len(filter(lambda edge: edge[2]['edge_type'] == 'implicit',
 43 |                                                                 self.bb_graph.in_edges(forwardedBB, data=True))) > 0
 44 | 
 45 |                         # Check if forwarderBB has atleast one implicit in edge
 46 |                         forwarderBB_in_edge_exists = len(filter(lambda edge: edge[2]['edge_type'] == 'implicit',
 47 |                                                                 self.bb_graph.in_edges(forwarderBB, data=True))) > 0
 48 | 
 49 |                         # Cannot delete block
 50 |                         if forwardedBB_in_edge_exists and forwarderBB_in_edge_exists:
 51 |                             continue
 52 | 
 53 |                         # Remove the edge between forwarder and forwarded
 54 |                         self.bb_graph.remove_edge(forwarderBB, forwardedBB)
 55 | 
 56 |                         # Iterate over the predecessors of the forwarder
 57 |                         for predecessorBB in list(self.bb_graph.predecessors(forwarderBB)):
 58 |                             # Get existing edge type
 59 |                             e_type = self.bb_graph.get_edge_data(predecessorBB, forwarderBB)['edge_type']
 60 | 
 61 |                             # Remove the edge between the predecessor and the forwarder
 62 |                             self.bb_graph.remove_edge(predecessorBB, forwarderBB)
 63 | 
 64 |                             # Add edge between the predecessor and the forwarded block
 65 |                             self.bb_graph.add_edge(predecessorBB, forwardedBB, edge_type=e_type)
 66 | 
 67 |                             logger.info('Adding {} edge from block {} to {}'.format(e_type, hex(id(predecessorBB)),
 68 |                                                                                     hex(id(forwardedBB))))
 69 | 
 70 |                             # Get last instruction of the predecessor
 71 |                             last_ins = predecessorBB.instructions[-1]
 72 | 
 73 |                             # Check if the last instruction of the predecessor points to the forwarder
 74 |                             if last_ins.argval == forwarderBB:
 75 |                                 # Change the xref to the forwarded
 76 |                                 last_ins.argval = forwardedBB
 77 | 
 78 |                         # Check if the forwarder has xrefs, if so patch them appropriately
 79 |                         if forwarderBB.has_xrefs_to:
 80 |                             forwardedBB.has_xrefs_to = True
 81 |                             for xref_ins in forwarderBB.xref_instructions:
 82 |                                 xref_ins.argval = forwardedBB
 83 |                                 forwardedBB.xref_instructions.append(xref_ins)
 84 | 
 85 |                         logger.debug('Forwarder basic block {} eliminated'.format(hex(id(bb))))
 86 | 
 87 |                         # There must not be any edges left
 88 |                         assert self.bb_graph.degree(forwarderBB) == 0
 89 | 
 90 |                         # Remove the node from the graph
 91 |                         self.bb_graph.remove_node(forwarderBB)
 92 |                         del forwarderBB
 93 |                         bb_eliminated = True
 94 |                         numEliminated += 1
 95 |                         break
 96 |         logger.info('{} basic blocks eliminated'.format(numEliminated))
 97 | 
 98 |     def merge_basic_blocks(self):
 99 |         """
100 |         Merges a basic block into its predecessor iff the basic block has exactly one predecessor
101 |         and the predecessor has this basic block as its lone successor
102 | 
103 |         :param bb_graph: A graph of basic blocks
104 |         :type bb_graph: nx.DiGraph
105 |         :returns: The simplified graph of basic blocks
106 |         :rtype: nx.DiGraph
107 |         """
108 |         numMerged = 0
109 |         logger.debug('Merging basic blocks...')
110 |         # flag variable to indicate whether a basic block was eliminated in a pass
111 |         bb_merged = True
112 | 
113 |         # Loop until no basic block can be eliminated any more
114 |         while bb_merged:
115 |             bb_merged = False
116 |             for bb in self.bb_graph.nodes():
117 |                 # The basic block should not have any xrefs and must have exactly one predecessor
118 |                 if not bb.has_xrefs_to and self.bb_graph.in_degree(bb) == 1:
119 |                     predecessorBB = list(self.bb_graph.predecessors(bb))[0]
120 | 
121 |                     # Predecessor basic block must have exactly one successor
122 |                     if self.bb_graph.out_degree(predecessorBB) == 1 and list(self.bb_graph.predecessors(bb))[0] == bb:
123 |                         # The predecessor block will be the merged block
124 |                         mergedBB = predecessorBB
125 | 
126 |                         # Get the last instruction
127 |                         last_ins = mergedBB.instructions[-1]
128 | 
129 |                         # Check if the last instruction is an un-conditional jump
130 |                         if last_ins.mnemonic == 'JUMP_FORWARD' or last_ins.mnemonic == 'JUMP_ABSOLUTE':
131 |                             # Remove the instruction as it is unnecessary after the blocks are merged
132 |                             del mergedBB.instructions[-1]
133 | 
134 |                         # Merge the block by adding all instructions
135 |                         for ins in bb.instructions:
136 |                             mergedBB.add_instruction(ins)
137 | 
138 |                         # If bb is a terminal node, mark the mergedBB as terminal too
139 |                         if bb in nx.get_node_attributes(self.bb_graph, 'isTerminal').keys():
140 |                             nx.set_node_attributes(self.bb_graph, 'isTerminal', {mergedBB: True})
141 | 
142 |                         # Remove the edge
143 |                         self.bb_graph.remove_edge(mergedBB, bb)
144 | 
145 |                         for successorBB in self.bb_graph.successors(bb):
146 |                             # Get existing type
147 |                             e_type = self.bb_graph.get_edge_data(bb, successorBB)['edge_type']
148 | 
149 |                             self.bb_graph.add_edge(mergedBB, successorBB, edge_type=e_type)
150 |                             logger.info('Adding {} edge from block {} to {}'.format(e_type, hex(id(mergedBB)),
151 |                                                                                     hex(id(successorBB))))
152 |                             self.bb_graph.remove_edge(bb, successorBB)
153 | 
154 |                         logger.debug('Basic block {} merged with block {}'.format(hex(id(bb)), hex(id(mergedBB))))
155 |                         assert self.bb_graph.degree(bb) == 0
156 |                         self.bb_graph.remove_node(bb)
157 |                         del bb
158 |                         bb_merged = True
159 |                         numMerged += 1
160 |                         break
161 |         logger.info('{} basic blocks merged.'.format(numMerged))
162 | 


--------------------------------------------------------------------------------
/utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/extremecoders-re/bytecode_simplifier/35467c62dd1292c8a727c4d2346f11ab35000ce2/utils/__init__.py


--------------------------------------------------------------------------------
/utils/rendergraph.py:
--------------------------------------------------------------------------------
 1 | import pydotplus
 2 | import networkx as nx
 3 | 
 4 | 
 5 | def render_bb(bb, is_entry, is_terminal):
 6 |     dot = hex(id(bb)) + ':\l\l'
 7 | 
 8 |     for ins in bb.instruction_iter():
 9 |         dot += ins.mnemonic.ljust(22)
10 | 
11 |         if ins.is_control_flow() or ins.has_xref():
12 |             dot += hex(id(ins.argval))
13 |         elif ins.arg is not None:
14 |             dot += str(ins.arg)
15 |         dot += '\l'
16 | 
17 |     if not is_entry and not is_terminal:
18 |         return pydotplus.Node(hex(id(bb)), label=dot, shape='box', fontname='Consolas')
19 | 
20 |     if is_entry:
21 |         return pydotplus.Node(hex(id(bb)), label=dot, shape='box', style='filled', color='cyan', fontname='Consolas')
22 | 
23 |     return pydotplus.Node(hex(id(bb)), label=dot, shape='box', style='filled', color='orange', fontname='Consolas')
24 | 
25 | 
26 | def render_graph(bb_graph, filename):
27 |     """
28 |     Renders a basic block graph to file
29 | 
30 |     :param bb_graph: The Graph to render
31 |     :type bb_graph: networkx.DiGraph
32 |     """
33 |     graph = pydotplus.Dot(graph_type='digraph', rankdir='TB')
34 |     entryblock = nx.get_node_attributes(bb_graph, 'isEntry').keys()[0]
35 |     returnblocks = nx.get_node_attributes(bb_graph, 'isTerminal').keys()
36 | 
37 |     nodedict = {}
38 | 
39 |     for bb in bb_graph.nodes():
40 |         node = render_bb(bb, bb == entryblock, bb in returnblocks)
41 |         if bb == entryblock:
42 |             sub = pydotplus.Subgraph('sub', rank='source')
43 |             sub.add_node(node)
44 |             graph.add_subgraph(sub)
45 |         else:
46 |             graph.add_node(node)
47 |         nodedict[bb] = node
48 | 
49 |     for edge in bb_graph.edges(data=True):
50 |         src = nodedict[edge[0]]
51 |         dest = nodedict[edge[1]]
52 |         e_style = 'dashed' if edge[2]['edge_type'] == 'implicit' else 'solid'
53 | 
54 |         graph.add_edge(pydotplus.Edge(src, dest, style=e_style))
55 |     # graph.set('splines', 'ortho')
56 |     # graph.set_prog('neato')
57 |     # graph.set('dpi', '100')
58 | 
59 |     graph.write(filename, format='svg')
60 | 


--------------------------------------------------------------------------------
/utils/striplineno.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | 
 3 | import marshal
 4 | import types
 5 | 
 6 | 
 7 | def process(co):
 8 |     co_constants = []
 9 |     for const in co.co_consts:
10 |         if isinstance(const, types.CodeType):
11 |             co_constants.append(process(const))
12 |         else:
13 |             co_constants.append(const)
14 | 
15 |     return types.CodeType(co.co_argcount, co.co_nlocals, co.co_stacksize, co.co_flags,
16 |                           co.co_code, tuple(co_constants), co.co_names, co.co_varnames,
17 |                           co.co_filename, co.co_name, 1, '')
18 | 
19 | 
20 | def main():
21 |     print sys.argv[1]
22 |     fn = sys.argv[1]
23 |     inf = open(fn, 'rb')
24 |     header = inf.read(4)
25 |     assert header == '\x03\xf3\x0d\x0a'
26 |     inf.read(4)  # Discard
27 |     co = marshal.load(inf)
28 |     inf.close()
29 |     outf = open('noline.pyc', 'wb')
30 |     outf.write('\x03\xf3\x0d\x0a\0\0\0\0')
31 |     marshal.dump(process(co), outf)
32 |     outf.close()
33 | 
34 | 
35 | if __name__ == '__main__':
36 |     main()
37 | 


--------------------------------------------------------------------------------
/verifier.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | import networkx as nx
 3 | 
 4 | logger = logging.getLogger(__name__)
 5 | 
 6 | 
 7 | def verify_graph(bb_graph):
 8 |     """
 9 |     Verify the graph for correctness
10 | 
11 |     :param bb_graph:
12 |     :type bb_graph: nx.DiGraph
13 |     """
14 |     try:
15 |         # There must exists exactly one entry point
16 |         numEntryPoint = len(nx.get_node_attributes(bb_graph, 'isEntry'))
17 |         if numEntryPoint != 1:
18 |             logger.error('Basic block graph has {} entrypoint(s)'.format(numEntryPoint))
19 |             raise Exception
20 | 
21 |         # The entrypoint must have a in degree of zero
22 |         i_degree_entry = bb_graph.in_degree(nx.get_node_attributes(bb_graph, 'isEntry').keys()[0])
23 | 
24 |         if i_degree_entry != 0:
25 |             logger.error('The entry point basic block has an in degree of {}'.format(i_degree_entry))
26 |             raise Exception
27 | 
28 |         for bb in bb_graph.nodes():
29 |             o_degree = bb_graph.out_degree(bb)
30 |             # A basic block can have 0,1 or 2 successors
31 |             if o_degree > 2:
32 |                 logger.error('Basic block {} has an out degree of {}'.format(hex(id(bb)), o_degree))
33 |                 raise Exception
34 | 
35 |             # A basic block having a out degree of 0 must have a RETURN_VALUE as the last instruction
36 |             if o_degree == 0:
37 |                 if bb.instructions[-1].mnemonic != 'RETURN_VALUE':
38 |                     logger.error('Basic block {} has an out degree of zero, but does not end with RETURN_VALUE'.format(
39 |                         hex(id(bb))))
40 |                     raise Exception
41 | 
42 |             # A basic block having out degree of 2, cannot have both out edge as of explicit type or implicit type
43 |             if o_degree == 2:
44 |                 o_edges = bb_graph.out_edges(bb, data=True)
45 |                 if o_edges[0][2]['edge_type'] == 'explicit' and o_edges[1][2]['edge_type'] == 'explicit':
46 |                     logger.error('Basic block {} has both out edges of explicit type'.format(hex(id(bb))))
47 |                     raise Exception
48 |                 if o_edges[0][2]['edge_type'] == 'implicit' and o_edges[1][2]['edge_type'] == 'implicit':
49 |                     logger.error('Basic block {} has both out edges of implicit type'.format(hex(id(bb))))
50 |                     raise Exception
51 | 
52 |             i_degree = bb_graph.in_degree(bb)
53 | 
54 |             # If in degree is greater than zero
55 |             if i_degree > 0:
56 |                 numImplicitEdges = 0
57 |                 for edge in bb_graph.in_edges(bb, data=True):
58 |                     if edge[2]['edge_type'] == 'implicit':
59 |                         numImplicitEdges += 1
60 | 
61 |                 if numImplicitEdges > 1:
62 |                     logger.error('Basic block {} has {} implicit in edges'.format(hex(id(bb)), numImplicitEdges))
63 |                     raise Exception
64 | 
65 |             if i_degree == o_degree == 0:
66 |                 logger.error('Orphaned block {} has no edges'.format(hex(id(bb))))
67 | 
68 |     except Exception as ex:
69 |         print ex
70 |         return False
71 |     return True
72 | 


--------------------------------------------------------------------------------