The dcc decompiler was developed by Cristina Cifuentes while a PhD student at
13 | the Queensland University of Technology
14 | (QUT), Australia, 1991-4, under the supervision of Professor
15 | John Gough. Mike Van Emmerik developed the library signature
16 | recognition algorithms while employed by QUT. The dcc distribution is made
17 | available under the GPL license. The readme
18 | file provides information about the distribution. We do not provide
19 | support for this decompiler, if you email, you'll get the standard
20 | reply. However, we participate in the Boomerang
21 | open source project, which aims at creating a retargetable decompiler based on
22 | some of the dcc and UQBT ideas, design, and/or implementations.
Notice
38 | Decompilation is a technique that allows you to recover lost source code. It
39 | is also needed in some cases for computer security, interoperability and error
40 | correction. dcc, and any decompiler in general, should not be used for
41 | "cracking" other programs, as programs are protected by copyright. Cracking of
42 | programs is not only illegal but it rides on other's creative effort. See the ethics of
44 | decompilation for more information.
45 |
46 |
47 |
48 |
dcc
The dcc decompiler decompiles .exe files from the
49 | (i386, DOS) platform to C programs. The final C program contains assembler code
50 | for any subroutines that are not possible to be decompiled at a higher level
51 | than assembler.
52 |
The analysis performed by dcc is based on traditional compiler
53 | optimization techniques and graph theory. The former is capable of eliminating
54 | registers and intermediate instructions to reconstruct high-level statements;
55 | the later is capable of determining the control structures in each subroutine.
56 |
Please note that at present, only C source is produced; dcc cannot (as
57 | yet) produce C++ source.
58 |
The structure of a decompiler resembles that of a compiler: a front-,
59 | middle-, and back-end which perform separate tasks. The front-end is a
60 | machine-language dependent module that reads in machine code for a particular
61 | machine and transforms it into an intermediate, machine-independent
62 | representation of the program. The middle-end (aka the Universal Decompiling
63 | Machine or UDM) is a machine and language independent module that performs the
64 | core of the decompiling analysis: data flow and control flow analysis. Finally,
65 | the back-end is high-level language dependent and generates code for the program
66 | (C in the case of dcc).
67 |
In practice, several programs
68 | are used with the decompiler to create the high-level program. These programs
69 | aid in the detection of compiler and library signatures, hence augmenting the
70 | readability of programs and eliminating compiler start-up and library routines
71 | from the decompilation analysis.
72 |
73 |
74 |
75 |
76 |
77 |
Example of Decompilation
We illustrate the
78 | decompilation of a fibonacci program (see Figure 4). Figure 1 illustrates the
79 | relevant machine code of this binary. No library or compiler start up code is
80 | included. Figure 2 presents the disassembly of the binary program. All calls to
81 | library routines were detected by dccSign (the signature matcher), and thus not
82 | included in the analysis. Figure 3 is the final output from dcc. This C program
83 | can be compared with the original C program in Figure 4.
84 |
C Cifuentes. Reverse
274 | Compilation Techniques, Queensland University of Technology, Department of
275 | Computer Science, PhD thesis.
276 | July 1994. (474 Kb compressed postscript file). Also available in compressed dvi
277 | format (365 Kb).
278 |
ABSTRACT
279 |
Techniques for writing reverse compilers or decompilers are presented in this
280 | thesis. These techniques are based on compiler and optimization theory, and are
281 | applied to decompilation in a unique way; these techniques have never before
282 | been published.
283 |
A decompiler is composed of several phases which are grouped into modules
284 | dependent on language or machine features. The front-end is a machine dependent
285 | module that parses the binary program, analyzes the semantics of the
286 | instructions in the program, and generates an intermediate low-level
287 | representation of the program, as well as a control flow graph of each
288 | subroutine. The universal decompiling machine is a language and machine
289 | independent module that analyzes the low-level intermediate code and transforms
290 | it into a high-level representation available in any high-level language, and
291 | analyzes the structure of the control flow graph(s) and transform them into
292 | graphs that make use of high-level control structures. Finally, the back-end is
293 | a target language dependent module that generates code for the target language.
294 |
Decompilation is a process that involves the use of tools to load the binary
295 | program into memory, parse or disassemble such a program, and decompile or
296 | analyze the program to generate a high-level language program. This process
297 | benefits from compiler and library signatures to recognize particular compilers
298 | and library subroutines. Whenever a compiler signature is recognized in the
299 | binary program, all compiler start-up and library subroutines are not
300 | decompiled; in the former case, the routines are eliminated from the final
301 | target program and the entry point to the main program is used for the
302 | decompiler analysis, in the latter case the subroutines are replaced by their
303 | library name.
304 |
The presented techniques were implemented in a prototype decompiler for the
305 | Intel i80286 architecture running under the DOS operating system, dcc, which
306 | produces target C programs for source .exe or .com files. Sample decompiled
307 | programs, comparisons against the initial high-level language program, and an
308 | analysis of results is presented in Chapter 9.
309 |
Chapter 1 gives an introduction to decompilation from a compiler point of
310 | view, Chapter 2 gives an overview of the history of decompilation since its
311 | appearance in the early 1960s, Chapter 3 presents the relations between the
312 | static binary code of the source binary program and the actions performed at
313 | run-time to implement the program, Chapter 4 describes the phases of the
314 | front-end module, Chapter 5 defines data optimization techniques to analyze the
315 | intermediate code and transform it into a higher-representation, Chapter 6
316 | defines control structure transformation techniques to analyze the structure of
317 | the control flow graph and transform it into a graph of high-level control
318 | structures, Chapter 7 describes the back-end module, Chapter 8 presents the
319 | decompilation tool programs, Chapter 9 gives an overview of the implementation
320 | of dcc and the results obtained, and Chapter 10 gives the conclusions and future
321 | work of this research.
322 |
The techniques presented in this thesis expand on earlier work described in
323 | the literature. Previous work in decompilation did not document on the
324 | interprocedural register analysis required to determine register arguments and
325 | register return values, the analysis required to eliminate stack-related
326 | instructions (i.e. push and pop), or the structuring of a generic set of control
327 | structures. Innovative work done for this research is described in Chapters 5,
328 | 6, and 8. Chapter 5, Sections 5.2 and 5.4 illustrate and describe nine different
329 | types of optimizations that transform the low-level intermediate code into a
330 | high-level representation. These optimizations take into account condition
331 | codes, subroutine calls (i.e. interprocedural analysis) and register spilling,
332 | eliminating all low-level features of the intermediate instructions (such as
333 | condition codes and registers) and introducing the high-level concept of
334 | expressions into the intermediate representation. Chapter 6, Sections 6.2 and
335 | 6.6 illustrate and describe algorithms to structure different types of loops and
336 | conditional, including multi-way branch conditionals (e.g. case statements).
337 | Previous work in this area has concentrated in the structuring of loops, few
338 | papers attempt to structure 2-way conditional branches, no work on multi-way
339 | conditional branches is described in the literature. This thesis presents a
340 | complete method for structuring all types of structures based on a
341 | predetermined, generic set of high-level control structures. A criterion for
342 | determining the generic set of control structures is given in Chapter 6, Section
343 | 6.4. Chapter 8 describes all tools used to decompile programs, the most
344 | important tool is the signature generator (Section 8.2) which is used to
345 | determine compiler and library signatures in architectures that have an
346 | operating system that do not share libraries, such as the DOS operating system.
347 |
348 |
349 |
350 |
351 |
352 |
Future Work - A Retargetable Decompiler
A
353 | retargetable decompiler engine can be built based on ideas and code from the UQBT project, by reusing the
354 | frontend of that framework and writing a new backend that supports the RTL and
355 | HRTL intermediate representation of the UQBT system. Please refer to the
356 | open source project Boomerang.
357 |
358 |
359 |
360 |
361 |
362 |
dcc Distribution
The dcc source code distribution is
363 | made available under the GNU GPL General Public
365 | License.
The dcc distribution is available in gzip tar format for Unix users, dcc.tar.gz
366 | and dcc_oo.tar.gz, and in its individual .zip
367 | files for PC users, dcc files pages. Read the readme file for a
368 | description of what is included in the distribution and installation
369 | instructions. If you do not have the tar and/or pkunzip programs, contact your
370 | system's administrator.
371 |
There is a now a second version of the decompiler; mainly to distinguish it
372 | from the first, we'll call it the "OO" version (it has the beginnings of Object
373 | Orientation, but there is still much to be done). This version has a bug fixed
374 | which caused the output to be wrong some of the time (randomly; successive runs
375 | would result in different output). It is also converted to C++, (the source for
376 | dcc; dcc does not produce C++ source), so those users wishing to use a C
377 | compiler without C++ facilities will have to stick to the original version. The
378 | file dccsrcoo.zip has the source for the later version, and dcc_oo.tar.gz
379 | has the whole distribution, with dccsrcoo.zip instead of
380 | dccsrc.zip and dcc32.zip instead of dcc.zip. This
381 | version has a better chance of working on PC compilers such as Microsoft Visual
382 | C++ and Borland C++. There is no longer any use of the curses library; it
383 | was found to be too much of a distribution hassle.
384 |
The OO version of dcc is the most recent, and has bug fixes that the
385 | original does not. For most purposes, the OO version is the one to start working
386 | with.
387 |
Support Please note that the authors are not currently working on
388 | this project and therefore cannot support any changes required on dcc. Source
389 | code is provided "as is". Read the documentation first.
390 |
Likewise, please don't email the authors with requests for modifications to
391 | dcc, or specific questions about its inner workings. If you do, you will just
392 | get a reply with this
393 | formletter.
394 |
Note Dcc has a fundamental
395 | implementation flaw that limits it to about 30KB of input binary program, i.e.
396 | it currently handles toy programs only! The problem is that pointers are kept in
397 | many places; many of these pointers point to elements of arrays. The arrays are
398 | all of variable size; the realloc system call can and will change the
399 | virtual addresses of these arrays, thus invalidating the pointers. Because of
400 | this, results are unpredictable as soon as one array is resized. (However, a
401 | segmentation fault is likely when this happens). The arrays are sized such that
402 | they don't get reallocated for input binaries less than about 30KB.
403 |
Before any serious work can be done with dcc, this implementation flaw has to
404 | be corrected. As noted above, the authors do not have the time to correct this
405 | error, or to offer any suggestions as to how to do this.
406 |
407 |
408 |
409 |
410 | Last updated: 4th May 2002
411 |
412 |
This page: http://www.itee.uq.edu.au/~cristina/dcc.html
In this email, I argue that LLVM IR is a poor system for building a
33 | Platform, by which I mean any system where LLVM IR would be a
34 | format in which programs are stored or transmitted for subsequent
35 | use on multiple underlying architectures.
36 |
37 | LLVM IR initially seems like it would work well here. I myself was
38 | once attracted to this idea. I was even motivated to put a bunch of
39 | my own personal time into making some of LLVM's optimization passes
40 | more robust in the absence of TargetData a while ago, even with no
41 | specific project in mind. There are several things still missing,
42 | but one could easily imagine that this is just a matter of people
43 | writing some more code.
44 |
45 | However, there are several ways in which LLVM IR differs from actual
46 | platforms, both high-level VMs like Java or .NET and actual low-level
47 | ISAs like x86 or ARM.
48 |
49 | First, the boundaries of what capabilities LLVM provides are nebulous.
50 | LLVM IR contains:
51 |
52 | * Explicitly Target-specific features. These aren't secret;
53 | x86_fp80's reason for being is pretty clear.
54 |
55 | * Target-specific ABI code. In order to interoperate with native
56 | C ABIs, LLVM requires front-ends to emit target-specific IR.
57 | Pretty much everyone around here has run into this.
58 |
59 | * Implicitly Target-specific features. The most obvious examples of
60 | these are all the different Linkage kinds. These are all basically
61 | just gateways to features in real linkers, and real linkers vary
62 | quite a lot. LLVM has its own IR-level Linker, but it doesn't
63 | do all the stuff that native linkers do.
64 |
65 | * Target-specific limitations in seemingly portable features.
66 | How big can the alignment be on an alloca? Or a GlobalVariable?
67 | What's the widest supported integer type? LLVM's various backends
68 | all have different answers to questions like these.
69 |
70 | Even ignoring the fact that the quality of the backends in the
71 | LLVM source tree varies widely, the question of "What can LLVM IR do?"
72 | has numerous backend-specific facets. This can be problematic for
73 | producers as well as consumers.
74 |
75 | Second, and more fundamentally, LLVM IR is a fundamentally
76 | vague language. It has:
77 |
78 | * Undefined Behavior. LLVM is, at its heart, a C compiler, and
79 | Undefined Behavior is one of its cornerstones.
80 |
81 | High-level VMs typically raise predictable exceptions when they
82 | encounter program errors. Physical machines typically document
83 | their behavior very extensively. LLVM is fundamentally different
84 | from both: it presents a bunch of rules to follow and then offers
85 | no description of what happens if you break them.
86 |
87 | LLVM's optimizers are built on the assumption that the rules
88 | are never broken, so when rules do get broken, the code just
89 | goes off the rails and runs into whatever happens to be in
90 | the way. Sometimes it crashes loudly. Sometimes it silently
91 | corrupts data and keeps running.
92 |
93 | There are some tools that can help locate violations of the
94 | rules. Valgrind is a very useful tool. But they can't find
95 | everything. There are even some kinds of undefined behavior that
96 | I've never heard anyone even propose a method of detection for.
97 |
98 | * Intentional vagueness. There is a strong preference for defining
99 | LLVM IR semantics intuitively rather than formally. This is quite
100 | practical; formalizing a language is a lot of work, it reduces
101 | future flexibility, and it tends to draw attention to troublesome
102 | edge cases which could otherwise be largely ignored.
103 |
104 | I've done work to try to formalize parts of LLVM IR, and the
105 | results have been largely fruitless. I got bogged down in
106 | edge cases that no one is interested in fixing.
107 |
108 | * Floating-point arithmetic is not always consistent. Some backends
109 | don't fully implement IEEE-754 arithmetic rules even without
110 | -ffast-math and friends, to get better performance.
111 |
112 | If you're familiar with "write once, debug everywhere" in Java,
113 | consider the situation in LLVM IR, which is fundamentally opposed
114 | to even trying to provide that level of consistency. And if you allow
115 | the optimizer to do subtarget-specific optimizations, you increase
116 | the chances that some bit of undefined behavior or vagueness will be
117 | exposed.
118 |
119 | Third, LLVM is a low level system that doesn't represent high-level
120 | abstractions natively. It forces them to be chopped up into lots of
121 | small low-level instructions.
122 |
123 | * It makes LLVM's Interpreter really slow. The amount of work
124 | performed by each instruction is relatively small, so the interpreter
125 | has to execute a relatively large number of instructions to do simple
126 | tasks, such as virtual method calls. Languages built for interpretation
127 | do more with fewer instructions, and have lower per-instruction
128 | overhead.
129 |
130 | * Similarly, it makes really-fast JITing hard. LLVM is fast compared
131 | to some other static C compilers, but it's not fast compared to
132 | real JIT compilers. Compiling one LLVM IR level instruction at a
133 | time can be relatively simple, ignoring the weird stuff, but this
134 | approach generates comically bad code. Fixing this requires
135 | recognizing patterns in groups of instructions, and then emitting
136 | code for the patterns. This works, but it's more involved.
137 |
138 | * Lowering high-level language features into low-level code locks
139 | in implementation details. This is less severe in native code,
140 | because a compiled blob is limited to a single hardware platform
141 | as well. But a platform which advertizes architecture independence
142 | which still has all the ABI lock-in of HLL implementation details
143 | presents a much more frightening backwards compatibility specter.
144 |
145 | * Apple has some LLVM IR transformations for Objective-C, however
146 | the transformations have to reverse-engineer the high-level semantics
147 | out of the lowered code, which is awkward. Further, they're
148 | reasoning about high-level semantics in a way that isn't guaranteed
149 | to be safe by LLVM IR rules alone. It works for the kinds of code
150 | clang generates for Objective C, but it wouldn't necessarily be
151 | correct if run on code produced by other front-ends. LLVM IR
152 | isn't capable of representing the necessary semantics for this
153 | unless we start embedding Objective C into it.
154 |
155 |
156 | In conclusion, consider the task of writing an independent implementation
157 | of an LLVM IR Platform. The set of capabilities it provides depends on who
158 | you talk to. Semantic details are left to chance. There are features
159 | which require a bunch of complicated infrastructure to implement which
160 | are rarely used. And if you want light-weight execution, you'll
161 | probably need to translate it into something else better suited for it
162 | first. This all doesn't sound very appealing.
163 |
164 | LLVM isn't actually a virtual machine. It's widely acknoledged that the
165 | name "LLVM" is a historical artifact which doesn't reliably connote what
166 | LLVM actually grew to be. LLVM IR is a compiler IR.
167 |
168 | Dan
169 |
170 |
210 |
211 |
212 | More information about the LLVMdev
213 | mailing list
214 |
215 |
--------------------------------------------------------------------------------
/llvm/lnq.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jcdutton/reference/3c950860cc90a22e24a8669a0912fe65e1403395/llvm/lnq.pdf
--------------------------------------------------------------------------------
/llvm/x86-llvm-translator-chipounov_2.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jcdutton/reference/3c950860cc90a22e24a8669a0912fe65e1403395/llvm/x86-llvm-translator-chipounov_2.pdf
--------------------------------------------------------------------------------
/optimizing/optimizing_cpp.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jcdutton/reference/3c950860cc90a22e24a8669a0912fe65e1403395/optimizing/optimizing_cpp.pdf
--------------------------------------------------------------------------------
/optimizing/test61.c:
--------------------------------------------------------------------------------
1 | /* Test a condition that results in no branches when compiled with -O2 */
2 | /* 0000000000000000 :
3 | 0: 83 ff 01 cmp $0x1,%edi
4 | 3: 19 c0 sbb %eax,%eax
5 | 5: 83 e0 df and $0xffffffdf,%eax
6 | 8: 83 c0 61 add $0x61,%eax
7 | b: c3 retq
8 | */
9 | /* Suggest re-write it to:
10 | 0000000000000000 :
11 | 0: cmp $0x1,%edi Node A
12 | 1: jb 4: Node A
13 | 2: mov 0x61,%eax Node B
14 | 3: jmp 5: Node B
15 | 4: mov 0x40,%eax Node C
16 | 5: retq Node D (Due to join point)
17 | */
18 |
19 | int test61 ( unsigned value );
20 |
21 | int test61 ( unsigned value ) {
22 | int ret;
23 | if (value < 1)
24 | ret = 0x40;
25 | else
26 | ret = 0x61;
27 | return ret;
28 | }
29 |
30 |
--------------------------------------------------------------------------------
/rtl/rep.txt:
--------------------------------------------------------------------------------
1 | The REP prefix.
2 |
3 | A variable repz and repnz will be either 1 or 0.
4 | Any decoded instruction can use that variable to decide how to craft the RTL.
5 |
6 | E.g. in amd64 att format:
7 | rep movs rsi, rdi;
8 | The RTL for this will be:
9 | begin:
10 | cmp ecx, 0;
11 | jz next_instruction;
12 | mov rsi, rdi;
13 | add 4, rsi;
14 | add 4, rdi;
15 | jmp begin;
16 | next_instruction:
17 |
18 | rep cmps rsi, rdi;
19 | The RTL for this will be:
20 | begin:
21 | repz cmp ecx, 0;
22 | jz next_instruction;
23 | cmp rsi, rdi;
24 | jz next_instruction;
25 | add 4, rsi;
26 | add 4, rdi;
27 | jmp begin;
28 | next_instruction:
29 |
30 | The above code arrangement has the advantage that all jz and jmp instructions are to amd64 intruction boundarys and none are to specific rtl instructions. I.e. No Jumps to the middle of a set of rtl instructions decoded from a single amd64 instruction.
31 |
32 |
33 |
34 |
--------------------------------------------------------------------------------
/rtl/test.txt:
--------------------------------------------------------------------------------
1 | Pseudo code for the TEST instruction.
2 |
3 | TEMP = SRC1 AND SRC2;
4 | SF = MSB(TEMP);
5 | IF TEMP = 0
6 | ZF = 1;
7 | ELSE
8 | ZF = 0;
9 | FI;
10 | PF = BitwiseXNOR(TEMP[0:7]);
11 | CF = 0;
12 | OF = 0;
13 | (* AF is undefined *)
14 |
--------------------------------------------------------------------------------
/second-write/Anand.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jcdutton/reference/3c950860cc90a22e24a8669a0912fe65e1403395/second-write/Anand.pdf
--------------------------------------------------------------------------------
/second-write/readme.txt:
--------------------------------------------------------------------------------
1 | There does not appear to be any open source code available.
2 |
--------------------------------------------------------------------------------
/second-write/secondwrite.sec11.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jcdutton/reference/3c950860cc90a22e24a8669a0912fe65e1403395/second-write/secondwrite.sec11.pdf
--------------------------------------------------------------------------------
/software-maintenance/10.1.1.89.2073-graphs.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jcdutton/reference/3c950860cc90a22e24a8669a0912fe65e1403395/software-maintenance/10.1.1.89.2073-graphs.pdf
--------------------------------------------------------------------------------
/source-source-translator/ROSE-Tutorial.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jcdutton/reference/3c950860cc90a22e24a8669a0912fe65e1403395/source-source-translator/ROSE-Tutorial.pdf
--------------------------------------------------------------------------------
/type-detection/TIE - Principled Reverse Engineering of Types in Binary Programs.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jcdutton/reference/3c950860cc90a22e24a8669a0912fe65e1403395/type-detection/TIE - Principled Reverse Engineering of Types in Binary Programs.pdf
--------------------------------------------------------------------------------
/type-detection/data-type-determination.txt:
--------------------------------------------------------------------------------
1 | Copyright (C) 2004-2017 James Courtier-Dutton
2 |
3 | IDEAS regarding data type determination.
4 |
5 | P1,P2 are pointers
6 | I1,I2 are integers
7 |
8 | OP1 OP2
9 | P1 + P2 = Invalid
10 | P1 - P2 = Integer
11 | P1 * P2 = Invalid
12 | P1 / P2 = Invalid
13 |
14 | P1 + I1 = Pointer
15 | P1 - I1 = Pointer
16 | P1 * I1 = Invalid
17 | P1 / I1 = Invalid
18 |
19 | I1 + I2 = Integer
20 | I1 - I2 = Integer
21 | I1 * I2 = Integer
22 | I1 / I2 = Integer
23 |
24 | I1 + P1 = Pointer
25 | I1 - P1 = Invalid
26 | I1 * P1 = Invalid
27 | I1 / P1 = Invalid
28 |
29 | + can be P1 + I1, I1 + P1, I1 + I2
30 | - can be P1 - P2, P1 - I1, I1 - I2
31 | * can be I1 * I2
32 | / can be I1 / I2
33 |
34 | 1) A variable is a pointer if it is used to load/store another variable in memory.
35 | 2) Signed/Unsigned integer can be determined from branch statements and other hints.
36 | 3) Integer or String can be determined depending on how it is used.
37 | E.g. If it is used as the format variable in a printf() it is a string.
38 | If it is used in the variable args of a printf() statement and the format string identifies it as %s is it a string.
39 |
40 |
41 | IDEAS and work in progress:
42 | Create a graph for each value or label.
43 | At the top of the graph have the instruction that directly determines the type of a value.
44 | E.g. A STORE or LOAD determines that it is a pointer and what size the value is being stored.
45 | The nodes of the graph are a combined key of value and instruction.
46 |
47 | Each value has a list of instructions that use that value.
48 | One of the instructions caused the creation of the SSA value.
49 | The rest of the instructions used the value.
50 |
51 | The problem is how to link these value-instruction nodes ?
52 | Investigate a rule that type inference inherits only from above.
53 | Can you arrange the nodes so this works.
54 |
55 | Rules:
56 | 1) If one valueA-instruction is a LOAD/STORE so the type is clear, all other valueA-instructions have the same type, so you can raise them to the top of the graph.
57 | 2) A sanity check needs to be made so that all valueA-instructions have the same type.
58 | 3) Links between nodes are based on instructions. So, an instruction with 3 params will have 1 node at the top, and 2 nodes below it, with links.
59 | But then the next nodes below those will be the previous nodes above them. So, do we then have triangle graph links?
60 | 4) If one of the nodes below is a top node, A horizontal or upwards link is used.
61 | 5) Also need another link type for different instructions of the same value.
62 | 6) Need to work out if order of instructions affects the type?
63 | 7) Is this a problem that can be solved with linear programming?
64 | Each value has two properties:
65 | 1) Pointer or Int
66 | 2) size
67 | Note: (1) Pointer of Int is wrong. We instead need the type of pointer.
68 |
69 |
70 | Maybe Use a scatter diagram.
71 | Each node is a value-instruction.
72 | Each link between the nodes infers inheritance in the link direction.
73 | Sometimes, an inheritance is inferred only from a combination of more than one node.
74 | So, a new node type is needed, a decision node, which combines more than one link into a single link.
75 | Also, a node that is just the "value" with the links determining the type of the value.
76 |
77 | IDEA 3: Type Inference Propagation
78 | Alternatively, create a long list of "if (...) then ..." statements, and then try to solve them.
79 | Need a way to resolve circular statements.
80 |
81 | Maybe, If the type is clear, the inheritance should be clear, and thus the value can be removed from the model.
82 | The if (...) then .... can be added and remove from the context depending on whether the params in the (...) are defined type is clear or not.
83 | First pass, selects the ones in context, sorts them to move them to the top of the list. Some of them might have (...) being 1, i.e. always true
84 | Next, the ones selected are executed.
85 | Based on the result, see if any others can be added to the context.
86 | Next, these new ones are executed.
87 | repeat until no new ones can be add to the context.
88 |
89 | If when executing the rules, a type is changed, e.g. 32bit goes to 64bit. Then need to modify the instruction stream with bitcasts etc. and re-run the analysis.
90 | If a bitcast is done, retain the bitcast relationship so if a bitcast is needed later on, the same one can be utilised.
91 | If a bitcast is added, the types on all the affected instructions could be reset to unknown, and then the rules can be run again on those instructions
92 | in order to gain type information. This area is difficult because we don't know exactly how to unwind the previous induction steps.
93 | So, we don't know which instructions have been used to make the type inductions, so we will not know if the changed instructions previously contributed to
94 | the type induction.
95 | Adding a bitcast instruction might modify the SSA labeling. A label would be split in 2. The label up until the new bitcast would be the same.
96 | The label after the bitcast would be new.
97 | The reason to add the bitcast instruction is due to the immediately following instruction needing it. So, with this pairing in mind, we can limit the amount
98 | of instructions that might have contributed to the induction.
99 |
100 |
101 | The final check is to then run all the rules, and verify that they are all valid.
102 | Each rule has several expressions.
103 | One expression to decide to include the rule or not in the context. (e.g. If value1 and value2 defined, and value3 undefined.)
104 | One expression to validate the input data. (e.g. if value1 == ptr and value2 == int)
105 | One expression to update the type data, or use to check the type data. (e.g. value3 == ptr.)
106 | Maybe add negative rules, I.e. for a MUL expression, if either of the values are ptr, abort. or do ptr_to_int casts.
107 |
108 | Implementation:
109 | 1) Add structure to the labal_s struct.
110 | add to the label_s structure a list/array containing all the instructions that reference this label.
111 | The list would be an array of structs. This can probably be filled in
112 | by adding functionality to the "register_label()" function.
113 | For each instruction in the list, we would have:
114 | struct tip_s {
115 | int valid; /* Is this entry valid? More use for when we need to
116 | delete individual entries */
117 | int inst_number; /* Number of the inst_log entry */
118 | int operand; /* Which operand of the instruction? 1 = srcA/value1, 2 =
119 | srcB/value2, 3 = dstA/value3 */
120 | int lab_pointer_first; /* Is this a pointer */
121 | int lab_pointer_inferred;
122 | int size_bits_first;
123 | int size_bits_inferred;
124 | int lab_pointer_size_first;
125 | int lab_pointer_size_inferred;
126 | }
127 |
128 | So, for each label we now have a list of each instruction it appears in.
129 | The contents of the structure are then filled with what type
130 | information can be gained from this instruction for this label.
131 |
132 | Each variable in the structure have values:
133 | 0 = no known,
134 | 1 or more = valid value
135 |
136 | The "first" and "inferred" depends on how the type was discovered.
137 | "first" is used when we are 100% sure about the type.
138 | "inferred" is used when the type has been inferred based on the type
139 | of some other label.
140 |
141 | Once that array is set up for each label, the next stage is to fill
142 | the entries in.
143 | One would do a first pass of all the instructions, and fill in the
144 | "first" values where possible.
145 | One would then start doing passes to fill in the inferred ones.
146 |
147 | 2)
148 | Then build a dependency tree.
149 | So, instead of one label being links to each of the instructions it is used in,
150 | create a dependency tree of just the dependency between the labels.
151 | I.e. label A's type depends in some way on label B.
152 | Each dependency link will have a rule.
153 | Rules can be:
154 | label A's type IS_THE_SAME_AS label B's type. IS_THE_SAME_AS could be bi-directional, but we need to get to a DAG, so make in uni-directional.
155 | but the direction to choose is difficult. The node furthest from the "first values" node depends on the closer one.
156 | i.e. the deepest one depends on the closer one.
157 | If they are equal depth, then which direction to choose ?
158 | labal A is type X IF label B is type X and label C is type X. This is tri-directional because it is the same as:
159 | labal B is type X IF label A is type X and label C is type X.
160 | labal C is type X IF label A is type X and label B is type X.
161 | We need some way to determine which dependencies to suppress in order to get to a DAG.
162 | Use deepest one depends on the closer one.
163 | If they are equal depth, then which ones of the 3 to choose ?
164 |
165 | We can have a first set of rules to determine if a label is a pointer or not.
166 | The second set would deal with the bit width of each label.
167 |
168 | 3)
169 | There exists a problem in that there might be conflicts in the types, with
170 | e.g. one dependency telling us it is 32bits and another telling us it is 64bits.
171 | These are resolved by adding bitcasts, truncates instructions etc.
172 | The problem is when an instruction is added, the labels change, and thus the dependencies change.
173 |
174 | 4a)
175 | There exists a problem in that there might be circular dependencies.
176 | We need some way to ensure we end up with a DAG. i.e. Acyclic.
177 | Can we tell if it is acyclic even before we have inferred the types?
178 |
179 | You can check for cycles in a connected component of a graph as follows.
180 | Find a node which has only outgoing edges. If there is no such node, then there is a cycle.
181 | Start a DFS at that node. When traversing each edge, check whether the edge points back to a node already on your stack. This indicates the existence of a cycle. If you find no such edge, there are no cycles in that connected component.
182 | Repeat for each node which has only outgoing edges.
183 |
184 | 4b)
185 | If cycles are found, what can be done to break them?
186 | a) If the dependency from both directions gives us the same type, then we can temporarily block one of the dependencies.
187 | b) If the dependency causes the type to wish to be different depending on which dependency we use,
188 | then new bitcast, truncate instructions need to be added, creating a new label and the dependency graph adjusted to include them.
189 |
190 |
191 | NEW INFO:
192 | We need to store types:
193 | Type. e.g. Int
194 | Pointer to Type
195 | Pointer to Pointer
196 | Pointer to Pointer to int etc.
197 |
198 | Simplistically this can be, for now, taken as:
199 | Int
200 | Pointer to Int
201 | Pointer to Pointer.
202 |
203 |
204 | Then score direct, then score indirect (the Int bit of "pointer to Int").
205 |
206 | struct tip_s {
207 | int valid; /* Is this entry valid? More use for when we need to delete individual entries */
208 | int node; /* The node that the inst or phi is contained in */
209 | int inst_number; /* Number of the inst_log entry */
210 | int phi_number; /* Number of the phi */
211 | int operand; /* Which operand of the instruction? 1 = srcA/value1, 2 = srcB/value2, 3 = dstA/value3 */
212 | int lab_pointer_first; /* Is this a pointer. Determined from the LOAD or STORE command */
213 | int lab_pointed_to_size; /* Is the size of the pointed to value */
214 | int lab_pointer_inferred; /* This has been inferred from another label. */
215 | struct tip_s *tip; /* Add a pointer to the pointed to type. */
216 | struct tip_s *tip_up; /* Add a pointer to the tip that pointed to this. */
217 | int lab_size_first; /* Bit width of the label */
218 | int lab_size_inferred;
219 | int lab_integer_first;
220 | int lab_unsigned_integer_first;
221 | int lab_signed_integer_first;
222 | };
223 |
224 | Type inferance can get complicated.
225 | E.g. See the following code:
226 |
227 | ADD A + B = C
228 | STORE 2 in [C]
229 |
230 | We therefore know C is a Pointer.
231 | From the rules:
232 | P1 + P2 = Invalid
233 | P1 + I1 = Pointer
234 | This means, either:
235 | A = Pointer
236 | B = Integer
237 | or
238 | A = Integer
239 | B = Pointer
240 |
241 | How to represent this in the types?
242 |
243 | What identifies a pointer?
244 | A STORE or LOAD determines that it is a pointer and what size the value is being stored.
245 | What idenfifies an intenger?
246 | Signed/Unsigned integer can be determined from branch statements and other hints.
247 | What are the other hints?
248 | I am not sure a branch can determine that. One could compare two pointer, so see if we had reached a malloc limit.
249 | A signed compare would probably force the type to be a signed int, and not a pointer.
250 |
251 | Integer or String can be determined depending on how it is used.
252 | E.g. If it is used as the format variable in a printf() it is a string.
253 | If it is used in the variable args of a printf() statement and the format string identifies it as %s is it a string.
254 |
255 | What about code that has:
256 | value1 + value2 + value3
257 |
258 | E.g. Instructions could look like:
259 | result = value1
260 | result = result + value2
261 | result = result + value3
262 | return result
263 |
264 | is result an integer or a pointer?
265 | Assume value1 is a pointer, what happens?
266 | value2 and value3 must be integers.
267 | Assume value1 is an integer, what happens?
268 | If value2 is a pointer, value3 must be an int.
269 | If value3 is a pointer, value2 must be an int.
270 | If value2 is a int, value3 may also be an int.
271 | Note: This is starting to become difficult!!!
272 |
273 | What can tell us that something is an int and not a pointer?
274 | 1) A signed calculation or compare means it is an integer.
275 | 2) The integer part of a addition, with the other part known to be a pointer.
276 | 3)
277 | P1 - P2 = Integer
278 | I1 + I2 = Integer
279 | I1 - I2 = Integer
280 | I1 * I2 = Integer
281 | I1 / I2 = Integer
282 |
283 |
284 |
--------------------------------------------------------------------------------