One thing which I am always being asked about is a 'decompiler for C', so I thought I would put a few notes up to discuss decompilers. You may have seen working decompilers before - there are several available for Java, there are several available for QuakeC, there is even one available for Visual Basic 3 and Visual Basic 4. If you look around for a decompiler for C though then you will be disappointed. As far as I know the only decompilers which will decompile an executable into C are extremely restricted (see the decompilation page at the University of Queensland for more details). So surely it would be a great idea to write a decompiler for win32 executables into C code ? yes, it would, and if you want to do so then come back in ten years and let me know how you've got on ;)
Seriously though, the main reason that Java and QuakeC have been decompiled is because it is easy to backtrack from the compilation to the original code. In general these languages are not compiled directly into machine code, they are converted into a direct representation of the program as a series of numbers which is interpreted by a virtual machine. For example, you write 'x=1' in your language and the 'compiler' compiles this to [1 100 1] say, which has the direct meaning of [1=assign] [100=a variable number, in this case 'x'] [1=a value] When it comes to reversing this we find that [1] is always our assignment statement and is always followed by a location and a value. Hence we can reverse the statement easily. As far as Java and QuakeC are concerned there is 'practically' (although this is probably much oversimplifying the process) a one-one correspondence between the original program and the pseudo-code which is produced by the compiler.
Now consider C, or Pascal, or any other truely compiled language. By this I mean that the language is compiled into a machine code which is directly readable by the processor and which is not interpreted through a virtual machine (although the virtual machine could still be reading optimised pseudocode and be difficult to reverse). In general a single statement in a high level language could be compiled into a number of statements in the machine code, these might be optimised by the compiler when they are pieced together, and even the order may be rearranged when considering pentium class processor scheduling for example. The relationship between the code produced and the original program is not straightforward in the sense that the correspondence is not one-one. The correspondence is a many-many correspondence (it is not even one-many or many-one). In addition when you throw in the fact that data and code may be difficult to identify and separate (just create a jump table in an array in C and see what happens to the assembler, or look at the disassembly of a Delphi program which mixes code and data so much), and that the code itself may be self modifying, you will begin to find that the problem is no longer very simple.
So just how do you begin to write a decompiler to C then if its all so mixed up ? Well, the only possible approach, in my mind, is not to try and recover the original program but to try and produce an equivalent program. The approach is still fraught with difficulties of course. First of all you would have to assume that the code is not self modifying otherwise you will find yourself with all kinds of paradoxes and difficulties in decompiling. Next you will need to identify the flow of the program code (where are the control statements if..then..else..endif and while..repeat or however you want to label them), and which parts of the program are code and which parts are data. Having done this it is 'simply' a matter of replacing most of the assembly language with equivalent C statements and introducing labels (and how friendly this is going to be! label1, label2 , ..., label146546). Of course the real test will be when you try to recompile your masterpiece into a runnable program, but if you get this far then you have done well.
So am I going to write a decompiler ? Well, no, not at this point in time. I believe that the best decompiler there is is the human brain. Take an assembly listing and decompile it yourself, giving the variables names and so on. Of course you can't do it to a 20Mb executable but to small programs it is an exercise worth undertaking. What I will probably be doing though is to study the process of decompilation in more depth, analysing my approach to hand decompilation if you like, and producing 1/ some articles to go up here on my web pages and 2/ maybe adding some tools like codeflow analysis to Borg2 (after all Borg1 did it), probably to be switched into use by the user (so you would disassemble your program and when happy with the disassembly you could do a codeflow analysis). Anyone got any comments ?