Skip Navigation

Why can't code be uncompiled?

I see a lot about source codes being leaked and I'm wondering how it that you could make something like an exact replica of Super Mario Bros without the source code or how you can't take the finished product and run it back through the compilation software?

53 comments
  • The long answer involves a lot of technical jargon, but the short answer is that the compilation process turns high level source code into something that the machine can read, and that process usually drops a lot of unneeded data and does some low-level optimization to make things more efficient during actual processing.

    One can use a decompiler to take that machine code and attempt to turn it back into something human readable, but will usually be missing data on variable names, function calls, comments, etc. and include compiler-added optimizations which makes it nearly impossible to reconstruct the original code

    It's sort of the code equivalent of putting a sentence into Google translate and then immediately translating it back to the original. You often end up with differences in word choice that give you a good general idea of intent, but it's impossible to know exactly which words were in the original sentence.

    • Thank you, sorry to push further but my understanding is that computers deal with binary so every language is compiled to machine code, which I took as binary.

      So if the language has elements being removed and the machine doesn't need them shouldn't you get back out exactly what is needed to do the task? Like if you compiled some code and then uncompiled it you would get the most efficient version of it because the computer took what it needed, discarded the rest and gave it back to you?

      • It depends on the specifics of how the language is compiled. I'll use C# as an example since that's what I'm currently working with, but the process is different between all of them.

        C#, when compiled, actually gets compressed down to what is known as an intermediate language (MSIL for C# specifically). This intermediate file is basically a set of genericized instructions that are not linked to any specific CPU. This is useful because different CPUs require different instructions.

        Then, when the program is run, a second compiler known as the JIT (just-in-time) compiler takes the intermediate commands and translates them into something directly relevant to the CPU being used.

        When we decompile a C# dll, we're really converting from the intermediate language (generic CPU-agnostic instructions) and translating it back into source code.

        To your second point, you are correct that the decompiled version will be more efficient from a processing perspective, but that efficiency comes at the direct cost of being able to easily understand what is happening at a human level. :)

      • The main issue is that to make code human-readable, we include a lot of conventions that computers don't need. We use specific formatting, name conventions, code structure, comments, etc. to help someone look at the code and understand its function.

        Let's say I write code, and I have a function named 'findUserName' that takes a variable 'text' and checks it against a global variable 'userName', to see if the user name is contained in the text, and returns 'true' if so. If I compile and decompile that, the result will be (for example) a function named 'function_002' that takes a variable 'var_local_000' and checks it against 'var_global_115'. Also, my comments will be gone, and finding where the function was called from will be difficult. Yes, you could look at that code and figure out that it's comparing the contents of two variables, but you wouldn't know that var_global_115 is a username, so you'd have to go find where that variable was set and try to puzzle out where it was coming from, and follow that rabbit hole backwards until you eventually find a request for user input which you'd have to use context clues to determine the purpose of. You also wouldn't have the context around what 'var_local_000' represented unless you found where the function was called, and followed a similar line backwards to find the origin of that variable.

        It's not that the code you get back from a decompiler is incorrect or inefficient, it's that it's very much not human-readable without a lot of extra investigatory work.

      • The implicit assumption with decompiling code is that the goal is either to inspect how the code works, or to try compiling for a different machine. I'll try to explain why the latter is quite difficult.

        As you said, compilation to machine code only keeps the details needed for the CPU to accomplish what was instructed. And indeed, that is supposed to be efficient to run on that CPU, by reason of being targeted exactly for that CPU. But when decompiling, the resulting code will reflect the specificity to that same CPU. If you then try to compile that code for a different CPU, it will likely work, but will likely be inefficient because the second CPU's unique advantages won't be leveraged.

        To use an example, consider how someone might divide two large numbers. Person A learned long division in school, and so takes each number and breaks it down into a series of smaller multiplications and subtractions. Person B learned to do division using a calculator, which just involves entering the two numbers and requesting that they be divided.

        Trying to do division by blindly giving Person B that series of multiplications and subtractions to do on the calculator is extremely inefficient because Person B knows how to do division easily. But Person B is following Person A's methods, without knowing that the whole point of this exercise is to just divide the two original numbers. Compilation loses context and intent, which cannot be recovered from decompilation, for non-trivial programs.

        Here is an example why source code is useful when it provides context: https://en.m.wikipedia.org/wiki/Fast_inverse_square_root#Overview_of_the_code . Very few people would be able to figure out how this works from just the machine code.

      • if you compiled some code and then uncompiled it you would get the most efficient version of it ... ?

        Sorta, an optimizing compiler will always trim dead code which isn't needed, but it will also do things that are more efficient but make the code harder to understand like unrolling loops. e.g. you might have some code that says "for numbers 1-100 call some function" the compiler can look at this and say "let's just go ahead and insert 100 calls to that function with the specific number" so instead of a small loop you'll see a big block of function calls almost the same.

        Other optimizations will similarly obfuscate the original programmers intent, and thinks like assertions are meant to be optimized out in production code so those won't appear in the de-compiled version of the sources.

  • I actually work on a C++ compiler... I think I should weigh in. The general consensus here that things are lossy is correct but perhaps non-obvious if you're not familiar with the domain.

    When you compile a program you're taking the source, turning into a graph that represents every aspect of the program, and then generating some kind of IR that then gets turned into machine code.

    You lose things like code comments because the machine doesn't care about the comments right off the bat.

    Then you lose local variable and function parameter names because the machine doesn't care about those things.

    Then you lose your class structure ... because the machine really just cares about the total size of the thing it's passing around. You can recover some of this information by looking at the functions but it's not always going to be straight forward because not every constructor initializes everything and things like unions add further complexity ... and not every memory allocation uses a constructor. You won't get any names of any data members/fields though because ... again the machine doesn't care.

    So what you're left with is basically the mangled names of functions and what you can derive from how instructions access memory.

    The mangled names normally tell you a lot, the namespace, the class (if any), and the argument count and types. Of course that's not guaranteed either, it's just because that's how we come up with unique stable names for the various things in your program. It could function with a bunch of UUIDs if you setup a table on the compilers side to associate everything.

    But wait! There's more! The optimizer can do some really wild things in the name of speed... Including combining functions. Those constructors? Gone, now they're just some more operations in the function bodies. That function you wrote to help improve readability of your code? Gone. That function you wrote to deduplicate code? Gone. That eloquent recursive logic you wrote? Gone, now it's the moral equivalent of a giant mess of goto statements. That template code that makes use of dozens of instantiated functions? Those functions are gone now too; instead it's all the instantiated logic puked out into one giant function. That piece of logic computing a value? Well the compiler figured out it's always 27, so the logic to compute it? Gone.

    Now all of that stuff doesn't happen every time, particularly not all of those things are always possible optimizations or good optimizations ... But you can see how incredibly difficult it is to reconstruct a program once it's been compiled and gone through optimization. There's a very low chance if you do reconstruct it, that it will look anything like what you started with.

  • As I've read somewhere once: it's easy to make a burger out of a cow. Making a cow out of a burger is slightly harder.

    That means that compiling code is a lossy process - the original code is lost in the process and can never be recovered because it doesn't exist anywhere anymore.

    • This is the fundamental notion of nearly 95% of cyberpunk stories re: the human soul and yet everyone always is like “but I want my cool robot hand!”

  • Well, actually it can be. It just takes a lot more to decompile code than compile it. Depending on the objective accuracy.

    Example: the Super Mario 64 Decompilation project. This was a project that used various debug data that was left in the rom to decompile the game back to a source code that compiled a byte accurate version of the rom. This took about 3 years and a lot of skilled developers to accomplish.

    Side note: Super Mario Bros wasn’t built using a compiled language, but rather Assembly. So technically that would be a Disassembly not a Decompilation.

  • The general difference is that you lose out on metadata - names, comments and organization that helps the source code in whatever programming language make sense, but which is not needed to actually execute the desired behavior on your CPU. Usually stuff like sensible names for bits of your code - functions/reusable logic, storage locations for "health" or "armor" or "current powerup", movement states, types of objects etc.

    However, most of these are just another kind of number to the computer itself, so a lot of compilation processes strip a lot of this information. You could still reverse engineer it, but you're missing context (like all those names) from the original code and that makes the work potentially pretty difficult. Bear in mind that reading actual original source code is sometimes cryptic enough, then compare "if player is dead, show game over screen" to if (sdfdfgsdfg == jgdfg) { lkghku(); } because the "decompiler" has to invent some kind of name for everything that's missing. Now you have to deal with thousands of jfdsghklgs, and figure out what it all means.

  • You can certainly decompile things back down to machine code, but there could be gaps and things lost in translation between the programming language used to create the program, and the machine code that results when you take it apart again.

    When you program, like actually write the code, you're using one language. When you compile it, you're passing it off to an interpreter into another language. There could be even more layers of this depending on what you're doing.

    Now think about what happens when you open a translator, enter some words, translate it to one language, and then another, and back to the original. It comes out all wrong; the same thing happens with code. There's nuance and flavor imparted by the language itself that isn't kept through the interpretation of that language to the language that actually is used by the computer to do its tasks.

  • You can get close depending on the language by using decompilers. Usually though, they're rough translations of what the decompiler thinks that the (compiled) machine code does. It's not a 1:1 deal.

    Basically, a compiler translates the human-readable code to machine code that can actually be recognized and executed by your computer. A decompiler attempts to do the opposite, it translates the machine code back into the original language. But like some "translators", it's not always correct. That's the hard part - once decompiled you will likely have a lot of blanks to fill in and bugs to fix before anything will be compilable again. You'll likely never be able to get an exact copy of the original source code via decompiler.

53 comments