Deterministic Fully-Static Whole-Binary Translation Without Heuristics

(arxiv.org)

116 points | by matt_d 3 hours ago

7 comments

fizza_pizza 36 minutes ago
The certification angle is the most interesting part to me. Regulated industries (aviation, medical devices) often can't use JIT for exactly this reason, the code that runs has to be the code that was certified. Static translation that produces a signable binary is a real unlock there, code bloat notwithstanding.
[-]
- camillomiller 31 minutes ago
  I wonder: how relevant is this portion of the software industry? Because I’m guessing there is also no way they can apply LLms at scale, which is never discussed in the larger AI at work narrative
  [-]
  - jy14898 16 minutes ago
    LLMs aren't relevant to aviation and medical devices
  - rvz 5 minutes ago
    It is completely relevant, if you want reliable software that you use daily to continue running without a massive rewrite.
    Before suggesting to use LLMs to completely rewrite this sort of software, there is a reason why compilers need to be certified to operate in safety critical environments. Not everything needs to use LLMs as the solution to a problem.
    I would go as far to say that using an LLM in this context is the wrong solution and is irrelevant to critical systems. Maybe some here see everything as tokens and must solve everything in the form of using LLMs.
    Rewriting a toy web app using LLMs from Javascript to Typescript is great, but isn't good for safety critical systems.
da-x 58 minutes ago
> Elevator achieves performance on par with or better than QEMU's user-mode JIT emulation.
I am not sure what QEMU's JIT is doing (in its userspace wrapper), but I think it has a lot of room to improve.
In 2013 I wrote a x86-64 to aarch64 JIT engine that was able to run what was then Fedora beta aarch64 binaries and rebuild almost the entire aarch64 port of Fedora on a x86_64 Linux. I also made a reverse aarch64 to x86-64 JIT that worked in the same way, and for fun I also showed the two JITs managing to run each other in a loop back fashion: x86-64 -> aarch64 -> x86_64 in the same process.
The JIT I devised did a 1-to-many instruction and CPU state mapping with overhead that was somewhat 2x to 5x slower than what would be expected to native recompiled code. I later compared this with QEMU's JIT which seemed more in the range of 10x to 50x slower.
Unfortunately this was not under a open source license settings, so no code release to prove it.. :(
[-]
- pm215 7 minutes ago
  Yes, QEMU's JIT is a fairly easy target to beat. Notably if you are happy to specialize the design to "only x86 to aarch64" and "only usermode" there's quite a lot of gain to be made. QEMU's usermode support is a kind of "this happens to work" appendix to its system emulation support, and the overall JIT architecture is a "guest to intermediate representation to host" one that is great for supporting a dozen guest architectures and multiple host architectures, but means you can't really take advantage of properties of a specific guest/host pair like "x86 has fewer integer registers so we can hard allocate them" or "we know the fiddly floating point semantics always match if you put the aarch64 CPU into the right mode". Plus there's just more time put into "emulate new architecture feature X" in QEMU development than into "look at optimization opportunities to make it faster", because that's what the people who pay for development work care more about.
jonhohle 1 hour ago
This is neat. I haven’t looked into it, but I would think relative offsets could still be an issue, but it seems there must be some translation layer/mmu since the codegen will be different sizes anyway. This would impact jump tables and internal branches, primarily.
I mostly work on stuff from the 90s, but disassemblers make a lot of assumptions about where code starts and ends, but occasionally a binary blob is not discoverable unless you have some prior knowledge (pointer at a fixed location to an entry point).
I would think after a few passes you could refine the binary into areas that are definitely code.
Panzerschrek 1 hour ago
Can it handle self-modifying code?
Why only x86_64? It has more sense to convert 32-bit programs, like many old games.
[-]
- oinkt 1 hour ago
  Consider reading the linked article, where this is explicitly addressed:
  > Self Modifying and JIT-Compiled Code. Elevator, like all fully static binary rewriters, does not support self modifying or just-in-time-compiled code.
- linkregister 16 minutes ago
  Why doesn't it clean my garage also? I've got some leaves to rake as well.
- gobdovan 1 hour ago
  [dead]
fguerraz 1 hour ago
Does it mean I can finally run Slack on Asahi?
[-]
- gobdovan 1 hour ago
  From the paper, Elevator currently supports only single-threaded binaries, does not support binaries using exception handling, has unsupported x64 extensions, and does not support self-modifying or JIT-compiled code. Slack is Electorn based, so t embeds Chromium and Node and depends on V8.
  Maybe try an emulator? There's also this project I found: https://github.com/andirsun/Slacky
dmitrygr 1 hour ago
Cute, but Rice's theorem remains, and while they translated every byte as code, still no handling is possible for
```
   char buf[] = {0xB8, 0x2A, 0x00, 0x00, 0x00, 0xC3};
   return ((int (*)(void))buf)();
```
static translation is only possible when you assume no adversarial code AND mostly assume compiler-produced binaries. hand-rolled asm gets hard, and adversarial code is provably unsolvable in all cases.
still, pretty cool for cooperative binaries
[-]
- fsmv 1 hour ago
  I only read the abstract but I got the impression that their solution to this is they have both. They translate all the data as if it was code and if it gets called into they use the translation where if it gets read as memory they use the original.
  Edit I found this in the paper
  > Elevator sidesteps the code-versus-data determination altogether through an application of superset disassembly [6]: we simultaneously interpret every executable byte offset in the original binary as (i) data and (ii) the start of a potential instruction sequence beginning at that offset, and we build the superset control flow graph from every one of the resulting candidate decodes. Every potential target of indirect jumps, callbacks, or other runtime dispatch mechanisms that cannot be statically analyzed therefore has a corresponding landing point in the rewritten binary. These targets are resolved at runtime through a lookup table from original instruction addresses to translated code addresses that we embed in the final binary.
- tlb 1 hour ago
  But in fact no modern processor/OS executes this either. Pages are marked as executable or not, and static data is loaded as non-executable pages.
  [-]
  - dmitrygr 1 hour ago
    that is why it was not "static const char buf[]" ;) it was not an accident
    executable stacks are still common (incl on windows with some settings), and sometimes they are required (eg for gcc nested functions)
    [-]
    - diamondlovesyou 1 hour ago
      That won't be located on the stack either. The underlying buffer will be a TU local - ie static and not rx
      [-]
      - lisper 1 hour ago
        Good grief, what a useless argument. Isn't it obvious that this could trivially be converted to a non-static array if that's really what was needed?
- userbinator 1 hour ago
  I read those bytes and immediately thought "mov eax, 42; ret".
- genxy 1 hour ago
  It looks like their system would just generate return 42;
- IshKebab 1 hour ago
  No based on the abstract it can handle that code. What it can't handle is runtime code generation.
aykutseker 50 minutes ago
[dead]