Turning stack-based IL into register-based ASM

Started by
6 comments, last by Juliean 5 months, 2 weeks ago

Hello,

so for context, I've got my own IL/bytecode-based compiled programming-language, which is stack-base (similar to Javas IL). I have a JIT-compiler which ATM does nothing more than just translate all the VM-based operations into one blob of machine-code (so that no loop is necessary, and constants need not be fetched from a bytecode-stream but can be treaded as actual constants). This is already multiple times faster than executing the bytecode in a VM, however it is still slower since it needs to use the VMs stack for passing parameters, for example:

PushConst 4
LoadLocal 0
Mul
PushConst 1
Add

Is the equivalent of the operation (4 * x + 1). Ignoring any crazy optimizations, the same operation would look like this in machine-code:

  push rbp
  mov rbp, rsp
  mov dword ptr [rbp - 4], edi
  mov eax, dword ptr [rbp - 4]
  shl eax, 2
  add eax, 1
  pop rbp
  ret

Obviously all the operands are eigther stored in a register, or the applications own stack.

What I'm curious is, is there any general algorithm to turn such a stack-based IL to a full register-based variant, that can then be optimized? For a direct compiled language like C++, I suppose it's easy, as all those steps are performed based on the AST. But for a IL like Javas bytecode, there is no AST. Is the Java IL simply parsed back to an AST somehow, or is there another method?

For very simple examples, I could come up with something, but for anything more complicated including jumps and labels, I don't have any idea. I could do another primitive optimization by looking for patterns like “Add" preceeded by “PushConst”, and use registers for adjacent operations, but I'd rather know a solution that is scalable.

Advertisement

Never worked with a stack-based language in compile context, but it sure looks like a fun project 🙂

To me, “AST” just means “machine-readable form", as opposed to text that you have to parse. In the process we usually drop some syntactic sugar that a computer doesn't need, hence “abstract”. As such, the sequence of stack-based instructions looks like an AST to me, probably as a list of tuples (instruction, argument).

The input language is much simpler (less syntax, less sugar), which means that source code and AST are much closer to each other.

A first thought to convert back to a register/variable based language: Drop the idea of stack and instead have a register/variable for each position in the stack. If the stack-pointer is at index N, and the instruction is to load 4, then you have var_N := 4. If the next instruction is to load a program variable x you get var_(N+1) := x. Multiplying then does var_N := var_N * var_(N+1); and so on. In other words, simulate the stack operations and use a number of variables as registers.

Likely the same code is run for different values for N at some point. The simplest solution I can thin of is to do the above for every different value of N that you find. It does however feel as there should be room for improvement there,

For labels and jumps, a standard tactic in compilers is to split the code in basic blocks. A basic block is a sequential piece of code, and at the end a jump or dropping into the next block. This gives a directed graph of blocks. From there you can do all kinds of analysis on the graph (deciding which variables are “live” after a block, finding common sub-expressions/code-sequences, finding dead code, and so on). Analysis on such blocks is the core of how all compilers optimize and rewrite. If you want to know about that, I'd recommend finding one or more books on compiler optimization. Some are introductions in the field, while others specialize in optimizing on some specific topic, eg performance or parallelism.

Alberth said:
To me, “AST” just means “machine-readable form

I do have a machine-readable format, where each instruction is represented as a pointer to an object. However unlike what I meant by AST, those objects disjoint. AddIntInstruction does not by itself know which instructions produce it's input. For most tasks like formatting, generation of the bytecode-stream those informations are not necessary. For optimization and transformation to registers, those are somewhat needed. I have a very primitive optimizer that scans just the adjacent instructions to match for patterns, but even there it would be beneficial being able to determine where the actual source of an operations values came from.

Alberth said:
A first thought to convert back to a register/variable based language: Drop the idea of stack and instead have a register/variable for each position in the stack. If the stack-pointer is at index N, and the instruction is to load 4, then you have var_N := 4. If the next instruction is to load a program variable x you get var_(N+1) := x. Multiplying then does var_N := var_N * var_(N+1); and so on. In other words, simulate the stack operations and use a number of variables as registers.

So what you are suggesting is basically simulating an execution of the code, and tracing the stack-pointer to label each push and pop to a register based on where at the stack we are? That actually sound interesting. This is somewhat similar to what I envisioned for an algorithm to extend my optimizer-scanning, but doing it this way makes it more flexible and reusable.

Alberth said:
For labels and jumps, a standard tactic in compilers is to split the code in basic blocks. A basic block is a sequential piece of code, and at the end a jump or dropping into the next block. This gives a directed graph of blocks. From there you can do all kinds of analysis on the graph (deciding which variables are “live” after a block, finding common sub-expressions/code-sequences, finding dead code, and so on). Analysis on such blocks is the core of how all compilers optimize and rewrite. If you want to know about that, I'd recommend finding one or more books on compiler optimization. Some are introductions in the field, while others specialize in optimizing on some specific topic, eg performance or parallelism.

Have to see if I understand that part quite yet. I think reading a book on optimizations would be a good idea in general. I still have a few optimizations that I'm quite unsure how to execute in practice (inlining etc…), so learning more would be useful. Thanks.

The graph with blocks captures how you jump around in the program. It is used to link local block computations with each other, so information becomes valid for the entire program.

For example, dead variable analysis. A variable is dead if its value is not needed again in the remainder of the program. This information is useful to decide whether you need to store an assigned value for future use. To be clear, this is different from variable usage, Code may still assign a new value to the variable and use that subsequently. (But in that case you don't need the old value!!)

This is a "backwards" operation. Code at the end of the program decides this property for code before it.

To compute, for each block B you get a set of dead variables from all the successor blocks. Initially, this is the empty set for all blocks. Next you repetitively iterate over all blocks to propagate successor block information upwards:

  • If you have multiple successors you merge sets by intersection [EDIT: Must be disjunction!!] (a variable is dead only if all successors don't need the value). Basically this computes what the block should “deliver” to its successor blocks.
  • Next you look in the code of block B, and update the set. If the first thing that happens to a variable is reading its value, you add that variable to the set (block B clearly needs its value). If the first thing that happens is assignment of a new value, you remove the variable from the set (block B does not need its value).

The computed sets are new values for the predecessor blocks. If a set is different from the previous iteration, all predecessor blocks must be computed again. Repeat until all sets are stable. A nice result of this computation is: If the computed set is not empty at the first block of the program, you know there is a code path that uses a variable without first initializing it !!

----

As for making blocks, I realized your stack-language already has labels and jump instructions, you could thus partition the stack language code into blocks already. Simplistic approach is to start a new block at each label (other blocks may jump to this point), and after each (possibly conditional) jump instruction. More advanced is to analyze first whether labels are actually used as jump destination.

----

I also had some thoughts about the varying stack height (different values N). To eliminate multiple runs for different N values, one thing you can do is use a new set of variables for each block, and start variable numbering at index 0 for generating register code. When you need a stack value pushed by the previous block, you introduce variables -1, -2, etc, and generate variable copying statements between the previous block(s) and the block you are converting to register code that fills tie variables with negative index.

It introduces new intermediate blocks that just do some variable coupling, but I think in the end it may save a lot of duplication in the register code. Obviously, having more than one variable with some value is non-optimal, but that should be fixable by finding and eliminating common sub-expressions.

EDIT: Intersection of variables from successor blocks is wrong.

Alberth said:
As for making blocks, I realized your stack-language already has labels and jump instructions, you could thus partition the stack language code into blocks already. Simplistic approach is to start a new block at each label (other blocks may jump to this point), and after each (possibly conditional) jump instruction. More advanced is to analyze first whether labels are actually used as jump destination.

Ah, it's starting to make sense now. I was a bit confused how to deal with conditional jumps, intermixed labels etc… but if we divide at both those points, then it would work.

Alberth said:
For example, dead variable analysis. A variable is dead if its value is not needed again in the remainder of the program. This information is useful to decide whether you need to store an assigned value for future use. To be clear, this is different from variable usage, Code may still assign a new value to the variable and use that subsequently. (But in that case you don't need the old value!!)

I already have that optimization, but it was based on linear scanning and pretty much considering any label or jump to be a dead-end, before which we cannot make any assumptions. That's for all of the optimizations I'm doing right now, so being able to evaluate this information would be very helpful. The only thing that works 100% right now is dead-code elimination, but that is based on a specialized scan that traces reachability from all jumps and labels (without considering anything as a block or similar).

Alberth said:
I also had some thoughts about the varying stack height (different values N). To eliminate multiple runs for different N values, one thing you can do is use a new set of variables for each block, and start variable numbering at index 0 for generating register code. When you need a stack value pushed by the previous block, you introduce variables -1, -2, etc, and generate variable copying statements between the previous block(s) and the block you are converting to register code that fills tie variables with negative index.

The suggestion for the stack-based marking was very helpful already 🙂 I'm not entirely sure about the exact algorithm, but even for generating the trace-information for which instruction consumes which input, I can already see it working. I would only need to keep a simulated stack that, instead of the values, stores the source-instruction that this value came from. When an instruction needs to pop a value, it pops the source location instead, and thus I can generate a tree of dependencies.
I'm still a bit unsure about the blocks in practice, though I have only though out the baseline algorithm so far. suppose, if you have the block-information, you simply invoke this algorithm based on the blocks, considering the label as the start (having to calculate what the stack-input looks like based on all possible sources of how you get to that block), and on the instruction that ends the block you know what values are present. I think what I don't fully know yet me most is just when the stacks differ, like if conditional-jump A pushes another value on the stack that wouldn't be there when reaching label A normally… but maybe that's not even the case.
I'll do it step by step and start just the normal scanning, and then do that blocks. I also hope that I can simply share this algorithm between my optimizer and JIT-compiler, since both essentially need similar informational structure.

In any regard, this was all very helpful, thank you so much 🙂

There is a programming language called FORTH that uses a stack to store data in addition to the return stack, the machine that executes it is very efficient.

In this context there is a version called r3 where to compile is simple generate a a chunk of asm for every basic word, the simple version of this compiler programmed in the same language is in

https://github.com/phreda4/r3/blob/main/r3/system/r3asm0.r3

this compile always store the top of stack in a register and try to avoid push values , for example, folding a literal and a sum with and add direct

add rax,4 ; optimice 4 +

A more optimized version of the compiler is not finish but the idea is make a static analysis of the stack and assign available register to positions, like previous user say.

the static analysis of the coda can detect inline words, variables like constant and folding operations

4 1+ be tranformed to 5 before compiling

you proyect is public? you have a github?

juan5 said:
There is a programming language called FORTH that uses a stack to store data in addition to the return stack, the machine that executes it is very efficient. In this context there is a version called r3 where to compile is simple generate a a chunk of asm for every basic word, the simple version of this compiler programmed in the same language is in https://github.com/phreda4/r3/blob/main/r3/system/r3asm0.r3​

So if I understand this correctly:

| generate amd64 code
| BASIC GENERATOR
| - no stack memory, only look next code to generate optimizations
| PHREDA 2020

They are only looking at adjacent codes? That is what my previous optimizer implementation also did, however I found it to be quite limited. As you mentioned, you could find constant adds of 4 +1 and merge them into one constant, however anything more complex would usually not work. Note here that my optimization is not necessarly because the backend I have is too inefficient, it's just for optimizing it further.

So what could be optimized by my existing implementation is:

LoadLocal 0
CallFunction DoSomethingWithInt
PushConst 1
MulInt

This Mul can be removed as X * 1 = X, easily by looking at just the operation that comes before the mul (PushConst). However, if the paths were reversed:

PushConst 1
LoadLocal 0
CallFunction DoSomethingWithInt
MulInt

Now just by “looking at the next code” won't allow me to optimize anything, as DoSomethingWithInt cannot be optimized on its own, and 1 in “1 * X” is away from the actual Mul-statement. So

juan5 said:
A more optimized version of the compiler is not finish but the idea is make a static analysis of the stack and assign available register to positions, like previous user say.

That is what the implementation I'm currently working on does. I've got the basic version finished already thanks to @alberth . There is still a lot to be done, however my primitive idea of simulating the stack-operations with instruction-references being pushed did work, based on the information fo stack push/pop sizes that the optimizer already has. This allows me to take my 2nd example code, and look for which instructions are actual operands 1 and 2 for the mul, regardless of where they are, and see that actually operand 2 is an “1”, thus allowing me to remove both instructions.

The same algorithm should also be reusable for the JIT-compiler, where I can now build an actual tree of instructions and potentially eliminate all stack-pushes that are not needed.

juan5 said:
you proyect is public? you have a github?

Unfortunately not. The language is part of a larger game-engine, which I did not make public. I lack the motivation to write proper documentation, so even if were public it would not be very easy for people to understand.

EDIT: I can show the current algorithm for calculating the stack-trace data: https://pastebin.com/k689W2mm
This takes an array of instructions and operation-descriptions, and produces an array of dependencies for those instructions. It still doesn't deal with jumps/blocks, but it produces the necessary data for tracing any operations.

This topic is closed to new replies.

Advertisement