Automata, Virtual Machines and Compilers

Published April 27, 2015 by Vilem Otte, posted by Vilem Otte
Do you see issues with this article? Let us know.
Advertisement

Greetings! I'm about to step into a bit larger topic that does not seem to be related to games directly, but it actually is. The whole topic is about automata, virtual machines and compilers (and not just about them!). The topic is large, that is why it will not be included whole inside a single article, anyways after each article I would like to provide code that one can run and that is easy to understand. What knowledge do you need to understand this article? You definitely need to know something about programming. Some knowledge of automata and assembler is also welcome. And of course you need your favourite programming language to work in. In the first article I would like to demonstrate the basics behind simple math expressions, very basic theory behind compilers, disassemblers and computation machines.

Formal Definitions

Language Definition

In the beginning we need to define our language. For start it is going to be fairly simple (hopefully we are going to extend it further) and our expressions will only consist of numbers (integer values), operators (for the beginning we're fine with +, -, * and /) and parentheses.

Backus-Naur Form

For language definition we are going to use so-called BNF (Backus-Naur Form) which is a method to describe context-free grammar. The form is actually set of simple rules written as:

<ident> ::= <expression>

Where the value on left side is a non-terminal symbol, and on the right side there is a set of terminal symbols (if any) followed by non-terminal symbols (if any). On the right side we can have multiple expressions separated using | (which means we can replace the non-terminal on the left with one of the expressions on the right).

Parts of the expression can be enclosed in '[' and ']' brackets followed by '*' (which means that this expression is repeated N times, where N >= 0), or '+' (which means that this expression is repeated N times, where N >= 1), or '^N' where N is a number (which means that this expression is repeated exactly N times). An example language is:

<integer> ::= [0..9]+ 
<expression> ::= <term> + <term> | <term> - <term>
<term> ::= <factor> * <factor> | <factor> / <factor>
<factor> ::= (<expression>) | <integer>
<program> ::= [<factor>]*

Automata

With this grammar, we can generate any possible expression that is valid in the language, which is great, but... we do not need to generate all valid inputs of our language, that will be left for the user, we need to accept the language and during that process it is needed to generate the code that is simple enough for some sort of virtual machine to execute.

As the language is context-free, it is possible to use a recursive descent parser to accept (e.g. it is valid expression) or reject (e.g. it is invalid expression) the input. I'm not going to write too much about parsers in the series - I challenge users to try writing them. Of course you are free to use Flex or any other generator (but I do not recommend this, unless you know how exactly you would parse something - these production tools are awesome, but only when you know how the actual parsing works).

What is a recursive descent parser? It is a top-down parser, built from mutually recursive procedures. Each of these procedures implements often one (and sometimes more) of productions of the grammar it recognizes. The exciting thing about this is that it closely resembles BNF!

Compilation and Execution

This is a process of producing machine code from some other language. Let me present a simple abstraction here, let us have a 'program':

"Compute three plus four" 

Such sentence is, well... not really useful, we need to translate it into something we can work with. One of such formal languages is math. If we translate such sentence into:

3+4 

We know a way to compute it (there are multiple computation models how to compute this), let me present one:

MOVE reg0 3 
MOVE reg1 4 
ADD reg0 reg1 
PRINT reg0 

Having a machine with 2 registers, allowing for 3 instructions, where:

  • Instruction MOVE takes 2nd argument (value) and moves it into 1st argument.
  • Instruction ADD takes value stored in register in 2nd argument and adds it to value stored in register in 1st argument
  • Instruction PRINT prints the value in 1st argument onto screen

Execution process of such program will work as following:

INSTRUCTION | REGISTERS [] | SCREEN [] 
[-, -] | [] 
MOVE reg0 3 [3, -] | [] 
MOVE reg1 4 [3, 4] | [] 
ADD reg0 reg1 [7, 4] | []
PRINT reg0 [7, 4] | [7] 

Such approach is known as imperative execution (we are giving orders to computing machine of what to do). This paradigm has one large advantage over other variants of execution: it is rather simple to make a machine that executes imperative language.

Simple calculator

Let us begin with something simple, and what is the most simple language we can imagine? It is a calculator of course! Our goal is to be able to handle 4 operators (+, -, * and /), parentheses and integer numbers. So, let us formally define a language in BNF.

<integer> ::= [0..9]+ 
<expression> ::= <term> + <term> | <term> - <term>
<term> ::= <factor> * <factor> | <factor> / <factor>
<factor> ::= (<expression>) | <integer>
<program> ::= [<factor>]*

Note: such language is exactly the one in the previous example.

Compiler

These rules are to be rewritten into the source, I'm going to provide pseudo-code here (you can see full working code in accompanying project).

Let us start with Integer, what we want to do when we hit an integer - well that is simple, we want to put its value into the register (e.g. somewhere where we can further work with it):

Note that in our assembly we move the right value or register into left one. Also after each operation, the result is stored in left register.

Integer() 
{ 
	printLine("mov.reg.i32 r0 %s", GetValue()); 
} 

Where GetValue is just a function that takes next token from stream, validates whether it is an integer (containing only 0..9). Such function can look like following:

GetValue() 
{ 
	string output; 
	int i = 0; 
	string token = GetNextToken(); // Gets current token and moves to next 
	do 
	{ 
		if (token in [0..9]) 
		{ 
			output += token; 
		} 
	} 
	while (i < token.length); 
	Match(); 
	return output; 
} 

Note. for '+' (e.g. at least one repetition) we use a do-while construct. I'm intentionally going to skip add_operation and mul_operation rules and jump over to Factor.

Factor() 
{ 
	if (NextToken is LEFT-PARENTHESIS) 
	{ 
		Match(LEFT-PARENTHESIS); // Eats current token, matches it against argument and goes to next token 
		Expression(); 
		Match(RIGHT-PARENTHESIS); // Eats current token, matches it against argument and goes to next token 
	} 
	else 
	{ 
		Integer(); 
	} 
} 

This was quite obvious - in parentheses we always have another expression, outside we have just an integer. The following two Term and Expression are the interesting ones:

Term() 
{ 
	Factor(); 
	while (NextToken is MULTIPLICATION | DIVISION) 
	{ 
		printLine("push.i32 r0"); 
		if (NextToken is MULTIPLICATION)) 
		{ 
			Match(MULTIPLICATION); // Eats current token, matches it against argument and goes to next token 
			Factor(); 
			printLine("pop.i32 r1"); 
			printLine("mul.i32 r0 r1"); 
		} 
		else if (NextToken is DIVISION) 
		{ 
			Match(DIVISION); // Eats current token, matches it against argument and goes to next token 
			Factor(); 
			printLine("pop.i32 r1"); 
			printLine("div.i32 r1 r0"); 
			printLine("mov.reg.reg r0 r1"); 
		} 
		else if (NextToken is NULL) // If there is no next token 
		{ 
			// Crash compilation and print error Expected("Expected multiplication or division operation"); 
		} 
	} 
} 

Expression() 
{ 
	Term(); 
	while (NextToken is ADDITION | SUBTRACTION) 
	{ 
		printLine("push.i32 r0"); 
		if (NextToken is ADDITION) 
		{ 
			Match(ADDITION); // Eats current token, matches it against argument and goes to next token 
			Term(); 
			printLine("pop.i32 r1"); 
			printLine("add.i32 r0 r1"); 
		} 
		else if (NextToken is SUBTRACTION) 
		{ 
			Match(SUBTRACTION); // Eats current token, matches it against argument and goes to next token 
			Term(); 
			printLine("pop.i32 r1"); 
			printLine("sub.i32 r0 r1"); 
			printLine("neg.i32 r0"); 
		} 
		else if (NextToken is NULL) // If there is no next token 
		{ // Crash compilation and print error 
			Expected("Expected addition or subtraction operation"); 
		} 
	} 
} 

What happens here? I know it can seem confusing at first - but we are doing exactly what BNF rule says to us. Note that we handled add_operation/mul_operation here in the while condition and based upon what operation we encountered we do different things. Actually we started working with operator precedence (that is why we always push the register value onto stack and pop it prior to working with that), the rest should be clear.

Addition and subtraction instructions are inside Expression because they are lower priority compared to multiplication and division handled in Term (in recursion, the deeper we are, the higher the operator precedence) - so resolving inside parentheses and actual values have the highest precedence, multiplication and division are in the middle and addition and subtraction have the lowest precedence.

I know this is not the clearest thing to understand, and I highly encourage you to try running the compiler in debug mode so you can actually see what is happening (and that precedence is solved correctly in this way).

When implementing the compiler like this and running on some input like:

(3 + 4 * (7 - 2)) / 2 

We obtain assembly:

mov.reg.i32 r0 3 
push.i32 r0 
mov.reg.i32 r0 4 
push.i32 r0 
mov.reg.i32 r0 7 
push.i32 r0 
mov.reg.i32 r0 2 
pop.i32 r1 
sub.i32 r0 r1 
neg.i32 r0 
pop.i32 r1 
mul.i32 r0 r1 
pop.i32 r1 
add.i32 r0 r1 
push.i32 r0 
mov.reg.i32 r0 2 
pop.i32 r1 
div.i32 r1 r0 
mov.reg.reg r0 r1 

Let us demonstrate what the execution would look like (by stack we mean LIFO stack):

mov.reg.i32 r0 3 // Registers [ 3, -] Stack [] 
push.i32 r0 // Registers [ 3, -] Stack [ 3] 
mov.reg.i32 r0 4 // Registers [ 4, -] Stack [ 3] 
push.i32 r0 // Registers [ 4, -] Stack [ 3, 4] 
mov.reg.i32 r0 7 // Registers [ 7, -] Stack [ 3, 4] 
push.i32 r0 // Registers [ 7, -] Stack [ 3, 4, 7] 
mov.reg.i32 r0 2 // Registers [ 2, -] Stack [ 3, 4, 7] 
pop.i32 r1 // Registers [ 2, 7] Stack [ 3, 4] 
sub.i32 r0 r1 // Registers [-5, 7] Stack [ 3, 4] 
neg.i32 r0 // Registers [ 5, 7] Stack [ 3, 4] 
pop.i32 r1 // Registers [ 5, 4] Stack [ 3] 
mul.i32 r0 r1 // Registers [20, 4] Stack [ 3] 
pop.i32 r1 // Registers [20, 3] Stack [] 
add.i32 r0 r1 // Registers [23, 3] Stack [] 
push.i32 r0 // Registers [23, 3] Stack [23] 
mov.reg.i32 r0 2 // Registers [ 2, 3] Stack [23] 
pop.i32 r1 // Registers [ 2, 23] Stack [] 
div.i32 r1 r0 // Registers [ 2, 11] Stack [] 
mov.reg.reg r0 r1 // Registers [11, 11] Stack [] 

I recommend to check a few more examples and hand-compute the registers and stack value on the run. It will definitely help to understand that operator precedence will work here and how the calculator works.

Disassembler

While I already mentioned how the computation works, I don't really recommend writing a virtual machine and processing strings. We can do better!

Performing very simple disassembly and storing everything in binary form is a far better way - all you need is to think of the opcodes for your instructions, assign IDs to your registers and store all these numbers into binary form.

Note that this way you can actually perform disassembly into x86 or x64 if you wish, but that is not necessary. Implementing a disassembler is very straight-forward and I recommend to look into accompanying source code, even though the implementation in there is not highly effective, it is easy to understand.

Virtual Machine

Up to here, we could only execute the code "by hand", but that ends now! What our virtual machine needs? It needs some memory where we can store stack and the actual code that is to be executed, and at least 4 registers - as our code is working with 2 (r0 and r1) and 2 more, one for stack pointer (as we need stack as temporary memory for our computation) and one for instruction pointer (so we know what we are executing right now).

Now, we read (in binary) our application to the memory, place instruction pointer to the address where there is beginning of our application and stack pointer right after the end of or application in memory. E.g.:

|O|U|R|_|A|P|P|L|I|C|A|T|I|O|N|_|S|O|U|R|C|E|_|C|O|D|E| | | | | | | | | | | | | | 
IP                                                     SP 

Where IP represents memory address where instruction pointer points to, and SP represents memory address were stack pointer points to. E.g. in this case our IP=0 and our SP=27. As each sub-computation stores result into R0 (register 0), the result of the whole computation will be also in register 0. Note that I'm actually printing both registers in the accompanying source code, once computation finishes.

Conclusion

The article showed basic principles in automata theory, formal languages and compilers. I know it wasn't too much and creating something useful out of this might take a lot more effort and time. Compilers aren't easy stuff, and there is not many recent articles about them. If the time allows, I'd be very glad to continue step-by-step adding more and more features into this little compiler, eventually making it useful. For next time, I promise to talk about types! Accompanying source code demonstrating sample implementation.

Look at the sample implementation here: https://github.com/Zgragselus/Compiler/tree/v1.0

Changelog:

29th September 2020 - Reformatted article, BNF was wrong; pointed out link to GitHub repository, where there is proper project with proper tag.
30th March 2020 - Reformatted article for new GameDev.net formatting

Notes:

Mirror posted on my site - https://www.otte.cz/new/page/articles/post/2

Cancel Save
0 Likes 6 Comments

Comments

mohessaid

This is a good articles and well demonstrated. I created a parser generator two years ago using the weak precedence algorithm to generate a syntax analyzer. I used C#.

The generator receive a weak precedence grammar as input and then verify it and generate the analyzer.

You can check it directly by typing your inputs in the checking area. or generate the source code of your new analyzer.

I didn't implement everything like yacc does but you can generate a full elevator for arithmetic operations using a weak precedence grammar.

you can check my parser: here

April 21, 2015 02:15 PM
Norman Barrows

i'd peer review it, but its been too long since i took "Intro to Formal Languages"! From what i recall, it sure looks right.

how about an article about the fun stuff like alternate architectures like no ram and all registers?

or the not fun stuff like implementing parameter passing to and from subroutines - stack frame pointers and all that jazz. that's where i usually give up on a pcode vm and settle for a macro compiler that compiles to c++ or whatever - its usually less work.

April 24, 2015 11:08 PM
Vilem Otte

#Norman Barrows - Actually I would like to extend this into at least like few articles that would in total cover actual computation (this lesson), types, local & global variables, control structures, procedures (and calling conventions), pointers, linking and maybe even structure definitions (from where it is possible to go up into OOP).

I wanted to go from the basics up to the fun stuff, keeping some solid connection to the theory.

#mohessaid - Thanks for the link. I wanted to add some simple parser into the code itself, but not going too deep into parsers. While I used flex, yacc, bison and such generators they always generated ugly code (but hey, it works), plus when you are concerned about speed it might be worth to try them (I bet they outperform the posted code in parsing).

The funny thing is, that when you get into more complicated thing I got with the production scripting language in the work, you realize that context-free language parsers are not enough (and so isn't BNF) ~ you start to become context sensitive and the actual fun in implementation begins. If there is time and readers, I'd like to get even this far into the topic.

April 25, 2015 10:09 AM
Alberth
Attaching the code generator directly in the parser (as in, perform 'print' while you read the input) makes the code nicely compact.

In my experience however, a much better approach is to split parsing and code generation into separate phases. First store parsed structures into an AST (abstract syntax tree), then generate code while traversing the stored AST data.

In this case, the AST would be a hierarchy of expression node classes.
August 16, 2015 06:47 AM
Aardvajk
Alberth - sorry for downvote. Meant to upvote as I agree re node trees but my finger slipped on phone and can't undo. Sorry.
November 14, 2016 04:40 PM
Vilem Otte

Aardvajk, I've fixed that by upvoting Alberth. The purpose was to make code compact and fairly easy to understand. Although you are right, the abstract syntax tree would make it more usable and extensible.

November 14, 2016 04:54 PM
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!

Have you ever wanted to find magic behind the actual computation inside your computer? How are compilers made? What is assembler and disassembler? How to make your own scripting language?Then the following article explains basics in theory and implementation behind this!

Advertisement
Advertisement
Advertisement