Code Generation Project Handbook
COP-3402
Table of Contents
1. Code generation projects
2. C++ and ANTLR usage
CodeGen.cpp
contains an ANTLR listener that will walk over the parse tree of SimpleIR input files and trigger enter
listener functions for each construct, such as enterFunction
and enterIfGoto
.
2.1. Writing to standard out
To write to the output using C++ streams, use cout
for standard out, e.g.,
cout << "hello, world!" << endl;
For instance, the following outputs an assembly comment:
cout << "# this is an assembly comment" << endl;
Chain together multiple values to emit. Use endl
to emit a line-ending, i.e., newline on unix-based OSes.
2.2. Retrieving ANTLR parse tree contents
A ctx
parameter is passed to each listener, which contains the SimpleIR construct subtree. The contents of the tree can be accessed using the naming given in the SimpleIR.g4
grammar file.
Accessing single terminal tokens
For instance, the grammar for a function construct is as follows:
function: 'function' functionName=NAME localVariables? parameters? statement* returnStatement end;
To access the name terminal, given by functionName
, use ctx->functionName->getText()
to get the tree node, then the string text of the tree node:
string name = ctx->functionName->getText();
->
accesses the field of data object via a pointer. This is just like C, where ->
accesses a struct field through a pointer to a struct. getText()
is an ANTLR function that gets the string value of a terminal token.
Consult the SimpleIR.g4
to find the names of other terminals in other constructs.
Accessing operands
Many operands in the SimpleIR grammar take either a number or a name, e.g., in the return statement's grammar:
returnStatement: 'return' operand=(NAME | NUM);
A helper function is provided in the CodeGen.cpp
template to convert this operand into the correct assembly operand, depending on whether it is a number or a name. The helper function is shown below as well:
static string operand_to_string(antlr4::Token *token) { stringstream operand; if (SimpleIRParser::NAME == token->getType()) { operand << symtab[token->getText()] << "(%rbp)"; } else if (SimpleIRParser::NUM == token->getType()) { operand << "$" << token->getText(); } else { assert(false); } return operand.str(); }
This function uses the ANTLR-generated SimpleIR token types to distinguish between name and number tokens, e.g., SimpleIRParser::NAME == token->getType()
. Name tokens represent variable names. As discussed in class, variables are allocated on the stack.
Accessing lists of tokens
ANTLR permits specifying a list of terminals or non-terminals, e.g., with NAME+
meaning, one or more NAME
terminals or NAME*
, meaning zero or more. For instance, take the grammar for parameters
:
parameters: 'parameters' formals+=NAME+;
This will create a C++ vector data structure for formals
. Like arrays and linked lists, vectors store a sequence of data of a given type. We can iterate over vectors and access elements similarly to an array, by initializing a loop variable, e.g., formal_i
, and iterating from 0 to the length of the vector, given by formals.size()
.
auto formals = ctx->formals; for (int formal_i = 0; formal_i < formals.size(); formal_i++) { cout << "# " << formals[formal_i]->getText() << endl; }
Vector elements can be access like arrays, with formals[formal_i]
. In this project, all vectors will be of tokens, so you can get the string text of one token in the vector formals[formal_i]->getText()
.
3. System V x86 64 ABI
- Parameter passing
- Return values
- Stack alignment
4. Debugging with GDB
One way to help trace the function call is to use gdb
. The following will rebuild the main
program with debugging symbols ;on, run gdb
, then step through each assembly instructions.
gcc -g -o main main.s func.s # compile with debugging symbols (-g) gdb main b main # setup breakpoint at main r # start running, breaks at main si # step instruction to see next instruction # hitting enter will repeat last command, e.g., si c # once done use c to continue running without stopping
4.1. Debugging tutorials
See these resources for more information on gdb
.
- Basic Assembler Debugging with GDB
- Chapter 2 of 21st Century C
5. Laying out the assembly file
5.1. enterUnit
virtual void enterUnit(SimpleIRParser::UnitContext * ctx) override { // filename cout << "\t.file \"" << filename << "\"" << endl; cout << "\t.section .note.GNU-stack,\"\",@progbits" << endl; cout << "\t.text" << endl; }
This function emits some assembly pseudo-ops expected by the toolchain and starts the text section where the program code will go.
.file "main.ir" .section .note.GNU-stack,"",@progbits .text
main.ir
should be the name of the input file or stdin
when read from standard input instead.
6. Variables
6.1. enterLocalVariables
This function generates assembly instructions that allocate space on the stack for all local variables. In SimpleIR, all variables are 64-bit integers.
virtual void enterLocalVariables(SimpleIRParser::LocalVariablesContext * ctx) override { auto variables = ctx->variables; // start from of an offset of 2x bytewidth from rbp, 1 to start after rbp // and another 1 to skip rbx which will have already been pushed int starting_offset = 2; for (int variable_i = 0; variable_i < variables.size(); variable_i++) { string variable = variables[variable_i]->getText(); int offset = (starting_offset + variable_i) * bytewidth; symtab[variable] = -1 * offset; } // compute size of the stack space needed int stackspace = variables.size() * bytewidth; // ceiling to 8 bytes stackoffset = std::ceil(stackspace / 8) * 8; // align to 16 bytes, accounting for rbx being pushed by adding 8 stackoffset += (stackoffset + 8) % 16; // emit the subtraction to allocate stack space cout << "\t# allocate stack space for locals" << endl; cout << "\tsub\t$" << stackoffset << ", %rsp" << endl; }
The CodeGen template has a symbol table to store the offsets of each declared variable:
map<string, int> symtab;
This function works by first collecting the set of variables to allocate space for from the syntax tree of the input program:
auto variables = ctx->variables;
Then it creates space on the stack for each local variable. The memory location of each local variable can be determined by an offset from the base pointer %rbp
per the System V AMD64 ABI. The compiler records each variables offset in a symbol table (symtab
) that maps the name of the variable to its offset from %rbp
, which the compiler will use to translate variable accesses and assignments whenever the variable is used in SimpleIR instructions. enterLocalVariables
function creates the symbol table by iterating over variables
and allocating the offsets:
int starting_offset = 2; for (int variable_i = 0; variable_i < variables.size(); variable_i++) { string variable = variables[variable_i]->getText(); int offset = (starting_offset + variable_i) * bytewidth; symtab[variable] = -1 * offset; }
We start offsetting 2 stack slots away from %rbp, because offset 0 stores the calling function's base pointer and offset 1 holds %rbx
which we pushed since it is a callee-saved registered according to the ABI. We start with the first local variable at offset 2 and store the subsequent locals at increasing offsets (starting_offset + variable_i)
.
symtab
will then contain a mapping from each local variable to its own distinct offset, which are multiples of the 8-byte width, e.g., -16, -24, -32, etc. (Do you remember why the offsets are negative in this ABI?). symtab
will tell you how to translate variable uses into offsets from the base pointer %rbp
for generating equivalent assembly instructions.
Finally, the functions computes the total amount of space required for %rbx and the local variables (number of variables times 8 bytes each), aligns the stack to 16 bytes (required by the ABI), and emits an assembly instruction that updates the stack pointer to allocate the space, i.e.,
// compute size of the stack space needed int stackspace = variables.size() * bytewidth; // ceiling to 8 bytes stackoffset = std::ceil(stackspace / 8) * 8; // align to 16 bytes, accounting for rbx being pushed by adding 8 stackoffset += (stackoffset + 8) % 16; // emit the subtraction to allocate stack space cout << "\t# allocate stack space for locals" << endl; cout << "\tsub\t$" << stackoffset << ", %rsp" << endl;
stackoffset
is a global variable that will hold the amount of stack space allocated. enterEnd
will use this in the epilogue.
6.2. enterAssign
This function generates assembly code corresponding to a variable assignment.
virtual void enterAssign(SimpleIRParser::AssignContext * ctx) override { string operand = operand_to_string(ctx->operand); cout << "\t# assign " << ctx->operand->getText() << " to " << ctx->variable->getText() << endl; cout << "\tmov\t" << operand << ", %rax" << endl; cout << "\tmov\t%rax, " << symtab[ctx->variable->getText()] << "(%rbp)" << endl; }
The syntax of the assignment statement is
assign: NAME ':=' operand=(NAME | NUM);
Notice that operand may either be a NAME
(which is the syntax of an identifier) or a NUM
(which is the syntax for an integer literal). An assignment takes either a constant integer literal or a variable identifier and copies that data to the left-hand side variable. This translates to either a mov immediate to a register (integer literal) or a load from memory to a register (variable name on right-hand side), followed by a store of that value to the memory location associated with the variable name.
If the right-hand side is a variable name, the compiler will generate a load from the memory location of the variable to a temporary register (we'll use %rax
for the temporary register). To determine the memory location, recall that our compiler assigns each local variable an offset from %rbp
that we stored in a offset table created by enterLocalVariables
. For example, if variable b has offset -24
from %rbp
, to load that memory location's value into %rax
, we use the following instruction:
mov -24(%rbp), %rax
-24(%rbp)
means, take the value of %rbp
(which is a pointer to the current stack) frame, add -24
to it, then dereference the address and load that value into %rax
.
If the operand is a NUM
such as 5
, then the immediate move will look like this, recalling that the $
prefix indicates an immediate value:
mov $5, %rax
The logic of creating this assembly operand is included in the operand_to_string
function, provided to you:
string operand = operand_to_string(ctx->operand);
When the operand is a NAME
, the compiler creates the %rbp
-indirect offset argument, e.g., -24(%rbp)
, i.e.,
// in operand_to_string operand << symtab[token->getText()] << "(%rbp)";
When the operand is a NUM
, we only need to use an immediate value argument, i.e.,
// in operand_to_string operand << "$" << token->getText();
Either way, the operand
variable now contains the corresponding assembly instruction argument for the right-hand side of the SimpleIR statement.
Finally, enterAssign
generates the copy of the right-hand side to the temporary register %rax
, then stores the value into the memory location of the left-hand side variable:
cout << "\tmov\t" << operand << ", %rax" << endl; cout << "\tmov\t%rax, " << symtab[ctx->variable->getText()] << "(%rbp)" << endl;
The left-hand size variable's memory location is found using the offset table (symtab
). The name of the variable is retrieved from syntax tree with ctx->variable->getText()
.
6.3. A simple example
function example_variables localVariables x y x := 1 y := x return 0 end function
.file "stdin" .section .note.GNU-stack,"",@progbits .text .globl example_variables .type example_variables, @function example_variables: # prologue, update stack pointer pushq %rbp # save old base ponter movq %rsp, %rbp # set new base pointer push %rbx # %rbx is callee-saved # allocate stack space for locals sub $24, %rsp # assign 1 to x mov $1, %rax mov %rax, -16(%rbp) # assign x to y mov -16(%rbp), %rax mov %rax, -24(%rbp) # set return value mov $0, %rax # epilogue add $24, %rsp pop %rbx # restore old base pointer pop %rbp # restore old base pointer ret
7. Defining functions
7.1. enterFunction
To create the function, emit the assembly pseudo-ops .globl
and .type
with the name of the function (ctx->functionName->getText()
), as well as a label for the function. Then emit the prologue. Use whatever the name of the function is for all three, e.g., for the function factorial
the function creation and prologue would look like this:
.globl factorial .type factorial, @function factorial: # prologue pushq %rbp # save old base ponter movq %rsp, %rbp # set new base pointer push %rbx # %rbx is callee-saved
factorial
should be replaced by whatever the name of the function is.
7.2. enterEnd
Emit the assembly function epilogue to restore the stack and base pointer registers, then issue the return instruction
# epilogue add $0, %rsp # restore old stack pointer pop %rbx # restore rbx pop %rbp # restore old base pointer ret
0
should be replaced by whatever was subtracted from the stack to allocate space for local variables. The stackoffset
variable holds this value. Feel free to omit the instructions when stackoffset
is 0, since this has no affect on the stack pointer.
7.3. enterReturnStatement
The return IR instruction does not issues the assembly ret
return instruction, which is instead done after the epilogue. Instead, the return IR instruction sets the %rax
register to the return value, following to the System V x86 64 ABI.
# set return value mov $10, %rax
To get the text of the operand, use the operand_to_string
function. Then emit a mov
instruction that copies that operand's value into %rax
.
7.4. A simple function example
The following SimpleIR program defines a main function that returns 0.
function main return 0 end function
This is a compiled version of SimpleIR in x86 64-bit assembly that follows the System V X86 64 ABI.
.file "stdin" .section .note.GNU-stack,"",@progbits .text .globl main .type main, @function main: # prologue, update stack pointer pushq %rbp # save old base ponter movq %rsp, %rbp # set new base pointer push %rbx # %rbx is callee-saved # set return value mov $0, %rax # epilogue add $0, %rsp pop %rbx # restore old base pointer pop %rbp # restore old base pointer ret
8. Function parameters and calls
8.1. enterParameters
In SimpleIR, the input program first declares all local variables, e.g.,
localvars x a b c d e f g h
then it declares which of those local variables is a parameter, e.g.,
params a b c d e f g h
In the above example, all local variables, except for x
are also parameters. The set of parameters is a subset of the local variables (or the same set).
The enterLocalVariables
function emits code that creates space for all local variables, including parameters. enterParams
then copies the parameter values set by the caller to the corresponding parameter local variables. For instance, for the above set of params
, the following assembly code copies the caller's values to the callee's local variables:
# move register parameter a to local variable mov %rdi, -24(%rbp) # move register parameter b to local variable mov %rsi, -32(%rbp) # move register parameter c to local variable mov %rdx, -40(%rbp) # move register parameter d to local variable mov %rcx, -48(%rbp) # move register parameter e to local variable mov %r8, -56(%rbp) # move register parameter f to local variable mov %r9, -64(%rbp) # move stack parameter g to local variable mov 16(%rbp), %rax mov %rax, -72(%rbp) # move stack parameter h to local variable mov 24(%rbp), %rax mov %rax, -80(%rbp)
The caller passes parameters both via registers and the stack, the first six via registers and the rest via the stack. To copy these parameters to the memory locations for the callee's local variable, enterParams
generates mov
instructions for the first size parameters from the ABI-defined set of parameter registers, which are given in the code template:
static const vector<string> registers = { "%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9" };
enterParams
can retrieve the list of parameters names for the input program from the ANTLR syntax tree, i.e.,
auto formals = ctx->formals; for (int formal_i = 0; formal_i < formals.size(); formal_i++) { cout << "# " << formals[formal_i]->getText() << endl; }
For each of the register parameters, generate a mov
from the corresponding register to the local variable's memory location. The local variables' memory locations are offsets from the base pointer, as described in enterLocalVariables
, i.e., by passing the name of the variable to symtab
to get the offset. Then generate the mov
instruction from the register, e.g., from the example above
# move register parameter c to local variable mov %rdx, -32(%rbp)
This instruction stores the value in %rdx
, i.e., the third parameter, into -32(%rbp)
, which is the memory location of the local variable c
for the above program. See enterAssign
for more details on getting the memory location of local variables.
For the stack allocated parameters, mov
them from the stack to the local variable. Since this is a move between two memory locations, first generate code to copy the value to a temporary register (we'll use %rax
), resulting in two move instructions. The first instruction copies from the parameter's memory location on the stack to %rax
, e.g.,
mov 16(%rbp), %rax
Then the next instruction copies from %rax
to the location variable's memory location, given by symtab
, e.g.,
mov %rax, -72(%rbp)
The offsets for the parameters are positive from the %rbp
(why is this the case?). They start from 16, because %rbp
plus 0 is the old base pointer and %rbp
plus 8 is the return address, so %rbp
plus 16 is the 7th parameter (the first after the register parameters), %rbp
plus 24 is the 8th, etc.
8.2. enterCall
There are several steps to calling a function in assembly:
- Passing parameters
- Making the call
- Restoring the stack pointer
- Saving the return value
The following is an example of a SimpleIR function call. Note that this assumes that the variable names have already been declared (enterLocalVariables) and defined (enterAssign), both of which are already given to you.
x := call paramtest a b c d e f g h
This call can be implemented with the following assembly code:
# pass stack-allocated arguments in reverse order push -96(%rbp) push -88(%rbp) # pass the first six parameters via the pre-defined set of registers mov -80(%rbp), %r9 mov -72(%rbp), %r8 mov -64(%rbp), %rcx mov -56(%rbp), %rdx mov -48(%rbp), %rsi mov -40(%rbp), %rdi call paramtest add $16, %rsp mov %rax, -32(%rbp)
# pass the first size parameters via the pre-defined set of registers mov -40(%rbp), %rdi mov -48(%rbp), %rsi mov -56(%rbp), %rdx mov -64(%rbp), %rcx mov -72(%rbp), %r8 mov -80(%rbp), %r9 # pass the remaining arguments via the stack push -88(%rbp) push -96(%rbp) # make the call call paramtest # restore the stack pointer add $16, %rsp # save the return value mov %rax, -32(%rbp)
Since there are either parameters, we will need to use both registers (for the first six) and the stack to pass parameters, per the System V AMD64 ABI. There is one mov
instruction for each of the first six parameters. These instruction copy each of the local variable's value to one of the predefined registers. The local variables' memory locations are offsets from the base pointer, as described in enterLocalVariables
. The set of registers for the first size parameters is defined by the ABI. An array holding these register names is provided:
static const vector<string> registers = { "%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9" };
After the parameters have been passed, the paramtest
function is called with the call
assembly instruction. Once the function returns, the stack pointer is restored to its state before pushing the two parameters onto the stack (two 8-byte values, so 16 bytes are added). Finally, the return value (found in %rax
due to the ABI) is saved in x
, the left-hand size of the SimpleIR call statement.
The name of the variable to assign to the return value can be retrieved with
ctx->variable->getText()
The function name can be retrieved with
ctx->functionName->getText()
You can retrieve the left-hand side variable (to assign the return value to), the function name to call, and the parameters with the following python code:
The parameters can be accessed by iterating over the vector of name tokens:
auto actuals = ctx->actuals; for (int actual_i = actuals.size() - 1; actual_i >= 0; actual_i--) { // an illustration of retrieving each actual parameter string name cout << "# " << actuals[actual_i]->getText() << endl; }
For the register parameters, generate a mov
instruction that copies the local variable's value (from the stack frame, e.g., -32(%rbp)
) to the corresponding parameter register (listed in registers
above). Then for each stack parameters (if there are any), generate a push
instruction that pushes the local variables value.
After processing the parameters, generate the call to the function name, call paramtest
above. After the call, generate an instruction to remove any parameters from the stack (How do we know how much space was used on the stack?) This is done in the example above with the following:
add $16, %rsp
Finally, the return value is stored in %rax
according to the System V AMD64 ABI convention (which your functions will also follow). Recall how to access the memory location of a variable. Use a mov
to store the return value (%rax
) into the memory location of the left-hand side variable, e.g.,
mov %rax, -24(%rbp)
8.3. Function call examples
A single parameter
- SimpleIR
main.ir
cat << EOT > main.ir function main localVariables retval x x := 3 retval := call func x return retval end function EOT
func.ir
cat << EOT > func.ir function func localVariables a parameters a return a end function EOT
- Compile the IR to assembly
./CodeGen main.ir > main.s
Possible output:
.file "main.ir" .section .note.GNU-stack,"",@progbits .text .globl main .type main, @function main: # prologue, update stack pointer pushq %rbp # save old base ponter movq %rsp, %rbp # set new base pointer push %rbx # %rbx is callee-saved # allocate stack space for locals sub $24, %rsp # assign 3 to x mov $3, %rax mov %rax, -24(%rbp) mov -24(%rbp), %rdi call func add $0, %rsp mov %rax, -16(%rbp) # set return value mov -16(%rbp), %rax # epilogue add $24, %rsp pop %rbx # restore old base pointer pop %rbp # restore old base pointer ret
./CodeGen func.ir > func.s
Possible output:
.file "func.ir" .section .note.GNU-stack,"",@progbits .text .globl func .type func, @function func: # prologue, update stack pointer pushq %rbp # save old base ponter movq %rsp, %rbp # set new base pointer push %rbx # %rbx is callee-saved # allocate stack space for locals sub $8, %rsp # move register parameter a to local variable mov %rdi, -16(%rbp) # set return value mov -16(%rbp), %rax # epilogue add $8, %rsp pop %rbx # restore old base pointer pop %rbp # restore old base pointer ret
- Running
Assembly and link the generated assembly:
gcc -o main main.s func.s
Run main and inspect its exit code (which is set by the return from main):
./main echo $?
You should see
3
, which is what was passed tofunc
, returned tomain
, then ultimately returned frommain
.
Multiple parameters
- SimpleIR
cat << EOT > main.ir function main localVariables argc argv x a b c d e f g h parameters argc argv a := 1 b := 2 c := 3 d := 4 e := 5 f := 6 g := 7 h := 8 x := call paramtest a b c d e f g h return x end function EOT
cat << EOT > paramtest.ir function paramtest localVariables x a b c d e f g h parameters a b c d e f g h x := c return x end function EOT
- Compile the IR to assembly
./CodeGen main.ir > main.s
Possible output:
.file "stdin" .section .note.GNU-stack,"",@progbits .text .globl main .type main, @function main: # prologue, update stack pointer pushq %rbp # save old base ponter movq %rsp, %rbp # set new base pointer push %rbx # %rbx is callee-saved # allocate stack space for locals sub $88, %rsp # move register parameter argc to local variable mov %rdi, -16(%rbp) # move register parameter argv to local variable mov %rsi, -24(%rbp) # assign 1 to a mov $1, %rax mov %rax, -40(%rbp) # assign 2 to b mov $2, %rax mov %rax, -48(%rbp) # assign 3 to c mov $3, %rax mov %rax, -56(%rbp) # assign 4 to d mov $4, %rax mov %rax, -64(%rbp) # assign 5 to e mov $5, %rax mov %rax, -72(%rbp) # assign 6 to f mov $6, %rax mov %rax, -80(%rbp) # assign 7 to g mov $7, %rax mov %rax, -88(%rbp) # assign 8 to h mov $8, %rax mov %rax, -96(%rbp) push -96(%rbp) push -88(%rbp) mov -80(%rbp), %r9 mov -72(%rbp), %r8 mov -64(%rbp), %rcx mov -56(%rbp), %rdx mov -48(%rbp), %rsi mov -40(%rbp), %rdi call paramtest add $16, %rsp mov %rax, -32(%rbp) # set return value mov -32(%rbp), %rax # epilogue add $88, %rsp pop %rbx # restore old base pointer pop %rbp # restore old base pointer ret
./CodeGen paramtest.ir > paramtest.s
Possible output:
.file "paramtest.ir" .section .note.GNU-stack,"",@progbits .text .globl paramtest .type paramtest, @function paramtest: # prologue, update stack pointer pushq %rbp # save old base ponter movq %rsp, %rbp # set new base pointer push %rbx # %rbx is callee-saved # allocate stack space for locals sub $72, %rsp # move register parameter a to local variable mov %rdi, -24(%rbp) # move register parameter b to local variable mov %rsi, -32(%rbp) # move register parameter c to local variable mov %rdx, -40(%rbp) # move register parameter d to local variable mov %rcx, -48(%rbp) # move register parameter e to local variable mov %r8, -56(%rbp) # move register parameter f to local variable mov %r9, -64(%rbp) # move stack parameter g to local variable mov 16(%rbp), %rax mov %rax, -72(%rbp) # move stack parameter h to local variable mov 24(%rbp), %rax mov %rax, -80(%rbp) # assign c to x mov -40(%rbp), %rax mov %rax, -16(%rbp) # set return value mov -16(%rbp), %rax # epilogue add $72, %rsp pop %rbx # restore old base pointer pop %rbp # restore old base pointer ret
- Running
Assembly and link the generated assembly:
gcc -o main main.s paramtest.s
Run main and inspect its exit code (which is set by the return from main):
./main echo $?
You should see
3
, which is what was passed tofunc
, returned tomain
, then ultimately returned frommain
.
9. Hard-coded Listeners for Functions
In order to compile executable programs that use arithmetic, branching, and pointers, your compiler needs to emit complete assembly programs. This depends on getting codegen1's listeners implemented correctly. To avoid penalizing you in codegen2 and codegen3 for mistakes in codegen1, you may instead hard code the assembly output for codegen1's functions using the implementations given in this section. These will hard-code the assembly output of your compile for the following SimpleIR template program, template.ir
:
function main # DO NOT CHANGE localVariables a b c d e f g result retval # DO NOT CHANGE # add your code to try out here retval := call print_int result # DO NOT CHANGE return 0 # DO NOT CHANGE end function # DO NOT CHANGE
You may use this SimpleIR program by adding your SimpleIR statements here and compiling. Note the lines that says "DO NOT CHANGE". These have been precompiled for you to assembly and provided in the hard-coded listeners below. As long as you do not change these and use the hard-coded listeners below, your implementations of codegen2 and codegen3's listeners will be gradeable.
print_int
is defined in iolib.c
, which is below:
#include <stdio.h> #include <stdint.h> #include <inttypes.h> int print_int(int64_t x) { printf("%" PRId64 "\n", x); return 0; } int64_t read_int() { int64_t x; scanf("%" PRId64 "d", &x); return x; }
You can simply link iolib.c
with the output of your compiler and your template assembly program will call print_int
, i.e.,
./CodeGen template.ir > template.s gcc -o template template.s iolib.c ./template
9.1. Hard-coded enterFunction
virtual void enterFunction(SimpleIRParser::FunctionContext * ctx) override { cout << "\t.globl main" << endl; cout << "\t.type main" << ", @function" << endl; cout << "main:" << endl; cout << "\t# prologue, update stack pointer" << endl; cout << "\tpushq\t%rbp # save old base ponter" << endl; cout << "\tmovq\t%rsp, %rbp # set new base pointer" << endl; cout << "\tpush\t%rbx # %rbx is callee-saved" << endl; }
9.2. Hard-coded enterEnd
virtual void enterEnd(SimpleIRParser::EndContext * ctx) override { cout << "\t# epilogue" << endl; cout << "\tadd\t$" << stackoffset << ", %rsp" << endl; cout << "\tpop\t%rbx # restore %rbx" << endl; cout << "\tpop\t%rbp # restore old base pointer" << endl; cout << "\tret" << endl; }
9.3. Hard-coded enterParameters
virtual void enterParameters(SimpleIRParser::ParametersContext * ctx) override { // empty for template.ir }
9.4. Hard-coded enterReturnStatement
virtual void enterReturnStatement(SimpleIRParser::ReturnStatementContext * ctx) override { cout << "\t# set return value" << endl; cout << "\tmov\t$0, %rax" << endl; }
9.5. Hard-coded enterCall
virtual void enterCall(SimpleIRParser::CallContext * ctx) override { cout << "\tmov\t-72(%rbp), %rdi" << endl; cout << "\tcall\tprint_int" << endl; cout << "\tadd\t$0, %rsp" << endl; cout << "\tmov\t%rax, -80(%rbp)" << endl; }
10. Arithmetic Operations
10.1. enterOperation
Arithmetic operations have the following syntax:
operation: variable=NAME ':=' operand1=(NAME | NUM) operatorKind=('+' | '-' | '*' | '/' | '%') operand2=(NAME | NUM);
There are four parts to arithmetic operations that are relevant to the compiler:
- The destination variable
ctx->variable->getText()
- The left operand
ctx.operand1
, may be an integer constantNUM
or variable nameNAME
. - The operator
ctx.operatorKind->getType()
, which may be one of five operator symbols. - The right operand
ctx.operand2
, may be an integer constantNUM
or variable nameNAME
.
The goal of the compiler is to generate assembly code that implements the arithmetic operation.
The compiler can generate code that
- Loads the operands into temporary registers
- Performs the arithmetic on the temporary registers
- Stores the result into the destination variable
For example, in the following SimpleIR,
function main localVariables x y x := y * 6 return 0 end function
the operation x := y * 6
compiles to
mov -24(%rbp), %rax mov $6, %rbx imul %rbx, %rax mov %rax, -16(%rbp)
Operands
The *
operator has two operands, y
and 6
. The y
operand translates to a load from the stack-allocated memory location for the y
, i.e., -24(%rbp)
. The 6
operand translates to an immediate value $6
. You can use the helper function, operand_to_string
to handle both; enterAssign
illustrates using this function. This code was translated by generating a mov
instruction for each operand to temporary registers, %rax
and %rbx
. This isn't the only way to translate the operation, but it makes for a simple compiler implementation.
Operator
Once we have both operands in registers, we can generate the assembly arithmetic operation corresponding to the symbol. In this case, we have *
, denoting multiplication, which can be implemented with imul
, i.e., integer multiplication. We multiply the values in the registers with imul %rbx, %rax
; remember that in the AT&T assembly syntax the destination is the second operand of the assembly instruction, which is also the left operand of the arithmetic operation. Multiplication and addition are commutative, but subtraction, division, and remainder are not, so this order matters. To check for the operator token type, compare the operator's type to token names provided in ANTLR-generated parser:
SimpleIRParser::PLUS
SimpleIRParser::MINUS
SimpleIRParser::STAR
SimpleIRParser::SLASH
SimpleIRParser::PERCENT
- Addition
add %rbx, %rax
is equivalent to
rax = rax + rbx
The left operand and the destination operand are the same register and are the second argument to the assembly instruction.
Remember, the destination is always the second argument to the AT&T assembly instruction.
- Subtraction
sub %rbx, %rax
This has the same argument ordering as the addition operation.
- Integer Multiplication
imul %rbx, %rax
This has the same argument ordering as the addition operation.
- Division and remainder
mov $9, %rax mov $5, %rbx # %rax holds the numerator cdq # now %rdx:%rax holds the sign-extended numerator idiv %rbx # divide %rdx:%rax by %rbx # quotient (result) now in %rax # remainder now in %rdx
The same operation,
idiv
is used for both division and remainder (both are computed together). The operation is fixed to using the%rax
register as the numerator. Strictly speaking%rdx
and%rax
form the numerator, but for 64-bit number, we can usecdq
to sign extend%rax
into%rdx
. Provide the denominator toidiv
. After the operation,%rax
holds the quotient (result of the division) and%rdx
holds the remainder.
Destination variable
The destination is a variable, so the compiler will need to generate an assembly instruction that stores the result of the operation into the memory location associated with this variable. The argument to the store operation will also be an offset from the base pointer, e.g., -16(%rbp)
for the x
variable.
Generate a mov
instruction that takes whatever the result of the arithmetic operation is (what register will it be in?) and stores it into this offset. For instance, if your compiler generates code that puts the result in %rax
and needs to store it into a variable that has offset -16
, it should generate the following:
mov %rax -16(%rbp)
11. Branching
There are two kinds of branching in SimpleIR, unconditional (gotoStatement
) and conditional (ifGoto
). Both use a label as a target of the branch.
11.1. enterLabel
The label syntax is as follow.
label: labelName=NAME ':';
Its syntax is very similar to assembly labels, so the compiler only needs to emit the label name ctx->labelName->getText()
followed by a colon.
11.2. enterGotoStatement
The uncondition branch has simple syntax:
gotoStatement: 'goto' labelName=NAME;
The compiler only needs to generate a jmp
instruction a provide the label name ctx->labelName->getText()
.
For instance, this SimpleIR code
loop: goto loop
generates the following assembly
loop: jmp loop
and causes an infinite loop.
11.3. enterIfGoto
Conditional branches have more complicated syntax:
ifGoto: 'if' operand1=(NAME | NUM) operatorKind=('=' | '!=' | '<' | '<=' | '>' | '>=') operand2=(NAME | NUM) 'goto' labelName=NAME;
Conditional branches compare two operands with one of several relational operators and branch to the label only if the comparison is true. They consistent of several parts
- A left operand
- A relational operator
- A right operand
- The label to branch to
Operands
The operands can be handled exactly as they are in arithmetic operations.
Operator
The operator tokens are handled similarly to those for arithmetic, but have different token types:
SimpleIRParser::EQ
SimpleIRParser::NEQ
SimpleIRParser::LT
SimpleIRParser::LTE
SimpleIRParser::GT
SimpleIRParser::GTE
The SimpleIR ifGoto
can be implemented by generating a comparison operation followed by a conditional jump. The comparison sets the internal EFLAGS register which the conditional jump will inspect to determine whether to make the jump.
The comparison operation is
cmp %rbx, %rax
where %rax is the left side of the comparison and the %rbx is the right (recall that AT&T syntax has operands reversed). While there is no destination register, you can think of cmp
as a subtraction that doesn't store the result; it only sets the EFLAGS
to record whether the result was zero (operands are equal) or negative (less than). Can the machine determine the other comparison operations with only these two flags?
Immediately after cmp
, generate the jump that corresponds to the relational operator, according to the following table:
ASM Op | SimpleIR Op | Jump if… |
---|---|---|
je | = | Equal |
jne | != | Not equal |
jl | < | Less than |
jle | <= | Less than or equal |
jg | > | Greater than |
jge | >= | Greater than or equal |
Example
function main localVariables x if x <= 10 goto falselabel x := 11 falselabel: x := x + 1 return 0 end function
The if x <= 10 goto falselabel
statement generates the following assembly:
mov -16(%rbp), %rax mov $10, %rbx cmp %rbx, %rax jle falselabel
12. Pointer operations
12.1. enterDereference
Given a derefence statement, e.g.,
t2 := *t1
where the variables have the following offsets
Variable | Offset |
---|---|
t1 | -40 |
t2 | -48 |
the following assembly code will load the value at the addressed stored in t1
and store that value into t2
mov -40(%rbp), %rax # value of the deref'ed variable (t1) mov (%rax), %rbx # the value in memory at address %rax (*t1) mov %rbx, -48(%rbp) # store the result in the assigned var (t2)
12.2. enterReference
Given a reference statement, e.g,.
t1 := &x
where the variables have the following offsets from the base pointer
Variable | Offset |
---|---|
x | -24 |
t1 | -40 |
the following assembly code will store the address of x
into t1
mov %rbp, %rax # start with stack frame address add $-24, %rax # add the offset of the referenced var (x) mov %rax, -40(%rbp) # store the result in the assigned var (t1)
12.3. enterAssignDereference
Given a dereference assignment statement
*t1 := 11
where t1
has offset -40
, the following assembly will store the value 11
into the address given by t1
mov -40(%rbp), %rax # get the value of ~t1~ mov $11, %rbx # get the right-hand-side value mov %rbx, (%rax) # store %rbx into the address in memory given by %rax
(%rax)
means store not to %rax
itself, but to the memory location given by the address in %rax
.