UP | HOME

Code Generation Project Handbook
COP-3402

Table of Contents

1. Code generation projects

2. C++ and ANTLR usage

CodeGen.cpp contains an ANTLR listener that will walk over the parse tree of SimpleIR input files and trigger enter listener functions for each construct, such as enterFunction and enterIfGoto.

2.1. Writing to standard out

To write to the output using C++ streams, use cout for standard out, e.g.,

cout << "hello, world!" << endl;

For instance, the following outputs an assembly comment:

cout << "# this is an assembly comment" << endl;

Chain together multiple values to emit. Use endl to emit a line-ending, i.e., newline on unix-based OSes.

2.2. Retrieving ANTLR parse tree contents

A ctx parameter is passed to each listener, which contains the SimpleIR construct subtree. The contents of the tree can be accessed using the naming given in the SimpleIR.g4 grammar file.

Accessing single terminal tokens

For instance, the grammar for a function construct is as follows:

function: 'function' functionName=NAME localVariables? parameters? statement* returnStatement end;

To access the name terminal, given by functionName, use ctx->functionName->getText() to get the tree node, then the string text of the tree node:

string name = ctx->functionName->getText();

-> accesses the field of data object via a pointer. This is just like C, where -> accesses a struct field through a pointer to a struct. getText() is an ANTLR function that gets the string value of a terminal token.

Consult the SimpleIR.g4 to find the names of other terminals in other constructs.

Accessing operands

Many operands in the SimpleIR grammar take either a number or a name, e.g., in the return statement's grammar:

returnStatement: 'return' operand=(NAME | NUM);

A helper function is provided in the CodeGen.cpp template to convert this operand into the correct assembly operand, depending on whether it is a number or a name. The helper function is shown below as well:

static string operand_to_string(antlr4::Token *token) {
  stringstream operand;
  if (SimpleIRParser::NAME == token->getType()) {
    operand << symtab[token->getText()] << "(%rbp)";
  } else if (SimpleIRParser::NUM == token->getType()) {
    operand << "$" << token->getText();
  } else {
    assert(false);
  }
  return operand.str();
}

This function uses the ANTLR-generated SimpleIR token types to distinguish between name and number tokens, e.g., SimpleIRParser::NAME == token->getType(). Name tokens represent variable names. As discussed in class, variables are allocated on the stack.

Accessing lists of tokens

ANTLR permits specifying a list of terminals or non-terminals, e.g., with NAME+ meaning, one or more NAME terminals or NAME*, meaning zero or more. For instance, take the grammar for parameters:

parameters: 'parameters' formals+=NAME+;

This will create a C++ vector data structure for formals. Like arrays and linked lists, vectors store a sequence of data of a given type. We can iterate over vectors and access elements similarly to an array, by initializing a loop variable, e.g., formal_i, and iterating from 0 to the length of the vector, given by formals.size().

auto formals = ctx->formals;
for (int formal_i = 0; formal_i < formals.size(); formal_i++) {
  cout << "# " << formals[formal_i]->getText() << endl;
}

Vector elements can be access like arrays, with formals[formal_i]. In this project, all vectors will be of tokens, so you can get the string text of one token in the vector formals[formal_i]->getText().

3. System V x86 64 ABI

  • Parameter passing
  • Return values
  • Stack alignment

4. Debugging with GDB

One way to help trace the function call is to use gdb. The following will rebuild the main program with debugging symbols ;on, run gdb, then step through each assembly instructions.

gcc -g -o main main.s func.s  # compile with debugging symbols (-g)
gdb main 
b main # setup breakpoint at main
r # start running, breaks at main
si # step instruction to see next instruction
# hitting enter will repeat last command, e.g., si
c # once done use c to continue running without stopping

4.1. Debugging tutorials

See these resources for more information on gdb.

5. Laying out the assembly file

5.1. enterUnit

virtual void enterUnit(SimpleIRParser::UnitContext * ctx) override {
  // filename
  cout << "\t.file \"" << filename << "\"" << endl;
  cout << "\t.section .note.GNU-stack,\"\",@progbits" << endl;
  cout << "\t.text" << endl;
}

This function emits some assembly pseudo-ops expected by the toolchain and starts the text section where the program code will go.

.file "main.ir"
.section .note.GNU-stack,"",@progbits
.text

main.ir should be the name of the input file or stdin when read from standard input instead.

6. Variables

6.1. enterLocalVariables

This function generates assembly instructions that allocate space on the stack for all local variables. In SimpleIR, all variables are 64-bit integers.

virtual void enterLocalVariables(SimpleIRParser::LocalVariablesContext * ctx) override {
  auto variables = ctx->variables;
  // start from of an offset of 2x bytewidth from rbp, 1 to start after rbp
  // and another 1 to skip rbx which will have already been pushed
  int starting_offset = 2;
  for (int variable_i = 0; variable_i < variables.size(); variable_i++) {
    string variable = variables[variable_i]->getText();
    int offset = (starting_offset + variable_i) * bytewidth;
    symtab[variable] = -1 * offset;
  }
  // compute size of the stack space needed
  int stackspace = variables.size() * bytewidth;
  // ceiling to 8 bytes
  stackoffset = std::ceil(stackspace / 8) * 8;
  // align to 16 bytes, accounting for rbx being pushed by adding 8
  stackoffset += (stackoffset + 8) % 16;
  // emit the subtraction to allocate stack space
  cout << "\t# allocate stack space for locals" << endl;
  cout << "\tsub\t$" << stackoffset << ", %rsp" << endl;
}

The CodeGen template has a symbol table to store the offsets of each declared variable:

map<string, int> symtab;

This function works by first collecting the set of variables to allocate space for from the syntax tree of the input program:

auto variables = ctx->variables;

Then it creates space on the stack for each local variable. The memory location of each local variable can be determined by an offset from the base pointer %rbp per the System V AMD64 ABI. The compiler records each variables offset in a symbol table (symtab) that maps the name of the variable to its offset from %rbp, which the compiler will use to translate variable accesses and assignments whenever the variable is used in SimpleIR instructions. enterLocalVariables function creates the symbol table by iterating over variables and allocating the offsets:

int starting_offset = 2;
for (int variable_i = 0; variable_i < variables.size(); variable_i++) {
  string variable = variables[variable_i]->getText();
  int offset = (starting_offset + variable_i) * bytewidth;
  symtab[variable] = -1 * offset;
}

We start offsetting 2 stack slots away from %rbp, because offset 0 stores the calling function's base pointer and offset 1 holds %rbx which we pushed since it is a callee-saved registered according to the ABI. We start with the first local variable at offset 2 and store the subsequent locals at increasing offsets (starting_offset + variable_i).

symtab will then contain a mapping from each local variable to its own distinct offset, which are multiples of the 8-byte width, e.g., -16, -24, -32, etc. (Do you remember why the offsets are negative in this ABI?). symtab will tell you how to translate variable uses into offsets from the base pointer %rbp for generating equivalent assembly instructions.

Finally, the functions computes the total amount of space required for %rbx and the local variables (number of variables times 8 bytes each), aligns the stack to 16 bytes (required by the ABI), and emits an assembly instruction that updates the stack pointer to allocate the space, i.e.,

// compute size of the stack space needed
int stackspace = variables.size() * bytewidth;
// ceiling to 8 bytes
stackoffset = std::ceil(stackspace / 8) * 8;
// align to 16 bytes, accounting for rbx being pushed by adding 8
stackoffset += (stackoffset + 8) % 16;
// emit the subtraction to allocate stack space
cout << "\t# allocate stack space for locals" << endl;
cout << "\tsub\t$" << stackoffset << ", %rsp" << endl;

stackoffset is a global variable that will hold the amount of stack space allocated. enterEnd will use this in the epilogue.

6.2. enterAssign

This function generates assembly code corresponding to a variable assignment.

virtual void enterAssign(SimpleIRParser::AssignContext * ctx) override {
  string operand = operand_to_string(ctx->operand);
  cout << "\t# assign " << ctx->operand->getText() << " to " << ctx->variable->getText() << endl;
  cout << "\tmov\t" << operand << ", %rax" << endl;
  cout << "\tmov\t%rax, " << symtab[ctx->variable->getText()] << "(%rbp)" << endl;
}

The syntax of the assignment statement is

assign: NAME ':=' operand=(NAME | NUM);

Notice that operand may either be a NAME (which is the syntax of an identifier) or a NUM (which is the syntax for an integer literal). An assignment takes either a constant integer literal or a variable identifier and copies that data to the left-hand side variable. This translates to either a mov immediate to a register (integer literal) or a load from memory to a register (variable name on right-hand side), followed by a store of that value to the memory location associated with the variable name.

If the right-hand side is a variable name, the compiler will generate a load from the memory location of the variable to a temporary register (we'll use %rax for the temporary register). To determine the memory location, recall that our compiler assigns each local variable an offset from %rbp that we stored in a offset table created by enterLocalVariables. For example, if variable b has offset -24 from %rbp, to load that memory location's value into %rax, we use the following instruction:

mov -24(%rbp), %rax

-24(%rbp) means, take the value of %rbp (which is a pointer to the current stack) frame, add -24 to it, then dereference the address and load that value into %rax.

If the operand is a NUM such as 5, then the immediate move will look like this, recalling that the $ prefix indicates an immediate value:

mov $5, %rax

The logic of creating this assembly operand is included in the operand_to_string function, provided to you:

string operand = operand_to_string(ctx->operand);

When the operand is a NAME, the compiler creates the %rbp-indirect offset argument, e.g., -24(%rbp), i.e.,

// in operand_to_string
operand << symtab[token->getText()] << "(%rbp)";

When the operand is a NUM, we only need to use an immediate value argument, i.e.,

// in operand_to_string
operand << "$" << token->getText();

Either way, the operand variable now contains the corresponding assembly instruction argument for the right-hand side of the SimpleIR statement.

Finally, enterAssign generates the copy of the right-hand side to the temporary register %rax, then stores the value into the memory location of the left-hand side variable:

cout << "\tmov\t" << operand << ", %rax" << endl;
cout << "\tmov\t%rax, " << symtab[ctx->variable->getText()] << "(%rbp)" << endl;

The left-hand size variable's memory location is found using the offset table (symtab). The name of the variable is retrieved from syntax tree with ctx->variable->getText().

6.3. A simple example

function example_variables
localVariables x y
x := 1
y := x
return 0
end function
  .file "stdin"
  .section .note.GNU-stack,"",@progbits
  .text
  .globl example_variables
  .type example_variables, @function
example_variables:
  # prologue, update stack pointer
  pushq	%rbp # save old base ponter
  movq	%rsp, %rbp # set new base pointer
  push	%rbx # %rbx is callee-saved
  # allocate stack space for locals
  sub	$24, %rsp
  # assign 1 to x
  mov	$1, %rax
  mov	%rax, -16(%rbp)
  # assign x to y
  mov	-16(%rbp), %rax
  mov	%rax, -24(%rbp)
  # set return value
  mov	$0, %rax
  # epilogue
  add	$24, %rsp
  pop	%rbx # restore old base pointer
  pop	%rbp # restore old base pointer
  ret

7. Defining functions

7.1. enterFunction

To create the function, emit the assembly pseudo-ops .globl and .type with the name of the function (ctx->functionName->getText()), as well as a label for the function. Then emit the prologue. Use whatever the name of the function is for all three, e.g., for the function factorial the function creation and prologue would look like this:

  .globl factorial
  .type factorial, @function
factorial:
  # prologue
  pushq	%rbp # save old base ponter
  movq	%rsp, %rbp # set new base pointer
  push	%rbx # %rbx is callee-saved

factorial should be replaced by whatever the name of the function is.

7.2. enterEnd

Emit the assembly function epilogue to restore the stack and base pointer registers, then issue the return instruction

# epilogue
add	$0, %rsp # restore old stack pointer
pop	%rbx # restore rbx
pop	%rbp # restore old base pointer
ret

0 should be replaced by whatever was subtracted from the stack to allocate space for local variables. The stackoffset variable holds this value. Feel free to omit the instructions when stackoffset is 0, since this has no affect on the stack pointer.

7.3. enterReturnStatement

The return IR instruction does not issues the assembly ret return instruction, which is instead done after the epilogue. Instead, the return IR instruction sets the %rax register to the return value, following to the System V x86 64 ABI.

# set return value
mov	$10, %rax

To get the text of the operand, use the operand_to_string function. Then emit a mov instruction that copies that operand's value into %rax.

7.4. A simple function example

The following SimpleIR program defines a main function that returns 0.

function main
return 0
end function

This is a compiled version of SimpleIR in x86 64-bit assembly that follows the System V X86 64 ABI.

  .file "stdin"
  .section .note.GNU-stack,"",@progbits
  .text
  .globl main
  .type main, @function
main:
  # prologue, update stack pointer
  pushq	%rbp # save old base ponter
  movq	%rsp, %rbp # set new base pointer
  push	%rbx # %rbx is callee-saved
  # set return value
  mov	$0, %rax
  # epilogue
  add	$0, %rsp
  pop	%rbx # restore old base pointer
  pop	%rbp # restore old base pointer
  ret

8. Function parameters and calls

8.1. enterParameters

In SimpleIR, the input program first declares all local variables, e.g.,

localvars x a b c d e f g h

then it declares which of those local variables is a parameter, e.g.,

params a b c d e f g h

In the above example, all local variables, except for x are also parameters. The set of parameters is a subset of the local variables (or the same set).

The enterLocalVariables function emits code that creates space for all local variables, including parameters. enterParams then copies the parameter values set by the caller to the corresponding parameter local variables. For instance, for the above set of params, the following assembly code copies the caller's values to the callee's local variables:

# move register parameter a to local variable
mov	%rdi, -24(%rbp)
# move register parameter b to local variable
mov	%rsi, -32(%rbp)
# move register parameter c to local variable
mov	%rdx, -40(%rbp)
# move register parameter d to local variable
mov	%rcx, -48(%rbp)
# move register parameter e to local variable
mov	%r8, -56(%rbp)
# move register parameter f to local variable
mov	%r9, -64(%rbp)
# move stack parameter g to local variable
mov	16(%rbp), %rax
mov	%rax, -72(%rbp)
# move stack parameter h to local variable
mov	24(%rbp), %rax
mov	%rax, -80(%rbp)

The caller passes parameters both via registers and the stack, the first six via registers and the rest via the stack. To copy these parameters to the memory locations for the callee's local variable, enterParams generates mov instructions for the first size parameters from the ABI-defined set of parameter registers, which are given in the code template:

static const vector<string> registers = { "%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9" };

enterParams can retrieve the list of parameters names for the input program from the ANTLR syntax tree, i.e.,

auto formals = ctx->formals;
for (int formal_i = 0; formal_i < formals.size(); formal_i++) {
  cout << "# " << formals[formal_i]->getText() << endl;
}

For each of the register parameters, generate a mov from the corresponding register to the local variable's memory location. The local variables' memory locations are offsets from the base pointer, as described in enterLocalVariables, i.e., by passing the name of the variable to symtab to get the offset. Then generate the mov instruction from the register, e.g., from the example above

# move register parameter c to local variable
mov	%rdx, -32(%rbp)

This instruction stores the value in %rdx, i.e., the third parameter, into -32(%rbp), which is the memory location of the local variable c for the above program. See enterAssign for more details on getting the memory location of local variables.

For the stack allocated parameters, mov them from the stack to the local variable. Since this is a move between two memory locations, first generate code to copy the value to a temporary register (we'll use %rax), resulting in two move instructions. The first instruction copies from the parameter's memory location on the stack to %rax, e.g.,

mov	16(%rbp), %rax

Then the next instruction copies from %rax to the location variable's memory location, given by symtab, e.g.,

mov	%rax, -72(%rbp)

The offsets for the parameters are positive from the %rbp (why is this the case?). They start from 16, because %rbp plus 0 is the old base pointer and %rbp plus 8 is the return address, so %rbp plus 16 is the 7th parameter (the first after the register parameters), %rbp plus 24 is the 8th, etc.

8.2. enterCall

There are several steps to calling a function in assembly:

  • Passing parameters
  • Making the call
  • Restoring the stack pointer
  • Saving the return value

The following is an example of a SimpleIR function call. Note that this assumes that the variable names have already been declared (enterLocalVariables) and defined (enterAssign), both of which are already given to you.

x := call paramtest a b c d e f g h

This call can be implemented with the following assembly code:

# pass stack-allocated arguments in reverse order
push	-96(%rbp)
push	-88(%rbp)
# pass the first six parameters via the pre-defined set of registers
mov	-80(%rbp), %r9
mov	-72(%rbp), %r8
mov	-64(%rbp), %rcx
mov	-56(%rbp), %rdx
mov	-48(%rbp), %rsi
mov	-40(%rbp), %rdi
call	paramtest
add	$16, %rsp
mov	%rax, -32(%rbp)
# pass the first size parameters via the pre-defined set of registers
mov	-40(%rbp), %rdi
mov	-48(%rbp), %rsi
mov	-56(%rbp), %rdx
mov	-64(%rbp), %rcx
mov	-72(%rbp), %r8
mov	-80(%rbp), %r9
# pass the remaining arguments via the stack
push	-88(%rbp)
push	-96(%rbp)
# make the call
call	paramtest
# restore the stack pointer
add	$16, %rsp
# save the return value
mov	%rax, -32(%rbp)

Since there are either parameters, we will need to use both registers (for the first six) and the stack to pass parameters, per the System V AMD64 ABI. There is one mov instruction for each of the first six parameters. These instruction copy each of the local variable's value to one of the predefined registers. The local variables' memory locations are offsets from the base pointer, as described in enterLocalVariables. The set of registers for the first size parameters is defined by the ABI. An array holding these register names is provided:

static const vector<string> registers = { "%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9" };

After the parameters have been passed, the paramtest function is called with the call assembly instruction. Once the function returns, the stack pointer is restored to its state before pushing the two parameters onto the stack (two 8-byte values, so 16 bytes are added). Finally, the return value (found in %rax due to the ABI) is saved in x, the left-hand size of the SimpleIR call statement.

The name of the variable to assign to the return value can be retrieved with

ctx->variable->getText()

The function name can be retrieved with

ctx->functionName->getText()

You can retrieve the left-hand side variable (to assign the return value to), the function name to call, and the parameters with the following python code:

The parameters can be accessed by iterating over the vector of name tokens:

auto actuals = ctx->actuals;
for (int actual_i = actuals.size() - 1; actual_i >= 0; actual_i--) {
  // an illustration of retrieving each actual parameter string name
  cout << "# " << actuals[actual_i]->getText() << endl;
}

For the register parameters, generate a mov instruction that copies the local variable's value (from the stack frame, e.g., -32(%rbp)) to the corresponding parameter register (listed in registers above). Then for each stack parameters (if there are any), generate a push instruction that pushes the local variables value.

After processing the parameters, generate the call to the function name, call paramtest above. After the call, generate an instruction to remove any parameters from the stack (How do we know how much space was used on the stack?) This is done in the example above with the following:

add $16, %rsp

Finally, the return value is stored in %rax according to the System V AMD64 ABI convention (which your functions will also follow). Recall how to access the memory location of a variable. Use a mov to store the return value (%rax) into the memory location of the left-hand side variable, e.g.,

mov	%rax, -24(%rbp)

8.3. Function call examples

A single parameter

  • SimpleIR

    main.ir

    cat << EOT > main.ir
    function main
    localVariables retval x
    x := 3
    retval := call func x
    return retval
    end function
    EOT
    

    func.ir

    cat << EOT > func.ir
    function func
    localVariables a
    parameters a
    return a
    end function
    EOT
    
  • Compile the IR to assembly
    ./CodeGen main.ir > main.s
    

    Possible output:

      .file "main.ir"
      .section .note.GNU-stack,"",@progbits
      .text
      .globl main
      .type main, @function
    main:
      # prologue, update stack pointer
      pushq	%rbp # save old base ponter
      movq	%rsp, %rbp # set new base pointer
      push	%rbx # %rbx is callee-saved
      # allocate stack space for locals
      sub	$24, %rsp
      # assign 3 to x
      mov	$3, %rax
      mov	%rax, -24(%rbp)
      mov	-24(%rbp), %rdi
      call	func
      add	$0, %rsp
      mov	%rax, -16(%rbp)
      # set return value
      mov	-16(%rbp), %rax
      # epilogue
      add	$24, %rsp
      pop	%rbx # restore old base pointer
      pop	%rbp # restore old base pointer
      ret
    
    ./CodeGen func.ir > func.s
    

    Possible output:

      .file "func.ir"
      .section .note.GNU-stack,"",@progbits
      .text
      .globl func
      .type func, @function
    func:
      # prologue, update stack pointer
      pushq	%rbp # save old base ponter
      movq	%rsp, %rbp # set new base pointer
      push	%rbx # %rbx is callee-saved
      # allocate stack space for locals
      sub	$8, %rsp
      # move register parameter a to local variable
      mov	%rdi, -16(%rbp)
      # set return value
      mov	-16(%rbp), %rax
      # epilogue
      add	$8, %rsp
      pop	%rbx # restore old base pointer
      pop	%rbp # restore old base pointer
      ret
    
  • Running

    Assembly and link the generated assembly:

    gcc -o main main.s func.s
    

    Run main and inspect its exit code (which is set by the return from main):

    ./main
    echo $?
    

    You should see 3, which is what was passed to func, returned to main, then ultimately returned from main.

Multiple parameters

  • SimpleIR
    cat << EOT > main.ir
    function main
    localVariables argc argv x a b c d e f g h
    parameters argc argv
    a := 1
    b := 2
    c := 3
    d := 4
    e := 5
    f := 6
    g := 7
    h := 8
    x := call paramtest a b c d e f g h
    return x
    end function
    EOT
    
    cat << EOT > paramtest.ir
    function paramtest
    localVariables x a b c d e f g h
    parameters a b c d e f g h
    x := c
    return x
    end function
    EOT
    
  • Compile the IR to assembly
    ./CodeGen main.ir > main.s
    

    Possible output:

      .file "stdin"
      .section .note.GNU-stack,"",@progbits
      .text
      .globl main
      .type main, @function
    main:
      # prologue, update stack pointer
      pushq	%rbp # save old base ponter
      movq	%rsp, %rbp # set new base pointer
      push	%rbx # %rbx is callee-saved
      # allocate stack space for locals
      sub	$88, %rsp
      # move register parameter argc to local variable
      mov	%rdi, -16(%rbp)
      # move register parameter argv to local variable
      mov	%rsi, -24(%rbp)
      # assign 1 to a
      mov	$1, %rax
      mov	%rax, -40(%rbp)
      # assign 2 to b
      mov	$2, %rax
      mov	%rax, -48(%rbp)
      # assign 3 to c
      mov	$3, %rax
      mov	%rax, -56(%rbp)
      # assign 4 to d
      mov	$4, %rax
      mov	%rax, -64(%rbp)
      # assign 5 to e
      mov	$5, %rax
      mov	%rax, -72(%rbp)
      # assign 6 to f
      mov	$6, %rax
      mov	%rax, -80(%rbp)
      # assign 7 to g
      mov	$7, %rax
      mov	%rax, -88(%rbp)
      # assign 8 to h
      mov	$8, %rax
      mov	%rax, -96(%rbp)
      push	-96(%rbp)
      push	-88(%rbp)
      mov	-80(%rbp), %r9
      mov	-72(%rbp), %r8
      mov	-64(%rbp), %rcx
      mov	-56(%rbp), %rdx
      mov	-48(%rbp), %rsi
      mov	-40(%rbp), %rdi
      call	paramtest
      add	$16, %rsp
      mov	%rax, -32(%rbp)
      # set return value
      mov	-32(%rbp), %rax
      # epilogue
      add	$88, %rsp
      pop	%rbx # restore old base pointer
      pop	%rbp # restore old base pointer
      ret
    
    ./CodeGen paramtest.ir > paramtest.s
    

    Possible output:

      .file "paramtest.ir"
      .section .note.GNU-stack,"",@progbits
      .text
      .globl paramtest
      .type paramtest, @function
    paramtest:
      # prologue, update stack pointer
      pushq	%rbp # save old base ponter
      movq	%rsp, %rbp # set new base pointer
      push	%rbx # %rbx is callee-saved
      # allocate stack space for locals
      sub	$72, %rsp
      # move register parameter a to local variable
      mov	%rdi, -24(%rbp)
      # move register parameter b to local variable
      mov	%rsi, -32(%rbp)
      # move register parameter c to local variable
      mov	%rdx, -40(%rbp)
      # move register parameter d to local variable
      mov	%rcx, -48(%rbp)
      # move register parameter e to local variable
      mov	%r8, -56(%rbp)
      # move register parameter f to local variable
      mov	%r9, -64(%rbp)
      # move stack parameter g to local variable
      mov	16(%rbp), %rax
      mov	%rax, -72(%rbp)
      # move stack parameter h to local variable
      mov	24(%rbp), %rax
      mov	%rax, -80(%rbp)
      # assign c to x
      mov	-40(%rbp), %rax
      mov	%rax, -16(%rbp)
      # set return value
      mov	-16(%rbp), %rax
      # epilogue
      add	$72, %rsp
      pop	%rbx # restore old base pointer
      pop	%rbp # restore old base pointer
      ret
    
  • Running

    Assembly and link the generated assembly:

    gcc -o main main.s paramtest.s
    

    Run main and inspect its exit code (which is set by the return from main):

    ./main
    echo $?
    

    You should see 3, which is what was passed to func, returned to main, then ultimately returned from main.

9. Hard-coded Listeners for Functions

In order to compile executable programs that use arithmetic, branching, and pointers, your compiler needs to emit complete assembly programs. This depends on getting codegen1's listeners implemented correctly. To avoid penalizing you in codegen2 and codegen3 for mistakes in codegen1, you may instead hard code the assembly output for codegen1's functions using the implementations given in this section. These will hard-code the assembly output of your compile for the following SimpleIR template program, template.ir:

function main # DO NOT CHANGE
localVariables a b c d e f g result retval # DO NOT CHANGE
# add your code to try out here
retval := call print_int result # DO NOT CHANGE
return 0 # DO NOT CHANGE
end function # DO NOT CHANGE

You may use this SimpleIR program by adding your SimpleIR statements here and compiling. Note the lines that says "DO NOT CHANGE". These have been precompiled for you to assembly and provided in the hard-coded listeners below. As long as you do not change these and use the hard-coded listeners below, your implementations of codegen2 and codegen3's listeners will be gradeable.

print_int is defined in iolib.c, which is below:

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

int print_int(int64_t x) {
  printf("%" PRId64 "\n", x);
  return 0;
}

int64_t read_int() {
  int64_t x;
  scanf("%" PRId64 "d", &x);
  return x;
}

You can simply link iolib.c with the output of your compiler and your template assembly program will call print_int, i.e.,

./CodeGen template.ir > template.s
gcc -o template template.s iolib.c
./template

9.1. Hard-coded enterFunction

virtual void enterFunction(SimpleIRParser::FunctionContext * ctx) override {
  cout << "\t.globl main" << endl;
  cout << "\t.type main" << ", @function" << endl;
  cout << "main:" << endl;
  cout << "\t# prologue, update stack pointer" << endl;
  cout << "\tpushq\t%rbp # save old base ponter" << endl;
  cout << "\tmovq\t%rsp, %rbp # set new base pointer" << endl;
  cout << "\tpush\t%rbx # %rbx is callee-saved" << endl;
}

9.2. Hard-coded enterEnd

virtual void enterEnd(SimpleIRParser::EndContext * ctx) override {
  cout << "\t# epilogue" << endl;
  cout << "\tadd\t$" << stackoffset << ", %rsp" << endl;
  cout << "\tpop\t%rbx # restore %rbx" << endl;
  cout << "\tpop\t%rbp # restore old base pointer" << endl;
  cout << "\tret" << endl;
}

9.3. Hard-coded enterParameters

virtual void enterParameters(SimpleIRParser::ParametersContext * ctx) override {
  // empty for template.ir
}

9.4. Hard-coded enterReturnStatement

virtual void enterReturnStatement(SimpleIRParser::ReturnStatementContext * ctx) override {
  cout << "\t# set return value" << endl;
  cout << "\tmov\t$0, %rax" << endl;
}

9.5. Hard-coded enterCall

virtual void enterCall(SimpleIRParser::CallContext * ctx) override {
  cout << "\tmov\t-72(%rbp), %rdi" << endl;
  cout << "\tcall\tprint_int" << endl;
  cout << "\tadd\t$0, %rsp" << endl;
  cout << "\tmov\t%rax, -80(%rbp)" << endl;
}

10. Arithmetic Operations

10.1. enterOperation

Arithmetic operations have the following syntax:

operation: variable=NAME ':=' operand1=(NAME | NUM) operatorKind=('+' | '-' | '*' | '/' | '%') operand2=(NAME | NUM);

There are four parts to arithmetic operations that are relevant to the compiler:

  • The destination variable ctx->variable->getText()
  • The left operand ctx.operand1, may be an integer constant NUM or variable name NAME.
  • The operator ctx.operatorKind->getType(), which may be one of five operator symbols.
  • The right operand ctx.operand2, may be an integer constant NUM or variable name NAME.

The goal of the compiler is to generate assembly code that implements the arithmetic operation.

The compiler can generate code that

  1. Loads the operands into temporary registers
  2. Performs the arithmetic on the temporary registers
  3. Stores the result into the destination variable

For example, in the following SimpleIR,

function main
localVariables x y
x := y * 6
return 0
end function

the operation x := y * 6 compiles to

mov	-24(%rbp), %rax
mov	$6, %rbx
imul	%rbx, %rax
mov	%rax, -16(%rbp)

Operands

The * operator has two operands, y and 6. The y operand translates to a load from the stack-allocated memory location for the y, i.e., -24(%rbp). The 6 operand translates to an immediate value $6. You can use the helper function, operand_to_string to handle both; enterAssign illustrates using this function. This code was translated by generating a mov instruction for each operand to temporary registers, %rax and %rbx. This isn't the only way to translate the operation, but it makes for a simple compiler implementation.

Operator

Once we have both operands in registers, we can generate the assembly arithmetic operation corresponding to the symbol. In this case, we have *, denoting multiplication, which can be implemented with imul, i.e., integer multiplication. We multiply the values in the registers with imul %rbx, %rax; remember that in the AT&T assembly syntax the destination is the second operand of the assembly instruction, which is also the left operand of the arithmetic operation. Multiplication and addition are commutative, but subtraction, division, and remainder are not, so this order matters. To check for the operator token type, compare the operator's type to token names provided in ANTLR-generated parser:

  • SimpleIRParser::PLUS
  • SimpleIRParser::MINUS
  • SimpleIRParser::STAR
  • SimpleIRParser::SLASH
  • SimpleIRParser::PERCENT
  • Addition
    add %rbx, %rax
    

    is equivalent to

    rax = rax + rbx
    

    The left operand and the destination operand are the same register and are the second argument to the assembly instruction.

    Remember, the destination is always the second argument to the AT&T assembly instruction.

  • Subtraction
    sub %rbx, %rax
    

    This has the same argument ordering as the addition operation.

  • Integer Multiplication
    imul %rbx, %rax
    

    This has the same argument ordering as the addition operation.

  • Division and remainder
    mov $9, %rax
    mov $5, %rbx
    # %rax holds the numerator
    cdq  # now %rdx:%rax holds the sign-extended numerator
    idiv %rbx # divide %rdx:%rax by %rbx
    # quotient (result) now in %rax
    # remainder now in %rdx
    

    The same operation, idiv is used for both division and remainder (both are computed together). The operation is fixed to using the %rax register as the numerator. Strictly speaking %rdx and %rax form the numerator, but for 64-bit number, we can use cdq to sign extend %rax into %rdx. Provide the denominator to idiv. After the operation, %rax holds the quotient (result of the division) and %rdx holds the remainder.

Destination variable

The destination is a variable, so the compiler will need to generate an assembly instruction that stores the result of the operation into the memory location associated with this variable. The argument to the store operation will also be an offset from the base pointer, e.g., -16(%rbp) for the x variable.

Generate a mov instruction that takes whatever the result of the arithmetic operation is (what register will it be in?) and stores it into this offset. For instance, if your compiler generates code that puts the result in %rax and needs to store it into a variable that has offset -16, it should generate the following:

mov %rax -16(%rbp)

11. Branching

There are two kinds of branching in SimpleIR, unconditional (gotoStatement) and conditional (ifGoto). Both use a label as a target of the branch.

11.1. enterLabel

The label syntax is as follow.

label: labelName=NAME ':';

Its syntax is very similar to assembly labels, so the compiler only needs to emit the label name ctx->labelName->getText() followed by a colon.

11.2. enterGotoStatement

The uncondition branch has simple syntax:

gotoStatement: 'goto' labelName=NAME;

The compiler only needs to generate a jmp instruction a provide the label name ctx->labelName->getText().

For instance, this SimpleIR code

loop:
goto loop

generates the following assembly

loop:
  jmp loop

and causes an infinite loop.

11.3. enterIfGoto

Conditional branches have more complicated syntax:

ifGoto: 'if' operand1=(NAME | NUM) operatorKind=('=' | '!=' | '<' | '<=' | '>' | '>=') operand2=(NAME | NUM) 'goto' labelName=NAME;

Conditional branches compare two operands with one of several relational operators and branch to the label only if the comparison is true. They consistent of several parts

  • A left operand
  • A relational operator
  • A right operand
  • The label to branch to

Operands

The operands can be handled exactly as they are in arithmetic operations.

Operator

The operator tokens are handled similarly to those for arithmetic, but have different token types:

  • SimpleIRParser::EQ
  • SimpleIRParser::NEQ
  • SimpleIRParser::LT
  • SimpleIRParser::LTE
  • SimpleIRParser::GT
  • SimpleIRParser::GTE

The SimpleIR ifGoto can be implemented by generating a comparison operation followed by a conditional jump. The comparison sets the internal EFLAGS register which the conditional jump will inspect to determine whether to make the jump.

The comparison operation is

cmp %rbx, %rax

where %rax is the left side of the comparison and the %rbx is the right (recall that AT&T syntax has operands reversed). While there is no destination register, you can think of cmp as a subtraction that doesn't store the result; it only sets the EFLAGS to record whether the result was zero (operands are equal) or negative (less than). Can the machine determine the other comparison operations with only these two flags?

Immediately after cmp, generate the jump that corresponds to the relational operator, according to the following table:

ASM Op SimpleIR Op Jump if…
je = Equal
jne != Not equal
jl < Less than
jle <= Less than or equal
jg > Greater than
jge >= Greater than or equal

Example

function main
localVariables x
if x <= 10 goto falselabel
x := 11
falselabel:
x := x + 1
return 0
end function

The if x <= 10 goto falselabel statement generates the following assembly:

mov	-16(%rbp), %rax
mov	$10, %rbx
cmp	%rbx, %rax
jle	falselabel

12. Pointer operations

12.1. enterDereference

Given a derefence statement, e.g.,

t2 := *t1

where the variables have the following offsets

Variable Offset
t1 -40
t2 -48

the following assembly code will load the value at the addressed stored in t1 and store that value into t2

mov -40(%rbp), %rax  # value of the deref'ed variable (t1)
mov (%rax), %rbx  # the value in memory at address %rax (*t1)
mov %rbx, -48(%rbp)  # store the result in the assigned var (t2)

12.2. enterReference

Given a reference statement, e.g,.

t1 := &x

where the variables have the following offsets from the base pointer

Variable Offset
x -24
t1 -40

the following assembly code will store the address of x into t1

mov %rbp, %rax  # start with stack frame address
add $-24, %rax  # add the offset of the referenced var (x)
mov %rax, -40(%rbp)  # store the result in the assigned var (t1)

12.3. enterAssignDereference

Given a dereference assignment statement

*t1 := 11

where t1 has offset -40, the following assembly will store the value 11 into the address given by t1

mov	-40(%rbp), %rax # get the value of ~t1~
mov	$11, %rbx # get the right-hand-side value
mov	%rbx, (%rax) # store %rbx into the address in memory given by %rax

(%rax) means store not to %rax itself, but to the memory location given by the address in %rax.

Author: Paul Gazzillo

Created: 2025-04-02 Wed 14:43

Validate