Code Generation 3
COP-3402
Table of Contents
Overview
In this project you will compile arithmetic and branching operations to x86 assembly. As with all programming projects, it will be submitted via git.
Setup the repo
ssh into eustis, replacing
NID
with your UCF NID.ssh NID@eustis.eecs.ucf.edu
Clone the compiler template to
codegen3
git clone https://www.cs.ucf.edu/~gazzillo/teaching/cop3402fall24/repos/compiler-template.git/ codegen3
Enter the repo
cd codegen3/
If this doesn't, double-check step (2) and make sure you put
codegen3
as the second argument toclone
.Add the URL of your personal remote repository, replacing
NID
with your UCF NID.git remote add submission gitolite3@eustis3.eecs.ucf.edu:cop3402/NID/codegen3
Synchronize your local repo with the remote eustis3 repo.
git push --set-upstream submission master
You only need to do this once. Use
git commit
andgit push
regularly to keep the remote repo up to date.
Setting up your development environment
Be sure you are in the repo directory.
cd ~/codegen3
If you receive a warning about a managed python installation, then double-check that you are on eustis.
Then create the development environment. This creates an "editable" installation of your project, so that you can modify its source and rerun without having to reinstall the project.
pipenv install -e ./
If you can't run pipenv but you've already installed it, trying logging out and back in again to eustis.
If you haven't installed pipenv yet, please review the
calc
project.Enter your pipenv development environment. Do this everytime you log in to eustis to work on your project.
pipenv shell
Double-check that you are in the environment. Your prompt should look something like this:
(compiler) NID@net1547:~/codegen3$
You can later exit the dev environment with
exit
. You do not need to enter the dev environment again if you have already entered it.
Get ANTLR and build the parser.
make -C grammar/
Compiler project structure
File | Description |
---|---|
Pipfile | pipenv settings |
compiler/CodeGen.py | The code generator that you will write. Not provided by the template repo. |
compiler/Interpreter.py | A SimpleIR interpreter for comparing output |
grammar/Makefile | A build file for the grammar |
grammar/SimpleIR.g4 | The SimpleIR grammar |
pyproject.toml | python project settings |
.gitignore
files are for git
and __init__.py
are for python modules.
Implementation
Skeleton code
Here is the start to compiler/CodeGen.py
import os import sys import math from textwrap import indent, dedent from antlr4 import * from grammar.SimpleIRLexer import SimpleIRLexer from grammar.SimpleIRParser import SimpleIRParser from grammar.SimpleIRListener import SimpleIRListener import logging logging.basicConfig(level=logging.DEBUG) # This class defines a complete listener for a parse tree produced by SimpleIRParser. class CodeGen(SimpleIRListener): def __init__(self, filename, outfile): self.filename = filename self.outfile = outfile self.symtab = {} self.bitwidth = 8 def enterLocalvars(self, ctx:SimpleIRParser.LocalvarsContext): """Allocates space for local variables, including input parameters""" # get list of local variables locals = [ local.getText() for local in ctx.NAME() ] # create a list of offsets of the bitwidth offsets = map(lambda x: (x+1)*-self.bitwidth, range(len(locals))) # create a dictionary mapping locals to offsets self.symtab = dict(zip(locals, offsets)) logging.debug(self.symtab) # compute the size of the stack space stackspace = len(self.symtab.keys()) * self.bitwidth logging.debug(stackspace) # ceiling to 8 bytes stackoffset = math.ceil(stackspace / 8) * 8 # align to 16 bytes by adding 8 to account for rbx if already aligned to 16 bytes stackoffset += (stackoffset + 8) % 16 logging.debug(stackoffset) # Emit an instruction to allocate stack space for locals self.outfile.write(indent(dedent(f'''\ # allocate stack space for locals sub\t${stackoffset}, %rsp '''), '\t')) def enterAssign(self, ctx:SimpleIRParser.AssignContext): """Assign value to a variable""" if SimpleIRParser.NAME == ctx.operand.type: operand = f"{self.symtab[ctx.operand.text]}(%rbp)" elif SimpleIRParser.NUM == ctx.operand.type: operand = f"${ctx.operand.text}" self.outfile.write(indent(dedent(f'''\ # assign {ctx.operand.text} to {ctx.NAME(0).getText()} mov\t{operand}, %rax mov\t%rax, {self.symtab[ctx.NAME(0).getText()]}(%rbp) '''), '\t')) def enterOperation(self, ctx:SimpleIRParser.OperationContext): """Arithmetic operation""" # TODO def enterLabel(self, ctx:SimpleIRParser.LabelContext): """Create a label""" # TODO def enterGoto(self, ctx:SimpleIRParser.GotoContext): """Unconditional goto""" # TODO def enterIfgoto(self, ctx:SimpleIRParser.IfgotoContext): """Conditional goto""" # TODO def main(): import sys if len(sys.argv) > 1: filepath = sys.argv[1] input_stream = FileStream(filepath) filename = os.path.basename(filepath) else: input_stream = StdinStream() filename = "stdin" lexer = SimpleIRLexer(input_stream) stream = CommonTokenStream(lexer) parser = SimpleIRParser(stream) tree = parser.unit() if parser.getNumberOfSyntaxErrors() > 0: print("syntax errors") exit(1) # print(tree.toStringTree()) walker = ParseTreeWalker() walker.walk(CodeGen(filename, sys.stdout), tree) if __name__ == '__main__': main()
Emitting assembly code
The CodeGen
class provides a self.outfile
file to write to. In python, write a string using
self.outfile.write("The string to emit")
Alternatively, you can use a format string to make creating templates easier, where anything inside curly braces is evaluated, e.g., the following prints a string followed by the contents of a variable called name
:
self.outfile.write(f"This is the what is in the name variable: {name}")
To retrive ANTLR parse tree contents, use the ctx
context parameter provided to each listener using the name of the token, e.g., the following will get the NAME
token from the syntax tree for a function production and store it in the name
python variable.
name = ctx.NAME()
Arithmetic Operations (enterOperation
)
Arithmetic operations have the following syntax:
operation: NAME ':=' operand1=(NAME | NUM) operator=('+' | '-' | '*' | '/' | '%') operand2=(NAME | NUM);
There are four parts to arithmetic operations that are relevant to the compiler:
- The destination variable
ctx.NAME(0).getText()
- The left operand
ctx.operand1
- The operator
ctx.operator
- The right operand
ctx.operand2
The goal of the compiler is to generate assembly code that implements the arithmetic operation.
Operands
The operands may either be integer constants NUM
or variable names NAME
. See how enterAssign
uses ANTLR facilities to distinguish between the two, i.e., by checking ctx.operand.type
to see whether it is a SimpleIRParser.NAME
or a SimplerIRParser.NUM
. Then depending on which it is, either converts the variable to a load from an offset from the base pointer
f"{self.symtab[ctx.operand1.text]}(%rbp)"
or to an immediate operand
f"${ctx.operand1.text}"
By doing this conversion for each operand, we will have the arguments needed for generating assembly instructions.
The compiler can generate code that
- Loads the operands into temporary registers
- Performs the arithmetic on the temporary registers
- Stores the result into the destination variable
For example, the following SimpleIR
x := x * 6
compiles to
mov -24(%rbp), %rax mov $6, %rbx imul %rbx, %rax mov %rax, -24(%rbp)
if x
has offset -24
.
Destination variable
The destination is a variable, so the compiler will need to generate an assembly instruction that stores the result of the operation into the memory location associated with this variable. The argument to the store operation will also be an offset from the base pointer, i.e.,
{self.symtab[ctx.NAME(0).getText()]}(%rbp)
Generate a mov
instruction that takes whatever the result of the arithmetic operation is and stores it into this offset. For instance, if your compiler generates code that puts the result in %rax
and needs to store it into a variable that has offset -16
, it should generate the following:
mov %rax -16(%rbp)
Operator
The operator is stored in ctx.operator
and can check checked for its type with the following template code. (Note that pass
is a python instruction that does no computation, akin to a no-op. It is just used as a placeholder for what you will implement.)
if SimpleIRParser.PLUS == ctx.operator.type: pass elif SimpleIRParser.MINUS == ctx.operator.type: pass elif SimpleIRParser.STAR == ctx.operator.type: pass elif SimpleIRParser.SLASH == ctx.operator.type: pass elif SimpleIRParser.PERCENT == ctx.operator.type: pass
This conditional block will allow you to generate different code depending on which operator the input program has specified.
See the lecture notes for an explanation of the arithmetic assembly instructions.
See an example of what the compiler can generate for arithmetic IR operations.
Branching
There are two kinds of branching in SimpleIR, unconditional (goto
) and conditional (ifgoto
). Both use a label as a target of the branch.
enterLabel
The label syntax is as follow.
label: NAME ':';
Its syntax is very similar to assembly labels, so the compiler only needs to emit the label name ctx.NAME()
followed by a colon.
enterGoto
The uncondition branch has simple syntax:
goto: 'goto' NAME;
The compiler only needs to generate a jmp
instruction a provide the label name ctx.NAME()
.
For instance, this SimpleIR code
loop: goto loop
generates the following assembly
loop: jmp loop
and causes an infinite loop.
enterIfgoto
Conditional branches have more complicated syntax:
ifgoto: 'if' operand1=(NAME | NUM) operator=('=' | '!=' | '<' | '<=' | '>' | '>=') operand2=(NAME | NUM) 'goto' labelname=NAME;
Conditional branches compare two operands with one of several relational operators and branch to the label only if the comparison is true. They consistent of several parts
- A left operand
- A relational operator
- A rigth operand
- The label to branch to
- Operands
The operands can be handled exactly as they are in arithmetic operations.
- Operator
The operator is handled similarly to those for arithmetic, but have different symbols. The following conditional block checks for each type of operator:
if SimpleIRParser.EQ == ctx.operator.type: pass if SimpleIRParser.NEQ == ctx.operator.type: pass elif SimpleIRParser.LT == ctx.operator.type: pass elif SimpleIRParser.LTE == ctx.operator.type: pass elif SimpleIRParser.GT == ctx.operator.type: pass elif SimpleIRParser.GTE == ctx.operator.type: pass
See the lecture notes on branching for an explanation of the branching assembly instructions.
See an example of what the compiler can generate for branching IR statements.
Debugging with GDB
One way to help trace the function call is to use gdb
. The following will rebuild the main
program with debugging symbols on, run gdb
, then step through each assembly instructions.
gcc -g -o main main.s func.s # compile with debugging symbols (-g) gdb main b main # setup breakpoint at main r # start running, breaks at main si # step instruction to see next instruction # hitting enter will repeat last command, e.g., si c # one done use c to continue running without stopping
Debugging tutorials
See these resources for more information on gdb
.
- Basic Assembler Debugging with GDB
- Chapter 2 of 21st Century C
Full examples
main.ir
function main localvars argc argv x params argc argv x := 1 + 2 # 3 x := x * 6 # 18 x := x - 1 # 17 x := x % 10 # 7 x := x / 3 # 2 return x # 2
main.s
Running
codegen main.ir > main.s
should produce similar assembly output. Note that#
denotes a comment..file "main.ir" .section .note.GNU-stack,"",@progbits .text .globl main .type main, @function main: # prologue, update stack pointer pushq %rbp # save old base ponter movq %rsp, %rbp # set new base pointer push %rbx # %rbx is callee-saved # allocate stack space for locals sub $24, %rsp # move register parameter argc to local variable mov %rdi, -8(%rbp) # move register parameter argv to local variable mov %rsi, -16(%rbp) # operation 1 2 to x mov $1, %rax mov $2, %rbx add %rbx, %rax mov %rax, -24(%rbp) # operation x 6 to x mov -24(%rbp), %rax mov $6, %rbx imul %rbx, %rax mov %rax, -24(%rbp) # operation x 1 to x mov -24(%rbp), %rax mov $1, %rbx sub %rbx, %rax mov %rax, -24(%rbp) # operation x 10 to x mov -24(%rbp), %rax mov $10, %rbx cdq idiv %rbx mov %rdx, %rax mov %rax, -24(%rbp) # operation x 3 to x mov -24(%rbp), %rax mov $3, %rbx cdq idiv %rbx mov %rax, -24(%rbp) # set return value mov -24(%rbp), %rax # epilogue pop %rbx # restore rbx for the caller mov %rbp, %rsp # restore old stack pointer pop %rbp # restore old base pointer ret
branching.ir
function main localvars a b x y t1 t2 params a b x := 9 y := 4 if x < 10 goto true1 # true t1 := x + y goto end1 true1: t1 := x - y # 5 end1: if x > 10 goto true2 # false t2 := x * y # 36 goto end2 true2: t2 := x / y end2: return x # 9
branching.s
.file "branching.ir" .section .note.GNU-stack,"",@progbits .text .globl main .type main, @function main: # prologue, update stack pointer pushq %rbp # save old base ponter movq %rsp, %rbp # set new base pointer push %rbx # %rbx is callee-saved # allocate stack space for locals sub $56, %rsp # move register parameter a to local variable mov %rdi, -8(%rbp) # move register parameter b to local variable mov %rsi, -16(%rbp) # assign 9 to x mov $9, %rax mov %rax, -24(%rbp) # assign 4 to y mov $4, %rax mov %rax, -32(%rbp) # ifgoto x 10 to true1 mov -24(%rbp), %rax mov $10, %rbx cmp %rbx, %rax jl true1 # operation x y to t1 mov -24(%rbp), %rax mov -32(%rbp), %rbx add %rbx, %rax mov %rax, -40(%rbp) # goto jmp end1 true1: # operation x y to t1 mov -24(%rbp), %rax mov -32(%rbp), %rbx sub %rbx, %rax mov %rax, -40(%rbp) end1: # ifgoto x 10 to true2 mov -24(%rbp), %rax mov $10, %rbx cmp %rbx, %rax jg true2 # operation x y to t2 mov -24(%rbp), %rax mov -32(%rbp), %rbx imul %rbx, %rax mov %rax, -48(%rbp) # goto jmp end2 true2: # operation x y to t2 mov -24(%rbp), %rax mov -32(%rbp), %rbx cdq idiv %rbx mov %rax, -48(%rbp) end2: # set return value mov -24(%rbp), %rax # epilogue pop %rbx # restore rbx for the caller mov %rbp, %rsp # restore old stack pointer pop %rbp # restore old base pointer ret
More full examples
cd ~/codegen3 wget https://www.cs.ucf.edu/~gazzillo/teaching/cop3402fall24/files/compiler-examples.tar tar -xvf compiler-examples.tar
Testing without codegen1
and codegen2
You can still test your project in isolation without relying on codegen1
and codegen2
's implementation of functions. The following will download a special wrapper for your assembly output. It will put your assembly output inside of a pre-defined assembly function so that you can still run it.
Download the wrapper
Download the wrapper here.
cd ~/codegen3 wget https://www.cs.ucf.edu/~gazzillo/teaching/cop3402fall24/files/compiler-wrapper.tar tar -xvf compiler-wrapper.tar
Files provided in the wrapper
File | Description |
---|---|
iolib.c | Integer read and print functions |
main.ir | Template program |
Makefile | Makefile to wrap your compiler output |
wrapper_end.s | Assembly template end wrapper |
wrapper_start.s | Assembly template start wrapper |
How to use the wrapper
Be sure your compiler only defines the functions given in the template (enterLocalvars, enterAssign, enterOperation, enterLabel, enterGoto, enterIfgoto). Do not define enterUnit, enterFunction, exitFunction, or enterCall and just omit them from your CodeeGen.py
, since code from these functions are pregenerated in the wrapper template for you.
Function | Define in CodeGen.py? |
---|---|
enterLocalvars | Yes |
enterAssign | Yes |
enterOperation | Yes |
enterLabel | Yes |
enterGoto | Yes |
enterIfgoto | Yes |
enterUnit | No |
enterFunction | No |
exitFunction | No |
enterCall | No |
Enter your code to compile inside of the given wrapper/main.ir
where it says "add your code to try out here". Do not change any other lines of the program.
function main # DO NOT CHANGE localvars a b c d e f g result retval # DO NOT CHANGE # add your code to try out here retval := call print_int result # DO NOT CHANGE return 0 # DO NOT CHANGE
Be sure to assign result
to your desired output, since this will be printed by print_int
. Here is an example use of the wrapper:
function main # DO NOT CHANGE localvars a b c d e f g result retval # DO NOT CHANGE # add your code to try out here a := 7 b := 4 result := 7 % 4 retval := call print_int result # DO NOT CHANGE return 0 # DO NOT CHANGE
Compile main.ir
with the wrapper using the following (be sure you have downloaded it already and that you are in your pipenv shell
):
cd ~/codegen3 make -C wrapper clean all
This will run codegen
and add the resulting assembly to a template assembly file, the assembly into a program called wrapper.elf
. If compilation fails you will see error output, such as
main.s: Assembler messages: main.s:16: Error: symbol `main' is already defined make: *** [Makefile:18: main.o] Error 1
Your error will vary depending on what went wrong.
If compilation and assembly were successful, you can run the program with
./wrapper/wrapper.elf
It will print whatever was set to result
in your main.ir
.
How the wrapper works
The wrapper contains the precompiled main.ir
split into wrapper_start.s
and wrapper_end.s
. These contains the assembly code produced by a complete implementation of codegen2
for the function, call, and return IR statements.
The Makefile
runs your codegen
and saves the output to main.s
, i.e.,
codegen main.ir | tee main.s.body
Then it "wraps" the output of your compiler with the start and end of the precompiled assembly file to produce a complete main.s
assembly program, i.e.,
cat wrapper_start.s main.s.body wrapper_end.s > main.s
Finally, it assembles main.s
as usual and links it with iolib.c
, which provides the print_int
function, into the executable wrapper.elf
:
gcc -c main.s gcc -o wrapper.elf main.o iolib.c
Submitting your project
Stage, commit, and push to the grading server
The only file you need to submit is compiler/CodeGen.py
.
Once you have set up the repo, all you need to do is use git add
, git commit
, and git push
to stage, commit, and sync your repo to the grading git server.
Self-check
Be sure you have already downloaded the test wrapper.
You can check that you've submitted correctly by cloning, building, and testing your repo.
cd ~ git clone gitolite3@eustis3.eecs.ucf.edu:cop3402/NID/codegen3 codegen3_new cd codegen3_new pipenv install -e ./ pipenv shell make -C grammar/ codegen << EOT > main.s.body function main # DO NOT CHANGE localvars a b c d e f g result retval # DO NOT CHANGE a := 1 + 2 a := a * 6 a := a - 1 a := a % 10 a := a / 3 result := a retval := call print_int result # DO NOT CHANGE return 0 # DO NOT CHANGE EOT cat wrapper/wrapper_start.s main.s.body wrapper/wrapper_end.s > main.s gcc -o main main.s wrapper/iolib.c ./main # you should see 2 printed out
Note that this is not comprehensive test of your compiler, but just a validation that the project has been submitted and runs for one test case.
(Only if instructed) Updating from the start repo
If the original repo gets updated after you have already started implementing your project, you can get those updates by pulling. Otherwise, you will never need to do this step. Be sure to commit any changes you have made before proceeding.
git pull origin master --rebase git push -f
If you encounter a conflict, it may be that you modified some files from the original repo that didn't need to be modified. Come to office hours if you need help resolving the conflict.
Troubleshooting
If you make a mistake in typing the URL, you can remove the submission remote and try the add step again:
git remote rm submission git remote add submission gitolite3@eustis3.eecs.ucf.edu:cop3402/NID/codegen3 # replace NID with yours
- Do not try creating a new repo if you make a mistake. You will not be able to push the new repo to gitolite3, since there already is one there. You can always make new changes and commit them to fix mistakes.
- If in self-check codegen3_new already exists, just use a fresh directory name.
The program must be run inside of the
pipenv
environment. You can see that you have successfully entered the environment because your prompt is prefixed with(compiler)
, e.g.,(compiler) NID@net1547:~/codegen3$
You can exit the environment with
exit
.
Grading schema
Criterion | Points |
---|---|
The git repo exists | 1 |
compiler/CodeGen.py exists |
1 |
codegen runs the given example correctly | 2 |
codegen runs new example inputs correctly | 2 |
TOTAL | 6 |