(3) Writing a Toy Compiler
COP-3402, Spring 2024
Table of Contents
- Overview
- Toy compiler for arithmetic expressions
- How would you write this compiler?
- Assignment
- Compiler overview
- Let's write a toy compiler together
- Toy compiler (informal) specification
- Lexer
- Parser
- Code generation
- Coding tips
- Submitting your project via GitHub
- (Live coding)
- Wrap-up
- Resources
- Makefile
Overview
- A compiler is a language translator
- Bridges human-friendlier languages and machine code
Toy compiler for arithmetic expressions
- Take binary arithmetic operations
- Addition, subtract, multiplication, division)
- Produce intel assembly
- Print result of each operation
Example
- Input program
1+5; 7-4;
- Output program
movl $1, %eax movl $5, %ebx addl %ebx, %eax movl $7, %eax movl $4, %ebx subl %ebx, %eax
# in scratch gcc -S -O0 test.c cat test.s
- lea uses instruction-pointer-relative addressing
- rodata (where strings are stored) is after text section
- .LC0 is made relative by linker
- by adding %rip (instruction pointer of next instruction), we get absolute address
- lea does this calculation and puts the address (not the contents) into parameter for printf
How would you write this compiler?
Assignment
- Practice using tooling by reimplementing the toy compiler.
- Follow-along with the implementation
- Lets you see compiler design first-hand
- Frees you from having to debug, come up with algorithms
- Gets you used the programming language and environment for the real project
to learn a new language, i would take publicly-available examples and write them out myself, just to get used to the syntax, idioms, and programming environment
Follow-along means type it out yourself without help from others.
- Submitting someone else's version is a violation of academic integrity.
- This is the only time in this course you are allowed to use or copy existing source code.
Compiler overview
- Input: program in source language
- Output: equivalent program in target language
- (Diagram)
Reminder first about high-level compiler view: file.c -> compiler.exe -> file.exe (which has it's own input and output).
Phases of a compiler
Dragon book
- Lexing
- Parsing
- Type-checking
- Intermediate language generation
- Optimization
- Machine code Generation
Lexing
- Takes a string of characters
- Produces a stream of words (called tokens)
- Examples: "1+2;" or "7-4;"
od -c
Parsing
- Takes a stream of tokens
- Produces a syntax tree
- Sentence diagramming
- (Diagram)
- Examples: 1+5;
Also show a nested if statement
Type-checking
- Takes a syntax tree
- Produces a symbol table
- Collects data type of each symbol
- Example (diagram)
Show an example of a C int plus a struct
Intermediate code generation
- Takes a syntax tree and symbol table
- Produces a machine-code-like language
- This enables compiler portability
LLVM IR Diagram
Optimization
- Takes intermediate/machine code
- Produces code that is "better"
- Faster, smaller, more energy efficient depending on design
- Example: constant propagation
x = 10; y = 1 + x; if (y > 7) { printf("%d\n", y); }
Machine code generation
- Takes intermediate code
- Produces machine code
- Separating this stage enables compiler portability
gcc -S file.c
Let's write a toy compiler together
- Takes a list of arithmetic operations
- Produces intel assembly that prints the results
- We can use gcc to assemble and link into an executable
Example
Input:
1+5; 8*3;
Output:
movl $1, %eax movl $5, %ebx addl %ebx, %eax movl %eax, %esi leaq .LC0(%rip), %rdi movl $0, %eax call printf@PLT movl $7, %eax movl $4, %ebx subl %ebx, %eax movl %eax, %esi leaq .LC0(%rip), %rdi movl $0, %eax call printf@PLT
Toy compiler (informal) specification
What do we really mean by arithmetic expressions?
- What ASCII symbols should be used?
- Are spaces allowed?
- What do we mean by a new line?
- How do we represent numbers on the physical machine?
- How do we represent arithmetic operations on the physical machine?
od -c example.toy od -x example.toy man ascii
Informal syntax
- A program is a list of statements
- A list of statement contains one statement and another statement list
- A statement contains one expression followed by a semicolon
- An expression contains a number, then an operator, then a number
- A number is a single ASCII code for a digit
- An operator is an ASCII code for plus, minus, star, or forward slash
Note the hierarchical nature of this, where a program can be decomposed into a list of statements, and statements can be further broken down into expressions and semicolons, etc.
Informal semantics
- ASCII code for digits mean machine integers
- Each ASCII code for an operator means an arithmetic machine operation
- Plus sign for addition
- Minus sign for substract
- Star for multiplication
- Forward slash for divide
- Each complete statement computes and prints the result of the machine operation
Lexer
Text files are just bytes
- Each byte corresponds to graphical character by convention
- ASCII table
man ascii
od -x
Tokens
- NUMBER: single digit
- OPERATOR: plus, minus, multiply, divide
- SEMICOLON
- END: signifies the program is finished
Token constructors
- create_number_token()
- create_operator_token()
- create_semicolon_token()
- create_end_token()
lex() algorithm
- Loop over each character
- Match character
- Construct and return token object
lexer abstracts away file handling and individual characters and provides a token stream
Regular expressions and finite state automata
- Generalizes lexing
- Will be covered in lecture 6
Parser
Syntax
- statement_list: contains a statement and next list
- statement: contains an expression
- expression: contains two operands and an operator (all tokens)
Tree constructors
- create_statement_list
- create_statement
- create_expression
Interface between parser and lexer
- next_token() - calls lex() to get next token
- get_lookahead() - inspects current token
Context-free grammars and pushdown automata
- Generalizes parsing
- Will be Covered in lecture 7
Code generation
- Takes the parser's syntax tree
- Traverses tree
- Emits assembly equivalent of each construct
Tree walkers
- gencode_statement_list
- gencode_statement
- gencode_expression
Syntax of the language gives us a natural design for our compiler. We can process each language construct (statement, expression, etc.) in its own function.
Assembly code
- addl, subl, imull, idiv
Using the templates with generated code
cat template_start.s <(cat example.toy | ./toy) template_end.s > example.s
Turning your compiler's output into an exectuable and running it
gcc -o example example.s ./example
Diffing the output with the expected output
diff -w example.s.expected example.s
Coding tips
- Create separate components
- Reason about and test separately
- Make assumptions explicit
- Make implementation match specification
Breaking down the problem
- lexer.c
- parser.c
- codegen.c
- main.c
Architecture
- Lexer - lex() takes a file (sequence of characters), yields one token at a time
- Token - a sequence of characters, represents one word of the language
- Parser - parse() uses lex(), checks syntax, and produces an abstract syntax tree (AST)
- AST - a data-structure that represents the input program's syntax
- Code generator - traverses the AST, generating equivalent assembly code
Submitting your project via GitHub
- Use the toy compiler GitHub assignment
git add
andgit commit
source- ideally, small, well-defined commits with clear commit messages
- Double-check with
git status
and on the repo website - Push
- Confirm by cloning in a separate directory
git clone url dirname
- (Don't clone in your current repo's directory)
(Live coding)
setup the environment and git, walk through commands
go over makefile, autobuild
go over header file, just copies declarations
specification vs implementation
(Scroll through the toy compiler code)
Wrap-up
- We looked at
- The phases of a compiler
- A toy compiler for a simple language
- Syntax specification
- Tree creation
- Semantic specification
- Tree traversal
Resources
example.toy
1+5; 7-4; 8*3; 9/4;
cat example.toy | ./toy
movl $1, %eax movl $5, %ebx addl %ebx, %eax movl %eax, %esi leaq .LC0(%rip), %rdi movl $0, %eax call printf@PLT movl $7, %eax movl $4, %ebx subl %ebx, %eax movl %eax, %esi leaq .LC0(%rip), %rdi movl $0, %eax call printf@PLT movl $8, %eax movl $3, %ebx imull %ebx, %eax movl %eax, %esi leaq .LC0(%rip), %rdi movl $0, %eax call printf@PLT movl $9, %eax movl $4, %ebx cdq idiv %ebx movl %eax, %esi leaq .LC0(%rip), %rdi movl $0, %eax call printf@PLT
cat template_start.s <(cat example.toy | ./toy) template_end.s > example.s
.file "toy" .text .section .rodata .LC0: .string "%d\n" .text .globl main .type main, @function main: .LFB0: .cfi_startproc endbr64 pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movl $1, %eax movl $5, %ebx addl %ebx, %eax movl %eax, %esi leaq .LC0(%rip), %rdi movl $0, %eax call printf@PLT movl $7, %eax movl $4, %ebx subl %ebx, %eax movl %eax, %esi leaq .LC0(%rip), %rdi movl $0, %eax call printf@PLT movl $8, %eax movl $3, %ebx imull %ebx, %eax movl %eax, %esi leaq .LC0(%rip), %rdi movl $0, %eax call printf@PLT movl $9, %eax movl $4, %ebx cdq idiv %ebx movl %eax, %esi leaq .LC0(%rip), %rdi movl $0, %eax call printf@PLT movl $0, %eax popq %rbp .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size main, .-main .ident "GCC: (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008" .section .note.GNU-stack,"",@progbits .section .note.gnu.property,"a" .align 8 .long 1f - 0f .long 4f - 1f .long 5 0: .string "GNU" 1: .align 8 .long 0xc0000002 .long 3f - 2f 2: .long 0x3 3: .align 8 4:
Makefile
SRC := \ lexer.c \ parser.c \ codegen.c \ main.c OBJ := $(SRC:%.c=%.o) PRG := toy .PHONY: all clean all: $(PRG) $(PRG): $(OBJ) $(CC) $(CFLAGS) -o $@ $^ %.o: %.c $(CC) $(CFLAGS) -c $< clean: rm -f $(OBJ) $(PRG)