UP | HOME

(3) Writing a Toy Compiler
COP-3402, Spring 2024

Table of Contents

Overview

  • A compiler is a language translator
  • Bridges human-friendlier languages and machine code

Toy compiler for arithmetic expressions

  • Take binary arithmetic operations
    • Addition, subtract, multiplication, division)
  • Produce intel assembly
  • Print result of each operation

Example

  • Input program
    1+5;
    7-4;
    
  • Output program
    movl $1, %eax
    movl $5, %ebx
    addl %ebx, %eax
    
    movl $7, %eax
    movl $4, %ebx
    subl %ebx, %eax
    
    # in scratch
    gcc -S -O0 test.c
    cat test.s
    
    • lea uses instruction-pointer-relative addressing
    • rodata (where strings are stored) is after text section
    • .LC0 is made relative by linker
    • by adding %rip (instruction pointer of next instruction), we get absolute address
    • lea does this calculation and puts the address (not the contents) into parameter for printf

How would you write this compiler?

Assignment

  • Practice using tooling by reimplementing the toy compiler.
  • Follow-along with the implementation
    • Lets you see compiler design first-hand
    • Frees you from having to debug, come up with algorithms
    • Gets you used the programming language and environment for the real project

to learn a new language, i would take publicly-available examples and write them out myself, just to get used to the syntax, idioms, and programming environment

Follow-along means type it out yourself without help from others.

  • Submitting someone else's version is a violation of academic integrity.
  • This is the only time in this course you are allowed to use or copy existing source code.

Compiler overview

  • Input: program in source language
  • Output: equivalent program in target language
  • (Diagram)

Reminder first about high-level compiler view: file.c -> compiler.exe -> file.exe (which has it's own input and output).

Phases of a compiler

compiler.jpg

Dragon book

  • Lexing
  • Parsing
  • Type-checking
  • Intermediate language generation
  • Optimization
  • Machine code Generation

Lexing

  • Takes a string of characters
  • Produces a stream of words (called tokens)
  • Examples: "1+2;" or "7-4;"
od -c

Parsing

  • Takes a stream of tokens
  • Produces a syntax tree
    • Sentence diagramming
    • (Diagram)
  • Examples: 1+5;

Also show a nested if statement

Type-checking

  • Takes a syntax tree
  • Produces a symbol table
    • Collects data type of each symbol
    • Example (diagram)

Show an example of a C int plus a struct

Intermediate code generation

  • Takes a syntax tree and symbol table
  • Produces a machine-code-like language
  • This enables compiler portability

LLVM IR Diagram

llvm_ir.webp

tmpa2017-vellvm-verifying-the-llvm-10-638.jpg?cb=1490261798

Optimization

  • Takes intermediate/machine code
  • Produces code that is "better"
    • Faster, smaller, more energy efficient depending on design
  • Example: constant propagation
x = 10;
y = 1 + x;
if (y > 7) {
  printf("%d\n", y);
}

Machine code generation

  • Takes intermediate code
  • Produces machine code
  • Separating this stage enables compiler portability
gcc -S file.c

Let's write a toy compiler together

  • Takes a list of arithmetic operations
  • Produces intel assembly that prints the results
  • We can use gcc to assemble and link into an executable

Example

Input:

1+5;
8*3;

Output:

movl $1, %eax
movl $5, %ebx
addl %ebx, %eax
movl  %eax, %esi
leaq    .LC0(%rip), %rdi
movl    $0, %eax
call    printf@PLT

movl $7, %eax
movl $4, %ebx
subl %ebx, %eax
movl  %eax, %esi
leaq    .LC0(%rip), %rdi
movl    $0, %eax
call    printf@PLT

Toy compiler (informal) specification

What do we really mean by arithmetic expressions?

  • What ASCII symbols should be used?
  • Are spaces allowed?
  • What do we mean by a new line?
  • How do we represent numbers on the physical machine?
  • How do we represent arithmetic operations on the physical machine?
od -c example.toy
od -x example.toy
man ascii

Informal syntax

  • A program is a list of statements
  • A list of statement contains one statement and another statement list
  • A statement contains one expression followed by a semicolon
  • An expression contains a number, then an operator, then a number
  • A number is a single ASCII code for a digit
  • An operator is an ASCII code for plus, minus, star, or forward slash
Note the hierarchical nature of this, where a program can be decomposed into a list of statements, and statements can be further broken down into expressions and semicolons, etc.

Informal semantics

  • ASCII code for digits mean machine integers
  • Each ASCII code for an operator means an arithmetic machine operation
    • Plus sign for addition
    • Minus sign for substract
    • Star for multiplication
    • Forward slash for divide
  • Each complete statement computes and prints the result of the machine operation

Lexer

Text files are just bytes

  • Each byte corresponds to graphical character by convention
  • ASCII table man ascii
  • od -x

Tokens

  • NUMBER: single digit
  • OPERATOR: plus, minus, multiply, divide
  • SEMICOLON
  • END: signifies the program is finished

Token constructors

  • create_number_token()
  • create_operator_token()
  • create_semicolon_token()
  • create_end_token()

lex() algorithm

  • Loop over each character
  • Match character
  • Construct and return token object

lexer abstracts away file handling and individual characters and provides a token stream

Regular expressions and finite state automata

  • Generalizes lexing
  • Will be covered in lecture 6

Parser

Syntax

  • statement_list: contains a statement and next list
  • statement: contains an expression
  • expression: contains two operands and an operator (all tokens)

Tree constructors

  • create_statement_list
  • create_statement
  • create_expression

Interface between parser and lexer

  • next_token() - calls lex() to get next token
  • get_lookahead() - inspects current token

Context-free grammars and pushdown automata

  • Generalizes parsing
  • Will be Covered in lecture 7

Code generation

  • Takes the parser's syntax tree
  • Traverses tree
  • Emits assembly equivalent of each construct

Tree walkers

  • gencode_statement_list
  • gencode_statement
  • gencode_expression

Syntax of the language gives us a natural design for our compiler. We can process each language construct (statement, expression, etc.) in its own function.

Assembly code

  • addl, subl, imull, idiv

Using the templates with generated code

cat template_start.s <(cat example.toy | ./toy) template_end.s > example.s

Turning your compiler's output into an exectuable and running it

gcc -o example example.s
./example

Diffing the output with the expected output

diff -w example.s.expected example.s

Coding tips

  • Create separate components
    • Reason about and test separately
  • Make assumptions explicit
  • Make implementation match specification

Breaking down the problem

  • lexer.c
  • parser.c
  • codegen.c
  • main.c

Architecture

  • Lexer - lex() takes a file (sequence of characters), yields one token at a time
    • Token - a sequence of characters, represents one word of the language
  • Parser - parse() uses lex(), checks syntax, and produces an abstract syntax tree (AST)
    • AST - a data-structure that represents the input program's syntax
  • Code generator - traverses the AST, generating equivalent assembly code

Submitting your project via GitHub

  • Use the toy compiler GitHub assignment
  • git add and git commit source
    • ideally, small, well-defined commits with clear commit messages
  • Double-check with git status and on the repo website
  • Push
  • Confirm by cloning in a separate directory
    • git clone url dirname
    • (Don't clone in your current repo's directory)

(Live coding)

setup the environment and git, walk through commands

go over makefile, autobuild

go over header file, just copies declarations

specification vs implementation

(Scroll through the toy compiler code)

Wrap-up

  • We looked at
    • The phases of a compiler
    • A toy compiler for a simple language
    • Syntax specification
      • Tree creation
    • Semantic specification
      • Tree traversal

Resources

example.toy

1+5;
7-4;
8*3;
9/4;

cat example.toy | ./toy

movl $1, %eax
movl $5, %ebx
addl %ebx, %eax
movl  %eax, %esi
leaq    .LC0(%rip), %rdi
movl    $0, %eax
call    printf@PLT

movl $7, %eax
movl $4, %ebx
subl %ebx, %eax
movl  %eax, %esi
leaq    .LC0(%rip), %rdi
movl    $0, %eax
call    printf@PLT

movl $8, %eax
movl $3, %ebx
imull %ebx, %eax
movl  %eax, %esi
leaq    .LC0(%rip), %rdi
movl    $0, %eax
call    printf@PLT

movl $9, %eax
movl $4, %ebx
cdq
idiv %ebx
movl  %eax, %esi
leaq    .LC0(%rip), %rdi
movl    $0, %eax
call    printf@PLT

cat template_start.s <(cat example.toy | ./toy) template_end.s > example.s

  .file	"toy"
  .text
  .section	.rodata
.LC0:
  .string	"%d\n"
  .text
  .globl	main
  .type	main, @function


main:
.LFB0:
  .cfi_startproc
  endbr64
  pushq	%rbp
  .cfi_def_cfa_offset 16
  .cfi_offset 6, -16
  movq	%rsp, %rbp
  .cfi_def_cfa_register 6

  movl $1, %eax
  movl $5, %ebx
  addl %ebx, %eax
  movl  %eax, %esi
  leaq	.LC0(%rip), %rdi
  movl	$0, %eax
  call	printf@PLT

  movl $7, %eax
  movl $4, %ebx
  subl %ebx, %eax
  movl  %eax, %esi
  leaq	.LC0(%rip), %rdi
  movl	$0, %eax
  call	printf@PLT

  movl $8, %eax
  movl $3, %ebx
  imull %ebx, %eax
  movl  %eax, %esi
  leaq	.LC0(%rip), %rdi
  movl	$0, %eax
  call	printf@PLT

  movl $9, %eax
  movl $4, %ebx
  cdq
  idiv %ebx
  movl  %eax, %esi
  leaq	.LC0(%rip), %rdi
  movl	$0, %eax
  call	printf@PLT
  movl	$0, %eax
  popq	%rbp
  .cfi_def_cfa 7, 8
  ret
  .cfi_endproc


.LFE0:
  .size	main, .-main
  .ident	"GCC: (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008"
  .section	.note.GNU-stack,"",@progbits
  .section	.note.gnu.property,"a"
  .align 8
  .long	 1f - 0f
  .long	 4f - 1f
  .long	 5
0:
  .string	 "GNU"
1:
  .align 8
  .long	 0xc0000002
  .long	 3f - 2f
2:
  .long	 0x3
3:
  .align 8
4:

Makefile

SRC := \
  lexer.c \
  parser.c \
  codegen.c \
  main.c

OBJ := $(SRC:%.c=%.o)

PRG := toy

.PHONY: all clean

all: $(PRG)

$(PRG): $(OBJ)
  $(CC) $(CFLAGS) -o $@ $^

%.o: %.c
  $(CC) $(CFLAGS) -c $<

clean:
  rm -f $(OBJ) $(PRG)

Author: Paul Gazzillo

Created: 2024-02-01 Thu 21:24