Compiler Basics
Compilers
COP-3402

Programming language implementation

Can your processor run C code?

gcc -o hello hello.c
./hello

Why are there two commands?

gcc takes source code and converts it to a runnable program.

Keep in mind that gcc itself is also a program and bash creates a process to exec the gcc program

Two main implementation techniques

Compilers and Interpreters

Compiler or interpreter, either way, we have another program that takes our C code and makes it run on our machine.

Two general techniques, though these are actually blended for real-world compiled or interpreted languages.

Roughly two ways to design your programming language implementation:

Compile or translate to the instructions that the machine does run, so that the machine can run an equivalent of the program
Interpret or simulate the actions of the instructions to run the program and produce the output

Compiler

The compiler translates your source instructions to machine code.

Diagram

Compiler

The machine code should have the same outputs that the source program would.

Functional equivalence is possible when programs are deterministic. How can programs, such as random number generators, create apparent non-determinism with a deterministic program? They can by using random (or hard to guess) inputs.

Probabilistic and quantum programming languages have probability distributions as outputs.

Interpreter

The interpreter simulates the actions of the source instructions.

Diagram

Emulators are a special kind of interpreter interpreters of machine code, which can be used to run code written for another processor.

Virtual machines are also interpreters of machine code, though sometimes virtualization is useful even for the same processor architecture.

Interpreter

The interpreted program should have the same outputs that the source program would.

Illustrating language implementation

Getting "1+2" to run on our machine.

We assume a syntax and semantics for arithmetic operations that is similar to what C-like languages have.

Compiler approach

Translate "1+2" to machine code.

// read the source code
char left = getchar();
char operator = getchar();
char right = getchar();

// generate assembly code that performs the addition
if ('+' == operator) {
  printf("  mov $%c, %%rax\n", left);
  printf("  mov $%c, %%rbx\n", right);
  printf("  add %%rbx, %%rax\n");  // print the addition instruction
}

Interpreter approach

Interpret the result of "1+2".

// read the source code
char left = getchar();
char operator = getchar();
char right = getchar();

// compute the result of the addition
if ('+' == operator) {
  int leftnum = left - '0';
  int rightnum = right - '0';
  int result = leftnum + rightnum;  // perform addition
  printf("%d\n", result);
}

Observe that it's pretty straightforward to interpet addition, because we have the same representation for it in C.

Key difference

Compiler program: prints out the machine code for addition
Interpreter program: perform the addition operation

Run and show the output of language.c.

Interpreting exponents, e.g., "2^8"?

No machine instruction or C instruction for exponentiation.

Interpreter pseudo-code

read left operand
read right operand
read operator

if operator == "^":
  result = 1
  for i = 1 to right operand:
    result = result * left operand
  print result

Full example for + and ^

language.c

/**
  gcc -o language language.c
  echo "1+2" | ./language
  echo "2^8" | ./language
*/

#include <stdio.h>
#include <stdlib.h>

#define CHAR2INT(c) ((c) - '0')

int main() {
  char left = getchar();
  char operator = getchar();
  char right = getchar();

  printf("input: %c%c%c\n", left, operator, right);

  printf("compiler\n");
  switch (operator) {
  case '+':
    printf("  mov $%c, %%rax\n", left);
    printf("  mov $%c, %%rbx\n", right);
    printf("  add %%rbx, %%rax\n");
    break;
  case '^':
    printf("  mov $%c, %%rbx\n", left);
    printf("  mov $%c, %%rcx\n", right);
    printf("  mov $1, %%rax\n");
    printf("loop:\n");
    printf("  cmp $0, %%rcx\n");
    printf("  jle end\n");
    printf("  imul %%rbx, %%rax\n");
    printf("  sub $1, %%rcx\n");
    printf("  jmp loop\n");
    printf("end:\n");
    break;
  }
  printf("\n");

  printf("interpreter\n");
  int leftnum = CHAR2INT(left);
  int rightnum = CHAR2INT(right);
  int result;
  switch (operator) {
  case '+':
    result = leftnum + rightnum;
    printf("%d\n", result);
    break;
  case '^':
    result = 1;
    for (; rightnum > 0; rightnum--) {
      result *= leftnum;
    }
    printf("%d\n", result);
    break;
  }
}

COP-5621 Compiler Construction

If you want to learn more about how programming languages are defined and implemented.

Compiler Internals

Front-end: process input language
Back-end: generate output language

front-end
- input language processing
- produces a syntax tree (will go into this later intuitively and formally with a grammar)
back-end
- output language generation
- produces assembly/machine code

(Diagram)

overall: input language -> compiler -> output
front/back ends:
- input -> front-end -> syntax tree -> back-end -> output

The "middle"-end

Many compilers use an intermediate language

Diagram

input -> front -> syntax tree -> middle-end
- IR generator -> IR -> optimizer -> IR
middle-end -> IR -> back-end -> output
IR often like a machine code, but without machine-like specifics (e.g., register-based, but infinite registers)

Why an intermediate representation?

Easier front-end development
Easier back-end development
Portability
Machine-independent optimization
Modularization

Multiple front-ends and back-ends

Classic phases of a compiler

Dragon book

Symbols vs. meaning

1 + 2 * 3

6 / 2(1 + 2)

一加二乘三

What gives symbols meaning?

Moon vs. finger pointing at the moon
"The map is not the territory"

symbols have no instrinsic meaning
the interpreter/compiler/human defines meaning
compiler defines input language in terms of output language
why does output language have meaning?

ASCII vs. machine code

od
gcc -S
ASCII table

Recall interpreter last time

Input language numbering?
Machine code numbering?

Syntax trees

Human language

"The boy went to the store"

Diagram this sentence

Syntax: groups of words

One sentence type:
- subject phrase, verb phrase, object phrase

Syntactic structure captures relationships between words

"The boy leaving behind his friend went to the store."

who/what went? what word goes with "went"
the boy, because nested structures allow us to go off on another clause and return to the parent clause

What does "store" mean?

"The store is closed"

"Squirrels store acorns for the winter".

Meaning depends on how it is used with other words:

Store as a noun: a shop that sells products
Store as a verb: to place for later use

Meaning depends on structure

Arithmetic expressions

9 + 3 - 2

Same idea as spoken language
Diagram this "sentence"
Operations are nested inside of each other
Do one operation at a time (even in your head)

What is the assembly code version of "9 + 3 - 2"?

Operations are one at a time
Order of operations depend on the syntax tree
Inuitively, you can walk the tree and compute the meaning (or generate equivalent assembly!)

Evaluating the tree

Post order traversal
Leaf nodes: value of the number symbol
Inner nodes: value is result of operation on child values

Ambiguity

1 + 2 * 3

What is the meaning, e.g., resulting value, of this expression?
What are the two typical ways this can be interpreted?
Draw the trees and perform the interpretation on each
Perform the compilation on one of the trees

Parsing

Problem: we don't say "noun phrase starts here", "subject is here"

How do we figure out syntactic structure?

Syntactic structure (word groupings) defines meaning
Spoken language and source code have no intrinsic structure

How is source code represented?

Just a sequence of ascii codes
No instrinsic structure
No instrinsic meaning (ascii numbers aren't machine integers)

Parsing infers structure

Parsing infers structure using word order and groupings

Define what we need by syntax

Formal grammars

Terminals

Nonterminals

Productions

Starting symbol

formal grammar
- two kinds of symbols
  - terminals (words)
  - nonterminals (structure names)
- productions (rules defining contents/children of structures)
- starting symbol (the top of the syntax tree)

Terminals

Words or tokens
What you see or hear, e.g., C source code

Nonterminals

Names of syntactic structures, e.g., verb phrase, expression, function definition
Not uttered explicitly

Productions

Rules mapping nonterminals to their contents
Defines legal groupings of symbols in the language
Written using an arrow nonterminal -> symbol1 symbol2 ...

Strictly speaking, context-free grammars are restricted to having a single nonterminal on the left-hand side of the arrow, but context-sensitive grammars can have more symbols on the left-hand side.

Starting symbol

The first nonterminal in the syntax tree

Constructs can be nested

For example, E -> E + num
- Expressions may contains other expressions
- E.g., 9 + 3 - 2 nests 9 + 3 inside of the subtraction operation

Example: arithmetic expressions

E -> E + E
E -> E - E
E -> E * E
E -> E / E
E -> N
N -> 0
N -> 1
N -> 2
N -> 3
N -> 4
N -> 5
N -> 6
N -> 7
N -> 8
N -> 9

The terminals, i.e., uttered words in the language, are the operators (+, -, *, /) and the digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9).

The nonterminals are the names of the grammar constructs, specifically E for expression, num for number, and op for operator.

The pipe symbol here means alternative, i.e., num -> 0 | 1 is equivalent to two rules, num -> 0 and num -> 1.

The starting symbol, by convetion, is the nonterminal on the left-hand side of the first rule.

Deriving a tree

Find a sequence of rule applications that
- Begins with the start symbol
- Ends with the string

Show derivation of the tree for 9 + 3 - 2

Defining an interpreter

Walk the tree
Evaluate each node of the tree
Define computation for each grammar production

Expression interpreter rules

Assume we have the following interpreter functions implemented
- add, sub, mult, div to perform addition, subtraction, multiplication, and division computations
- ascii2int to convert the ASCII character value to an integer represetation

Specifying the interpreter

Define computation for each grammar construct

E -> E + E  { E.value = add(E1.value, E2.value) }
E -> E - E  { E.value = sub(E1.value, E2.value) }
E -> E * E  { E.value = mult(E1.value, E2.value) }
E -> E / E  { E.value = div(E1.value, E2.value) }
E -> N      { E.value = ascii2int(N.value) }
N -> 0      { N.value = "0" }
N -> 1      { N.value = "1" }
N -> 2      { N.value = "2" }
...
N -> 9      { N.value = "9" }

Defining a compiler

Similar tree-walking approach
Generate code instead of compute result

Compile infix to postfix

Assume we have string concat function to join any number of strings.

E -> E + E  { E.value = concat(E1.value, E2.value, "+") }
E -> E - E  { E.value = concat(E1.value, E2.value, "-") }
E -> E * E  { E.value = concat(E1.value, E2.value, "*") }
E -> E / E  { E.value = concat(E1.value, E2.value, "/") }
E -> N      { E.value = N.value }
N -> 0      { N.value = "0" }
N -> 1      { N.value = "1" }
N -> 2      { N.value = "2" }
...
N -> 9      { N.value = "9" }

Note that we no longer need to convert ascii to a machine integer, since we can just print the output program in ascii as well.

Compiler Basics Compilers COP-3402

Programming language implementation

Compiler

Compiler

Interpreter

Interpreter

Illustrating language implementation

Compiler approach

Interpreter approach

Key difference

Interpreter pseudo-code

Full example for + and ^

COP-5621 Compiler Construction

Compiler Internals

The "middle"-end

Why an intermediate representation?

Multiple front-ends and back-ends

Classic phases of a compiler

Symbols vs. meaning

1 + 2 * 3

6 / 2(1 + 2)

一加二乘三

What gives symbols meaning?

ASCII vs. machine code

Syntax trees

Human language

Syntax: groups of words

What does "store" mean?

Meaning depends on structure

Arithmetic expressions

Evaluating the tree

Ambiguity

Parsing

Parsing infers structure

Formal grammars

Terminals

Nonterminals

Productions

Starting symbol

Constructs can be nested

Example: arithmetic expressions

Deriving a tree

Defining an interpreter

Expression interpreter rules

Specifying the interpreter

Defining a compiler

Compile infix to postfix

Compiler Basics
Compilers
COP-3402