UP | HOME

Compiler Basics
Compilers
COP-3402

Programming language implementation

Can your processor run C code?

gcc -o hello hello.c
./hello

Why are there two commands?

gcc takes source code and converts it to a runnable program.

Keep in mind that gcc itself is also a program and bash creates a process to exec the gcc program

Two main implementation techniques

Compilers and Interpreters

Compiler or interpreter, either way, we have another program that takes our C code and makes it run on our machine.

Two general techniques, though these are actually blended for real-world compiled or interpreted languages.

Roughly two ways to design your programming language implementation:

  • Compile or translate to the instructions that the machine does run, so that the machine can run an equivalent of the program
  • Interpret or simulate the actions of the instructions to run the program and produce the output

Compiler

The compiler translates your source instructions to machine code.

Diagram

Compiler

The machine code should have the same outputs that the source program would.

Functional equivalence is possible when programs are deterministic. How can programs, such as random number generators, create apparent non-determinism with a deterministic program? They can by using random (or hard to guess) inputs.

Probabilistic and quantum programming languages have probability distributions as outputs.

Interpreter

The interpreter simulates the actions of the source instructions.

Diagram

Emulators are a special kind of interpreter interpreters of machine code, which can be used to run code written for another processor.

Virtual machines are also interpreters of machine code, though sometimes virtualization is useful even for the same processor architecture.

Interpreter

The interpreted program should have the same outputs that the source program would.

Illustrating language implementation

Getting "1+2" to run on our machine.

We assume a syntax and semantics for arithmetic operations that is similar to what C-like languages have.

Compiler approach

Translate "1+2" to machine code.

// read the source code
char left = getchar();
char operator = getchar();
char right = getchar();

// generate assembly code that performs the addition
if ('+' == operator) {
  printf("  mov $%c, %%rax\n", left);
  printf("  mov $%c, %%rbx\n", right);
  printf("  add %%rbx, %%rax\n");  // print the addition instruction
}

Interpreter approach

Interpret the result of "1+2".

// read the source code
char left = getchar();
char operator = getchar();
char right = getchar();

// compute the result of the addition
if ('+' == operator) {
  int leftnum = left - '0';
  int rightnum = right - '0';
  int result = leftnum + rightnum;  // perform addition
  printf("%d\n", result);
}

Observe that it's pretty straightforward to interpet addition, because we have the same representation for it in C.

Key difference

  • Compiler program: prints out the machine code for addition
  • Interpreter program: perform the addition operation

Run and show the output of language.c.

Interpreting exponents, e.g., "2^8"?

No machine instruction or C instruction for exponentiation.

Interpreter pseudo-code

read left operand
read right operand
read operator

if operator == "^":
  result = 1
  for i = 1 to right operand:
    result = result * left operand
  print result

Full example for + and ^

language.c

/**
  gcc -o language language.c
  echo "1+2" | ./language
  echo "2^8" | ./language
*/

#include <stdio.h>
#include <stdlib.h>

#define CHAR2INT(c) ((c) - '0')

int main() {
  char left = getchar();
  char operator = getchar();
  char right = getchar();

  printf("input: %c%c%c\n", left, operator, right);

  printf("compiler\n");
  switch (operator) {
  case '+':
    printf("  mov $%c, %%rax\n", left);
    printf("  mov $%c, %%rbx\n", right);
    printf("  add %%rbx, %%rax\n");
    break;
  case '^':
    printf("  mov $%c, %%rbx\n", left);
    printf("  mov $%c, %%rcx\n", right);
    printf("  mov $1, %%rax\n");
    printf("loop:\n");
    printf("  cmp $0, %%rcx\n");
    printf("  jle end\n");
    printf("  imul %%rbx, %%rax\n");
    printf("  sub $1, %%rcx\n");
    printf("  jmp loop\n");
    printf("end:\n");
    break;
  }
  printf("\n");

  printf("interpreter\n");
  int leftnum = CHAR2INT(left);
  int rightnum = CHAR2INT(right);
  int result;
  switch (operator) {
  case '+':
    result = leftnum + rightnum;
    printf("%d\n", result);
    break;
  case '^':
    result = 1;
    for (; rightnum > 0; rightnum--) {
      result *= leftnum;
    }
    printf("%d\n", result);
    break;
  }
}

COP-5621 Compiler Construction

If you want to learn more about how programming languages are defined and implemented.

Compiler Internals

  • Front-end: process input language
  • Back-end: generate output language
  • front-end
    • input language processing
    • produces a syntax tree (will go into this later intuitively and formally with a grammar)
  • back-end
    • output language generation
    • produces assembly/machine code

(Diagram)

  • overall: input language -> compiler -> output
  • front/back ends:
    • input -> front-end -> syntax tree -> back-end -> output

The "middle"-end

Many compilers use an intermediate language

Diagram

  • input -> front -> syntax tree -> middle-end
    • IR generator -> IR -> optimizer -> IR
  • middle-end -> IR -> back-end -> output
  • IR often like a machine code, but without machine-like specifics (e.g., register-based, but infinite registers)

Why an intermediate representation?

  • Easier front-end development
  • Easier back-end development
  • Portability
  • Machine-independent optimization
  • Modularization

Multiple front-ends and back-ends

llvm_ir.webp

Classic phases of a compiler

compiler.jpg

Dragon book

Symbols vs. meaning

1 + 2 * 3

6 / 2(1 + 2)

一加二乘三

What gives symbols meaning?

  • symbols have no instrinsic meaning
  • the interpreter/compiler/human defines meaning
  • compiler defines input language in terms of output language
  • why does output language have meaning?

ASCII vs. machine code

Recall interpreter last time

  • Input language numbering?
  • Machine code numbering?

Syntax trees

Human language

"The boy went to the store"

Diagram this sentence

Syntax: groups of words

  • One sentence type:
    • subject phrase, verb phrase, object phrase

Syntactic structure captures relationships between words

"The boy leaving behind his friend went to the store."

  • who/what went? what word goes with "went"
  • the boy, because nested structures allow us to go off on another clause and return to the parent clause

What does "store" mean?

"The store is closed"

"Squirrels store acorns for the winter".

Meaning depends on how it is used with other words:

  • Store as a noun: a shop that sells products
  • Store as a verb: to place for later use

Meaning depends on structure

Arithmetic expressions

9 + 3 - 2

  • Same idea as spoken language
  • Diagram this "sentence"
  • Operations are nested inside of each other
  • Do one operation at a time (even in your head)

What is the assembly code version of "9 + 3 - 2"?

  • Operations are one at a time
  • Order of operations depend on the syntax tree
  • Inuitively, you can walk the tree and compute the meaning (or generate equivalent assembly!)

Evaluating the tree

  • Post order traversal
  • Leaf nodes: value of the number symbol
  • Inner nodes: value is result of operation on child values

Ambiguity

1 + 2 * 3

  • What is the meaning, e.g., resulting value, of this expression?
  • What are the two typical ways this can be interpreted?
  • Draw the trees and perform the interpretation on each
  • Perform the compilation on one of the trees

Parsing

Problem: we don't say "noun phrase starts here", "subject is here"

How do we figure out syntactic structure?

  • Syntactic structure (word groupings) defines meaning
  • Spoken language and source code have no intrinsic structure

How is source code represented?

  • Just a sequence of ascii codes
  • No instrinsic structure
  • No instrinsic meaning (ascii numbers aren't machine integers)

Parsing infers structure

Parsing infers structure using word order and groupings

Define what we need by syntax

Formal grammars

Terminals
Nonterminals
Productions
Starting symbol
  • formal grammar
    • two kinds of symbols
      • terminals (words)
      • nonterminals (structure names)
    • productions (rules defining contents/children of structures)
    • starting symbol (the top of the syntax tree)

Terminals

  • Words or tokens
  • What you see or hear, e.g., C source code

Nonterminals

  • Names of syntactic structures, e.g., verb phrase, expression, function definition
  • Not uttered explicitly

Productions

  • Rules mapping nonterminals to their contents
  • Defines legal groupings of symbols in the language
  • Written using an arrow nonterminal -> symbol1 symbol2 ...

Strictly speaking, context-free grammars are restricted to having a single nonterminal on the left-hand side of the arrow, but context-sensitive grammars can have more symbols on the left-hand side.

Starting symbol

  • The first nonterminal in the syntax tree

Constructs can be nested

  • For example, E -> E + num
    • Expressions may contains other expressions
    • E.g., 9 + 3 - 2 nests 9 + 3 inside of the subtraction operation

Example: arithmetic expressions

E -> E + E
E -> E - E
E -> E * E
E -> E / E
E -> N
N -> 0
N -> 1
N -> 2
N -> 3
N -> 4
N -> 5
N -> 6
N -> 7
N -> 8
N -> 9

The terminals, i.e., uttered words in the language, are the operators (+, -, *, /) and the digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9).

The nonterminals are the names of the grammar constructs, specifically E for expression, num for number, and op for operator.

The pipe symbol here means alternative, i.e., num -> 0 | 1 is equivalent to two rules, num -> 0 and num -> 1.

The starting symbol, by convetion, is the nonterminal on the left-hand side of the first rule.

Deriving a tree

  • Find a sequence of rule applications that
    • Begins with the start symbol
    • Ends with the string

Show derivation of the tree for 9 + 3 - 2

Defining an interpreter

  • Walk the tree
  • Evaluate each node of the tree
  • Define computation for each grammar production

Expression interpreter rules

  • Assume we have the following interpreter functions implemented
    • add, sub, mult, div to perform addition, subtraction, multiplication, and division computations
    • ascii2int to convert the ASCII character value to an integer represetation

Specifying the interpreter

Define computation for each grammar construct

E -> E + E  { E.value = add(E1.value, E2.value) }
E -> E - E  { E.value = sub(E1.value, E2.value) }
E -> E * E  { E.value = mult(E1.value, E2.value) }
E -> E / E  { E.value = div(E1.value, E2.value) }
E -> N      { E.value = ascii2int(N.value) }
N -> 0      { N.value = "0" }
N -> 1      { N.value = "1" }
N -> 2      { N.value = "2" }
...
N -> 9      { N.value = "9" }

Defining a compiler

  • Similar tree-walking approach
  • Generate code instead of compute result

Compile infix to postfix

Assume we have string concat function to join any number of strings.

E -> E + E  { E.value = concat(E1.value, E2.value, "+") }
E -> E - E  { E.value = concat(E1.value, E2.value, "-") }
E -> E * E  { E.value = concat(E1.value, E2.value, "*") }
E -> E / E  { E.value = concat(E1.value, E2.value, "/") }
E -> N      { E.value = N.value }
N -> 0      { N.value = "0" }
N -> 1      { N.value = "1" }
N -> 2      { N.value = "2" }
...
N -> 9      { N.value = "9" }

Note that we no longer need to convert ascii to a machine integer, since we can just print the output program in ascii as well.

Author: Paul Gazzillo

Created: 2025-03-07 Fri 10:21

Validate