UP | HOME

Parsing
COP-3402

Recall: how your C program gets run

source_to_process.svg

Compiler Internals

  • Front-end: process input language
  • Back-end: generate output language
  • front-end
    • input language processing
    • produces a syntax tree (will go into this later intuitively and formally with a grammar)
  • back-end
    • output language generation
    • produces assembly/machine code

(Diagram)

  • overall: input language -> compiler -> output
  • front/back ends:
    • input -> front-end -> syntax tree -> back-end -> output

The "middle"-end

Many compilers use an intermediate language

Diagram

  • input -> front -> syntax tree -> middle-end
    • IR generator -> IR -> optimizer -> IR
  • middle-end -> IR -> back-end -> output
  • IR often like a machine code, but without machine-like specifics (e.g., register-based, but infinite registers)

Why an intermediate representation?

  • Easier front-end development
  • Easier back-end development
  • Portability
  • Machine-independent optimization
  • Modularization

Multiple front-ends and back-ends

llvm_ir.webp

Classic phases of a compiler

compiler.jpg

Dragon book

Symbols vs. meaning

1 + 2 * 3

6 / 2(1 + 2)

一加二乘三

What gives symbols meaning?

  • symbols have no instrinsic meaning
  • the interpreter/compiler/human defines meaning
  • compiler defines input language in terms of output language
  • why does output language have meaning?

ASCII vs. machine code

Recall interpreter last time

  • Input language numbering?
  • Machine code numbering?

Syntax trees

Human language

"The boy went to the store"

Diagram this sentence

Syntax: groups of words

  • One sentence type:
    • subject phrase, verb phrase, object phrase

Syntactic structure captures relationships between words

"The boy leaving behind his friend went to the store."

  • who/what went? what word goes with "went"
  • the boy, because nested structures allow us to go off on another clause and return to the parent clause

What does "store" mean?

"The store is closed"

"Squirrels store acorns for the winter".

Meaning depends on how it is used with other words:

  • Store as a noun: a shop that sells products
  • Store as a verb: to place for later use

Meaning depends on structure

Arithmetic expressions

9 + 3 - 2

  • Same idea as spoken language
  • Diagram this "sentence"
  • Operations are nested inside of each other
  • Do one operation at a time (even in your head)

What is the assembly code version of "9 + 3 - 2"?

  • Operations are one at a time
  • Order of operations depend on the syntax tree
  • Inuitively, you can walk the tree and compute the meaning (or generate equivalent assembly!)

Evaluating the tree

  • Post order traversal
  • Leaf nodes: value of the number symbol
  • Inner nodes: value is result of operation on child values

Ambiguity

1 + 2 * 3

  • What is the meaning, e.g., resulting value, of this expression?
  • What are the two typical ways this can be interpreted?
  • Draw the trees and perform the interpretation on each
  • Perform the compilation on one of the trees

Parsing

Problem: we don't say "noun phrase starts here", "subject is here"

How do we figure out syntactic structure?

  • Syntactic structure (word groupings) defines meaning
  • Spoken language and source code have no intrinsic structure

How is source code represented?

  • Just a sequence of ascii codes
  • No instrinsic structure
  • No instrinsic meaning (ascii numbers aren't machine integers)

Parsing infers structure

Parsing infers structure using word order and groupings

Define what we need by syntax

Formal grammars

Terminals
Nonterminals
Productions
Starting symbol
  • formal grammar
    • two kinds of symbols
      • terminals (words)
      • nonterminals (structure names)
    • productions (rules defining contents/children of structures)
    • starting symbol (the top of the syntax tree)

Terminals

  • Words or tokens
  • What you see or hear, e.g., C source code

Nonterminals

  • Names of syntactic structures, e.g., verb phrase, expression, function definition
  • Not uttered explicitly

Productions

  • Rules mapping nonterminals to their contents
  • Defines legal groupings of symbols in the language
  • Written using an arrow nonterminal -> symbol1 symbol2 ...

Strictly speaking, context-free grammars are restricted to having a single nonterminal on the left-hand side of the arrow, but context-sensitive grammars can have more symbols on the left-hand side.

Starting symbol

  • The first nonterminal in the syntax tree

Constructs can be nested

  • For example, E -> E + num
    • Expressions may contains other expressions
    • E.g., 9 + 3 - 2 nests 9 + 3 inside of the subtraction operation

Example: arithmetic expressions

E -> E + E
E -> E - E
E -> E * E
E -> E / E
E -> N
N -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

The terminals, i.e., uttered words in the language, are the operators and the digits.

The nonterminals are the names of the grammar constructs, specifically E for expression, num for number, and op for operator.

There are four productions.

The pipe symbol here means alternative, i.e., num -> 0 | 1 is equivalent to two rules, num -> 0 and num -> 1.

The starting symbol, by convetion, is the nonterminal on the left-hand side of the first rule.

Deriving a tree for string

  • Find a sequence of rule applications that
    • Begins with the start symbol
    • Ends with the string

Show derivation of the tree for 9 + 3 - 2

Defining an interpreter

  • Walk the tree
  • Evaluate each node of the tree
  • Define computation for each grammar production

Expression interpreter rules

  • Assume we have the following interpreter functions implemented
    • add, sub, mult, div to perform addition, subtraction, multiplication, and division computations
    • ascii2int to convert the ASCII character value to an integer represetation

Specifying the interpreter

Define computation for each grammar construct

E -> E + E  { E.value = add(E1.value, E2.value) }
E -> E - E  { E.value = sub(E1.value, E2.value) }
E -> E * E  { E.value = mult(E1.value, E2.value) }
E -> E / E  { E.value = div(E1.value, E2.value) }
E -> N      { E.value = ascii2int(N.value) }
N -> 0      { N.value = "0" }
N -> 1      { N.value = "1" }
N -> 2      { N.value = "2" }
...
N -> 9      { N.value = "9" }

Defining a compiler

  • Similar tree-walking approach
  • Generate code instead of compute result

Compile infix to postfix

Assume we have string concat function to join any number of strings.

E -> E + E  { E.value = concat(E1.value, E2.value, "+") }
E -> E - E  { E.value = concat(E1.value, E2.value, "-") }
E -> E * E  { E.value = concat(E1.value, E2.value, "*") }
E -> E / E  { E.value = concat(E1.value, E2.value, "/") }
E -> N      { E.value = N.value }
N -> 0      { N.value = "0" }
N -> 1      { N.value = "1" }
N -> 2      { N.value = "2" }
...
N -> 9      { N.value = "9" }

Note that we no longer need to convert ascii to a machine integer, since we can just print the output program in ascii as well.

Calc project

Author: Paul Gazzillo

Created: 2024-10-23 Wed 00:07

Validate