Parsing
Compiler Foundations
COP-3402
Recall: how your C program gets run
Compiler Internals
- Front-end: process input language
- Back-end: generate output language
- front-end
- input language processing
- produces a syntax tree (will go into this later intuitively and formally with a grammar)
- back-end
- output language generation
- produces assembly/machine code
(Diagram)
- overall: input language -> compiler -> output
- front/back ends:
- input -> front-end -> syntax tree -> back-end -> output
The "middle"-end
Many compilers use an intermediate language
Diagram
- input -> front -> syntax tree -> middle-end
- IR generator -> IR -> optimizer -> IR
- middle-end -> IR -> back-end -> output
- IR often like a machine code, but without machine-like specifics (e.g., register-based, but infinite registers)
Why an intermediate representation?
- Easier front-end development
- Easier back-end development
- Portability
- Machine-independent optimization
- Modularization
Multiple front-ends and back-ends
Classic phases of a compiler
Dragon book
Symbols vs. meaning
1 + 2 * 3
6 / 2(1 + 2)
一加二乘三
What gives symbols meaning?
- Moon vs. finger pointing at the moon
- "The map is not the territory"
- symbols have no instrinsic meaning
- the interpreter/compiler/human defines meaning
- compiler defines input language in terms of output language
- why does output language have meaning?
ASCII vs. machine code
od
gcc -S
- ASCII table
Recall interpreter last time
- Input language numbering?
- Machine code numbering?
Syntax trees
Human language
"The boy went to the store"
Diagram this sentence
Syntax: groups of words
- One sentence type:
- subject phrase, verb phrase, object phrase
Syntactic structure captures relationships between words
"The boy leaving behind his friend went to the store."
- who/what went? what word goes with "went"
- the boy, because nested structures allow us to go off on another clause and return to the parent clause
What does "store" mean?
"The store is closed"
"Squirrels store acorns for the winter".
Meaning depends on how it is used with other words:
- Store as a noun: a shop that sells products
- Store as a verb: to place for later use
Meaning depends on structure
Arithmetic expressions
9 + 3 - 2
- Same idea as spoken language
- Diagram this "sentence"
- Operations are nested inside of each other
- Do one operation at a time (even in your head)
What is the assembly code version of "9 + 3 - 2"?
- Operations are one at a time
- Order of operations depend on the syntax tree
- Inuitively, you can walk the tree and compute the meaning (or generate equivalent assembly!)
Evaluating the tree
- Post order traversal
- Leaf nodes: value of the number symbol
- Inner nodes: value is result of operation on child values
Ambiguity
1 + 2 * 3
- What is the meaning, e.g., resulting value, of this expression?
- What are the two typical ways this can be interpreted?
- Draw the trees and perform the interpretation on each
- Perform the compilation on one of the trees
Parsing
Problem: we don't say "noun phrase starts here", "subject is here"
How do we figure out syntactic structure?
- Syntactic structure (word groupings) defines meaning
- Spoken language and source code have no intrinsic structure
How is source code represented?
- Just a sequence of ascii codes
- No instrinsic structure
- No instrinsic meaning (ascii numbers aren't machine integers)
Parsing infers structure
Parsing infers structure using word order and groupings
Define what we need by syntax
Formal grammars
Terminals |
Nonterminals |
Productions |
Starting symbol |
- formal grammar
- two kinds of symbols
- terminals (words)
- nonterminals (structure names)
- productions (rules defining contents/children of structures)
- starting symbol (the top of the syntax tree)
- two kinds of symbols
Terminals
- Words or tokens
- What you see or hear, e.g., C source code
Nonterminals
- Names of syntactic structures, e.g., verb phrase, expression, function definition
- Not uttered explicitly
Productions
- Rules mapping nonterminals to their contents
- Defines legal groupings of symbols in the language
- Written using an arrow
nonterminal -> symbol1 symbol2 ...
Strictly speaking, context-free grammars are restricted to having a single nonterminal on the left-hand side of the arrow, but context-sensitive grammars can have more symbols on the left-hand side.
Starting symbol
- The first nonterminal in the syntax tree
Constructs can be nested
- For example,
E -> E + num
- Expressions may contains other expressions
- E.g., 9 + 3 - 2 nests 9 + 3 inside of the subtraction operation
Example: arithmetic expressions
E -> E + E E -> E - E E -> E * E E -> E / E E -> N N -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
The terminals, i.e., uttered words in the language, are the operators and the digits.
The nonterminals are the names of the grammar constructs, specifically E for expression, num for number, and op for operator.
There are four productions.
The pipe symbol here means alternative, i.e., num -> 0 | 1
is equivalent to two rules, num -> 0
and num -> 1
.
The starting symbol, by convetion, is the nonterminal on the left-hand side of the first rule.
Deriving a tree for string
- Find a sequence of rule applications that
- Begins with the start symbol
- Ends with the string
Show derivation of the tree for 9 + 3 - 2
Defining an interpreter
- Walk the tree
- Evaluate each node of the tree
- Define computation for each grammar production
Expression interpreter rules
- Assume we have the following interpreter functions implemented
add
,sub
,mult
,div
to perform addition, subtraction, multiplication, and division computationsascii2int
to convert the ASCII character value to an integer represetation
Specifying the interpreter
Define computation for each grammar construct
E -> E + E { E.value = add(E1.value, E2.value) } E -> E - E { E.value = sub(E1.value, E2.value) } E -> E * E { E.value = mult(E1.value, E2.value) } E -> E / E { E.value = div(E1.value, E2.value) } E -> N { E.value = ascii2int(N.value) } N -> 0 { N.value = "0" } N -> 1 { N.value = "1" } N -> 2 { N.value = "2" } ... N -> 9 { N.value = "9" }
Defining a compiler
- Similar tree-walking approach
- Generate code instead of compute result
Compile infix to postfix
Assume we have string concat
function to join any number of strings.
E -> E + E { E.value = concat(E1.value, E2.value, "+") } E -> E - E { E.value = concat(E1.value, E2.value, "-") } E -> E * E { E.value = concat(E1.value, E2.value, "*") } E -> E / E { E.value = concat(E1.value, E2.value, "/") } E -> N { E.value = N.value } N -> 0 { N.value = "0" } N -> 1 { N.value = "1" } N -> 2 { N.value = "2" } ... N -> 9 { N.value = "9" }
Note that we no longer need to convert ascii to a machine integer, since we can just print the output program in ascii as well.