Grammars and Parsing
Lecture 7
Table of Contents
Limitations of regular expressions
- Regexes match patterns in strings
- Can match infinite set of strings
- Don't support certain patterns
Regexes match an infinite set of strings with a finite expression
Regexes can't parse
Matching curly braces
- Curly braces make nested scopes (in C-like languages)
- Is there a regex to ensure matched braces?
{ { { } } }
We can make a regex that accepts all programs with matched curly braces, but there is none that will match only an arbitrarily nested string.
Finite state automata "can't count"
- Has a finite number of states
- But nesting is arbitrary
- Need a new state for each level of nesting depth
- (Diagram)
Show how you need to keep adding states for each level of nesting you want to match. Need an unbounded number of states for an unbounded language.
Hierarchical syntax
- Natural language
- The person walks home
- The person I went to school with walks home
- Syntax: the valid arrangements of words (tokens)
The nesting structure we see in programming languages is just like that of natural language. Although we use such structure so automatically, that we may not even be aware unless we hear a particularly complicated or ambiguous sentence.
In these examples, a listener has no confusion about whether "with" or "school" is doing the walking.
John saw the man on the mountain with a telescope.
- John, using a telescope, saw a man on a mountain.
- John saw a man on a mountain which had a telescope on it.
- John saw a man on a mountain who had a telescope.
- John, on a mountain and using a telescope, saw a man.
- John, on a mountain, saw a man who had a telescope.
"The Purple People Eater" by Sheb Wooley
- A purple creature that eats people.
- A creature that eats purple people. (This interpretation is confirmed in the lyrics, although whether the creature itself is also purple is never made clear.)
British Left Waffles on Falkland Islands.[7]
- The British party of the left rambles indecisively about Falkland Island policy.
- The British forces left behind waffles (the breakfast item) on the Falkland Islands.
Nesting, stacks, recursion
- The person { I went to school with } walks home
- Maintain state of sentence before entering the "I went to school with" clause.
- Tree walking: record state on call stack while processing children
We can think of this computationally as holding onto some state.
Where have we seen this kind of state saving in computer science?
Function calls, recursion, stacks.
Parsers infer the hierarchical syntax from a list of words
- Nested structure is implicit in list of words
- Parser infers structure by knowing syntax rules
The nested structure is implicit in each utterance of the language.
The parser can infer this structure even though the input does not explicitly express it, because the parser has the syntactic rules.
Technically a recognizer checks whether a string matches the syntax (like a finite state machine checks whether a string matches a regular expression), while a parser is a recognizer that also produces a syntax tree.
Grammars describe syntax
- Grammars describe all possible sentences (strings) in a language
- with a finite set of rules
- Grammars make implicit structure explicit
- Language constructs have their own symbols
Even if the language has an unbounded number of strings, the grammar can describe them in a finite bound.
Examples of grammar
- sentence → subject verb object
- subject → nounphrase
- nounphrase → "the" noun
- noun → "person"
- noun → "store"
- verb → "walks to"
- object → nounphrase
the store walks to the person
"the", "person", "store", "walks to" are all that we see explicitly in the language
sentence, subject, nounphrase, noun, verb, object are the language constructs that are unspoken, but implied.
Special symbols represent language constructs
- Unspoken, but implied by the structure of a language
- Project 1 makes these symbol explicit in a tree representation
- the person { some other clause } walks to the store
https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously
Alternatives
- noun had multiple rules
- Language constructs can have many variations
- e.g., if statements with and without an else branch
- Just like regular expressions
- bison even uses
|
for syntax alternatives
- bison even uses
Context-free grammars
- Terminal symbols are the words, the spoken parts, e.g, "person", "the"
- Nonterminal symbols are the unspoken representations of structures, e.g., sentence, nounphrase
- Productions are rules
- They map a nonterminal to a sequence of other symbols (terminal or nonterminal)
- E.g., nounphrase → "the" noun
- Starting symbol is the top of the hierarchy
Derivations: generating a string from the grammar
- Start from the starting nonterminal
- Pick a production for the nonterminal and substitute the symbol with the right-hand-side symbols
- Repeatedly replace any new nonterminals according to production rules until only terminals remain
(Diagram)
the store walks to the person
While parsers infer structure from a string, a generator produces a string from the grammar.
Notice the recursive nature of this process?
This notion of a derivation is where the terms nonterminal and terminal come from AFAIK. Nonterminals continue derivation until terminals stop the process.
Parsing: finding a derivation for the given string
- Recall: the string has no explicit syntax information
- Parser knows grammar rules
- Parser discovers derivation that produces the given string
- Proof that string is in language
- Recovery of explicit syntax with nonterminal symbols
If derivation is generating a string from the grammar, parsing is finding a derivation for some string. If there is a derivation, the string is in the language, otherwise it's not.
Correspondence between language and computation
Parsing Algorithms
Recursive descent
- Toy compiler's parser
- One function per nonterminal
- Performs derivation, guessing which alternative production to take
- Call starting nonterminal's function
- Try each alternative production for a nonterminal
- Backtrack if we chose the wrong one
- Stop after reading entire input (or no more alternatives, i.e., parse error)
- Predictive parsing avoids backtracking
- Works only on some grammars
Table-driven
- Pushdown automaton: finite state automaton (FSA) plus a stack
- Recall: nested structure of language
- Production rules matched with an FSA
- Nested productions' states held in stack during parsing
This is what bison uses.
Semantic actions are similar to a post-order traversal of the parse tree.
Many variations
- LL(k), LL(*), SLR, LALR
- Trade-offs in efficiency vs classes of grammars supported
This course won't go into the many varieties. The Dragon Book, chapter 4, has a good discussion of some of these.
The LR parsing algorithm
- Classic parsing algorithm
- Efficient and supports many grammars used for programming languages
- Can be tricky to write grammars for
The algorithm
(From Charlie Hughes)
Homework overview
- derivation of ccd in class
- parsing of ccd in class
- start of parsing aacdbb in class
LR Parsing Algorithm Actions
- shift next_state
- Push the current lookahead token onto the stack
- Push the next state on top of the stack (current state, terminal symbol)
- reduce production_number
- Find the production by its number
- Pop off the same number of state/value pairs from the stack as symbols in the production's right-hand-side
- Push the nonterminal symbol of the production
- Use the goto table to find the next state (current state, nonterminal symbol)