Grammars
Lecture 6
Table of Contents
Overview
- Let's be more precise about syntax, e.g.,
- order of operations
- specifying numbers, identifers, etc.
Formal grammars specify syntax
Review
Regular expressions
String pattern matching
- phone numbers
- email addresses
- programming language tokens
How would you write this program?
- Search for an email address in a text file?
Example
Sequence
- Area code is three digits
Alternation
- A digit is either a 0, 1, 2, .., 9
Parentheses are just used to make order of operations explicit, just like in arithmetic
Repetition
- Any number of characters before the @ sign
Optional elements
- E.g., country code
Wild cards
- Allow any character (or some subset of characters)
More examples
Regular expression language
- concatenation, e.g.,
ab
- alternation, e.g.,
a|b
- Kleene closure, e.g.,
a*
The order of operations, from highest to lowest, is Kleene closure, concatenation, and alternation.
One way to remember order of operations, is that alternation is like addition or logical or, concatenation is like multiplication or logical and, and Kleene closure is like exponentiation.
Regular expressions in practice
- Lots of syntactic sugar available
- python library (perl syntax)
Finite state automata
- AKA
- Finite state machines
- Finite automaton
- State machine
Pattern matching equivalent to many automation tasks
- String pattern matching problem
- Capture each possible string prefix in a state
First abstract machine model
- Formal language: potentially infinite set of strings
- Each string drawn from a finite alphabet
- Each string element itself is finite
Here's our first abstract machine model
You'll see more in discrete 2
Other state machine applications
- Traffic lights
- Turnstiles
- Vending machines
Any machine with some predefined set of states and events that transition between states
Implementing finite state automata
- Graph
- If and while
- Table-based (diagram)
Automatically generating automata from regular expressions
This is how the flex tools works under the hood.
Limitations of regular expressions
- Regexes match patterns in strings
- Can match infinite set of strings
- Don't support certain patterns
Regexes match an infinite set of strings with a finite expression
Matching curly braces
- Curly braces make nested scopes (in C-like languages)
- Is there a regex to ensure matched braces?
{ { { } } }
We can make a regex that accepts all programs with matched curly braces, but there is none that will match only an arbitrarily nested string.
Finite state automata "can't count"
- Has a finite number of states
- But nesting is arbitrary
- Need a new state for each level of nesting depth
- (Diagram)
Show how you need to keep adding states for each level of nesting you want to match. Need an unbounded number of states for an unbounded language.
Hierarchical syntax
- Natural language
- The person walks home
- The person I went to school with walks home
- Syntax: the valid arrangements of words (tokens)
The nesting structure we see in programming languages is just like that of natural language. Although we use such structure so automatically, that we may not even be aware unless we hear a particularly complicated or ambiguous sentence.
In these examples, a listener has no confusion about whether "with" or "school" is doing the walking.
Nesting, stacks, recursion
- The person { I went to school with } walks home
- Maintain state of sentence before entering the "I went to school with" clause.
- Tree walking: record state on call stack while processing children
We can think of this computationally as holding onto some state.
Where have we seen this kind of state saving in computer science?
Function calls, recursion, stacks.
Parsers infer the hierarchical syntax from a list of words
- Nested structure is implicit in list of words
- Parser infers structure by knowing syntax rules
The nested structure is implicit in each utterance of the language.
The parser can infer this structure even though the input does not explicitly express it, because the parser has the syntactic rules.
Technically a recognizer checks whether a string matches the syntax (like a finite state machine checks whether a string matches a regular expression), while a parser is a recognizer that also produces a syntax tree.
Grammars describe syntax
- Grammars describe all possible sentences (strings) in a language
- with a finite set of rules
- Grammars make implicit structure explicit
- Language constructs have their own symbols
Even if the language has an unbounded number of strings, the grammar can describe them in a finite bound.
Examples of grammar
- sentence → subject verb object
- subject → nounphrase
- nounphrase → "the" noun
- noun → "person"
- noun → "store"
- verb → "walks to"
- object → nounphrase
"the", "person", "store", "walks to" are all that we see explicitly in the language
sentence, subject, nounphrase, noun, verb, object are the language constructs that are unspoken, but implied.
Special symbols represent language constructs
- Unspoken, but implied by the structure of a language
- Project 1 makes these symbol explicit in a tree representation
- the person { some other clause } walks to the store
https://en.wikipedia.org/wiki/Colorless_green_ideas_sleep_furiously
Alternatives
- noun had multiple rules
- Language constructs can have many variations
- e.g., if statements with and without an else branch
- Just like regular expressions
- bison even uses
|
for syntax alternatives
- bison even uses
Context-free grammars
- Terminal symbols are the words, the spoken parts, e.g, "person", "the"
- Nonterminal symbols are the unspoken representations of structures, e.g., sentence, nounphrase
- Productions are rules
- They map a nonterminal to a sequence of other symbols (terminal or nonterminal)
- E.g., nounphrase → "the" noun
- Starting symbol is the top of the hierarchy
Derivations: generating a string from the grammar
- Start from the starting nonterminal
- Pick a production for the nonterminal and substitute the symbol with the right-hand-side symbols
- Repeatedly replace any new nonterminals according to production rules until only terminals remain
(Diagram)
While parsers infer structure from a string, a generator produces a string from the grammar.
Notice the recursive nature of this process?
This notion of a derivation is where the terms nonterminal and terminal come from AFAIK. Nonterminals continue derivation until terminals stop the process.
Parsing: finding a derivation for the given string
- Recall: the string has no explicit syntax information
- Parser knows grammar rules
- Parser discovers derivation that produces the given string
- Proof that string is in language
- Recovery of explicit syntax with nonterminal symbols
If derivation is generating a string from the grammar, parsing is finding a derivation for some string. If there is a derivation, the string is in the language, otherwise it's not.
Correspondence between language and computation
ANTLR setup
ANTLR is a parser generator
- Takes formal grammar, produces parser
- Parser takes any valid input and creates parse tree
Hand-writing parsers can be tricky
ANTLR grammar format
- Mix of context-free grammar with regular expressions
- Regex permitted in productions
ANTLR run-time
- Support for inspecting and visualizing parses
- Lots of infrastructure for language processing
We'll see more when we start type-checking and generating code from our ANTLR parse tree
Visitor pattern is common for abstracting away language tree traversals
LabeledExpr example
Homework review if needed
Compiler project
Hand-write your ANTLR parser using the screen shot of the SimpleC grammar and push SimpleC.g4 to your compiler repo. Be sure to test it on example programs that you write.