COP 3402 meeting -*- Outline -*- * Context-free Parsing Based on material from Chapter 2 of the book "Modern Compiler Implementation in Java" by Andrew W. Appel with Jens Palsberg (Cambridge, 1998) ** Why context-free grammars and parsing? Context-free grammars are used to describe all programming languages. They seem to make sense to people. They have been used to describe natural languages too, although they aren't really powerful enough for that. (Some parentheses to balance the imbalance below: (((( ------------------------------------------ WHY CONTEXT-FREE PARSING? Can we define a regular expression to make sure expressions have balanced parentheses? e.g., recognize: (34) and ((12)+(789)) but not: (567))+82)))) ------------------------------------------ ... No, it's not possible Need to count or use a stack, but a finite automaton can't do that (since there can be an arbitrary number of left parentheses) ------------------------------------------ WHAT IS NEEDED Needed for checking balanced parentheses: ------------------------------------------ ... recursion in the grammar, like ::= | ( ) | + and an unbounded stack for parsing. Q: How does this grammar ::= | ( ) | + ensure that parentheses are balanced? ** Parsing with Context-free Grammars ------------------------------------------ PARSING Want to recognize language with Goal: ------------------------------------------ ... a context-free grammar and terminals that are tokens from the lexer For this a stack is needed, since the grammars may be recursive... ... Produce an *abstract syntax tree* that represents the parse, i.e., that reflects the program's structure (and can be used to generate code later) *** context-free grammar ------------------------------------------ CONTEXT-FREE GRAMMARS def: a *context-free* grammar (CFG) (N, T, P, S) has start symbol S in N and each production (in P) has the form: -> g where is a Nonterminal symbol, and g in (N+T)* Example: def: For a CFG (N,T,P,S), g in (N+T)* *produces* g' in (N+T)*, written g =P=> g', iff g is e f, g' is e h f, is a nonterminal (in N), h in (N+T)*, and the rule -> h is in P Example: ( ) =P=> ------------------------------------------ ... -> | ( ) -> 1 | 2 | 3 So g =P=> g' if the in g can be replaced by the right hand side of a rule -> h in P ... ( 1 ), where P is the rules from the grammar above *** derivation ------------------------------------------ DERIVATION def: a *derivation* of a terminal strings t from the rules P of G=(N,T,P,S), is a sequence (S, g1, g2, ..., gm), where gm = t, and for all i: gi in (N+T)*, S =P=> g1, and for all 1 <= j < m: gj =P=> g(j+1) Example: ------------------------------------------ A derivation is a trace of the production game - It shows how a terminal string is in the language of the grammar ... A derivation (S, g1, g2, ...) is usually written S -> g1 -> g2 ... So an example is: -> ( ) -> ( ) -> ( 1 ) (more examples below) ------------------------------------------ LEFTMOST DERIVATION def: a *leftmost derivation* of a string t in T* from a CFG (N,T,P,S) is a derivation of t from (N,T,P,S) (g0, g1, ..., gm) such that g0 = S, gm = t, and for all 0 <= j < m: when gj =P=> g(j+1) and the nonterminal is replaced in gj, then there are no nonterminals to the left of in gj. Example: ------------------------------------------ Notes: - a leftmost derivation corresponds to a parse, that processes input from left to right. - leftmost derivations are unique (when they exist). ... for grammar: -> | - -> 1 | 2 | 3 To derive 3 - 2: -> - -> - -> 3 - -> 3 - -> 3 - 2 Q: Would a rightmost derivation be? ... like the above, but with right replacing left. *** parse trees ------------------------------------------ PARSE TREES AND DERIVATIONS def: A *parse tree*, Tr, for a CFG (N,T,P,S) represents a derivation, D, iff: - Each node in Tr is labeled by a nonterminal (in N) - the root of Tr is the start symbol S - an arc from to h in (N+T) iff -> ... h ... in P - the order of children of a node labeled is the order in a production -> ... in P ------------------------------------------ ------------------------------------------ EXAMPLES GRAMMAR: ::= | + | * Derivations of 3*4+2 ------------------------------------------ ... (leftmost) -> + -> * + -> * + -> 3 * + -> 3 * + -> 3 * 4 + -> 3 * 4 + -> 3 * 4 + 2 ... (rightmost) -> * -> * + -> * + -> * + 2 -> * + 2 -> * 4 + 2 -> * 4 + 2 -> 3 * 4 + 2 ------------------------------------------ EXAMPLE PARSE TREES Corresponding to leftmost derivation: Corresponding to rightmost derivation: ------------------------------------------ ... (leftmost) / | \ + / | | \ * | | | 2 | | 3 4 ... (rightmost) / | \ * / / | \ + | | | 3 | | 4 2 Q: What does each parse tree mean? The first (leftmost) is (3*4) + 2 = 12+2 = 14 The second (rightmost) is 3*(4+2) = 3*6 = 18 This is an example of ... *** ambiguity ------------------------------------------ AMBIGUITY def: a CFG (N,T,P,S) is *ambiguous* iff there is some t in T* such that there are two different parse tress for t ------------------------------------------ Q: What's an example of an ambiguous grammar? The one in our example above! Q: Why do we want to avoid ambiguous grammars? So the meaning of each program is uniquely determined! Q: Is ambiguity a property of a language or its grammar? The grammar, there may be non-ambiguous grammars for the same language **** Fixing ambiguous grammars ------------------------------------------ FIXING AMBIGUOUS GRAMMARS Idea: Rewrite grammar to Example: ------------------------------------------ ... eliminate undesired parse trees ... Grammar for expressions ::= + | ::= * | ::= This generates the same language as the ambiguous grammar above, but the only parses of 3*4 + 2 corresponds to the tree: / | \ + | \ \ / | | \ * | | | | | | 4 2 | 3 Why are is the tree / | \ * / / | \ + | | | 3 | | 4 2 not a parse tree for the grammar? Because there is no rule for with the * (and no parentheses) on the right hand side, have to use for that and has no rule for + (and no parentheses) on the right hand side How to do that? There is a standard idea in parsing called "operator precedence" if op2 binds tighter than op1 then write the grammar so that op2 can only occur in nonterminals produced by the nonterminal that produces op1 Introduce new nonterminals to force op2 to have that relationship to op2 ** Parsing techniques *** recursive-descent parsing (LL parsing) This is regularly used in practice, because it can have excellent error messages (but you must customize the error messages for that to happen) There are automated tools that generate such parsers (antlr is one) ------------------------------------------ RECURSIVE DESCENT PARSING ALGORITHM For each production rule, of form: ::= g1 | g2 | ... | gm 1. Write a 2. This function ------------------------------------------ ... (recursive) function parseN (which needs no arguments, as it gets tokens from the lexer) ... decides between the alternatives g1, ..., gm by looking at the next (first) terminal (token) So the first terminal must provide enough information to decide which alternative to parse! **** example ------------------------------------------ EXAMPLE RECURSIVE-DESCENT PARSER ::= if then else | begin S end | write ::= ; | ::= ::= = #include "pl0.tab.h" yytoken_kind_t tok = yylex(); void advance() { tok = yylex(); } void eat(yytoken_kind_t tk) { if (tok == tk) { advance(); } else { /* ... report error */ } } void parseCond() { eat(numbersym); eat(eqsym); eat(numbersym); } void parseStmt() { switch (tok.typ) { case ifsym: eat(ifsym); parseExp(); eat(thensym); parseStmt(); eat(elsesym); parsetStmt(); break; case beginsym: eat(beginsym); parseStmt(); parseList(); break; case writesym: eat(writesym); eat(numbersym); break; default: // report error } } ------------------------------------------ Q: What kind of errors can eat() report? That a certain kind of token was expected but saw instead... (at a given source code location, from the token) Q: What kind of error message can the parser produce (in the default case)? Unexpected token; e.g., it was expecting a statement and we saw some token that cannot start a statement Q: How do errors get reported by parseExp()? When eat reports them... **** terminology: LL(1) ------------------------------------------ LL(1) GRAMMARS A recursive-descent parser must: - choose between alternatives (e.g., ::= | ) def: A grammar is *LL(1)* iff ------------------------------------------ ... based on the next token (in the input) ... it can be parsed left-to-right in one pass using one token of lookahead to decide between alternatives Thus an LL(1) grammar is necessary for a recursive-descent parser LL(1) stands for: left-to-right parse, leftmost derivation, 1-symbol lookahead A weakness is that LL(k) parsers must predict the production to use (based on k tokens) **** terminology: LR(1) LR(1) stands for: left-to-right parse, rightmost derivation, 1-token lookahead LR(k) can postpone decision of what production to use until: it has seen the entire right-hand side of a production (and k tokens beyond that) So it's more powerful than a LL(k) parser ------------------------------------------ LR(1) GRAMMARS An LR(1) parser needs to decide when to: shift (push token on stack) or reduce uses a DFA based on stack + lookhead def: A grammar is *LR(1)* iff ------------------------------------------ ... based on the next token (in the input) ... it can be parsed left-to-right in one pass using the parse stack and one token of lookahead to decide between alternatives Most non-ambiguous programming languages can be parsed using LR(1). ***** LALR(1) parsers ------------------------------------------ LALR(1) Parsing Smaller tables than LR(1) - merges states of the DFA if only differ in lookahead ------------------------------------------ LALR(1) stands for lookahead LR(1) This is what Bison (and yacc) use. **** ambiguity problems ------------------------------------------ PROBLEM: AMBIGUITY Consider: ::= := | if then ::= | else and the statement: if b1 then if b2 then x := 2 else x := 3 Is this parsed as: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 or as: / / \-----|--------\ if then | |\ / \ b1 /| \ else / | \ / | \ | | | ... | | | x := 3 | | | if then | /|\ \ b2 ... x := 2 ------------------------------------------ That is, when is the assignment x := 3 executed? ------------------------------------------ FIXES FOR AMBIGUITY Change the language: a. Always have an else clause: ::= if then else (use skip if don't want to do anything) b. Use an end marker ::= if then else fi | if then fi Give precedence to one production: ::= if then ::= else // priority! | So we only get the parse tree: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 ------------------------------------------ The idea of giving priority to one production is found in parsing expression grammars (PEGs) The idea of priority also works for expression parsing The better idea seems to be to change the language, as we want programmers to also be sure what is going on Note, no LR(1) grammar can be ambiguous!