COP 3402 meeting -*- Outline -*- * Context-free Parsing Based on material from Chapter 2 of the book "Modern Compiler Implementation in Java" by Andrew W. Appel with Jens Palsberg (Cambridge, 1998) ** Why context-free grammars and parsing? Context-free grammars are used to describe all programming languages. They seem to make sense to people. They have been used to describe natural languages too, although they aren't really powerful enough for that. (Some parentheses to balance the imbalance below: (((( ------------------------------------------ WHY CONTEXT-FREE PARSING? Can we define a regular expression to make sure expressions have balanced parentheses? e.g., recognize: (34) and ((12)+(789)) but not: (567))+82)))) ------------------------------------------ ... No, it's not possible Need to count or use a stack, but a finite automaton can't do that (since there can be an arbitrary number of parentheses) ------------------------------------------ WHAT IS NEEDED Needed for checking balanced parentheses: ------------------------------------------ ... recursion in the grammar, like ::= | ( ) | + and an unbounded stack for parsing. But note that regular grammars don't allow such productions, and finite state machines can't simulate an unbounded stack Q: How does this grammar ::= | ( ) | + ensure that parentheses are balanced? ** Parsing with Context-free Grammars ------------------------------------------ PARSING Want to recognize language with Goal: ------------------------------------------ ... a context-free grammar and terminals that are tokens from the lexer For this a stack is needed, since the grammars may be recursive... ... Produce an *abstract syntax tree* that represents the parse, i.e., that reflects the program's structure (and can be used to generate code later) *** context-free grammar ------------------------------------------ CONTEXT-FREE GRAMMARS def: a *context-free* grammar (CFG) (N, T, P, S) has a set (of nonterminals) N, a set (of terminals) T, a start symbol S in N and each production (in P) has the form: -> g where is a Nonterminal symbol, and g in (N+T)* Example: def: For a CFG (N,T,P,S), g in (N+T)* *produces* g' in (N+T)*, written g =P=> g', iff g is e f, e and f in (N+T)*, g' is e h f, is a nonterminal (in N), h in (N+T)*, and the rule -> h is in P Example: ( ) =P=> ------------------------------------------ ... -> | ( ) -> 1 | 2 | 3 So g =P=> g' if the in g can be replaced by the right hand side of a rule -> h in P ... 1 , where P is the rules from the grammar above *** derivation ------------------------------------------ DERIVATION def: a *derivation* of gm in T* from the rules P of G=(N,T,P,S), is a sequence (S, g1, g2, ..., gm), where for all i: gi in (N+T)*, S =P=> g1, and for all 1 <= j < m: gj =P=> g(j+1) Example: ------------------------------------------ A derivation is a trace of the production game - It shows how a terminal string is in the language of the grammar ... A derivation (S, g1, g2, ...) is usually written S -> g1 -> g2 ... So an example is: -> ( ) -> ( ) -> ( 1 ) (more examples below) ------------------------------------------ LEFTMOST DERIVATION def: a *leftmost derivation* of gm in T* from a CFG (N,T,P,S) is a derivation of t from (N,T,P,S) (S, g1, ..., gm) such that for all 0 <= j < m: when gj =P=> g(j+1) and the nonterminal is replaced in gj, then there are no nonterminals to the left of in gj. Example: ------------------------------------------ Notes: - a leftmost derivation corresponds to a parse, that processes input from left to right. - leftmost derivations are unique (when they exist). ... for grammar: -> | - -> 1 | 2 | 3 To derive 3 - 2: -> - -> - -> 3 - -> 3 - -> 3 - 2 Q: Would a rightmost derivation be? ... like the above, but with right replacing left. *** parse trees ------------------------------------------ PARSE TREES AND DERIVATIONS def: A *parse tree*, Tr, for a CFG (N,T,P,S) represents a derivation, D, iff: - Each node in Tr is labeled by a nonterminal (in N) - the root of Tr is the start symbol S - an arc from to h in (N+T) iff -> ... h ... in P - the order of children of a node labeled is the order in a production -> ... in P ------------------------------------------ ------------------------------------------ EXAMPLES GRAMMAR: ::= | + | * Derivations of 3*4+2 ------------------------------------------ ... (leftmost) -> + -> * + -> * + -> 3 * + -> 3 * + -> 3 * 4 + -> 3 * 4 + -> 3 * 4 + 2 ... (rightmost) -> * -> * + -> * + -> * + 2 -> * + 2 -> * 4 + 2 -> * 4 + 2 -> 3 * 4 + 2 ------------------------------------------ EXAMPLE PARSE TREES Corresponding to leftmost derivation: Corresponding to rightmost derivation: ------------------------------------------ ... (leftmost) / | \ + / | | \ * | | | 2 | | 3 4 ... (rightmost) / | \ * / / | \ + | | | 3 | | 4 2 Q: What does each parse tree mean? The first (leftmost) is (3*4) + 2 = 12+2 = 14 The second (rightmost) is 3*(4+2) = 3*6 = 18 This is an example of ... *** ambiguity ------------------------------------------ AMBIGUITY def: a CFG (N,T,P,S) is *ambiguous* iff there is some t in T* such that there are two different parse trees for t ------------------------------------------ Q: What's an example of an ambiguous grammar? The one in our example above! Q: Why do we want to avoid ambiguous grammars? So the meaning of each program is uniquely determined! Q: Is ambiguity a property of a language or its grammar? The grammar, there may be non-ambiguous grammars for the same language **** Fixing ambiguous grammars ------------------------------------------ FIXING AMBIGUOUS GRAMMARS Idea: Rewrite grammar to Example: ------------------------------------------ ... eliminate undesired parse trees ... Grammar for expressions ::= + | ::= * | ::= This generates the same language as the ambiguous grammar above, but the only parses of 3*4 + 2 corresponds to the tree: / | \ + | \ \ / | | \ * | | | | | | 4 2 | 3 Q: Why is the tree / | \ * / / | \ + | | | 3 | | 4 2 not a parse tree for this grammar? Because there is no rule for with the * (and no parentheses) on the right hand side, have to use for that and has no rule for + (and no parentheses) on the right hand side How to create such a grammar? There is a standard idea in parsing called "operator precedence" if op2 binds tighter than op1 then write the grammar so that op2 can only occur in nonterminals produced by the nonterminal that produces op1 Introduce new nonterminals to force op2 to have that relationship to op2 E.g., we added and to the grammar, and made produce the multiplication operator (*) only be produced from a , which produces the addition operator (+). ** Parsing techniques *** recursive-descent parsing (LL parsing) This is regularly used in practice, because it can have excellent error messages (but you must customize the error messages for that to happen) There are automated tools that generate such parsers (antlr is one) ------------------------------------------ RECURSIVE DESCENT PARSING ALGORITHM For each production rule, of form: ::= g1 | g2 | ... | gm 1. Write a 2. This function ------------------------------------------ ... (recursive) function parseN (which needs no arguments, as it gets tokens from the lexer) ... decides between the alternatives g1, ..., gm by looking at the next (first) terminal (token) So, the first terminal in each production must provide enough information to decide which alternative to parse! **** example ------------------------------------------ EXAMPLE RECURSIVE-DESCENT RECOGNIZER ::= if then else end | begin end | print ::= | ::= ; ::= ::= == #include "spl.tab.h" // the current token yytoken_kind_t tok; void parser_initialize() { tok = yylex(); } void advance() { tok = yylex(); } void eat(yytoken_kind_t expected) { if (tok == expected) { advance(); } else { /* ... report error */ } } void parseCond() { eat(numbersym); eat(eqeqsym); eat(numbersym); } void parseStmts() { while (tok.toknum == ifsym || tok.toknum == beginsym || tok.toknum == printsym || tok.toknum == semisym) { if (tok.toknum == semisym) { eat(semisym); parseStmt(); } } } void parseStmt() { switch (tok.toknum) { case ifsym: eat(ifsym); parseCond(); eat(thensym); parseStmts(); eat(elsesym); parsetStmts(); eat(endsym); break; case beginsym: eat(beginsym); parseStmts(); eat(endsym); break; case printsym: eat(printsym); eat(numbersym); break; default: // report error } } ------------------------------------------ Q: What kind of errors can eat() report? That a certain kind of token was expected but saw instead... (at a given source code location, from the token) Q: What kind of error message can the parser produce (in the default case)? Unexpected token; e.g., it was expecting a statement and we saw some token that cannot start a statement Q: How do errors get reported by parseExp()? When eat reports them... **** terminology: LL(1) ------------------------------------------ LL(1) GRAMMARS A recursive-descent parser must: - choose between alternatives (e.g., ::= | ) def: A grammar is *LL(1)* iff ------------------------------------------ ... choose between alternatives based on the next input token ... it can be parsed left-to-right in one pass using one token of lookahead to decide between alternatives Thus an LL(1) grammar is necessary when using a recursive-descent parser LL(1) stands for: left-to-right parse, leftmost derivation, 1-symbol lookahead A weakness is that LL(k) parsers must predict the production to use (based only on k tokens) **** terminology: LR(1) LR(1) stands for: left-to-right parse, rightmost derivation, 1-token lookahead LR(k) can postpone decision of what production to use until: it has seen the entire right-hand side of a production (and k tokens beyond that) So it's more powerful than a LL(k) parser as it can use the stack to make decisions (based on history) ------------------------------------------ LR(1) GRAMMARS An LR(1) parser needs to decide when to: shift (push token on stack) or reduce uses a DFA based on stack + lookhead def: A grammar is *LR(1)* iff ------------------------------------------ ... based on the next token (in the input) ... it can be parsed left-to-right in one pass using the parse stack and one token of lookahead to decide between alternatives Note that the parse stack is more than an LL(1) parser can use. Most non-ambiguous programming languages can be parsed using LR(1). ***** LALR(1) parsers ------------------------------------------ LALR(1) Parsing Smaller tables than LR(1) - merges states of the DFA if they only differ in lookahead ------------------------------------------ LALR(1) stands for lookahead LR(1) The DFA decides what to do (shift or reduce) based on the parse stack and the lookahead. This is what Bison (and yacc) use. **** ambiguity problems ------------------------------------------ PROBLEM: AMBIGUITY Consider: ::= := | if then ::= | else and the input statement: if b1 then if b2 then x := 2 else x := 3 Is this parsed as: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 or as: / / \-----|--------\ if then | |\ / \ b1 /| \ else / | \ / | \ | | | ... | | | x := 3 | | | if then | /|\ \ b2 ... x := 2 ------------------------------------------ That is, when is the assignment x := 3 executed? ------------------------------------------ FIXES FOR AMBIGUITY Change the language: a. Always have an else clause: ::= if then else (use skip if don't want to do anything) b. Use an end marker ::= if then else fi | if then fi Give precedence to one production: ::= if then ::= else // priority! | So we only get the parse tree: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 ------------------------------------------ The second solution is adopted in SPL (our homework) The idea of giving priority to one production is found in parsing expression grammars (PEGs) The idea of priority also works for expression parsing The better idea seems to be to change the language, as we want programmers to also be sure what is going on Note, no LR(1) grammar can be ambiguous!