WHY CONTEXT-FREE PARSING? Can we define a regular expression to make sure expressions have balanced parentheses? e.g., recognize: (34) and ((12)+(789)) but not: (567))+82)))) WHAT IS NEEDED Needed for checking balanced parentheses: have a stack for matching left & right parentheses or need to be able to count PARSING Want to recognize language with recursion in the grammars, like: ::= | ( ) | + Goal: produce an abstract syntax tree that represents the parse and reflect the program's structure CONTEXT-FREE GRAMMARS def: a *context-free* grammar (CFG) (N, T, P, S) has start symbol S in N and each production (in P) has the form: -> g where is a Nonterminal symbol, and g in (N+T)* Example: ::= | ( ) ::= 1 | 2 | 3 Derivations: -> -> 3 -> ( ) -> ( ) -> ( 1 ) def: For a CFG (N,T,P,S), g in (N+T)* *produces* g' in (N+T)*, written g =P=> g', iff g is e f, e and f in (N+T)*, g' is e h f, is a nonterminal (in N), h in (N+T)*, and the rule -> h is in P Example: ( ) =P=> ( 1 ) DERIVATION def: a *derivation* of a terminal strings t from the rules P of G=(N,T,P,S), is a sequence (g0, g1, g2, ..., gm), where gm = t, and g0 = S, for all i: gi in (N+T)*, S =P=> g1 (i.e., g0 =P=> g1), and for all 0 <= j < m: gj =P=> g(j+1) Example: ( , ( ), ( 1 ) ) usually written: -> ( ) -> ( 1 ) LEFTMOST DERIVATION def: a *leftmost derivation* of a string t in T* from a CFG (N,T,P,S) is a derivation of t from (N,T,P,S) (g0, g1, ..., gm) such that g0 = S, gm = t, and for all 0 <= j < m: when gj =P=> g(j+1) and the nonterminal is replaced in gj, then there are no nonterminals to the left of in gj. Example: ::= | - ::= 1 | 2 | 3 -> - -> - -> 3 - -> 3 - -> 3 - 2 def: a *rightmost derivation* of a string t in T* from a CFG (N,T,P,S) is a derivation of t from (N,T,P,S) (g0, g1, ..., gm) such that g0 = S, gm = t, and for all 0 <= j < m: when gj =P=> g(j+1) and the nonterminal is replaced in gj, then there are no nonterminals to the right of in gj. -> - -> - -> - 2 -> - 2 -> 3 - 2 PARSE TREES AND DERIVATIONS def: A *parse tree*, Tr, for a CFG (N,T,P,S) represents a derivation, D, iff: - Each node in Tr is labeled by a nonterminal (in N) - the root of Tr is the start symbol S - an arc from to h in (N+T) iff -> ... h ... in P - the order of children of a node labeled is the order in a production -> ... in P EXAMPLES GRAMMAR: ::= | - ::= 1 | 2 | 3 Leftmost derivation / | \ - | | | | | 3 | | 2 / | \ - | | | | | | 2 | 3 GRAMMAR: ::= | + | * Derivations of 3*4+2 (leftmost) -> * -> * -> 3 * -> 3 * + -> 3 * + -> 3 * 4 + -> 3 * 4 + -> 3 * 4 + 2 like 3 * (4+2) / | \ * | | | | /|\ 3 / | \ + | | | | | 4 | | 2 -> + -> + -> + 2 -> * + 2 -> * + 2 -> * 4 + 2 -> * 4 + 2 -> 3 * 4 + 2 / | \ + | | | /|\ | / | \ 2 * | | | | | | 4 | 3 EXAMPLE PARSE TREES Corresponding to leftmost derivation: (above) Corresponding to rightmost derivation: AMBIGUITY def: a CFG (N,T,P,S) is *ambiguous* iff there is some t in T* such that there are two different parse trees for t example: the grammar above and 3*4+2 is a string t that shows it FIXING AMBIGUOUS GRAMMARS Idea: Rewrite grammar to elimate the undesired parse trees Example: ::= + | ::= * | ::= | ( ) RECURSIVE DESCENT PARSING ALGORITHM For each production rule, of form: ::= g1 | g2 | ... | gm 1. Write a function parseN() 2. This function decides between alternatives (g1, ..., gm) by looking at the next token EXAMPLE RECURSIVE-DESCENT PARSER ::= if then else | begin end | write ::= ; | ::= ::= = #include "pl0.tab.h" yytoken_kind_t tok = yylex(); void advance() { tok = yylex(); } void eat(yytoken_kind_t tk) { if (tok == tk) { advance(); } else { /* ... report error */ } } void parseCond() { eat(numbersym); eat(eqsym); eat(numbersym); } void parseStmt() { switch (tok) { case ifsym: eat(ifsym); parseCond(); eat(thensym); parseStmt(); eat(elsesym); parsetStmt(); break; case beginsym: eat(beginsym); parseStmt(); parseList(); break; case writesym: eat(writesym); eat(numbersym); break; default: // report error } } LL(1) GRAMMARS A recursive-descent parser must: - choose between alternatives (e.g., ::= | ) based on the first token in input def: A grammar is *LL(1)* iff if it can be parsed left-to-right in one pass using one token of lookahead to decide between alternatives LR(1) GRAMMARS An LR(1) parser needs to decide when to: shift (push token on stack) or reduce based on the next token in the input uses a DFA based on stack + lookhead def: A grammar is *LR(1)* iff it can be parsed left-to-right in one pass using one token of lookahead and the parse stack to decide between alternatives. LALR(1) Parsing Smaller tables than LR(1) - merges states of the DFA if only differ in lookahead PROBLEM: AMBIGUITY Consider: ::= := | if then ::= | else and the statement: if b1 then if b2 then x := 2 else x := 3 Is this parsed as: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 or as: / / \-----|--------\ if then | |\ / \ b1 /| \ else / | \ / | \ | | | ... | | | x := 3 | | | if then | /|\ \ b2 ... x := 2 FIXES FOR AMBIGUITY Change the language: a. Always have an else clause: ::= if then else (use skip if don't want to do anything) b. Use an end marker ::= if then else fi | if then fi Give precedence to one production: ::= if then ::= else // priority! | So we only get the parse tree: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 THE BIG PICTURE tokens - source --> [ Lexer] ------> [ Parser] code / / / abstract / syntax v trees [ static analysis ] / / v [ code generator ] BISON AND FLEX, GENERATING A PARSER idea: ast.h (AST types) | | bison | -----> pl0.tab.c | / ^ yyparse function v / bison | pl0.y ----------> pl0.tab.h | token | defs. flex v pl0_lexer.l-----> pl0_lexer.c yylex function