COP 3402 meeting -*- Outline -*-

* Context-free Parsing

  Based on material from Chapter 2 of the book
  "Modern Compiler Implementation in Java"
  by Andrew W. Appel with Jens Palsberg
  (Cambridge, 1998)

** Why context-free grammars and parsing?
     Context-free grammars are used to describe all programming languages.
     They seem to make sense to people.
     They have been used to describe natural languages too,
      although they aren't really powerful enough for that.

     (Some parentheses to balance the imbalance below: ((((
------------------------------------------
        WHY CONTEXT-FREE PARSING?

Can we define a regular expression
   to make sure expressions have
   balanced parentheses?
   e.g., recognize: (34) and ((12)+(789))
         but not: (567))+82))))


------------------------------------------
     ... No, it's not possible
         Need to count or use a stack,
         but a finite automaton can't do that
            (since there can be an arbitrary number of parentheses)


------------------------------------------
          WHAT IS NEEDED

Needed for checking balanced parentheses:


------------------------------------------
     ... recursion in the grammar, like
            <expr> ::= <number> | ( <expr> )
                    | <number> + <expr>
         and an unbounded stack for parsing.

         But note that regular grammars don't allow such productions,
            and finite state machines can't simulate an unbounded stack

        Q: How does this grammar
            <expr> ::= <number> | ( <expr> )
                    | <number> + <expr>
           ensure that parentheses are balanced?


** Parsing with Context-free Grammars

------------------------------------------
           PARSING

Want to recognize language
   with


Goal:


------------------------------------------
      ... a context-free grammar
          and terminals that are
          tokens from the lexer

      For this a stack is needed,
      since the grammars may be recursive...

      ... Produce an *abstract syntax tree*
          that represents the parse,
          i.e., that reflects the program's structure
          (and can be used to generate code later)

*** context-free grammar
------------------------------------------
             CONTEXT-FREE GRAMMARS
             
def: a *context-free* grammar (CFG)
     (N, T, P, S) has 
     a set (of nonterminals) N,
     a set (of terminals) T,
     a start symbol S in N and each
     production (in P) has the form:
        <M> -> g
     where <M> is a Nonterminal symbol,
        and g in (N+T)*

Example:


def: For a CFG (N,T,P,S),
     g in (N+T)* *produces* g' in (N+T)*,
     written g =P=> g',
     iff g is e <X> f, e and f in (N+T)*,
         g' is e h f,
         <X> is a nonterminal (in N),
         h in (N+T)*, and the rule
         <X> -> h is in P

Example:
       ( <Number> ) =P=> 


------------------------------------------
        ... <Exp> -> <Number> | ( <Exp> )
            <Number> -> 1 | 2 | 3
            
        So g =P=> g' if the <X> in g can be replaced by
        the right hand side of a rule <X> -> h in P

        ... 1 , where P is the rules from the grammar above

*** derivation
------------------------------------------
              DERIVATION     

def: a *derivation* of gm in T*
     from the rules P of G=(N,T,P,S),
     is a sequence (S, g1, g2, ..., gm),
     where for all i: gi in (N+T)*,
           S =P=> g1, and
           for all 1 <= j < m:
                 gj =P=> g(j+1)

Example:


------------------------------------------
        A derivation is a trace of the production game
        - It shows how a terminal string is in the language
          of the grammar

        ... A derivation (S, g1, g2, ...)
           is usually written
              S -> g1
                -> g2
                ...

         So an example is:
          <Exp> -> ( <Exp> )
                -> ( <Number> )
                -> ( 1 )

         (more examples below) 

------------------------------------------
      LEFTMOST DERIVATION
     
def: a *leftmost derivation* of gm in T*
     from a CFG (N,T,P,S) is
     a derivation of t from (N,T,P,S)
     (S, g1, ..., gm) such that
     for all 0 <= j < m:
         when gj =P=> g(j+1)
         and the nonterminal <X> is
            replaced in gj,
         then there are no
              nonterminals to the left of
              <X> in gj.

Example:


------------------------------------------
       Notes:
         - a leftmost derivation corresponds to a parse,
           that processes input from left to right.
         - leftmost derivations are unique (when they exist).

       ... for grammar:
           <Exp> -> <Number>
                  | <Exp> - <Exp>
           <Number> -> 1 | 2 | 3

           To derive 3 - 2:
              <Exp> -> <Exp> - <Exp>
                    -> <Number> - <Exp>
                    -> 3 - <Exp>
                    -> 3 - <Number>
                    -> 3 - 2

      Q: Would a rightmost derivation be?
         ... like the above, but with right replacing left.
         
*** parse trees
------------------------------------------
         PARSE TREES AND DERIVATIONS

def: A *parse tree*, Tr, for a CFG
     (N,T,P,S) represents a derivation, D,
     iff:
       - Each node in Tr is labeled by
         a nonterminal (in N)
       - the root of Tr is
         the start symbol S
       - an arc from <N0> to h in (N+T)
          iff <N0> -> ... h ... in P
       - the order of children of a node
         labeled <N0> is the order in
         a production <N0> -> ... in P
         
------------------------------------------

------------------------------------------
             EXAMPLES

GRAMMAR:

  <expr> ::= <number> | <expr> + <expr>
          |  <expr> * <expr>

Derivations of 3*4+2


------------------------------------------
   ... (leftmost)
     <expr> -> <expr> + <expr>
            -> <expr> * <expr> + <expr>
            -> <number> * <expr> + <expr>
            -> 3 * <expr> + <expr>
            -> 3 * <number> + <expr>
            -> 3 * 4 + <expr>
            -> 3 * 4 + <number>     
            -> 3 * 4 + 2

    ... (rightmost)
     <expr> -> <expr> * <expr>
            -> <expr> * <expr> + <expr>
            -> <expr> * <expr> + <number>
            -> <expr> * <expr> + 2
            -> <expr> * <number> + 2
            -> <expr> * 4 + 2
            -> <number> * 4 + 2
            -> 3 * 4 + 2

------------------------------------------
         EXAMPLE PARSE TREES

Corresponding to leftmost derivation:


Corresponding to rightmost derivation:


------------------------------------------
   ... (leftmost)
            <expr>
           /   |  \
        <expr> +  <expr>
       /  |  |      \
   <expr> * <expr>  <number>
     |       |        |
 <number>  <number>   2
     |       |
     3       4

  ... (rightmost)
           <expr>
          /  |   \
      <expr> *  <expr>
       /        /   | \
   <number>  <expr> + <expr>
    |          |        |
    3      <number>   <number>
              |         |
              4         2

          Q: What does each parse tree mean?
            The first (leftmost) is (3*4) + 2 = 12+2 = 14
            The second (rightmost) is 3*(4+2) = 3*6 = 18

          This is an example of ...

*** ambiguity
------------------------------------------
             AMBIGUITY

def: a CFG (N,T,P,S) is *ambiguous* iff
     there is some t in T* such that
     there are two different parse trees
     for t


------------------------------------------
        Q: What's an example of an ambiguous grammar?
           The one in our example above!

        Q: Why do we want to avoid ambiguous grammars?
           So the meaning of each program is uniquely determined!

        Q: Is ambiguity a property of a language or its grammar?
           The grammar, there may be non-ambiguous grammars for the
           same language
           
**** Fixing ambiguous grammars
------------------------------------------
          FIXING AMBIGUOUS GRAMMARS

Idea: Rewrite grammar to


Example:


------------------------------------------
    ... eliminate undesired parse trees

    ... Grammar for expressions

  <expr> ::= <expr> + <term> | <term>
  <term> ::= <term> * <factor> | <factor>
  <factor> ::= <number>

         This generates the same language as the ambiguous grammar
         above, but the only parses of 3*4 + 2 corresponds to the tree:

            <expr>
           /   |  \
        <expr> +  <term>
          |          \
        <term>        \
       /  |  |         \
   <term> * <factor> <factor>
     |        |         |
 <factor>  <number>  <number>
     |        |         |
 <number>     4         2
     |
     3

       Q: Why is the tree

           <expr>
          /  |   \
      <expr> *  <expr>
       /        /   | \
   <number>  <expr> + <expr>
    |          |        |
    3      <number>   <number>
              |         |
              4         2

        not a parse tree for this grammar?
          Because there is no rule for <expr> with the * (and no parentheses)
            on the right hand side, have to use <term> for that and
            <term> has no rule for + (and no parentheses)
            on the right hand side
        
        How to create such a grammar?
        There is a standard idea in parsing called "operator precedence"
           if op2 binds tighter than op1
             then write the grammar so that op2 can only occur in
                nonterminals produced by the nonterminal that produces op1
             Introduce new nonterminals to
                force op2 to have that relationship to op2

            E.g., we added <term> and <factor> to the grammar, and
            made <factor> produce the multiplication operator (*)
            only be produced from a <term>, which produces the
            addition operator (+).
            
                
** Parsing techniques
*** recursive-descent parsing (LL parsing)
    This is regularly used in practice,
      because it can have excellent error messages
      (but you must customize the error messages for that to happen)
    There are automated tools that generate such parsers (antlr is one)  

------------------------------------------
    RECURSIVE DESCENT PARSING ALGORITHM

For each production rule, of form:
   <N> ::= g1 | g2 | ... | gm

 1. Write a 


 2. This function 


------------------------------------------
   ... (recursive) function parseN
            (which needs no arguments, as it gets tokens from the lexer)

   ... decides between the alternatives g1, ..., gm
       by looking at the next (first) terminal (token)

       So, the first terminal in each production
       must provide enough information
       to decide which alternative to parse!

**** example
------------------------------------------
   EXAMPLE RECURSIVE-DESCENT RECOGNIZER

<Stmt> ::= if <Cond>
           then <Stmts>
           else <Stmts> end
        | begin <Stmts> end
        | print <number>
<Stmts> ::= <Empty> | <StmtList>
<StmtList> ::= <Stmt> ; <StmtList>
<Empty> ::=
<Cond> ::= <number> == <number>

 #include "spl.tab.h"

 // the current token
 yytoken_kind_t tok;

 void parser_initialize() {
    tok = yylex();
 }

 void advance() { tok = yylex(); }

 void eat(yytoken_kind_t expected) {
    if (tok == expected) {
        advance();
    } else { /* ... report error */ }
 }

 void parseCond()
 {     eat(numbersym);
       eat(eqeqsym);
       eat(numbersym);
 }

 void parseStmts()
 {
    while (tok.toknum == ifsym
           || tok.toknum == beginsym
           || tok.toknum == printsym
           || tok.toknum == semisym) {
      if (tok.toknum == semisym) {
         eat(semisym);
         parseStmt();
      }
    }
 }

 void parseStmt()
 {    switch (tok.toknum) {
      case ifsym:
         eat(ifsym);
         parseCond();
         eat(thensym);
         parseStmts();
         eat(elsesym);
         parsetStmts();
         eat(endsym);
         break;
      case beginsym:
         eat(beginsym);
         parseStmts();
         eat(endsym);
         break;
      case printsym:
         eat(printsym);
         eat(numbersym);
         break;
      default:
         // report error
      }
 }

------------------------------------------
  Q: What kind of errors can eat() report?
     That a certain kind of token was expected but saw instead...
       (at a given source code location, from the token)

   Q: What kind of error message can the parser produce
       (in the default case)?
       Unexpected token;
         e.g., it was expecting a statement and we saw some token
         that cannot start a statement

  Q: How do errors get reported by parseExp()?
      When eat reports them...

**** terminology: LL(1)
------------------------------------------
           LL(1) GRAMMARS

A recursive-descent parser must:


  - choose between alternatives
    (e.g., <N0> ::= <A> | <B>)

def: A grammar is *LL(1)* iff


------------------------------------------
   ... choose between alternatives
        based on the next input token

   ... it can be parsed left-to-right in one pass
       using one token of lookahead
       to decide between alternatives

   Thus an LL(1) grammar is necessary
     when using a recursive-descent parser

   LL(1) stands for:
       left-to-right parse,
        leftmost derivation,
        1-symbol lookahead

   A weakness is that LL(k) parsers must predict the production to use
      (based only on k tokens)
        
**** terminology: LR(1)
   LR(1) stands for:
     left-to-right parse,
     rightmost derivation,
     1-token lookahead
     
   LR(k) can postpone decision of what production to use until:
      it has seen the entire right-hand side of a production
       (and k tokens beyond that)

   So it's more powerful than a LL(k) parser
      as it can use the stack to make decisions
      (based on history)
      
------------------------------------------
           LR(1) GRAMMARS

An LR(1) parser needs to decide
  when to: shift (push token on stack) or
           reduce

  uses a DFA based on stack + lookhead

def: A grammar is *LR(1)* iff


------------------------------------------
   ... based on the next token
       (in the input)

   ... it can be parsed left-to-right in one pass
       using the parse stack
          and one token of lookahead
       to decide between alternatives

   Note that the parse stack is more than an LL(1) parser can use.

   Most non-ambiguous programming languages can be parsed using LR(1).

***** LALR(1) parsers
------------------------------------------
          LALR(1) Parsing

Smaller tables than LR(1)
  - merges states of the DFA
      if they only differ in lookahead

------------------------------------------
        LALR(1) stands for lookahead LR(1)

        The DFA decides what to do (shift or reduce) based on
            the parse stack and the lookahead.
        This is what Bison (and yacc) use.

**** ambiguity problems
------------------------------------------
           PROBLEM: AMBIGUITY

Consider:

<S> ::= <ident> := <number>
     | if <E> then <S> <ME>
<ME> ::= <empty> | else <S>

and the input statement:

   if b1 then 
      if b2 then x := 2
    else x := 3

Is this parsed as:

     <S>
   / /  \ -----|---\
  if <E> then  |  <ME>
      |        |      \
      |        |    <empty>
      |       <S>     
             / | \--------\
      |     /  |  \    \   \
      b1   if <E> then <S> <ME>
                      / | \ \   \    
                       ...   |   \
                      x := 2 else <S>
                                 / | \
                                  ...
                                 x := 3
or as:

     <S>
    / /  \-----|--------\
  if <E> then <S>      <ME> 
      |        |\     /   \       
      b1      /| \   else <S> 
             / |  \      / | \    
             |  |  |      ...
             |  |  |     x := 3
             |  |  |
            if <E> then <S> <ME>
                |       /|\   \
                b2      ...   <empty>
                       x := 2
------------------------------------------

        That is, when is the assignment x := 3 executed?

------------------------------------------
        FIXES FOR AMBIGUITY

Change the language:

 a. Always have an else clause:

   <S> ::= if <E> then <S> else <S>

   (use skip if don't want to do anything)

 b. Use an end marker

   <S> ::= if <E> then <S> else <S> fi
        |  if <E> then <S> fi

Give precedence to one production:

   <S> ::= if <E> then <S> <ME>
   <ME> ::= else <S>  // priority!
         | <empty>

So we only get the parse tree:

       <S>
   / /  \ -----|---\
  if <E> then  |  <ME>
      |        |      \
      |        |    <empty>
      |       <S>     
             / | \--------\
      |     /  |  \    \   \
      b1   if <E> then <S> <ME>
                      / | \ \   \    
                       ...   |   \
                      x := 2 else <S>
                                 / | \
                                  ...
                                 x := 3
            
------------------------------------------
        The second solution is adopted in SPL (our homework)
        
        The idea of giving priority to one production is
        found in parsing expression grammars (PEGs)

        The idea of priority also works for expression parsing

        The better idea seems to be to change the language,
           as we want programmers to also be sure what is going on

        Note, no LR(1) grammar can be ambiguous!