LANGUAGES

def: A *language* is a set of strings
of characters from some alphabet.


          LANGUAGE CLASSES

Languages can be classified by
  the grammars used to generate/recognize
  them

Venn Diagram:

 |--------------------------------------|
 | Strings recognized by                |
 |   the Regular Grammar                |
 |                                      |
 |  |--------------------------------|  |
 |  | Strings recognized by          |  |
 |  |  the Context-free Grammar      |  |
 |  |                                |  |
 |  |  |--------------------------|  |  |
 |  |  | Strings recognized by    |  |  |
 |  |  |  the Context-sensitive   |  |  |
 |  |  |    Grammar               |  |  |
 |  |  |     (static checking)    |  |  |
 |  |  |                          |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |  | Strings recognized |  |  |  |
 |  |  |  |  by the            |  |  |  |
 |  |  |  |   Type 0 Grammar   |  |  |  |
 |  |  |  |    (dynamic checks)|  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |                          |  |  |
 |  |  |--------------------------|  |  |
 |  |                                |  |
 |  |--------------------------------|  |
 |                                      |
 |--------------------------------------|


          PHASES OF A COMPILER

Programs allowed by a compiler's:

 |--------------------------------------|
 | Lexical Analysis (Lexer)             |
 |                                      |
 |  |--------------------------------|  |
 |  | Parser                         |  |
 |  |                                |  |
 |  |  |--------------------------|  |  |
 |  |  | Static Analysis          |  |  |
 |  |  |                          |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |  |  Runtime checks    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |                          |  |  |
 |  |  |--------------------------|  |  |
 |  |                                |  |
 |  |--------------------------------|  |
 |                                      |
 |--------------------------------------|


       PROBLEM WRITING SYNTAX ANALYSIS

Naive way to write a compiler, etc.

Language
 Def.docx  -- coding --> parser.c (v. 1)

 Def2.docx -- coding --> parser.c (v. 2)
           -- coding --> parser.c (v. 3)

 Def3.docx -- coding --> parser.c (v. 4)
 Def4.docx -- coding --> parser.c (v. 5)
   ...                    ...
 DefN.docx -- coding --> parser.c (v. N)

Disadvantages:


   - can be slow to write
   - error prone
   - hard to check or verify
   - expensive
   - might not be very efficient


 COMPUTER SCIENCE SOLUTION: AUTOMATION

 high-level
 description  tool      generated code

lang_lexer.l -- flex--> lang_lexer.c
                          + lang_lexer.h
     lang.y  -- bison --> lang.tab.c
                          + lang.tab.h
     lang2.y -- bison --> lang.tab.c
                          + lang.tab.h
                ...

     langN.y -- bison --> lang.tab.c
                          + lang.tab.h

Advantages:

     - cycle time is faster
     - grammer is easier to check/verify
     - code can be very efficient


        GRAMMARS DESCRIBE LANGUAGES

Grammars are high-level descriptions
 of languages/parsers

def: A *terminal* is a character
     (or string of characters)
     found in the language

     E.g., 123
           while
	   xyz
	   ,
	   (

def: A *nonterminal* is a meta-notation
     that can be replaced with other
     terminals or nonterminals.

     E.g., <program>
           <expr>

def: a *grammar* consists of a finite set
     of rules (called "productions")
     and a start symbol (a nonterminal).

     Let V = nonterminals + terminals
     
     The rules have the form

            V+ -> V*

     where there is no symbol in both
        nonterminals and terminals


def: The language generated by a grammar G
     with set of productions P is:

     {w | w is in terminals* and S =>* w}
       where S is the start symbol of G
       gAd => gBd iff g in V* and d in V*
       and A -> B is a rule in P
       and g =>* h iff either h = g
                    or g -> i and i =>* h


         BNF NOTATION FOR GRAMMARS

::=  means "can be" or "can produce" i.e., ->

 |   means "or"

<X>  is a nonterminal named X


Example

 <BNumber> ::= <BDigit> <BNumber>
           |  <BDigit>

 <BDigit> ::= 0 | 1

E.g.,

   <BNumber> =>* 01001


         GRAMMARS AS RULES OF GAMES

A grammar can be seen as describing
  two games:

   - A production game
      (Can you produce this string?)
   - A recognition/parsing game
      (Is this string in the language?)
      

            PRODUCTION GAME

Goal: produce a string in the language
        from the start symbol

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj> <Punc>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult
      <Punc> -> ? | .

   Can we produce "Johnny is good."?
     <Sentence> =>* Johnny is good.

   Can we produce "Johnny is good."?


    <Sentence>
        -> <Noun> <Verb> <Adj> <Punc>
	-> Johnny <Verb> <Adj> <Punc>
	-> Johnny is <Adj> <Punc>
	-> Johnny is good <Punc>
        -> Johnny is good .

    <Sentence>
        -> <Noun> <Verb> <Adj> <Punc>
	-> <Noun> <Verb> <Adj> .
	-> <Noun> <Verb> good .
	-> <Noun> is     good .
	-> Johnny is good.


        RECOGNITION OR PARSING GAME

Goal: determine if a string is in
      the language of the grammar

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult

   Is "Johnny is good" in this grammar?


         DERIVATION (OR PARSE) TREES

def: a *tree* is a finite set of nodes
     connected by directed edges
     that is connected and has no cycles

def: a *derivation tree* for grammar G
     is a tree such that:
        - Every node has a label that is
          a symbol of G
        - The root is labeled by
          the start symbol of G
        - Every node with a
          direct descendent, is labeled by
          a nonterminal
        - If the descendents of a node
          labeled by N have the following
          labels (in order):
             A, B, C, ..., K
          then G has a production of form
             N -> A B C ... K

        EXAMPLE DERIVATION TREE

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult

String to parse: "Johnny is good"

         <Sentence>
         /    |   \
        /     |    \
       v      v     v
    <Noun> <Verb> <Adj>


    Johnny   is    good

        EXTENSIONS TO BNF (EBNF)

Arbitrary number of repeats:

    { x }  means 0 or more repeats of x

   <N> :: = { <Z> }

   is equivalent to:

   <N> ::= <Z-seq>
   <Z-seq> ::= <empty> | <Z-Seq> <Z>
   <empty> ::=

   {<Z>} is also written as:
       <Z>*  or  [ <Z> ] ...

One-or-more repeats:

     x+  means 1 or more repeats of x

   <N> :: = <X>+

   is equivalent to:

   <N> ::= <Xs>
   <Xs> ::= <X> | <Xs> <X>

   <X>+ is sometimes written as <X> ...
                   or <X> [ <X> ] ...

Optional element:

    [ x ]  means 0 or 1 occurences of x

     <N> ::= [ <Y> ]

     is equivalent to:

     <N> ::= <Y-opt>
     <Y-opt> ::= <empty> | <Y>

    READING A BNF GRAMMAR

Example rules:

  <DecimalConstant> ::=
	<NonZeroDigit> <Digits>
  <NonZeroDigit> ::= 1 | 2 | 3 | 4 | 5
                    | 6 | 7 | 8 | 9
  <Digit> ::= 0 | <NonZeroDigit>
  <Digits> ::= <Digit> | <Digit> <Digits>


      EBNF GRAMMAR FOR (SUBSET OF) SPL

<program> ::= <block> .

<block> ::= begin <const-decls>
                  <var-decls>
                  <proc-decls>
                  <stmts>
            end

<const-decls> ::= {<const-decl>}
<const-decl> ::= const <const-def-list> ;
<const-def-list> ::= <const-def>
          | <const-def-list> , <const-def>
<const-def> ::= <ident> = <number>

<var-decls> ::= {<var-decl>}
<var-decl> ::= var <ident-list> ;
<ident-list> ::= <ident> {<comma-ident>}
<comma-ident> ::= , <ident>

<proc-decls> ::= {<proc-decl>}
<proc-decl> ::= proc <ident> <block> ;

<stmts> ::= <empty> | <stmt-list>
<empty> ::= 
<stmt-list> ::= <stmt> {<semi-stmt>}
<semi-stmt> ::= ; <stmt>
<stmt> ::= <ident> := <expr>
       | call <ident>
       | if <condition> then <stmts>
                        else <stmts> end
       | if <condition> then <stmts> end
       | while <condition> do <stmts> end
       | read <ident>
       | print <ident>
       | block

<condition> ::= <expr> <rel-op> <expr>


          EXAMPLES IN SPL

Shortest program:

  begin end.

Factorial program:

begin
  var n, res; # input and result
  proc fact
  begin
     read n;
     res := 1;
     while n != 0
     do
       res := res * n;
       n := n-1
     end;
     print res
  end;
  call fact
end.


           MOTIVATION

 # $Id$\n   .text start\nstart:\tADDI ...

Want to:
  break input string into tokens
  so that parser doesn't need to worry
  about individual characters

  Examples from the SSM assembler
    (in asm.tab.h file)

          INPUT    token names
           ===================
           .text   textsym
           start   identsym
	   :       colonsym
	   ADDI    addisym
	   

Approach:
  the parser only see non-ignored tokens
     not comments, whitespace, ...


            LEXICAL ANALYSIS

Lexical means relating to the words
  of a language


       GOALS OF LEXICAL ANALYSIS

- Simplify the parser,
  so it need not handle:
    - white space and comments
    - details of tokens


- Recognize the longest match
  Why?
    e.g., == operator or <= 


- Handle every possible character of input
  Why?
     so communication is clear
       and programmer doesn't think a
        character means something
	  when it doesn't


         CONFLICT BETWEEN RULES

Suppose that both "if" and numbers
 are tokens:

What tokens should "if8" match?


Fixing such situations:

   - tell programmers it's an identifier
      (will cause a syntax error)
     say whitespace is required to separate tokens
     
   - tell programmers it's two tokens


           WHICH TOKEN TO RETURN?

If the input is "<=", what token(s)?
   leqsym

If the input is "==", what token?
   eqeqsym

If the input is "<8", what token(s)?
   ltsym and then numbersym

If the input is "if", what token(s)?
   ifsym

If the input is "//", what token(s)?
   (if that starts a comment, then none and eat the comment)
   divsym and then divsym in SPL

Summary:
    favor the longest match
    but give priority to reserved words over identifiers


          THE BIG PICTURE

                       tokens
 - source --> [ Lexer] ------> [ Parser]
    code                          /
                                 / 
                                / abstract
                               /  syntax
                              v   trees (ASTs)
                            [ static
                              analysis ]
                             / 
                            /
                           v
                         [ code generator ]

For the Lexer we want to:
  - specify the tokens
    using regular expressions (REs)
  - convert REs to DFAs to execute them
     but easy conversions are:
        - REs to NFAs
        - NFAs to DFAs


        HOW PARSER WORKS WITH LEXER

Couroutine structure:

   Parser calls lexer:
   
          yychar = yylex(); // call lexer
           /* ... use yychar ... */

   Lexer function
       remembers pointer to input stream

         returns next token (int code)

   Parser works...
   
   Parser calls lexer again:
   
          yychar = yylex(); // call lexer
           /* ... use yychar ... */
     

 BISON AND FLEX, GENERATING A PARSER

idea:

  ast.h (AST types)
   |
   |        bison
   |        -----> spl.tab.c
   |       /        ^  yyparse() function
   v      / bison   |  
  spl.y file -----> spl.tab.h  
                    |  tokens
             flex   v  defs. 
  spl.l file -----> spl.c       
                     yylex() function


             DEFINITIONS

The lexical grammar of a language
   is regular, because they can be
   parsed very quickly using a finite state automaton (DFA)


def: a grammar is *regular* iff
     all its productions have one of these
     forms:

           <B> ::= c <D>
         or
           <B> ::= c

     where c is a terminal symbol,
       and <B> and <D> are nonterminals

def: a language is *regular* iff
     it can be defined using a
     regular grammar.

Thm: Every regular language can be
     recognized by a finite automaton.

Thm: Every regular language can be
     specified by a regular expression.

            REGULAR EXPRESSIONS

The language of regular expressions:

  <RE> ::= <char>
       | <RE> '|' <RE>
       | <RE> <RE>
       | emp            
       | <RE>*
       | ( <RE> )

 where <char> is a character

Examples:

  RE              meaning
====================================
  emp             the empty string

  2*              zero or more 2's   e.g., emp, 2, 22, 222, 2222, ...

  (0|1)*0         even binary numerals  e.g., 0, 10, 00, 100, 110, 

 b*(abb*)*(a|emp) a's and b's without  e.g., emp, a, aba, abba, baba
                  consecutive a's
  

TYPICAL EXTENSIONS TO REGULAR EXPRESSIONS

[abcd]   means   a|b|c|d

[h-m]    means   h|i|j|k|l|m

 x?      means   x|emp

 y+      means   y(y*)


           FOR YOU TO DO

Write a regular expression that describes:

1. The keyword "if"

if

2. The set of all (positive)
   decimal numbers

[1-9]([0-9]*)    (1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*


3. The set of all possible identifiers
   without underbars

[a-zA-Z][a-zA-Z0-9]*


(If you have time, write these as
 regular grammars also.)

<KW> ::= i <End>
<End> ::= f

<Number> ::= 1 <Digits> | 2 <Digits> | 3 <Digits> | ... | 9 <Digits>
           | 0 | 1 | 2 | 3 | ... | 9
<Digits> ::= 0 <Digits> | 1 <Digits> | 3 <Digits> | ... | 9 <Digits>
           | 0 | 1 | 2 | 3 | ... | 9

<Digits> ::= <Digit>*
<Digit> ::= 0 | 1 | 2 | 3 | ... | 9


   EXAMPLE REGULAR EXPRESSIONS FOR SPL

PATTERN         RE
=========================

DIGIT	        [0-9]
LETTER		[a-zA-Z]


...

              EXAMPLE

LETTERORDIGIT   ({LETTER}|{DIGIT})
IDENT           {LETTER}({LETTERORDIGIT}*)


State diagram:

             <         =
    -->[q0] --->[[q1]]---> [[q2]]


      PSEUDO CODE FOR THIS LEXER CASE

    char c;


        DEALING WITH COMMENTS

Suppose
  /  is used for division
  // starts a comment to end of line
      (unlike SPL!)

What state diagram?


How does whitespace fit in?


TACTICS FOR IGNORING WHITESPACE, COMMENTS

Goal: do not send ignored tokens to parser

Can always get a non-ignored token:


Return "tokens" that include ignored stuff
  to a loop that ignores them


Giant DFA that goes back to start state
  on seeing something to ignore


     NONDETERMINISTIC FINITE AUTOMATA

def: A *nondeterministic finite automaton*
    (NFA) over an alphabet Sigma
    is a system (K, Sigma, delta, q0, F)
    where K is a finite set (of states),
          Sigma is a finite set
            (the input alphabet),
          delta: is a map of type
             (K, Sigma) -> Sets(K)
          q0 in K is the initial state,
          F is a subset of K
            (the final/accepting states).


     TRANSITION FUNCTION AND ACCEPTANCE

  p in delta(q,x)
     means that in state q,
       on input x the next state
       can be p

  p in delta*(q,s)
     where s in Sigma*
      is defined by:
              delta*(q,emp) = {q}
         (i.e., q in delta*(q,emp))
         p in delta*(q,xa) =
                delta*(q2, a)
            where x in Sigma,
                  a in Sigma*,
                  q2 in delta(q,x)

Lemma: for all c in Sigma,
   p in delta*(q, c) = delta(q, c)

def: An NFA (K, Sigma, delta, q0, F)
     *accepts* a string s in Sigma* iff
     there is some q in delta*(q0,s)
                 such that q in F


                  EXAMPLE NFA

         0,1                     0,1
        /---\                   /---\
        \   /                   \   /
         | v   0           0     | v
    -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]]
         |
         | 1
         v
       [ q1 ]
         |
         | 1
         v
      [[ q2 ]]
         | ^
        /   \
        \---/
         0,1

 K = {q0,q1,q2,q3,q4}
 Sigma = {0,1}
 q0 is start state
 F = {q2,q4}

 delta(q0,0) = {q0,q3}
 delta(q1,0) = {}
 delta(q2,0) = {q2}
 delta(q3,0) = {q4}
 delta(q4,0) = {q4}
 delta(q0,1) = {q0,q1}
 delta(q1,1) = {q2}
 delta(q2,1) = {q2}
 delta(q3,1) = {}
 delta(q4,1) = {q4}


   EXTENDING FUNCTIONS TO SETS OF STATES

Notation extending delta and delta* to
   sets of states
     d(Q,x) = union of d(q,x)
                for all q in Q

     so d({},x) = {}
        d({q},x) = d(q,x)
        d({q1,q2},x) = d(q1,x)+d(q2,x)
        d({q1,q2,q3},x) =
               d(q1,x)+d(q2,x)+d(q3,x)
        etc.


Note that delta*({q}, x)
       = delta*(q, x)
       = delta(q, x)


             EXAMPLE

delta*(q0,010011)
    = delta*(delta(q0,0),10011)
    = delta*({q0,q3},10011)
    = delta*(q0,10011)
      + delta*(q3,10011)
    = delta*(delta(q0,1),0011)
      + delta*(delta(q3,1),0011)
    = delta*({q0,q1},0011)
      + delta*({},0011)
    = delta*(delta(q0,0),011)
         + delta*(delta(q1,0),011)
      + {}
    = delta*({q0,q3},011)
         + delta*({},011)
      + delta*({q2},011)
    = delta*({q0,q3},011)
         + {}
      + delta*({q2},011)
    = delta*({q0,q3},011)
      + delta*({q2},011)
    = delta*(delta(q0,0),11)
         + delta(q3,0),11)
      + delta*(delta(q2,0),11)
    = delta*({q0,q3},11)
         + delta*({q4},11)
      + delta*({q2},11)
    = delta*({q0,q3,q4},1)
         + delta*({q4},1)
      + delta*({q2},1)
    = delta({q0,q1,q4},1)
         + delta(q4,1)
      + delta(q2,1)
    = {q0,q1,q2,q4}
         + {q4}
      + {q2}
    = {q0,q1,q2,q4}


     DETERMINISTIC FINITE AUTOMATA

def: a *deterministic finite automaton*
     (DFA) is an NFA in which
     delta(q,c) is a singleton or empty
        for all q in K and c in Sigma.


          IMPLEMENTING DFAS
          
How would you represent states?


How would you implement a DFA?


           PROBLEM

We want to specify lexical grammar
   using

So we need to convert regular expressions
   into DFAs 


         CONVERTING REs TO NFAs

Definition based on grammar
   of Regular Expressions:

Result of Convert(M) looks like
 this:  -->(M q)
    where the "tail", -->,
                 goes to the start state
          and q is the "head state"
          and M is converted recursively

 assume also Convert(N) is --->(N q')

                 c
  Convert(c) =  --->[ q ]

  Convert(M | N) =
                 /--->(M q)--\
       emp      /             \
      ---->[ q ]               -> [ q2 ]
                \             /
                 \--->(N q')-/

  Convert(M N) =

           --> (M q) --> (N q')

                  emp
  Convert(emp) = -----> [ q ]

                       emp
                   /--------------\
                  /         emp   v
  Convert(M*) = -/ /->(M q)---->[ q2 ]
                  /              /
                  \   emp       /
                   \-----------/

  Convert((M)) = Convert(M)

  After conversion, make the "head state"
   be a final state

        EXAMPLE OF CONVERSION TO NFA

Regular expression:  (i|j)*

               i
 Convert(i) = --->[ qi ]

               j
 Convert(j) = --->[ qj ]

 Convert(i|j) = 
                   i
                 /--->(qi)---\ emp
       emp      /             \
      ---->[ q ]               -> [ q2 ]
                \  j          /
                 \--->(qj)---/ emp

 Convert((i|j)*) =


       CONVERTING AN NFA INTO A DFA

Idea:
  Convert each reachable set of NFA states
  into


How?

  Use the emp-closure of each state q
     = set of states reachable from q
       using emp

 Closure wrt emp:
  closure(S) is the smallest set T
  such that
     T = S +
        union {delta(s, emp) | s in S}
  
  can compute closure(S) as

     T <- S;
     do T2 <- T
        T <- T2 +
           union {delta(s, emp)| s in T2}
     while (T != T2)

 DFA Transitions:
  Let S be a set of states, then
    DFAdelta(S, c) =
       closure(union {delta(s,c)|s in S})

  
     EXAMPLE CONVERSION OF NFA TO DFA

NFA for if|[a-z]([a-z]|[0-9])*

                 f
           [q2] ---> [[q3]]
            ^                    emp
         i /                   /------\
          / emp       a-z     /       v
   -->[q1]------>[q4]----->[q5]    [[q8]]
                                     ^ |
                                  emp| |
                                     | |
                                a-z  | |
                             /-----\ | |
                            /      v / |
                        |->[q6]   [q7] |
                        |   \  0-9 ^   |
                    emp |    \----/    /
                        |             / 
                        \    emp     /
                         \----------/


Converted to DFA:

                   f
        [2,5,6,8] ---> [[3,6,7,8]]
          ^               \
      i  /                 \ a-z
        / a-h j-z      a-z  \ 0-9  
  -->[1,4]---->[[5,6,8]]0-9  v    /-|
                    \----->[[6,7,8]]|a-z
                                ^   |0-9
                                 \  |
                                  \-|


  USING THE FLEX TOOL TO GENERATE LEXERS

Example: SSM assembler (in hw1 zip file)

   High-level description in
 asm_lexer.l

  === flex ===>

   generates:
      asm_lexer.c
       + asm_lexer.h

   Wrapper for lexer:
        lexer.h
           declares functions
        lexer.c
           errors_noted variable +
           some utility functions
        asm_lexer.l
           defines functions
             e.g., lexer_print_token
                   lexer_output

   (ASTs defined in ast.h)

 asm.y is Bison description file grammar

  == bison ==>

   - Declarations in asm.tab.h
        includes ast.h
                 machine_types.h
                 parser_types.h
                     declares YYSTYPE
                 lexer.h
        declares
              yytokentype
                eolsym = ...
                minussym = ...
                dottextsym = ...
                ...
                   
   - Definitions in asm.tab.c
        defines yyparser()
                YYSTYPE yylval;


      STRUCTURE OF FLEX INPUT FILE

  /* ... definitions section ... */

%%
  /* ... rules section ... */
%%

  /* ... user subroutines ... */


    SECTIONS IN FLEX INPUT (.l file)

Definitions section:


Rules section:


User subroutine section:


        WHY CONTEXT-FREE PARSING?

Can we define a regular expression
   to make sure expressions have
   balanced parentheses?
   e.g., recognize: (34) and ((12)+(789))
         but not: (567))+82))))


          WHAT IS NEEDED

Needed for checking balanced parentheses:


           PARSING

Want to recognize language
   with


Goal:


             CONTEXT-FREE GRAMMARS
             
def: a *context-free* grammar (CFG)
     (N, T, P, S) has 
     a set (of nonterminals) N,
     a set (of terminals) T,
     a start symbol S in N and each
     production (in P) has the form:
        <M> -> g
     where <M> is a Nonterminal symbol,
        and g in (N+T)*

Example:


def: For a CFG (N,T,P,S),
     g in (N+T)* *produces* g' in (N+T)*,
     written g =P=> g',
     iff g is e <X> f, e and f in (N+T)*,
         g' is e h f,
         <X> is a nonterminal (in N),
         h in (N+T)*, and the rule
         <X> -> h is in P

Example:
       ( <Number> ) =P=> 


              DERIVATION     

def: a *derivation* of gm in T*
     from the rules P of G=(N,T,P,S),
     is a sequence (S, g1, g2, ..., gm),
     where for all i: gi in (N+T)*,
           S =P=> g1, and
           for all 1 <= j < m:
                 gj =P=> g(j+1)

Example:


      LEFTMOST DERIVATION
     
def: a *leftmost derivation* of gm in T*
     from a CFG (N,T,P,S) is
     a derivation of t from (N,T,P,S)
     (S, g1, ..., gm) such that
     for all 0 <= j < m:
         when gj =P=> g(j+1)
         and the nonterminal <X> is
            replaced in gj,
         then there are no
              nonterminals to the left of
              <X> in gj.

Example:


         PARSE TREES AND DERIVATIONS

def: A *parse tree*, Tr, for a CFG
     (N,T,P,S) represents a derivation, D,
     iff:
       - Each node in Tr is labeled by
         a nonterminal (in N)
       - the root of Tr is
         the start symbol S
       - an arc from <N0> to h in (N+T)
          iff <N0> -> ... h ... in P
       - the order of children of a node
         labeled <N0> is the order in
         a production <N0> -> ... in P
         

             EXAMPLES

GRAMMAR:

  <expr> ::= <number> | <expr> + <expr>
          |  <expr> * <expr>

Derivations of 3*4+2


         EXAMPLE PARSE TREES

Corresponding to leftmost derivation:


Corresponding to rightmost derivation:


             AMBIGUITY

def: a CFG (N,T,P,S) is *ambiguous* iff
     there is some t in T* such that
     there are two different parse trees
     for t


          FIXING AMBIGUOUS GRAMMARS

Idea: Rewrite grammar to


Example:


    RECURSIVE DESCENT PARSING ALGORITHM

For each production rule, of form:
   <N> ::= g1 | g2 | ... | gm

 1. Write a 


 2. This function 


   EXAMPLE RECURSIVE-DESCENT RECOGNIZER

<Stmt> ::= if <Cond>
           then <Stmts>
           else <Stmts> end
        | begin <Stmts> end
        | print <number>
<StmtList> ::= <Stmt> ; <StmtList>
        | <Empty>
<Empty> ::=
<Cond> ::= <number> == <number>

 #include "spl.tab.h"
 yytoken_kind_t tok = yylex();

 void advance() { tok = yylex(); }

 void eat(yytoken_kind_t expected) {
    if (tok == expected) {
        advance();
    } else { /* ... report error */ }
 }

 void parseCond()
 {     eat(numbersym);
       eat(eqeqsym);
       eat(numbersym);
 }

 void parseStmts()
 {
    parseStmt();
    while (tok == semisym)
    {
       eat(semisym);
       parseStmt();
    }
    // should undo the advance
 }

 void parseStmt()
 {    switch (tok.toknum) {
      case ifsym:
         eat(ifsym);
         parseCond();
         eat(thensym);
         parseStmts();
         eat(elsesym);
         parsetStmts();
         eat(endsym);
         break;
      case beginsym:
         eat(beginsym);
         parseStmts();
         break;
      case printsym:
         eat(printsym);
         eat(numbersym);
         break;
      default:
         // report error
      }
 }


           LL(1) GRAMMARS

A recursive-descent parser must:


  - choose between alternatives
    (e.g., <N0> ::= <A> | <B>)

def: A grammar is *LL(1)* iff


           LR(1) GRAMMARS

An LR(1) parser needs to decide
  when to: shift (push token on stack) or
           reduce

  uses a DFA based on stack + lookhead

def: A grammar is *LR(1)* iff


          LALR(1) Parsing

Smaller tables than LR(1)
  - merges states of the DFA
      if they only differ in lookahead


           PROBLEM: AMBIGUITY

Consider:

<S> ::= <ident> := <number>
     | if <E> then <S> <ME>
<ME> ::= <empty> | else <S>

and the input statement:

   if b1 then 
      if b2 then x := 2
    else x := 3

Is this parsed as:

     <S>
   / /  \ -----|---\
  if <E> then  |  <ME>
      |        |      \
      |        |    <empty>
      |       <S>     
             / | \--------\
      |     /  |  \    \   \
      b1   if <E> then <S> <ME>
                      / | \ \   \    
                       ...   |   \
                      x := 2 else <S>
                                 / | \
                                  ...
                                 x := 3
or as:

     <S>
    / /  \-----|--------\
  if <E> then <S>      <ME> 
      |        |\     /   \       
      b1      /| \   else <S> 
             / |  \      / | \    
             |  |  |      ...
             |  |  |     x := 3
             |  |  |
            if <E> then <S> <ME>
                |       /|\   \
                b2      ...   <empty>
                       x := 2

        FIXES FOR AMBIGUITY

Change the language:

 a. Always have an else clause:

   <S> ::= if <E> then <S> else <S>

   (use skip if don't want to do anything)

 b. Use an end marker

   <S> ::= if <E> then <S> else <S> fi
        |  if <E> then <S> fi

Give precedence to one production:

   <S> ::= if <E> then <S> <ME>
   <ME> ::= else <S>  // priority!
         | <empty>

So we only get the parse tree:

       <S>
   / /  \ -----|---\
  if <E> then  |  <ME>
      |        |      \
      |        |    <empty>
      |       <S>     
             / | \--------\
      |     /  |  \    \   \
      b1   if <E> then <S> <ME>
                      / | \ \   \    
                       ...   |   \
                      x := 2 else <S>
                                 / | \
                                  ...
                                 x := 3
            

     BISON: A LALR(1) PARSER GENERATOR

Input:
    spl.y
     |
     |  bison spl.y
     v

Output:
    spl.tab.c   and   spl.tab.h

    yyparse{def.}     token decls.
    parse tables      extern decls.
    yylval               yyparse()
    (user code)          yylval

          THE BIG PICTURE

                       tokens
 - source --> [ Lexer] ------> [ Parser]
    code                          /
                                 /
                                / ASTs
                               /
                              |
                              v
               symbol <---- [ static
               table  ---->   analysis ]
                             /
                            /
                           v
                        [ code generator ]


 BISON AND FLEX, GENERATING A PARSER

idea:

                  ast.h (AST types)
                     |
             bison   v
             -----> spl.tab.c
            /        ^  yyparse (def.)
           / bison   |
  spl.y ----------> spl.tab.h
                     |  token enum.
                     |  (decl.)
              flex   v
  spl_lexer.l-----> spl_lexer.c
                        yylex (def.)


     HOW IT ALL FITS TOGETHER IN HW3

machine_types.h:

// ... 
typedef unsigned int address_type;
typedef unsigned char byte_type;
typedef int word_type;

file_location.h:

// location in a source file
typedef struct {
    const char *filename;
    unsigned int line; // of first token
} file_location;

ast.h:
// ...
#include "machine_types.h"
#include "file_location.h"

// types of ASTs (type tags)
typedef enum { block_ast, /* ... */
    token_ast
} AST_type;

// typedefs for types N_t, follow,
// where N is a nonterminal

typedef struct {
    file_location *file_loc;
    AST_type type_tag;
    void *next; // for lists
} generic_t;

typedef struct ident_s {
    file_location *file_loc;
    AST_type type_tag;
    struct ident_s *next; // for lists
    const char *name;
} ident_t;

typedef struct {
    file_location *file_loc;
    AST_type type_tag;
    const char *text;
    word_type value;
} number_t;

typedef struct {
    file_location *file_loc;
    AST_type type_tag;
    const char *text;
    int code;
} token_t;

// ...

typedef struct block_s {
    file_location *file_loc;
    AST_type type_tag;
    const_decls_t const_decls;
    var_decls_t var_decls;
    proc_decls_t proc_decls;
    stmts_t stmts;
} block_t;

// ...
typedef union AST_u {
    generic_t generic;
    block_t block;
    const_decls_t const_decls;
    const_decl_t const_decl;
    const_def_list_t const_def_list;
    const_def_t const_def;
    var_decls_t var_decls;
    var_decl_t var_decl;
    ident_list_t ident_list;
    // ...
    expr_t expr;
    binary_op_expr_t binary_op_expr;
    token_t token;
    number_t number;
    ident_t ident;
    empty_t empty;
} AST;

// Return the file location from an AST
extern file_location *ast_file_loc(AST t);

// ...

// Return a pointer to a fresh copy of t
// that has been allocated on the heap
extern AST *ast_heap_copy(AST t);

// ...

extern block_t ast_block(
  token_t begin_tok,
  const_decls_t const_decls,
  var_decls_t var_decls,
  proc_decls_t proc_decls,
  stmts_t stmts);

// ...
extern ident_t ast_ident(
  file_location *file_loc,
  const char *name);

extern expr_t ast_expr_number(
  number_t e);

extern empty_t ast_empty(
  file_location *file_loc);
// ...

parser_types.h:

#include "ast.h"
typedef AST YYSTYPE;

spl.y (also spl.tab.h):

#include "ast.h"
#include "machine_types.h"
#include "parser_types.h"
#include "lexer.h"

// more below

    CONNECTING THE PARSER AND THE ASTs

Parser model

   A stack of (terminals + nonterminals)
   A parallel stack of ASTs
   1 token of lookahead

   Steps in parsing (DFA decides to):

   - shift:
     1. push lookahead on parse stack
     2. push yylval on AST stack

OR

   - reduce using a rule written:

        nt : a b c { $$ = f($1,$2,$3); };

     1. take a,b,c off parse stack
     2. take aval,bval,cval off AST stack
           ntval = f(aval,bval,cval)
     3. push nt on parse stack
     4. push result (ntval) on AST stack


  CONNECTION WITH ASTs IN GRAMMAR FILE

 /* $Id: spl.y ... */

%code requires {
#include "ast.h"
#include "machine_types.h"
#include "parser_types.h"
#include "lexer.h"
  /* ... */
}
 /* ... */
%token <ident> identsym
%token <number> numbersym
%token <token> plussym    "+"
%token <token> minussym   "-"
%token <token> multsym    "*"
%token <token> divsym     "/"
%token <token> periodsym  "."
%token <token> semisym    ";"
%token <token> eqsym      "="
%token <token> commasym   ","
%token <token> becomessym ":="
%token <token> lparensym  "("
%token <token> rparensym  ")"
%token <token> constsym   "const"
%token <token> varsym     "var"
 /* ... */

%type <block> program
%type <block> block
%type <const_decls> constDecls
%type <const_def> constDef
%type <var_decls> varDecls
%type <var_decl> varDecl
%type <idents> idents
%type <proc_decls> procDecls
%type <empty> empty
 /* ... */
%type <expr> expr
%type <expr> term
%type <expr> factor

%start program
/* ... */

  PUTTING A TOKEN VALUE ON THE AST STACK

To put yylval on the AST Stack
  when the field name is "token"

in the .y file have:
  %token <token> somesym "some"    

the generated parser has:

#include "ast.h"
#include "parser_types.h" 
   // typedef AST YYSTYPE;
   
  YYSTYPE yyvsa[];  // the AST stack

  yyvsa[yyi].token = yylval;


PUSHING AST FOR A NONTERMINAL ON AST STACK

To put yylval on the AST Stack
  when the field name is "const_def"

in the .y file have:
  %type <const_def> constDef

the generated parser has:

#include "ast.h"
#include "parser_types.h" 
   // typedef AST YYSTYPE;
   
  YYSTYPE yyvsa[];  // the AST stack

  yyvsa[yyi].const_def = yylval;


    THE CONST LANGUAGE

programs all look like:

    const ident = 3402

      ASTs FOR THE CONST LANGUAGE

/* $Id: ast.h ... */
/* ... */

// types of ASTs (type tags)
typedef enum {
    const_def_ast, token_ast,
    number_ast, ident_ast
} AST_type;

typedef struct {
    file_location *file_loc;
} generic_t;

typedef struct ident_s {
    file_location *file_loc;
    const char *name;
} ident_t;

typedef struct {
    file_location *file_loc;
    const char *text;
    word_type value;
} number_t;

typedef struct {
    file_location *file_loc;
    const char *text;
    int code;
} token_t;

typedef struct const_def_s {
    file_location *file_loc;
    ident_t ident;
    number_t number;
} const_def_t;

typedef union AST_u {
    generic_t generic;
    const_def_t const_def;
    token_t token;
    number_t number;
    ident_t ident;
} AST;

extern const_def_t ast_const_def(
  ident_t ident, number_t number);

// ...


    THE CONST.Y FILE FOR CONST LANGUAGE

 /* ... */
%code requires {
#include "ast.h"
#include "machine_types.h"
#include "parser_types.h"
#include "lexer.h"
 /* ...*/
}
 /* ...*/
%token <token> constsym   "const"
%token <ident> identsym
%token <token> eqsym      "="
%token <number> numbersym

%type <const_defs> program
%type <const_def> constDef
%type <const_defs> constDefs
%type <empty> empty

%start program

%code {
extern int yylex(void);
const_def_t progast;
extern void setProgAST(const_def_t t);
}

%%

program : constDef { setProgAST($1); } ;

constDef : "const" identsym "=" numbersym
           { $$ = ast_const_def($2,$4); };

%%

// Set the program's ast to be t
void setProgAST(const_def_t t) {
  progast = t;
}