LANGUAGES

def: A *language* is
   a set of strings of characters
   (from some alphabet)

Characters are:
 For a human language:
   alphabetical characters
     (a, b, c, ...)
 For a computer (language):
   codes in some character set
   (ASCII characters or unicode)


          LANGUAGE CLASSES

Languages can be classified by
   the kind of grammar needed
   to recognize them

Venn Diagram:

 |--------------------------------------|
 | Regular Languages                    |
 |                                      |
 |                                      |
 |  |--------------------------------|  |
 |  | Contex-free Languages          |  |
 |  |                                |  |
 |  |                                |  |
 |  |  |--------------------------|  |  |
 |  |  | Context-sensitive        |  |  |
 |  |  |   Languages              |  |  |
 |  |  |                          |  |  |
 |  |  |                          |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |  |  Type 0 Languages  |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |                          |  |  |
 |  |  |--------------------------|  |  |
 |  |                                |  |
 |  |--------------------------------|  |
 |                                      |
 |--------------------------------------|


          PHASES OF A COMPILER

Programs allowed by a compiler's:

 |--------------------------------------|
 | Lexical Analysis (Lexer)             |
 |    uses regular grammar              |
 |                                      |
 |  |--------------------------------|  |
 |  | Parser                         |  |
 |  |   uses a context-free grammar  |  |
 |  |                                |  |
 |  |  |--------------------------|  |  |
 |  |  | Static Analysis          |  |  |
 |  |  |   use techniques         |  |  |
 |  |  |   equivalent to context- |  |  |
 |  |  |    sensitive grammars    |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |  |  Runtime checks    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |                          |  |  |
 |  |  |--------------------------|  |  |
 |  |                                |  |
 |  |--------------------------------|  |
 |                                      |
 |--------------------------------------|

    3*4+2
       (3*4) + 2

       3  * (4+2)


       PROBLEM WRITING SYNTAX ANALYSIS

Naive way to write a compiler, etc.

Language
 Def.docx  -- coding --> parser.c (v. 1)

 Def2.docx -- coding --> parser.c (v. 2)
           -- coding --> parser.c (v. 3)

 Def3.docx -- coding --> parser.c (v. 4)
 Def4.docx -- coding --> parser.c (v. 5)
   ...                    ...
 DefN.docx -- coding --> parser.c (v. N)

Disadvantages:
  - can be slow to write the parser
  - can be error prone
  - hard to verify
  - experiments with language
     are costly
  - code might not be very efficient


 COMPUTER SCIENCE SOLUTION: AUTOMATION

 high-level
 description  tool      generated code

   lang.y  -- bison --> lang.tab.c
                        + lang.tab.h
   lang2.y -- bison --> lang.tab.c
                        + lang.tab.h
              ...
   langN.y -- bison --> lang.tab.c
                        + lang.tab.h

Advantages:
  - faster cycle time (better for humans)
  - grammar is easier to check/verify
  - experiments/changes are easy/cheap
  - code generated is very efficient


        GRAMMARS DESCRIBE LANGUAGES

Grammars are high-level descriptions
 of languages/parsers

def: a *grammar* consists of a finite set
     of rules (called "productions")
     and a start symbol (a nonterminal).

     Let V = nonterminals + terminals
     
     The rules have the form

            V+ -> V*

     where there is no symbol in both
        nonterminals and terminals


def: The language generated by a grammar G
     with set of productions P is:

     {w | w is in terminals* and S =>* w}
       where S is the start symbol of G
       gAd => gBd iff g in V* and d in V*
                      and j in V* and i in V*
       and A -> B is a rule in P
       and j =>* h iff either h = j
                    or g -> i and i =>* h


         BNF NOTATION FOR GRAMMARS

::=  means what -> means in grammars,
           "can become"

 |   means "or"

<X>  is a nonterminal symbol

     (terminal symbols don't have angle brackets)


Example

 <BNumber> ::= <BDigit> <BNumber>
           |  <BDigit>

 <BDigit> ::= 0 | 1

Example:
    <BNumber> =>* 1101

    <BNumber> -> <BDigit> <BNumber>
              -> 1 <BNumber>
	      -> 1 <BDigit> <BNumber>
	      -> 1 1 <BNumber>
	      -> 1 1 <BDigit> <BNumber>
	      -> 1 1 0 <BNumber>
              -> 1 1 0 <BDigit>
              -> 1 1 0 1


         GRAMMARS AS RULES OF GAMES

A grammar can be seen as describing
  two games:

   - A production game
      (Can you produce this string?)
   - A recognition/parsing game
      (Is this string in the language?)
      

            PRODUCTION GAME

Goal: produce a string in the language
        from the start symbol

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult

   Can we produce "Johnny is good"?

     <Sentence> -> <Noun> <Verb> <Adj>
                -> Johnny <Verb> <Adj>
		-> Johnny is <Adj>
		-> Johnny is good


        RECOGNITION OR PARSING GAME

Goal: determine if a string is in
      the language of the grammar

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult

   Is "Johnny is good" in this grammar?

		Johnny is good
		<- Johnny is <Adj>
                <- Johnny <Verb> <Adj>
                <- <Noun> <Verb> <Adj>
                <- <Sentence>

      "Charlie can be difficult"
            <- Charlie can be <Adj>
	    <- Charlie <Verb> <Adj>
	    <- <Noun> <Verb> <Adj>
	    <- <Sentence>


         DERIVATION (OR PARSE) TREES

def: a *tree* is a finite set of nodes
     connected by directed edges
     that is connected and has no cycles

def: a *derivation tree* for grammar G
     is a tree such that:
        - Every node has a label that is
          a symbol of G
        - The root is labeled by
          the start symbol of G
        - Every node with a
          direct descendent, is labeled by
          a nonterminal
        - If the descendents of a node
          labeled by N have the following
          labels (in order):
             A, B, C, ..., K
          then G has a production of form
             N -> A B C ... K

        EXAMPLE DERIVATION TREE

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult

String to parse: "Johnny is good"

         <Sentence>
         /    |   \
        /     |    \
       v      v     v
    <Noun> <Verb> <Adj>
      |       |     |
      |       |     |
      v       v     v
    Johnny   is    good

        EXTENSIONS TO BNF (EBNF)

Arbitrary number of repeats:

    { x }  means 0 or more repeats of x

   <N> :: = { <Z> }

   is equivalent to:

   <N> ::= <Z-seq>
   <Z-seq> ::= <empty> | <Z-Seq> <Z>
   <empty> ::= 

   {<Z>} is also written as:
       <Z>*  or  [ <Z> ] ...

One-or-more repeats:

     x+  means 1 or more repeats of x

   <N> :: = <X>+

   is equivalent to:

   <N> ::= <Xs>
   <Xs> ::= <X> | <Xs> <X>

   <X>+ is sometimes written as <X> ...
                   or <X> [ <X> ] ...

Optional element:

    [ x ]  means 0 or 1 occurences of x

     <N> ::= [ <Y> ]

     is equivalent to:

     <N> ::= <Y-opt>
     <Y-opt> ::= <empty> | <Y>

    READING A BNF GRAMMAR

Example rules:

  <DecimalConstant> ::=
	<NonZeroDigit> <Digits>
  <NonZeroDigit> ::= 1 | 2 | 3 | 4 | 5
                    | 6 | 7 | 8 | 9
  <Digit> ::= 0 | <NonZeroDigit>
  <Digits> ::= <Digit> | <Digit> <Digits>

Using EBNF:

  <DecimalConstant> ::= <NonZeroDigit> <Digit>+
  <NonZeroDigit> ::= 1 | 2 | 3 | 4 | 5
                    | 6 | 7 | 8 | 9
  <Digit> ::= 0 | <NonZeroDigit>

Better (allowing single digit <DecimalConstant>s):

  <DecimalConstant> ::= <NonZeroDigit> {<Digit>}
  <NonZeroDigit> ::= 1 | 2 | 3 | 4 | 5
                    | 6 | 7 | 8 | 9
  <Digit> ::= 0 | <NonZeroDigit>


      EBNF GRAMMAR FOR (SUBSET OF) PL/0

<program> ::= <block> .
<block> ::= <const-decls>
            <var-decls>
            <proc-decls>
            <stmt>
<const-decls> ::= {<const-decl>}
<const-decl> ::= const <const-def>
                   {<comma-const-def>} ;
<const-def> ::= <ident> = <number>
<comma-const-def> ::= , <const-def>
<var-decls> ::= {<var-decl>}
<var-decl> ::= var <idents> ;
<idents> ::= <ident> {<comma-ident>}
<comma-ident> ::= , <ident>

<proc-decls> ::= {<proc-decl>}
<proc-decl> ::= procedure <ident> ; <block> ;
<stmt> ::= <ident> := <expr>
       | call <ident>
       | begin <stmt> {<semi-stmt>} end
       | if <condition> then <stmt> <else-opt>
       | while <condition> do <stmt>
       | read <ident>
       | write <ident>
       | skip
<semi-stmt> ::= ; <stmt>
<else-opt> ::= <empty> | else <stmt>
<empty> ::=
<condition> ::= odd <expr>
           | <expr> <rel-op> <expr>


          EXAMPLES IN PL/0

Shortest program:
  skip.

Factorial program:

  var n, res; # input and result
  procedure fact;
  begin
     read n;
     res := 1;
     while (n <> 0)
       begin
          res := res * n;
          n := n-1
       end;
     write res
  end;
  call fact.


       LEXICAL ANALYSIS
           MOTIVATION

 # $Id$\n   .text start\nstart:\tADDI ...

Want to:
   - write a parser at a high level
      (not individual characters)
   - ignore comments, whitespace
   - have parser be efficient
   - deal with bad characters sensibly

Approach:
   - break the input (stream of chars)
      into tokens
   tokens = chunks of input

    enum name    characters
    
   dottextsym    .text
   identsym      start
   identsym      start
   colonsym      :
   addiopsym     ADDI
   ...
   

            LEXICAL ANALYSIS

Lexical means relating to the words
  of a language


       GOALS OF LEXICAL ANALYSIS

- Simplify the parser,
  so it need not handle:
    - white space and comments
    - details of tokens

- Recognize the longest match
  Why?
     if8 = b;


- Handle every possible character of input
  Why?
   - so that we completely check a program
   - don't crash on any input


         CONFLICT BETWEEN RULES

Suppose that both "if" and numbers
 are tokens:
       ifsym
       numbersym
   and input
       if8

What tokens should "if8" match?
   want it to be an ident


Fixing such situations:
   tell programmers to use whitespace or punctuation
      to separate tokens
   otherwise use longest match

    elseif


           WHICH TOKEN TO RETURN?

If the input is "<=", what token(s)?
          one token (lessequalsym)

If the input is "<8", what token(s)?
          two tokens (lesssym, numbersym)

If the input is "if", what token(s)?
          one token (ifsym, not an identsym)

If the input is "//", what token(s)?
          (nothing, start of a comment)

Summary:
   favor the longest string we can make into a token
   favor the reserved words over identifiers

    IF IF=THEN     /* keyword at the beginning */
    THEN THEN=ELSE
    ELSE IF=THEN=ELSE


          THE BIG PICTURE

                       tokens
 - source --> [ Lexer] ------> [ Parser]
    code                          /
                                 / 
                           AST  / abstract
                               /  syntax
                              v   trees
                            [ static
                              analysis ]
                             / 
                            /  IR
                           v
                         [ code generator ]

For the Lexer we want to:
  - specify the tokens
    using regular expressions (REs)
  - convert REs to DFAs to execute them
     but easy conversions are:
        - REs to NFAs
        - NFAs to DFAs


        HOW PARSER WORKS WITH LEXER

Couroutine structure:

   Parser calls lexer:
   
          tok = yytoken(); // call lexer
           /* ... use yylavl ... */

   Lexer function (yytoken)
       remembers pointer to input stream

         returns next token (int code)

   Parser works...
   
   Parser calls lexer again:
   
          tok = yytoken(); // call lexer
           /* ... use yylavl ... */
     

 BISON AND FLEX, GENERATING A PARSER

idea:

   ast.h (AST types)
    |
    |        bison
    |        -----> g.tab.c
    |       /        ^  yyparse function
    v      / bison   |  
    g.y file -----> g.tab.h  
                     |  tokens
              flex   v  defs. 
    g.l file -----> g.c       
                      yytoken function


             DEFINITIONS

The lexical grammar of a language
   is regular, because
    they can be parsed/recognized
     very quickly/efficiently

def: a grammar is *regular* iff
     all its productions have one of these
     forms:

           <B> ::= c <D>
         or
           <B> ::= c

     where c is a terminal symbol,
       and <B> and <D> are nonterminals

def: a language is *regular* iff
     it can be defined using a
     regular grammar.

Thm: Every regular language can be
     recognized by a finite automaton.

Thm: Every regular language can be
     specified by a regular expression.

            REGULAR EXPRESSIONS

The language of regular expressions:

  <RE> ::= <char>
       | <RE> '|' <RE>
       | <RE> <RE>
       | emp            
       | <RE>*
       | ( <RE> )

 where <char> is a character

Examples:

  RE              meaning (language)
====================================
  emp             the empty string

  (0|1)*0         even binary numerals

 b*(abb*)*(a|emp) a's and b's without
                  consecutive a's

   abba
   b
   bbbb
   ababababbba


     EXTENSIONS TO REGULAR EXPRESSIONS

[abcd]   means   a|b|c|d

[h-m]    means   h|i|j|k|l|m

      so [a-z] is all lower case letters in ASCII
      
 x?      means   x|emp

 y+      means   y(y*)


           FOR YOU TO DO

Write a regular expression that describes:

1. The reserved word "if"
     
    if

2. The set of all (positive)
   decimal numbers

    [0-9][0-9]*
    [0-9]+

    [1-9][0-9]*
    [1-9]([0-9]*)

3. The set of all possible identifiers
   with underbars (_)

    [a-zA-Z0-9_]+   
      FOO
      F_B
      I10
      897G

   Identifiers that must start with a letter or underscore:
      [_a-zA-Z][a-zA-Z0-9_]*

   Identifiers that start with c, i, or j:
     [cij][a-zA-Z0-9_]*   c
                          court
			  c85
			  k9

 4. The reserved word "int" in C

    int

 5. Signed integer literals as in C

    [-+](([1-9][0-9]*)|(0[0-7]*)|(0x[0-9a-fA-F]*))

    6 -5

(If you have time, write these as
 regular grammars also.)


   EXAMPLE REGULAR EXPRESSIONS FOR PL/0

PATTERN         RE
=========================
DECDIGIT	[0-9]
LETTER		[_a-zA-Z]
LETTERORDIGIT   {LETTER}|{DIGIT}
IDENT           {LETTER}{LETTERORDIGIT}*

...

        FINITE AUTOMATA
            EXAMPLE

State diagram:

             <         =
    -->[q0] --->[[q1]]---> [[q2]]
                  |    
                  | >
                  v
                [[q3]]


      PSEUDO CODE FOR THIS LEXER CASE

    char c;
    get next char into c;
    if (c is '<')
    then
       get the next char into c
       if (c is '=')
       then
           return leqsym
       else if (c is '>')
            then
               return neqsym
            else
	       unget the char c
	       return lesssym


        DEALING WITH COMMENTS

Suppose
  /  is used for division
  // starts a comment to end of line
      (unlike PL/0!)

What state diagram?
                                [^\n]
                                 __
             '/'           '/'   \v
    --> [q0] ---> [[qdiv]] ---> [q2]
          ^                      /
          \---------------------/
                  \n


How does whitespace fit in?
    read those chars and go back to state q0 (start state)


TACTICS FOR IGNORING WHITESPACE, COMMENTS

Goal: do not send ignored tokens to parser

Can always get a non-ignored token:
    have lexer keep going until it finds a
      non-ignored token

Return "tokens" that include ignored stuff
  to a loop that ignores them


Giant DFA that goes back to start state
  on seeing something to ignore


     NONDETERMINISTIC FINITE AUTOMATA

def: A *nondeterministic finite automaton*
    (NFA) over an alphabet Sigma
    is a system (K, Sigma, delta, q0, F)
    where K is a finite set (of states),
          Sigma is a finite set set
            (the input alphabet),
          delta: is a map of type
             (K, Sigma) -> Sets(K)
          q0 in K is the initial state, &
          F is a subset of K
            (the final/accepting states).


     TRANSITION FUNCTION AND ACCEPTANCE

  p in delta(q,x)
     means that in state q,
       on input x the next state
       can be p

  p in delta*(q,s)
     where s in Sigma*
      is defined by:
              delta*(q,emp) = {q}
         (i.e., q in delta*(q,emp))
         p in delta*(q,xa) =
                delta*(q2, a)
            where x in Sigma,
                  a in Sigma*,
                  q2 in delta(q,x)

Lemma: for all c in Sigma,
   p in delta*(q, c) = delta(q, c)

def: An NFA (K, Sigma, delta, q0, F)
     *accepts* a string s in Sigma* iff
     there is some q in delta*(q0,s)
                 such that q in F


                  EXAMPLE NFA

         0,1                     0,1
        /---\                   /---\
        \   /                   \   /
         | v   0           0     | v
    -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]]
         |
         | 1
         v
       [ q1 ]
         |
         | 1
         v
      [[ q2 ]]
         | ^
        /   \
        \---/
         0,1

 K = {q0,q1,q2,q3,q4}
 Sigma = {0,1}
 q0 is start state
 F = {q2,q4}

 delta(q0,0) = {q0,q3}
 delta(q1,0) = {}
 delta(q2,0) = {q2}
 delta(q3,0) = {q4}
 delta(q4,0) = {q4}
 delta(q0,1) = {q0,q1}
 delta(q1,1) = {q2}
 delta(q2,1) = {q2}
 delta(q3,1) = {}
 delta(q4,1) = {q4}


   EXTENDING FUNCTIONS TO SETS OF STATES

Notation extending delta and delta* to
   sets of states
     d(Q,x) = union of d(q,x)
                for all q in Q

     so d({},x) = {}
        d{{q},x) = d(q,x)
        d({q1,q2},x) = d(q1,x)+d(q2,x)
        d({q1,q2,q3},x) =
               d(q1,x)+d(q2,x)+d(q3,x)
        etc.


Note that delta*({q}, x)
       = delta*(q, x)
       = delta(q, x)


             EXAMPLE

delta*(q0,010011)
    = delta*(delta(q0,0),10011)
    = delta*({q0,q3},10011)
    = delta*(q0,10011)
      + delta*(q3,10011)
    = delta*(delta(q0,1),0011)
      + delta*(delta(q3,1),0011)
    = delta*({q0,q1},0011)
      + delta*({},0011)
    = delta*(delta(q0,0),011)
         + delta*(delta(q1,0),011)
      + {}
    = delta*({q0,q3},011)
         + delta*({},011)
      + delta*({q2},011)
    = delta*({q0,q3},011)
         + {}
      + delta*({q2},011)
    = delta*({q0,q3},011)
      + delta*({q2},011)
    = delta*(delta(q0,0),11)
         + delta(q3,0),11)
      + delta*(delta(q2,0),11)
    = delta*({q0,q3},11)
         + delta*({q4},11)
      + delta*({q2},11)
    = delta*({q0,q3,q4},1)
         + delta*({q4},1)
      + delta*({q2},1)
    = delta({q0,q1,q4},1)
         + delta(q4,1)
      + delta(q2,1)
    = {q0,q1,q2,q4}
         + {q4}
      + {q2}
    = {q0,q1,q2,q4}


     DETERMINISTIC FINITE AUTOMATA

def: a *deterministic finite automaton*
     (DFA) is an NFA in which
     delta(q,c) is a singleton or empty
        for all q in K and c in Sigma.


          IMPLEMENTING DFAS
          
How would you represent states?

    ints

How would you implement a DFA?

    could use a switch
       or a 2D array (state x input char)


           PROBLEM

We want to specify lexical grammar
   using regular expressions

So we need to convert regular expressions
   into DFAs to run them


         CONVERTING REs TO NFAs

Definition based on grammar
   of Regular Expressions:

Result of Convert(M) looks like
 this:  -->(M q)
    where the "tail", -->,
                 goes to the start state
          and q is the "head state"
 assume also Convert(N) is --->(N q')

                 c
  Convert(c) =  --->[ q ]

  Convert(M | N) =
                 /--->(M q)--\
       emp      /             \
      ---->[ q ]               -> [ q2 ]
                \             /
                 \--->(N q')-/

  Convert(M N) =

           --> (M q)-->(N q')

                  emp
  Convert(emp) = -----> [ q ]

                       emp
                   /--------------\
                  /         emp   v
  Convert(M*) = -/ /->(M q)---->[ q2 ]
                  /              /
                  \   emp       /
                   \-----------/

  Convert((M)) = Convert(M)

  After conversion, make the "head state"
   be a final state

        EXAMPLE OF CONVERSION TO NFA

Regular expression:  (i|j)*

               i
 Convert(i) = --->[ qi ]

               j
 Convert(j) = --->[ qj ]

 Convert(i|j) = 
                   i
                 /--->(qi)---\ emp
       emp      /             \
      ---->[ q ]               -> [ q2 ]
                \  j          /
                 \--->(qj)---/ emp

 Convert((i|j)*) =


       CONVERTING AN NFA INTO A DFA

Idea:
  Convert each reachable set of NFA states
  into


How?

  Use the emp-closure of each state q
     = set of states reachable from q
       using emp

 Closure wrt emp:
  closure(S) is the smallest set T
  such that
     T = S +
        union {delta(s, emp) | s in T}
  
  can compute closure(S) as

     T <- S;
     do T2 <- T
        T <- T2 +
           union {delta(s, emp)| s in T2}
     while (T != T2)

 DFA Transitions:
  Let S be a set of states, then
    DFAdelta(S, c) =
       closure(union {delta(s,c)|s in S})

  
     EXAMPLE CONVERSION OF NFA TO DFA

NFA for if|[a-z]([a-z]|[0-9])*

                 f
           [q2] ---> [[q3]]
            ^                    emp
         i /                   /------\
          / emp       a-z     /       v
   -->[q1]------>[q4]----->[q5]    [[q8]]
                                     ^ |
                                  emp| |
                                     | |
                                a-z  | |
                             /-----\ | |
                            /      v / |
                        |->[q6]   [q7] |
                        |   \  0-9 ^   |
                    emp |    \----/    /
                        |             / 
                        \    emp     /
                         \----------/


Converted to DFA:

                   f
        [2,5,6,8] ---> [3,6,7,8]
          ^               \
      i  /                 \ a-z
        / a-h j-z      a-z  \ 0-9  
  -->[1,4]---->[5,6,8] 0-9   v    /-|
                    \----->[6,7,8]  |a-z
                                ^   |0-9
                                 \  |
                                  \-|


  USING THE FLEX TOOL TO GENERATE LEXERS

Example: SRM assembler

   High-level description in
       asm_lexer.l

   Generated lexer asm_lexer.c
                   + asm_lexer.h

   Wrapper for lexer:
        lexer.h
           declares functions
        lexer.c
           defines functions
             e.g., lexer_print_token

   ASTs defined in ast.h

   asm.y is Bison description file grammar

   == bison ==>

   - Declarations in asm.tab.h
        includes ast.h
                 machine_types.h
                 parser_types.h
                     declares YYSTYPE
                 lexer.h
        declares
                 yytokentype
                   eolsym = ...
                   minussym = ...
                   dottextsym = ...
                   ...
                   
   - Definitions in asm.tab.c
        defines yyparser()
                YYSTYPE yylval;


      STRUCTURE OF FLEX INPUT FILE

  /* ... definitions section ... */

%%
  /* ... rules section ... */
%%

  /* ... user subroutines ... */


    SECTIONS IN FLEX INPUT (.l file)

Definitions section:


Rules section:


User subroutine section:


        WHY CONTEXT-FREE PARSING?

Can we define a regular expression
   to make sure expressions have
   balanced parentheses?
   e.g., recognize: (34) and ((12)+(789))
         but not: (567))+82))))


          WHAT IS NEEDED

Needed for checking balanced parentheses:


           PARSING

Want to recognize language
   with


Goal:


             CONTEXT-FREE GRAMMARS
             
def: a *context-free* grammar (CFG)
     (N, T, P, S) has 
     start symbol S in N and each
     production (in P) has the form:
        <M> -> g
     where <M> is a Nonterminal symbol,
        and g in (N+T)*

Example:


def: For a CFG (N,T,P,S),
     g in (N+T)* *produces* g' in (N+T)*,
     written g =P=> g',
     iff g is e <X> f,
         g' is e h f,
         <X> is a nonterminal (in N),
         h in (N+T)*, and the rule
         <X> -> h is in P

Example:
       ( <Number> ) =P=> 


              DERIVATION     

def: a *derivation* of a terminal string t
     from the rules P of a CFG (N,T,P,S),
     is a sequence (S, g1, g2, ..., gm),
     where gm = t, and
           for all i: gi in (N+T)*, and
           for all 0 <= j < m:
                 gj =P=> g(j+1)

Example:


      LEFTMOST DERIVATION
     
def: a *leftmost derivation* of a string
     t in T* from a CFG (N,T,P,S) is
     a derivation of t from (N,T,P,S)
     (S, g1, ..., gm) such that gm = t and
        for all 0 <= j < m:
            when gj =P=> g(j+1)
            and the nonterminal <X> is
             replaced in gj,
              then there are no
               nonterminals to the left of
               <X> in gj.

Example:


         PARSE TREES AND DERIVATIONS

def: A *parse tree*, Tr, for a CFG
     (N,T,P,S) represents a derivation, D,
     iff:
       - Each node in Tr is labeled by
         a nonterminal (in N)
       - the root of Tr is
         the start symbol S
       - an arc from <N0> to h in (N+T)
          iff <N0> -> ... h ... in P
       - the order of children of a node
         labeled <N0> is the order in
         a production <N0> -> ... in P
         

             EXAMPLES

GRAMMAR:

  <expr> ::= <number> | <expr> + <expr>
          |  <expr> * <expr>

Derivations of 3*4+2


         EXAMPLE PARSE TREES

Corresponding to leftmost derivation:


Corresponding to rightmost derivation:


             AMBIGUITY

def: a CFG (N,T,P,S) is *ambiguous* iff
     there is some t in T* such that
     there are two different parse tress
     for t


          FIXING AMBIGUOUS GRAMMARS

Idea: Rewrite grammar to


Example:


    RECURSIVE DESCENT PARSING ALGORITHM

For each production rule, of form:
   <N> ::= g1 | g2 | ... | gm

 1. Write a 


 2. This function 


      EXAMPLE RECURSIVE-DESCENT PARSER

<Stmt> ::= if <Cond>
           then <Stmt>
           else <Stmt>
        | begin S <List> end
        | write <number>
<List> ::= ; <Stmt> <List> | <Empty>
<Empty> ::=
<Cond> ::= <number> = <number>


 token tok = lexer_next();

 void advance() { tok = yytoken(); }

 void eat(token_type tt) {
    if (tok.typ == tt) {
        advance();
    } else { /* ... report error */ }
 }

 void parseCond()
 {     eat(numbersym);
       eat(eqsym);
       eat(numbersym);
 }

 void parseStmt()
 {    switch (tok.typ) {
      case ifsym:
         eat(ifsym);
         parseExp();
         eat(thensym);
         parseStmt();
         eat(elsesym);
         parsetStmt();
         break;
      case beginsym:
         eat(beginsym);
         parseStmt();
         parseList();
         break;
      case writesym:
         eat(writesym);
         eat(numbersym);
         break;
      default:
         // report error
      }
 }


           LL(1) GRAMMARS

A recursive-descent parser must:

  - choose between alternatives
    (e.g., <N0> ::= <A> | <B>)

def: A grammar is *LL(1)* iff


           LR(1) GRAMMARS

An LR(1) parser needs to decide
  when to: shift (push token on stack) or
           reduce

  uses a DFA based on stack + lookhead

def: A grammar is *LR(1)* iff


          LALR(1) Parsing

Smaller tables than LR(1)
  - merges states of the DFA
      if only differ in lookahead

            PROBLEM: AMBIGUITY

Consider:

<S> ::= <ident> := <number>
     | if <E> then <S> <ME>
<ME> ::= <empty> | else <S>

and the statement:

   if b1 then 
      if b2 then x := 2
   else x := 3

Is this parsed as:

       <S>
   / /  \ -----|---\
  if <E> then  |  <ME>
      |        |      \
      |        |    <empty>
      |       <S>     
             / | \--------\
      |     /  |  \    \   \
      b1   if <E> then <S> <ME>
                      / | \ \   \    
                       ...   |   \
                      x := 2 else <S>
                                 / | \
                                  ...
                                 x := 3
or as:

       <S>
   /  /   \----|--------\
  if <E> then <S>      <ME> 
      |        |\     /   \       
      b1      /| \   else <S> 
             / |  \      / | \    
             |  |  |      ...
             |  |  |     x := 3
             |  |  |
            if <E> then <S> <ME>
                |       /|\   \
                b2      ...   <empty>
                       x := 2

        FIXES FOR AMBIGUITY

Change the language:

 a. Always have an else clause:

   <S> ::= if <E> then <S> else <S>

   (use skip if don't want to do anything)

 b. Use an end marker

   <S> ::= if <E> then <S> else <S> fi
        |  if <E> then <S> fi

Give precedence to one production:

   <S> ::= if <E> then <S> <ME>
   <ME> ::= else <S>  // priority!
         | <empty>

So we only get the parse tree:

       <S>
   / /  \ -----|---\
  if <E> then  |  <ME>
      |        |      \
      |        |    <empty>
      |       <S>     
             / | \--------\
      |     /  |  \    \   \
      b1   if <E> then <S> <ME>
                      / | \ \   \    
                       ...   |   \
                      x := 2 else <S>
                                 / | \
                                  ...
                                 x := 3