I. Overview
 A. theory
  1. languages
------------------------------------------
                LANGUAGES

def: A *language* is


------------------------------------------
            For a written natural language, what are the characters?
            For a spoken natural language, what are the characters?
  2. hierarchy of language classes
------------------------------------------
          LANGUAGE CLASSES

Languages can be classified by


Venn Diagram:

 |--------------------------------------|
 | Strings recognized by                |
 |   the Regular Grammar                |
 |                                      |
 |  |--------------------------------|  |
 |  | Strings recognized by          |  |
 |  |  the Contex-free Grammar       |  |
 |  |                                |  |
 |  |  |--------------------------|  |  |
 |  |  | Strings recognized by    |  |  |
 |  |  |  the Context-sensitive   |  |  |
 |  |  |    Grammar               |  |  |
 |  |  |     (static checking)    |  |  |
 |  |  |                          |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |  | Strings recognized |  |  |  |
 |  |  |  |  by the            |  |  |  |
 |  |  |  |   Type 0 Grammar   |  |  |  |
 |  |  |  |    (dynamic checks)|  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |                          |  |  |
 |  |  |--------------------------|  |  |
 |  |                                |  |
 |  |--------------------------------|  |
 |                                      |
 |--------------------------------------|

------------------------------------------
   a. relation of language classes to parts of a compiler
------------------------------------------
          PHASES OF A COMPILER

Programs allowed by a compiler's:

 |--------------------------------------|
 | Lexical Analysis (Lexer)             |
 |                                      |
 |  |--------------------------------|  |
 |  | Parser                         |  |
 |  |                                |  |
 |  |  |--------------------------|  |  |
 |  |  | Static Analysis          |  |  |
 |  |  |                          |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |  |  Runtime checks    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |                          |  |  |
 |  |  |--------------------------|  |  |
 |  |                                |  |
 |  |--------------------------------|  |
 |                                      |
 |--------------------------------------|

------------------------------------------
        Is there a relationship to the previous diagram
        (about langauge classes)?
  3. automation of syntax analysis
------------------------------------------
       PROBLEM WRITING SYNTAX ANALYSIS

Naive way to write a compiler, etc.

Language
 Def.docx  -- coding --> parser.c (v. 1)

 Def2.docx -- coding --> parser.c (v. 2)
           -- coding --> parser.c (v. 3)

 Def3.docx -- coding --> parser.c (v. 4)
 Def4.docx -- coding --> parser.c (v. 5)
   ...                    ...
 DefN.docx -- coding --> parser.c (v. N)

Disadvantages:


------------------------------------------
------------------------------------------
 COMPUTER SCIENCE SOLUTION: AUTOMATION

 high-level
 description  tool      generated code

lang_lexer.l -- flex--> lang_lexer.c
                          + lang_lexer.h
     lang.y  -- bison --> lang.tab.c
                          + lang.tab.h
     lang2.y -- bison --> lang.tab.c
                          + lang.tab.h
                ...

     langN.y -- bison --> lang.tab.c
                          + lang.tab.h

Advantages:


------------------------------------------
  4. grammars
   a. definitions
------------------------------------------
        GRAMMARS DESCRIBE LANGUAGES

Grammars are high-level descriptions
 of languages/parsers

def: A *terminal* is a character
     (or string of characters)
     found in the language

     E.g., 

def: A *nonterminal* is a meta-notation
     that can be replaced with other
     terminals or nonterminals.

def: a *grammar* consists of a finite set
     of rules (called "productions")
     and a start symbol (a nonterminal).

     Let V = nonterminals + terminals
     
     The rules have the form

            V+ -> V*

     where there is no symbol in both
        nonterminals and terminals


def: The language generated by a grammar G
     with set of productions P is:

     {w | w is in terminals* and S =>* w}
       where S is the start symbol of G
       gAd => gBd iff g in V* and d in V*
       and A -> B is a rule in P
       and g =>* h iff either h = g
                    or g -> i and i =>* h

------------------------------------------
   b. BNF notation
------------------------------------------
         BNF NOTATION FOR GRAMMARS

::=  means

 |   means

<X>  is a


Example

 <BNumber> ::= <BDigit> <BNumber>
           |  <BDigit>

 <BDigit> ::= 0 | 1

------------------------------------------
   c. grammars as games
------------------------------------------
         GRAMMARS AS RULES OF GAMES

A grammar can be seen as describing
  two games:

   - A production game
      (Can you produce this string?)
   - A recognition/parsing game
      (Is this string in the language?)
      
------------------------------------------
    i. production game
------------------------------------------
            PRODUCTION GAME

Goal: produce a string in the language
        from the start symbol

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult

   Can we produce "Johnny is good"?


------------------------------------------
        Could you win the production game and produce
           the string "Johnny is good"?
         Could you produce the sentence "Charlie can be difficult"?
    ii. recognition or parsing game
------------------------------------------
        RECOGNITION OR PARSING GAME

Goal: determine if a string is in
      the language of the grammar

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult

   Is "Johnny is good" in this grammar?


------------------------------------------
       Could you win the recognition game
           on the string "Johnny is good"?
        What does this have to do with parsing in a compiler?
   d. derivation trees
------------------------------------------
         DERIVATION (OR PARSE) TREES

def: a *tree* is a finite set of nodes
     connected by directed edges
     that is connected and has no cycles

def: a *derivation tree* for grammar G
     is a tree such that:
        - Every node has a label that is
          a symbol of G
        - The root is labeled by
          the start symbol of G
        - Every node with a
          direct descendent, is labeled by
          a nonterminal
        - If the descendents of a node
          labeled by N have the following
          labels (in order):
             A, B, C, ..., K
          then G has a production of form
             N -> A B C ... K
------------------------------------------
------------------------------------------
        EXAMPLE DERIVATION TREE

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult

String to parse: "Johnny is good"

         <Sentence>
         /    |   \
        /     |    \
       v      v     v
    <Noun> <Verb> <Adj>


    Johnny   is    good
------------------------------------------
        Why does the order of nodes matter?
   e. extensions to BNF (EBNF)
------------------------------------------
        EXTENSIONS TO BNF (EBNF)

Arbitrary number of repeats:

    { x }  means 0 or more repeats of x

   <N> :: = { <Z> }

   is equivalent to:

   <N> ::= <Z-seq>
   <Z-seq> ::= <empty> | <Z-Seq> <Z>
   <empty> ::=

   {<Z>} is also written as:
       <Z>*  or  [ <Z> ] ...

One-or-more repeats:

     x+  means 1 or more repeats of x

   <N> :: = <X>+

   is equivalent to:

   <N> ::= <Xs>
   <Xs> ::= <X> | <Xs> <X>

   <X>+ is sometimes written as <X> ...
                   or <X> [ <X> ] ...

Optional element:

    [ x ]  means 0 or 1 occurences of x

     <N> ::= [ <Y> ]

     is equivalent to:

     <N> ::= <Y-opt>
     <Y-opt> ::= <empty> | <Y>
------------------------------------------
        What is EBNF notation like that you may have seen before?
   f. examples
------------------------------------------
    READING A BNF GRAMMAR

Example rules:

  <DecimalConstant> ::=
	<NonZeroDigit> <Digits>
  <NonZeroDigit> ::= 1 | 2 | 3 | 4 | 5
                    | 6 | 7 | 8 | 9
  <Digit> ::= 0 | <NonZeroDigit>
  <Digits> ::= <Digit> | <Digit> <Digits>

------------------------------------------
	can you give an example of a <DecimalConstant>?
------------------------------------------
      EBNF GRAMMAR FOR (SUBSET OF) SPL

<program> ::= <block> .

<block> ::= begin <const-decls>
                  <var-decls>
                  <proc-decls>
                  <stmts>
            end

<const-decls> ::= {<const-decl>}
<const-decl> ::= const <const-def-list> ;
<const-def-list> ::= <const-def>
          | <const-def-list> , <const-def>
<const-def> ::= <ident> = <number>

<var-decls> ::= {<var-decl>}
<var-decl> ::= var <ident-list> ;
<ident-list> ::= <ident> {<comma-ident>}
<comma-ident> ::= , <ident>

<proc-decls> ::= {<proc-decl>}
<proc-decl> ::= proc <ident> <block> ;

<stmts> ::= <empty> | <stmt-list>
<empty> ::= 
<stmt-list> ::= <stmt> {<semi-stmt>}
<semi-stmt> ::= ; <stmt>
<stmt> ::= <ident> := <expr>
       | call <ident>
       | if <condition> then <stmts>
                        else <stmts> end
       | if <condition> then <stmts> end
       | while <condition> do <stmts> end
       | read <ident>
       | print <ident>
       | block

<condition> ::= <expr> <rel-op> <expr>

------------------------------------------
------------------------------------------
          EXAMPLES IN SPL

Shortest program:

  begin end.

Factorial program:

begin
  var n, res; # input and result
  proc fact
  begin
     read n;
     res := 1;
     while n != 0
     do
       res := res * n;
       n := n-1
     end;
     print res
  end;
  call fact
end.

------------------------------------------
        Does the factorial program parse correctly?
        Does SPL use semicolons to end statements or to separate them?
        How does C use semicolons?
II. Lexical Analysis
------------------------------------------
           MOTIVATION

 # $Id$\n   .text start\nstart:\tADDI ...

Want to:


Approach:


------------------------------------------
------------------------------------------
            LEXICAL ANALYSIS

Lexical means relating to the words
  of a language


------------------------------------------
 A. Goals of Lexical Analysis
------------------------------------------
       GOALS OF LEXICAL ANALYSIS

- Simplify the parser,
  so it need not handle:


- Recognize the longest match
  Why?


- Handle every possible character of input
  Why?


------------------------------------------
     Why recognize the longest match?
     Why handle all possible characters?
------------------------------------------
         CONFLICT BETWEEN RULES

Suppose that both "if" and numbers
 are tokens:

What tokens should "if8" match?


Fixing such situations:


------------------------------------------
        What tokens should "<=" match?
------------------------------------------
           WHICH TOKEN TO RETURN?

If the input is "<=", what token(s)?


If the input is "==", what token?


If the input is "<8", what token(s)?


If the input is "if", what token(s)?


If the input is "//", what token(s)?


Summary:


------------------------------------------
        How would you program the longest match idea?
        How would you ensure that reserved words
           are favored over identifiers?  (e.g., "if" is not an identifier)
           After finding an identifier characters,
           check to see if it's a reserved word,
           and if so, then return the reserved word token

        ... In sum,
              favor is longest match,
              but give priority to reserved words over identifiers

  1. Overview
------------------------------------------
          THE BIG PICTURE

                       tokens
 - source --> [ Lexer] ------> [ Parser]
    code                          /
                                 / 
                                / abstract
                               /  syntax
                              v   trees
                            [ static
                              analysis ]
                             / 
                            /
                           v
                         [ code generator ]

For the Lexer we want to:
  - specify the tokens
    using regular expressions (REs)
  - convert REs to DFAs to execute them
     but easy conversions are:
        - REs to NFAs
        - NFAs to DFAs

------------------------------------------
        Explain what an the abbreviations mean:
           - NFA = nondeterministic finite state automaton
           - DFA = deterministic finite state automaton

------------------------------------------
        HOW PARSER WORKS WITH LEXER

Couroutine structure:

   Parser calls lexer:
   
          yychar = yylex(); // call lexer
           /* ... use yychar ... */

   Lexer function
       remembers pointer to input stream

         returns next token (int code)

   Parser works...
   
   Parser calls lexer again:
   
          yychar = yylex(); // call lexer
           /* ... use yychar ... */
     
------------------------------------------

        It's a coroutine,
        because the lexer picks up where it left off
        when called again

   a. Use of Automated Tools
------------------------------------------
 BISON AND FLEX, GENERATING A PARSER

idea:

  ast.h (AST types)
   ^
   |        bison
   |        -----> spl.tab.c
   |       /        ^   yyparse()
   |      / bison   |  
   |   spl.y        |
   |      \------> spl.tab.h  
   |                      \  tokens
   |               flex    v  defs. 
  spl_lexer.l file --------> spl_lexer.c       
                              yylex()
  spl.y file -----> spl.tab.h  
                    \  tokens
             flex    v  defs. 
  spl_lexer.l file -----> spl_lexer.c       
                     yylex() function

------------------------------------------
   ...
     explain all of this:

      The context-free grammar
           (the .y file) is central to this,

      The .y file records:
            - grammar and
            - names/types of the tokens
               (that the parser needs,
                and the lexer produces)

      Bison is a parser generator
          it generates the function yyparse()
          in files spl.tab.h and spl.tab.c

      The ASTs are structs that record the parse

      Flex is a lexical analyzer generator

      the spl_lexer.l file records
         - the lexical grammar (using REs)
         - how tokens produce ASTs (in yylval)

------------------------------------------
     FOR HOMEWORK 2 (LEXICAL ANALYSIS)


  ast.h (AST types)
   ^
   |       
   | 
   | 
   | 
   |                  spl.tab.h  
   |                      \  tokens
   |               flex    v  defs. 
  spl_lexer.l file --------> spl_lexer.c       
                              yylex()

------------------------------------------

         The lexer needs to record tokens as token ASTs,
             for later use by the parser,
         but otherwise the ASTs are not important for homework 2

 B. Using Flex
  1. Files and coordination with parser

------------------------------------------
  USING THE FLEX TOOL TO GENERATE LEXERS

Example: SSM assembler (in hw1 zip file)

   High-level description in file
 asm_lexer.l

  === flex ===>

   generates:

      asm_lexer.c + asm_lexer.h

   Wrapper for lexer:
        lexer.h
           declares functions
        lexer.c
           errors_noted variable +
           some utility functions
        asm_lexer.l
           defines functions
             e.g., lexer_print_token
                   lexer_output

   (ASTs defined in ast.h)

 asm.y is Bison description file grammar

  == bison ==>

   - Declarations in asm.tab.h
        #includes ast.h
                  machine_types.h
                  parser_types.h
                     (declares YYSTYPE)
                  lexer.h

        declares
              yytokentype
                eolsym = ...
                minussym = ...
                dottextsym = ...
                ...
                   
   - Definitions in asm.tab.c
        defines yyparser()
                YYSTYPE yylval;

------------------------------------------
       There are options in flex for naming the .h and .c files generated.

       The type YYSTYPE is the type of the ASTs

       Since the connection between flex and bison is tricky,
       for homework 2, we give you the spl.tab.h file,
          from that file you will need the token types (yytokentype)
              and a minimal spl.tab.c that defines yylval.

  2. structure of flex input
------------------------------------------
      STRUCTURE OF FLEX INPUT FILE

  /* ... definitions section ... */

ARRGH r
EEEE eeee
RE   {ARRGH}{EEEE}

%%
  /* ... rules section ... */

re   { tok2ast(resym); return resym; }
{RE} { tok2ast(re4sym); return re4sym; }

%%

  /* ... user subroutines ... */


------------------------------------------
        Explain the difference between re and {ARRGH}{EEEE}

        The regular expression
          - must start in column 1 (otherwise it's taken as C code)
          - if it uses defined 

------------------------------------------
    SECTIONS IN FLEX INPUT (.l file)

Definitions section:


Rules section:


User subroutine section:


------------------------------------------

     ...
       options,
       (C) #includes,
       (C) declarations of names used in rules
       (flex) definitions of named REs,
       (flex) declarations of
                start conditions (states)
       
     ...
       pairs of REs and actions (code)

     ...
       definitions of functions used

Go over code in the assembler's lexer_main.c

          show output
     lexer.h,
     lexer.c
     asm_lexer.l
       and generated code asm_lexer.h
                          asm_lexer.c
COP 3402 meeting -*- Outline -*-

III. Theory of Regular Languages and Regular Expressions

  Based on Chapter 3 of the book
      "Formal Languages and their Relation to Automata"
      by Hopcroft and Ullman (Addison-Wesley, 1969).

   There is a strong connection between theory and practice in
   syntax analysis,
   with some of the first and most useful results in CS theory!

 A. Definitions
------------------------------------------
             DEFINITIONS

The lexical grammar of a language
   is regular, because


def: a grammar is *regular* iff
     all its productions have one of these
     forms:

           <B> ::= c <D>
         or
           <B> ::= c

     where c is a terminal symbol,
       and <B> and <D> are nonterminals

def: a language is *regular* iff
     it can be defined using a
     regular grammar.

Thm: Every regular language can be
     recognized by a finite automaton.

Thm: Every regular language can be
     specified by a regular expression.
------------------------------------------
     ... regular grammars can be parsed/recognized quickly
         (using finite automata)

         Regular languages are also called "type 3" languages
            (a name from the Chomsky hierarchy)
         A regular language is also called a *regular set*.

         Q: Does it follow that every regular expression
            can be recognized using a finite automaton?
  1. Regular Expressions
------------------------------------------
            REGULAR EXPRESSIONS

The language of regular expressions:

  <RE> ::= <char>
       | <RE> '|' <RE>
       | <RE> <RE>
       | emp            
       | <RE>*
       | ( <RE> )

 where <char> is a character

Examples:

  RE              meaning
====================================
  emp             the empty string

  2*              zero or more 2's

  (0|1)*0         even binary numerals

 b*(abb*)*(a|emp) a's and b's without
                  consecutive a's
  
------------------------------------------
  2. extensions to regular expressions
------------------------------------------
TYPICAL EXTENSIONS TO REGULAR EXPRESSIONS

[abcd]   means   a|b|c|d

[h-m]    means   h|i|j|k|l|m

 x?      means   x|emp

 y+      means   y(y*)

------------------------------------------
  3. examples of regular expressions
------------------------------------------
           FOR YOU TO DO

Write a regular expression that describes:

1. The keyword "if"


2. The set of all (positive)
   decimal numbers


3. The set of all possible identifiers
   without underbars


(If you have time, write these as
 regular grammars also.)


------------------------------------------
        What would be the regular expression for identifiers?
  4. Examples using regular expressions
------------------------------------------
   EXAMPLE REGULAR EXPRESSION DEFINITIONS

PATTERN         RE
=========================
DIGIT	        [0-9]
LETTER		[a-zA-Z]


...
------------------------------------------
        What would be the regular expression for identifiers?
 B. finite automata (omit)
  1. example
------------------------------------------
              EXAMPLE

State diagram:

             <         =
    -->[q0] --->[[q1]]---> [[q2]]

------------------------------------------
        What token should be returned at state q2?
        What should be done at state q1?
------------------------------------------
      PSEUDO CODE FOR THIS LEXER CASE

    char c;


------------------------------------------
------------------------------------------
        DEALING WITH COMMENTS

Suppose
  /  is used for division
  // starts a comment to end of line
      (unlike SPL!)

What state diagram?


How does whitespace fit in?


------------------------------------------
       How does whitespace fit in?
------------------------------------------
TACTICS FOR IGNORING WHITESPACE, COMMENTS

Goal: do not send ignored tokens to parser

Can always get a non-ignored token:


Return "tokens" that include ignored stuff
  to a loop that ignores them


Giant DFA that goes back to start state
  on seeing something to ignore


------------------------------------------
  2. definitions
------------------------------------------
     NONDETERMINISTIC FINITE AUTOMATA

def: A *nondeterministic finite automaton*
    (NFA) over an alphabet Sigma
    is a system (K, Sigma, delta, q0, F)
    where K is a finite set (of states),
          Sigma is a finite set
            (the input alphabet),
          delta: is a map of type
             (K, Sigma) -> Sets(K)
          q0 in K is the initial state,
          F is a subset of K
            (the final/accepting states).

------------------------------------------
------------------------------------------
     TRANSITION FUNCTION AND ACCEPTANCE

  p in delta(q,x)
     means that in state q,
       on input x the next state
       can be p

  p in delta*(q,s)
     where s in Sigma*
      is defined by:
              delta*(q,emp) = {q}
         (i.e., q in delta*(q,emp))
         p in delta*(q,xa) =
                delta*(q2, a)
            where x in Sigma,
                  a in Sigma*,
                  q2 in delta(q,x)

Lemma: for all c in Sigma,
   p in delta*(q, c) = delta(q, c)

def: An NFA (K, Sigma, delta, q0, F)
     *accepts* a string s in Sigma* iff
     there is some q in delta*(q0,s)
                 such that q in F

------------------------------------------
         If each final state means to return a token,
            what token should be returned if there are many final states?
  3. example
------------------------------------------
                  EXAMPLE NFA

         0,1                     0,1
        /---\                   /---\
        \   /                   \   /
         | v   0           0     | v
    -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]]
         |
         | 1
         v
       [ q1 ]
         |
         | 1
         v
      [[ q2 ]]
         | ^
        /   \
        \---/
         0,1

 K = {q0,q1,q2,q3,q4}
 Sigma = {0,1}
 q0 is start state
 F = {q2,q4}

 delta(q0,0) = {q0,q3}
 delta(q1,0) = {}
 delta(q2,0) = {q2}
 delta(q3,0) = {q4}
 delta(q4,0) = {q4}
 delta(q0,1) = {q0,q1}
 delta(q1,1) = {q2}
 delta(q2,1) = {q2}
 delta(q3,1) = {}
 delta(q4,1) = {q4}

------------------------------------------
------------------------------------------
   EXTENDING FUNCTIONS TO SETS OF STATES

Notation extending delta and delta* to
   sets of states
     d(Q,x) = union of d(q,x)
                for all q in Q

     so d({},x) = {}
        d({q},x) = d(q,x)
        d({q1,q2},x) = d(q1,x)+d(q2,x)
        d({q1,q2,q3},x) =
               d(q1,x)+d(q2,x)+d(q3,x)
        etc.


Note that delta*({q}, x)
       = delta*(q, x)
       = delta(q, x)

------------------------------------------
------------------------------------------
             EXAMPLE

delta*(q0,010011)
    = delta*(delta(q0,0),10011)
    = delta*({q0,q3},10011)
    = delta*(q0,10011)
      + delta*(q3,10011)
    = delta*(delta(q0,1),0011)
      + delta*(delta(q3,1),0011)
    = delta*({q0,q1},0011)
      + delta*({},0011)
    = delta*(delta(q0,0),011)
         + delta*(delta(q1,0),011)
      + {}
    = delta*({q0,q3},011)
         + delta*({},011)
      + delta*({q2},011)
    = delta*({q0,q3},011)
         + {}
      + delta*({q2},011)
    = delta*({q0,q3},011)
      + delta*({q2},011)
    = delta*(delta(q0,0),11)
         + delta(q3,0),11)
      + delta*(delta(q2,0),11)
    = delta*({q0,q3},11)
         + delta*({q4},11)
      + delta*({q2},11)
    = delta*({q0,q3,q4},1)
         + delta*({q4},1)
      + delta*({q2},1)
    = delta({q0,q1,q4},1)
         + delta(q4,1)
      + delta(q2,1)
    = {q0,q1,q2,q4}
         + {q4}
      + {q2}
    = {q0,q1,q2,q4}


------------------------------------------
        So does this machine accept 010011?
        What strings does this machine accept?
  4. deterministic finite automata        
------------------------------------------
     DETERMINISTIC FINITE AUTOMATA

def: a *deterministic finite automaton*
     (DFA) is an NFA in which
     delta(q,c) is a singleton or empty
        for all q in K and c in Sigma.


------------------------------------------
------------------------------------------
          IMPLEMENTING DFAS
          
How would you represent states?


How would you implement a DFA?


------------------------------------------
  5. converting NFA to DFA
------------------------------------------
           PROBLEM

We want to specify lexical grammar
   using

So we need to convert regular expressions
   into DFAs 


------------------------------------------
       Why DFAs?
   a. Converting Regular Expressions to NFAs
------------------------------------------
         CONVERTING REs TO NFAs

Definition based on grammar
   of Regular Expressions:

Result of Convert(M) looks like
 this:  -->(M q)
    where the "tail", -->,
                 goes to the start state
          and q is the "head state"
          and M is converted recursively

 assume also Convert(N) is --->(N q')

                 c
  Convert(c) =  --->[ q ]

  Convert(M | N) =
                 /--->(M q)--\
       emp      /             \
      ---->[ q ]               -> [ q2 ]
                \             /
                 \--->(N q')-/

  Convert(M N) =

           --> (M q) --> (N q')

                  emp
  Convert(emp) = -----> [ q ]

                       emp
                   /--------------\
                  /         emp   v
  Convert(M*) = -/ /->(M q)---->[ q2 ]
                  /              /
                  \   emp       /
                   \-----------/

  Convert((M)) = Convert(M)

  After conversion, make the "head state"
   be a final state
------------------------------------------
------------------------------------------
        EXAMPLE OF CONVERSION TO NFA

Regular expression:  (i|j)*

               i
 Convert(i) = --->[ qi ]

               j
 Convert(j) = --->[ qj ]

 Convert(i|j) = 
                   i
                 /--->(qi)---\ emp
       emp      /             \
      ---->[ q ]               -> [ q2 ]
                \  j          /
                 \--->(qj)---/ emp

 Convert((i|j)*) =


------------------------------------------
         In programming terms, do we need to use an NFA to
            recognize "if" and other reserved words?
   b. Converting NFAs to DFAs
------------------------------------------
       CONVERTING AN NFA INTO A DFA

Idea:
  Convert each reachable set of NFA states
  into


How?

  Use the emp-closure of each state q
     = set of states reachable from q
       using emp

 Closure wrt emp:
  closure(S) is the smallest set T
  such that
     T = S +
        union {delta(s, emp) | s in S}
  
  can compute closure(S) as

     T <- S;
     do T2 <- T
        T <- T2 +
           union {delta(s, emp)| s in T2}
     while (T != T2)

 DFA Transitions:
  Let S be a set of states, then
    DFAdelta(S, c) =
       closure(union {delta(s,c)|s in S})

  
------------------------------------------
        Why do we need to take the closure wrt emp?
------------------------------------------
     EXAMPLE CONVERSION OF NFA TO DFA

NFA for if|[a-z]([a-z]|[0-9])*

                 f
           [q2] ---> [[q3]]
            ^                    emp
         i /                   /------\
          / emp       a-z     /       v
   -->[q1]------>[q4]----->[q5]    [[q8]]
                                     ^ |
                                  emp| |
                                     | |
                                a-z  | |
                             /-----\ | |
                            /      v / |
                        |->[q6]   [q7] |
                        |   \  0-9 ^   |
                    emp |    \----/    /
                        |             / 
                        \    emp     /
                         \----------/


Converted to DFA:

                   f
        [2,5,6,8] ---> [[3,6,7,8]]
          ^               \
      i  /                 \ a-z
        / a-h j-z      a-z  \ 0-9  
  -->[1,4]---->[[5,6,8]]0-9  v    /-|
                    \----->[[6,7,8]]|a-z
                                ^   |0-9
                                 \  |
                                  \-|


------------------------------------------
        Which states are final in the DFA?
        Which tokens would be returned in each state?
IV. Context-free Parsing
 A. Why context-free grammars and parsing?
------------------------------------------
        WHY CONTEXT-FREE PARSING?

Can we define a regular expression
   to make sure expressions have
   balanced parentheses?
   e.g., recognize: (34) and ((12)+(789))
         but not: (567))+82))))


------------------------------------------
------------------------------------------
          WHAT IS NEEDED

Needed for checking balanced parentheses:


------------------------------------------
        How does this grammar
            <expr> ::= <number> | ( <expr> )
                    | <number> + <expr>
           ensure that parentheses are balanced?
 B. Parsing with Context-free Grammars
------------------------------------------
           PARSING

Want to recognize language
   with


Goal:


------------------------------------------
  1. context-free grammar
------------------------------------------
             CONTEXT-FREE GRAMMARS
             
def: a *context-free* grammar (CFG)
     (N, T, P, S) has 
     a set (of nonterminals) N,
     a set (of terminals) T,
     a start symbol S in N and each
     production (in P) has the form:
        <M> -> g
     where <M> is a Nonterminal symbol,
        and g in (N+T)*

Example:


def: For a CFG (N,T,P,S),
     g in (N+T)* *produces* g' in (N+T)*,
     written g =P=> g',
     iff g is e <X> f, e and f in (N+T)*,
         g' is e h f,
         <X> is a nonterminal (in N),
         h in (N+T)*, and the rule
         <X> -> h is in P

Example:
       ( <Number> ) =P=> 


------------------------------------------
  2. derivation
------------------------------------------
              DERIVATION     

def: a *derivation* of gm in T*
     from the rules P of G=(N,T,P,S),
     is a sequence (S, g1, g2, ..., gm),
     where for all i: gi in (N+T)*,
           S =P=> g1, and
           for all 1 <= j < m:
                 gj =P=> g(j+1)

Example:


------------------------------------------
------------------------------------------
      LEFTMOST DERIVATION
     
def: a *leftmost derivation* of gm in T*
     from a CFG (N,T,P,S) is
     a derivation of t from (N,T,P,S)
     (S, g1, ..., gm) such that
     for all 0 <= j < m:
         when gj =P=> g(j+1)
         and the nonterminal <X> is
            replaced in gj,
         then there are no
              nonterminals to the left of
              <X> in gj.

Example:


------------------------------------------
      Would a rightmost derivation be?
  3. parse trees
------------------------------------------
         PARSE TREES AND DERIVATIONS

def: A *parse tree*, Tr, for a CFG
     (N,T,P,S) represents a derivation, D,
     iff:
       - Each node in Tr is labeled by
         a nonterminal (in N)
       - the root of Tr is
         the start symbol S
       - an arc from <N0> to h in (N+T)
          iff <N0> -> ... h ... in P
       - the order of children of a node
         labeled <N0> is the order in
         a production <N0> -> ... in P
         
------------------------------------------
------------------------------------------
             EXAMPLES

GRAMMAR:

  <expr> ::= <number> | <expr> + <expr>
          |  <expr> * <expr>

Derivations of 3*4+2


------------------------------------------
------------------------------------------
         EXAMPLE PARSE TREES

Corresponding to leftmost derivation:


Corresponding to rightmost derivation:


------------------------------------------
          What does each parse tree mean?
  4. ambiguity
------------------------------------------
             AMBIGUITY

def: a CFG (N,T,P,S) is *ambiguous* iff
     there is some t in T* such that
     there are two different parse trees
     for t


------------------------------------------
        What's an example of an ambiguous grammar?
        Why do we want to avoid ambiguous grammars?
        Is ambiguity a property of a language or its grammar?
   a. Fixing ambiguous grammars
------------------------------------------
          FIXING AMBIGUOUS GRAMMARS

Idea: Rewrite grammar to


Example:


------------------------------------------
       Why is the tree

           <expr>
          /  |   \
      <expr> *  <expr>
       /        /   | \
   <number>  <expr> + <expr>
    |          |        |
    3      <number>   <number>
              |         |
              4         2

        not a parse tree for this grammar?
 C. Parsing techniques
  1. recursive-descent parsing (LL parsing)
------------------------------------------
    RECURSIVE DESCENT PARSING ALGORITHM

For each production rule, of form:
   <N> ::= g1 | g2 | ... | gm

 1. Write a 


 2. This function 


------------------------------------------
   a. example
------------------------------------------
   EXAMPLE RECURSIVE-DESCENT RECOGNIZER

<Stmt> ::= if <Cond>
           then <Stmts>
           else <Stmts> end
        | begin <Stmts> end
        | print <number>
<Stmts> ::= <Empty> | <StmtList>
<StmtList> ::= <Stmt> ; <StmtList>
<Empty> ::=
<Cond> ::= <number> == <number>

 #include "spl.tab.h"

 // the current token
 yytoken_kind_t tok;

 void parser_initialize() {
    tok = yylex();
 }

 void advance() { tok = yylex(); }

 void eat(yytoken_kind_t expected) {
    if (tok == expected) {
        advance();
    } else { /* ... report error */ }
 }

 void parseCond()
 {     eat(numbersym);
       eat(eqeqsym);
       eat(numbersym);
 }

 void parseStmts()
 {
    while (tok.toknum == ifsym
           || tok.toknum == beginsym
           || tok.toknum == printsym
           || tok.toknum == semisym) {
      if (tok.toknum == semisym) {
         eat(semisym);
         parseStmt();
      }
    }
 }

 void parseStmt()
 {    switch (tok.toknum) {
      case ifsym:
         eat(ifsym);
         parseCond();
         eat(thensym);
         parseStmts();
         eat(elsesym);
         parsetStmts();
         eat(endsym);
         break;
      case beginsym:
         eat(beginsym);
         parseStmts();
         eat(endsym);
         break;
      case printsym:
         eat(printsym);
         eat(numbersym);
         break;
      default:
         // report error
      }
 }

------------------------------------------
  What kind of errors can eat() report?
   What kind of error message can the parser produce
       (in the default case)?
  How do errors get reported by parseExp()?
   b. terminology: LL(1)
------------------------------------------
           LL(1) GRAMMARS

A recursive-descent parser must:


  - choose between alternatives
    (e.g., <N0> ::= <A> | <B>)

def: A grammar is *LL(1)* iff


------------------------------------------
   c. terminology: LR(1)
------------------------------------------
           LR(1) GRAMMARS

An LR(1) parser needs to decide
  when to: shift (push token on stack) or
           reduce

  uses a DFA based on stack + lookhead

def: A grammar is *LR(1)* iff


------------------------------------------
    i. LALR(1) parsers
------------------------------------------
          LALR(1) Parsing

Smaller tables than LR(1)
  - merges states of the DFA
      if they only differ in lookahead

------------------------------------------
   d. ambiguity problems
------------------------------------------
           PROBLEM: AMBIGUITY

Consider:

<S> ::= <ident> := <number>
     | if <E> then <S> <ME>
<ME> ::= <empty> | else <S>

and the input statement:

   if b1 then 
      if b2 then x := 2
    else x := 3

Is this parsed as:

     <S>
   / /  \ -----|---\
  if <E> then  |  <ME>
      |        |      \
      |        |    <empty>
      |       <S>     
             / | \--------\
      |     /  |  \    \   \
      b1   if <E> then <S> <ME>
                      / | \ \   \    
                       ...   |   \
                      x := 2 else <S>
                                 / | \
                                  ...
                                 x := 3
or as:

     <S>
    / /  \-----|--------\
  if <E> then <S>      <ME> 
      |        |\     /   \       
      b1      /| \   else <S> 
             / |  \      / | \    
             |  |  |      ...
             |  |  |     x := 3
             |  |  |
            if <E> then <S> <ME>
                |       /|\   \
                b2      ...   <empty>
                       x := 2
------------------------------------------
------------------------------------------
        FIXES FOR AMBIGUITY

Change the language:

 a. Always have an else clause:

   <S> ::= if <E> then <S> else <S>

   (use skip if don't want to do anything)

 b. Use an end marker

   <S> ::= if <E> then <S> else <S> fi
        |  if <E> then <S> fi

Give precedence to one production:

   <S> ::= if <E> then <S> <ME>
   <ME> ::= else <S>  // priority!
         | <empty>

So we only get the parse tree:

       <S>
   / /  \ -----|---\
  if <E> then  |  <ME>
      |        |      \
      |        |    <empty>
      |       <S>     
             / | \--------\
      |     /  |  \    \   \
      b1   if <E> then <S> <ME>
                      / | \ \   \    
                       ...   |   \
                      x := 2 else <S>
                                 / | \
                                  ...
                                 x := 3
            
------------------------------------------
V. Using the bison parser generator
------------------------------------------
     BISON: A LALR(1) PARSER GENERATOR

Input:
    spl.y
     |
     |  bison spl.y
     v

Output:
    spl.tab.c   and   spl.tab.h

    yyparse{def.}     token decls.
    parse tables      extern decls.
    yylval               yyparse(),
    (+ user code)        yylval
------------------------------------------
 A. big picture
------------------------------------------
          THE BIG PICTURE

                       tokens
 - source --> [ Lexer] ------> [ Parser]
    code                          /
                                 /
                                / ASTs
                               /
                              |
                              v
               symbol <---- [ static
               table  ---->   analysis ]
                             /
                            /
                           v
                        [ code generator ]

------------------------------------------
 B. using bison and flex
------------------------------------------
 BISON AND FLEX, GENERATING A PARSER

idea:

                  ast.h (AST types)
                     |
             bison   v
             -----> spl.tab.c
            /        ^  yyparse (def.)
           / bison   |
  spl.y ----------> spl.tab.h
                     |  token enum.
                     |  (decl.)
              flex   v
  spl_lexer.l-----> spl_lexer.c
                        yylex (def.)


------------------------------------------
  1. file connections for hw3
------------------------------------------
     HOW IT ALL FITS TOGETHER IN HW3

// in file machine_types.h: ======

// ... 
typedef unsigned int address_type;
typedef unsigned char byte_type;
typedef int word_type;

// in file file_location.h: ======

// location in a source file
typedef struct {
    const char *filename;
    unsigned int line; // of first token
} file_location;

// in file ast.h: ================
// ...
#include "machine_types.h"
#include "file_location.h"

// types of ASTs (type tags)
typedef enum { block_ast, /* ... */
    token_ast
} AST_type;

// typedefs for types N_t, follow,
// where N is a nonterminal

typedef struct {
    file_location *file_loc;
    AST_type type_tag;
    void *next; // for lists
} generic_t;

typedef struct ident_s {
    file_location *file_loc;
    AST_type type_tag;
    struct ident_s *next; // for lists
    const char *name;
} ident_t;

typedef struct {
    file_location *file_loc;
    AST_type type_tag;
    const char *text;
    word_type value;
} number_t;

typedef struct {
    file_location *file_loc;
    AST_type type_tag;
    const char *text;
    int code;
} token_t;

// ...

typedef struct block_s {
    file_location *file_loc;
    AST_type type_tag;
    const_decls_t const_decls;
    var_decls_t var_decls;
    proc_decls_t proc_decls;
    stmts_t stmts;
} block_t;

// ...
typedef union AST_u {
    generic_t generic;
    block_t block;
    const_decls_t const_decls;
    const_decl_t const_decl;
    const_def_list_t const_def_list;
    const_def_t const_def;
    var_decls_t var_decls;
    var_decl_t var_decl;
    ident_list_t ident_list;
    // ...
    expr_t expr;
    binary_op_expr_t binary_op_expr;
    token_t token;
    number_t number;
    ident_t ident;
    empty_t empty;
} AST;

// Return the file location from an AST
extern file_location *ast_file_loc(AST t);

// ...

// Return a pointer to a fresh copy of t
// that has been allocated on the heap
extern AST *ast_heap_copy(AST t);

// ...

extern block_t ast_block(
  token_t begin_tok,
  const_decls_t const_decls,
  var_decls_t var_decls,
  proc_decls_t proc_decls,
  stmts_t stmts);

// ...
extern ident_t ast_ident(
  file_location *file_loc,
  const char *name);

extern expr_t ast_expr_number(
  number_t e);

extern empty_t ast_empty(
  file_location *file_loc);
// ...

parser_types.h:

#include "ast.h"
typedef AST YYSTYPE;

spl.y (also spl.tab.h):

#include "ast.h"
#include "machine_types.h"
#include "parser_types.h"
#include "lexer.h"

// more below
------------------------------------------
  2. abstract syntax trees and their use in parsing
   a. background: how the parser works
------------------------------------------
    CONNECTING THE PARSER AND THE ASTs

Parser model

   A stack of (terminals + nonterminals)
   A parallel stack of ASTs
   1 token of lookahead

   Steps in parsing (DFA decides to):

   - shift:
     1. push lookahead on parse stack
     2. push its yylval on AST stack

OR

   - reduce using a rule written:

        nt : a b c { $$ = f($1,$2,$3); };

     1. take a,b,c off parse stack
     2. take their AST values,
          aval,bval,cval, off the AST stack
        and compute
          ntval = f(aval,bval,cval)
     3. push nt on parse stack
     4. push result (ntval) on AST stack

------------------------------------------
   b. use of ASTs in the grammar (.y) file
------------------------------------------
  CONNECTION WITH ASTs IN GRAMMAR FILE

 /* $Id: spl.y ... */

%code requires {
#include "ast.h"
#include "machine_types.h"
#include "parser_types.h"
#include "lexer.h"
  /* ... */
}
 /* ... */
%token <ident> identsym
%token <number> numbersym
%token <token> plussym    "+"
%token <token> minussym   "-"
%token <token> multsym    "*"
%token <token> divsym     "/"
%token <token> periodsym  "."
%token <token> semisym    ";"
%token <token> eqsym      "="
%token <token> commasym   ","
%token <token> becomessym ":="
%token <token> lparensym  "("
%token <token> rparensym  ")"
%token <token> constsym   "const"
%token <token> varsym     "var"
 /* ... */

%type <block> program
%type <block> block
%type <const_decls> constDecls
%type <const_def> constDef
%type <var_decls> varDecls
%type <var_decl> varDecl
%type <idents> idents
%type <proc_decls> procDecls
%type <empty> empty
 /* ... */
%type <expr> expr
%type <expr> term
%type <expr> factor

%start program
/* ... */
------------------------------------------
------------------------------------------
  PUTTING A TOKEN VALUE ON THE AST STACK
               (DETAILS)
               
To put yylval on the AST Stack
  when the field name is "token"

in the .y file have:
  %token <token> somesym "some"    

the generated parser has:

#include "ast.h"
#include "parser_types.h" 
   // typedef AST YYSTYPE;
   
  YYSTYPE yyvsa[];  // the AST stack

  yyvsa[yyi].token = yylval;

------------------------------------------
------------------------------------------
PUSHING AST FOR A NONTERMINAL ON AST STACK
             (DETAILS)

To put yylval on the AST Stack
  when the field name is "const_def"

in the .y file have:
  %type <const_def> constDef

the generated parser has:

#include "ast.h"
#include "parser_types.h" 
   // in file parser_types.h: =====
   // typedef AST YYSTYPE;

// in the generated parser file: ==

  YYSTYPE yyvsa[];  // the AST stack

  yyvsa[yyi].const_def = yylval;

------------------------------------------
   c. Example, the const language
------------------------------------------
    THE CONST LANGUAGE

programs all look like:

    const ident = 3402
------------------------------------------
------------------------------------------
      ASTs FOR THE CONST LANGUAGE

/* $Id: ast.h ... */
/* ... */

typedef struct {
    file_location *file_loc;
} generic_t;

typedef struct ident_s {
    file_location *file_loc;
    const char *name;
} ident_t;

typedef struct {
    file_location *file_loc;
    const char *text;
    word_type value;
} number_t;

typedef struct {
    file_location *file_loc;
    const char *text;
    int code;
} token_t;

typedef struct const_def_s {
    file_location *file_loc;
    ident_t ident;
    number_t number;
} const_def_t;

typedef union AST_u {
    generic_t generic;
    const_def_t const_def;
    token_t token;
    number_t number;
    ident_t ident;
} AST;

extern const_def_t ast_const_def(
  ident_t ident, number_t number);

// ...

------------------------------------------
------------------------------------------
    THE CONST.Y FILE FOR CONST LANGUAGE

 /* ... */
%code requires {
#include "ast.h"
#include "machine_types.h"
#include "parser_types.h"
#include "lexer.h"
 /* ...*/
}
 /* ...*/
%token <token> constsym   "const"
%token <ident> identsym
%token <token> eqsym      "="
%token <number> numbersym

%type <const_defs> program
%type <const_def> constDef
%type <const_defs> constDefs
%type <empty> empty

%start program

%code {
extern int yylex(void);
const_def_t progast;
extern void setProgAST(const_def_t t);
}

%%

program : constDef { setProgAST($1); } ;

constDef : "const" identsym "=" numbersym
           { $$ = ast_const_def($2,$4); };

%%

// Set the program's ast to be t
void setProgAST(const_def_t t) {
  progast = t;
}
------------------------------------------
          Do we really need 2?
          When are line numbers recorded?
   d. example of changing the grammar
        How would we make the grammar be <program> ::= <constDefs>
            with <constDefs> ::= { <constDef> } ?
    i. Change the grammar
    ii. Add to the ast module