COP 3402 meeting -*- Outline -*-

* Lexical Analysis

  Based on material from Chapter 2 of the book
  "Modern Compiler Implementation in Java"
  by Andrew W. Appel with Jens Palsberg
  (Cambridge, 1998)

  and the theory also uses Chapter 3 of the book
      "Formal Languages and their Relation to Automata"
      by Hopcroft and Ullman (Addison-Wesley, 1969).

------------------------------------------
           MOTIVATION

 # $Id$\n   .text start\nstart:\tADDI ...

Want to:


Approach:


------------------------------------------
   ...
    - write parser at a high level
       not individual characters
    - ignore comments, whitespace, ...
    - have the parser be efficient
    - deal with bad characters sensibly

  ...
   break input stream of characters
    into tokens:

    textsym    .text
    identsym   start
    identsym   start
    colonsym   :
    addisym    ADDI
    ...

    The parser only sees the non-ignored parts
        not the comments, whitespace, ...

------------------------------------------
            LEXICAL ANALYSIS

Lexical means relating to the words
  of a language


------------------------------------------
    This is the basis of the word "lexicon"

** Goals of Lexical Analysis
------------------------------------------
       GOALS OF LEXICAL ANALYSIS

- Simplify the parser,
  so it need not handle:


- Recognize the longest match
  Why?


- Handle every possible character of input
  Why?


------------------------------------------
     ... - white space and comments
         - details of tokens
             (numbers, identifiers, ...)

     Q: Why recognize the longest match?
        so that "ident" is a single token instead of "i", "d", etc.

     Q: Why handle all possible characters?
        So that it completely checks the program input
           and cannot crash on any input

------------------------------------------
         CONFLICT BETWEEN RULES

Suppose that both "if" and numbers
 are tokens:

What tokens should "if8" match?


Fixing such situations:


------------------------------------------
        ... longest match would favor an identifier "if8"
            but programmer might want to recognize "if" and "8"

       ... Several possible answers:
             - tell programmers that whitespace or punctuation needed
                 to end identifiers, so "if8" is an identifier,
                 but "if 8" is two tokens, "if" and "8"
               (equivalent to always favoring the longest match)
             - give some rules priority
                 (e.g., reserved words more important than identifiers,
                        so "if8" is "if" and "8")

        Q: What tokens should "<=" match?
           leqsym (for less-than-or-equal-to), with longest match

------------------------------------------
           WHICH TOKEN TO RETURN?

If the input is "<=", what token(s)?


If the input is "<8", what token(s)?


If the input is "if", what token(s)?


If the input is "//", what token(s)?

Summary:


------------------------------------------

        Q: How would you program the longest match idea?
           Keep building a token until the next char can't be added to it.
           e.g., see < then add = to the token if = follows,
           otherwise return the < token (and unget the next character)

        Q: How would you ensure that reserved words
           are favored over identifiers?  (e.g., "if" is not an identifier)
           After finding an identifier, check to see if it's a reserved word
           and if so, then return the reserved word token

        ... In sum,
              favor is longest match,
              but give priority to reserved words over identifiers


*** Overview
------------------------------------------
          THE BIG PICTURE

                       tokens
 - source --> [ Lexer] ------> [ Parser]
    code                          /
                                 / 
                                / abstract
                               /  syntax
                              v   trees
                            [ static
                              analysis ]
                             / 
                            /
                           v
                         [ code generator ]

For the Lexer we want to:
  - specify the tokens
    using regular expressions (REs)
  - convert REs to DFAs to execute them
     but easy conversions are:
        - REs to NFAs
        - NFAs to DFAs

------------------------------------------
        Explain what an the abbreviations mean:
           - NFA = nondeterministic finite state automaton
           - DFA = deterministic finite state automaton

------------------------------------------
        HOW PARSER WORKS WITH LEXER

Couroutine structure:

   Parser calls lexer:
   
          tok = yylex(); // call lexer
           /* ... use yylavl ... */

   Lexer function
       remembers pointer to input stream

         returns next token (int code)

   Parser works...
   
   Parser calls lexer again:
   
          tok = yylex(); // call lexer
           /* ... use yylavl ... */
     
------------------------------------------

        It's a coroutine, because the lexer picks up where it left off
        when called again

**** Use of Automated Tools
------------------------------------------
 BISON AND FLEX, GENERATING A PARSER

idea:

   ast.h (AST types)
    |
    |        bison
    |        -----> g.tab.c
    |       /        ^  yyparse function
    v      / bison   |  
    g.y file -----> g.tab.h  
                     |  tokens
              flex   v  defs. 
    g.l file -----> g.c       
                      yytoken function


------------------------------------------
   ...
     explain all of this:

      The context-free grammar
           (the .y file) is central to this,

      The .y file records:
            - grammar and
            - names/types of the tokens
               (that the parser needs)

      Bison is a parser generator
          it generated the unction yyparse
          in files *.tab.h and *.tab.c

      The ASTs are how the parse is recorded

      Flex is a lexical analyzer generator

      the .l file records
         - the lexical grammar (as REs)
         - how tokens produce ASTs (in yylval)

** Theory of Regular Languages, Regular Expressions
      There is a strong connection between theory and practice in
      syntax analysis,
      with some of the first and most useful results in CS theory!

*** Definitions
------------------------------------------
             DEFINITIONS

The lexical grammar of a language
   is regular, because


def: a grammar is *regular* iff
     all its productions have one of these
     forms:

           <B> ::= c <D>
         or
           <B> ::= c

     where c is a terminal symbol,
       and <B> and <D> are nonterminals

def: a language is *regular* iff
     it can be defined using a
     regular grammar.

Thm: Every regular language can be
     recognized by a finite automaton.

Thm: Every regular language can be
     specified by a regular expression.
------------------------------------------
     ... regular grammars can be parsed/recognized quickly
         using finite automata

         Regular languages are also called "type 3" languages
            (a name from the Chomsky hierarchy)
         A regular language is also called a *regular set*.

         Q: Does it follow that every regular expression
            can be recognized using a finite automaton?
            Yes, indeed!

         Formal language theory studied such theorems...

*** Regular Expressions
    Regular expressions are a convenient and commonly used way to
    specify regular languages and lexical analyzers
------------------------------------------
            REGULAR EXPRESSIONS

The language of regular expressions:

  <RE> ::= <char>
       | <RE> '|' <RE>
       | <RE> <RE>
       | emp            
       | <RE>*
       | ( <RE> )

 where <char> is a character

Examples:

  RE              meaning
====================================
  emp             the empty string

  (0|1)*0         even binary numerals

 b*(abb*)*(a|emp) a's and b's without
                  consecutive a's
  
------------------------------------------
        emp is a way of writing the epsilon character
           which is used to denote the empty string

**** extensions to regular expressions
------------------------------------------
     EXTENSIONS TO REGULAR EXPRESSIONS

[abcd]   means   a|b|c|d

[h-m]    means   h|i|j|k|l|m

 x?      means   x|emp

 y+      means   y(y*)

------------------------------------------

**** examples of regular expressions
------------------------------------------
           FOR YOU TO DO

Write a regular expression that describes:

1. The keyword "if"


2. The set of all (positive)
   decimal numbers


3. The set of all possible identifiers
   with underbars (_)


(If you have time, write these as
 regular grammars also.)


------------------------------------------
        Q: What would be the regular expression for identifiers?
           [_a-zA-Z][_a-zA-Z0-9]*


**** Examples using regular expressions
------------------------------------------
   EXAMPLE REGULAR EXPRESSIONS FOR PL/0

PATTERN         RE
=========================
DECDIGIT	[0-9]
LETTER		[_a-zA-Z]


...
------------------------------------------

        Q: What would be the regular expression for identifiers?
           {LETTER}({LETTERORDIGIT}*)
           
*** finite automata

**** example
------------------------------------------
              EXAMPLE

State diagram:

             <         =
    -->[q0] --->[[q1]]---> [[q2]]
                  |    
                  | >
                  v
                [[q3]]

------------------------------------------
        Explain the notation:
            qi are states
            initial arrow shows start state
            double circles (boxes here) are final states

        Q: What token should be returned at state q2?
           leqsym ("<=")
        Q: What token should be returned at state q3?
           neqsym ("<>")

        Q: What should be done at state q1?
           look at the next character,
           if it's a < or =, go to that state
           if it's neither, return lesssym ("<"),
              but need to save the character for future use...

------------------------------------------
      PSEUDO CODE FOR THIS LEXER CASE

    char c;


------------------------------------------
    ...
    get next char into c
    if (c is '<')
    then
        get next char into c
        if (c is '=')
        then
            return leqsym
        else if (c is '>')
             then
                 return neqsym
             else
                 unget the char c
                 return lesssym

------------------------------------------
        DEALING WITH COMMENTS

Suppose
  /  is used for division
  // starts a comment to end of line
      (unlike PL/0!)

What state diagram?


How does whitespace fit in?


------------------------------------------

                           non-nl 
                            /--\
              '/'       '/' \  v
      -->[q0] --->[[q1]]--->[q2]----\
          ^                         |
           \------------------------/
                     '\n'

       Q: How does whitespace fit in?
           from q0 go back to q0
           from q1, return divsym ("/")
           
------------------------------------------
TACTICS FOR IGNORING WHITESPACE, COMMENTS

Goal: do not send ignored tokens to parser

Can always get a non-ignored token:


Return "tokens" that include ignored stuff
  to a loop that ignores them


Giant DFA that goes back to start state
  on seeing something to ignore


------------------------------------------
     ...- recognizer consumes ignored stuff
          before each token
          invariant: next char is not start of whitespace or comment
                     (this works when a single char starts a comment)

     ...- recognizer returns "tokens" (or something similar) for ignored stuff,
          but a loop keeps calling recognizer until it
          gets a token that is not ignored
                     (simplifies comments if start with multiple characters)

     ... - recognizer hard to code, not modular
     
**** definitions
    Remember the big picture:
       We can translate REs to NFAs and then to DFAs that we can execute.

    The following is from Hopcroft and Ullman,
    chapter 3, adapted
------------------------------------------
     NONDETERMINISTIC FINITE AUTOMATA

def: A *nondeterministic finite automaton*
    (NFA) over an alphabet Sigma
    is a system (K, Sigma, delta, q0, F)
    where K is a finite set (of states),
          Sigma is a finite set set
            (the input alphabet),
          delta: is a map of type
             (K, Sigma) -> Sets(K)
          q0 in K is the initial state, &
          F is a subset of K
            (the final/accepting states).

------------------------------------------
        So delta(q,x) returns a set of states
            (The type Sets(K) is the set of all subsets of K,
             i.e., Powerset(K))

        Since computers aren't good at guessing,
         we can simulate an NFA by
         having the machine track all possible states 

------------------------------------------
     TRANSITION FUNCTION AND ACCEPTANCE

  p in delta(q,x)
     means that in state q,
       on input x the next state
       can be p

  p in delta*(q,s)
     where s in Sigma*
      is defined by:
              delta*(q,emp) = {q}
         (i.e., q in delta*(q,emp))
         p in delta*(q,xa) =
                delta*(q2, a)
            where x in Sigma,
                  a in Sigma*,
                  q2 in delta(q,x)

Lemma: for all c in Sigma,
   p in delta*(q, c) = delta(q, c)

def: An NFA (K, Sigma, delta, q0, F)
     *accepts* a string s in Sigma* iff
     there is some q in delta*(q0,s)
                 such that q in F

------------------------------------------
         delta* extends delta pointwise (to strings from Sigma*)

         Proof of the lemma:  (omit)
             p in delta*(q, c)
         iff   (by c = c emp)
             p in delta*(q, c emp)
         iff   (by def)
             p in delta(q,c)

         Q: If each final state means to return a token,
            what token should be returned if there are many final states?
            Have to decide based on longest match or priority

**** example
------------------------------------------
                  EXAMPLE NFA

         0,1                     0,1
        /---\                   /---\
        \   /                   \   /
         | v   0           0     | v
    -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]]
         |
         | 1
         v
       [ q1 ]
         |
         | 1
         v
      [[ q2 ]]
         | ^
        /   \
        \---/
         0,1

 K = {q0,q1,q2,q3,q4}
 Sigma = {0,1}
 q0 is start state
 F = {q2,q4}

 delta(q0,0) = {q0,q3}
 delta(q1,0) = {}
 delta(q2,0) = {q2}
 delta(q3,0) = {q4}
 delta(q4,0) = {q4}
 delta(q0,1) = {q0,q1}
 delta(q1,1) = {q2}
 delta(q2,1) = {q2}
 delta(q3,1) = {}
 delta(q4,1) = {q4}

------------------------------------------
        Note: double brackets ([[ and ]]) enclose final states (in F)
        
------------------------------------------
   EXTENDING FUNCTIONS TO SETS OF STATES

Notation extending delta and delta* to
   sets of states
     d(Q,x) = union of d(q,x)
                for all q in Q

     so d({},x) = {}
        d{{q},x) = d(q,x)
        d({q1,q2},x) = d(q1,x)+d(q2,x)
        d({q1,q2,q3},x) =
               d(q1,x)+d(q2,x)+d(q3,x)
        etc.


Note that delta*({q}, x)
       = delta*(q, x)
       = delta(q, x)

------------------------------------------
        I'm using + to stand for "set union" in these notes

------------------------------------------
             EXAMPLE

delta*(q0,010011)
    = delta*(delta(q0,0),10011)
    = delta*({q0,q3},10011)
    = delta*(q0,10011)
      + delta*(q3,10011)
    = delta*(delta(q0,1),0011)
      + delta*(delta(q3,1),0011)
    = delta*({q0,q1},0011)
      + delta*({},0011)
    = delta*(delta(q0,0),011)
         + delta*(delta(q1,0),011)
      + {}
    = delta*({q0,q3},011)
         + delta*({},011)
      + delta*({q2},011)
    = delta*({q0,q3},011)
         + {}
      + delta*({q2},011)
    = delta*({q0,q3},011)
      + delta*({q2},011)
    = delta*(delta(q0,0),11)
         + delta(q3,0),11)
      + delta*(delta(q2,0),11)
    = delta*({q0,q3},11)
         + delta*({q4},11)
      + delta*({q2},11)
    = delta*({q0,q3,q4},1)
         + delta*({q4},1)
      + delta*({q2},1)
    = delta({q0,q1,q4},1)
         + delta(q4,1)
      + delta(q2,1)
    = {q0,q1,q2,q4}
         + {q4}
      + {q2}
    = {q0,q1,q2,q4}


------------------------------------------
        Do this example using the diagram first.
        
        Q: So does this machine accept 010011?
        Yes, as both q2 and q4 are in F

        Q: What strings does this machine accept?
        binary strings with either
           two consecutive 0's or
           two consecutive 1's

**** deterministic finite automata        
------------------------------------------
     DETERMINISTIC FINITE AUTOMATA

def: a *deterministic finite automaton*
     (DFA) is an NFA in which
     delta(q,c) is a singleton or empty
        for all q in K and c in Sigma.


------------------------------------------
      So, in a DFA the transition function
          is a (partial) function


------------------------------------------
          IMPLEMENTING DFAS
          
How would you represent states?


How would you implement a DFA?


------------------------------------------

      ... by (unsigned) integers

      ... for each state,
            use a switch statement (with a case on each char)
            to determine the next state

          or

             use a 2D array
                with indexes of state and a char,
                each entry of which is a state


*** converting NFA to DFA
------------------------------------------
           PROBLEM

We want to specify lexical grammar
   using

So we need to convert regular expressions
   into DFAs 


------------------------------------------
       ... regular expressions

       Q: Why DFAs?
       Because they are easy to implement
          But from an RE, easiter to transform into an NFA,
          so we need to do the following (in genteral):

**** Converting Regular Expressions to NFAs
------------------------------------------
         CONVERTING REs TO NFAs

Definition based on grammar
   of Regular Expressions:

Result of Convert(M) looks like
 this:  -->(M q)
    where the "tail", -->,
                 goes to the start state
          and q is the "head state"
 assume also Convert(N) is --->(N q')

                 c
  Convert(c) =  --->[ q ]

  Convert(M | N) =
                 /--->(M q)--\
       emp      /             \
      ---->[ q ]               -> [ q2 ]
                \             /
                 \--->(N q')-/

  Convert(M N) =

           --> (M q)-->(N q')

                  emp
  Convert(emp) = -----> [ q ]

                       emp
                   /--------------\
                  /         emp   v
  Convert(M*) = -/ /->(M q)---->[ q2 ]
                  /              /
                  \   emp       /
                   \-----------/

  Convert((M)) = Convert(M)

  After conversion, make the "head state"
   be a final state
------------------------------------------

------------------------------------------
        EXAMPLE OF CONVERSION TO NFA

Regular expression:  (i|j)*

               i
 Convert(i) = --->[ qi ]

               j
 Convert(j) = --->[ qj ]

 Convert(i|j) = 
                   i
                 /--->(qi)---\ emp
       emp      /             \
      ---->[ q ]               -> [ q2 ]
                \  j          /
                 \--->(qj)---/ emp

 Convert((i|j)*) =


------------------------------------------
        ... draw the NFA for Convert((i|j)*)

            and mark the resulting head state as final

                      emp
                   /-------------\
                  /   i       emp v
Convert((i|j)*)=-/ /-q/----/->[ q2]
                  /  j\-qj/      /
                  \   emp       /
                   \-----------/


         Q: In programming terms, do we need to use an NFA to
            recognize "if" and other reserved words?
         We will use REs which are the same as NFAs, so we could,
         But if programming by hand, then just focus on identifiers,
                                     and check for reserved words

**** Converting NFAs to DFAs
     You don't have to go through this process
     if you can see how to write the DFA to begin with,
     but it's useful to have a process in order to embed it in a program.

------------------------------------------
       CONVERTING AN NFA INTO A DFA

Idea:
  Convert each reachable set of NFA states
  into


How?

  Use the emp-closure of each state q
     = set of states reachable from q
       using emp

 Closure wrt emp:
  closure(S) is the smallest set T
  such that
     T = S +
        union {delta(s, emp) | s in T}
  
  can compute closure(S) as

     T <- S;
     do T2 <- T
        T <- T2 +
           union {delta(s, emp)| s in T2}
     while (T != T2)

 DFA Transitions:
  Let S be a set of states, then
    DFAdelta(S, c) =
       closure(union {delta(s,c)|s in S})

  
------------------------------------------
        ... a single state of the DFA
            (and the conversion is done before using the DFA,
            it's pre-computed!)

        Q: Why do we need to take the closure wrt emp?
        Because the NFA can always follow emp transitions without input
        
        In these formula delta is the NFA's transition function,
           and "union" takes a set of sets and unions them all
           together

------------------------------------------
     EXAMPLE CONVERSION OF NFA TO DFA

NFA for if|[a-z]([a-z]|[0-9])*

                 f
           [q2] ---> [[q3]]
            ^                    emp
         i /                   /------\
          / emp       a-z     /       v
   -->[q1]------>[q4]----->[q5]    [[q8]]
                                     ^ |
                                  emp| |
                                     | |
                                a-z  | |
                             /-----\ | |
                            /      v / |
                        |->[q6]   [q7] |
                        |   \  0-9 ^   |
                    emp |    \----/    /
                        |             / 
                        \    emp     /
                         \----------/


Converted to DFA:

                   f
        [2,5,6,8] ---> [3,6,7,8]
          ^               \
      i  /                 \ a-z
        / a-h j-z      a-z  \ 0-9  
  -->[1,4]---->[5,6,8] 0-9   v    /-|
                    \----->[6,7,8]  |a-z
                                ^   |0-9
                                 \  |
                                  \-|


------------------------------------------
        Q: Which states are final in the DFA?
            (all of them that include either 3 or 8,
              so all but [1,4])

        Q: Which tokens would be returned in each state?
            ("if" from 3,6,7,8, ident from the other final states)

** Using Flex
*** Files and coordination with parser

------------------------------------------
  USING THE FLEX TOOL TO GENERATE LEXERS

Example: SRM assembler

   High-level description in
       asm_lexer.l

   Generated lexer asm_lexer.c
                   + asm_lexer.h

   Wrapper for lexer:
        lexer.h
           declares functions
        lexer.c
           does nothing
        asm_lexer.l
           defines functions
             e.g., lexer_print_token

   ASTs defined in ast.h

   asm.y is Bison description file grammar

   == bison ==>

   - Declarations in asm.tab.h
        includes ast.h
                 machine_types.h
                 parser_types.h
                     declares YYSTYPE
                 lexer.h
        declares
                 yytokentype
                   eolsym = ...
                   minussym = ...
                   dottextsym = ...
                   ...
                   
   - Definitions in asm.tab.c
        defines yyparser()
                YYSTYPE yylval;

------------------------------------------
       There are options in flex for naming the .h and .c files generated.

       So for homework 2, we'll give you the pl0.tab.h file,
          since we'll need the token types (yytokentype)
          and a minimal pl0.tab.c that defines yylval.

*** structure of flex input
------------------------------------------
      STRUCTURE OF FLEX INPUT FILE

  /* ... definitions section ... */

%%
  /* ... rules section ... */
%%

  /* ... user subroutines ... */


------------------------------------------

------------------------------------------
    SECTIONS IN FLEX INPUT (.l file)

Definitions section:


Rules section:


User subroutine section:


------------------------------------------

     ...
       options,
       #includes,
       declarations of names used in rules
       definitions of named REs,
       declarations of
           start conditions (states)

       
     ...
       pairs of REs and actions (code)

     ...
       definitions of functions used

Go over code in lexer_main.c
          show output
     lexer.h,
     lexer.c
     asm_lexer.l
       and generated code asm_lexer.h
                          asm_lexer.c