COP 3402 meeting -*- Outline -*-

* Lexical Analysis

  Based on material from Chapter 2 of the book
  "Modern Compiler Implementation in Java"
  by Andrew W. Appel with Jens Palsberg
  (Cambridge, 1998)

------------------------------------------
           MOTIVATION

 # $Id$\n   .text start\nstart:\tADDI ...

Want to:


Approach:


------------------------------------------
   ...
    - write parser at a high level
       not individual characters
    - ignore comments, whitespace, ...
    - have the parser be efficient
    - deal with bad characters sensibly

  ...
   break input stream of characters
    into tokens:

    textsym    .text
    identsym   start
    identsym   start
    colonsym   :
    addisym    ADDI
    ...

    The parser only sees the non-ignored parts
        not the comments, whitespace, ...

------------------------------------------
            LEXICAL ANALYSIS

Lexical means relating to the words
  of a language


------------------------------------------
    This is the basis of the word "lexicon"

** Goals of Lexical Analysis
------------------------------------------
       GOALS OF LEXICAL ANALYSIS

- Simplify the parser,
  so it need not handle:


- Recognize the longest match
  Why?


- Handle every possible character of input
  Why?


------------------------------------------
     ... - white space and comments
         - details of tokens
             (numbers, identifiers, escape sequences, etc.)

     Q: Why recognize the longest match?
        so that "ident" is a single token instead of "i", "d", etc.
        Also want == to be one token now = and =.

     Q: Why handle all possible characters?
        So that it completely checks the program input
           and cannot crash on any input

------------------------------------------
         CONFLICT BETWEEN RULES

Suppose that both "if" and numbers
 are tokens:

What tokens should "if8" match?


Fixing such situations:


------------------------------------------
        ... longest match would favor an identifier "if8"
            but programmer might want to recognize "if" and "8"

       ... Several possible answers:
             - tell programmers that whitespace or punctuation needed
                 to end identifiers, so "if8" is an identifier,
                 and so the programmer needs to fix the program

             - give some rules priority
                 (e.g., reserved words more important than identifiers,
                        so "if8" is "if" and "8")

        Q: What tokens should "<=" match?
           leqsym (for less-than-or-equal-to), with longest match

------------------------------------------
           WHICH TOKEN TO RETURN?

If the input is "<=", what token(s)?


If the input is "==", what token?


If the input is "<8", what token(s)?


If the input is "if", what token(s)?


If the input is "//", what token(s)?


Summary:


------------------------------------------
        ... leqsym
        ... eqeqsym
        ... ltsym followed by numbersym
        ... ifsym
        ... divsym followed by divsym in SPL,
            for a language like C or C++, nothing, as it starts a comment

        Q: How would you program the longest match idea?
           Keep building a token until the next char can't be added to it.
           e.g., see < then add = to the token if = follows,
           otherwise return the < token (and unget the next character)

        Q: How would you ensure that reserved words
           are favored over identifiers?  (e.g., "if" is not an identifier)
           After finding an identifier characters,
           check to see if it's a reserved word,
           and if so, then return the reserved word token

        ... In sum,
              favor is longest match,
              but give priority to reserved words over identifiers

*** Overview
------------------------------------------
          THE BIG PICTURE

                       tokens
 - source --> [ Lexer] ------> [ Parser]
    code                          /
                                 / 
                                / abstract
                               /  syntax
                              v   trees
                            [ static
                              analysis ]
                             / 
                            /
                           v
                         [ code generator ]

For the Lexer we want to:
  - specify the tokens
    using regular expressions (REs)
  - convert REs to DFAs to execute them
     but easy conversions are:
        - REs to NFAs
        - NFAs to DFAs

------------------------------------------
        Explain what an the abbreviations mean:
           - NFA = nondeterministic finite state automaton
           - DFA = deterministic finite state automaton

------------------------------------------
        HOW PARSER WORKS WITH LEXER

Couroutine structure:

   Parser calls lexer:
   
          yychar = yylex(); // call lexer
           /* ... use yychar ... */

   Lexer function
       remembers pointer to input stream

         returns next token (int code)

   Parser works...
   
   Parser calls lexer again:
   
          yychar = yylex(); // call lexer
           /* ... use yychar ... */
     
------------------------------------------

        It's a coroutine,
        because the lexer picks up where it left off
        when called again

**** Use of Automated Tools
------------------------------------------
 BISON AND FLEX, GENERATING A PARSER

idea:

  ast.h (AST types)
   ^
   |        bison
   |        -----> spl.tab.c
   |       /        ^   yyparse()
   |      / bison   |  
   |   spl.y        |
   |      \------> spl.tab.h  
   |                      \  tokens
   |               flex    v  defs. 
  spl_lexer.l file --------> spl_lexer.c       
                              yylex()
  spl.y file -----> spl.tab.h  
                    \  tokens
             flex    v  defs. 
  spl_lexer.l file -----> spl_lexer.c       
                     yylex() function

------------------------------------------
   ...
     explain all of this:

      The context-free grammar
           (the .y file) is central to this,

      The .y file records:
            - grammar and
            - names/types of the tokens
               (that the parser needs,
                and the lexer produces)

      Bison is a parser generator
          it generates the function yyparse()
          in files spl.tab.h and spl.tab.c

      The ASTs are structs that record the parse

      Flex is a lexical analyzer generator

      the spl_lexer.l file records
         - the lexical grammar (using REs)
         - how tokens produce ASTs (in yylval)

------------------------------------------
     FOR HOMEWORK 2 (LEXICAL ANALYSIS)


  ast.h (AST types)
   ^
   |       
   | 
   | 
   | 
   |                  spl.tab.h  
   |                      \  tokens
   |               flex    v  defs. 
  spl_lexer.l file --------> spl_lexer.c       
                              yylex()

------------------------------------------

         The lexer needs to record tokens as token ASTs,
             for later use by the parser,
         but otherwise the ASTs are not important for homework 2

** Using Flex
*** Files and coordination with parser

------------------------------------------
  USING THE FLEX TOOL TO GENERATE LEXERS

Example: SSM assembler (in hw1 zip file)

   High-level description in file
 asm_lexer.l

  === flex ===>

   generates:

      asm_lexer.c + asm_lexer.h

   Wrapper for lexer:
        lexer.h
           declares functions
        lexer.c
           errors_noted variable +
           some utility functions
        asm_lexer.l
           defines functions
             e.g., lexer_print_token
                   lexer_output

   (ASTs defined in ast.h)

 asm.y is Bison description file grammar

  == bison ==>

   - Declarations in asm.tab.h
        #includes ast.h
                  machine_types.h
                  parser_types.h
                     (declares YYSTYPE)
                  lexer.h

        declares
              yytokentype
                eolsym = ...
                minussym = ...
                dottextsym = ...
                ...
                   
   - Definitions in asm.tab.c
        defines yyparser()
                YYSTYPE yylval;

------------------------------------------
       There are options in flex for naming the .h and .c files generated.

       The type YYSTYPE is the type of the ASTs

       Since the connection between flex and bison is tricky,
       for homework 2, we give you the spl.tab.h file,
          from that file you will need the token types (yytokentype)
              and a minimal spl.tab.c that defines yylval.

*** structure of flex input
------------------------------------------
      STRUCTURE OF FLEX INPUT FILE

  /* ... definitions section ... */

ARRGH r
EEEE eeee
RE   {ARRGH}{EEEE}

%%
  /* ... rules section ... */

re   { tok2ast(resym); return resym; }
{RE} { tok2ast(re4sym); return re4sym; }

%%

  /* ... user subroutines ... */


------------------------------------------
        Explain the difference between re and {ARRGH}{EEEE}

        The regular expression
          - must start in column 1 (otherwise it's taken as C code)
          - if it uses defined 

------------------------------------------
    SECTIONS IN FLEX INPUT (.l file)

Definitions section:


Rules section:


User subroutine section:


------------------------------------------

     ...
       options,
       (C) #includes,
       (C) declarations of names used in rules
       (flex) definitions of named REs,
       (flex) declarations of
                start conditions (states)
       
     ...
       pairs of REs and actions (code)

     ...
       definitions of functions used

Go over code in the assembler's lexer_main.c

          show output
     lexer.h,
     lexer.c
     asm_lexer.l
       and generated code asm_lexer.h
                          asm_lexer.c