COP 3402 meeting -*- Outline -*- * Lexical Analysis Based on material from Chapter 2 of the book "Modern Compiler Implementation in Java" by Andrew W. Appel with Jens Palsberg (Cambridge, 1998) ------------------------------------------ MOTIVATION # $Id$\n .text start\nstart:\tADDI ... Want to: Approach: ------------------------------------------ ... - write parser at a high level not individual characters - ignore comments, whitespace, ... - have the parser be efficient - deal with bad characters sensibly ... break input stream of characters into tokens: textsym .text identsym start identsym start colonsym : addisym ADDI ... The parser only sees the non-ignored parts not the comments, whitespace, ... ------------------------------------------ LEXICAL ANALYSIS Lexical means relating to the words of a language ------------------------------------------ This is the basis of the word "lexicon" ** Goals of Lexical Analysis ------------------------------------------ GOALS OF LEXICAL ANALYSIS - Simplify the parser, so it need not handle: - Recognize the longest match Why? - Handle every possible character of input Why? ------------------------------------------ ... - white space and comments - details of tokens (numbers, identifiers, escape sequences, etc.) Q: Why recognize the longest match? so that "ident" is a single token instead of "i", "d", etc. Also want == to be one token now = and =. Q: Why handle all possible characters? So that it completely checks the program input and cannot crash on any input ------------------------------------------ CONFLICT BETWEEN RULES Suppose that both "if" and numbers are tokens: What tokens should "if8" match? Fixing such situations: ------------------------------------------ ... longest match would favor an identifier "if8" but programmer might want to recognize "if" and "8" ... Several possible answers: - tell programmers that whitespace or punctuation needed to end identifiers, so "if8" is an identifier, and so the programmer needs to fix the program - give some rules priority (e.g., reserved words more important than identifiers, so "if8" is "if" and "8") Q: What tokens should "<=" match? leqsym (for less-than-or-equal-to), with longest match ------------------------------------------ WHICH TOKEN TO RETURN? If the input is "<=", what token(s)? If the input is "==", what token? If the input is "<8", what token(s)? If the input is "if", what token(s)? If the input is "//", what token(s)? Summary: ------------------------------------------ ... leqsym ... eqeqsym ... ltsym followed by numbersym ... ifsym ... divsym followed by divsym in SPL, for a language like C or C++, nothing, as it starts a comment Q: How would you program the longest match idea? Keep building a token until the next char can't be added to it. e.g., see < then add = to the token if = follows, otherwise return the < token (and unget the next character) Q: How would you ensure that reserved words are favored over identifiers? (e.g., "if" is not an identifier) After finding an identifier characters, check to see if it's a reserved word, and if so, then return the reserved word token ... In sum, favor is longest match, but give priority to reserved words over identifiers *** Overview ------------------------------------------ THE BIG PICTURE tokens - source --> [ Lexer] ------> [ Parser] code / / / abstract / syntax v trees [ static analysis ] / / v [ code generator ] For the Lexer we want to: - specify the tokens using regular expressions (REs) - convert REs to DFAs to execute them but easy conversions are: - REs to NFAs - NFAs to DFAs ------------------------------------------ Explain what an the abbreviations mean: - NFA = nondeterministic finite state automaton - DFA = deterministic finite state automaton ------------------------------------------ HOW PARSER WORKS WITH LEXER Couroutine structure: Parser calls lexer: yychar = yylex(); // call lexer /* ... use yychar ... */ Lexer function remembers pointer to input stream returns next token (int code) Parser works... Parser calls lexer again: yychar = yylex(); // call lexer /* ... use yychar ... */ ------------------------------------------ It's a coroutine, because the lexer picks up where it left off when called again **** Use of Automated Tools ------------------------------------------ BISON AND FLEX, GENERATING A PARSER idea: ast.h (AST types) ^ | bison | -----> spl.tab.c | / ^ yyparse() | / bison | | spl.y | | \------> spl.tab.h | \ tokens | flex v defs. spl_lexer.l file --------> spl_lexer.c yylex() spl.y file -----> spl.tab.h \ tokens flex v defs. spl_lexer.l file -----> spl_lexer.c yylex() function ------------------------------------------ ... explain all of this: The context-free grammar (the .y file) is central to this, The .y file records: - grammar and - names/types of the tokens (that the parser needs, and the lexer produces) Bison is a parser generator it generates the function yyparse() in files spl.tab.h and spl.tab.c The ASTs are structs that record the parse Flex is a lexical analyzer generator the spl_lexer.l file records - the lexical grammar (using REs) - how tokens produce ASTs (in yylval) ------------------------------------------ FOR HOMEWORK 2 (LEXICAL ANALYSIS) ast.h (AST types) ^ | | | | | spl.tab.h | \ tokens | flex v defs. spl_lexer.l file --------> spl_lexer.c yylex() ------------------------------------------ The lexer needs to record tokens as token ASTs, for later use by the parser, but otherwise the ASTs are not important for homework 2 ** Using Flex *** Files and coordination with parser ------------------------------------------ USING THE FLEX TOOL TO GENERATE LEXERS Example: SSM assembler (in hw1 zip file) High-level description in file asm_lexer.l === flex ===> generates: asm_lexer.c + asm_lexer.h Wrapper for lexer: lexer.h declares functions lexer.c errors_noted variable + some utility functions asm_lexer.l defines functions e.g., lexer_print_token lexer_output (ASTs defined in ast.h) asm.y is Bison description file grammar == bison ==> - Declarations in asm.tab.h #includes ast.h machine_types.h parser_types.h (declares YYSTYPE) lexer.h declares yytokentype eolsym = ... minussym = ... dottextsym = ... ... - Definitions in asm.tab.c defines yyparser() YYSTYPE yylval; ------------------------------------------ There are options in flex for naming the .h and .c files generated. The type YYSTYPE is the type of the ASTs Since the connection between flex and bison is tricky, for homework 2, we give you the spl.tab.h file, from that file you will need the token types (yytokentype) and a minimal spl.tab.c that defines yylval. *** structure of flex input ------------------------------------------ STRUCTURE OF FLEX INPUT FILE /* ... definitions section ... */ ARRGH r EEEE eeee RE {ARRGH}{EEEE} %% /* ... rules section ... */ re { tok2ast(resym); return resym; } {RE} { tok2ast(re4sym); return re4sym; } %% /* ... user subroutines ... */ ------------------------------------------ Explain the difference between re and {ARRGH}{EEEE} The regular expression - must start in column 1 (otherwise it's taken as C code) - if it uses defined ------------------------------------------ SECTIONS IN FLEX INPUT (.l file) Definitions section: Rules section: User subroutine section: ------------------------------------------ ... options, (C) #includes, (C) declarations of names used in rules (flex) definitions of named REs, (flex) declarations of start conditions (states) ... pairs of REs and actions (code) ... definitions of functions used Go over code in the assembler's lexer_main.c show output lexer.h, lexer.c asm_lexer.l and generated code asm_lexer.h asm_lexer.c