COP 3402 meeting -*- Outline -*- * Lexical Analysis Based on material from Chapter 2 of the book "Modern Compiler Implementation in Java" by Andrew W. Appel with Jens Palsberg (Cambridge, 1998) and the theory also uses Chapter 3 of the book "Formal Languages and their Relation to Automata" by Hopcroft and Ullman (Addison-Wesley, 1969). ------------------------------------------ MOTIVATION # $Id$\n .text start\nstart:\tADDI ... Want to: Approach: ------------------------------------------ ... - write parser at a high level not individual characters - ignore comments, whitespace, ... - have the parser be efficient - deal with bad characters sensibly ... break input stream of characters into tokens: textsym .text identsym start identsym start colonsym : addisym ADDI ... The parser only sees the non-ignored parts not the comments, whitespace, ... ------------------------------------------ LEXICAL ANALYSIS Lexical means relating to the words of a language ------------------------------------------ This is the basis of the word "lexicon" ** Goals of Lexical Analysis ------------------------------------------ GOALS OF LEXICAL ANALYSIS - Simplify the parser, so it need not handle: - Recognize the longest match Why? - Handle every possible character of input Why? ------------------------------------------ ... - white space and comments - details of tokens (numbers, identifiers, ...) Q: Why recognize the longest match? so that "ident" is a single token instead of "i", "d", etc. Q: Why handle all possible characters? So that it completely checks the program input and cannot crash on any input ------------------------------------------ CONFLICT BETWEEN RULES Suppose that both "if" and numbers are tokens: What tokens should "if8" match? Fixing such situations: ------------------------------------------ ... longest match would favor an identifier "if8" but programmer might want to recognize "if" and "8" ... Several possible answers: - tell programmers that whitespace or punctuation needed to end identifiers, so "if8" is an identifier, but "if 8" is two tokens, "if" and "8" (equivalent to always favoring the longest match) - give some rules priority (e.g., reserved words more important than identifiers, so "if8" is "if" and "8") Q: What tokens should "<=" match? leqsym (for less-than-or-equal-to), with longest match ------------------------------------------ WHICH TOKEN TO RETURN? If the input is "<=", what token(s)? If the input is "<8", what token(s)? If the input is "if", what token(s)? If the input is "//", what token(s)? Summary: ------------------------------------------ Q: How would you program the longest match idea? Keep building a token until the next char can't be added to it. e.g., see < then add = to the token if = follows, otherwise return the < token (and unget the next character) Q: How would you ensure that reserved words are favored over identifiers? (e.g., "if" is not an identifier) After finding an identifier, check to see if it's a reserved word and if so, then return the reserved word token ... In sum, favor is longest match, but give priority to reserved words over identifiers *** Overview ------------------------------------------ THE BIG PICTURE tokens - source --> [ Lexer] ------> [ Parser] code / / / abstract / syntax v trees [ static analysis ] / / v [ code generator ] For the Lexer we want to: - specify the tokens using regular expressions (REs) - convert REs to DFAs to execute them but easy conversions are: - REs to NFAs - NFAs to DFAs ------------------------------------------ Explain what an the abbreviations mean: - NFA = nondeterministic finite state automaton - DFA = deterministic finite state automaton ------------------------------------------ HOW PARSER WORKS WITH LEXER Couroutine structure: Parser calls lexer: tok = yylex(); // call lexer /* ... use yylavl ... */ Lexer function remembers pointer to input stream returns next token (int code) Parser works... Parser calls lexer again: tok = yylex(); // call lexer /* ... use yylavl ... */ ------------------------------------------ It's a coroutine, because the lexer picks up where it left off when called again **** Use of Automated Tools ------------------------------------------ BISON AND FLEX, GENERATING A PARSER idea: ast.h (AST types) | | bison | -----> g.tab.c | / ^ yyparse function v / bison | g.y file -----> g.tab.h | tokens flex v defs. g.l file -----> g.c yytoken function ------------------------------------------ ... explain all of this: The context-free grammar (the .y file) is central to this, The .y file records: - grammar and - names/types of the tokens (that the parser needs) Bison is a parser generator it generated the unction yyparse in files *.tab.h and *.tab.c The ASTs are how the parse is recorded Flex is a lexical analyzer generator the .l file records - the lexical grammar (as REs) - how tokens produce ASTs (in yylval) ** Theory of Regular Languages, Regular Expressions There is a strong connection between theory and practice in syntax analysis, with some of the first and most useful results in CS theory! *** Definitions ------------------------------------------ DEFINITIONS The lexical grammar of a language is regular, because def: a grammar is *regular* iff all its productions have one of these forms: ::= c or ::= c where c is a terminal symbol, and and are nonterminals def: a language is *regular* iff it can be defined using a regular grammar. Thm: Every regular language can be recognized by a finite automaton. Thm: Every regular language can be specified by a regular expression. ------------------------------------------ ... regular grammars can be parsed/recognized quickly using finite automata Regular languages are also called "type 3" languages (a name from the Chomsky hierarchy) A regular language is also called a *regular set*. Q: Does it follow that every regular expression can be recognized using a finite automaton? Yes, indeed! Formal language theory studied such theorems... *** Regular Expressions Regular expressions are a convenient and commonly used way to specify regular languages and lexical analyzers ------------------------------------------ REGULAR EXPRESSIONS The language of regular expressions: ::= | '|' | | emp | * | ( ) where is a character Examples: RE meaning ==================================== emp the empty string (0|1)*0 even binary numerals b*(abb*)*(a|emp) a's and b's without consecutive a's ------------------------------------------ emp is a way of writing the epsilon character which is used to denote the empty string **** extensions to regular expressions ------------------------------------------ EXTENSIONS TO REGULAR EXPRESSIONS [abcd] means a|b|c|d [h-m] means h|i|j|k|l|m x? means x|emp y+ means y(y*) ------------------------------------------ **** examples of regular expressions ------------------------------------------ FOR YOU TO DO Write a regular expression that describes: 1. The keyword "if" 2. The set of all (positive) decimal numbers 3. The set of all possible identifiers with underbars (_) (If you have time, write these as regular grammars also.) ------------------------------------------ Q: What would be the regular expression for identifiers? [_a-zA-Z][_a-zA-Z0-9]* **** Examples using regular expressions ------------------------------------------ EXAMPLE REGULAR EXPRESSIONS FOR PL/0 PATTERN RE ========================= DECDIGIT [0-9] LETTER [_a-zA-Z] ... ------------------------------------------ Q: What would be the regular expression for identifiers? {LETTER}({LETTERORDIGIT}*) *** finite automata **** example ------------------------------------------ EXAMPLE State diagram: < = -->[q0] --->[[q1]]---> [[q2]] | | > v [[q3]] ------------------------------------------ Explain the notation: qi are states initial arrow shows start state double circles (boxes here) are final states Q: What token should be returned at state q2? leqsym ("<=") Q: What token should be returned at state q3? neqsym ("<>") Q: What should be done at state q1? look at the next character, if it's a < or =, go to that state if it's neither, return lesssym ("<"), but need to save the character for future use... ------------------------------------------ PSEUDO CODE FOR THIS LEXER CASE char c; ------------------------------------------ ... get next char into c if (c is '<') then get next char into c if (c is '=') then return leqsym else if (c is '>') then return neqsym else unget the char c return lesssym ------------------------------------------ DEALING WITH COMMENTS Suppose / is used for division // starts a comment to end of line (unlike PL/0!) What state diagram? How does whitespace fit in? ------------------------------------------ non-nl /--\ '/' '/' \ v -->[q0] --->[[q1]]--->[q2]----\ ^ | \------------------------/ '\n' Q: How does whitespace fit in? from q0 go back to q0 from q1, return divsym ("/") ------------------------------------------ TACTICS FOR IGNORING WHITESPACE, COMMENTS Goal: do not send ignored tokens to parser Can always get a non-ignored token: Return "tokens" that include ignored stuff to a loop that ignores them Giant DFA that goes back to start state on seeing something to ignore ------------------------------------------ ...- recognizer consumes ignored stuff before each token invariant: next char is not start of whitespace or comment (this works when a single char starts a comment) ...- recognizer returns "tokens" (or something similar) for ignored stuff, but a loop keeps calling recognizer until it gets a token that is not ignored (simplifies comments if start with multiple characters) ... - recognizer hard to code, not modular **** definitions Remember the big picture: We can translate REs to NFAs and then to DFAs that we can execute. The following is from Hopcroft and Ullman, chapter 3, adapted ------------------------------------------ NONDETERMINISTIC FINITE AUTOMATA def: A *nondeterministic finite automaton* (NFA) over an alphabet Sigma is a system (K, Sigma, delta, q0, F) where K is a finite set (of states), Sigma is a finite set set (the input alphabet), delta: is a map of type (K, Sigma) -> Sets(K) q0 in K is the initial state, & F is a subset of K (the final/accepting states). ------------------------------------------ So delta(q,x) returns a set of states (The type Sets(K) is the set of all subsets of K, i.e., Powerset(K)) Since computers aren't good at guessing, we can simulate an NFA by having the machine track all possible states ------------------------------------------ TRANSITION FUNCTION AND ACCEPTANCE p in delta(q,x) means that in state q, on input x the next state can be p p in delta*(q,s) where s in Sigma* is defined by: delta*(q,emp) = {q} (i.e., q in delta*(q,emp)) p in delta*(q,xa) = delta*(q2, a) where x in Sigma, a in Sigma*, q2 in delta(q,x) Lemma: for all c in Sigma, p in delta*(q, c) = delta(q, c) def: An NFA (K, Sigma, delta, q0, F) *accepts* a string s in Sigma* iff there is some q in delta*(q0,s) such that q in F ------------------------------------------ delta* extends delta pointwise (to strings from Sigma*) Proof of the lemma: (omit) p in delta*(q, c) iff (by c = c emp) p in delta*(q, c emp) iff (by def) p in delta(q,c) Q: If each final state means to return a token, what token should be returned if there are many final states? Have to decide based on longest match or priority **** example ------------------------------------------ EXAMPLE NFA 0,1 0,1 /---\ /---\ \ / \ / | v 0 0 | v -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]] | | 1 v [ q1 ] | | 1 v [[ q2 ]] | ^ / \ \---/ 0,1 K = {q0,q1,q2,q3,q4} Sigma = {0,1} q0 is start state F = {q2,q4} delta(q0,0) = {q0,q3} delta(q1,0) = {} delta(q2,0) = {q2} delta(q3,0) = {q4} delta(q4,0) = {q4} delta(q0,1) = {q0,q1} delta(q1,1) = {q2} delta(q2,1) = {q2} delta(q3,1) = {} delta(q4,1) = {q4} ------------------------------------------ Note: double brackets ([[ and ]]) enclose final states (in F) ------------------------------------------ EXTENDING FUNCTIONS TO SETS OF STATES Notation extending delta and delta* to sets of states d(Q,x) = union of d(q,x) for all q in Q so d({},x) = {} d{{q},x) = d(q,x) d({q1,q2},x) = d(q1,x)+d(q2,x) d({q1,q2,q3},x) = d(q1,x)+d(q2,x)+d(q3,x) etc. Note that delta*({q}, x) = delta*(q, x) = delta(q, x) ------------------------------------------ I'm using + to stand for "set union" in these notes ------------------------------------------ EXAMPLE delta*(q0,010011) = delta*(delta(q0,0),10011) = delta*({q0,q3},10011) = delta*(q0,10011) + delta*(q3,10011) = delta*(delta(q0,1),0011) + delta*(delta(q3,1),0011) = delta*({q0,q1},0011) + delta*({},0011) = delta*(delta(q0,0),011) + delta*(delta(q1,0),011) + {} = delta*({q0,q3},011) + delta*({},011) + delta*({q2},011) = delta*({q0,q3},011) + {} + delta*({q2},011) = delta*({q0,q3},011) + delta*({q2},011) = delta*(delta(q0,0),11) + delta(q3,0),11) + delta*(delta(q2,0),11) = delta*({q0,q3},11) + delta*({q4},11) + delta*({q2},11) = delta*({q0,q3,q4},1) + delta*({q4},1) + delta*({q2},1) = delta({q0,q1,q4},1) + delta(q4,1) + delta(q2,1) = {q0,q1,q2,q4} + {q4} + {q2} = {q0,q1,q2,q4} ------------------------------------------ Do this example using the diagram first. Q: So does this machine accept 010011? Yes, as both q2 and q4 are in F Q: What strings does this machine accept? binary strings with either two consecutive 0's or two consecutive 1's **** deterministic finite automata ------------------------------------------ DETERMINISTIC FINITE AUTOMATA def: a *deterministic finite automaton* (DFA) is an NFA in which delta(q,c) is a singleton or empty for all q in K and c in Sigma. ------------------------------------------ So, in a DFA the transition function is a (partial) function ------------------------------------------ IMPLEMENTING DFAS How would you represent states? How would you implement a DFA? ------------------------------------------ ... by (unsigned) integers ... for each state, use a switch statement (with a case on each char) to determine the next state or use a 2D array with indexes of state and a char, each entry of which is a state *** converting NFA to DFA ------------------------------------------ PROBLEM We want to specify lexical grammar using So we need to convert regular expressions into DFAs ------------------------------------------ ... regular expressions Q: Why DFAs? Because they are easy to implement But from an RE, easiter to transform into an NFA, so we need to do the following (in genteral): **** Converting Regular Expressions to NFAs ------------------------------------------ CONVERTING REs TO NFAs Definition based on grammar of Regular Expressions: Result of Convert(M) looks like this: -->(M q) where the "tail", -->, goes to the start state and q is the "head state" assume also Convert(N) is --->(N q') c Convert(c) = --->[ q ] Convert(M | N) = /--->(M q)--\ emp / \ ---->[ q ] -> [ q2 ] \ / \--->(N q')-/ Convert(M N) = --> (M q)-->(N q') emp Convert(emp) = -----> [ q ] emp /--------------\ / emp v Convert(M*) = -/ /->(M q)---->[ q2 ] / / \ emp / \-----------/ Convert((M)) = Convert(M) After conversion, make the "head state" be a final state ------------------------------------------ ------------------------------------------ EXAMPLE OF CONVERSION TO NFA Regular expression: (i|j)* i Convert(i) = --->[ qi ] j Convert(j) = --->[ qj ] Convert(i|j) = i /--->(qi)---\ emp emp / \ ---->[ q ] -> [ q2 ] \ j / \--->(qj)---/ emp Convert((i|j)*) = ------------------------------------------ ... draw the NFA for Convert((i|j)*) and mark the resulting head state as final emp /-------------\ / i emp v Convert((i|j)*)=-/ /-q/----/->[ q2] / j\-qj/ / \ emp / \-----------/ Q: In programming terms, do we need to use an NFA to recognize "if" and other reserved words? We will use REs which are the same as NFAs, so we could, But if programming by hand, then just focus on identifiers, and check for reserved words **** Converting NFAs to DFAs You don't have to go through this process if you can see how to write the DFA to begin with, but it's useful to have a process in order to embed it in a program. ------------------------------------------ CONVERTING AN NFA INTO A DFA Idea: Convert each reachable set of NFA states into How? Use the emp-closure of each state q = set of states reachable from q using emp Closure wrt emp: closure(S) is the smallest set T such that T = S + union {delta(s, emp) | s in T} can compute closure(S) as T <- S; do T2 <- T T <- T2 + union {delta(s, emp)| s in T2} while (T != T2) DFA Transitions: Let S be a set of states, then DFAdelta(S, c) = closure(union {delta(s,c)|s in S}) ------------------------------------------ ... a single state of the DFA (and the conversion is done before using the DFA, it's pre-computed!) Q: Why do we need to take the closure wrt emp? Because the NFA can always follow emp transitions without input In these formula delta is the NFA's transition function, and "union" takes a set of sets and unions them all together ------------------------------------------ EXAMPLE CONVERSION OF NFA TO DFA NFA for if|[a-z]([a-z]|[0-9])* f [q2] ---> [[q3]] ^ emp i / /------\ / emp a-z / v -->[q1]------>[q4]----->[q5] [[q8]] ^ | emp| | | | a-z | | /-----\ | | / v / | |->[q6] [q7] | | \ 0-9 ^ | emp | \----/ / | / \ emp / \----------/ Converted to DFA: f [2,5,6,8] ---> [3,6,7,8] ^ \ i / \ a-z / a-h j-z a-z \ 0-9 -->[1,4]---->[5,6,8] 0-9 v /-| \----->[6,7,8] |a-z ^ |0-9 \ | \-| ------------------------------------------ Q: Which states are final in the DFA? (all of them that include either 3 or 8, so all but [1,4]) Q: Which tokens would be returned in each state? ("if" from 3,6,7,8, ident from the other final states) ** Using Flex *** Files and coordination with parser ------------------------------------------ USING THE FLEX TOOL TO GENERATE LEXERS Example: SRM assembler High-level description in asm_lexer.l Generated lexer asm_lexer.c + asm_lexer.h Wrapper for lexer: lexer.h declares functions lexer.c does nothing asm_lexer.l defines functions e.g., lexer_print_token ASTs defined in ast.h asm.y is Bison description file grammar == bison ==> - Declarations in asm.tab.h includes ast.h machine_types.h parser_types.h declares YYSTYPE lexer.h declares yytokentype eolsym = ... minussym = ... dottextsym = ... ... - Definitions in asm.tab.c defines yyparser() YYSTYPE yylval; ------------------------------------------ There are options in flex for naming the .h and .c files generated. So for homework 2, we'll give you the pl0.tab.h file, since we'll need the token types (yytokentype) and a minimal pl0.tab.c that defines yylval. *** structure of flex input ------------------------------------------ STRUCTURE OF FLEX INPUT FILE /* ... definitions section ... */ %% /* ... rules section ... */ %% /* ... user subroutines ... */ ------------------------------------------ ------------------------------------------ SECTIONS IN FLEX INPUT (.l file) Definitions section: Rules section: User subroutine section: ------------------------------------------ ... options, #includes, declarations of names used in rules definitions of named REs, declarations of start conditions (states) ... pairs of REs and actions (code) ... definitions of functions used Go over code in lexer_main.c show output lexer.h, lexer.c asm_lexer.l and generated code asm_lexer.h asm_lexer.c