COP 3402 meeting -*- Outline -*- * Theory of Regular Languages and Regular Expressions Based on Chapter 3 of the book "Formal Languages and their Relation to Automata" by Hopcroft and Ullman (Addison-Wesley, 1969). There is a strong connection between theory and practice in syntax analysis, with some of the first and most useful results in CS theory! ** Definitions ------------------------------------------ DEFINITIONS The lexical grammar of a language is regular, because def: a grammar is *regular* iff all its productions have one of these forms: ::= c or ::= c where c is a terminal symbol, and and are nonterminals def: a language is *regular* iff it can be defined using a regular grammar. Thm: Every regular language can be recognized by a finite automaton. Thm: Every regular language can be specified by a regular expression. ------------------------------------------ ... regular grammars can be parsed/recognized quickly (using finite automata) Regular languages are also called "type 3" languages (a name from the Chomsky hierarchy) A regular language is also called a *regular set*. Q: Does it follow that every regular expression can be recognized using a finite automaton? Yes, indeed! (That's a corollary.) Formal language theory studied such theorems... *** Regular Expressions Regular expressions are a convenient and commonly used way to specify regular languages and lexical analyzers ------------------------------------------ REGULAR EXPRESSIONS The language of regular expressions: ::= | '|' | | emp | * | ( ) where is a character Examples: RE meaning ==================================== emp the empty string 2* zero or more 2's (0|1)*0 even binary numerals b*(abb*)*(a|emp) a's and b's without consecutive a's ------------------------------------------ emp is a way of writing the epsilon character which is used to denote the empty string *** extensions to regular expressions ------------------------------------------ TYPICAL EXTENSIONS TO REGULAR EXPRESSIONS [abcd] means a|b|c|d [h-m] means h|i|j|k|l|m x? means x|emp y+ means y(y*) ------------------------------------------ *** examples of regular expressions ------------------------------------------ FOR YOU TO DO Write a regular expression that describes: 1. The keyword "if" 2. The set of all (positive) decimal numbers 3. The set of all possible identifiers without underbars (If you have time, write these as regular grammars also.) ------------------------------------------ Q: What would be the regular expression for identifiers? [a-zA-Z][a-zA-Z0-9]* *** Examples using regular expressions ------------------------------------------ EXAMPLE REGULAR EXPRESSION DEFINITIONS PATTERN RE ========================= DIGIT [0-9] LETTER [a-zA-Z] ... ------------------------------------------ ... LETTERORDIGIT ({LETTER}|{DECDIGIT}) Q: What would be the regular expression for identifiers? {LETTER}({LETTERORDIGIT}*) Note that the names being defined must start in column 1 and occur before the first %% in the .l file. ** finite automata (omit) *** example ------------------------------------------ EXAMPLE State diagram: < = -->[q0] --->[[q1]]---> [[q2]] ------------------------------------------ Explain the notation: qi are states initial arrow shows start state double circles (boxes here) are final states Q: What token should be returned at state q2? leqsym ("<=") Q: What should be done at state q1? look at the next character, if it's a =, go to state q2 (makes longest match), and return leqsym if it's not, return lesssym ("<"), but need to save the character read for future use... ------------------------------------------ PSEUDO CODE FOR THIS LEXER CASE char c; ------------------------------------------ ... get next char into c if (c is '<') then get next char into c if (c is '=') then return leqsym else unget the char c return lesssym ------------------------------------------ DEALING WITH COMMENTS Suppose / is used for division // starts a comment to end of line (unlike SPL!) What state diagram? How does whitespace fit in? ------------------------------------------ non-nl /--\ '/' '/' \ v -->[q0] --->[[q1]]--->[q2]----\ ^ | \------------------------/ '\n' Q: How does whitespace fit in? from q0 go back to q0 from q1, return divsym ("/") ------------------------------------------ TACTICS FOR IGNORING WHITESPACE, COMMENTS Goal: do not send ignored tokens to parser Can always get a non-ignored token: Return "tokens" that include ignored stuff to a loop that ignores them Giant DFA that goes back to start state on seeing something to ignore ------------------------------------------ ... recognizer consumes ignored stuff before each token invariant: next char is not start of whitespace or comment (this works when a single char starts a comment) ... recognizer returns "tokens" (or something similar) for ignored stuff, but a loop keeps calling recognizer until it gets a token that is not ignored (simplifies comments if start with multiple characters) ... recognizer hard to code, not modular *** definitions Remember the big picture: We can translate REs to NFAs and then to DFAs that we can execute. The following is from Hopcroft and Ullman, chapter 3, adapted ------------------------------------------ NONDETERMINISTIC FINITE AUTOMATA def: A *nondeterministic finite automaton* (NFA) over an alphabet Sigma is a system (K, Sigma, delta, q0, F) where K is a finite set (of states), Sigma is a finite set (the input alphabet), delta: is a map of type (K, Sigma) -> Sets(K) q0 in K is the initial state, F is a subset of K (the final/accepting states). ------------------------------------------ So delta(q,x) returns a set of states (The type Sets(K) is the set of all subsets of K, i.e., Powerset(K)) Since computers aren't good at guessing, we can simulate an NFA by having the machine track all possible states ------------------------------------------ TRANSITION FUNCTION AND ACCEPTANCE p in delta(q,x) means that in state q, on input x the next state can be p p in delta*(q,s) where s in Sigma* is defined by: delta*(q,emp) = {q} (i.e., q in delta*(q,emp)) p in delta*(q,xa) = delta*(q2, a) where x in Sigma, a in Sigma*, q2 in delta(q,x) Lemma: for all c in Sigma, p in delta*(q, c) = delta(q, c) def: An NFA (K, Sigma, delta, q0, F) *accepts* a string s in Sigma* iff there is some q in delta*(q0,s) such that q in F ------------------------------------------ delta* extends delta pointwise (to strings from Sigma*) Proof of the lemma: (omit) p in delta*(q, c) iff (by c = c emp) p in delta*(q, c emp) iff (by def) p in delta(q,c) Q: If each final state means to return a token, what token should be returned if there are many final states? Have to decide based on longest match or priority *** example ------------------------------------------ EXAMPLE NFA 0,1 0,1 /---\ /---\ \ / \ / | v 0 0 | v -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]] | | 1 v [ q1 ] | | 1 v [[ q2 ]] | ^ / \ \---/ 0,1 K = {q0,q1,q2,q3,q4} Sigma = {0,1} q0 is start state F = {q2,q4} delta(q0,0) = {q0,q3} delta(q1,0) = {} delta(q2,0) = {q2} delta(q3,0) = {q4} delta(q4,0) = {q4} delta(q0,1) = {q0,q1} delta(q1,1) = {q2} delta(q2,1) = {q2} delta(q3,1) = {} delta(q4,1) = {q4} ------------------------------------------ Note: double brackets ([[ and ]]) enclose final states (in F) ------------------------------------------ EXTENDING FUNCTIONS TO SETS OF STATES Notation extending delta and delta* to sets of states d(Q,x) = union of d(q,x) for all q in Q so d({},x) = {} d({q},x) = d(q,x) d({q1,q2},x) = d(q1,x)+d(q2,x) d({q1,q2,q3},x) = d(q1,x)+d(q2,x)+d(q3,x) etc. Note that delta*({q}, x) = delta*(q, x) = delta(q, x) ------------------------------------------ I'm using + to stand for "set union" in these notes ------------------------------------------ EXAMPLE delta*(q0,010011) = delta*(delta(q0,0),10011) = delta*({q0,q3},10011) = delta*(q0,10011) + delta*(q3,10011) = delta*(delta(q0,1),0011) + delta*(delta(q3,1),0011) = delta*({q0,q1},0011) + delta*({},0011) = delta*(delta(q0,0),011) + delta*(delta(q1,0),011) + {} = delta*({q0,q3},011) + delta*({},011) + delta*({q2},011) = delta*({q0,q3},011) + {} + delta*({q2},011) = delta*({q0,q3},011) + delta*({q2},011) = delta*(delta(q0,0),11) + delta(q3,0),11) + delta*(delta(q2,0),11) = delta*({q0,q3},11) + delta*({q4},11) + delta*({q2},11) = delta*({q0,q3,q4},1) + delta*({q4},1) + delta*({q2},1) = delta({q0,q1,q4},1) + delta(q4,1) + delta(q2,1) = {q0,q1,q2,q4} + {q4} + {q2} = {q0,q1,q2,q4} ------------------------------------------ Do this example using the diagram first. Q: So does this machine accept 010011? Yes, as both q2 and q4 are in F Q: What strings does this machine accept? binary strings with either two consecutive 0's or two consecutive 1's *** deterministic finite automata ------------------------------------------ DETERMINISTIC FINITE AUTOMATA def: a *deterministic finite automaton* (DFA) is an NFA in which delta(q,c) is a singleton or empty for all q in K and c in Sigma. ------------------------------------------ So, in a DFA the transition function is a (partial) function ------------------------------------------ IMPLEMENTING DFAS How would you represent states? How would you implement a DFA? ------------------------------------------ ... by (unsigned) integers ... for each state, use a switch statement (with a case on each char) to determine the next state or use a 2D array with indexes of state and a char, each entry of which is a state *** converting NFA to DFA ------------------------------------------ PROBLEM We want to specify lexical grammar using So we need to convert regular expressions into DFAs ------------------------------------------ ... regular expressions Q: Why DFAs? Because they are easy to implement But from an RE, easier to transform into an NFA, so we need to do the following (in general): **** Converting Regular Expressions to NFAs ------------------------------------------ CONVERTING REs TO NFAs Definition based on grammar of Regular Expressions: Result of Convert(M) looks like this: -->(M q) where the "tail", -->, goes to the start state and q is the "head state" and M is converted recursively assume also Convert(N) is --->(N q') c Convert(c) = --->[ q ] Convert(M | N) = /--->(M q)--\ emp / \ ---->[ q ] -> [ q2 ] \ / \--->(N q')-/ Convert(M N) = --> (M q) --> (N q') emp Convert(emp) = -----> [ q ] emp /--------------\ / emp v Convert(M*) = -/ /->(M q)---->[ q2 ] / / \ emp / \-----------/ Convert((M)) = Convert(M) After conversion, make the "head state" be a final state ------------------------------------------ ------------------------------------------ EXAMPLE OF CONVERSION TO NFA Regular expression: (i|j)* i Convert(i) = --->[ qi ] j Convert(j) = --->[ qj ] Convert(i|j) = i /--->(qi)---\ emp emp / \ ---->[ q ] -> [ q2 ] \ j / \--->(qj)---/ emp Convert((i|j)*) = ------------------------------------------ ... draw the NFA for Convert((i|j)*) and mark the resulting head state as final emp /------------------\ / \ / emp v Convert((i|j)*)=-/ /-->((i|j) q2)------>[[q2]] / / | / \ emp / \-----------------/ substituting for ((i|j) q2) in the above: emp /------------------\ / i \ / /-->[qi]-\emp v Convert((i|j)*)=-/ /->[q] ----->[[q2]] / \--->[qj]-/emp / | j / \ emp / \-----------------/ Q: In programming terms, do we need to use an NFA to recognize "if" and other reserved words? We will use REs which are the same as NFAs, so we could, But if programming by hand, then just focus on identifiers, and check for reserved words **** Converting NFAs to DFAs You don't have to mplement (or go through) this algorithm if you can see how to write the DFA to begin with, but it's useful to have an algorithm in order to embed it in a program. ------------------------------------------ CONVERTING AN NFA INTO A DFA Idea: Convert each reachable set of NFA states into How? Use the emp-closure of each state q = set of states reachable from q using emp Closure wrt emp: closure(S) is the smallest set T such that T = S + union {delta(s, emp) | s in S} can compute closure(S) as T <- S; do T2 <- T T <- T2 + union {delta(s, emp)| s in T2} while (T != T2) DFA Transitions: Let S be a set of states, then DFAdelta(S, c) = closure(union {delta(s,c)|s in S}) ------------------------------------------ ... a single state of the DFA (and the conversion is done before using the DFA, it's pre-computed!) Q: Why do we need to take the closure wrt emp? Because the NFA can always follow emp transitions without input In these formula delta is the NFA's transition function, and "union" takes a set of sets and unions them all together ------------------------------------------ EXAMPLE CONVERSION OF NFA TO DFA NFA for if|[a-z]([a-z]|[0-9])* f [q2] ---> [[q3]] ^ emp i / /------\ / emp a-z / v -->[q1]------>[q4]----->[q5] [[q8]] ^ | emp| | | | a-z | | /-----\ | | / v / | |->[q6] [q7] | | \ 0-9 ^ | emp | \----/ / | / \ emp / \----------/ Converted to DFA: f [2,5,6,8] ---> [[3,6,7,8]] ^ \ i / \ a-z / a-h j-z a-z \ 0-9 -->[1,4]---->[[5,6,8]]0-9 v /-| \----->[[6,7,8]]|a-z ^ |0-9 \ | \-| ------------------------------------------ Q: Which states are final in the DFA? (all of them that include either 3 or 8, so all but [1,4]) Q: Which tokens would be returned in each state? ("if" from 3,6,7,8, ident from the other final states)