LANGUAGES def: A *language* is LANGUAGE CLASSES Languages can be classified by Venn Diagram: |--------------------------------------| | Regular Languages | | | | | | |--------------------------------| | | | Contex-free Languages | | | | | | | | | | | | |--------------------------| | | | | | Context-sensitive | | | | | | Languages | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | Type 0 Languages | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| PHASES OF A COMPILER Programs allowed by a compiler's: |--------------------------------------| | Lexical Analysis (Lexer) | | | | | | |--------------------------------| | | | Parser | | | | | | | | | | | | |--------------------------| | | | | | Static Analysis | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | Runtime checks | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| PROBLEM WRITING SYNTAX ANALYSIS Naive way to write a compiler, etc. Language Def.docx -- coding --> parser.c (v. 1) Def2.docx -- coding --> parser.c (v. 2) -- coding --> parser.c (v. 3) Def3.docx -- coding --> parser.c (v. 4) Def4.docx -- coding --> parser.c (v. 5) ... ... DefN.docx -- coding --> parser.c (v. N) Disadvantages: COMPUTER SCIENCE SOLUTION: AUTOMATION high-level description tool generated code lang.y -- bison --> lang.tab.c + lang.tab.h lang2.y -- bison --> lang.tab.c + lang.tab.h ... langN.y -- bison --> lang.tab.c + lang.tab.h Advantages: GRAMMARS DESCRIBE LANGUAGES Grammars are high-level descriptions of languages/parsers def: a *grammar* consists of a finite set of rules (called "productions") and a start symbol (a nonterminal). Let V = nonterminals + terminals The rules have the form V+ -> V* where there is no symbol in both nonterminals and terminals def: The language generated by a grammar G with set of productions P is: {w | w is in terminals* and S =>* w} where S is the start symbol of G gAd => gBd iff g in V* and d in V* and A -> B is a rule in P and g =>* h iff either h = g or g -> i and i =>* h BNF NOTATION FOR GRAMMARS ::= means | means is a Example ::= | ::= 0 | 1 GRAMMARS AS RULES OF GAMES A grammar can be seen as describing two games: - A production game (Can you produce this string?) - A recognition/parsing game (Is this string in the language?) PRODUCTION GAME Goal: produce a string in the language from the start symbol Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Can we produce "Johnny is good"? RECOGNITION OR PARSING GAME Goal: determine if a string is in the language of the grammar Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Is "Johnny is good" in this grammar? DERIVATION (OR PARSE) TREES def: a *tree* is a finite set of nodes connected by directed edges that is connected and has no cycles def: a *derivation tree* for grammar G is a tree such that: - Every node has a label that is a symbol of G - The root is labeled by the start symbol of G - Every node with a direct descendent, is labeled by a nonterminal - If the descendents of a node labeled by N have the following labels (in order): A, B, C, ..., K then G has a production of form N -> A B C ... K EXAMPLE DERIVATION TREE Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult String to parse: "Johnny is good" / | \ / | \ v v v Johnny is good EXTENSIONS TO BNF (EBNF) Arbitrary number of repeats: { x } means 0 or more repeats of x :: = { } is equivalent to: ::= ::= | ::= {} is also written as: * or [ ] ... One-or-more repeats: x+ means 1 or more repeats of x :: = + is equivalent to: ::= ::= | + is sometimes written as ... or [ ] ... Optional element: [ x ] means 0 or 1 occurences of x ::= [ ] is equivalent to: ::= ::= | READING A BNF GRAMMAR Example rules: ::= ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= 0 | ::= | EBNF GRAMMAR FOR (SUBSET OF) PL/0 ::= . ::= ::= {} ::= const {} ; ::= = ::= , ::= {} ::= var ; ::= {} ::= , ::= {} ::= procedure ; ; ::= := | call | begin {} end | if then | while do | read | write | skip ::= ; ::= | else ::= ::= odd | EXAMPLES IN PL/0 Shortest program: skip. Factorial program: var n, res; # input and result procedure fact; begin read n; res := 1; while (n <> 0) begin res := res * n; n := n-1 end; write res end; call fact. MOTIVATION # $Id$\n .text start\nstart:\tADDI ... Want to: Approach: LEXICAL ANALYSIS Lexical means relating to the words of a language GOALS OF LEXICAL ANALYSIS - Simplify the parser, so it need not handle: - Recognize the longest match Why? - Handle every possible character of input Why? CONFLICT BETWEEN RULES Suppose that both "if" and numbers are tokens: What tokens should "if8" match? Fixing such situations: WHICH TOKEN TO RETURN? If the input is "<=", what token(s)? If the input is "<8", what token(s)? If the input is "if", what token(s)? If the input is "//", what token(s)? Summary: THE BIG PICTURE tokens - source --> [ Lexer] ------> [ Parser] code / / / abstract / syntax v trees [ static analysis ] / / v [ code generator ] For the Lexer we want to: - specify the tokens using regular expressions (REs) - convert REs to DFAs to execute them but easy conversions are: - REs to NFAs - NFAs to DFAs HOW PARSER WORKS WITH LEXER Couroutine structure: Parser calls lexer: tok = yytoken(); // call lexer /* ... use yylavl ... */ Lexer function remembers pointer to input stream returns next token (int code) Parser works... Parser calls lexer again: tok = yytoken(); // call lexer /* ... use yylavl ... */ BISON AND FLEX, GENERATING A PARSER idea: ast.h (AST types) | | bison | -----> g.tab.c | / ^ yyparse function v / bison | g.y file -----> g.tab.h | tokens flex v defs. g.l file -----> g.c yytoken function DEFINITIONS The lexical grammar of a language is regular, because def: a grammar is *regular* iff all its productions have one of these forms: ::= c or ::= c where c is a terminal symbol, and and are nonterminals def: a language is *regular* iff it can be defined using a regular grammar. Thm: Every regular language can be recognized by a finite automaton. Thm: Every regular language can be specified by a regular expression. REGULAR EXPRESSIONS The language of regular expressions: ::= | '|' | | emp | * | ( ) where is a character Examples: RE meaning ==================================== emp the empty string (0|1)*0 even binary numerals b*(abb*)*(a|emp) a's and b's without consecutive a's EXTENSIONS TO REGULAR EXPRESSIONS [abcd] means a|b|c|d [h-m] means h|i|j|k|l|m x? means x|emp y+ means y(y*) FOR YOU TO DO Write a regular expression that describes: 1. The keyword "if" 2. The set of all (positive) decimal numbers 3. The set of all possible identifiers with underbars (_) (If you have time, write these as regular grammars also.) EXAMPLE REGULAR EXPRESSIONS FOR PL/0 PATTERN RE ========================= DECDIGIT [0-9] LETTER [_a-zA-Z] ... EXAMPLE State diagram: < = -->[q0] --->[[q1]]---> [[q2]] | | > v [[q3]] PSEUDO CODE FOR THIS LEXER CASE char c; DEALING WITH COMMENTS Suppose / is used for division // starts a comment to end of line (unlike PL/0!) What state diagram? How does whitespace fit in? TACTICS FOR IGNORING WHITESPACE, COMMENTS Goal: do not send ignored tokens to parser Can always get a non-ignored token: Return "tokens" that include ignored stuff to a loop that ignores them Giant DFA that goes back to start state on seeing something to ignore NONDETERMINISTIC FINITE AUTOMATA def: A *nondeterministic finite automaton* (NFA) over an alphabet Sigma is a system (K, Sigma, delta, q0, F) where K is a finite set (of states), Sigma is a finite set set (the input alphabet), delta: is a map of type (K, Sigma) -> Sets(K) q0 in K is the initial state, & F is a subset of K (the final/accepting states). TRANSITION FUNCTION AND ACCEPTANCE p in delta(q,x) means that in state q, on input x the next state can be p p in delta*(q,s) where s in Sigma* is defined by: delta*(q,emp) = {q} (i.e., q in delta*(q,emp)) p in delta*(q,xa) = delta*(q2, a) where x in Sigma, a in Sigma*, q2 in delta(q,x) Lemma: for all c in Sigma, p in delta*(q, c) = delta(q, c) def: An NFA (K, Sigma, delta, q0, F) *accepts* a string s in Sigma* iff there is some q in delta*(q0,s) such that q in F EXAMPLE NFA 0,1 0,1 /---\ /---\ \ / \ / | v 0 0 | v -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]] | | 1 v [ q1 ] | | 1 v [[ q2 ]] | ^ / \ \---/ 0,1 K = {q0,q1,q2,q3,q4} Sigma = {0,1} q0 is start state F = {q2,q4} delta(q0,0) = {q0,q3} delta(q1,0) = {} delta(q2,0) = {q2} delta(q3,0) = {q4} delta(q4,0) = {q4} delta(q0,1) = {q0,q1} delta(q1,1) = {q2} delta(q2,1) = {q2} delta(q3,1) = {} delta(q4,1) = {q4} EXTENDING FUNCTIONS TO SETS OF STATES Notation extending delta and delta* to sets of states d(Q,x) = union of d(q,x) for all q in Q so d({},x) = {} d{{q},x) = d(q,x) d({q1,q2},x) = d(q1,x)+d(q2,x) d({q1,q2,q3},x) = d(q1,x)+d(q2,x)+d(q3,x) etc. Note that delta*({q}, x) = delta*(q, x) = delta(q, x) EXAMPLE delta*(q0,010011) = delta*(delta(q0,0),10011) = delta*({q0,q3},10011) = delta*(q0,10011) + delta*(q3,10011) = delta*(delta(q0,1),0011) + delta*(delta(q3,1),0011) = delta*({q0,q1},0011) + delta*({},0011) = delta*(delta(q0,0),011) + delta*(delta(q1,0),011) + {} = delta*({q0,q3},011) + delta*({},011) + delta*({q2},011) = delta*({q0,q3},011) + {} + delta*({q2},011) = delta*({q0,q3},011) + delta*({q2},011) = delta*(delta(q0,0),11) + delta(q3,0),11) + delta*(delta(q2,0),11) = delta*({q0,q3},11) + delta*({q4},11) + delta*({q2},11) = delta*({q0,q3,q4},1) + delta*({q4},1) + delta*({q2},1) = delta({q0,q1,q4},1) + delta(q4,1) + delta(q2,1) = {q0,q1,q2,q4} + {q4} + {q2} = {q0,q1,q2,q4} DETERMINISTIC FINITE AUTOMATA def: a *deterministic finite automaton* (DFA) is an NFA in which delta(q,c) is a singleton or empty for all q in K and c in Sigma. IMPLEMENTING DFAS How would you represent states? How would you implement a DFA? PROBLEM We want to specify lexical grammar using So we need to convert regular expressions into DFAs CONVERTING REs TO NFAs Definition based on grammar of Regular Expressions: Result of Convert(M) looks like this: -->(M q) where the "tail", -->, goes to the start state and q is the "head state" assume also Convert(N) is --->(N q') c Convert(c) = --->[ q ] Convert(M | N) = /--->(M q)--\ emp /emp \ ---->[ q ] -> [ q2 ] \emp / \--->(N q')-/ Convert(M N) = --> (M q)-->(N q') emp Convert(emp) = -----> [ q ] emp /--------------\ / emp v Convert(M*) = -/ /->(M q)---->[ q2 ] / / \ emp / \-----------/ Convert((M)) = Convert(M) After conversion, make the "head state" be a final state EXAMPLE OF CONVERSION TO NFA Regular expression: (i|j)* i Convert(i) = --->[ qi ] j Convert(j) = --->[ qj ] Convert(i|j) = i /--->(qi)---\ emp emp / \ ---->[ q ] -> [ q2 ] \ j / \--->(qj)---/ emp Convert((i|j)*) = emp /--------------\ / emp v Convert((i|j)*)=-/ /->(i|j)---->[ q2 ] / / \ emp / \-----------/ CONVERTING AN NFA INTO A DFA Idea: Convert each reachable set of NFA states into into a single state of the DFA How? Use the emp-closure of each state q = set of states reachable from q using emp Closure wrt emp: closure(S) is the smallest set T such that T = S + union {delta(s, emp) | s in T} can compute closure(S) as T <- S; do T2 <- T T <- T2 + union {delta(s, emp)| s in T2} while (T != T2) DFA Transitions: Let S be a set of states, then DFAdelta(S, c) = closure(union {delta(s,c)|s in S}) EXAMPLE CONVERSION OF NFA TO DFA NFA for if|[a-z]([a-z]|[0-9])* f [q2] ---> [[q3]] ^ emp i / /------\ / emp a-z / v -->[q1]------>[q4]----->[q5] [[q8]] ^ | emp| | | | a-z | | /-----\ | | / v / | |->[q6] [q7] | | \ 0-9 ^ | emp | \----/ / | / \ emp / \----------/ Converted to DFA: f [2,5,6,8] ---> [3,6,7,8] ^ \ i / \ a-z / a-h j-z a-z \ 0-9 -->[1,4]---->[5,6,8] 0-9 v /-| \----->[6,7,8] |a-z ^ |0-9 \ | \-| USING THE FLEX TOOL TO GENERATE LEXERS Example: SRM assembler High-level description in asm_lexer.l Generated lexer asm_lexer.c + asm_lexer.h Wrapper for lexer: lexer.h declares functions lexer.c does nothing asm_lexer.l defines functions e.g., lexer_print_token ASTs defined in ast.h asm.y is Bison description file grammar == bison ==> - Declarations in asm.tab.h includes ast.h machine_types.h parser_types.h declares YYSTYPE lexer.h declares yytokentype eolsym = ... minussym = ... dottextsym = ... ... - Definitions in asm.tab.c defines yyparser() YYSTYPE yylval; STRUCTURE OF FLEX INPUT FILE /* ... definitions section ... */ %% /* ... rules section ... */ %% /* ... user subroutines ... */ SECTIONS IN FLEX INPUT (.l file) Definitions section: Rules section: User subroutine section: WHY CONTEXT-FREE PARSING? Can we define a regular expression to make sure expressions have balanced parentheses? e.g., recognize: (34) and ((12)+(789)) but not: (567))+82)))) WHAT IS NEEDED Needed for checking balanced parentheses: PARSING Want to recognize language with Goal: CONTEXT-FREE GRAMMARS def: a *context-free* grammar (CFG) (N, T, P, S) has start symbol S in N and each production (in P) has the form: -> g where is a Nonterminal symbol, and g in (N+T)* Example: def: For a CFG (N,T,P,S), g in (N+T)* *produces* g' in (N+T)*, written g =P=> g', iff g is e f, g' is e h f, is a nonterminal (in N), h in (N+T)*, and the rule -> h is in P Example: ( ) =P=> DERIVATION def: a *derivation* of a terminal string t from the rules P of a CFG (N,T,P,S), is a sequence (S, g1, g2, ..., gm), where gm = t, and for all i: gi in (N+T)*, and for all 0 <= j < m: gj =P=> g(j+1) Example: LEFTMOST DERIVATION def: a *leftmost derivation* of a string t in T* from a CFG (N,T,P,S) is a derivation of t from (N,T,P,S) (S, g1, ..., gm) such that gm = t and for all 0 <= j < m: when gj =P=> g(j+1) and the nonterminal is replaced in gj, then there are no nonterminals to the left of in gj. Example: PARSE TREES AND DERIVATIONS def: A *parse tree*, Tr, for a CFG (N,T,P,S) represents a derivation, D, iff: - Each node in Tr is labeled by a nonterminal (in N) - the root of Tr is the start symbol S - an arc from to h in (N+T) iff -> ... h ... in P - the order of children of a node labeled is the order in a production -> ... in P EXAMPLES GRAMMAR: ::= | + | * Derivations of 3*4+2 EXAMPLE PARSE TREES Corresponding to leftmost derivation: Corresponding to rightmost derivation: AMBIGUITY def: a CFG (N,T,P,S) is *ambiguous* iff there is some t in T* such that there are two different parse tress for t FIXING AMBIGUOUS GRAMMARS Idea: Rewrite grammar to Example: RECURSIVE DESCENT PARSING ALGORITHM For each production rule, of form: ::= g1 | g2 | ... | gm 1. Write a 2. This function EXAMPLE RECURSIVE-DESCENT PARSER ::= if then else | begin S end | write ::= ; | ::= ::= = token tok = lexer_next(); void advance() { tok = yytoken(); } void eat(token_type tt) { if (tok.typ == tt) { advance(); } else { /* ... report error */ } } void parseCond() { eat(numbersym); eat(eqsym); eat(numbersym); } void parseStmt() { switch (tok.typ) { case ifsym: eat(ifsym); parseExp(); eat(thensym); parseStmt(); eat(elsesym); parsetStmt(); break; case beginsym: eat(beginsym); parseStmt(); parseList(); break; case writesym: eat(writesym); eat(numbersym); break; default: // report error } } LL(1) GRAMMARS A recursive-descent parser must: - choose between alternatives (e.g., ::= | ) def: A grammar is *LL(1)* iff LR(1) GRAMMARS An LR(1) parser needs to decide when to: shift (push token on stack) or reduce uses a DFA based on stack + lookhead def: A grammar is *LR(1)* iff LALR(1) Parsing Smaller tables than LR(1) - merges states of the DFA if only differ in lookahead PROBLEM: AMBIGUITY Consider: ::= := | if then ::= | else and the statement: if b1 then if b2 then x := 2 else x := 3 Is this parsed as: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 or as: / / \----|--------\ if then | |\ / \ b1 /| \ else / | \ / | \ | | | ... | | | x := 3 | | | if then | /|\ \ b2 ... x := 2 FIXES FOR AMBIGUITY Change the language: a. Always have an else clause: ::= if then else (use skip if don't want to do anything) b. Use an end marker ::= if then else fi | if then fi Give precedence to one production: ::= if then ::= else // priority! | So we only get the parse tree: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3