LANGUAGES def: A *language* is a set of strings of characters (from some alphabet) Characters are: For a human language: alphabetical characters (a, b, c, ...) For a computer (language): codes in some character set (ASCII characters or unicode) LANGUAGE CLASSES Languages can be classified by the kind of grammar needed to recognize them Venn Diagram: |--------------------------------------| | Regular Languages | | | | | | |--------------------------------| | | | Contex-free Languages | | | | | | | | | | | | |--------------------------| | | | | | Context-sensitive | | | | | | Languages | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | Type 0 Languages | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| PHASES OF A COMPILER Programs allowed by a compiler's: |--------------------------------------| | Lexical Analysis (Lexer) | | uses regular grammar | | | | |--------------------------------| | | | Parser | | | | uses a context-free grammar | | | | | | | | |--------------------------| | | | | | Static Analysis | | | | | | use techniques | | | | | | equivalent to context- | | | | | | sensitive grammars | | | | | | |--------------------| | | | | | | | Runtime checks | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| 3*4+2 (3*4) + 2 3 * (4+2) PROBLEM WRITING SYNTAX ANALYSIS Naive way to write a compiler, etc. Language Def.docx -- coding --> parser.c (v. 1) Def2.docx -- coding --> parser.c (v. 2) -- coding --> parser.c (v. 3) Def3.docx -- coding --> parser.c (v. 4) Def4.docx -- coding --> parser.c (v. 5) ... ... DefN.docx -- coding --> parser.c (v. N) Disadvantages: - can be slow to write the parser - can be error prone - hard to verify - experiments with language are costly - code might not be very efficient COMPUTER SCIENCE SOLUTION: AUTOMATION high-level description tool generated code lang.y -- bison --> lang.tab.c + lang.tab.h lang2.y -- bison --> lang.tab.c + lang.tab.h ... langN.y -- bison --> lang.tab.c + lang.tab.h Advantages: - faster cycle time (better for humans) - grammar is easier to check/verify - experiments/changes are easy/cheap - code generated is very efficient GRAMMARS DESCRIBE LANGUAGES Grammars are high-level descriptions of languages/parsers def: a *grammar* consists of a finite set of rules (called "productions") and a start symbol (a nonterminal). Let V = nonterminals + terminals The rules have the form V+ -> V* where there is no symbol in both nonterminals and terminals def: The language generated by a grammar G with set of productions P is: {w | w is in terminals* and S =>* w} where S is the start symbol of G gAd => gBd iff g in V* and d in V* and j in V* and i in V* and A -> B is a rule in P and j =>* h iff either h = j or g -> i and i =>* h BNF NOTATION FOR GRAMMARS ::= means what -> means in grammars, "can become" | means "or" is a nonterminal symbol (terminal symbols don't have angle brackets) Example ::= | ::= 0 | 1 Example: =>* 1101 -> -> 1 -> 1 -> 1 1 -> 1 1 -> 1 1 0 -> 1 1 0 -> 1 1 0 1 GRAMMARS AS RULES OF GAMES A grammar can be seen as describing two games: - A production game (Can you produce this string?) - A recognition/parsing game (Is this string in the language?) PRODUCTION GAME Goal: produce a string in the language from the start symbol Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Can we produce "Johnny is good"? -> -> Johnny -> Johnny is -> Johnny is good RECOGNITION OR PARSING GAME Goal: determine if a string is in the language of the grammar Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Is "Johnny is good" in this grammar? Johnny is good <- Johnny is <- Johnny <- <- "Charlie can be difficult" <- Charlie can be <- Charlie <- <- DERIVATION (OR PARSE) TREES def: a *tree* is a finite set of nodes connected by directed edges that is connected and has no cycles def: a *derivation tree* for grammar G is a tree such that: - Every node has a label that is a symbol of G - The root is labeled by the start symbol of G - Every node with a direct descendent, is labeled by a nonterminal - If the descendents of a node labeled by N have the following labels (in order): A, B, C, ..., K then G has a production of form N -> A B C ... K EXAMPLE DERIVATION TREE Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult String to parse: "Johnny is good" / | \ / | \ v v v | | | | | | v v v Johnny is good EXTENSIONS TO BNF (EBNF) Arbitrary number of repeats: { x } means 0 or more repeats of x :: = { } is equivalent to: ::= ::= | ::= {} is also written as: * or [ ] ... One-or-more repeats: x+ means 1 or more repeats of x :: = + is equivalent to: ::= ::= | + is sometimes written as ... or [ ] ... Optional element: [ x ] means 0 or 1 occurences of x ::= [ ] is equivalent to: ::= ::= | READING A BNF GRAMMAR Example rules: ::= ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= 0 | ::= | Using EBNF: ::= + ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= 0 | Better (allowing single digit s): ::= {} ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= 0 | EBNF GRAMMAR FOR (SUBSET OF) PL/0 ::= . ::= ::= {} ::= const {} ; ::= = ::= , ::= {} ::= var ; ::= {} ::= , ::= {} ::= procedure ; ; ::= := | call | begin {} end | if then | while do | read | write | skip ::= ; ::= | else ::= ::= odd | EXAMPLES IN PL/0 Shortest program: skip. Factorial program: var n, res; # input and result procedure fact; begin read n; res := 1; while (n <> 0) begin res := res * n; n := n-1 end; write res end; call fact. LEXICAL ANALYSIS MOTIVATION # $Id$\n .text start\nstart:\tADDI ... Want to: - write a parser at a high level (not individual characters) - ignore comments, whitespace - have parser be efficient - deal with bad characters sensibly Approach: - break the input (stream of chars) into tokens tokens = chunks of input enum name characters dottextsym .text identsym start identsym start colonsym : addiopsym ADDI ... LEXICAL ANALYSIS Lexical means relating to the words of a language GOALS OF LEXICAL ANALYSIS - Simplify the parser, so it need not handle: - white space and comments - details of tokens - Recognize the longest match Why? if8 = b; - Handle every possible character of input Why? - so that we completely check a program - don't crash on any input CONFLICT BETWEEN RULES Suppose that both "if" and numbers are tokens: ifsym numbersym and input if8 What tokens should "if8" match? want it to be an ident Fixing such situations: tell programmers to use whitespace or punctuation to separate tokens otherwise use longest match elseif WHICH TOKEN TO RETURN? If the input is "<=", what token(s)? one token (lessequalsym) If the input is "<8", what token(s)? two tokens (lesssym, numbersym) If the input is "if", what token(s)? one token (ifsym, not an identsym) If the input is "//", what token(s)? (nothing, start of a comment) Summary: favor the longest string we can make into a token favor the reserved words over identifiers IF IF=THEN /* keyword at the beginning */ THEN THEN=ELSE ELSE IF=THEN=ELSE THE BIG PICTURE tokens - source --> [ Lexer] ------> [ Parser] code / / AST / abstract / syntax v trees [ static analysis ] / / IR v [ code generator ] For the Lexer we want to: - specify the tokens using regular expressions (REs) - convert REs to DFAs to execute them but easy conversions are: - REs to NFAs - NFAs to DFAs HOW PARSER WORKS WITH LEXER Couroutine structure: Parser calls lexer: tok = yytoken(); // call lexer /* ... use yylavl ... */ Lexer function (yytoken) remembers pointer to input stream returns next token (int code) Parser works... Parser calls lexer again: tok = yytoken(); // call lexer /* ... use yylavl ... */ BISON AND FLEX, GENERATING A PARSER idea: ast.h (AST types) | | bison | -----> g.tab.c | / ^ yyparse function v / bison | g.y file -----> g.tab.h | tokens flex v defs. g.l file -----> g.c yytoken function DEFINITIONS The lexical grammar of a language is regular, because they can be parsed/recognized very quickly/efficiently def: a grammar is *regular* iff all its productions have one of these forms: ::= c or ::= c where c is a terminal symbol, and and are nonterminals def: a language is *regular* iff it can be defined using a regular grammar. Thm: Every regular language can be recognized by a finite automaton. Thm: Every regular language can be specified by a regular expression. REGULAR EXPRESSIONS The language of regular expressions: ::= | '|' | | emp | * | ( ) where is a character Examples: RE meaning (language) ==================================== emp the empty string (0|1)*0 even binary numerals b*(abb*)*(a|emp) a's and b's without consecutive a's abba b bbbb ababababbba EXTENSIONS TO REGULAR EXPRESSIONS [abcd] means a|b|c|d [h-m] means h|i|j|k|l|m so [a-z] is all lower case letters in ASCII x? means x|emp y+ means y(y*) FOR YOU TO DO Write a regular expression that describes: 1. The reserved word "if" if 2. The set of all (positive) decimal numbers [0-9][0-9]* [0-9]+ [1-9][0-9]* [1-9]([0-9]*) 3. The set of all possible identifiers with underbars (_) [a-zA-Z0-9_]+ FOO F_B I10 897G Identifiers that must start with a letter or underscore: [_a-zA-Z][a-zA-Z0-9_]* Identifiers that start with c, i, or j: [cij][a-zA-Z0-9_]* c court c85 k9 4. The reserved word "int" in C int 5. Signed integer literals as in C [-+](([1-9][0-9]*)|(0[0-7]*)|(0x[0-9a-fA-F]*)) 6 -5 (If you have time, write these as regular grammars also.) EXAMPLE REGULAR EXPRESSIONS FOR PL/0 PATTERN RE ========================= DECDIGIT [0-9] LETTER [_a-zA-Z] LETTERORDIGIT {LETTER}|{DIGIT} IDENT {LETTER}{LETTERORDIGIT}* ... FINITE AUTOMATA EXAMPLE State diagram: < = -->[q0] --->[[q1]]---> [[q2]] | | > v [[q3]] PSEUDO CODE FOR THIS LEXER CASE char c; get next char into c; if (c is '<') then get the next char into c if (c is '=') then return leqsym else if (c is '>') then return neqsym else unget the char c return lesssym DEALING WITH COMMENTS Suppose / is used for division // starts a comment to end of line (unlike PL/0!) What state diagram? [^\n] __ '/' '/' \v --> [q0] ---> [[qdiv]] ---> [q2] ^ / \---------------------/ \n How does whitespace fit in? read those chars and go back to state q0 (start state) TACTICS FOR IGNORING WHITESPACE, COMMENTS Goal: do not send ignored tokens to parser Can always get a non-ignored token: have lexer keep going until it finds a non-ignored token Return "tokens" that include ignored stuff to a loop that ignores them Giant DFA that goes back to start state on seeing something to ignore NONDETERMINISTIC FINITE AUTOMATA def: A *nondeterministic finite automaton* (NFA) over an alphabet Sigma is a system (K, Sigma, delta, q0, F) where K is a finite set (of states), Sigma is a finite set set (the input alphabet), delta: is a map of type (K, Sigma) -> Sets(K) q0 in K is the initial state, & F is a subset of K (the final/accepting states). TRANSITION FUNCTION AND ACCEPTANCE p in delta(q,x) means that in state q, on input x the next state can be p p in delta*(q,s) where s in Sigma* is defined by: delta*(q,emp) = {q} (i.e., q in delta*(q,emp)) p in delta*(q,xa) = delta*(q2, a) where x in Sigma, a in Sigma*, q2 in delta(q,x) Lemma: for all c in Sigma, p in delta*(q, c) = delta(q, c) def: An NFA (K, Sigma, delta, q0, F) *accepts* a string s in Sigma* iff there is some q in delta*(q0,s) such that q in F EXAMPLE NFA 0,1 0,1 /---\ /---\ \ / \ / | v 0 0 | v -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]] | | 1 v [ q1 ] | | 1 v [[ q2 ]] | ^ / \ \---/ 0,1 K = {q0,q1,q2,q3,q4} Sigma = {0,1} q0 is start state F = {q2,q4} delta(q0,0) = {q0,q3} delta(q1,0) = {} delta(q2,0) = {q2} delta(q3,0) = {q4} delta(q4,0) = {q4} delta(q0,1) = {q0,q1} delta(q1,1) = {q2} delta(q2,1) = {q2} delta(q3,1) = {} delta(q4,1) = {q4} EXTENDING FUNCTIONS TO SETS OF STATES Notation extending delta and delta* to sets of states d(Q,x) = union of d(q,x) for all q in Q so d({},x) = {} d{{q},x) = d(q,x) d({q1,q2},x) = d(q1,x)+d(q2,x) d({q1,q2,q3},x) = d(q1,x)+d(q2,x)+d(q3,x) etc. Note that delta*({q}, x) = delta*(q, x) = delta(q, x) EXAMPLE delta*(q0,010011) = delta*(delta(q0,0),10011) = delta*({q0,q3},10011) = delta*(q0,10011) + delta*(q3,10011) = delta*(delta(q0,1),0011) + delta*(delta(q3,1),0011) = delta*({q0,q1},0011) + delta*({},0011) = delta*(delta(q0,0),011) + delta*(delta(q1,0),011) + {} = delta*({q0,q3},011) + delta*({},011) + delta*({q2},011) = delta*({q0,q3},011) + {} + delta*({q2},011) = delta*({q0,q3},011) + delta*({q2},011) = delta*(delta(q0,0),11) + delta(q3,0),11) + delta*(delta(q2,0),11) = delta*({q0,q3},11) + delta*({q4},11) + delta*({q2},11) = delta*({q0,q3,q4},1) + delta*({q4},1) + delta*({q2},1) = delta({q0,q1,q4},1) + delta(q4,1) + delta(q2,1) = {q0,q1,q2,q4} + {q4} + {q2} = {q0,q1,q2,q4} DETERMINISTIC FINITE AUTOMATA def: a *deterministic finite automaton* (DFA) is an NFA in which delta(q,c) is a singleton or empty for all q in K and c in Sigma. IMPLEMENTING DFAS How would you represent states? ints How would you implement a DFA? could use a switch or a 2D array (state x input char) PROBLEM We want to specify lexical grammar using regular expressions So we need to convert regular expressions into DFAs to run them CONVERTING REs TO NFAs Definition based on grammar of Regular Expressions: Result of Convert(M) looks like this: -->(M q) where the "tail", -->, goes to the start state and q is the "head state" assume also Convert(N) is --->(N q') c Convert(c) = --->[ q ] Convert(M | N) = /--->(M q)--\ emp / \ ---->[ q ] -> [ q2 ] \ / \--->(N q')-/ Convert(M N) = --> (M q)-->(N q') emp Convert(emp) = -----> [ q ] emp /--------------\ / emp v Convert(M*) = -/ /->(M q)---->[ q2 ] / / \ emp / \-----------/ Convert((M)) = Convert(M) After conversion, make the "head state" be a final state EXAMPLE OF CONVERSION TO NFA Regular expression: (i|j)* i Convert(i) = --->[ qi ] j Convert(j) = --->[ qj ] Convert(i|j) = i /--->(qi)---\ emp emp / \ ---->[ q ] -> [ q2 ] \ j / \--->(qj)---/ emp Convert((i|j)*) = CONVERTING AN NFA INTO A DFA Idea: Convert each reachable set of NFA states into How? Use the emp-closure of each state q = set of states reachable from q using emp Closure wrt emp: closure(S) is the smallest set T such that T = S + union {delta(s, emp) | s in T} can compute closure(S) as T <- S; do T2 <- T T <- T2 + union {delta(s, emp)| s in T2} while (T != T2) DFA Transitions: Let S be a set of states, then DFAdelta(S, c) = closure(union {delta(s,c)|s in S}) EXAMPLE CONVERSION OF NFA TO DFA NFA for if|[a-z]([a-z]|[0-9])* f [q2] ---> [[q3]] ^ emp i / /------\ / emp a-z / v -->[q1]------>[q4]----->[q5] [[q8]] ^ | emp| | | | a-z | | /-----\ | | / v / | |->[q6] [q7] | | \ 0-9 ^ | emp | \----/ / | / \ emp / \----------/ Converted to DFA: f [2,5,6,8] ---> [3,6,7,8] ^ \ i / \ a-z / a-h j-z a-z \ 0-9 -->[1,4]---->[5,6,8] 0-9 v /-| \----->[6,7,8] |a-z ^ |0-9 \ | \-| USING THE FLEX TOOL TO GENERATE LEXERS Example: SRM assembler High-level description in asm_lexer.l Generated lexer asm_lexer.c + asm_lexer.h Wrapper for lexer: lexer.h declares functions lexer.c defines functions e.g., lexer_print_token ASTs defined in ast.h asm.y is Bison description file grammar == bison ==> - Declarations in asm.tab.h includes ast.h machine_types.h parser_types.h declares YYSTYPE lexer.h declares yytokentype eolsym = ... minussym = ... dottextsym = ... ... - Definitions in asm.tab.c defines yyparser() YYSTYPE yylval; STRUCTURE OF FLEX INPUT FILE /* ... definitions section ... */ %% /* ... rules section ... */ %% /* ... user subroutines ... */ SECTIONS IN FLEX INPUT (.l file) Definitions section: Rules section: User subroutine section: WHY CONTEXT-FREE PARSING? Can we define a regular expression to make sure expressions have balanced parentheses? e.g., recognize: (34) and ((12)+(789)) but not: (567))+82)))) WHAT IS NEEDED Needed for checking balanced parentheses: PARSING Want to recognize language with Goal: CONTEXT-FREE GRAMMARS def: a *context-free* grammar (CFG) (N, T, P, S) has start symbol S in N and each production (in P) has the form: -> g where is a Nonterminal symbol, and g in (N+T)* Example: def: For a CFG (N,T,P,S), g in (N+T)* *produces* g' in (N+T)*, written g =P=> g', iff g is e f, g' is e h f, is a nonterminal (in N), h in (N+T)*, and the rule -> h is in P Example: ( ) =P=> DERIVATION def: a *derivation* of a terminal string t from the rules P of a CFG (N,T,P,S), is a sequence (S, g1, g2, ..., gm), where gm = t, and for all i: gi in (N+T)*, and for all 0 <= j < m: gj =P=> g(j+1) Example: LEFTMOST DERIVATION def: a *leftmost derivation* of a string t in T* from a CFG (N,T,P,S) is a derivation of t from (N,T,P,S) (S, g1, ..., gm) such that gm = t and for all 0 <= j < m: when gj =P=> g(j+1) and the nonterminal is replaced in gj, then there are no nonterminals to the left of in gj. Example: PARSE TREES AND DERIVATIONS def: A *parse tree*, Tr, for a CFG (N,T,P,S) represents a derivation, D, iff: - Each node in Tr is labeled by a nonterminal (in N) - the root of Tr is the start symbol S - an arc from to h in (N+T) iff -> ... h ... in P - the order of children of a node labeled is the order in a production -> ... in P EXAMPLES GRAMMAR: ::= | + | * Derivations of 3*4+2 EXAMPLE PARSE TREES Corresponding to leftmost derivation: Corresponding to rightmost derivation: AMBIGUITY def: a CFG (N,T,P,S) is *ambiguous* iff there is some t in T* such that there are two different parse tress for t FIXING AMBIGUOUS GRAMMARS Idea: Rewrite grammar to Example: RECURSIVE DESCENT PARSING ALGORITHM For each production rule, of form: ::= g1 | g2 | ... | gm 1. Write a 2. This function EXAMPLE RECURSIVE-DESCENT PARSER ::= if then else | begin S end | write ::= ; | ::= ::= = token tok = lexer_next(); void advance() { tok = yytoken(); } void eat(token_type tt) { if (tok.typ == tt) { advance(); } else { /* ... report error */ } } void parseCond() { eat(numbersym); eat(eqsym); eat(numbersym); } void parseStmt() { switch (tok.typ) { case ifsym: eat(ifsym); parseExp(); eat(thensym); parseStmt(); eat(elsesym); parsetStmt(); break; case beginsym: eat(beginsym); parseStmt(); parseList(); break; case writesym: eat(writesym); eat(numbersym); break; default: // report error } } LL(1) GRAMMARS A recursive-descent parser must: - choose between alternatives (e.g., ::= | ) def: A grammar is *LL(1)* iff LR(1) GRAMMARS An LR(1) parser needs to decide when to: shift (push token on stack) or reduce uses a DFA based on stack + lookhead def: A grammar is *LR(1)* iff LALR(1) Parsing Smaller tables than LR(1) - merges states of the DFA if only differ in lookahead PROBLEM: AMBIGUITY Consider: ::= := | if then ::= | else and the statement: if b1 then if b2 then x := 2 else x := 3 Is this parsed as: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 or as: / / \----|--------\ if then | |\ / \ b1 /| \ else / | \ / | \ | | | ... | | | x := 3 | | | if then | /|\ \ b2 ... x := 2 FIXES FOR AMBIGUITY Change the language: a. Always have an else clause: ::= if then else (use skip if don't want to do anything) b. Use an end marker ::= if then else fi | if then fi Give precedence to one production: ::= if then ::= else // priority! | So we only get the parse tree: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3