I. Overview A. theory 1. languages ------------------------------------------ LANGUAGES def: A *language* is ------------------------------------------ For a written natural language, what are the characters? For a spoken natural language, what are the characters? 2. hierarchy of language classes ------------------------------------------ LANGUAGE CLASSES Languages can be classified by Venn Diagram: |--------------------------------------| | Regular Languages | | | | | | |--------------------------------| | | | Contex-free Languages | | | | | | | | | | | | |--------------------------| | | | | | Context-sensitive | | | | | | Languages | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | Type 0 Languages | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| ------------------------------------------ a. relation of language classes to parts of a compiler ------------------------------------------ PHASES OF A COMPILER Programs allowed by a compiler's: |--------------------------------------| | Lexical Analysis (Lexer) | | | | | | |--------------------------------| | | | Parser | | | | | | | | | | | | |--------------------------| | | | | | Static Analysis | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | Runtime checks | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| ------------------------------------------ Is there a relationship to the previous diagram (about langauge classes)? 3. grammars a. definitions ------------------------------------------ GRAMMARS DESCRIBE LANGUAGES def: a *grammar* consists of a finite set of rules (called "productions") and a start symbol (a nonterminal). Let V = nonterminals + terminals The rules have the form V+ -> V* where there is no symbol in both nonterminals and terminals def: The language generated by a grammar G with set of productions P is: {w | w is in terminals* and S =>* w} where S is the start symbol of G gAd => gBd iff g in V* and d in V* and A -> B is a rule in P and g =>* h iff either h = g or g -> i and i =>* h ------------------------------------------ b. BNF notation ------------------------------------------ BNF NOTATION FOR GRAMMARS ::= means | means is a Example ::= | ::= 0 | 1 ------------------------------------------ c. grammars as games ------------------------------------------ GRAMMARS AS RULES OF GAMES A grammar can be seen as describing two games: - A production game (Can you produce this string?) - A recognition/parsing game (Is this string in the language?) ------------------------------------------ i. production game ------------------------------------------ PRODUCTION GAME Goal: produce a string in the language from the start symbol Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Can we produce "Johnny is good"? ------------------------------------------ Could you win the production game and produce the string "Johnny is good"? Could you produce the sentence "Charlie can be difficult"? ii. recognition or parsing game ------------------------------------------ RECOGNITION OR PARSING GAME Goal: determine if a string is in the language of the grammar Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Is "Johnny is good" in this grammar? ------------------------------------------ Could you win the recognition game on the string "Johnny is good"? What does this have to do with parsing in a compiler? d. derivation trees ------------------------------------------ DERIVATION (OR PARSE) TREES def: a *tree* is a finite set of nodes connected by directed edges that is connected and has no cycles def: a *derivation tree* for grammar G is a tree such that: - Every node has a label that is a symbol of G - The root is labeled by the start symbol of G - Every node with a direct descendent, is labeled by a nonterminal - If the descendents of a node labeled by N have the following labels (in order): A, B, C, ..., K then G has a production of form N -> A B C ... K ------------------------------------------ ------------------------------------------ EXAMPLE DERIVATION TREE Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult String to parse: "Johnny is good" / | \ / | \ v v v Johnny is good ------------------------------------------ Why does the order of nodes matter? e. extensions to BNF (EBNF) ------------------------------------------ EXTENSIONS TO BNF (EBNF) Arbitrary number of repeats: { x } means 0 or more repeats of x :: = { } is equivalent to: ::=

::= |

::= {} is also written as: * or [ ] ... One-or-more repeats: x+ means 1 or more repeats of x :: = + is equivalent to: ::= ::= | + is sometimes written as ... Optional element: [ x ] means 0 or 1 occurences of x ::= [ ] is equivalent to: ::=

::= | ------------------------------------------ What is EBNF notation like that you may have seen before? f. examples ------------------------------------------ READING A BNF GRAMMAR Example rules: ::= ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= 0 | ::= | ------------------------------------------ can you give an example of a ? ------------------------------------------ EBNF GRAMMAR FOR (SUBSET OF) PL/0 ::= . ::=

::= {} ::= const {} ; ::= =

::= ,

::= {} ::= var ; ::= {} ::= ,

::= {} ::= procedure ; ; ::= := | call | begin {} end | if then

| while do | read | write | skip ::= ;

::= | else ::= ::= odd |

------------------------------------------ ------------------------------------------ EXAMPLES IN PL/0 Shortest program: skip. Factorial program: var n, res; # input and result procedure fact; begin read n; res := 1; while (n <> 0) begin res := res * n; n := n-1 end; write res end; call fact. ------------------------------------------ Does the factorial program parse correctly? Does PL/0 use semicolons to end statements or to separate them? How does C use semicolons? II. Lexical Analysis ------------------------------------------ LEXICAL ANALYSIS Lexical means relating to the words of a language ------------------------------------------ A. Goals of Lexical Analysis ------------------------------------------ GOALS OF LEXICAL ANALYSIS - Simplify the parser, so it need not handle: - Recognize the longest match Why? - Handle every possible character of input Why? ------------------------------------------ Why handle all possible characters? Why recognize the longest match? ------------------------------------------ CONFLICT BETWEEN RULES Suppose that both "if" and numbers are tokens: What tokens should "if8" match? Fixing such situations: ------------------------------------------ What tokens should "<=" match? ------------------------------------------ WHICH TOKEN TO RETURN? If the input is "<=", what token(s)? If the input is "<8", what token(s)? If the input is "if", what token(s)? If the input is "//", what token(s)? Summary: ------------------------------------------ How would you program the longest match idea? How would you ensure that reserved words are favored over identifiers? (e.g., "if" is not an identifier) After finding an identifier, check to see if it's a reserved word and if so, then return the reserved word token ... In sum, favor is longest match, but give priority to reserved words over identifiers B. Theory of Regular Languages, Regular Expressions There is a strong connection between theory and practice in syntax analysis, with some of the first and most useful results in CS theory! 1. Overview ------------------------------------------ THE BIG PICTURE tokens - source -- [ Lexer] ------> [ Parser] code | | trees v ... For the Lexer we want to: - specify the tokens using regular expressions (REs) - convert REs to DFAs to execute them but easy conversions are: - REs to NFAs - NFAs to DFAs ------------------------------------------ Explain what an the abbreviations mean: - NFA = nondeterministic finite state automaton - DFA = deterministic finite state automaton 2. Definitions ------------------------------------------ DEFINITIONS The lexical grammar of a language is regular, because def: a grammar is *regular* iff all its productions have one of these forms: ::= c or ::= c where c is a terminal symbol, and and are nonterminals def: a language is *regular* iff it can be defined using a regular grammar. Thm: Every regular language can be recognized by a finite automaton. Thm: Every regular language can be specified by a regular expression. ------------------------------------------ ... regular grammars can be parsed/recognized quickly using finite automata Regular languages are also called "type 3" languages (a name from the Chomsky hierarchy) A regular language is also called a *regular set*. Q: Does it follow that every regular expression can be recognized using a finite automaton? 3. Regular Expressions ------------------------------------------ REGULAR EXPRESSIONS The language of regular expressions: ::= | '|' | | emp | * | ( ) where is a character Examples: RE meaning ==================================== emp the empty string (0|1)*0 even binary numerals b*(abb*)*(a|emp) a's and b's without consecutive a's ------------------------------------------ a. extensions to regular expressions ------------------------------------------ EXTENSIONS TO REGULAR EXPRESSIONS [abcd] means a|b|c|d [h-m] means h|i|j|k|l|m x? means x|emp y+ means y(y*) ------------------------------------------ b. examples of regular expressions ------------------------------------------ FOR YOU TO DO Write a regular expression that describes: 1. The keyword "if" 2. The set of all (positive) decimal numbers 3. The set of all possible identifiers without underbars (_) (If you have time, write these as regular grammars also.) ------------------------------------------ c. Example using regular expressions ------------------------------------------ EXAMPLE REGULAR EXPRESSIONS FOR PL/0 RE Token ========================= <= leqsym < lesssym <> neqsym >= geqsym > gtrsym [0-9][0-9]* numbersym ... ------------------------------------------ What would be the regular expression for identifiers for PL/0? 4. finite automata a. example ------------------------------------------ EXAMPLE State diagram: < = -->[q0] --->[[q1]]---> [[q2]] | | > v [[q3]] ------------------------------------------ What token should be returned at state q2? What token should be returned at state q3? What should be done at state q1? ------------------------------------------ PSEUDO CODE FOR THIS LEXER CASE char c; ------------------------------------------ ------------------------------------------ DEALING WITH COMMENTS Suppose / is used for division // starts a comment to end of line (unlike PL/0!) What state diagram? How does whitespace fit in? ------------------------------------------ How does whitespace fit in? ------------------------------------------ TACTICS FOR IGNORING WHITESPACE, COMMENTS Goal: do not send ignored tokens to parser Can always get a non-ignored token: Return "tokens" that include ignored stuff to a loop that ignores them Giant DFA that goes back to start state on seeing something to ignore ------------------------------------------ b. definitions ------------------------------------------ NONDETERMINISTIC FINITE AUTOMATA def: A *nondeterministic finite automaton* (NFA) over an alphabet Sigma is a system (K, Sigma, delta, q0, F) where K is a finite set (of states), Sigma is a finite set set (the input alphabet), delta: is a map of type (K, Sigma) -> Sets(K) q0 in K is the initial state, & F is a subset of K (the final/accepting states). ------------------------------------------ ------------------------------------------ TRANSITION FUNCTION AND ACCEPTANCE p in delta(q,x) means that in state q, on input x the next state can be p p in delta*(q,s) where s in Sigma* is defined by: delta*(q,emp) = {q} (i.e., q in delta*(q,emp)) p in delta*(q,xa) = delta*(q2, a) where x in Sigma, a in Sigma*, q2 in delta(q,x) Lemma: for all c in Sigma, p in delta*(q, c) = delta(q, c) def: An NFA (K, Sigma, delta, q0, F) *accepts* a string s in Sigma* iff there is some q in delta*(q0,s) such that q in F ------------------------------------------ If each final state means to return a token, what token should be returned if there are many final states? c. example ------------------------------------------ EXAMPLE NFA 0,1 0,1 /---\ /---\ \ / \ / | v 0 0 | v -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]] | | 1 v [ q1 ] | | 1 v [[ q2 ]] | ^ / \ \---/ 0,1 K = {q0,q1,q2,q3,q4} Sigma = {0,1} q0 is start state F = {q2,q4} delta(q0,0) = {q0,q3} delta(q1,0) = {} delta(q2,0) = {q2} delta(q3,0) = {q4} delta(q4,0) = {q4} delta(q0,1) = {q0,q1} delta(q1,1) = {q2} delta(q2,1) = {q2} delta(q3,1) = {} delta(q4,1) = {q4} ------------------------------------------ ------------------------------------------ EXTENDING FUNCTIONS TO SETS OF STATES Notation extending delta and delta* to sets of states d(Q,x) = union of d(q,x) for all q in Q so d({},x) = {} d{{q},x) = d(q,x) d({q1,q2},x) = d(q1,x)+d(q2,x) d({q1,q2,q3},x) = d(q1,x)+d(q2,x)+d(q3,x) etc. Note that delta*({q}, x) = delta*(q, x) = delta(q, x) ------------------------------------------ ------------------------------------------ EXAMPLE delta*(q0,010011) = delta*(delta(q0,0),10011) = delta*({q0,q3},10011) = delta*(q0,10011) + delta*(q3,10011) = delta*(delta(q0,1),0011) + delta*(delta(q3,1),0011) = delta*({q0,q1},0011) + delta*({},0011) = delta*(delta(q0,0),011) + delta*(delta(q1,0),011) + {} = delta*({q0,q3},011) + delta*({},011) + delta*({q2},011) = delta*({q0,q3},011) + {} + delta*({q2},011) = delta*({q0,q3},011) + delta*({q2},011) = delta*(delta(q0,0),11) + delta(q3,0),11) + delta*(delta(q2,0),11) = delta*({q0,q3},11) + delta*({q4},11) + delta*({q2},11) = delta*({q0,q3,q4},1) + delta*({q4},1) + delta*({q2},1) = delta({q0,q1,q4},1) + delta(q4,1) + delta(q2,1) = {q0,q1,q2,q4} + {q4} + {q2} = {q0,q1,q2,q4} ------------------------------------------ So does this machine accept 010011? What strings does this machine accept? d. deterministic finite automata ------------------------------------------ DETERMINISTIC FINITE AUTOMATA def: a *deterministic finite automaton* (DFA) is an NFA in which delta(q,c) is a singleton or empty for all q in K and c in Sigma. ------------------------------------------ ------------------------------------------ IMPLEMENTING DFAS How would you represent states? How would you implement a DFA? ------------------------------------------ 5. converting NFA to DFA ------------------------------------------ PROBLEM We want to specify lexical grammar using So we need to convert regular expressions into DFAs ------------------------------------------ Why DFAs? a. Converting Regular Expressions to NFAs ------------------------------------------ CONVERTING REs TO NFAs Definition based on grammar of Regular Expressions: Result of Convert(M) looks like this: -->(M q) where the "tail", -->, goes to the start state and q is the "head state" assume also Convert(N) is --->(N q') c Convert(c) = --->[ q ] Convert(M | N) = /--->(M q)--\ emp / \ ---->[ q ] -> [ q2 ] \ / \--->(N q')-/ Convert(M N) = --> (M q)-->(N q') emp Convert(emp) = -----> [ q ] emp /--------------\ / emp v Convert(M*) = -/ /->(M q)---->[ q2 ] / / \ emp / \-----------/ Convert((M)) = Convert(M) After conversion, make the "head state" be a final state ------------------------------------------ ------------------------------------------ EXAMPLE OF CONVERSION TO NFA Regular expression: (i|j)* i Convert(i) = --->[ qi ] j Convert(j) = --->[ qj ] Convert(i|j) = i /--->(qi)---\ emp emp / \ ---->[ q ] -> [ q2 ] \ j / \--->(qj)---/ emp Convert((i|j)*) = ------------------------------------------ In programming terms, do we need to use an NFA to recognize "if" and other reserved words? b. Converting NFAs to DFAs ------------------------------------------ CONVERTING AN NFA INTO A DFA Idea: Convert each reachable set of NFA states into How? Use the emp-closure of each state q = set of states reachable from q using emp Closure wrt emp: closure(S) is the smallest set T such that T = S + union {delta(s, emp) | s in T} can compute closure(S) as T <- S; do T2 <- T T <- T2 + union {delta(s, emp)| s in T2} while (T != T2) DFA Transitions: Let S be a set of states, then DFAdelta(S, c) = closure(union {delta(s,c)|s in S}) ------------------------------------------ Why do we need to take the closure wrt emp? ------------------------------------------ EXAMPLE CONVERSION OF NFA TO DFA NFA for if|[a-z]([a-z]|[0-9])* f [q2] ---> [[q3]] ^ emp i / /------\ / emp a-z / v -->[q1]------>[q4]----->[q5] [[q8]] ^ | emp| | | | a-z | | /-----\ | | / v / | |->[q6] [q7] | | \ 0-9 ^ | emp | \----/ / | / \ emp / \----------/ Converted to DFA: f [2,5,6,8] ---> [3,6,7,8] ^ \ i / \ a-z / a-h j-z a-z \ 0-9 -->[1,4]---->[5,6,8] 0-9 v /-| \----->[6,7,8] |a-z ^ |0-9 \ | \-| ------------------------------------------ Which states are final in the DFA? Which tokens would be returned in each state? III. Context-free Parsing A. Why context-free grammars and parsing? ------------------------------------------ WHY CONTEXT-FREE PARSING Can we define a regular expression to make sure expressions have balanced parentheses? e.g., recognize: (34) and ((12)(789)) but not: (567))82)))) ------------------------------------------ ------------------------------------------ WHAT IS NEEDED Needed for checking balanced parentheses: ------------------------------------------ B. Parsing with Context-free Grammars ------------------------------------------ PARSING Want to recognize language with Goal: ------------------------------------------ 1. context-free grammar ------------------------------------------ CONTEXT-FREE GRAMMARS def: a *context-free* grammar (CFG) (N, T, P, S) has start symbol S in N and each production (in P) has the form: -> g where is a Nonterminal symbol, and g in (N+T)* Example: def: For a CFG (N,T,P,S), g in (N+T)* *produces* g' in (N+T)*, written g =P=> g', iff g is e f, g' is e h f, is a nonterminal (in N), h in (N+T)*, and the rule -> h is in P Example: ( ) =P=> ------------------------------------------ 2. derivation ------------------------------------------ DERIVATION def: a *derivation* of a terminal string t from the rules P of a CFG (N,T,P,S), is a sequence (S, g1, g2, ..., gm), where gm = t, and for all i: gi in (N+T)*, and for all 0 <= j < m: gj =P=> g(j+1) Example: ------------------------------------------ ------------------------------------------ LEFTMOST DERIVATION def: a *leftmost derivation* of a string t in T* from a CFG (N,T,P,S) is a derivation of t from (N,T,P,S) (S, g1, ..., gm) such that gm = t and for all 0 <= j < m: when gj =P=> g(j+1) and the nonterminal is replaced in gj, then there are no nonterminals to the left of in gj. Example: ------------------------------------------ 3. parse trees ------------------------------------------ PARSE TREES AND DERIVATIONS def: A *parse tree*, Tr, for a CFG (N,T,P,S) represents a derivation, D, iff: - Each node in Tr is labeled by a nonterminal (in N) - the root of Tr is the start symbol S - an arc from to h in (N+T) iff -> ... h ... in P - the order of children of a node labeled is the order in a production -> ... in P ------------------------------------------ ------------------------------------------ EXAMPLES GRAMMAR: ::= | + | * Derivations of 3*4+2 ------------------------------------------ ------------------------------------------ EXAMPLE PARSE TREES Corresponding to leftmost derivation: Corresponding to rightmost derivation: ------------------------------------------ What does each parse tree mean? 4. ambiguity ------------------------------------------ AMBIGUITY def: a CFG (N,T,P,S) is *ambiguous* iff there is some t in T* such that there are two different parse tress for t ------------------------------------------ What's an example of an ambiguous grammar? Why do we want to avoid ambiguous grammars? Is ambiguity a property of a language or its grammar? a. Fixing ambiguous grammars ------------------------------------------ FIXING AMBIGUOUS GRAMMARS Idea: Rewrite grammar to Example: ------------------------------------------ 5. recursive-descent parsing ------------------------------------------ RECURSIVE DESCENT PARSING ALGORITHM For each production rule, of form: ::= g1 | g2 | ... | gm 1. Write a 2. This function ------------------------------------------ a. example ------------------------------------------ EXAMPLE RECURSIVE-DESCENT PARSER ::= if then else | begin S end | write ::= ; | ::= ::= = token tok = lexer_next(); void advance() { tok = lexer_next(); } void eat(token_type tt) { if (tok.typ == tt) { advance(); } else { /* ... report error */ } } void parseCond() { void parseStmt() { ------------------------------------------ What kind of errors can eat() report? What kind of error message can we produce? How would we write parseList()? ------------------------------------------ PARSELIST // ::= ; | void parseList() { ------------------------------------------ b. terminology: LL(1) ------------------------------------------ LL(1) GRAMMARS A recursive-descent parser must: - choose between alternatives (e.g., ::= | ) def: A grammar is *LL(1)* iff ------------------------------------------ c. problems that can occur in recursive-descent parsing i. left-recursion ------------------------------------------ PROBLEM: LEFT RECURSION Consider: ::= + | Parsing function for : void parseExp() { parseExp(); eat(plussym); eat(numbersym); } Is that a problem? ------------------------------------------ ------------------------------------------ ELIMINATING LEFT RECURSION Technique: change to right recursion, using a new nonterminal E.g., change ::= + | to ------------------------------------------ How do you know you haven't changed the language? ii. left factoring ------------------------------------------ PROBLEM: AMBIGUOUS START OF RULES Example: ::= if then else | if then If we see an if token, which one to parse? SOLUTION: LEFT FACTORING Tactic: - Put common start in 1 nonterminal - Put differences in another nonterminal Example: Why is this better? ------------------------------------------ iii. ambiguity problems ------------------------------------------ PROBLEM: AMBIGUITY Consider: ::= := | if then ::= | else and the statement: if b1 then if b2 then x := 2 else x := 3 Is this parsed as: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 or as: / / \----|--------\ if then | |\ / \ b1 /| \ else / | \ / | \ | | | ... | | | x := 3 | | | if then | /|\ \ b2 ... x := 2 ------------------------------------------ ------------------------------------------ FIXES FOR AMBIGUITY Change the language: a. Always have an else clause: ::= if then else (use skip if don't want to do anything) b. Use an end marker ::= if then else fi | if then fi Give precedence to one production ::= if then ::= else // priority! | So we only get the parse tree: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 ------------------------------------------ d. case study: PL/0 expressions ------------------------------------------ CASE STUDY: PARSING PL/0 EXPRESSIONS Grammar for in PL/0: ::= { } ::= ::= | ::= { } ::= ::= |
::= | | | ( ) ::= | Examples in the language of : +3402 -3402 + 4020 3*4+2 3*4+(0+(1+1)) ------------------------------------------ Why isn't the grammar like ::= ? How is 3*4+2 parsed? ------------------------------------------ PARSE TREE WITH THIS GRAMMAR 3 * 4 + 2 ------------------------------------------ Could we have made a different parse tree? What are the tokens here? How do we start writing this? ------------------------------------------ CODE FOR PARSING EXPRESSIONS // assume advance() and eat(), as before void parseExpr() { void parseSign() { ------------------------------------------