I. Overview A. theory 1. languages ------------------------------------------ LANGUAGES def: A *language* is ------------------------------------------ For a written natural language, what are the characters? For a spoken natural language, what are the characters? 2. hierarchy of language classes ------------------------------------------ LANGUAGE CLASSES Languages can be classified by Venn Diagram: |--------------------------------------| | Strings recognized by | | the Regular Grammar | | | | |--------------------------------| | | | Strings recognized by | | | | the Contex-free Grammar | | | | | | | | |--------------------------| | | | | | Strings recognized by | | | | | | the Context-sensitive | | | | | | Grammar | | | | | | (static checking) | | | | | | | | | | | | |--------------------| | | | | | | | Strings recognized | | | | | | | | by the | | | | | | | | Type 0 Grammar | | | | | | | | (dynamic checks)| | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| ------------------------------------------ a. relation of language classes to parts of a compiler ------------------------------------------ PHASES OF A COMPILER Programs allowed by a compiler's: |--------------------------------------| | Lexical Analysis (Lexer) | | | | |--------------------------------| | | | Parser | | | | | | | | |--------------------------| | | | | | Static Analysis | | | | | | | | | | | | |--------------------| | | | | | | | Runtime checks | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| ------------------------------------------ Is there a relationship to the previous diagram (about langauge classes)? 3. automation of syntax analysis ------------------------------------------ PROBLEM WRITING SYNTAX ANALYSIS Naive way to write a compiler, etc. Language Def.docx -- coding --> parser.c (v. 1) Def2.docx -- coding --> parser.c (v. 2) -- coding --> parser.c (v. 3) Def3.docx -- coding --> parser.c (v. 4) Def4.docx -- coding --> parser.c (v. 5) ... ... DefN.docx -- coding --> parser.c (v. N) Disadvantages: ------------------------------------------ ------------------------------------------ COMPUTER SCIENCE SOLUTION: AUTOMATION high-level description tool generated code lang_lexer.l -- flex--> lang_lexer.c + lang_lexer.h lang.y -- bison --> lang.tab.c + lang.tab.h lang2.y -- bison --> lang.tab.c + lang.tab.h ... langN.y -- bison --> lang.tab.c + lang.tab.h Advantages: ------------------------------------------ 4. grammars a. definitions ------------------------------------------ GRAMMARS DESCRIBE LANGUAGES Grammars are high-level descriptions of languages/parsers def: A *terminal* is a character (or string of characters) found in the language E.g., def: A *nonterminal* is a meta-notation that can be replaced with other terminals or nonterminals. def: a *grammar* consists of a finite set of rules (called "productions") and a start symbol (a nonterminal). Let V = nonterminals + terminals The rules have the form V+ -> V* where there is no symbol in both nonterminals and terminals def: The language generated by a grammar G with set of productions P is: {w | w is in terminals* and S =>* w} where S is the start symbol of G gAd => gBd iff g in V* and d in V* and A -> B is a rule in P and g =>* h iff either h = g or g -> i and i =>* h ------------------------------------------ b. BNF notation ------------------------------------------ BNF NOTATION FOR GRAMMARS ::= means | means is a Example ::= | ::= 0 | 1 ------------------------------------------ c. grammars as games ------------------------------------------ GRAMMARS AS RULES OF GAMES A grammar can be seen as describing two games: - A production game (Can you produce this string?) - A recognition/parsing game (Is this string in the language?) ------------------------------------------ i. production game ------------------------------------------ PRODUCTION GAME Goal: produce a string in the language from the start symbol Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Can we produce "Johnny is good"? ------------------------------------------ Could you win the production game and produce the string "Johnny is good"? Could you produce the sentence "Charlie can be difficult"? ii. recognition or parsing game ------------------------------------------ RECOGNITION OR PARSING GAME Goal: determine if a string is in the language of the grammar Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Is "Johnny is good" in this grammar? ------------------------------------------ Could you win the recognition game on the string "Johnny is good"? What does this have to do with parsing in a compiler? d. derivation trees ------------------------------------------ DERIVATION (OR PARSE) TREES def: a *tree* is a finite set of nodes connected by directed edges that is connected and has no cycles def: a *derivation tree* for grammar G is a tree such that: - Every node has a label that is a symbol of G - The root is labeled by the start symbol of G - Every node with a direct descendent, is labeled by a nonterminal - If the descendents of a node labeled by N have the following labels (in order): A, B, C, ..., K then G has a production of form N -> A B C ... K ------------------------------------------ ------------------------------------------ EXAMPLE DERIVATION TREE Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult String to parse: "Johnny is good" / | \ / | \ v v v Johnny is good ------------------------------------------ Why does the order of nodes matter? e. extensions to BNF (EBNF) ------------------------------------------ EXTENSIONS TO BNF (EBNF) Arbitrary number of repeats: { x } means 0 or more repeats of x :: = { } is equivalent to: ::= ::= | ::= {} is also written as: * or [ ] ... One-or-more repeats: x+ means 1 or more repeats of x :: = + is equivalent to: ::= ::= | + is sometimes written as ... or [ ] ... Optional element: [ x ] means 0 or 1 occurences of x ::= [ ] is equivalent to: ::= ::= | ------------------------------------------ What is EBNF notation like that you may have seen before? f. examples ------------------------------------------ READING A BNF GRAMMAR Example rules: ::= ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= 0 | ::= | ------------------------------------------ can you give an example of a ? ------------------------------------------ EBNF GRAMMAR FOR (SUBSET OF) SPL ::= . ::= begin end ::= {} ::= const ; ::= | , ::= = ::= {} ::= var ; ::= {} ::= , ::= {} ::= proc ; ::= | ::= ::= {} ::= ; ::= := | call | if then else end | if then end | while do end | read | print | block ::= ------------------------------------------ ------------------------------------------ EXAMPLES IN SPL Shortest program: begin end. Factorial program: begin var n, res; # input and result proc fact begin read n; res := 1; while n != 0 do res := res * n; n := n-1 end; print res end; call fact end. ------------------------------------------ Does the factorial program parse correctly? Does SPL use semicolons to end statements or to separate them? How does C use semicolons? II. Lexical Analysis ------------------------------------------ MOTIVATION # $Id$\n .text start\nstart:\tADDI ... Want to: Approach: ------------------------------------------ ------------------------------------------ LEXICAL ANALYSIS Lexical means relating to the words of a language ------------------------------------------ A. Goals of Lexical Analysis ------------------------------------------ GOALS OF LEXICAL ANALYSIS - Simplify the parser, so it need not handle: - Recognize the longest match Why? - Handle every possible character of input Why? ------------------------------------------ Why recognize the longest match? Why handle all possible characters? ------------------------------------------ CONFLICT BETWEEN RULES Suppose that both "if" and numbers are tokens: What tokens should "if8" match? Fixing such situations: ------------------------------------------ What tokens should "<=" match? ------------------------------------------ WHICH TOKEN TO RETURN? If the input is "<=", what token(s)? If the input is "==", what token? If the input is "<8", what token(s)? If the input is "if", what token(s)? If the input is "//", what token(s)? Summary: ------------------------------------------ How would you program the longest match idea? How would you ensure that reserved words are favored over identifiers? (e.g., "if" is not an identifier) After finding an identifier characters, check to see if it's a reserved word, and if so, then return the reserved word token ... In sum, favor is longest match, but give priority to reserved words over identifiers 1. Overview ------------------------------------------ THE BIG PICTURE tokens - source --> [ Lexer] ------> [ Parser] code / / / abstract / syntax v trees [ static analysis ] / / v [ code generator ] For the Lexer we want to: - specify the tokens using regular expressions (REs) - convert REs to DFAs to execute them but easy conversions are: - REs to NFAs - NFAs to DFAs ------------------------------------------ Explain what an the abbreviations mean: - NFA = nondeterministic finite state automaton - DFA = deterministic finite state automaton ------------------------------------------ HOW PARSER WORKS WITH LEXER Couroutine structure: Parser calls lexer: yychar = yylex(); // call lexer /* ... use yychar ... */ Lexer function remembers pointer to input stream returns next token (int code) Parser works... Parser calls lexer again: yychar = yylex(); // call lexer /* ... use yychar ... */ ------------------------------------------ It's a coroutine, because the lexer picks up where it left off when called again a. Use of Automated Tools ------------------------------------------ BISON AND FLEX, GENERATING A PARSER idea: ast.h (AST types) ^ | bison | -----> spl.tab.c | / ^ yyparse() | / bison | | spl.y | | \------> spl.tab.h | \ tokens | flex v defs. spl_lexer.l file --------> spl_lexer.c yylex() spl.y file -----> spl.tab.h \ tokens flex v defs. spl_lexer.l file -----> spl_lexer.c yylex() function ------------------------------------------ ... explain all of this: The context-free grammar (the .y file) is central to this, The .y file records: - grammar and - names/types of the tokens (that the parser needs, and the lexer produces) Bison is a parser generator it generates the function yyparse() in files spl.tab.h and spl.tab.c The ASTs are structs that record the parse Flex is a lexical analyzer generator the spl_lexer.l file records - the lexical grammar (using REs) - how tokens produce ASTs (in yylval) ------------------------------------------ FOR HOMEWORK 2 (LEXICAL ANALYSIS) ast.h (AST types) ^ | | | | | spl.tab.h | \ tokens | flex v defs. spl_lexer.l file --------> spl_lexer.c yylex() ------------------------------------------ The lexer needs to record tokens as token ASTs, for later use by the parser, but otherwise the ASTs are not important for homework 2 B. Using Flex 1. Files and coordination with parser ------------------------------------------ USING THE FLEX TOOL TO GENERATE LEXERS Example: SSM assembler (in hw1 zip file) High-level description in file asm_lexer.l === flex ===> generates: asm_lexer.c + asm_lexer.h Wrapper for lexer: lexer.h declares functions lexer.c errors_noted variable + some utility functions asm_lexer.l defines functions e.g., lexer_print_token lexer_output (ASTs defined in ast.h) asm.y is Bison description file grammar == bison ==> - Declarations in asm.tab.h #includes ast.h machine_types.h parser_types.h (declares YYSTYPE) lexer.h declares yytokentype eolsym = ... minussym = ... dottextsym = ... ... - Definitions in asm.tab.c defines yyparser() YYSTYPE yylval; ------------------------------------------ There are options in flex for naming the .h and .c files generated. The type YYSTYPE is the type of the ASTs Since the connection between flex and bison is tricky, for homework 2, we give you the spl.tab.h file, from that file you will need the token types (yytokentype) and a minimal spl.tab.c that defines yylval. 2. structure of flex input ------------------------------------------ STRUCTURE OF FLEX INPUT FILE /* ... definitions section ... */ ARRGH r EEEE eeee RE {ARRGH}{EEEE} %% /* ... rules section ... */ re { tok2ast(resym); return resym; } {RE} { tok2ast(re4sym); return re4sym; } %% /* ... user subroutines ... */ ------------------------------------------ Explain the difference between re and {ARRGH}{EEEE} The regular expression - must start in column 1 (otherwise it's taken as C code) - if it uses defined ------------------------------------------ SECTIONS IN FLEX INPUT (.l file) Definitions section: Rules section: User subroutine section: ------------------------------------------ ... options, (C) #includes, (C) declarations of names used in rules (flex) definitions of named REs, (flex) declarations of start conditions (states) ... pairs of REs and actions (code) ... definitions of functions used Go over code in the assembler's lexer_main.c show output lexer.h, lexer.c asm_lexer.l and generated code asm_lexer.h asm_lexer.c COP 3402 meeting -*- Outline -*- III. Theory of Regular Languages and Regular Expressions Based on Chapter 3 of the book "Formal Languages and their Relation to Automata" by Hopcroft and Ullman (Addison-Wesley, 1969). There is a strong connection between theory and practice in syntax analysis, with some of the first and most useful results in CS theory! A. Definitions ------------------------------------------ DEFINITIONS The lexical grammar of a language is regular, because def: a grammar is *regular* iff all its productions have one of these forms: ::= c or ::= c where c is a terminal symbol, and and are nonterminals def: a language is *regular* iff it can be defined using a regular grammar. Thm: Every regular language can be recognized by a finite automaton. Thm: Every regular language can be specified by a regular expression. ------------------------------------------ ... regular grammars can be parsed/recognized quickly (using finite automata) Regular languages are also called "type 3" languages (a name from the Chomsky hierarchy) A regular language is also called a *regular set*. Q: Does it follow that every regular expression can be recognized using a finite automaton? 1. Regular Expressions ------------------------------------------ REGULAR EXPRESSIONS The language of regular expressions: ::= | '|' | | emp | * | ( ) where is a character Examples: RE meaning ==================================== emp the empty string 2* zero or more 2's (0|1)*0 even binary numerals b*(abb*)*(a|emp) a's and b's without consecutive a's ------------------------------------------ 2. extensions to regular expressions ------------------------------------------ TYPICAL EXTENSIONS TO REGULAR EXPRESSIONS [abcd] means a|b|c|d [h-m] means h|i|j|k|l|m x? means x|emp y+ means y(y*) ------------------------------------------ 3. examples of regular expressions ------------------------------------------ FOR YOU TO DO Write a regular expression that describes: 1. The keyword "if" 2. The set of all (positive) decimal numbers 3. The set of all possible identifiers without underbars (If you have time, write these as regular grammars also.) ------------------------------------------ What would be the regular expression for identifiers? 4. Examples using regular expressions ------------------------------------------ EXAMPLE REGULAR EXPRESSION DEFINITIONS PATTERN RE ========================= DIGIT [0-9] LETTER [a-zA-Z] ... ------------------------------------------ What would be the regular expression for identifiers? B. finite automata (omit) 1. example ------------------------------------------ EXAMPLE State diagram: < = -->[q0] --->[[q1]]---> [[q2]] ------------------------------------------ What token should be returned at state q2? What should be done at state q1? ------------------------------------------ PSEUDO CODE FOR THIS LEXER CASE char c; ------------------------------------------ ------------------------------------------ DEALING WITH COMMENTS Suppose / is used for division // starts a comment to end of line (unlike SPL!) What state diagram? How does whitespace fit in? ------------------------------------------ How does whitespace fit in? ------------------------------------------ TACTICS FOR IGNORING WHITESPACE, COMMENTS Goal: do not send ignored tokens to parser Can always get a non-ignored token: Return "tokens" that include ignored stuff to a loop that ignores them Giant DFA that goes back to start state on seeing something to ignore ------------------------------------------ 2. definitions ------------------------------------------ NONDETERMINISTIC FINITE AUTOMATA def: A *nondeterministic finite automaton* (NFA) over an alphabet Sigma is a system (K, Sigma, delta, q0, F) where K is a finite set (of states), Sigma is a finite set (the input alphabet), delta: is a map of type (K, Sigma) -> Sets(K) q0 in K is the initial state, F is a subset of K (the final/accepting states). ------------------------------------------ ------------------------------------------ TRANSITION FUNCTION AND ACCEPTANCE p in delta(q,x) means that in state q, on input x the next state can be p p in delta*(q,s) where s in Sigma* is defined by: delta*(q,emp) = {q} (i.e., q in delta*(q,emp)) p in delta*(q,xa) = delta*(q2, a) where x in Sigma, a in Sigma*, q2 in delta(q,x) Lemma: for all c in Sigma, p in delta*(q, c) = delta(q, c) def: An NFA (K, Sigma, delta, q0, F) *accepts* a string s in Sigma* iff there is some q in delta*(q0,s) such that q in F ------------------------------------------ If each final state means to return a token, what token should be returned if there are many final states? 3. example ------------------------------------------ EXAMPLE NFA 0,1 0,1 /---\ /---\ \ / \ / | v 0 0 | v -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]] | | 1 v [ q1 ] | | 1 v [[ q2 ]] | ^ / \ \---/ 0,1 K = {q0,q1,q2,q3,q4} Sigma = {0,1} q0 is start state F = {q2,q4} delta(q0,0) = {q0,q3} delta(q1,0) = {} delta(q2,0) = {q2} delta(q3,0) = {q4} delta(q4,0) = {q4} delta(q0,1) = {q0,q1} delta(q1,1) = {q2} delta(q2,1) = {q2} delta(q3,1) = {} delta(q4,1) = {q4} ------------------------------------------ ------------------------------------------ EXTENDING FUNCTIONS TO SETS OF STATES Notation extending delta and delta* to sets of states d(Q,x) = union of d(q,x) for all q in Q so d({},x) = {} d({q},x) = d(q,x) d({q1,q2},x) = d(q1,x)+d(q2,x) d({q1,q2,q3},x) = d(q1,x)+d(q2,x)+d(q3,x) etc. Note that delta*({q}, x) = delta*(q, x) = delta(q, x) ------------------------------------------ ------------------------------------------ EXAMPLE delta*(q0,010011) = delta*(delta(q0,0),10011) = delta*({q0,q3},10011) = delta*(q0,10011) + delta*(q3,10011) = delta*(delta(q0,1),0011) + delta*(delta(q3,1),0011) = delta*({q0,q1},0011) + delta*({},0011) = delta*(delta(q0,0),011) + delta*(delta(q1,0),011) + {} = delta*({q0,q3},011) + delta*({},011) + delta*({q2},011) = delta*({q0,q3},011) + {} + delta*({q2},011) = delta*({q0,q3},011) + delta*({q2},011) = delta*(delta(q0,0),11) + delta(q3,0),11) + delta*(delta(q2,0),11) = delta*({q0,q3},11) + delta*({q4},11) + delta*({q2},11) = delta*({q0,q3,q4},1) + delta*({q4},1) + delta*({q2},1) = delta({q0,q1,q4},1) + delta(q4,1) + delta(q2,1) = {q0,q1,q2,q4} + {q4} + {q2} = {q0,q1,q2,q4} ------------------------------------------ So does this machine accept 010011? What strings does this machine accept? 4. deterministic finite automata ------------------------------------------ DETERMINISTIC FINITE AUTOMATA def: a *deterministic finite automaton* (DFA) is an NFA in which delta(q,c) is a singleton or empty for all q in K and c in Sigma. ------------------------------------------ ------------------------------------------ IMPLEMENTING DFAS How would you represent states? How would you implement a DFA? ------------------------------------------ 5. converting NFA to DFA ------------------------------------------ PROBLEM We want to specify lexical grammar using So we need to convert regular expressions into DFAs ------------------------------------------ Why DFAs? a. Converting Regular Expressions to NFAs ------------------------------------------ CONVERTING REs TO NFAs Definition based on grammar of Regular Expressions: Result of Convert(M) looks like this: -->(M q) where the "tail", -->, goes to the start state and q is the "head state" and M is converted recursively assume also Convert(N) is --->(N q') c Convert(c) = --->[ q ] Convert(M | N) = /--->(M q)--\ emp / \ ---->[ q ] -> [ q2 ] \ / \--->(N q')-/ Convert(M N) = --> (M q) --> (N q') emp Convert(emp) = -----> [ q ] emp /--------------\ / emp v Convert(M*) = -/ /->(M q)---->[ q2 ] / / \ emp / \-----------/ Convert((M)) = Convert(M) After conversion, make the "head state" be a final state ------------------------------------------ ------------------------------------------ EXAMPLE OF CONVERSION TO NFA Regular expression: (i|j)* i Convert(i) = --->[ qi ] j Convert(j) = --->[ qj ] Convert(i|j) = i /--->(qi)---\ emp emp / \ ---->[ q ] -> [ q2 ] \ j / \--->(qj)---/ emp Convert((i|j)*) = ------------------------------------------ In programming terms, do we need to use an NFA to recognize "if" and other reserved words? b. Converting NFAs to DFAs ------------------------------------------ CONVERTING AN NFA INTO A DFA Idea: Convert each reachable set of NFA states into How? Use the emp-closure of each state q = set of states reachable from q using emp Closure wrt emp: closure(S) is the smallest set T such that T = S + union {delta(s, emp) | s in S} can compute closure(S) as T <- S; do T2 <- T T <- T2 + union {delta(s, emp)| s in T2} while (T != T2) DFA Transitions: Let S be a set of states, then DFAdelta(S, c) = closure(union {delta(s,c)|s in S}) ------------------------------------------ Why do we need to take the closure wrt emp? ------------------------------------------ EXAMPLE CONVERSION OF NFA TO DFA NFA for if|[a-z]([a-z]|[0-9])* f [q2] ---> [[q3]] ^ emp i / /------\ / emp a-z / v -->[q1]------>[q4]----->[q5] [[q8]] ^ | emp| | | | a-z | | /-----\ | | / v / | |->[q6] [q7] | | \ 0-9 ^ | emp | \----/ / | / \ emp / \----------/ Converted to DFA: f [2,5,6,8] ---> [[3,6,7,8]] ^ \ i / \ a-z / a-h j-z a-z \ 0-9 -->[1,4]---->[[5,6,8]]0-9 v /-| \----->[[6,7,8]]|a-z ^ |0-9 \ | \-| ------------------------------------------ Which states are final in the DFA? Which tokens would be returned in each state? IV. Context-free Parsing A. Why context-free grammars and parsing? ------------------------------------------ WHY CONTEXT-FREE PARSING? Can we define a regular expression to make sure expressions have balanced parentheses? e.g., recognize: (34) and ((12)+(789)) but not: (567))+82)))) ------------------------------------------ ------------------------------------------ WHAT IS NEEDED Needed for checking balanced parentheses: ------------------------------------------ How does this grammar ::= | ( ) | + ensure that parentheses are balanced? B. Parsing with Context-free Grammars ------------------------------------------ PARSING Want to recognize language with Goal: ------------------------------------------ 1. context-free grammar ------------------------------------------ CONTEXT-FREE GRAMMARS def: a *context-free* grammar (CFG) (N, T, P, S) has a set (of nonterminals) N, a set (of terminals) T, a start symbol S in N and each production (in P) has the form: -> g where is a Nonterminal symbol, and g in (N+T)* Example: def: For a CFG (N,T,P,S), g in (N+T)* *produces* g' in (N+T)*, written g =P=> g', iff g is e f, e and f in (N+T)*, g' is e h f, is a nonterminal (in N), h in (N+T)*, and the rule -> h is in P Example: ( ) =P=> ------------------------------------------ 2. derivation ------------------------------------------ DERIVATION def: a *derivation* of gm in T* from the rules P of G=(N,T,P,S), is a sequence (S, g1, g2, ..., gm), where for all i: gi in (N+T)*, S =P=> g1, and for all 1 <= j < m: gj =P=> g(j+1) Example: ------------------------------------------ ------------------------------------------ LEFTMOST DERIVATION def: a *leftmost derivation* of gm in T* from a CFG (N,T,P,S) is a derivation of t from (N,T,P,S) (S, g1, ..., gm) such that for all 0 <= j < m: when gj =P=> g(j+1) and the nonterminal is replaced in gj, then there are no nonterminals to the left of in gj. Example: ------------------------------------------ Would a rightmost derivation be? 3. parse trees ------------------------------------------ PARSE TREES AND DERIVATIONS def: A *parse tree*, Tr, for a CFG (N,T,P,S) represents a derivation, D, iff: - Each node in Tr is labeled by a nonterminal (in N) - the root of Tr is the start symbol S - an arc from to h in (N+T) iff -> ... h ... in P - the order of children of a node labeled is the order in a production -> ... in P ------------------------------------------ ------------------------------------------ EXAMPLES GRAMMAR: ::= | + | * Derivations of 3*4+2 ------------------------------------------ ------------------------------------------ EXAMPLE PARSE TREES Corresponding to leftmost derivation: Corresponding to rightmost derivation: ------------------------------------------ What does each parse tree mean? 4. ambiguity ------------------------------------------ AMBIGUITY def: a CFG (N,T,P,S) is *ambiguous* iff there is some t in T* such that there are two different parse trees for t ------------------------------------------ What's an example of an ambiguous grammar? Why do we want to avoid ambiguous grammars? Is ambiguity a property of a language or its grammar? a. Fixing ambiguous grammars ------------------------------------------ FIXING AMBIGUOUS GRAMMARS Idea: Rewrite grammar to Example: ------------------------------------------ Why is the tree / | \ * / / | \ + | | | 3 | | 4 2 not a parse tree for this grammar? C. Parsing techniques 1. recursive-descent parsing (LL parsing) ------------------------------------------ RECURSIVE DESCENT PARSING ALGORITHM For each production rule, of form: ::= g1 | g2 | ... | gm 1. Write a 2. This function ------------------------------------------ a. example ------------------------------------------ EXAMPLE RECURSIVE-DESCENT RECOGNIZER ::= if then else end | begin end | print ::= | ::= ; ::= ::= == #include "spl.tab.h" // the current token yytoken_kind_t tok; void parser_initialize() { tok = yylex(); } void advance() { tok = yylex(); } void eat(yytoken_kind_t expected) { if (tok == expected) { advance(); } else { /* ... report error */ } } void parseCond() { eat(numbersym); eat(eqeqsym); eat(numbersym); } void parseStmts() { while (tok.toknum == ifsym || tok.toknum == beginsym || tok.toknum == printsym || tok.toknum == semisym) { if (tok.toknum == semisym) { eat(semisym); parseStmt(); } } } void parseStmt() { switch (tok.toknum) { case ifsym: eat(ifsym); parseCond(); eat(thensym); parseStmts(); eat(elsesym); parsetStmts(); eat(endsym); break; case beginsym: eat(beginsym); parseStmts(); eat(endsym); break; case printsym: eat(printsym); eat(numbersym); break; default: // report error } } ------------------------------------------ What kind of errors can eat() report? What kind of error message can the parser produce (in the default case)? How do errors get reported by parseExp()? b. terminology: LL(1) ------------------------------------------ LL(1) GRAMMARS A recursive-descent parser must: - choose between alternatives (e.g., ::= | ) def: A grammar is *LL(1)* iff ------------------------------------------ c. terminology: LR(1) ------------------------------------------ LR(1) GRAMMARS An LR(1) parser needs to decide when to: shift (push token on stack) or reduce uses a DFA based on stack + lookhead def: A grammar is *LR(1)* iff ------------------------------------------ i. LALR(1) parsers ------------------------------------------ LALR(1) Parsing Smaller tables than LR(1) - merges states of the DFA if they only differ in lookahead ------------------------------------------ d. ambiguity problems ------------------------------------------ PROBLEM: AMBIGUITY Consider: ::= := | if then ::= | else and the input statement: if b1 then if b2 then x := 2 else x := 3 Is this parsed as: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 or as: / / \-----|--------\ if then | |\ / \ b1 /| \ else / | \ / | \ | | | ... | | | x := 3 | | | if then | /|\ \ b2 ... x := 2 ------------------------------------------ ------------------------------------------ FIXES FOR AMBIGUITY Change the language: a. Always have an else clause: ::= if then else (use skip if don't want to do anything) b. Use an end marker ::= if then else fi | if then fi Give precedence to one production: ::= if then ::= else // priority! | So we only get the parse tree: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 ------------------------------------------ V. Using the bison parser generator ------------------------------------------ BISON: A LALR(1) PARSER GENERATOR Input: spl.y | | bison spl.y v Output: spl.tab.c and spl.tab.h yyparse{def.} token decls. parse tables extern decls. yylval yyparse(), (+ user code) yylval ------------------------------------------ A. big picture ------------------------------------------ THE BIG PICTURE tokens - source --> [ Lexer] ------> [ Parser] code / / / ASTs / | v symbol <---- [ static table ----> analysis ] / / v [ code generator ] ------------------------------------------ B. using bison and flex ------------------------------------------ BISON AND FLEX, GENERATING A PARSER idea: ast.h (AST types) | bison v -----> spl.tab.c / ^ yyparse (def.) / bison | spl.y ----------> spl.tab.h | token enum. | (decl.) flex v spl_lexer.l-----> spl_lexer.c yylex (def.) ------------------------------------------ 1. file connections for hw3 ------------------------------------------ HOW IT ALL FITS TOGETHER IN HW3 // in file machine_types.h: ====== // ... typedef unsigned int address_type; typedef unsigned char byte_type; typedef int word_type; // in file file_location.h: ====== // location in a source file typedef struct { const char *filename; unsigned int line; // of first token } file_location; // in file ast.h: ================ // ... #include "machine_types.h" #include "file_location.h" // types of ASTs (type tags) typedef enum { block_ast, /* ... */ token_ast } AST_type; // typedefs for types N_t, follow, // where N is a nonterminal typedef struct { file_location *file_loc; AST_type type_tag; void *next; // for lists } generic_t; typedef struct ident_s { file_location *file_loc; AST_type type_tag; struct ident_s *next; // for lists const char *name; } ident_t; typedef struct { file_location *file_loc; AST_type type_tag; const char *text; word_type value; } number_t; typedef struct { file_location *file_loc; AST_type type_tag; const char *text; int code; } token_t; // ... typedef struct block_s { file_location *file_loc; AST_type type_tag; const_decls_t const_decls; var_decls_t var_decls; proc_decls_t proc_decls; stmts_t stmts; } block_t; // ... typedef union AST_u { generic_t generic; block_t block; const_decls_t const_decls; const_decl_t const_decl; const_def_list_t const_def_list; const_def_t const_def; var_decls_t var_decls; var_decl_t var_decl; ident_list_t ident_list; // ... expr_t expr; binary_op_expr_t binary_op_expr; token_t token; number_t number; ident_t ident; empty_t empty; } AST; // Return the file location from an AST extern file_location *ast_file_loc(AST t); // ... // Return a pointer to a fresh copy of t // that has been allocated on the heap extern AST *ast_heap_copy(AST t); // ... extern block_t ast_block( token_t begin_tok, const_decls_t const_decls, var_decls_t var_decls, proc_decls_t proc_decls, stmts_t stmts); // ... extern ident_t ast_ident( file_location *file_loc, const char *name); extern expr_t ast_expr_number( number_t e); extern empty_t ast_empty( file_location *file_loc); // ... parser_types.h: #include "ast.h" typedef AST YYSTYPE; spl.y (also spl.tab.h): #include "ast.h" #include "machine_types.h" #include "parser_types.h" #include "lexer.h" // more below ------------------------------------------ 2. abstract syntax trees and their use in parsing a. background: how the parser works ------------------------------------------ CONNECTING THE PARSER AND THE ASTs Parser model A stack of (terminals + nonterminals) A parallel stack of ASTs 1 token of lookahead Steps in parsing (DFA decides to): - shift: 1. push lookahead on parse stack 2. push its yylval on AST stack OR - reduce using a rule written: nt : a b c { $$ = f($1,$2,$3); }; 1. take a,b,c off parse stack 2. take their AST values, aval,bval,cval, off the AST stack and compute ntval = f(aval,bval,cval) 3. push nt on parse stack 4. push result (ntval) on AST stack ------------------------------------------ b. use of ASTs in the grammar (.y) file ------------------------------------------ CONNECTION WITH ASTs IN GRAMMAR FILE /* $Id: spl.y ... */ %code requires { #include "ast.h" #include "machine_types.h" #include "parser_types.h" #include "lexer.h" /* ... */ } /* ... */ %token identsym %token numbersym %token plussym "+" %token minussym "-" %token multsym "*" %token divsym "/" %token periodsym "." %token semisym ";" %token eqsym "=" %token commasym "," %token becomessym ":=" %token lparensym "(" %token rparensym ")" %token constsym "const" %token varsym "var" /* ... */ %type program %type block %type constDecls %type constDef %type varDecls %type varDecl %type idents %type procDecls %type empty /* ... */ %type expr %type term %type factor %start program /* ... */ ------------------------------------------ ------------------------------------------ PUTTING A TOKEN VALUE ON THE AST STACK (DETAILS) To put yylval on the AST Stack when the field name is "token" in the .y file have: %token somesym "some" the generated parser has: #include "ast.h" #include "parser_types.h" // typedef AST YYSTYPE; YYSTYPE yyvsa[]; // the AST stack yyvsa[yyi].token = yylval; ------------------------------------------ ------------------------------------------ PUSHING AST FOR A NONTERMINAL ON AST STACK (DETAILS) To put yylval on the AST Stack when the field name is "const_def" in the .y file have: %type constDef the generated parser has: #include "ast.h" #include "parser_types.h" // in file parser_types.h: ===== // typedef AST YYSTYPE; // in the generated parser file: == YYSTYPE yyvsa[]; // the AST stack yyvsa[yyi].const_def = yylval; ------------------------------------------ c. Example, the const language ------------------------------------------ THE CONST LANGUAGE programs all look like: const ident = 3402 ------------------------------------------ ------------------------------------------ ASTs FOR THE CONST LANGUAGE /* $Id: ast.h ... */ /* ... */ typedef struct { file_location *file_loc; } generic_t; typedef struct ident_s { file_location *file_loc; const char *name; } ident_t; typedef struct { file_location *file_loc; const char *text; word_type value; } number_t; typedef struct { file_location *file_loc; const char *text; int code; } token_t; typedef struct const_def_s { file_location *file_loc; ident_t ident; number_t number; } const_def_t; typedef union AST_u { generic_t generic; const_def_t const_def; token_t token; number_t number; ident_t ident; } AST; extern const_def_t ast_const_def( ident_t ident, number_t number); // ... ------------------------------------------ ------------------------------------------ THE CONST.Y FILE FOR CONST LANGUAGE /* ... */ %code requires { #include "ast.h" #include "machine_types.h" #include "parser_types.h" #include "lexer.h" /* ...*/ } /* ...*/ %token constsym "const" %token identsym %token eqsym "=" %token numbersym %type program %type constDef %type constDefs %type empty %start program %code { extern int yylex(void); const_def_t progast; extern void setProgAST(const_def_t t); } %% program : constDef { setProgAST($1); } ; constDef : "const" identsym "=" numbersym { $$ = ast_const_def($2,$4); }; %% // Set the program's ast to be t void setProgAST(const_def_t t) { progast = t; } ------------------------------------------ Do we really need 2? When are line numbers recorded? d. example of changing the grammar How would we make the grammar be ::= with ::= { } ? i. Change the grammar ii. Add to the ast module