LANGUAGES def: A *language* is a set of strings of characters from some alphabet. LANGUAGE CLASSES Languages can be classified by the grammars used to generate/recognize them Venn Diagram: |--------------------------------------| | Strings recognized by | | the Regular Grammar | | | | |--------------------------------| | | | Strings recognized by | | | | the Context-free Grammar | | | | | | | | |--------------------------| | | | | | Strings recognized by | | | | | | the Context-sensitive | | | | | | Grammar | | | | | | (static checking) | | | | | | | | | | | | |--------------------| | | | | | | | Strings recognized | | | | | | | | by the | | | | | | | | Type 0 Grammar | | | | | | | | (dynamic checks)| | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| PHASES OF A COMPILER Programs allowed by a compiler's: |--------------------------------------| | Lexical Analysis (Lexer) | | | | |--------------------------------| | | | Parser | | | | | | | | |--------------------------| | | | | | Static Analysis | | | | | | | | | | | | |--------------------| | | | | | | | Runtime checks | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| PROBLEM WRITING SYNTAX ANALYSIS Naive way to write a compiler, etc. Language Def.docx -- coding --> parser.c (v. 1) Def2.docx -- coding --> parser.c (v. 2) -- coding --> parser.c (v. 3) Def3.docx -- coding --> parser.c (v. 4) Def4.docx -- coding --> parser.c (v. 5) ... ... DefN.docx -- coding --> parser.c (v. N) Disadvantages: - can be slow to write - error prone - hard to check or verify - expensive - might not be very efficient COMPUTER SCIENCE SOLUTION: AUTOMATION high-level description tool generated code lang_lexer.l -- flex--> lang_lexer.c + lang_lexer.h lang.y -- bison --> lang.tab.c + lang.tab.h lang2.y -- bison --> lang.tab.c + lang.tab.h ... langN.y -- bison --> lang.tab.c + lang.tab.h Advantages: - cycle time is faster - grammer is easier to check/verify - code can be very efficient GRAMMARS DESCRIBE LANGUAGES Grammars are high-level descriptions of languages/parsers def: A *terminal* is a character (or string of characters) found in the language E.g., 123 while xyz , ( def: A *nonterminal* is a meta-notation that can be replaced with other terminals or nonterminals. E.g., def: a *grammar* consists of a finite set of rules (called "productions") and a start symbol (a nonterminal). Let V = nonterminals + terminals The rules have the form V+ -> V* where there is no symbol in both nonterminals and terminals def: The language generated by a grammar G with set of productions P is: {w | w is in terminals* and S =>* w} where S is the start symbol of G gAd => gBd iff g in V* and d in V* and A -> B is a rule in P and g =>* h iff either h = g or g -> i and i =>* h BNF NOTATION FOR GRAMMARS ::= means "can be" or "can produce" i.e., -> | means "or" is a nonterminal named X Example ::= | ::= 0 | 1 E.g., =>* 01001 GRAMMARS AS RULES OF GAMES A grammar can be seen as describing two games: - A production game (Can you produce this string?) - A recognition/parsing game (Is this string in the language?) PRODUCTION GAME Goal: produce a string in the language from the start symbol Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult -> ? | . Can we produce "Johnny is good."? =>* Johnny is good. Can we produce "Johnny is good."? -> -> Johnny -> Johnny is -> Johnny is good -> Johnny is good . -> -> . -> good . -> is good . -> Johnny is good. RECOGNITION OR PARSING GAME Goal: determine if a string is in the language of the grammar Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Is "Johnny is good" in this grammar? DERIVATION (OR PARSE) TREES def: a *tree* is a finite set of nodes connected by directed edges that is connected and has no cycles def: a *derivation tree* for grammar G is a tree such that: - Every node has a label that is a symbol of G - The root is labeled by the start symbol of G - Every node with a direct descendent, is labeled by a nonterminal - If the descendents of a node labeled by N have the following labels (in order): A, B, C, ..., K then G has a production of form N -> A B C ... K EXAMPLE DERIVATION TREE Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult String to parse: "Johnny is good" / | \ / | \ v v v Johnny is good EXTENSIONS TO BNF (EBNF) Arbitrary number of repeats: { x } means 0 or more repeats of x :: = { } is equivalent to: ::=

::= | ::= {} is also written as: * or [ ] ... One-or-more repeats: x+ means 1 or more repeats of x :: = + is equivalent to: ::= ::= | + is sometimes written as ... or [ ] ... Optional element: [ x ] means 0 or 1 occurences of x ::= [ ] is equivalent to: ::=

::= | READING A BNF GRAMMAR Example rules: ::= ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= 0 | ::= | EBNF GRAMMAR FOR (SUBSET OF) SPL ::= . ::= begin

end ::= {} ::= const ; ::= | ,

::= =

::= {} ::= var ; ::= {} ::= ,

::= {} ::= proc ; ::= | ::= ::= {} ::= ; ::= := | call | if then else end | if then end | while do end | read | print | block ::=

EXAMPLES IN SPL Shortest program: begin end. Factorial program: begin var n, res; # input and result proc fact begin read n; res := 1; while n != 0 do res := res * n; n := n-1 end; print res end; call fact end. MOTIVATION # $Id$\n .text start\nstart:\tADDI ... Want to: break input string into tokens so that parser doesn't need to worry about individual characters Examples from the SSM assembler (in asm.tab.h file) INPUT token names =================== .text textsym start identsym : colonsym ADDI addisym Approach: the parser only see non-ignored tokens not comments, whitespace, ... LEXICAL ANALYSIS Lexical means relating to the words of a language GOALS OF LEXICAL ANALYSIS - Simplify the parser, so it need not handle: - white space and comments - details of tokens - Recognize the longest match Why? e.g., == operator or <= - Handle every possible character of input Why? so communication is clear and programmer doesn't think a character means something when it doesn't CONFLICT BETWEEN RULES Suppose that both "if" and numbers are tokens: What tokens should "if8" match? Fixing such situations: - tell programmers it's an identifier (will cause a syntax error) say whitespace is required to separate tokens - tell programmers it's two tokens WHICH TOKEN TO RETURN? If the input is "<=", what token(s)? leqsym If the input is "==", what token? eqeqsym If the input is "<8", what token(s)? ltsym and then numbersym If the input is "if", what token(s)? ifsym If the input is "//", what token(s)? (if that starts a comment, then none and eat the comment) divsym and then divsym in SPL Summary: favor the longest match but give priority to reserved words over identifiers THE BIG PICTURE tokens - source --> [ Lexer] ------> [ Parser] code / / / abstract / syntax v trees (ASTs) [ static analysis ] / / v [ code generator ] For the Lexer we want to: - specify the tokens using regular expressions (REs) - convert REs to DFAs to execute them but easy conversions are: - REs to NFAs - NFAs to DFAs HOW PARSER WORKS WITH LEXER Couroutine structure: Parser calls lexer: yychar = yylex(); // call lexer /* ... use yychar ... */ Lexer function remembers pointer to input stream returns next token (int code) Parser works... Parser calls lexer again: yychar = yylex(); // call lexer /* ... use yychar ... */ BISON AND FLEX, GENERATING A PARSER idea: ast.h (AST types) | | bison | -----> spl.tab.c | / ^ yyparse() function v / bison | spl.y file -----> spl.tab.h | tokens flex v defs. spl_lexer.l file -----> spl_lexer.c yylex() function DEFINITIONS The lexical grammar of a language is regular, because they can be parsed very quickly using a finite state automaton (DFA) def: a grammar is *regular* iff all its productions have one of these forms: ::= c or ::= c where c is a terminal symbol, and and are nonterminals def: a language is *regular* iff it can be defined using a regular grammar. Thm: Every regular language can be recognized by a finite automaton. Thm: Every regular language can be specified by a regular expression. REGULAR EXPRESSIONS The language of regular expressions: ::= | '|' | | emp | * | ( ) where is a character Examples: RE meaning ==================================== emp the empty string 2* zero or more 2's e.g., emp, 2, 22, 222, 2222, ... (0|1)*0 even binary numerals e.g., 0, 10, 00, 100, 110, b*(abb*)*(a|emp) a's and b's without e.g., emp, a, aba, abba, baba consecutive a's TYPICAL EXTENSIONS TO REGULAR EXPRESSIONS [abcd] means a|b|c|d [h-m] means h|i|j|k|l|m x? means x|emp y+ means y(y*) FOR YOU TO DO Write a regular expression that describes: 1. The keyword "if" if 2. The set of all (positive) decimal numbers [1-9]([0-9]*) (1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)* 3. The set of all possible identifiers without underbars [a-zA-Z][a-zA-Z0-9]* (If you have time, write these as regular grammars also.) ::= i ::= f ::= 1 | 2 | 3 | ... | 9 | 0 | 1 | 2 | 3 | ... | 9 ::= 0 | 1 | 3 | ... | 9 | 0 | 1 | 2 | 3 | ... | 9 ::= * ::= 0 | 1 | 2 | 3 | ... | 9 EXAMPLE REGULAR EXPRESSIONS FOR SPL PATTERN RE ========================= DIGIT [0-9] LETTER [a-zA-Z] ... EXAMPLE LETTERORDIGIT ({LETTER}|{DIGIT}) IDENT {LETTER}({LETTERORDIGIT}*) State diagram: < = -->[q0] --->[[q1]]---> [[q2]] PSEUDO CODE FOR THIS LEXER CASE char c; DEALING WITH COMMENTS Suppose / is used for division // starts a comment to end of line (unlike SPL!) What state diagram? How does whitespace fit in? TACTICS FOR IGNORING WHITESPACE, COMMENTS Goal: do not send ignored tokens to parser Can always get a non-ignored token: Return "tokens" that include ignored stuff to a loop that ignores them Giant DFA that goes back to start state on seeing something to ignore NONDETERMINISTIC FINITE AUTOMATA def: A *nondeterministic finite automaton* (NFA) over an alphabet Sigma is a system (K, Sigma, delta, q0, F) where K is a finite set (of states), Sigma is a finite set (the input alphabet), delta: is a map of type (K, Sigma) -> Sets(K) q0 in K is the initial state, F is a subset of K (the final/accepting states). TRANSITION FUNCTION AND ACCEPTANCE p in delta(q,x) means that in state q, on input x the next state can be p p in delta*(q,s) where s in Sigma* is defined by: delta*(q,emp) = {q} (i.e., q in delta*(q,emp)) p in delta*(q,xa) = delta*(q2, a) where x in Sigma, a in Sigma*, q2 in delta(q,x) Lemma: for all c in Sigma, p in delta*(q, c) = delta(q, c) def: An NFA (K, Sigma, delta, q0, F) *accepts* a string s in Sigma* iff there is some q in delta*(q0,s) such that q in F EXAMPLE NFA 0,1 0,1 /---\ /---\ \ / \ / | v 0 0 | v -->[ q0 ] ---> [ q3 ] ---> [[ q4 ]] | | 1 v [ q1 ] | | 1 v [[ q2 ]] | ^ / \ \---/ 0,1 K = {q0,q1,q2,q3,q4} Sigma = {0,1} q0 is start state F = {q2,q4} delta(q0,0) = {q0,q3} delta(q1,0) = {} delta(q2,0) = {q2} delta(q3,0) = {q4} delta(q4,0) = {q4} delta(q0,1) = {q0,q1} delta(q1,1) = {q2} delta(q2,1) = {q2} delta(q3,1) = {} delta(q4,1) = {q4} EXTENDING FUNCTIONS TO SETS OF STATES Notation extending delta and delta* to sets of states d(Q,x) = union of d(q,x) for all q in Q so d({},x) = {} d({q},x) = d(q,x) d({q1,q2},x) = d(q1,x)+d(q2,x) d({q1,q2,q3},x) = d(q1,x)+d(q2,x)+d(q3,x) etc. Note that delta*({q}, x) = delta*(q, x) = delta(q, x) EXAMPLE delta*(q0,010011) = delta*(delta(q0,0),10011) = delta*({q0,q3},10011) = delta*(q0,10011) + delta*(q3,10011) = delta*(delta(q0,1),0011) + delta*(delta(q3,1),0011) = delta*({q0,q1},0011) + delta*({},0011) = delta*(delta(q0,0),011) + delta*(delta(q1,0),011) + {} = delta*({q0,q3},011) + delta*({},011) + delta*({q2},011) = delta*({q0,q3},011) + {} + delta*({q2},011) = delta*({q0,q3},011) + delta*({q2},011) = delta*(delta(q0,0),11) + delta(q3,0),11) + delta*(delta(q2,0),11) = delta*({q0,q3},11) + delta*({q4},11) + delta*({q2},11) = delta*({q0,q3,q4},1) + delta*({q4},1) + delta*({q2},1) = delta({q0,q1,q4},1) + delta(q4,1) + delta(q2,1) = {q0,q1,q2,q4} + {q4} + {q2} = {q0,q1,q2,q4} DETERMINISTIC FINITE AUTOMATA def: a *deterministic finite automaton* (DFA) is an NFA in which delta(q,c) is a singleton or empty for all q in K and c in Sigma. IMPLEMENTING DFAS How would you represent states? How would you implement a DFA? PROBLEM We want to specify lexical grammar using So we need to convert regular expressions into DFAs CONVERTING REs TO NFAs Definition based on grammar of Regular Expressions: Result of Convert(M) looks like this: -->(M q) where the "tail", -->, goes to the start state and q is the "head state" and M is converted recursively assume also Convert(N) is --->(N q') c Convert(c) = --->[ q ] Convert(M | N) = /--->(M q)--\ emp / \ ---->[ q ] -> [ q2 ] \ / \--->(N q')-/ Convert(M N) = --> (M q) --> (N q') emp Convert(emp) = -----> [ q ] emp /--------------\ / emp v Convert(M*) = -/ /->(M q)---->[ q2 ] / / \ emp / \-----------/ Convert((M)) = Convert(M) After conversion, make the "head state" be a final state EXAMPLE OF CONVERSION TO NFA Regular expression: (i|j)* i Convert(i) = --->[ qi ] j Convert(j) = --->[ qj ] Convert(i|j) = i /--->(qi)---\ emp emp / \ ---->[ q ] -> [ q2 ] \ j / \--->(qj)---/ emp Convert((i|j)*) = CONVERTING AN NFA INTO A DFA Idea: Convert each reachable set of NFA states into How? Use the emp-closure of each state q = set of states reachable from q using emp Closure wrt emp: closure(S) is the smallest set T such that T = S + union {delta(s, emp) | s in S} can compute closure(S) as T <- S; do T2 <- T T <- T2 + union {delta(s, emp)| s in T2} while (T != T2) DFA Transitions: Let S be a set of states, then DFAdelta(S, c) = closure(union {delta(s,c)|s in S}) EXAMPLE CONVERSION OF NFA TO DFA NFA for if|[a-z]([a-z]|[0-9])* f [q2] ---> [[q3]] ^ emp i / /------\ / emp a-z / v -->[q1]------>[q4]----->[q5] [[q8]] ^ | emp| | | | a-z | | /-----\ | | / v / | |->[q6] [q7] | | \ 0-9 ^ | emp | \----/ / | / \ emp / \----------/ Converted to DFA: f [2,5,6,8] ---> [[3,6,7,8]] ^ \ i / \ a-z / a-h j-z a-z \ 0-9 -->[1,4]---->[[5,6,8]]0-9 v /-| \----->[[6,7,8]]|a-z ^ |0-9 \ | \-| USING THE FLEX TOOL TO GENERATE LEXERS Example: SSM assembler (in hw1 zip file) High-level description in asm_lexer.l === flex ===> generates: asm_lexer.c + asm_lexer.h Wrapper for lexer: lexer.h declares functions lexer.c errors_noted variable + some utility functions asm_lexer.l defines functions e.g., lexer_print_token lexer_output (ASTs defined in ast.h) asm.y is Bison description file grammar == bison ==> - Declarations in asm.tab.h includes ast.h machine_types.h parser_types.h declares YYSTYPE lexer.h declares yytokentype eolsym = ... minussym = ... dottextsym = ... ... - Definitions in asm.tab.c defines yyparser() YYSTYPE yylval; STRUCTURE OF FLEX INPUT FILE /* ... definitions section ... */ %% /* ... rules section ... */ %% /* ... user subroutines ... */ SECTIONS IN FLEX INPUT (.l file) Definitions section: options in %{ ... %} #includes declarations of names used in rules definitions of named REs declarations of states (%s) Rules section: pairs of REs and actions (C code) User subroutine section: definitions of C code subroutines used WHY CONTEXT-FREE PARSING? Can we define a regular expression to make sure expressions have balanced parentheses? e.g., recognize: (34) and ((12)+(789)) but not: (567))+82)))) WHAT IS NEEDED Needed for checking balanced parentheses: an infinite number of states, we will use an (unbounded) stack recursion in the grammar, like: ::= | ( ) | + PARSING Want to recognize language syntax with a context-free grammar and terminals that are tokens from the lexer Goal: to produce an abstract syntax tree (AST) that represents the parse CONTEXT-FREE GRAMMARS def: a *context-free* grammar (CFG) (N, T, P, S) has a set (of nonterminals) N, a set (of terminals) T, a start symbol S in N and each production (in P) has the form: -> g where is a Nonterminal symbol, and g in (N+T)* Example: -> | ( ) -> 1 | 2 | 3 Examples in that language: 1 ((2)) ((((3)))) def: For a CFG (N,T,P,S), g in (N+T)* *produces* g' in (N+T)*, written g =P=> g', iff g is e f, e and f in (N+T)*, g' is e h f, is a nonterminal (in N), h in (N+T)*, and the rule -> h is in P Example: ( ) =P=> ( 1 ) DERIVATION def: a *derivation* of gm in T* from the rules P of G=(N,T,P,S), is a sequence (S, g1, g2, ..., gm), where for all i: gi in (N+T)*, S =P=> g1, and for all 1 <= j < m: gj =P=> g(j+1) Example: (, (), (), (1)) usually written: -> ( ) -> ( ) -> ( 1 ) LEFTMOST DERIVATION def: a *leftmost derivation* of gm in T* from a CFG (N,T,P,S) is a derivation of t from (N,T,P,S) (S, g1, ..., gm) such that for all 0 <= j < m: when gj =P=> g(j+1) and the nonterminal is replaced in gj, then there are no nonterminals to the left of in gj. Example: for the grammar: -> | - -> 1 | 2 | 3 -> - -> - -> 3 - -> 3 - -> 3 - 2 A rightmost derivation follows: -> - -> - -> - 2 -> - 2 -> 3 - 2 PARSE TREES AND DERIVATIONS def: A *parse tree*, Tr, for a CFG (N,T,P,S) represents a derivation, D, iff: - Each node in Tr is labeled by a nonterminal (in N) - the root of Tr is the start symbol S - an arc from to h in (N+T) iff -> ... h ... in P - the order of children of a node labeled is the order in a production -> ... in P EXAMPLES GRAMMAR: ::= | + | * ::= 1 | 2 | 3 | 4 Derivations of 3*4+2 leftmost -> * -> * -> 3 * -> 3 * + -> 3 * + -> 3 * 4 + -> 3 * 4 + -> 3 * 4 + 2 / | \ * | / \ \ + | | | 3 | | 4 2 rightmost derivation follows: -> + -> + -> + 2 -> * + 2 -> * + 2 -> * 4 + 2 -> * 4 + 2 -> 3 * 4 +2 / \ \ + / \ | | * | | | 2 | | 3 4 AMBIGUITY def: a CFG (N,T,P,S) is *ambiguous* iff there is some t in T* such that there are two different parse trees for t FIXING AMBIGUOUS GRAMMARS Idea: Rewrite grammar to eliminate unwanted parse trees Example: ::= | + ::= | * ::= ::= 1 | 2 | 3 | 4 to parse 3*4+2 / | \ + | | / | \ | * | | | 2 | | 4 | 3 RECURSIVE DESCENT PARSING ALGORITHM For each production rule, of form: ::= g1 | g2 | ... | gm 1. Write a (recursive) function parseN 2. This function decides between the alternatives g1, g2, ..., gm by looking at the first terminal (token) in the input EXAMPLE RECURSIVE-DESCENT RECOGNIZER ::= if then else end | begin end | print ::= | ::= | ; ::= ::= == #include "spl.tab.h" // the current token yytoken_kind_t tok; void parser_initialize() { tok = yylex(); } void advance() { tok = yylex(); } void eat(yytoken_kind_t expected) { if (tok == expected) { advance(); } else { /* ... report error */ } } void parseCond() { eat(numbersym); eat(eqeqsym); eat(numbersym); } void parseStmts() { if (tok.toknum == ifsym || tok.toknum == beginsym || tok.toknum == printsym) { parseStmtList(); } else { parseEmpty(); } } void parseEmpty() {} void parseStmtList() { parseStmt(); if (tok.toknum == semisym) { eat(semisym); parseStmtList(); } } void parseStmt() { switch (tok.toknum) { case ifsym: eat(ifsym); parseCond(); eat(thensym); parseStmts(); eat(elsesym); parsetStmts(); eat(endsym); break; case beginsym: eat(beginsym); parseStmts(); eat(endsym); break; case printsym: eat(printsym); eat(numbersym); break; default: // report error } } LL(1) GRAMMARS A recursive-descent parser must: able to decide what to do based on the next token - choose between alternatives (e.g., ::= | ) def: A grammar is *LL(1)* iff it can be parsed left-to-right in one pass using 1 token of lookahead (to decide between alternatives) LR(1) GRAMMARS An LR(1) parser needs to decide when to: shift (push token on stack) or reduce based on the next token; it uses a DFA based on stack + lookhead def: A grammar is *LR(1)* iff it can be parsed left-to-right in one pass using the parse stack and 1 token of lookahead to decide whether to shift or reduce LALR(1) Parsing Smaller tables than LR(1) - merges states of the DFA if they only differ in lookahead PROBLEM: AMBIGUITY Consider: ::= := | if then ::= | else and the input statement: if b1 then if b2 then x := 2 else x := 3 Is this parsed as: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then | / | \ \ \ b2 ... | \ x := 2 else / | \ ... x := 3 or as: / / \-----|--------\ if then | |\ / \ b1 /| \ else / | \ / | \ | | | ... | | | x := 3 | | | if then | /|\ \ b2 ... x := 2 FIXES FOR AMBIGUITY Change the language: a. Always have an else clause: ::= if then else (use skip if don't want to do anything) b. Use an end marker ::= if then else fi | if then fi Give precedence to one production: ::= if then ::= else // priority! | So we only get the parse tree: / / \ -----|---\ if then | | | \ | | | / | \--------\ | / | \ \ \ b1 if then / | \ \ \ ... | \ x := 2 else / | \ ... x := 3 BISON: A LALR(1) PARSER GENERATOR Input: spl.y | | bison spl.y v Output: spl.tab.c and spl.tab.h yyparse{def.} token decls. parse tables extern decls. yylval yyparse(), (+ user code) yylval THE BIG PICTURE tokens - source --> [ Lexer] ------> [ Parser] code / / / ASTs / | v symbol <---- [ static table ----> analysis ] / / v [ code generator ] BISON AND FLEX, GENERATING A PARSER idea: ast.h (AST types) | bison v -----> spl.tab.c / ^ yyparse (def.) / bison | spl.y ----------> spl.tab.h | token enum. | (decl.) flex v spl_lexer.l-----> spl_lexer.c yylex (def.) HOW IT ALL FITS TOGETHER IN HW3 // in file machine_types.h: ====== // ... typedef unsigned int address_type; typedef unsigned char byte_type; typedef int word_type; // in file file_location.h: ====== // location in a source file typedef struct { const char *filename; unsigned int line; // of first token } file_location; // in file ast.h: ================ // ... #include "machine_types.h" #include "file_location.h" // types of ASTs (type tags) typedef enum { block_ast, /* ... */ token_ast } AST_type; // typedefs for types N_t, follow, // where N is a nonterminal typedef struct { file_location *file_loc; AST_type type_tag; void *next; // for lists } generic_t; typedef struct ident_s { file_location *file_loc; AST_type type_tag; struct ident_s *next; // for lists const char *name; } ident_t; typedef struct { file_location *file_loc; AST_type type_tag; const char *text; word_type value; } number_t; typedef struct { file_location *file_loc; AST_type type_tag; const char *text; int code; } token_t; // ... typedef struct block_s { file_location *file_loc; AST_type type_tag; const_decls_t const_decls; var_decls_t var_decls; proc_decls_t proc_decls; stmts_t stmts; } block_t; // ... typedef union AST_u { generic_t generic; block_t block; const_decls_t const_decls; const_decl_t const_decl; const_def_list_t const_def_list; const_def_t const_def; var_decls_t var_decls; var_decl_t var_decl; ident_list_t ident_list; // ... expr_t expr; binary_op_expr_t binary_op_expr; token_t token; number_t number; ident_t ident; empty_t empty; } AST; // Return the file location from an AST extern file_location *ast_file_loc(AST t); // ... // Return a pointer to a fresh copy of t // that has been allocated on the heap extern AST *ast_heap_copy(AST t); // ... extern block_t ast_block( token_t begin_tok, const_decls_t const_decls, var_decls_t var_decls, proc_decls_t proc_decls, stmts_t stmts); // ... extern ident_t ast_ident( file_location *file_loc, const char *name); extern expr_t ast_expr_number( number_t e); extern empty_t ast_empty( file_location *file_loc); // ... parser_types.h: =========== #include "ast.h" typedef AST YYSTYPE; spl.y (also spl.tab.h): #include "ast.h" #include "machine_types.h" #include "parser_types.h" #include "lexer.h" // more below CONNECTING THE PARSER AND THE ASTs Parser model A stack of (terminals + nonterminals) A parallel stack of ASTs 1 token of lookahead Steps in parsing (DFA decides to): - shift: 1. push lookahead on parse stack 2. push its yylval on AST stack OR - reduce using a rule written: nt : a b c { $$ = f($1,$2,$3); } | a' b' c' { $$ = f'($1,$2,$3) }; if the parse stack has a b c in it: 1. take a,b,c off parse stack 2. take their AST values, aval,bval,cval, off the AST stack and compute ntval = f(aval,bval,cval) 3. push nt on parse stack 4. push result (ntval) on AST stack CONNECTION WITH ASTs IN GRAMMAR FILE /* $Id: spl.y ... */ %code requires { #include "ast.h" #include "machine_types.h" #include "parser_types.h" #include "lexer.h" /* ... */ } /* ... */ %token identsym %token numbersym %token plussym "+" %token minussym "-" %token multsym "*" %token divsym "/" %token periodsym "." %token semisym ";" %token eqsym "=" %token commasym "," %token becomessym ":=" %token lparensym "(" %token rparensym ")" %token constsym "const" %token varsym "var" /* ... */ %type program %type block %type constDecls %type constDef %type varDecls %type varDecl %type idents %type procDecls %type empty /* ... */ %type expr %type term %type factor %start program /* ... */ PUTTING A TOKEN VALUE ON THE AST STACK (DETAILS) To put yylval on the AST Stack when the field name is "token" in the .y file have: %token somesym "some" the generated parser has: #include "ast.h" #include "parser_types.h" // typedef AST YYSTYPE; YYSTYPE yyvsa[]; // the AST stack yyvsa[yyi].token = yylval; PUSHING AST FOR A NONTERMINAL ON AST STACK (DETAILS) To put yylval on the AST Stack when the field name is "const_def" in the .y file have: %type constDef the generated parser has: #include "ast.h" #include "parser_types.h" // in file parser_types.h: ===== // typedef AST YYSTYPE; // in the generated parser file: == YYSTYPE yyvsa[]; // the AST stack yyvsa[yyi].const_def = yylval; THE CONST LANGUAGE programs all look like: const ident = 3402 ASTs FOR THE CONST LANGUAGE /* $Id: ast.h ... */ /* ... */ typedef struct { file_location *file_loc; } generic_t; typedef struct ident_s { file_location *file_loc; const char *name; } ident_t; typedef struct { file_location *file_loc; const char *text; word_type value; } number_t; typedef struct { file_location *file_loc; const char *text; int code; } token_t; typedef struct const_def_s { file_location *file_loc; ident_t ident; number_t number; } const_def_t; typedef union AST_u { generic_t generic; const_def_t const_def; token_t token; number_t number; ident_t ident; } AST; extern const_def_t ast_const_def( ident_t ident, number_t number); // ... THE CONST.Y FILE FOR CONST LANGUAGE /* ... */ %code requires { #include "ast.h" #include "machine_types.h" #include "parser_types.h" #include "lexer.h" /* ...*/ } /* ...*/ %token constsym "const" %token identsym %token eqsym "=" %token numbersym %type program %type constDef %type empty %start program %code { extern int yylex(void); const_def_t progast; extern void setProgAST(const_def_t t); } %% program : constDef { setProgAST($1); } ; constDef : "const" identsym "=" numbersym { $$ = ast_const_def($2,$4); }; %% // Set the program's ast to be t void setProgAST(const_def_t t) { progast = t; }