COP 3402 meeting -*- Outline -*- * Overview ** theory *** languages ------------------------------------------ LANGUAGES def: A *language* is ------------------------------------------ ... a set of strings of characters from some alphabet (such a string is sometimes called a "sentence" but in a programming language these are called "phrases") Q: For a written natural language, what are the characters? Letters, on a computer these are characters in some code system, like ASCII or Unicode Q: For a spoken natural language, what are the characters? Phonemes *** hierarchy of language classes ------------------------------------------ LANGUAGE CLASSES Languages can be classified by Venn Diagram: |--------------------------------------| | Regular Languages | | | | | | |--------------------------------| | | | Contex-free Languages | | | | | | | | | | | | |--------------------------| | | | | | Context-sensitive | | | | | | Languages | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | Type 0 Languages | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| ------------------------------------------ ... the kind of grammar needed to recognize/generate them Note: the bigger languages are faster to parse and the smaller ones are harder to parse **** relation of language classes to parts of a compiler ------------------------------------------ PHASES OF A COMPILER Programs allowed by a compiler's: |--------------------------------------| | Lexical Analysis (Lexer) | | | | | | |--------------------------------| | | | Parser | | | | | | | | | | | | |--------------------------| | | | | | Static Analysis | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | Runtime checks | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |--------------------| | | | | | | | | | | | |--------------------------| | | | | | | | |--------------------------------| | | | |--------------------------------------| ------------------------------------------ These programs are the strings of the language Q: Is there a relationship to the previous diagram (about langauge classes)? Yes indeed, this is because parsing a regular language is faster and easier than parsing a context free language, etc. *** automation of syntax analysis ------------------------------------------ PROBLEM WRITING SYNTAX ANALYSIS Naive way to write a compiler, etc. Language Def.docx -- coding --> parser.c (v. 1) Def2.docx -- coding --> parser.c (v. 2) -- coding --> parser.c (v. 3) Def3.docx -- coding --> parser.c (v. 4) Def4.docx -- coding --> parser.c (v. 5) ... ... DefN.docx -- coding --> parser.c (v. N) Disadvantages: ------------------------------------------ ... - can be slow to write - error prone, - hard to verify - experiments with language and coding are costly - code might not be efficient (focus on correctness) ------------------------------------------ COMPUTER SCIENCE SOLUTION: AUTOMATION high-level description tool generated code lang.y -- bison --> lang.tab.c + lang.tab.h lang2.y -- bison --> lang.tab.c + lang.tab.h ... langN.y -- bison --> lang.tab.c + lang.tab.h Advantages: ------------------------------------------ ... - cycle time is faster - grammar is easy to check/verify - experiments/changes easy/cheap - code very efficient *** grammars We use a grammar as a high-level description **** definitions ------------------------------------------ GRAMMARS DESCRIBE LANGUAGES Grammars are high-level descriptions of languages/parsers def: a *grammar* consists of a finite set of rules (called "productions") and a start symbol (a nonterminal). Let V = nonterminals + terminals The rules have the form V+ -> V* where there is no symbol in both nonterminals and terminals def: The language generated by a grammar G with set of productions P is: {w | w is in terminals* and S =>* w} where S is the start symbol of G gAd => gBd iff g in V* and d in V* and A -> B is a rule in P and g =>* h iff either h = g or g -> i and i =>* h ------------------------------------------ Note that A => B just when there is a production A -> B in P also A =>* B just when there is a chain of (zero or more) productions that from A can produce B A language only consists of strings of *terminal* symbols The start symbol is usually listed first in the presentation of a grammar. BNF is often used in programming langauges for regular and context-free grammars **** BNF notation ------------------------------------------ BNF NOTATION FOR GRAMMARS ::= means | means is a Example ::= | ::= 0 | 1 ------------------------------------------ (BNF stands for Backus-Naur Form; Backus and Naur were two people on the Algol 60 committee.) ... (BNF) "::=" means "produces" or "can become", written "->" in modern grammar books "|" means "or", which separates alternatives is a non-terminal c not inside < and > is a terminal symbol **** grammars as games ------------------------------------------ GRAMMARS AS RULES OF GAMES A grammar can be seen as describing two games: - A production game (Can you produce this string?) - A recognition/parsing game (Is this string in the language?) ------------------------------------------ ***** production game ------------------------------------------ PRODUCTION GAME Goal: produce a string in the language from the start symbol Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Can we produce "Johnny is good"? ------------------------------------------ Here the nonterminals are in curly brakets Q: Could you win the production game and produce the string "Johnny is good"? Yes Trace of the game: -> -> Johnny -> Johnny is -> Johnny is good Q: Could you produce the sentence "Charlie can be difficult"? Yes! ***** recognition or parsing game ------------------------------------------ RECOGNITION OR PARSING GAME Goal: determine if a string is in the language of the grammar Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult Is "Johnny is good" in this grammar? ------------------------------------------ Q: Could you win the recognition game on the string "Johnny is good"? Yes Trace of the game: Johnny is good <- is good <- good <- Johnny Why are those arrows backwards? They correspond to the direction of the productions Q: What does this have to do with parsing in a compiler? The parser determines if the program is in the language **** derivation trees ------------------------------------------ DERIVATION (OR PARSE) TREES def: a *tree* is a finite set of nodes connected by directed edges that is connected and has no cycles def: a *derivation tree* for grammar G is a tree such that: - Every node has a label that is a symbol of G - The root is labeled by the start symbol of G - Every node with a direct descendent, is labeled by a nonterminal - If the descendents of a node labeled by N have the following labels (in order): A, B, C, ..., K then G has a production of form N -> A B C ... K ------------------------------------------ These definitions are from the book "Formal Languages and their Relation to Automata" by Hopcroft and Ullman (Addison-Wesley, 1969). ------------------------------------------ EXAMPLE DERIVATION TREE Example Grammar: -> -> Johnny | Sue | Charlie -> is | can be -> good | difficult String to parse: "Johnny is good" / | \ / | \ v v v Johnny is good ------------------------------------------ Q: Why does the order of nodes matter? Because the order in the string matters... The ASTs we use in a compiler will be representations of these parse trees **** extensions to BNF (EBNF) ------------------------------------------ EXTENSIONS TO BNF (EBNF) Arbitrary number of repeats: { x } means 0 or more repeats of x :: = { } is equivalent to: ::= ::= | ::= {} is also written as: * or [ ] ... One-or-more repeats: x+ means 1 or more repeats of x :: = + is equivalent to: ::= ::= | + is sometimes written as ... or [ ] ... Optional element: [ x ] means 0 or 1 occurences of x ::= [ ] is equivalent to: ::= ::= | ------------------------------------------ Note, it's good practice to use an explict production , but some use a special symbol, like \epsilon, for the empty string of symbols Q: What is EBNF notation like that you may have seen before? Regular expressions combined with BNF (esp. *, +) **** examples ------------------------------------------ READING A BNF GRAMMAR Example rules: ::= ::= 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ::= 0 | ::= | ------------------------------------------ Q: can you give an example of a ? 3 and 3402 are examples Caveat: the above grammar is lexical, that is the terminals are characters instead of symbols ------------------------------------------ EBNF GRAMMAR FOR (SUBSET OF) PL/0 ::= . ::= ::= {} ::= const {} ; ::= = ::= , ::= {} ::= var ; ::= {} ::= , ::= {} ::= procedure ; ; ::= := | call | begin {} end | if then | while do | read | write | skip ::= ; ::= | else ::= ::= odd | ------------------------------------------ Note: comments are from a # to the end of the line ------------------------------------------ EXAMPLES IN PL/0 Shortest program: skip. Factorial program: var n, res; # input and result procedure fact; begin read n; res := 1; while (n <> 0) begin res := res * n; n := n-1 end; write res end; call fact. ------------------------------------------ Q: Does the factorial program parse correctly? I think so... Q: Does PL/0 use semicolons to end statements or to separate them? To separate them! How can you tell? see the grammar for begin-end statments and atomic statements Q: How does C use semicolons?