COP 3402 meeting -*- Outline -*-

* Overview

** theory
*** languages
------------------------------------------
                LANGUAGES

def: A *language* is


------------------------------------------
        ... a set of strings of characters from some alphabet
            (such a string is sometimes called a "sentence"
             but in a programming language these are called "phrases")
            Q: For a written natural language, what are the characters?
            Letters,
            on a computer these are characters in some code system,
               like ASCII or Unicode
            Q: For a spoken natural language, what are the characters?
            Phonemes

*** hierarchy of language classes
------------------------------------------
          LANGUAGE CLASSES

Languages can be classified by


Venn Diagram:

 |--------------------------------------|
 | Regular Languages                    |
 |                                      |
 |                                      |
 |  |--------------------------------|  |
 |  | Contex-free Languages          |  |
 |  |                                |  |
 |  |                                |  |
 |  |  |--------------------------|  |  |
 |  |  | Context-sensitive        |  |  |
 |  |  |   Languages              |  |  |
 |  |  |                          |  |  |
 |  |  |                          |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |  |  Type 0 Languages  |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |                          |  |  |
 |  |  |--------------------------|  |  |
 |  |                                |  |
 |  |--------------------------------|  |
 |                                      |
 |--------------------------------------|

------------------------------------------
        ... the kind of grammar needed to
            recognize/generate them

        Note: the bigger languages are faster to parse
              and the smaller ones are harder to parse

**** relation of language classes to parts of a compiler
------------------------------------------
          PHASES OF A COMPILER

Programs allowed by a compiler's:

 |--------------------------------------|
 | Lexical Analysis (Lexer)             |
 |                                      |
 |                                      |
 |  |--------------------------------|  |
 |  | Parser                         |  |
 |  |                                |  |
 |  |                                |  |
 |  |  |--------------------------|  |  |
 |  |  | Static Analysis          |  |  |
 |  |  |                          |  |  |
 |  |  |                          |  |  |
 |  |  |                          |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |  |  Runtime checks    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |                    |  |  |  |
 |  |  |  |--------------------|  |  |  |
 |  |  |                          |  |  |
 |  |  |--------------------------|  |  |
 |  |                                |  |
 |  |--------------------------------|  |
 |                                      |
 |--------------------------------------|

------------------------------------------
        These programs are the strings of the language

        Q: Is there a relationship to the previous diagram
        (about langauge classes)?
        Yes indeed,
           this is because parsing a regular language is
           faster and easier than parsing a context free language, etc.

*** automation of syntax analysis
------------------------------------------
       PROBLEM WRITING SYNTAX ANALYSIS

Naive way to write a compiler, etc.

Language
 Def.docx  -- coding --> parser.c (v. 1)

 Def2.docx -- coding --> parser.c (v. 2)
           -- coding --> parser.c (v. 3)

 Def3.docx -- coding --> parser.c (v. 4)
 Def4.docx -- coding --> parser.c (v. 5)
   ...                    ...
 DefN.docx -- coding --> parser.c (v. N)

Disadvantages:


------------------------------------------
        ... - can be slow to write
            - error prone,
            - hard to verify
            - experiments with language
               and coding are costly
            - code might not be efficient
               (focus on correctness)
            
------------------------------------------
 COMPUTER SCIENCE SOLUTION: AUTOMATION

 high-level
 description  tool      generated code

   lang.y  -- bison --> lang.tab.c
                        + lang.tab.h
   lang2.y -- bison --> lang.tab.c
                        + lang.tab.h
              ...
   langN.y -- bison --> lang.tab.c
                        + lang.tab.h

Advantages:


------------------------------------------
     ...
      - cycle time is faster
      - grammar is easy to check/verify
      - experiments/changes easy/cheap
      - code very efficient

*** grammars
     We use a grammar as a high-level description
**** definitions
------------------------------------------
        GRAMMARS DESCRIBE LANGUAGES

Grammars are high-level descriptions
 of languages/parsers

def: a *grammar* consists of a finite set
     of rules (called "productions")
     and a start symbol (a nonterminal).

     Let V = nonterminals + terminals
     
     The rules have the form

            V+ -> V*

     where there is no symbol in both
        nonterminals and terminals


def: The language generated by a grammar G
     with set of productions P is:

     {w | w is in terminals* and S =>* w}
       where S is the start symbol of G
       gAd => gBd iff g in V* and d in V*
       and A -> B is a rule in P
       and g =>* h iff either h = g
                    or g -> i and i =>* h

------------------------------------------
     Note that A => B just when there is a production A -> B in P
          also A =>* B just when there is a chain of (zero or more)
          productions that from A can produce B

     A language only consists of strings of *terminal* symbols

     The start symbol is usually listed first in the presentation of a grammar.
     BNF is often used in programming langauges for regular and
     context-free grammars

**** BNF notation
------------------------------------------
         BNF NOTATION FOR GRAMMARS

::=  means

 |   means

<X>  is a


Example

 <BNumber> ::= <BDigit> <BNumber>
           |  <BDigit>

 <BDigit> ::= 0 | 1

------------------------------------------
     (BNF stands for Backus-Naur Form;
     Backus and Naur were two people on the Algol 60 committee.)

     ... (BNF)
        "::=" means "produces" or "can become", 
              written "->" in modern grammar books
         "|" means "or", which separates alternatives
        <X> is a non-terminal
         c  not inside < and > is a terminal symbol

**** grammars as games
------------------------------------------
         GRAMMARS AS RULES OF GAMES

A grammar can be seen as describing
  two games:

   - A production game
      (Can you produce this string?)
   - A recognition/parsing game
      (Is this string in the language?)
      
------------------------------------------

***** production game
------------------------------------------
            PRODUCTION GAME

Goal: produce a string in the language
        from the start symbol

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult

   Can we produce "Johnny is good"?


------------------------------------------
        Here the nonterminals are in curly brakets
        Q: Could you win the production game and produce
           the string "Johnny is good"?
           Yes

           Trace of the game:
              <Sentence>
                -> <Noun> <Verb> <Adj>
                -> Johnny <Verb> <Adj>
                -> Johnny is <Adj>
                -> Johnny is good

         Q: Could you produce the sentence "Charlie can be difficult"?
          Yes!

***** recognition or parsing game
------------------------------------------
        RECOGNITION OR PARSING GAME

Goal: determine if a string is in
      the language of the grammar

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult

   Is "Johnny is good" in this grammar?


------------------------------------------
       Q: Could you win the recognition game
           on the string "Johnny is good"?
           Yes

           Trace of the game:
                  Johnny is good
               <- <Noun> is good
               <- <Noun> <Verb> good
               <- Johnny <Verb> <Adj>
              <Sentence>
        Why are those arrows backwards?
        They correspond to the direction of the productions

        Q: What does this have to do with parsing in a compiler?
        The parser determines if the program is in the language

**** derivation trees
------------------------------------------
         DERIVATION (OR PARSE) TREES

def: a *tree* is a finite set of nodes
     connected by directed edges
     that is connected and has no cycles

def: a *derivation tree* for grammar G
     is a tree such that:
        - Every node has a label that is
          a symbol of G
        - The root is labeled by
          the start symbol of G
        - Every node with a
          direct descendent, is labeled by
          a nonterminal
        - If the descendents of a node
          labeled by N have the following
          labels (in order):
             A, B, C, ..., K
          then G has a production of form
             N -> A B C ... K
------------------------------------------
        These definitions are from the book
        "Formal Languages and their Relation
         to Automata" by Hopcroft and Ullman
         (Addison-Wesley, 1969).

------------------------------------------
        EXAMPLE DERIVATION TREE

Example Grammar:
      <Sentence> -> <Noun> <Verb> <Adj>
      <Noun> -> Johnny | Sue | Charlie
      <Verb> -> is | can be
      <Adj> -> good | difficult

String to parse: "Johnny is good"

         <Sentence>
         /    |   \
        /     |    \
       v      v     v
    <Noun> <Verb> <Adj>


    Johnny   is    good
------------------------------------------
        Q: Why does the order of nodes matter?
        Because the order in the string matters...

        The ASTs we use in a compiler will be representations of these
        parse trees

**** extensions to BNF (EBNF)
------------------------------------------
        EXTENSIONS TO BNF (EBNF)

Arbitrary number of repeats:

    { x }  means 0 or more repeats of x

   <N> :: = { <Z> }

   is equivalent to:

   <N> ::= <Z-seq>
   <Z-seq> ::= <empty> | <Z-Seq> <Z>
   <empty> ::=

   {<Z>} is also written as:
       <Z>*  or  [ <Z> ] ...

One-or-more repeats:

     x+  means 1 or more repeats of x

   <N> :: = <X>+

   is equivalent to:

   <N> ::= <Xs>
   <Xs> ::= <X> | <Xs> <X>

   <X>+ is sometimes written as <X> ...
                   or <X> [ <X> ] ...

Optional element:

    [ x ]  means 0 or 1 occurences of x

     <N> ::= [ <Y> ]

     is equivalent to:

     <N> ::= <Y-opt>
     <Y-opt> ::= <empty> | <Y>
------------------------------------------
        Note, it's good practice to use an explict production <empty>,
        but some use a special symbol, like \epsilon,
        for the empty string of symbols

        Q: What is EBNF notation like that you may have seen before?
        Regular expressions combined with BNF (esp. *, +)

**** examples

------------------------------------------
    READING A BNF GRAMMAR

Example rules:

  <DecimalConstant> ::=
	<NonZeroDigit> <Digits>
  <NonZeroDigit> ::= 1 | 2 | 3 | 4 | 5
                    | 6 | 7 | 8 | 9
  <Digit> ::= 0 | <NonZeroDigit>
  <Digits> ::= <Digit> | <Digit> <Digits>

------------------------------------------

	Q: can you give an example of a <DecimalConstant>?
           3 and 3402
           are examples

	Caveat: the above grammar is lexical, that is the terminals
		are characters instead of symbols


------------------------------------------
      EBNF GRAMMAR FOR (SUBSET OF) PL/0

<program> ::= <block> .
<block> ::= <const-decls>
            <var-decls>
            <proc-decls>
            <stmt>
<const-decls> ::= {<const-decl>}
<const-decl> ::= const <const-def>
                   {<comma-const-def>} ;
<const-def> ::= <ident> = <number>
<comma-const-def> ::= , <const-def>
<var-decls> ::= {<var-decl>}
<var-decl> ::= var <idents> ;
<idents> ::= <ident> {<comma-ident>}
<comma-ident> ::= , <ident>

<proc-decls> ::= {<proc-decl>}
<proc-decl> ::= procedure <ident> ; <block> ;
<stmt> ::= <ident> := <expr>
       | call <ident>
       | begin <stmt> {<semi-stmt>} end
       | if <condition> then <stmt> <else-opt>
       | while <condition> do <stmt>
       | read <ident>
       | write <ident>
       | skip
<semi-stmt> ::= ; <stmt>
<else-opt> ::= <empty> | else <stmt>
<empty> ::=
<condition> ::= odd <expr>
           | <expr> <rel-op> <expr>

------------------------------------------
        Note: comments are from a # to the end of the line

------------------------------------------
          EXAMPLES IN PL/0

Shortest program:
  skip.

Factorial program:

  var n, res; # input and result
  procedure fact;
  begin
     read n;
     res := 1;
     while (n <> 0)
       begin
          res := res * n;
          n := n-1
       end;
     write res
  end;
  call fact.

------------------------------------------
        Q: Does the factorial program parse correctly?
        I think so...
        Q: Does PL/0 use semicolons to end statements or to separate them?
        To separate them!
        How can you tell?
        see the grammar for begin-end statments and atomic statements
        Q: How does C use semicolons?