COP 3402 meeting -*- Outline -*-

* architecture

** Parser, Symbol Table, and Code Generation

*** tasks needed
------------------------------------------
      WHAT A COMPILER NEEDS TO DO

- static analysis
    e.g., type checking needs:
    - each name's type


- code generation needs:
   - each var or const name's lexical addr
   - each procedure's starting address


------------------------------------------
        Note that types of names are the base case for type checking
              (e.g., need types of a and i so can type check a[i])
           + types of operators (+, etc.)

        Q: Must we always type check statically?
            No, but for different kinds of data
               will have different opcodes (e.g. int vs. float +, *, -, ...)
            Can make everything the same (ints),
               but still have some names 
                   that can't be used in some places
                         (e.g., - if p1 and p2 are procedures,
                                  then p1 + p2 is an error, as is p1 = p2,
                                - can't assign to constants)

        Q: What information does the VM need to load and store from variables?
              the lexical address,
                i.e., how many static links to follow and offset in AR
        Q: What information does the VM need for procedure calls
              address of where the code for the procedure starts

*** data structures needed
**** symbol table
------------------------------------------
          SYMBOL TABLE

Gives information about each name:


def: a *symbol table* maps


------------------------------------------
  ... - kind of name (const, var, procedure)
      - type (or at least size)
      - offset in its AR

  ... each (declared) name
      to its attributes.
      (An attribute is a property important to the compiler,
      such as kind of name, type, or size)

      Q: What kind of data structure would work to store this information?
        A mapping of some sort (names to attributes):
            - array with linear search
            - binary search tree (with binary search)
            - hash table with hashing
            
**** storage layout
------------------------------------------
          STORAGE LAYOUT

Compiler needs to track:
   memory layout within a scope:
      - what offset can store
        the next variable or constant?


   instruction layout:
      - what address to jump to?
         - for if-statements, while-loops

         
------------------------------------------
        Q: What kind of data structure could be used to store this layout?
           Need at least a counter (int variable) to store the offset
              and need to make sure it is updated properly
                  (so operations of an ADT would be useful)

        Q: How would an if-then-else be processed in a VM?
           evaluate the test/condition 
           jmp conditionally to the else part if it's false
           at the end of the if-part, jump to just past the else part

        Q: When generating code for an if-then-else,
           how can the compiler know the target addresses to jump to?
           Will have to know how big (how many instructions) each part
           is, and then jump around the then block, conditionally,
           and have the end of the then block jump around the else
           block (unconditionally)

**** summary
------------------------------------------
         SUMMARY OF DATA NEEDS

Symbol Table:
   maps names -> attributes
                  (kind, size, etc.)

Code Manager:
  - data allocated in a frame
  - instruction counts for pieces of code


------------------------------------------
        Q: For the symbol table, what operations are needed?
            - look up the attributes of a name
            - check if a name has a mapping (avoid duplicates)
            - add a mapping (name -> attributes) to the table
            
        Q: For the code manager, what operations are needed?
            - Allocate a data slot (for a given type)
            - Return total allocation (next offset for an allocation)

            - Allocate an instruction (of a given format),
            - Return size of instructions in a piece of code
                  (offsets for jumps)

** Basic architectural issues for static analysis and code generation
------------------------------------------
     HOW SHOULD PARSER COMMUNICATE?

Strategies:

 [Action-based]
   During parsing:
      - for nonterminal <N>,
        action:
          - adds to symbol table
          - checks for errors
          - allocates and tracks storage
          - emits code

 [Tree-based]
   During parsing:
     - for nonterminal <N>,
       action:
          - creates and returns an AST
   Walk tree to:
     - build symbol table
     - check for errors
     - allocate and track storage
     - (improve the tree's computation
         e.g., eliminate unneeded
               computation steps)
     - emit code
------------------------------------------

        Q: If we use a tree-based approach,
           then when and how does the symbol table get built?
        There would be a tree walk that would construct the symbol table
            (or several such, either to handle mutual recursion
                              or to decorate the AST)

        Intermediate forms are possible,
        such as having the parser
            build the symbol table and return an AST for later processing

------------------------------------------
        ADVANTAGES AND DISADVANTAGES

Action-based architecture:


Tree-based:


------------------------------------------
  ... [Action-based]
      + easier to have parsing and static analysis influence lexical analysis
               (e.g., parsing declarations in C or C++
                      needs to know what names are types)
      + easier to start coding
        but harder to split up work, because:
         - many different tasks interwoven with parsing
         - so harder to maintain and debug (a big problem!)
           (i.e., it's less modular!)

      - development is more ad hoc
      
  ... [Tree-based]
      - harder to have parsing and static analysis influence lexical analysis
           (so harder to parse some languages)
      + phases can be separated,
          since data for testing each phase can be constructured manually
      + code not all interwoven
      + so easier to maintain and debug
         (i.e., it's more modular!)

      + more principled development (some theory here)