COP 3402 meeting -*- Outline -*-

* Overview

------------------------------------------
            SYSTEMS SOFTWARE

Helping people run programs


  Human     -----> [ Translator ]
  Readable         /    |
  Input           /     |
  (Prog. text)   /      |
                /       v
               v     Object
           Library   Code
           (Binary)  (Binary)
             \          |
              \         |
               \        v
                \->[ Linker/ Loader ]
                        |
                        |
                        v
                      Process
                        |
                        |
                        v
                    [ OS + Computer ]
                        |
                        |
                        v
                      Computation

------------------------------------------
        Q: How would a debugger fit into this picture?
        It would interact with the process and OS+Computer

        Q: What job or jobs does a translator have?
        Two:
           - translating the source language into machine code
           - giving the user (programmer) good feedback (error messages)

        Q: What jobs would an IDE (like Visual Studio) do?
           - help edit the program text,
           - run the translator (usually incrementally)
           - link and load the entire program and report errors
           - run the program, with debugging support

** Compiler vs. Interpreter
------------------------------------------
          RECALL DEFINITIONS

def: An *assembler* translates
     a low-level language that is close
     to machine code into machine code.


def: A *compiler* translates a
     (high-level) language into a form
     that can be (more easily)
     executed by a computer.


def: An *interpreter* executes a
     programming language directly,
     often without translating it
     into object code first


------------------------------------------
        Q: Which of these is most like what a VM does?
           an interpreter, as it executes instructions directly

        Q: Does the VM you are writing do loading?
           yes, it loads the program into memory to execute it.

        Assembly language uses human-readable names for instructions
        and macros. Typically there is a one-to-one ratio of
        instructions in assembly language and instructions in the
        translated machine code

        Modern interpreters often do some translation into an
        internal format (e.g., a virtual machine language),
        and use a VM to run the program (for portability)

        Q: What are some examples of languages with interpreters?
        Python, Java (to some extent), C#, (also: Lisp, Scheme, ...)

        Modern languages, like Smalltalk, Java, and C#,
        often combine interpreters and compilers,
        using a "just-in-time compiler" that only compiles
        programs into machine code as needed, and interprets seldom
        used code. Such systems may be called *hybrids* of compilers
        and interpreters.

------------------------------------------
            COMPILER PICTURE

  Source -----> [ Lexical Analyzer ]
                        |
                        v
                 Token stream
                        |
                        v
                [     Parser       ]
                        |
                        v
                  Parse Tree (AST)
                   + Symbol Table
                        |
                        v
                 [ Static Analysis ]
                        |
                        v
                  Parse Tree (AST)
                   + Symbol Table
                        |
                        v
                 [ Code Generator ]
                        |
                        v
                   Object Code
                        |
                        v
                 [ Linker/ Loader ]
                        |
                        v
                      Process
                        |
                        v
                 [  OS + Computer ]

------------------------------------------
        Point out the front end (lexical analysis and parsing), back
        end (code generator) and "middle end" (static analysis)

------------------------------------------
             INTERPRETER PICTURE

  Source -----> [ Lexical Analyzer ]
                        |
                        v
                 Token stream
                        |
                        v
                [     Parser       ]
                        |
                        v
                  Parse Tree (AST)
                   + Symbol Table
                        |
                        v
                [ Interpreter     ]
                        |
                        v
                 [  OS + Computer ]

------------------------------------------
        The interpreter is a process running on the computer,
        it executes the instructions, perhaps using the AST
        An interpreter may also do static analysis.

    Note: we will be doing something like this in the homework
          HW2: Lexical analysis
          HW3: Parsing
          HW4: Code generation and execution

*** advantages and disadvantages

------------------------------------------
       ADVANTAGES AND DISADVANTAGES

Compiler
  Advantages:


  Disadvantages:


Interpreters
  Advantages:


  Disadvantages:


------------------------------------------
  ...
     Compilers
      Advantages:
          - often produce code that executes faster (perhaps 10x faster)
             why?
               - optimization (algorithms to make code run faster)
               - direct execution of object code on machine
                  (vs. indirect execution of a VM, so little overhead)
          - separates preparation (compile time) vs. running (runtime),
            So have time (before runtime) to:
             - carefully analyze code for problems
             - improve (optimize) code
             - carry out complex transformations
                (e.g., expanding macros, expanding generics)
          - compiled (object) code is hard to read
                so can help protect IP (but this is always diminishing)

      Disadvantages:
         - compiled (object) code can be quite different than source,
                so can be hard to debug
         - compilation time can slow development
               (long waits for compiler causes breaks between
                writing code and testing it)
         - object code can be hard to check
                so can lead to security vulnerabilities
         - object code may be large, so slow to send over network
         - compilers are bigger programs, so harder to write

     Interpreters
      Advantages:
         - little time needed to start running a program
               so better feedback for initial development efforts
         - intermediate representation can be very concise
               so faster to send over network (Java's initial pitch)
               (often VM code is optimized to use very little space)
         - source code is easier to understand than object code
               so easier to:
                  - debug
                  - check or verify
                  - protect against security vulnerabilities
         - small program than a compiler,
               so
                 - better for prototyping language ideas
                 - better for creating very high-level languages
                      (e.g., SPSS and R, Matlab, ...)
                 - easier to write
                 - may be more secure

      Disadvantages:
         - execution generally slower (rule of thumb about 10x slower!)
              since not much time can be spent on analysis and optimization

*** hybrids of compilers and interpreters
------------------------------------------
   HYBRID COMPILER/INTERPRETER PICTURE

  Source -----> [ Lexical Analyzer ]
                        |
                        v
                 Token stream
                        |
                        v
                [     Parser       ]
                        |
                        v
                  Parse Tree (AST)
                   + Symbol Table
                        |
                        v
                 [ Static Analysis ]
                        |
                        v
                  Parse Tree (AST)
                   + Symbol Table
                        |
                        v
                 [ Code Generator ]
                        |
                        v
                      VM Code
                        |
                        v
       /------->[ VM + JIT Compiler ]
       |statistics      |
       |                v
       \--------[  OS + Computer ]

------------------------------------------
        The VM is also a process running on the computer,
         it often features a compiler inside
         the Just-in-Time (JIT) compiler,
         which compiles frequently used VM code into machine code
           to improve runtime efficiency

------------------------------------------
   HYBRID ADVANTAGES AND DISADVANTAGES

Advantages:


Disadvantages:


------------------------------------------
  ...
     Advantages:
        - faster compilation
            so less waiting
        - faster execution speed than interpreter
        - can have better optimization than compiler
            since has runtime execution information
        - compilation time faster,
            since it will only compile what needs to be compiled
        - VM code smaller than object code
            so faster to send over network
        - VM code easier to read
            so easier to: debug, check
        - Can build full system incrementally
            (adding JIT compiler later)

     Disadvantages:
        - VM code is easier to read, so doesn't help protect IP
        - Slower at runtime, because:
            - more to do, execute and also compile
            - indirect at first
            - compiling code
                takes time away from executing code
            - VM is generally larger than compiled code's runtime system
                (so more I/O to get it started)
        - hybrid is harder to write than either compiler or interpreter
            (it includes both!)

** compiler structure
------------------------------------------
        STANDARD COMPILER ARCHITECTURE
  (Based on Apel's book "Modern Compiler
                         Implementation")

  Source code (text)
         |
         v
  [    Lexer        ]
         |
         v
   Stream of tokens
         |
         v
  [    Parser       ]
         |
         v
        AST -------------\
         |                \
         v                 v
  [ Static Analyzer ] <- Symbol Table
         |                    /
         v                   /
   Intermediate Rep.        /
         |                 /
  [  HL Optimizer   ]     /
         |               /
         v              /
   Intermediate Rep.   /
         |            /
         v           v
  [ Code Generator  ]
         |
         v
    Instruction Rep.
         |
         v
  [  LL Optimizer   ]
         |
         v
    Machine Code

------------------------------------------

   Notes:
     Each part may generate error messages,
         we will stop compilation if there are any errors,
         but it's better to do error recovery and continue

     Error recovery is only useful if it doesn't make things worse

     The optimizers can be omitted (or added later)

     Lexical analyzer (lexer)
       conceptually translates stream of chars
          to stream of tokens
       but often communicates via global variables

    Symbol table
       conceptually a map from identifiers to attributes
          such as: lexical address,
                   type (or size),
                   source info (line, column)
                   initial value (or definition)
                   (other static analysis information)

    AST = Abstract Syntax Tree
        represents the parse
        (hierarchical structure according to the grammar)
        of the input

    The intermediate representation varies:
        Some compilers use 3-address code (like a register machine),
        Many (today) use a single-assignment language
              (where each variable is only assigned once, A-normal form)
        Some use a custom representation that works for their optimizer
        We will use annotated ASTs, and manipulate those

*** tokens
**** definitions
------------------------------------------
          TOKENS

Represent distinct symbols in the input


    including punctuation and operators


Typically, comments are ignored

Reserved words:


Keywords:


White space delimits identifiers/
    aNumber   vs.  a Number
------------------------------------------
   ... e.g.,  3.14
              ident
              "a string"

   ... e.g., ;
             <=
             >
             :=
             =
             ==
             (
             )

   ... strings that would be identifiers
   but always represent a reserved word
   e.g.,    if
            else
            while
            begin
            end

   ... strings that would be identifiers
       but only play a special role
       in some contexts
          e.g., in PL/I can write:
             if if=then
             then then=else
             else if=then=else

    Reserved words make lexical analysis easier

**** data structures
***** token types
------------------------------------------
          TOKENS

Defined in *.tab.h file
  produced by bison

Example from the SSM assembler (asm.tab.h)

#ifndef YYTOKENTYPE
# define YYTOKENTYPE
  enum yytokentype
  {
    YYEMPTY = -2,
    YYEOF = 0,        /* "end of file" */
    YYerror = 256,    /* error  */
    YYUNDEF = 257,    /* "invalid token" */
    eolsym = 258,     /* eolsym */
    identsym = 259,   /* identsym  */
    unsignednumsym = 260,
    plussym = 261,     /* "+"  */
    minussym = 262,    /* "-"  */
    commasym = 263,    /* ","  */
    dottextsym = 264,  /* ".text"  */
    dotdatasym = 265,  /* ".data"  */
    dotstacksym = 266, /* ".stack"  */
    dotendsym = 267,   /* ".end"  */
    colonsym = 268,    /* ":"  */
    lbracketsym = 269, /* "["  */
    rbracketsym = 270, /* "]"  */
    equalsym = 271,    /* "="  */
    noopsym = 272,     /* "NOP"  */
    addopsym = 273,    /* "ADD"  */
    subopsym = 274,    /* "SUB"  */
    /* ... */
    wordsym = 318,     /* "WORD"  */
    charsym = 319,     /* "CHAR"  */
    stringsym = 320,   /* "STRING"  */
    charliteralsym = 321,
    stringliteralsym = 322
  };
  typedef enum yytokentype yytoken_kind_t;
#endif


Examples:

    Input      yytokentype
=======================================
    ident      identsym
    34         unsignednumbersym
    +          plussym


------------------------------------------
        Q: What would be the token types for the input
              WORD x = +24
             wordsym identsym equalsym plussym unsignednumsym

        Q: How are reserved words represented?
           Each has its own yytokentype value (e.g., WORD is wordsym)

*** symbols and symbol table

    A symbol table is a mapping (e.g., a hash table) from
    identifiers (strings) to their attributes
    (such as file location, type, lexical address, etc.)

       Q: What part of a compiler should populate the symbol table?
          The parser, unless there is no scoping (as in assembly language)
          (It's done in pass1 in the SRM assembler.)

       Q: Why should the that be the tool?

       - Only the parser knows about nesting
             lexer doesn't know scope of variables
         However, for some languages, like C, it is necessary to know
             the kind of name for further parsing
             (e.g., may need to know what names are types).
             That kind of feedback may require multiple passes by the parser,
             or some interaction during lexical analysis ...

**** symbols
------------------------------------------
               SYMBOLS

What information should remembered
  for identifiers?


------------------------------------------
        - text (name)
        - lexical address (of declaration)
            (lexical level and offset)
        - kind of name
            (if there are different namespaces,
             e.g., constants, variables, routines, types)
        - size (e.g., for arrays)
        - filename of declaration
        - line number of declaration
        - column number of declaration (but we aren't tracking this)

        Some other information might be useful,
         e.g.:
           for managing the symbol table (a delete mark), or
           for code generation may want to have a variable to
                 mark if space has been allocated (for a variable)

        How should we track lexical levels?
            count from outermost (which is 0) inwards

        Q: What would be a suitable data structure for a symbol?
           A hash table mapping names to attributes
           or an array of mappings (each is a struct)

**** symbol table
------------------------------------------
          SYMBOL TABLE

A *symbol table*


Does each scope have its own symbol table?


What operations would a symbol table need?


What data structure would be good?


------------------------------------------
      ... maps identifiers to their attributes

      ... yes, esp. if you can redeclare names in inner scopes

      ... lookup/fetch, add,
             If the symbol table is mutable,
             then need push scope, pop scope
             (or can use a delete operation)
             if it's not mutable, then add needs to produce a new table

      ... a hash table or stack of hash tables