I. Code Generation
 A. overview
------------------------------------------
         OVERVIEW OF CODE GENERATION

 .. ASTs...-> [ Static Analysis ]
                     |
                     | IR
                     v
              [ Code Generation ]
                     |
                     | Machine Code
                     |
                     v
               SSM Virtual Machine
                  Execution

The IR (= Intermediate Representation)
    records


------------------------------------------
  1. IR (Intermediate Representation)
         What kind of information is needed from a name's use
            in order to generate code?
         Should the parser create the lexical address of a name's use
            during paring?
         Is the symbol table unchanging (immutable)?
------------------------------------------
            IR TREES

An IR is a tree structure,


Helps in modularizing compilers
and code generation

   WITHOUT IR               WITH IR

Java ------>x86       Java        ->x86
     \/ |||                \     /
   C ------>MIPS         C \\   /-->MIPS
     \\/ ||                 ->IR
 C++ ------>Sparc      C++ //   \-->Sparc
      \\\/                 /     \
  C# ------>A1          C#/       \>A1

------------------------------------------
------------------------------------------
          OUR CHOICES FOR AN IR

To keep things simple, we will use
   a modified AST type as an IR

Parser:
   - records
   - provides


Static analysis:
   - records


------------------------------------------
  2. General strategy
------------------------------------------
   GENERAL STRATEGY FOR CODE GENERATION

Don't try to optimize!


Follow the grammar of


Think about the invariants!


Trust the recursion!

------------------------------------------
------------------------------------------
       FOLLOWING THE GRAMMAR

Code resembles the grammar that


When


------------------------------------------
     How does this relate to the parser?
     Why is this useful?
 B. Translation target: code sequences
------------------------------------------
         TARGET: CODE SEQUENCES

Need lists of machine code

Why?


------------------------------------------
------------------------------------------
           THE CODE TYPE

// file code.h
#include "instruction.h"

// machine code instructions
typedef struct code_s {
    struct code_s *next;
    bin_instr_t instr;
} code;

// Code creation functions below
// with the named mnemonic and parameters
extern code *code_nop();
extern code *code_add(reg_num_type t,
                      offset_type ot,
                      reg_num_type s,
                      offset_type os);
extern code *code_sub(reg_num_type t,
                      offset_type ot,
		      reg_num_type s,
                      offset_type os);
// ...

------------------------------------------
        Why use that instead of just bin_instr_t?
------------------------------------------
     REPRESENTING CODE SEQUENCES IN C

// file code_seq.h
#include "code.h"

// code sequences
typedef code *code_seq;

extern code_seq code_seq_empty();

extern code_seq code_seq_singleton(
                                 code *c);

extern bool code_seq_is_empty(
                            code_seq seq);

// Requires: !code_seq_is_empty(seq)
// Return the first element...
extern code *code_seq_first(code_seq seq);

// Requires: !code_seq_is_empty(seq)
// Return the rest of the given sequence
extern code_seq code_seq_rest(
                            code_seq seq);

// Return the size (number of words)
extern unsigned int code_seq_size(
                            code_seq seq);

// ...

// Requires: c != NULL && seq != NULL
// Modify seq to add the given
//   code *c added to its end
extern void code_seq_add_to_end(
                 code_seq *seq, code *c);

// Requires: s1 != NULL && s2 != NULL
// Modifies s1 to be the concatenation
// of s1 followed by s2
extern void code_seq_concat(code_seq *s1,
                            code_seq s2);

// ...
------------------------------------------
        Why are code sequences needed?
 C. Designing Code Sequences
  1. Overall strategies
------------------------------------------
 STRATEGIES FOR DESIGNING CODE SEQUENCES

Work backwards


------------------------------------------
------------------------------------------
       EXAMPLE: EXPRESSION EVALUATION

Example: (E1 + E2) - (E3 / E4).


Constraints:
 - Expressions have a result value
 - Binary operations (+, -, *, /)
   in the SSM


Where should the result be stored?

  Can it be a register?


------------------------------------------
    Can expressions have (side-)effects in the programming language?
    Can the order of evaluating expressions matter for the language?
    What would a compiler do for a register-based ISA?
------------------------------------------
         ADDRESSING VARIABLES

Consider an expression

          x

where x is a variable

How to get x's value on the top of stack?


------------------------------------------
         Would this be different for constants?
  2. use of registers (omit)
------------------------------------------
 USE OF REGISTERS IN A REGISTER-BASED ISA

For a register-based ISA:

What if the target register
is already in use?
   e.g., in   x := y + z

Strategies:
 - use a different register


 - save and restore


------------------------------------------
  3. strategy for expression evaluation
------------------------------------------
    GENERAL STRATEGY FOR EXPRESSIONS

Each expression's value goes


To operate on an expression's value


------------------------------------------
  4. Background on SSM instructions
------------------------------------------
      BACKGROUND: SSM INSTRUCTIONS

 ADD t,ot,s,os
            "M[GPR[t]+ot]
              = M[GPR[SP]] + M[GPR[s]+os]"

 SUB t,ot,s,os
            "M[GPR[t]+ot]
              = M[GPR[SP]] - M[GPR[s]+os]"

 MUL s,o    "(HI,LO)
             = M[GPR[SP]] * M[GPR[s]+ o]"

 DIV s,o    "HI
             = M[GPR[SP]] % M[GPR[s] + o]"
            and
            "LO
             = M[GPR[SP]] / M[GPR[s] + o]"

 CPW t,ot,s,os
            "M[GPR[t+ot] = M[GPR[s]+os]"

 CPR t,s    "GPR[t] = GPR[s]"

 ADDI r,o,i "M[GPR[r]+o]
             = M[GPR[r]+o] + sgnExt(i)"

What limitations on immediate operands?


What if the literal doesn't fit?


------------------------------------------
        What does M[GPR[SP]] mean?
        If the numbers are small enough, where is the result of a
           multiplication as a 32 bit integer located: HI or LO?
        How would you copy the value into location GPR[t]+ot
                from location GPR[s]+os?
        Are there limitations on the immediate operands for ADDI?
        What can you do if you want a constant value that doesn't fit?
 D. Literal Table
------------------------------------------
            LITERAL TABLE IDEA

- Store literal values in


- Keep mapping from


- Initialize


------------------------------------------
------------------------------------------
  LITERAL TABLE IN EXPRESSION EVALUATION

Idea for code for numeric expression, N:

    1. Look up N in literal table,
    2. Receive N's 


    3. generate a copy (CPW) instruction
         to copy that to 


------------------------------------------
     What's our goal for expression code?
------------------------------------------
    LITERAL TABLE AND BOF DATA SECTION

How to get the literals into memory
   with the assumed offsets?


------------------------------------------
 E. Activation Record (AR) Layout
    Where should constants and variables for a block be stored?
------------------------------------------
   LAYOUT OF AN ACTIVATION RECORD

Must save SP, FP, static link, RA
   and register $r3

Can't have offset of static link
    at a varying offset from FP

Layout 1:
                                   offset
  FP --> [  saved     SP          ] 0
         [  registers FP          ]-1
         [            static link ]-2
         [            RA          ]-3
         [ local constants        ]-4
         [      ...               ]
         [ local variables        ]
         [      ...               ]
         [ temporary storage      ]
   SP -->[      ...               ]


Layout 2:
                                   offset
         [      ...               ]
         [ local variables        ]
         [      ...               ]
   FP -->[ local constants        ] 0
         [  saved     SP          ]-1
         [  registers FP          ]-2
         [            static link ]-3
         [            RA          ]-4
         [ temporary storage      ]
   SP -->[      ...               ]

                                   offset
Layout 3:
         [ saved     SP           ] 4
         [ registers FP           ] 3
         [           static link  ] 2
         [           RA           ] 1
   FP -->[ local constants        ] 0
         [      ...               ]
         [ local variables        ]
         [      ...               ]
         [ temporary storage      ]
   SP -->[      ...               ]

Advantages of layout 1:


Advantages of layout 2:


Advantages of layout 3:


------------------------------------------
      Why can't we put the constants and variables in front of
         the saved registers as an alternative to layout 1?
      What does the FP register address?
       What are the advantages of layout 1?
      What are the advantages of layout 2?
      Any disadvantages?
      What are the advantages of layout 3?
      Any disadvantages?
      Which layout should we use?
 F. Declarations
   Where are constants and variables stored?
------------------------------------------
 TRANSLATION SCHEME FOR SPL DECLARATIONS

   const c = n;
   var x;

When do blocks start executing?


What should be done then?


How do we know how much space to allocate?


How to initialize constants?


How to initialize variables?


------------------------------------------
      When are blocks executed in SPL?
      When starting to execute a block, what should be done?
      Which should be allocated first: constants or variables?
      How do we know how much space to allocate?
      How to initialize constants?
      How to compute the value of the constant?
      How to initialize variables?
 G. Compiling Expressions
  1. deciding where to start
    What are the simplest cases for expressions?
  2. numeric literals
------------------------------------------
 TRANSLATION SCHEME FOR NUMERIC LITERALS


------------------------------------------
  3. variables as expressions
------------------------------------------
  TRANSLATION SCHEME FOR VARIABLE NAMES
           (AND CONSTANTS)


------------------------------------------
    How to load FP into $r3?
    How would you generate code
       to repeat the loading of the next static link levelsOut times?
  4. binary operator expressions
------------------------------------------
        TRANSLATING EXPRESSIONS

Abstract syntax of expressions in SPL

  E ::= E1 o E2 | x | n
  o ::= + | - | * | /


Simplest cases are:


------------------------------------------
    So, for E1 - E2 what needs to be done?
    Why can we evaluate E2 first?
 H. Statements
  1. Basic Statements
   What are the base cases in the grammar for statements?
------------------------------------------
 TRANSLATION SCHEME FOR BASIC STATEMENTS


    begin end


    x := E


    read x


    print E


------------------------------------------
            For testing, want to know: What are the simplest cases?
            In general, can the "levels outwards" part of
                the lexical address be determined when the variable
                is declared?
            Does the same thing work for constants?
                Should we write a character with code E or the digits of E?
 I. Conditions
  1. Overall conditions
------------------------------------------
        GRAMMAR FOR CONDITIONS

<condition> ::= divisible <expr> by <expr>
              | <expr> <rel-op> <expr>
<rel-op> ::= == | != | < | <= | > | >=

So the recursion structure of the code is?


Code looks like:


------------------------------------------
         What should these functions return?
   a. Relational operator conditions
------------------------------------------
     RELATIONAL OPERATOR CONDITIONS

<condition> ::= <expr> <rel-op> <expr>

A design for rel-op conditions:

 Goal: put true of false on top of stack
       for the value of the condition

 One case for each condition:

 Consider case op is !=

  [Evaluate E2 to top of stack]
  [Evaluate E1 to top of stack]
  # What does the stack look like? (1)
  # jump ahead 3 instrs,
  # if memory[GPR[$sp]]
  #              != memory[GPR[$sp]+1]
  BNE $sp, 1, 3
  # put 0 (false) at SP+1
  LIT $sp, 1, 0
  # jump over next instr
  JREL 2
  # put 1 (true) at SP+1
  # What does the stack look like (2)?
  # deallocate one word from stack
  ARI $sp, 1
  # now top of stack has truth value

 Consider E1 >= E2
  [Evaluate E2 to top of stack]
  [Evaluate E1 to top of stack]
  # What does the stack look like? (3)
  SUB $sp, 0, $sp, 1  # SP = E1 - E2
  # jump ahead 3 instrs, if geq
  BGEZ $sp, 1, 3      # skip 2 instrs
  # put 0 (false) at SP+1
  LIT $sp, 1, 0
  # jump over next instr
  JREL 2
  # put 1 (true) at SP+1
  LIT $sp, 1, 1
  # What does the stack look like (4)?
  # deallocate one word from stack
  ARI $sp, 1
  # now top of stack has truth value
------------------------------------------
          What would work for ==?
          What would you do for < ?
------------------------------------------
     CODE FOR BINARY RELOP CONDITIONS

// file ast.h
typedef struct {
    file_location *file_loc;
    AST_type type_tag;
    expr_t expr1;
    token_t rel_op;
    expr_t expr2;
} rel_op_condition_t;


// file gen_code.c

// Generate code for cond,
// putting its truth value
// on top of the runtime stack
// May also modify SP,HI,LO, and $r3
code_seq gen_code_rel_op_condition(
              rel_op_condition_t cond)
{


}

------------------------------------------
 J. Control Flow Statements (Compound Statements)
     Why is it useful to write the base cases first?
------------------------------------------
 ABSTRACT SYNTAX FOR COMPOUND STATEMENTS

S ::= begin S*
    | if C S1* S2*
    | while C S*

So what is the code structure?


Source and generated code look like:


  begin S1 S2 ... end


  if C S1


  if C S1 S2


  while C S


------------------------------------------
           Why deallocate the truth value in the loop and at the end?
II. Code Generation for Procedures
------------------------------------------
      SUPPORTING PROCEDURES AND CALLS

Main issues:
   - storing their code
     Why?


   - knowing exactly where each starts
     Why?


Another issue:
   - sending the right static link

------------------------------------------
        What static link does a called procedure need?
 A. Where to store code for procedures?
   Where do we put the code sequences for each procedure?
------------------------------------------
       WHERE TO PUT PROCEDURE CODE?

Possible layouts in VM's code array:


------------------------------------------
       How would you implement each?
       Which layout makes the most sense?
 B. how to find each procedure's starting address?
------------------------------------------
      NESTED PROCEDURES ARE A PROBLEM

begin
  proc A
  begin
    proc B
    begin
      # B's body code...
      call A # ...
      # ...
    end;
    # A's body code
    call B # ...
    # ...
  end;
  call A
end.

If lay out the code as

   [ code for A ]
   [ code for B ]

How do we know the address of B
    to compile the call to B in A?


What about the other direction?


------------------------------------------
------------------------------------------
   RECURSIVE PROCEDURES, SIMILAR PROBLEM

begin
  proc R
  begin
    # R's body code ...
    call R
    # ...
  end;
  # ...
  call R
  # ...
end.

Before storing code for R,
  how do we know where it starts?


------------------------------------------
------------------------------------------
  MUTUAL RECURSION (NOT IN OUR LANGUAGE)
        
begin
  proc O
  begin # O's body code...
    call E
    # ...
  end;

  proc E;
  begin
    # E's body code ...
    call O
    # ...
  end;

  # ...
  call O;
  call E
  # ...
end.

One of these must before the other in
  the code area of the VM...


------------------------------------------
      No matter which of O or E is put first,
         how is the call to the second one to know
         where the second one starts?
  1. solutions
------------------------------------------
       SOLUTION STRATEGIES FOR CALLS

[Multiple passes]:
  1. Generate code for each procedure
     (+ store offsets in symbol table,
      + layout procedure code in memory
        with placholders for calls)
  2. Gather table of addresses
     (map from names to addresses,
      using offsets and beginning address)
  3. Patch up code addresses for calls
     (+ output code)

[Lazy evaluation, labels]:
  1. Generate code for each procedure
     with calls to "labels"
     (+ store or update
        labels in symbol table)
  (+ output code)
------------------------------------------
 C. Multiple Passes Solution
------------------------------------------
    GENERAL SOLUTION: MULTIPLE PASSES

Problem: where does each procedure start?

Passes over the IR:
  1. Compile all procedure code
     (now know how big each procedure is)
  2. Lay out procedure code in memory
     (now know where each starts)
  3. Change each call instruction


------------------------------------------
        What would a progrm need to do to change all the call instructions?
 D. Labels Solution
------------------------------------------
        GENERAL SOLUTION: LABELS

Use "labels" to allow


Term "label" is from assembly language

    ;  ...
    jmp L
    ; ...
    L: ; ...

------------------------------------------
------------------------------------------
        APPROACHES TO FIXING LABELS

Problem: convert labels to addresses

 (1) Use multiple passes
       a. Generate code with labels
       b. Lay out memory for procedures
          (determine starting addresses)
       c. Change labels to addresses

     advantages:


     disadvantages:


 (2) Use shared mutable data (lazy eval.)
       a. labels are unique placeholders,
          shared by all uses (calls)
       b. when address is determined,
          update the placeholder
          (and all uses are updated)

     advantages:


     disadvantages:


------------------------------------------
  1. label data structure
   a. Creating and propagating labels
------------------------------------------
       LABEL DATA STRUCTURE

// file label.h
// ...
#include "machine_types.h"

typedef struct {
    bool is_set;
    unsigned int word_offset;
} label;

// Return a fresh label that is not set
extern label *label_create();

// Requires: lab != NULL
// Set the address in the label
extern void label_set(label *lab,
               unsigned int word_offset);

// Is the given label set?
extern bool label_is_set(label *lab);

// Requires: label_is_set(lab)
// Return the word offset in lab
extern
unsigned int label_read(label *lab);
------------------------------------------
------------------------------------------
            CREATING LABELS

Labels created for all procedures
when creating proc_decl_t ASTs.

// file ast.h
// ...
#include "label.h"

// ...

typedef struct proc_decl_s {
    file_location *file_loc;
    AST_type type_tag;
    struct proc_decl_s *next; // for lists
    const char *name;
    struct block_s *block;
    label *lab; // for code generation
} proc_decl_t;


// file ast.c

// Return an AST for a proc_decl
proc_decl_t ast_proc_decl(ident_t ident,
                          block_t block)
{
    proc_decl_t ret;
    ret.file_loc = // ...
    ret.type_tag = proc_decl_ast;
    ret.next = NULL;
    ret.name = ident.name;
    block_t *p = // ...
    ret.block = p;

    // this is the source of the labels
    ret.lab = label_create();
    assert(ret.lab != NULL);

    return ret;
}
------------------------------------------
------------------------------------------
     PROPAGATING POINTERS TO LABELS (1)

Labels added to attributes
of procedure names

// file id_attrs.h
//...
#include "label.h"

typedef struct {
    file_location file_loc;
    id_kind kind;  // kind of identifier
    unsigned int offset_count;
    // for a procedure, its label
    label *lab;
} id_attrs;

------------------------------------------
------------------------------------------
     PROPAGATING POINTERS TO LABELS (2)

Make call statement ASTs point to the
label of the procedure being called

// file scope_check.c

// check the statement to make sure that
// the procedure has been declared
// (if not, then produce an error).
// Modifies the given AST
// to have appropriate id_use pointers.
void scope_check_callStmt(
                        call_stmt_t *stmt)
{


}

------------------------------------------
------------------------------------------
     PROPAGATING POINTERS TO LABELS (3)

Associate labels with
each call instruction in code structures

// file code.h
// ...
#include "label.h"

typedef struct code_s {
    struct code_s *next;
    bin_instr_t instr;
    // labels for call instructions
    label *lab; 
} code;

// ...

// Requires: lab != NULL
// Create and return a fresh instruction
// with the named mnemonic and parameters
extern code *code_call(address_type a,
                       label *lab);

// ...

------------------------------------------
        Where should the label passed to the code_call function
           come from?
        So what has been achieved?
   b. Using labels to fill in addresses of procedures
    i. Putting addresses of Procedures in Labels
------------------------------------------
    SETTING LABELS IN PROCEDURES (1)

// file label.h

typedef struct {
    bool is_set;
    unsigned int word_offset;
} label;


------------------------------------------
      Where is the address of a procedure known?
------------------------------------------
    SETTING LABELS IN PROCEDURES (2)

// file gen_code.c

// ...

void gen_code_proc_decl(proc_decl_t pd)
{


}

------------------------------------------
    What makes code for a procedure?
    At the end of code generation, what has this achieved?
    ii. Fixing up the Call Instructions
    Can some call instructions not have their labels set
       at the end of code generation?
------------------------------------------
  PUTTING ADDRESSES IN CALL INSTRUCTIONS

Write a function to fix
all call instructions in a code_seq

extern
void code_seq_fix_labels(code_seq cs);


------------------------------------------
   How would you write such a function?
 E. testing the solution
   How would you test a solution?
 F. exercise