COP 3402 meeting -*- Outline -*-

* Code Generation for Procedures
------------------------------------------
      SUPPORTING PROCEDURES AND CALLS

Main issues:
   - storing their code
     Why?


   - knowing exactly where each starts
     Why?


Another issue:
   - sending the right static link

------------------------------------------
        ... we only want to execute their code when called,
            and they can be called from anywhere (their name is visible)
               (in some languages also from where they can be accessed
                from data structures)

        ... because machines don't support symbolic names,
            need (absolute) address for the call instruction

            e.g., in the SSM, need an address for the CALL instruction

        Q: What static link does a called procedure need?
           The one for the scope in which it is defined;
           this will be the link for the AR given by the number of
           levels outward where the procedure name was declared.

           For example, suppose we we compiling code for procedure P,
               and it (i.e., P's body) calls procedure Q.
               There are several cases:

               Q was declared in P's body,
                  so the call and Q are in the same scope,
                  so levelsOutward == 0 for Q (from the call);
                  thus want to pass FP as the static link.

               Q was declared in the same block as P (but before P):
                  so levelsOutward == 1 for Q
                  thus want to pass P's static link as the static link

               Q was declared in a scope surrounding where P was declared
                  so levelsOutward == N > 1 for Q,
                  so want to pass static link for N levels outward
                  
               Thus it's always the levels outward of the name Q

** Where to store code for procedures?
    We can't just put the code for procedures in the main program's
    code sequence
    Why?
      We don't want them executed when the program starts running,
      only when called!
      So they have to be stored somewhere...

   A fundamental issue is this:
   Q: Where do we put the code sequences for each procedure?
      Becuase we have to know where they are in the memory of the VM
      so we can call them with their address

      Note that in the SSM, the BOF loading process only allows
      code/instructions at the beginning of memory,
      but does allow the BOF to specify where execution should start
------------------------------------------
       WHERE TO PUT PROCEDURE CODE?

Possible layouts in VM's code array:


------------------------------------------
   ... (1) % store main program first, then procedures:
            [code to set up the program's AR]
            [code for main program's block]
            [(optional: code to tear down the program's AR)]
            EXIT 0
            [code for each procedure...]

   ... (2) % store procedures first, then main program:
            JMPA, size(all-proc-code)+1 # jump arount the procedures
            [code for each procedure...]
            [code to set up the program's AR]
            [code for main program's block]
            [(optional: code to tear down the program's AR)]
            EXIT 0

           Note that the BOF file can set the PC to a start address
             other than 0, without a jump at the beginning,
             so the first jump isn't really needed!

    ... (3) store procedures under programmer control:
            Many languages (like SPL) don't allow procedure expressions
             and so procedures can't be values and stored in data
            But some languages (like Scheme, functional languages)
             have expressions that denote function values (closures).
            In such a langauge, the compiler can just:
            - generate the code for the function as an expression value
            - let the program do what it wants with the value

       Q: How would you implement each?
          Main ideas:
           A. track start address of each procedure declaration
               (e.g., as an attribute)
           B. procedure code is written out (to the BOF)
               after code generation,
               so can adjust starting addresses and call instructions,
               at that time

          Scheme (1): (program first, then procedures)
          need to store code for each procedure somewhere in the compiler,
          track offset of each procedure,
            when generating code to call a procedure p,
                 can find p's offset in symbol table
          when done with the main program's block, know its length,
            then add that length to offset when writing call instrs into BOF

          Scheme (2):
          Store the main program's code sequence somewhere in the
          compiler (but that is probably done anyway, since its length
          is needed for the BOF header),
          track start address of each procedure,
            when generating code to call a procedure p,
                 can find p's start address in symbol table
            (no changes to call instructions needed when writing to BOF)

          Scheme (1) has the advantage that it
              works for code without procedures,
                 making initial testing easier
              
          Scheme (2) has the advantage that it
              doesn't need to fix call instructions when writing to
              BOF, if the language has true lazy evaluation,
              However, without lazy evaluation, such a pass is needed

          Scheme (3) is usually required only if the programming
              language needs it (for procedures/functions as
              expressions), but SPL doesn't need it

       Q: Which layout makes the most sense?
          Either scheme (1) or scheme (2) could work for SPL.
          Since scheme (2) boils down to scheme (1) if there are no
          procedures, and since scheme (2) uses offsets in a less
          complex way than scheme (1), we recommend using scheme (2)
          and setting the start address of the main program's code in
          the BOF.

** how to find each procedure's starting address?

   Most machines only support calls to absolute addresses
   so the compiler needs to know exactly where each procedure starts
   (it's code address) to put in the call instruction...

   However, nesting still requires patching up call instructions in some way
   Consider:

------------------------------------------
      NESTED PROCEDURES ARE A PROBLEM

begin
  proc A
  begin
    proc B
    begin
      # B's body code...
      call A # ...
      # ...
    end;
    # A's body code
    call B # ...
    # ...
  end;
  call A
end.

If lay out the code as

   [ code for A ]
   [ code for B ]

How do we know the address of B
    to compile the call to B in A?


What about the other direction?


------------------------------------------
       Note that these calls are all to previously declared procedures
       
       ... in the layout shown,
              we don't know B's start address
                 until we know how big A's body is
       ... in the second scheme,
              we don't know A's starting address,
                 until we know how big B's body is

       So we'll need some mechanism for filling in these addresses...

------------------------------------------
   RECURSIVE PROCEDURES, SIMILAR PROBLEM

begin
  proc R
  begin
    # R's body code ...
    call R
    # ...
  end;
  # ...
  call R
  # ...
end.

Before storing code for R,
  how do we know where it starts?


------------------------------------------

    ... we can know R's offset when we start walking R's AST,
         so we can put in a call to be adjusted later

------------------------------------------
  MUTUAL RECURSION (NOT IN OUR LANGUAGE)
        
begin
  proc O
  begin # O's body code...
    call E
    # ...
  end;

  proc E;
  begin
    # E's body code ...
    call O
    # ...
  end;

  # ...
  call O;
  call E
  # ...
end.

One of these must before the other in
  the code area of the VM...


------------------------------------------
      If the language, like SPL, requires procedures to be declared
       before use in calls,
       then examples with mutual recursion, like the O and E example,
       are illegal

      However, if the language allows such mutually recursive calls,
         perhaps with forward declarations, as in C,
         then this is a problem.

      Q: No matter which of O or E is put first,
         how is the call to the second one to know
         where the second one starts?
          There's no good way to do that.
          (The problem is we need to know the exact address.)
          (and the other order has a similar problem).

*** solutions
------------------------------------------
       SOLUTION STRATEGIES FOR CALLS

[Multiple passes]:
  1. Generate code for each procedure
     (+ store offsets in symbol table,
      + layout procedure code in memory
        with placholders for calls)
  2. Gather table of addresses
     (map from names to addresses,
      using offsets and beginning address)
  3. Patch up code addresses for calls
     (+ output code)

[Lazy evaluation, labels]:
  1. Generate code for each procedure
     with calls to "labels"
     (+ store or update
        labels in symbol table)
  (+ output code)
------------------------------------------
       These solutions assume that where a procedure is in memory
           does not affect the size of
           the code/instructions needed to call it
       (That is true on the SSM)

    With a language that has true lazy evaluation (such as Haskell)
     the lazy evaluation solution with labels would be easiest.
     It can still work even if the language does not support true lazy
     evaluation, but then it's a bit harder and may require another pass.

** Multiple Passes Solution
------------------------------------------
    GENERAL SOLUTION: MULTIPLE PASSES

Problem: where does each procedure start?

Passes over the IR:
  1. Compile all procedure code
     (now know how big each procedure is)
  2. Lay out procedure code in memory
     (now know where each starts)
  3. Change each call instruction


------------------------------------------
        Step 3 could be done by a "linker"
           (when compiler outputs information from steps 1 and 2)

        Q: What would a progrm need to do to change all the call instructions?
           iterate over the sequence of instructions,
                if it's a call, then adjust it
            
** Labels Solution
------------------------------------------
        GENERAL SOLUTION: LABELS

Use "labels" to allow


Term "label" is from assembly language

    ;  ...
    jmp L
    ; ...
    L: ; ...

------------------------------------------
    ... the IR to specify a call target (address)
        that will be determined later

------------------------------------------
        APPROACHES TO FIXING LABELS

Problem: convert labels to addresses

 (1) Use multiple passes
       a. Generate code with labels
       b. Lay out memory for procedures
          (determine starting addresses)
       c. Change labels to addresses

     advantages:


     disadvantages:


 (2) Use shared mutable data (lazy eval.)
       a. labels are unique placeholders,
          shared by all uses (calls)
       b. when address is determined,
          update the placeholder
          (and all uses are updated)

     advantages:


     disadvantages:


------------------------------------------
        ... (advantages of multiple passes)
            + easy to understand/program
            + need a second pass
               (to adjust addresses) anyway

        ... (disadvantages of multiple passes)
            but the time needed is (only) linear in size of compiled code

        ... (advantages of lazy eval)
            + can debug some code early
               (before full implementation)

        ... (disadvantages of lazy eval)
            - harder to understand, timing is everything
               (even in a language with true lazy evaluation)
            - label data structure must be truly unique
               (copies destroy the whole idea,
                so need pointers or references)
            - still requires multiple passes
               (needed, if the language does not have true lazy
                evaluation, or if lazy evaluation is not implemented fully,
                so in C another pass is needed in this case
                to force all resolutions)

*** label data structure

    This is patterned after lazy evaluation,
    but true lazy evaluation is not a feature of C
    and doesn't need to be fully implemented to work.

**** Creating and propagating labels

     The compiler needs to propagate pointers to labels,
     and the labels themselves must be unique.
     (It's an error if they are copied.)
     
------------------------------------------
       LABEL DATA STRUCTURE

// file label.h
// ...
#include "machine_types.h"

typedef struct {
    bool is_set;
    unsigned int word_offset;
} label;

// Return a fresh label that is not set
extern label *label_create();

// Requires: lab != NULL
// Set the address in the label
extern void label_set(label *lab,
               unsigned int word_offset);

// Is the given label set?
extern bool label_is_set(label *lab);

// Requires: label_is_set(lab)
// Return the word offset in lab
extern
unsigned int label_read(label *lab);
------------------------------------------

        So the compiler can create a label (on the heap),
        and all data pointing to it see the updates
           (once it's set)

------------------------------------------
            CREATING LABELS

Labels created for all procedures
when creating proc_decl_t ASTs.

// file ast.h
// ...
#include "label.h"

// ...

typedef struct proc_decl_s {
    file_location *file_loc;
    AST_type type_tag;
    struct proc_decl_s *next; // for lists
    const char *name;
    struct block_s *block;
    label *lab; // for code generation
} proc_decl_t;


// file ast.c

// Return an AST for a proc_decl
proc_decl_t ast_proc_decl(ident_t ident,
                          block_t block)
{
    proc_decl_t ret;
    ret.file_loc = // ...
    ret.type_tag = proc_decl_ast;
    ret.next = NULL;
    ret.name = ident.name;
    block_t *p = // ...
    ret.block = p;

    // this is the source of the labels
    ret.lab = label_create();
    assert(ret.lab != NULL);

    return ret;
}
------------------------------------------

------------------------------------------
     PROPAGATING POINTERS TO LABELS (1)

Labels added to attributes
of procedure names

// file id_attrs.h
//...
#include "label.h"

typedef struct {
    file_location file_loc;
    id_kind kind;  // kind of identifier
    unsigned int offset_count;
    // for a procedure, its label
    label *lab;
} id_attrs;

------------------------------------------


------------------------------------------
     PROPAGATING POINTERS TO LABELS (2)

Make call statement ASTs point to the
label of the procedure being called

// file scope_check.c

// check the statement to make sure that
// the procedure has been declared
// (if not, then produce an error).
// Modifies the given AST
// to have appropriate id_use pointers.
void scope_check_callStmt(
                        call_stmt_t *stmt)
{


}

------------------------------------------

  ...
  stmt->idu = scope_check_ident_declared(
          *(stmt->file_loc), stmt->name);
  assert(stmt->idu != NULL);
  id_attrs *attrs
            = id_use_get_attrs(stmt->idu);
  // check that it's a procedure, or error
  assert(attrs != NULL);
  assert(attrs->lab != NULL);


   So now each call statement points to
   the label of the procedure being called

------------------------------------------
     PROPAGATING POINTERS TO LABELS (3)

Associate labels with
each call instruction in code structures

// file code.h
// ...
#include "label.h"

typedef struct code_s {
    struct code_s *next;
    bin_instr_t instr;
    // labels for call instructions
    label *lab; 
} code;

// ...

// Requires: lab != NULL
// Create and return a fresh instruction
// with the named mnemonic and parameters
extern code *code_call(address_type a,
                       label *lab);

// ...

------------------------------------------

        Q: Where should the label passed to the code_call function
           come from?
           From the AST for the proc declaration of the called
           procedure (via the call statement). Has to be the same pointer!
           
        This puts labels in code sequences also

        Q: So what has been achieved?
           Every procedure declared has a label,
           every call to a procedure points to that label
              (as does every call instruction code)

**** Using labels to fill in addresses of procedures

     Now that the labels are where they need to be,
     information about where each procedure starts needs to get to its
     label and from there to the call instruction

***** Putting addresses of Procedures in Labels
------------------------------------------
    SETTING LABELS IN PROCEDURES (1)

// file label.h

typedef struct {
    bool is_set;
    unsigned int word_offset;
} label;


------------------------------------------

      Q: Where is the address of a procedure known?
         In code generation when the procedure has been compiled

------------------------------------------
    SETTING LABELS IN PROCEDURES (2)

// file gen_code.c

// ...

void gen_code_proc_decl(proc_decl_t pd)
{


}

------------------------------------------

    Q: What makes code for a procedure?
       It's block

    I also added a data structure to hold the code_seq for procedures
       called a proc_holder, with a register function.
       The proc_holder_register function returns the procedure's offset
       (based on where it goes in the text section)

    ...
    code_seq pdc = gen_code_block(*(pd.block));
    // add code to return from the procedure call
    code_seq_add_to_end(&pdc, code_rtn());
    unsigned int proc_offset
         = proc_holder_register(pdc);

    label_set(pd.lab, proc_offset);

    Q: At the end of code generation, what has this achieved?
       Now we know for each procedure it's address,
       which is in its label.
       And the label is pointed to by each code
       for each call instruction.

***** Fixing up the Call Instructions
    Q: Can some call instructions not have their labels set
       at the end of code generation?

       No, assuming that scope checking made sure that each call was a
       call to a declared procedure,
       then by the end of code generation
       each procedure declaration has been processed,
       and during gen_code_proc_decl we set each label,
       so each label should be set by then

------------------------------------------
  PUTTING ADDRESSES IN CALL INSTRUCTIONS

Write a function to fix
all call instructions in a code_seq

extern
void code_seq_fix_labels(code_seq cs);


------------------------------------------
   Q: How would you write such a function?
      Use a loop, written with code_seq_is_empty,
      code_seq_first, and code_seq_rest.

   ...
       For each (code *)c in a code_seq do
        unsigned int a = label_read(c->lab);
	c->instr.jump.addr = a;

   It's best to use assert(label_is_set(c->lab));
       in such a loop, for debugging.

** testing the solution

   Q: How would you test a solution?
   Write SPL code that uses procedures and procedure calls
     start with the simplest examples

** exercise
    Write code for the float calculator
        with let statements
           <stmt> ::= let { <var-decl> } in <stmt>