COP 3402 meeting -*- Outline -*-

* Code Generation for Procedures
------------------------------------------
      SUPPORTING PROCEDURES AND CALLS

Main issues:
   - storing their code
     Why?


   - knowing exactly where each starts
     Why?


Another issue:
   - sending the right static link

------------------------------------------
        ... we only want to execute their code when called,
            and they can be called from anywhere (their name is visible)
               (in some languages also from where they can be accessed
                from data structures)

        ... because machines don't support symbolic names,
            need (absolute) address for the call instruction

            e.g., in the SRM, need an address for JAL instruction

        Q: What static link does a called procedure need?
           The one for the scope in which it is defined;
           this will be the link for the AR given by the number of
           levels outward where the procedure name was declared.

           For example, suppose we we compiling code for procedure P,
               and the block in P has a statement that calls procedure Q.
               There are several cases:

               Q was declared in P's block,
                  so the call and Q are in the same scope,
                  so levelsOutward == 0 for Q (from the call);
                  thus want to pass FP as the static link.

               Q was declared in the same block as P:
                  so levelsOutward == 1 for Q
                  thus want to pass P's static link as the static link

               Q was declared in a scope surrounding where P was declared
                  so levelsOutward == N > 1 for Q (from call site),
                  so want to pass static link for N levels outward
                  
               Thus it's always the levels outward of the name Q

** Where to store code for procedures?
    We can't just put the code for procedures in the main program's
    code sequence
    Why?
      We don't want them executed when the program starts running,
      only when called!
      So they have to be stored somewhere...

   A fundamental issue is this:
   Q: Where do we put the code sequences for each procedure?
      Becuase we have to know where they are in the memory of the VM
      so we can call them with their address
      (As the VM doesn't support names for procedures directly.)

      Note that in the SRM, the BOF loading process only allows
      code/instructions at the beginning of memory,
      but does allow the BOF to specify where execution should start
------------------------------------------
       WHERE TO PUT PROCEDURE CODE?

Possible layouts in VM's code array:


------------------------------------------
   ... (1) store main program first, then procedures:
            [code to set up the program's AR]
            [code for program block]
            [(optional: code to tear down the program's AR)]
            EXIT
            [code for each procedure...]

   ... (2) store procedures first, then main program:
            BEQ $0,$0, size(all-proc-code) # skip past the procedures
            [code for each procedure...]
            [code to set up the program's AR]
            [code for program block]
            [(optional: code to tear down the program's AR)]
            EXIT

           Note that the BOF file can set the PC to a start address
             other than 0, without a jump at the beginning,
             so the first jump isn't really needed

    ... (3) store procedures under programmer control:
            Many languages (like PL/0) don't allow procedure expressions
             and so procedures can't be values and stored in data
            But some languages (like Scheme, functional languages)
             have expressions that denote function values (closures).
            In such a langauge, the compiler can just:
            - generate the code for the function as an expression value
            - let the program do what it wants with the value

       Q: How would you implement each?
          Main ideas:
           A. track start address of each procedure declaration
               (e.g., as an attribute)
           B. procedure code is written out (to the BOF)
               after code generation,
               so can adjust starting addresses and call instructions,
               at that time

          Scheme (1):
          need to store code for each procedure somewhere,
          track offset of each procedure,
            when generating code to call a procedure p,
                 can find p's offset in symbol table
          when done with the main program's block, know its length,
            then add that length to offset when writing call instrs into BOF

          Scheme (2):
          Store the main program's code sequence somewhere,
          track start address of each procedure,
            when generating code to call a procedure p,
                 can find p's start address in symbol table
            (no changes to call instructions needed when writing to BOF)

          Scheme (1) has the advantage that it
              works for code without procedures,
                 making initial testing easier
              
          Scheme (2) has the advantage that it
              doesn't need to fix call instructions when writing to BOF

          Scheme (3) is usually required by the programming language
               (when there are procedures/functions as expressions)

       Q: Which layout makes the most sense?
          Since scheme (2) boils down to scheme (1) if there are no
          procedures, and since scheme (2) uses offsets in a less
          complex way than scheme (1), we recommend using scheme (2)
          and setting the start address of the main program's code in
          the BOF.

** how to find each procedure's starting address?

   Most machines only support calls to absolute addresses
   so the compiler needs to know exactly where each procedure starts
   (it's code address) to put in the call (JAL for the SRM) instruction...

   However, nesting requires patching up call instructions in some way
   Consider:

------------------------------------------
      NESTED PROCEDURES ARE A PROBLEM

  procedure A;
    procedure B;
      begin # B's body code...
            call A # ...
            # ...
      end
  begin
     # A's body code
     call B # ...
     # ...
  end

If lay out the code as

   [ code for A ]
   [ code for B ]

How do we know the address of B
    to compile the call to B?


What about the other direction?


------------------------------------------
       Note that these calls are all to previously declared procedures
       
       ... in the layout shown,
              we don't know B's start address
                 until we know how big A's body is
       ... in the second scheme,
              we don't know A's starting address,
                 until we know how big B's body is

       So we'll need some mechanism for filling in these addresses...

------------------------------------------
   RECURSIVE PROCEDURES, SIMILAR PROBLEM

  procedure R;
    begin
      # R's body code ...
      call R
      # ...
    end

Before storing code for R,
  how do we know where it starts?


------------------------------------------
    It's like the nested procedure case,
        but it's the procedure itself that we need to know the size of
        however:

    ... we can know R's offset when we start walking R's AST,
         so we can put in a call to be adjusted later

------------------------------------------
        MUTUAL RECURSION
        
  procedure O;
    begin # O's body code...
      call E
      # ...
    end

  procedure E;
    begin
      # E's body code ...
      call O
      # ...

One of these must before the other in
  the code area of the VM...


------------------------------------------
      If the language, like PL/0, requires procedures to be declared
       before use in calls,
       then examples with mutual recursion, like the O and E example,
       are illegal

      However, if the language allows such mutually recursive calls,
         perhaps with forward declarations, as in C,
         then this is a problem.

      Q: No matter which of O or E is put first,
         how is the call to the second one to know
         where the second one starts?
          (The problem is we need to know the exact address.)
          (and the other order has a similar problem).

*** solutions
------------------------------------------
       SOLUTION STRATEGIES FOR CALLS

[Multiple passes]:
  1. Generate code for each procedure
     (+ store offsets in symbol table,
      + layout procedure code in memory)
  2. Gather table of addresses
     (map from names to addresses,
      using offsets and beginning address)
  3. Patch up code addresses for calls
     (+ output code)

[Lazy evaluation, labels]:
  1. Generate code for each procedure
     with calls to labels
     (+ store or update
        labels in symbol table)
  (+ output code)
------------------------------------------
       These solutions assume that where a procedure is in memory
           does not affect the size of the code/instructions
       (That is true on the SRM, 

** Multiple Passes as a Solution
------------------------------------------
      GENERAL SOLUTION: MULTIPLE PASSES

Problem: where does each procedure start?

Solution idea:
  1. Compile all procedure code
     (now know how big each procedure is)
  2. Lay out procedure code in memory
     (now know where each starts)
  3. Change each call instruction


------------------------------------------
        Step 3 could be done by a "linker"
           (when compiler outputs information from steps 1 and 2)

        Q: What would a progrm need to do to change all the call instructions?
           iterate over the sequence of instructions,
                if it's a call, then adjust it
            
** Labels as a Solution
------------------------------------------
         GENERAL SOLUTION: LABELS

Use "labels" to allow


Term "label" is from assembly language

    ;  ...
    jmp L
    ; ...
    L: ; ...

------------------------------------------
    ... the IR to specify a call target (address)
        that will be determined later

------------------------------------------
        APPROACHES TO FIXING LABELS

Problem: convert labels to addresses

 (1) Use multiple passes
       a. Generate code with labels
       b. Lay out memory for procedures
          (determine starting addresses)
       c. Change labels to addresses

     advantages:


     disadvantages:


 (2) Use shared mutable data (lazy eval.)
       a. labels are unique placeholders,
          shared by all uses (calls)
       b. when address is determined,
          update the placeholder
          (and all uses are updated)

     advantages:


     disadvantages:


------------------------------------------
        ... (advantages of multiple passes)
            + easy to understand/program
            + need a second pass
               (to adjust addresses) anyway

        ... (disadvantages of multiple passes)
            time needed is linear in size of compiled code

        ... (advantages of lazy eval)
            + can debug some code early
               (before full implementation)

        ... (disadvantages of lazy eval)
            - harder to understand, timing is everything
            - label data structure must be truly unique
               (copies destroy the whole idea,
                so need pointers or references)
            - still requires multiple passes for mutual recursion
               (to force all resolutions)

*** label data structure for lazy evaluation
------------------------------------------
    LABEL DATA STRUCTURE FOR LAZY EVAL

// file label.h
// ...
#include "machine_types.h"

typedef struct {
    bool is_set;
    unsigned int word_offset;
} label;

// Return a fresh label that is not set
extern label *label_create();

// Requires: lab != NULL
// Set the address in the label
extern void label_set(label *lab,
               unsigned int word_offset);

// Is the given label set?
extern bool label_is_set(label *lab);

// Requires: label_is_set(lab)
// Return the word offset in lab
extern
unsigned int label_read(label *lab);
------------------------------------------

        So the compiler can create a label (on the heap),
        and all data pointing to it see the updates
           (once it's set)

** exercise
    Write code for the float calculator
        with let statements
           <stmt> ::= let { <var-decl> } in <stmt>