COP 3402 meeting -*- Outline -*- * Code Generation for Procedures ------------------------------------------ SUPPORTING PROCEDURES AND CALLS Main issues: - storing their code Why? - knowing exactly where each starts Why? Another issue: - sending the right static link ------------------------------------------ ... we only want to execute their code when called, and they can be called from anywhere (their name is visible) (in some languages also from where they can be accessed from data structures) ... because machines don't support symbolic names, need (absolute) address for the call instruction e.g., in the SSM, need an address for the CALL instruction Q: What static link does a called procedure need? The one for the scope in which it is defined; this will be the link for the AR given by the number of levels outward where the procedure name was declared. For example, suppose we we compiling code for procedure P, and it (i.e., P's body) calls procedure Q. There are several cases: Q was declared in P's body, so the call and Q are in the same scope, so levelsOutward == 0 for Q (from the call); thus want to pass FP as the static link. Q was declared in the same block as P (but before P): so levelsOutward == 1 for Q thus want to pass P's static link as the static link Q was declared in a scope surrounding where P was declared so levelsOutward == N > 1 for Q, so want to pass static link for N levels outward Thus it's always the levels outward of the name Q ** Where to store code for procedures? We can't just put the code for procedures in the main program's code sequence Why? We don't want them executed when the program starts running, only when called! So they have to be stored somewhere... A fundamental issue is this: Q: Where do we put the code sequences for each procedure? Becuase we have to know where they are in the memory of the VM so we can call them with their address Note that in the SSM, the BOF loading process only allows code/instructions at the beginning of memory, but does allow the BOF to specify where execution should start ------------------------------------------ WHERE TO PUT PROCEDURE CODE? Possible layouts in VM's code array: ------------------------------------------ ... (1) % store main program first, then procedures: [code to set up the program's AR] [code for main program's block] [(optional: code to tear down the program's AR)] EXIT 0 [code for each procedure...] ... (2) % store procedures first, then main program: JMPA, size(all-proc-code)+1 # jump arount the procedures [code for each procedure...] [code to set up the program's AR] [code for main program's block] [(optional: code to tear down the program's AR)] EXIT 0 Note that the BOF file can set the PC to a start address other than 0, without a jump at the beginning, so the first jump isn't really needed! ... (3) store procedures under programmer control: Many languages (like SPL) don't allow procedure expressions and so procedures can't be values and stored in data But some languages (like Scheme, functional languages) have expressions that denote function values (closures). In such a langauge, the compiler can just: - generate the code for the function as an expression value - let the program do what it wants with the value Q: How would you implement each? Main ideas: A. track start address of each procedure declaration (e.g., as an attribute) B. procedure code is written out (to the BOF) after code generation, so can adjust starting addresses and call instructions, at that time Scheme (1): (program first, then procedures) need to store code for each procedure somewhere in the compiler, track offset of each procedure, when generating code to call a procedure p, can find p's offset in symbol table when done with the main program's block, know its length, then add that length to offset when writing call instrs into BOF Scheme (2): Store the main program's code sequence somewhere in the compiler (but that is probably done anyway, since its length is needed for the BOF header), track start address of each procedure, when generating code to call a procedure p, can find p's start address in symbol table (no changes to call instructions needed when writing to BOF) Scheme (1) has the advantage that it works for code without procedures, making initial testing easier Scheme (2) has the advantage that it doesn't need to fix call instructions when writing to BOF, if the language has true lazy evaluation, However, without lazy evaluation, such a pass is needed Scheme (3) is usually required only if the programming language needs it (for procedures/functions as expressions), but SPL doesn't need it Q: Which layout makes the most sense? Either scheme (1) or scheme (2) could work for SPL. Since scheme (2) boils down to scheme (1) if there are no procedures, and since scheme (2) uses offsets in a less complex way than scheme (1), we recommend using scheme (2) and setting the start address of the main program's code in the BOF. ** how to find each procedure's starting address? Most machines only support calls to absolute addresses so the compiler needs to know exactly where each procedure starts (it's code address) to put in the call instruction... However, nesting still requires patching up call instructions in some way Consider: ------------------------------------------ NESTED PROCEDURES ARE A PROBLEM begin proc A begin proc B begin # B's body code... call A # ... # ... end; # A's body code call B # ... # ... end; call A end. If lay out the code as [ code for A ] [ code for B ] How do we know the address of B to compile the call to B in A? What about the other direction? ------------------------------------------ Note that these calls are all to previously declared procedures ... in the layout shown, we don't know B's start address until we know how big A's body is ... in the second scheme, we don't know A's starting address, until we know how big B's body is So we'll need some mechanism for filling in these addresses... ------------------------------------------ RECURSIVE PROCEDURES, SIMILAR PROBLEM begin proc R begin # R's body code ... call R # ... end; # ... call R # ... end. Before storing code for R, how do we know where it starts? ------------------------------------------ ... we can know R's offset when we start walking R's AST, so we can put in a call to be adjusted later ------------------------------------------ MUTUAL RECURSION (NOT IN OUR LANGUAGE) begin proc O begin # O's body code... call E # ... end; proc E; begin # E's body code ... call O # ... end; # ... call O; call E # ... end. One of these must before the other in the code area of the VM... ------------------------------------------ If the language, like SPL, requires procedures to be declared before use in calls, then examples with mutual recursion, like the O and E example, are illegal However, if the language allows such mutually recursive calls, perhaps with forward declarations, as in C, then this is a problem. Q: No matter which of O or E is put first, how is the call to the second one to know where the second one starts? There's no good way to do that. (The problem is we need to know the exact address.) (and the other order has a similar problem). *** solutions ------------------------------------------ SOLUTION STRATEGIES FOR CALLS [Multiple passes]: 1. Generate code for each procedure (+ store offsets in symbol table, + layout procedure code in memory with placholders for calls) 2. Gather table of addresses (map from names to addresses, using offsets and beginning address) 3. Patch up code addresses for calls (+ output code) [Lazy evaluation, labels]: 1. Generate code for each procedure with calls to "labels" (+ store or update labels in symbol table) (+ output code) ------------------------------------------ These solutions assume that where a procedure is in memory does not affect the size of the code/instructions needed to call it (That is true on the SSM) With a language that has true lazy evaluation (such as Haskell) the lazy evaluation solution with labels would be easiest. It can still work even if the language does not support true lazy evaluation, but then it's a bit harder and may require another pass. ** Multiple Passes Solution ------------------------------------------ GENERAL SOLUTION: MULTIPLE PASSES Problem: where does each procedure start? Passes over the IR: 1. Compile all procedure code (now know how big each procedure is) 2. Lay out procedure code in memory (now know where each starts) 3. Change each call instruction ------------------------------------------ Step 3 could be done by a "linker" (when compiler outputs information from steps 1 and 2) Q: What would a progrm need to do to change all the call instructions? iterate over the sequence of instructions, if it's a call, then adjust it ** Labels Solution ------------------------------------------ GENERAL SOLUTION: LABELS Use "labels" to allow Term "label" is from assembly language ; ... jmp L ; ... L: ; ... ------------------------------------------ ... the IR to specify a call target (address) that will be determined later ------------------------------------------ APPROACHES TO FIXING LABELS Problem: convert labels to addresses (1) Use multiple passes a. Generate code with labels b. Lay out memory for procedures (determine starting addresses) c. Change labels to addresses advantages: disadvantages: (2) Use shared mutable data (lazy eval.) a. labels are unique placeholders, shared by all uses (calls) b. when address is determined, update the placeholder (and all uses are updated) advantages: disadvantages: ------------------------------------------ ... (advantages of multiple passes) + easy to understand/program + need a second pass (to adjust addresses) anyway ... (disadvantages of multiple passes) but the time needed is (only) linear in size of compiled code ... (advantages of lazy eval) + can debug some code early (before full implementation) ... (disadvantages of lazy eval) - harder to understand, timing is everything (even in a language with true lazy evaluation) - label data structure must be truly unique (copies destroy the whole idea, so need pointers or references) - still requires multiple passes (needed, if the language does not have true lazy evaluation, or if lazy evaluation is not implemented fully, so in C another pass is needed in this case to force all resolutions) *** label data structure This is patterned after lazy evaluation, but true lazy evaluation is not a feature of C and doesn't need to be fully implemented to work. **** Creating and propagating labels The compiler needs to propagate pointers to labels, and the labels themselves must be unique. (It's an error if they are copied.) ------------------------------------------ LABEL DATA STRUCTURE // file label.h // ... #include "machine_types.h" typedef struct { bool is_set; unsigned int word_offset; } label; // Return a fresh label that is not set extern label *label_create(); // Requires: lab != NULL // Set the address in the label extern void label_set(label *lab, unsigned int word_offset); // Is the given label set? extern bool label_is_set(label *lab); // Requires: label_is_set(lab) // Return the word offset in lab extern unsigned int label_read(label *lab); ------------------------------------------ So the compiler can create a label (on the heap), and all data pointing to it see the updates (once it's set) ------------------------------------------ CREATING LABELS Labels created for all procedures when creating proc_decl_t ASTs. // file ast.h // ... #include "label.h" // ... typedef struct proc_decl_s { file_location *file_loc; AST_type type_tag; struct proc_decl_s *next; // for lists const char *name; struct block_s *block; label *lab; // for code generation } proc_decl_t; // file ast.c // Return an AST for a proc_decl proc_decl_t ast_proc_decl(ident_t ident, block_t block) { proc_decl_t ret; ret.file_loc = // ... ret.type_tag = proc_decl_ast; ret.next = NULL; ret.name = ident.name; block_t *p = // ... ret.block = p; // this is the source of the labels ret.lab = label_create(); assert(ret.lab != NULL); return ret; } ------------------------------------------ ------------------------------------------ PROPAGATING POINTERS TO LABELS (1) Labels added to attributes of procedure names // file id_attrs.h //... #include "label.h" typedef struct { file_location file_loc; id_kind kind; // kind of identifier unsigned int offset_count; // for a procedure, its label label *lab; } id_attrs; ------------------------------------------ ------------------------------------------ PROPAGATING POINTERS TO LABELS (2) Make call statement ASTs point to the label of the procedure being called // file scope_check.c // check the statement to make sure that // the procedure has been declared // (if not, then produce an error). // Modifies the given AST // to have appropriate id_use pointers. void scope_check_callStmt( call_stmt_t *stmt) { } ------------------------------------------ ... stmt->idu = scope_check_ident_declared( *(stmt->file_loc), stmt->name); assert(stmt->idu != NULL); id_attrs *attrs = id_use_get_attrs(stmt->idu); // check that it's a procedure, or error assert(attrs != NULL); assert(attrs->lab != NULL); So now each call statement points to the label of the procedure being called ------------------------------------------ PROPAGATING POINTERS TO LABELS (3) Associate labels with each call instruction in code structures // file code.h // ... #include "label.h" typedef struct code_s { struct code_s *next; bin_instr_t instr; // labels for call instructions label *lab; } code; // ... // Requires: lab != NULL // Create and return a fresh instruction // with the named mnemonic and parameters extern code *code_call(address_type a, label *lab); // ... ------------------------------------------ Q: Where should the label passed to the code_call function come from? From the AST for the proc declaration of the called procedure (via the call statement). Has to be the same pointer! This puts labels in code sequences also Q: So what has been achieved? Every procedure declared has a label, every call to a procedure points to that label (as does every call instruction code) **** Using labels to fill in addresses of procedures Now that the labels are where they need to be, information about where each procedure starts needs to get to its label and from there to the call instruction ***** Putting addresses of Procedures in Labels ------------------------------------------ SETTING LABELS IN PROCEDURES (1) // file label.h typedef struct { bool is_set; unsigned int word_offset; } label; ------------------------------------------ Q: Where is the address of a procedure known? In code generation when the procedure has been compiled ------------------------------------------ SETTING LABELS IN PROCEDURES (2) // file gen_code.c // ... void gen_code_proc_decl(proc_decl_t pd) { } ------------------------------------------ Q: What makes code for a procedure? It's block I also added a data structure to hold the code_seq for procedures called a proc_holder, with a register function. The proc_holder_register function returns the procedure's offset (based on where it goes in the text section) ... code_seq pdc = gen_code_block(*(pd.block)); // add code to return from the procedure call code_seq_add_to_end(&pdc, code_rtn()); unsigned int proc_offset = proc_holder_register(pdc); label_set(pd.lab, proc_offset); Q: At the end of code generation, what has this achieved? Now we know for each procedure it's address, which is in its label. And the label is pointed to by each code for each call instruction. ***** Fixing up the Call Instructions Q: Can some call instructions not have their labels set at the end of code generation? No, assuming that scope checking made sure that each call was a call to a declared procedure, then by the end of code generation each procedure declaration has been processed, and during gen_code_proc_decl we set each label, so each label should be set by then ------------------------------------------ PUTTING ADDRESSES IN CALL INSTRUCTIONS Write a function to fix all call instructions in a code_seq extern void code_seq_fix_labels(code_seq cs); ------------------------------------------ Q: How would you write such a function? Use a loop, written with code_seq_is_empty, code_seq_first, and code_seq_rest. ... For each (code *)c in a code_seq do unsigned int a = label_read(c->lab); c->instr.jump.addr = a; It's best to use assert(label_is_set(c->lab)); in such a loop, for debugging. ** testing the solution Q: How would you test a solution? Write SPL code that uses procedures and procedure calls start with the simplest examples ** exercise Write code for the float calculator with let statements ::= let { } in