I. Code Generation A. overview ------------------------------------------ OVERVIEW OF CODE GENERATION .. ASTs...-> [ Static Analysis ] | | IR v [ Code Generation ] | | Machine Code | v Virtual Machine Execution The IR (= Intermediate Representation) records ------------------------------------------ 1. IR (Intermediate Representation) What kind of information is needed from a name's use in order to generate code? Should the parser create the lexical address of a name's use during paring? Is the symbol table unchanging (immutable)? ------------------------------------------ IR TREES An IR is a tree structure, Helps in modularizing compilers and code generation WITHOUT IR WITH IR Java x86 Java x86 C MIPS C MIPS IR C++ Sparc C++ Sparc C# A1 C# A1 ------------------------------------------ ------------------------------------------ OUR CHOICES FOR AN IR To keep things simple, we will use a modified AST type as an IR Parser: - records - provides Static analysis: - records ------------------------------------------ 2. General strategy ------------------------------------------ GENERAL STRATEGY FOR CODE GENERATION Follow the grammar of ------------------------------------------ ------------------------------------------ FOLLOWING THE GRAMMAR Code resembles the grammar that When ------------------------------------------ How does this relate to the parser? Why is this useful? B. Translation target: code sequences ------------------------------------------ TARGET: CODE SEQUENCES Need sequences of machine code Why? ------------------------------------------ Why are code sequences needed? ------------------------------------------ REPRESENTING CODE SEQUENCES IN C // code that can be in a sequence typedef struct code_s code; // code sequences typedef code *code_seq; // machine code instructions typedef struct code_s { code_seq next; instruction instr; } code; ------------------------------------------ C. Expressions ------------------------------------------ TRANSLATING EXPRESSIONS Abstract syntax of expressions: E ::= E1 o E2 | x | n o ::= + | - | * | / VM instructions: NEG "negate val on top of stack" ADD "add top 2 elems on stack" SUB "subtract top from 2nd top" MUL "multiply top 2 elems" DIV "divide 2nd from top by top" MOD "2nd from top elem mod top" PBP "push BP on the stack" LOD o "load offset from addr in top" LIT n "literal" ------------------------------------------ ------------------------------------------ TRANSLATION SCHEME FOR EXPRESSIONS Expression MACHINE CODE x n E1 o E2 ------------------------------------------ The grammar (abstract syntax) is recursive, so what are the base cases? Why use PBP (push base pointer) followed by levl PSI instructions? What are we assuming about the evaluation of expressions? D. Declarations Where are constants and variables stored? ------------------------------------------ TRANSLATION SCHEME FOR DECLARATIONS const c = n; var x; ------------------------------------------ What does LIT n do? What does INC do? How do we track the association between names and these locations? Where do global constants and variables go? E. Straight-line Code (simple statements without control flow) What are the base cases in the grammar for statements? ------------------------------------------ TRANSLATION SCHEME FOR BASIC STATEMENTS skip x := E write E read x ------------------------------------------ Can the "levels outwards" part of the lexical address be determined when the variable is declared? Can the "offset from start of AR" part of the lexical address be determined when the variable is declared? Does the same hold for constants? F. Conditions ------------------------------------------ GRAMMAR FOR CONDITIONS ::= odd | ::= = | <> | < | <= | > | >= So structure is? Code looks like: ------------------------------------------ What should these functions return? G. Control Flow Statements (Compound Statements) Why is it useful to write the base cases first? ------------------------------------------ ABSTRACT SYNTAX FOR COMPOUND STATEMENTS S ::= begin { S } | if C S1 S2 | while C S So what is the code structure? Code looks like: begin S1 S2 ... if C S1 S2 while C S ------------------------------------------ ------------------------------------------ TRANSLATION PROBLEM FOR CONDITIONAL STMTS If jumps use an absolute address, how to compute that? Problem: For code in procedures: Approaches: ------------------------------------------ H. procedures ------------------------------------------ SUPPORTING PROCEDURES AND CALLS Main issues: - storing their code Why? - knowing exactly where each starts Why? Another issue: - trimming the stack for return ------------------------------------------ 1. Where to store code for procedures? Where do we put the code sequences for each procedure? ------------------------------------------ WHERE TO PUT PROCEDURE CODE? Possible layouts in VM's code array: ------------------------------------------ How would you implement each? Which layout makes the most sense? 2. how to find each procedure's starting address? ------------------------------------------ NESTED PROCEDURES ARE A PROBLEM procedure A; procedure B; begin # B's body code... call A # ... # ... end begin # A's body code call B # ... # ... end If store the code as [ code for A ] [ code for B ] how do we know the address of B to compile the call to B? What about the other direction? ------------------------------------------ ------------------------------------------ RECURSIVE PROCEDURES, ANOTHER PROBLEM procedure R; begin # R's body code ... call R # ... end Before storing code for R, how do we know where it starts? procedure O; begin # O's body code... call E # ... end procedure E; begin # E's body code ... call O # ... One of these must before the other in the code area of the VM... ------------------------------------------ No matter which of O or E is put first, how is the call to the second one to know where the second one starts? 3. Labels as a solution ------------------------------------------ GENERAL SOLUTION: LABELS Use "labels" to allow Term "label" is from assembly language ; ... jmp L ; ... L: ; ... ------------------------------------------ ------------------------------------------ APPROACHES TO FIXING LABELS Problem: convert labels to addresses (1) Use multiple passes - lay out memory for procedures (determine starting addresses) - change labels to addresses advantages: disadvantages: (2) Use shared mutable data (lazy eval.) - label is a unique placeholder, shared by all uses - when address is determined, update the placeholder (and all uses are updated) advantages: disadvantages: ------------------------------------------ a. label data structure for lazy evaluation ------------------------------------------ LABEL DATA STRUCTURE FOR LAZY EVAL In file label.h for HW4: typedef struct { bool is_set; address addr; } label; // Return a fresh label that is not set extern label *label_create(); // Set the address in the label extern void label_set(label *lab, address addr); // Is lab set? extern bool label_is_set(label *lab); // Requires: label_is_set(lab) // Return the address in lab. extern address label_read(label *lab); ------------------------------------------ 4. Trimming the stack for returns ------------------------------------------ PROBLEM OF TRIMMING THE STACK FOR RETURNS Consider a procedure's block like: procedure P; var x; skip; code for P would be: INC 1 NOP RTN Why? Picture of stack before a call of P: 6 | | 5 | | 4 | | <- SP 3 | | 2 | | <- BP |---------| Picture after the call 7 | | <- SP 6 | ret adr | 5 | old BP | 4 | stat lnk| <- BP 3 | | 2 | | |---------| After executing INC 1 and NOP 8 | | <- SP 7 | x | 6 | ret adr | 5 | old BP | 4 | stat lnk| <- BP 3 | | 2 | | |---------| The return will pop the stack 3 times: 5 | old BP | <- BP 4 | | 3 | | 2 | | |---------| ------------------------------------------ Is the picture at the end the same as the start? What got put into the BP register? How to fix this? I. exercise