COP 3402 meeting -*- Outline -*- * Overview ------------------------------------------ SYSTEMS SOFTWARE Helping people run programs Human -----> [ Translator ] Readable / | Input / | (Textual) / | / v v Object Library Code (Binary) (Binary) \ | \ | \ v \->[ Linker/ Loader ] | | v Process | | v [ OS + Computer ] | | v Computation ------------------------------------------ Q: How would a debugger fit into this picture? It would interact with the process and OS+Computer Q: What job or jobs does a translator have? Two: - translating the source language into machine code - giving the user (programmer) good feedback (error messages) ** Compiler vs. Interpreter ------------------------------------------ RECALL DEFINITIONS def: An *assembler* translates a low-level language that is close to machine code into machine code. def: A *compiler* translates a (high-level) language into a form that can be (more easily) executed by a computer. def: An *interpreter* executes a programming language directly, often without translating it into object code first ------------------------------------------ Assembly language uses human-readable names for instructions and macros. Typically there is a one-to-one ratio of instructions in assembly language and instructions in the translated machine code An interpreter may do some translation into an internal format (e.g., a virtual machine language) Q: What are some examples of languages with interpreters? Python, Java (to some extent) Modern languages, like Java, often combine interpreters and compilers, using a "just-in-time compiler" that only compiles programs into machine code as needed, and interprets seldom used code. Such systems may be called *hybrids* of compilers and interpreters. ------------------------------------------ COMPILER PICTURE Source -----> [ Lexical Analyzer ] | v Token stream | v [ Parser ] | v Parse Tree (AST) + Symbol Table | v [ Static Analysis ] | v Parse Tree (AST) + Symbol Table | v [ Code Generator ] | v Object Code | v [ Linker/ Loader ] | v Process | v [ OS + Computer ] ------------------------------------------ Point out the front end (lexical analysis and parsing), back end (code generator) and "middle end" (static analysis) ------------------------------------------ INTERPRETER PICTURE Source -----> [ Lexical Analyzer ] | v Token stream | v [ Parser ] | v Parse Tree (AST) + Symbol Table | v [ Interpreter ] | v [ OS + Computer ] ------------------------------------------ The interpreter is a process running on the computer, it executes the instructions, perhaps using the AST An interpreter may also do static analysis. *** advantages and disadvantages ------------------------------------------ ADVANTAGES AND DISADVANTAGES Compiler Advantages: Disadvantages: Interpreters Advantages: Disadvantages: ------------------------------------------ ... Compilers Advantages: - often produce code that executes faster (perhaps 10x faster) why? - optimization - direct execution of object code on machine (vs. indirect execution of a VM) - separates preparation (compile time) vs. running (runtime), So have time to: - carefully analyze code for problems - improve (optimize) code - carry out complex transformations (e.g., expanding macros, expanding generics) - compiled (object) code is hard to read so can help protect IP Disadvantages: - compiled (object) code can be quite different than source, so can be hard to debug - compilation time can slow development (long waits for compiler causes breaks between writing code and testing it) - object code can be hard to check so can lead to security vulnerabilities - object code may be large, so slow to send over network - compilers are bigger programs, so harder to write Interpreters Advantages: - little time needed to start running a program so better feedback for initial development efforts - intermediate representation can be very concise so faster to send over network (Java's initial pitch) - source code is easier to understand than object code so easier to: - debug - check or verify - protect against security vulnerabilities - small program than a compiler, so - better for prototyping language ideas - better for creating very high-level languages (e.g., SPSS and R, Matlab, ...) Disadvantages: - execution generally slower since not much time can be spent on analysis and optimization *** hybrids of compilers and interpreters ------------------------------------------ HYBRID COMPILER/INTERPRETER PICTURE Source -----> [ Lexical Analyzer ] | v Token stream | v [ Parser ] | v Parse Tree (AST) + Symbol Table | v [ Static Analysis ] | v Parse Tree (AST) + Symbol Table | v [ Code Generator ] | v VM Code | v [ VM + JIT Compiler ] | v [ OS + Computer ] ------------------------------------------ The VM is also a process running on the computer, it often features a compiler inside the Just-in-Time (JIT) compiler, which may compile VM code into machine code to improve runtime efficiency ------------------------------------------ HYBRID ADVANTAGES AND DISADVANTAGES Advantages: Disadvantages: ------------------------------------------ ... Advantages: - faster compilation so less waiting - faster execution speed than interpreter - can have better optimization than compiler since has runtime execution information - VM code smaller than object code so faster to send over network - VM code easier to read so easier to: debug, check - Can build full system incrementally (adding JIT compiler later) Disadvantages: - VM code is easier to read, so doesn't help protect IP - Slower at runtime, because: - more to do, execute and also compile - indirect at first - compiling code takes time away from executing code - VM is generally larger than compiled code's runtime system (so more I/O to get it started) - hybrid is harder to write than either compiler or interpreter (it includes both!) Note: we will be doing something like this in the homework HW2: Lexical analysis HW3: Parsing HW4: Code generation and execution ** compiler structure ------------------------------------------ STANDARD COMPILER ARCHITECTURE Source code (text) | v [ Lexer ] | v Stream of tokens | v [ Parser ] | v AST + Symbol Table | / | v v | [ Static Analyzer ] / | / v / Intermediate Rep. / | / v v [ Code Generator ] | v Instruction Rep. | v [ Optimizer ] | v Machine Code ------------------------------------------ Notes: Each part may generate error messages, typically compilation stops if there are any but could do error recovery and continue The optimizer could be omitted Lexical analyzer (lexer) conceptually translates stream of chars to stream of tokens but often communicates via global variables Symbol table conceptually a map from identifiers to information such as: lexical address, type (or size), source info (line, column) initial value (or definition) (other static analysis information) AST = Abstract Syntax Tree represents the parse (hierarchical structure according to the grammar) of the input The intermediate representation varies: Some compilers use 3-address code (like the register VM), Today many use a single-assignment language (where each variable is only assigned once) Some use annotated ASTs, and manipulate those Some use a custom representation that works for their optimizer *** tokens **** definitions ------------------------------------------ TOKENS Represent distinct symbols in the input including punctuation and operators Typically, comments are ignored Reserved words: Keywords: White space delimits identifiers/ aNumber vs. a Number ------------------------------------------ ... e.g., 3.14 ident "a string" ... e.g., ; <= > := ( ) ... strings that would be identifiers but always represent something special e.g., if else while var ... strings that would be identifiers but only play a special role in some contexts e.g., in PL/I can write: if if=then then then=else else if=then=else Reserved words make lexical analysis easier **** data structures ***** token types ------------------------------------------ TOKENS Defined in *.tab.h file produced by bison Example from the SRM assembler (asm.tab.h) #ifndef YYTOKENTYPE # define YYTOKENTYPE enum yytokentype { YYEMPTY = -2, YYEOF = 0, /* "end of file" */ YYerror = 256, /* error */ YYUNDEF = 257, /* "invalid token" */ eolsym = 258, /* eolsym */ identsym = 259, /* identsym */ unsignednumsym = 260, // unsignednumsym plussym = 261, /* "+" */ minussym = 262, /* "-" */ commasym = 263, /* "," */ /* ... */ straopsym = 303, /* "STRA" */ notropsym = 304, /* "NOTR" */ regsym = 305, /* regsym */ wordsym = 306 /* "WORD" */ }; typedef enum yytokentype yytoken_kind_t; #endif Examples: Input Token (number) ident identsym 34 unsignednumbersym + plussym ------------------------------------------ Q: What would be the token types for the input WORD x = +24 wordsym identsym equalsym plussym unsignednumsym Q: How are reserved words represented? Each has its own token_type *** symbols and symbol table A symbol table is a mapping (e.g., a hash table) from identifiers (strings) to their attributes (such as file location, type, lexical address, etc.) Q: What part of a compiler should populate the symbol table? The parser, unless there is no scoping (as in assembly language) (It's done in pass1 in the SRM assembler.) Q: Why should the that be the tool? - Only the parser knows about nesting lexer doesn't know scope of variables However, for some languages, like C, it is necessary to know the kind of name for further parsing (e.g., may need to know what names are types). That kind of feedback may require multiple passes by the parser, or some interaction during lexical analysis may be able to handle it... **** symbols ------------------------------------------ SYMBOLS What information should remembered for identifiers? ------------------------------------------ - text (name) - lexical address (of declaration) (lexical level and offset) - kind of name (if there are different namespaces, e.g., constants, variables, routines, types) - size (e.g., for arrays) - filename of declaration - line number of declaration - column number of declaration Some other information might be useful, e.g.: for managing the symbol table (a delete mark), or for code generation may want to have a variable to mark if space has been allocated (for a variable) Q: How should we track lexical levels? count from outermost (which is 0) inwards Q: What would be a suitable data structure for a symbol? A struct **** symbol table ------------------------------------------ SYMBOL TABLE A *symbol table* Does each scope have its own symbol table? What operations would a symbol table need? What data structure would be good? ------------------------------------------ ... maps identifiers to symbols ... yes, esp. if you can redeclare names in inner scopes ... lookup/fetch, add, If the symbol table is mutable, then need push scope, pop scope (or can use a delete operation) if it's not mutable, then add needs to produce a new table ... a hash table or stack of hash tables