SYSTEMS SOFTWARE Helping people run programs Human -----> [ Translator ] Readable / | Input / | (Prog. text) / | / v v Object Library Code (Binary) (Binary) \ | \ | \ v \->[ Linker/ Loader ] | | v Process | | v [ OS + Computer ] | | v Computation RECALL DEFINITIONS def: An *assembler* translates a low-level language that is close to machine code into machine code. def: A *compiler* translates a (high-level) language into a form that can be (more easily) executed by a computer. def: An *interpreter* executes a programming language directly, often without translating it into object code first COMPILER PICTURE Source -----> [ Lexical Analyzer ] | v Token stream | v [ Parser ] | v Parse Tree (AST) + Symbol Table | v [ Static Analysis ] | v Parse Tree (AST) + Symbol Table | v [ Code Generator ] | v Object Code | v [ Linker/ Loader ] | v Process | v [ OS + Computer ] INTERPRETER PICTURE Source -----> [ Lexical Analyzer ] | v Token stream | v [ Parser ] | v Parse Tree (AST) + Symbol Table | v [ Interpreter ] | v [ OS + Computer ] ADVANTAGES AND DISADVANTAGES Compiler Advantages: - programs can execute very quickly why faster? - can optimize code - direct execution on machine - separates preparation (compile time) from running (runtime) - have time (before runtime) to: - carefully analyze code for problems - improve (optimize) code - carry out complex transformations (expand macros, generics) - compiled code is hard to read helps protect IP Disadvantages: - compiled code can be hard to debug because it might be quite different - compilation time can be slow may slow down development - object code can be hard to check e.g., for security vulnerabilities - object code can be large so hard to send over a network - compiler is a bigger program, so harder to write Interpreters Advantages: - little time to start running program so write-debug cycle is faster - intermediate representation can be very concise, so faster to send over network - source code (main artifact) is easier to read and understand, so easier to: - debug - check or verify - protect against security vulnerabilities - smaller program than compiler so easier to write - better for prototyping language ideas - better for creating very high-level languages (R, SPSS, and Matlab) - might be more secure Disadvantages: - execution speeed is generally slower HYBRID COMPILER/INTERPRETER PICTURE Source -----> [ Lexical Analyzer ] | v Token stream | v [ Parser ] | v Parse Tree (AST) + Symbol Table | v [ Static Analysis ] | v Parse Tree (AST) + Symbol Table | v [ Code Generator ] | v VM Code | v /------->[ VM + JIT Compiler ] |statistics | | v \--------[ OS + Computer ] HYBRID ADVANTAGES AND DISADVANTAGES Advantages: - faster compilation, so less waiting - faster runtime execution than interpreter - can have better optimization than a compiler - compilation times are faster (only compile what needs to be compiled) - still has small VM code - VM code is easier to read better debugging - can build the system incrementally Disadvantages: - VM code is easier to read so less protection for IP - slower at runtime - more to do (execute and also compile) - indirect execution at first - compiling code uses time - more I/O to get started - harder to write than a compiler or interpreter (it's both!) STANDARD COMPILER ARCHITECTURE (Based on Apel's book "Modern Compiler Implementation") Source code (text) | v [ Lexer ] | v Stream of tokens | v [ Parser ] | v (AST = Abstract Syntax Tree) AST -------------\ | \ v v [ Static Analyzer ] <- Symbol Table | / v / Intermediate Rep. / | / [ HL Optimizer ] / | / v / Intermediate Rep. / | / v v [ Code Generator ] | v Instruction Rep. | v [ LL Optimizer ] | v Machine Code TOKENS Represent distinct symbols in the input e.g., anIdentifier ( , ) 34241234 including punctuation and operators e.g., + and spaces Typically, comments are ignored Reserved words: strings that would normally be identifiers but are special in the language e.g., if while begin end Keywords: special only in certain contexts if if=then then then=else else if=then=else White space delimits identifiers/ aNumber vs. a Number TOKENS Defined in *.tab.h file produced by bison Example from the SSM assembler (asm.tab.h) #ifndef YYTOKENTYPE # define YYTOKENTYPE enum yytokentype { YYEMPTY = -2, YYEOF = 0, /* "end of file" */ YYerror = 256, /* error */ YYUNDEF = 257, /* "invalid token" */ eolsym = 258, /* eolsym */ identsym = 259, /* identsym */ unsignednumsym = 260, plussym = 261, /* "+" */ minussym = 262, /* "-" */ commasym = 263, /* "," */ dottextsym = 264, /* ".text" */ dotdatasym = 265, /* ".data" */ dotstacksym = 266, /* ".stack" */ dotendsym = 267, /* ".end" */ colonsym = 268, /* ":" */ lbracketsym = 269, /* "[" */ rbracketsym = 270, /* "]" */ equalsym = 271, /* "=" */ noopsym = 272, /* "NOP" */ addopsym = 273, /* "ADD" */ subopsym = 274, /* "SUB" */ /* ... */ wordsym = 318, /* "WORD" */ charsym = 319, /* "CHAR" */ stringsym = 320, /* "STRING" */ charliteralsym = 321, stringliteralsym = 322 }; typedef enum yytokentype yytoken_kind_t; #endif Examples: Input yytokentype ======================================= ident identsym 34 unsignednumbersym + plussym SYMBOLS What information should remembered for identifiers? - text (name) - lexical address (of declaration) - kind (const or var) - size - filename of declaration - line number of declaration maybe other information for managing the symbol table - delete mark - space allocated mark SYMBOL TABLE A *symbol table* Does each scope have its own symbol table? What operations would a symbol table need? What data structure would be good?