CIS 6614 meeting -*- Outline -*- * Program Analysis or Static Analysis ** What is program analysis? Program analysis is also called static analysis (esp. in the literature on programming languages and compilers) ------------------------------------------ WHAT IS PROGRAM ANALYSIS? Def: *program analysis* is ------------------------------------------ ... predicting statically safe approximations to the set of configurations or behaviors that may occur dynamically in a program's execution Q: How is it different than testing? Runtime assertion checking? - It's done without running the program, - It takes all possible executions into account Q: How does this differ from (human) code inspection? automated, can handle larger more complex code bases Traditionally (and first) used for compiler optimizations (such as register allocation, dead code elimination, etc.) ** use in software security ------------------------------------------ USES OF PROGRAM ANALYSIS IN SOFTWARE SECURITY General goal/idea: Program ---> Static Analyzer --> Bug Reports Examples: - warn about bad library functions (e.g., gets) ------------------------------------------ ... also examples: - warn about misuse of protocols (e.g., crypto - warn about using tainted inputs - warn about possible: - type errors - buffer overflows - arithmetic overflows - double frees of memory - use after free - using var before written - dead code (unreachable code) - remember the goto fail bug? Q: Could all of these cause security vulnerabilities? Yes... *** vs. program testing Q: What are the differences from program testing? program analysis: - is static, does not run the program - doesn't need test data - can guarantee program always has some property (in all possible cases) - uses approximations, so may report errors that can't happen (unlike testing) - requires an expert to design the analysis (and tool) *** vs. program verification Q: What's the difference from program verification? program analysis: - no explicit specification of behavior - little annotation - targeted to a *specific* property (usually just one) - may lose precision to gain speed - no human intervention once started program verification: - behavior usually specified in detail - lots of annotation - for any (safety) property - may take a day per function - often requires human intervention Q: What are the similarities with program verification? - both based on mathematical theories - both summarize executions into static approximations - both entirely static but (in both cases) there is a possibility for interactions or hybrid approaches (with dynamic analysis) *** kinds of properties Q: What is the difference between safety and liveness properties? Def: a *safety* property says that nothing bad happens and characterizes a prefix of a program's execution history (Since it's about a prefix, once false it can never become true) Def: a *liveness* property says that something good eventually happens Q: What kind of property does static analysis work on? safety, typically, but could also work on liveness properties with appropriate assumptions about the system (such as that the scheduler is fair or doesn't take forever to do something.) Q: Are typical security policies safety or liveness properties? e.g., only authorized users can gain access to a system safety, so it is a good match for static analysis Q: Are any security properties liveness properties? yes, preventing denial of service (i.e., every customer request is eventually served) *** Goals of Program Analysis Q: What are the goals of program analysis (static analysis)? - little or no input from programmers ==> so practical, usable Generally speaking, the bias is towards having no programmer input, in contrast to work in formal methods (verification) - correctness ==> so usable "under the covers" - efficient (at compile time): in time and space ** equivalent ideas Q: In what ways (mathematical formalisms) can an analysis be expressed? - Dataflow equations - Constraints - Abstract Interpretation - Type system - Control Flow Analysis These have differences in Emphasis: - Dataflow equations and constraints - compilers - Abstract interpretation - correctness All can approximate dependence on dynamic values in imperative languages Type systems typically have: - no flow dependence - uniform treatment of each statement Constraint based analysis = Control Flow Analysis - is typically used for analysis of functional languages ** key design idea Q: What's the key design idea for a program analysis? Slogan: Err on the safe side! Key questions: - What's the bad outcome? - What would track the opposite of that? - What would be safe? A sound analysis track the negation of the bad outcome in the analysis, so if it's sound, and if we say the bad outcome doesn't happen then it won't be possible All programs |--------------------------------| | Programs that are safe | | |------------------------| | | | Programs claimed safe | | | | | | | |------------------------| | |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| | | | Programs that can be bad | |--------------------------------| Because: what we know in a sound analysis is that the claimed safety property holds in all possible executions *** Example: Dead code analysis (as in the goto fail bug) - What's the bad outcome? that we say code can be deleted (is dead) and it's not - What would track the opposite? tracking code that is needed - What would be safe? saying more code is needed than is true So the analsysis may use a bigger set of code than is truly needed (overapproximate what is neeeded) that way if we are wrong, it avoids the bad outcome (we would say something is needed, when it isn't) All programs |--------------------------------| | Programs with no dead code | | |------------------------| | | | Programs claimed to | | | | have no dead code | | | |------------------------| | |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| | | | Programs that have dead code | |--------------------------------| note that we won't be claiming that the bad outcome occurs, only that it doesn't occur Q: Assuming the analysis is sound, will we have some false positives (claiming bad programs are safe)? No, not if the analysis is sound Q: Will we have some false negatives (good programs not claimed safe)? Yes, because making it perfectly precise is impossible ------------------------------------------ FOR YOU TO DO Suppose we want to prevent using gets() (from libc). What should we do in an analysis? ------------------------------------------ ... track program points that use gets() Q: What is the bad outcome to prevent? saying the program is okay, if it actually might use gets() Q: What would track the opposite? lines that might use gets() Q: What would be safe? saying more lines might use gets() ** Terms in static analysis *** intraprocedural vs. interprocedural analyses Q: What is a procedure in the context of program analysis? a subroutine: procedure, function, method, ... Q: What is the difference between an intraprocedural analysis and an interprocedural analysis? Def: an *intraprocedural* analysis analyzes information only within a procedure/function/method - it can be modular if there are summaries of the analysis information for each procedure - so it scales well Def: an *interprocedural* analysis analyzes the entire program, across procedure calls - much more complex - not modular (if no summaries ar used) - doesn't scale well (often cubic in size of program) - but can be more accurate for statically composed programs *** may vs. must analysis Q: What is the difference between a "may" and a "must property/analysis? Def: A "may property" (or analysis) tracks states what might happen in some execution A "must property" (or analysis) tracks what will always happen in every execution *** sources of imprecision **** conditional statements ------------------------------------------ CONDITIONAL STATEMENTS: MAY ANALYSIS At point END, might the following code have used gets? if (...) { // 2 gets(buffer); } else { // 4 fgets(buffer, sizeof(buffer), fn); } // point END ------------------------------------------ ... yes, maybe, and we want to warn about it ------------------------------------------ CONDITIONAL STATEMENTS: MUST ANALYSIS At point END, will the following code always have used gets? if (...) { // 2 gets(buffer); } else { // 4 fgets(buffer, sizeof(buffer), fn); } // point END ------------------------------------------ ... No, the condition might be false. Q: What does a "may" analysis do with information from the branches of a conditional? - it unions them Q: What does a "must" analysis do with information from the branches of a conditional? - it intersects them **** loops ------------------------------------------ LOOP STATEMENTS: MAY ANALYSIS Might the following code use gets? while (...) { // 2 gets(buffer); } // point END What about: while (i < BUFSIZ) { // 5 if (i >= 3) gets(buffer); i++; } // point END ------------------------------------------ ... yes, at END both might have used gets, so we need to warn about them ------------------------------------------ LOOP STATEMENTS: MUST ANALYSIS Will the following code always use gets? while (...) { // 2 gets(buffer); } // point END What about: while (i < BUFSIZ) { // 5 if (i >= 3) gets(buffer); i++; } // point END ------------------------------------------ Q: What should be done with the information in a may analysis? union the information on all paths Q: How about for a must analysis? intersect the information on all paths **** handling different kinds of control structures (CFGs) Q: How can one write an analysis that is generic with respect to different kinds of control structures in different languages (such as switch statements, different kinds of loops, exception handling, etc.)? - Write it in terms of a universal (lower-level) representation: sequencing and gotos Q: What is a control flow graph (CFG) in static analysis? - nodes are statements or tests - edges are control flow relation between labels (map from a node to the next node in the forward control flow) Example: z = 1; // 1 while (x > 0) // 2 { z = z*y; // 3 x = x-1; } // 4 print("done"); // 5 Blocks are: statements 1, 3, 4, 5 and test 2 Flows are: (1,2), (2,3), (2,5), (3,4), (4,2) Picture: [ 1 ] | v /-->[ 2 ]--\ | | | | v | | [ 3 ] | | | | | v | \---[ 4 ] | | /----/ v [ 5 ] **** property space this is the technical term for the information used in the analysis (and the way that information is combined, the join operator and an approximation ordering on the information, making it a lattice) We divide analyses into: may/must When the property space is a set, then the join operator is: - union if doing a may analysis - intersection if doing a must analysis the intiial value of the information is (representing unknown): - empty set if a may analysis - the total set if a must analysis Q: How does a property space reflect the imprecision of an analysis? - unions (and intersections) may lose information when paths are combined ***** example ------------------------------------------ EXAMPLE: UNREACHABLE CODE Consider C with statments: assignments, empty statements (skip), if-then-else, while, sequencing, labels Stmt ::= L: x = E; | L : ; | if (L: E) { Stmt1 } else { Stmt2 } | while (L: E) { Stmt } | Stmt1 Stmt2 Atomic blocks are: S ::= L : x = E; | L : ; | L : E So CFG of l1: x = 0; while (l2: x < 4) { l3: y = y+1; l4: x = x-1; } l5: y = x*x; is [ l1: x = 0;] | v /->[ l2: x < 4; ]--\ | | | | v | | [ l3: y = y+1;] | | | | | v | \--[ l4: x = x-1;] | | /---------/ | v [ l5: y = x*x;] Set of flows is: We want a forward, may analysis with: Property space: Reaches(l) = Set of labels that may be reached from start initial value: {} Combination: join is \union EntryTo(l) = union {ExitOf(l') | (l',l) flows} Transfer functions: f(l: x = E;, EntryTo(l)) = {l} union EntryTo(l) f(l: ;, EntryTo(l)) = EntryTo(l) f(l:E, EntryTo(l)) = {l} union EntryTo(l) ------------------------------------------ ... {(l1,l2), (l2,l3) (l3,l4), (l4,l2), (l2,l5)} ------------------------------------------ EQUATIONS AND SOLUTION FOR EXAMPLE EntryTo(l1) = {} ExitOf(l1) = {l1} union EntryTo(l1) EntryTo(l2) = union{ExitOf(l1),ExitOf(l4)} ExitOf(l2) = {l2} union EntryTo(l2) EntryTo(l3) = ExitOf(l2) ExitOf(l3) = {l3} union EntryTo(l3) EntryTo(l4) = ExitOf(l3) ExitOf(l4) = {l4} union EntryTo(l4) EntryTo(l5) = ExitOf(l2) ExitOf(l5) = {l5} union EntryTo(l5) Trace of iterations: Steps: 0 1 EntryTo(l1) {} ExitOf(l1) {} EntryTo(l2) {} ExitOf(l2) {} EntryTo(l3) {} ExitOf(l3) {} EntryTo(l4) {} ExitOf(l4) {} EntryTo(l5) {} ExitOf(l5) {} ------------------------------------------ trace this Q: Is this result what happens dynamically? No, the loop never executes Q: Is it safe? no, claims all code reachable sometimes, but that isn't true! Q: How could we make this safe? Need to take values into account, so need to approximate those, or partially evaluate constant exprs *** building tools with dataflow analysis Q: What are some existing tools for building an analysis? Parser generator and: - LLVM - JastAdd Java compiler - XText