CIS 6614 meeting -*- Outline -*- * Symbolic Execution ** What is symbolic execution? See paper by James C. King, "Symbolic execution and program testing" CACM, Vol. 19, num. 7, pp. 385-394, July, 1976. http://doi.acm.org/10.1145/360248.360252 Q: What does symbolic execution do? finds path conditions in a program, forming a (potentially infinite) execution tree Q: How does symbolic execution differ from static analysis? - it treats paths separately instead of combining them - it does not try to summarize informtion about each static point in a program, combining all possible executions into one approximate value (instead it forms a potentially infinite execution tree) - It does not try to analyze arbitrary properties, but only tries to find path conditions (what inputs lead to that point in the program's dynamic execution) Q: How can symbolic execution help with testing? - To execute (cover) more paths in a program's execution: find path conditions that tell what inputs are needed to reach that point in the program's (dynamic) execution (then use an SMT solver to obtain concrete input data) - To avoid redundant tests, make sure each test input satisfies a distinct path condition (or use path conditions to eliminate equivalent inputs) How does knowing the path condition help with testing? it can be used (with an SMT solver) to find concrete data to execute a given path *** path conditions ------------------------------------------ PATH CONDITIONS What is the path that would reach the first execution of the assignment? void abs(int x) { if (x < 0) { x = - x; } return x; } What's the symbolic state after the assignment on that branch? ------------------------------------------ ... PC: xsym < 0 reaches the first execution of the assignment ... state(xsym) = - xsym Q: What is the CFG for this example? | v [ x < 0 ] / \ / \ v | x = -x; | \ / v v return x *** execution trees Q: What would be the execution tree for this example? PC: true state(xsym) = xsym | v x < 0 PC: xsym<0 / \ PC: !(xsym < 0) state(xsym) / \ state(xsym) = xsym = xsym v | x = -x; | \ / PC: xsym<0 \ / state(xsym) | | = -xsym v v return x Q: Does execution fork? Why? yes, because true does not imply x < 0 and true does not imply !(x < 0) Q: Would this execution tree be infinite? No, there is no loop or recursion... *** Another example ------------------------------------------ ANOTHER EXAMPLE int power(int x, int y) { int z = 1; while (y >= 1) { z = z * x; y = y - 1; } return z; } What is the CFG for this? ------------------------------------------ Q: Would the execution tree for this example be infinite? Yes Q: What would the execution tree look like? Q: What inputs would reach the second execution of y=y-1? y >= 3 How does that come out of the execution tree? *** Uses in security Q: Remember the Apple Goto Fail bug? What would be the path condition for reaching the the first execution of the code after the second goto fail? false! Q: Would symbolic execution have helped detect the heartbleed bug? (discuss) Q: How can symbolic execution be used in software security? - generate test cases - increase code coverage in testing (search for bugs) - look for inputs that take different (e.g., longer) paths in crypto algorithms *** Advantages and disadvantages (vs. static analysis) Q: When is symbolic execution very simlar to static analysis? When the program has no loops or recursion. Q: What are the advantages and disadvantages of symbolic execution for program analysis (compared to static analysis)? advantages -- symbolic execution: - can generate test data - can help testing cover more paths in code - can help testing avoid redundant tests - does not need tailoring to individual properties by experts - can generate inputs that reveal bugs disadvantages -- symbolic execution: - can be slow, due to exponential number of paths per test in a program resulting in exponential increase in size of PC and state info. - will miss some paths, since can't represent infinite tree *** tools/frameworks for symbolic execution ------------------------------------------ SYMBOLIC EXECUTION TOOLS (Quoted from p. 1066 of: Cristian Cadar, et al. 2011. "Symbolic execution for software testing in practice: preliminary assessment". In Proceedings of ICSE '11, pp. 1066-1071 https://doi.org/10.1145/1985793.1985995) Open Source Tools: - NASA's Symbolic (Java) PathFinder, - Stanford's KLEE, - UIUC's jCUTE, - UC Berkeley's CREST and BitBlaze - FuzzBall Commercial use at: - Microsoft (Pex, SAGE, YOGI and PREfix), - Parasoft (several tools) ------------------------------------------ symbolic execution is used in several tools from Parasoft that are marketed commercially. Note that several tools (esp. Java Pathfinder and KLEE are open source frameworks that can be customized, might be good for projects) ** approach *** algorithm **** original algorithm (King 1976) ------------------------------------------ SYMBOLIC EXECUTION ALGORITHM King 1976: - Assume program works on integers - Assume program uses simple tests for control flow (<, >, <=, etc.) - Assume the program is a CFG - Assume only data is integers 1. Allow symbolic inputs; e.g., isym, jsym for reading i, j - Replace integer ops with symbolic ops that make e.g., i - j > 0 produces expr: 2. State of program represented by: - path condition (PC): - variable values: 3. Tests (e.g. in if-statements) produce: ------------------------------------------ ... symbolic expressions (polynomials) ... symbolic values, expressions, and constants ... isym - jsym -1 >= 0 ... a Boolean expression over symbolic inputs (of form R >= 0 or not(R >= 0) where R is a symbolic expression that tells what inputs reach this point in the program Note that the PC never contains program variables (only symbolic variables) ... also represented by expressions over symbolic inputs Q: What should be the initial value for PC? true (i.e., no constraints yet) ... two path constraints on the test: - PC ==> test - PC ==> (not test) and then a. if exactly one holds (can be proven), execute the corresponding path b. if neither holds, then -- some inputs will take the true branch, -- and some the false one! need to execute with the PCs: - PC && test, and also - PC && (not test) Q: Does the PC need to be updated for the first case? no! Q: What's the problem with the second case? can lead to exponentially-sized expressions! ------------------------------------------ EXAMPLE OF KING'S ALGORITHM int sum(int a, int b, int c) { l0: int x, y, z; l1: x = a + b; l2: y = b + c; l3: z = x + y - b; l4: return z; } Trace of symbolic execution: PC = true state(asym) = asym, state(bsym) = bsym state(csym) = csym, count(l0) = 0, count(l1) = 0, ... l0: int x, int y, int z PC = true state(asym) = asym, state(bsym) = bsym state(csym) = csym, count(l0) = 1, count(l1) = 0, ... ------------------------------------------ Using asym as the symbolic value of a l1: x = a + b; PC = true state(asym) = asym, state(bsym) = bsym state(csym) = csym, state(xsym) = plussym(asym, bsym) count(l0) = 1, count(l1) = 1, count(l2) = 0, count(l3) = 0, count(l4) = 0 l2: y = b + c; PC = true state(asym) = asym, state(bsym) = bsym state(csym) = csym, state(xsym) = plussym(asym, bsym) state(ysym) = plussym(bsym, csym) count(l0) = 1, count(l1) = 1, count(l2) = 1, count(l3) = 0, count(l4) = 0 l3: z = x + y - b; PC = true state(asym) = asym, state(bsym) = bsym state(csym) = csym, state(xsym) = plussym(asym, bsym) state(ysym) = plussym(bsym, csym) state(zsym) = subsym(plussym(plussym(asym, bsym), plussym(bsym, csym)), bsym) count(l0) = 1, count(l1) = 1, count(l2) = 1, count(l3) = 1, count(l4) = 0 l4: return z; ------------------------------------------ EXAMPLE WITH A LOOP int power(int x, int y) { l2: int z = 1; l3: int j = 1; l4: if (y < j) { l45: goto l8; } else { l5: z = z * x; l6: j = j + 1; l7: goto l4; } l8: return z; trace of symbolic execution: ------------------------------------------ Q: What is the PC at the beginning? PC = true Q: What is the initial state? state(xsym) = xsym, state(ysym) = ysym count(l2) = 0, count(l3) = 0, ... l2: int z = 1; PC = true state(xsym) = xsym, state(ysym) = ysym state(zsym) = 1 count(l2) = 1, count(l3) = 0, ... l3: int j = 1; PC = true state(xsym) = xsym, state(ysym) = ysym state(zsym) = 1, state(jsym) = 1 count(l2) = 1, count(l3) = 1, count(l4) = 0, ... Q: What happens on line 4 the first time? neither true ==> not (ysym<1) nor true ==> (ysym<1) are provable, so execution forks: case a: (y < 1) PC = (ysym < 1) // i.e., (ysym-1 < 0) state(xsym) = xsym, state(ysym) = ysym state(zsym) = 1, state(jsym) = 1 count(l2) = 1, count(l3) = 1, count(l4) = 1, count(l45) = 0, ... case b: (!(y < 1)) PC = notsym(ysym < 1) // i.e., notsym(ysym-1 < 0) state(xsym) = xsym, state(ysym) = ysym state(zsym) = 1, state(jsym) = 1 count(l2) = 1, count(l3) = 1, count(l4) = 1, count(l45) = 0, ... Q: What happens next in case a? l45: goto l8 PC = notsym(ysym < 1) // i.e., notsym(ysym-1 < 0) state(xsym) = xsym, state(ysym) = ysym state(zsym) = 1, state(jsym) = 1 count(l2) = 1, count(l3) = 1, count(l4) = 1, count(l45) = 1, ... l8: return z; Q: What about case b? executes l5: z = z * x; PC = notsym(ysym < 1) // i.e., notsym(ysym-1 < 0) state(xsym) = xsym, state(ysym) = ysym state(zsym) = 1 * xsym, state(jsym) = 1 count(l2) = 1, count(l3) = 1, count(l4) = 1, count(l45) = 1, count(l5) = 1, count(l6) = 0, ... l6: j = j + 1; PC = notsym(ysym < 1) // i.e., notsym(ysym-1 < 0) state(xsym) = xsym, state(ysym) = ysym state(zsym) = 1 * xsym, state(jsym) = 1 + 1 count(l2) = 1, count(l3) = 1, count(l4) = 1, count(l45) = 1, count(l5) = 1, count(l6) = 1, count(l7) = 0... now test the condition at l4 again... does the execution fork again? Again will fork, as (ysym < 1) doesn't tell us anything about the condition (y < j), where state(jsym) = 2 So the symbolic execution continues like this forever... Q: What should be done to prevent infinite loops in the tool? Need to cut off the search... (What King's system did, p. 390, is ask the user what to do...) Q: How would traditional static analysis handle such a loop? It would stop if the analysis value reached a fixedpoint, otherwise continue for some small number to find a fixedpoint, then approximate using a widening operator to find something that is both safe and guaranteed to be a fixedpoint ***** Limits of King's algorithm Q: What are the limits of King's original algorithm? - only integer data but now can use an SMT solver to deal with other kinds of data for user-defined types in OOP see the paper: S. Khurshid, C.S. Pasareanu, and W. Visser. 2003. "Generalized Symbolic Execution for Model Checking and Testing." In TACAS 2003. LNCS vol. 2619. Springer. https://doi.org/10.1007/3-540-36577-X_40 (or https://rdcu.be/cU1iz) ** background on automatic theorem proving *** SAT Solvers ------------------------------------------ BACKGROUND: SAT SOLVERS goal: decide if first-order predicate is always true (or not) SAT Solvers: input: propositional formula in CNF output: satisfying assignment to vars (or failure indication) example: w1 && w2 && w3 where: w1 == b || c w2 == !a || !d w3 == !b || d satisfying assignment is {a = , b = , c = , d = } ------------------------------------------ Q: What's the time complexity of this problem? exponential! However, there has been a lot of work making this run very fast most of the time, for small enough formulas *** SMT Solvers ------------------------------------------ SMT SOLVERS SMT = Satisfiability Modulo Theories Combines SAT + decidable theories Example theories: - Uninterpreted functions and equality - Peano arithmetic (arith. without *, /) - Arrays, Strings - Bitvectors - Algebraic Datatypes (e.g., enums) Uninterpreted functions and equality: x == x x == y <==> y == x (x == y) && (y == z) ==> (x == z) x == y ==> f(x) == f(y) ------------------------------------------ **** proof process ------------------------------------------ PROVING A FORMULA WITH SMT SOLVER To prove P see if not P is satisfiable ------------------------------------------ Q: What does it mean if not P is satisfiable? that P is not always true (there are cases where it's negation holds) Q: What does it mean if not P is unsatisfiable? that P is always true (no cases where it can't be true) Bonus, if not P is satisfiable, then the satisfying assignment gives a counterexample which can be used to show how the formula fails **** example ------------------------------------------ EXAMPLE b+2 == c && f(read(write(a,b,3),c-2)) != f(c-b+1) Which parts are arithmetic? arrays? uninterpreted functions? ------------------------------------------ ... b+2 == c, c-2, c-b+1 ... write(a,b,3), read(write(a,b,3),c-2) ... f(read(write(a,b,3),c-2)) != f(c-b+1) ------------------------------------------ SMT PROCESS b+2 == c && f(read(write(a,b,3),c-2)) != f(c-b+1) ------------------------------------------ ... use equality to substitute b+2 for c b+2 == b+2 && f(read(write(a,b,3),(b+2)-2)) != f((b+2)-b+1) simplify by theory of arithmetic, b+2-2 == b and b+2-b+1 == 3 and by Boolean: (true && A == A) f(read(write(a,b,3),b)) != f(3) simplify by theory of arrays: f(3) != f(3) so by theory of uninterpreted functions, this is unsatisfiable!