CIS 6614 meeting -*- Outline -*-

* Symbolic Execution

** What is symbolic execution?
   See paper by James C. King, "Symbolic execution and program testing"
                CACM, Vol. 19, num. 7, pp. 385-394, July, 1976.
                http://doi.acm.org/10.1145/360248.360252

   Q: What does symbolic execution do?
      finds path conditions in a program,
        forming a (potentially infinite) execution tree

   Q: How does symbolic execution differ from static analysis?
      - it treats paths separately instead of combining them
      - it does not try to summarize informtion about each static
        point in a program,
        combining all possible executions into one approximate value
        (instead it forms a potentially infinite execution tree)
      - It does not try to analyze arbitrary properties,
         but only tries to find path conditions (what inputs lead to
           that point in the program's dynamic execution)

   Q: How can symbolic execution help with testing?
      - To execute (cover) more paths in a program's execution:
            find path conditions that tell what inputs are needed to
            reach that point in the program's (dynamic) execution
            (then use an SMT solver to obtain concrete input data)
      - To avoid redundant tests,
            make sure each test input satisfies a distinct path condition
            (or use path conditions to eliminate equivalent inputs)

      How does knowing the path condition help with testing?
         it can be used (with an SMT solver) to find concrete data
          to execute a given path

*** path conditions
------------------------------------------
        PATH CONDITIONS

What is the path that would reach
  the first execution of the assignment?

  void abs(int x) {
     if (x < 0) {
        x = - x;
     }
     return x;
  }


What's the symbolic state
   after the assignment on that branch?
------------------------------------------
      ... PC: xsym < 0 reaches the first execution of the assignment
      ... state(xsym) = - xsym

      Q: What is the CFG for this example?

                  |
                  v
               [ x < 0 ]
                 / \               
                /   \
               v     |
            x = -x;  |
                \   /                    
                 v v
              return x

*** execution trees

      Q: What would be the execution tree for this example?

                 PC: true
                 state(xsym) = xsym
                   |
                   v
                  x < 0
   PC: xsym<0    /   \  PC: !(xsym < 0)
 state(xsym)    /     \  state(xsym) = xsym
   = xsym      v       |
            x = -x;    |
                \     /
   PC: xsym<0    \   /
 state(xsym)      | | 
   = -xsym        v v
               return x

      Q: Does execution fork?  Why?
          yes, because true does not imply   x < 0
                   and true does not imply !(x < 0)

      Q: Would this execution tree be infinite?
         No, there is no loop or recursion...

*** Another example

------------------------------------------
            ANOTHER EXAMPLE

int power(int x, int y) { 
   int z = 1; 
   while (y >= 1) {
      z = z * x; 
      y = y - 1; 
   } 
   return z;
}

What is the CFG for this?


------------------------------------------
   Q: Would the execution tree for this example be infinite?
      Yes
      
   Q: What would the execution tree look like?


   Q: What inputs would reach the second execution of y=y-1?
        y >= 3
        How does that come out of the execution tree?

*** Uses in security

   Q: Remember the Apple Goto Fail bug?  What would be the path
      condition for reaching the the first execution of the code
      after the second goto fail?
       false!
       
   Q: Would symbolic execution have helped detect the heartbleed bug?
      (discuss)

   Q: How can symbolic execution be used in software security?
      - generate test cases
      - increase code coverage in testing (search for bugs)
      - look for inputs that take different (e.g., longer) paths
             in crypto algorithms


*** Advantages and disadvantages (vs. static analysis)
   Q: When is symbolic execution very simlar to static analysis?
      When the program has no loops or recursion.
   Q: What are the advantages and disadvantages of symbolic execution
    for program analysis (compared to static analysis)?

    advantages -- symbolic execution:
       - can generate test data
       - can help testing cover more paths in code
       - can help testing avoid redundant tests
       - does not need tailoring to individual properties by experts
       - can generate inputs that reveal bugs

    disadvantages -- symbolic execution:
       - can be slow, due to exponential number of paths per test in a program
           resulting in exponential increase in size of PC and state info.
       - will miss some paths, since can't represent infinite tree

*** tools/frameworks for symbolic execution
------------------------------------------
       SYMBOLIC EXECUTION TOOLS

(Quoted from p. 1066 of:
Cristian Cadar, et al. 2011.
"Symbolic execution for software testing
in practice: preliminary assessment".
In Proceedings of ICSE '11, pp. 1066-1071
https://doi.org/10.1145/1985793.1985995)

Open Source Tools: 
- NASA's Symbolic (Java) PathFinder,
- Stanford's KLEE,
- UIUC's jCUTE,
- UC Berkeley's CREST and BitBlaze
- FuzzBall

Commercial use at:
- Microsoft (Pex, SAGE, YOGI and PREfix),
- Parasoft (several tools)

------------------------------------------
    symbolic execution is used in several tools from Parasoft that are
    marketed commercially.

    Note that several tools (esp. Java Pathfinder and KLEE are open
    source frameworks that can be customized, might be good for projects)

** approach
*** algorithm
**** original algorithm (King 1976)
------------------------------------------
       SYMBOLIC EXECUTION ALGORITHM

King 1976:
    - Assume program works on integers
    - Assume program uses simple tests
       for control flow (<, >, <=, etc.)
    - Assume the program is a CFG
    - Assume only data is integers

 1. Allow symbolic inputs;
    e.g., isym, jsym for reading i, j
    
    - Replace integer ops with
       symbolic ops that make 


    e.g., i - j > 0
        produces expr:


 2. State of program represented by:
    - path condition (PC):


    - variable values:


 3. Tests (e.g. in if-statements) produce:


------------------------------------------
     ... symbolic expressions (polynomials)
     ... symbolic values, expressions,
         and constants
     ... isym - jsym -1 >= 0
     
     ... a Boolean expression
         over symbolic inputs
         (of form R >= 0 or not(R >= 0)
           where R is a symbolic expression
         that tells what inputs reach
         this point in the program

         Note that the PC never contains program variables
         (only symbolic variables)

     ... also represented by
         expressions over symbolic inputs

     Q: What should be the initial value for PC?
        true (i.e., no constraints yet)

     ... two path constraints on the test:
          - PC ==> test
          - PC ==> (not test)
         and then
          a. if exactly one holds (can be proven),
             execute the corresponding path
          b. if neither holds, then
             -- some inputs will take the true branch,
             --             and some the false one!
             need to execute with the PCs:
               - PC && test,      and also
               - PC && (not test)

     Q: Does the PC need to be updated for the first case?
         no!
     Q: What's the problem with the second case?
         can lead to exponentially-sized expressions!

------------------------------------------
       EXAMPLE OF KING'S ALGORITHM

int sum(int a, int b, int c) {
    l0: int x, y, z;
    l1: x = a + b;
    l2: y = b + c;
    l3: z = x + y - b;
    l4: return z;
}


Trace of symbolic execution:

    PC = true
    state(asym) = asym, state(bsym) = bsym
    state(csym) = csym,
    count(l0) = 0, count(l1) = 0, ...

      l0: int x, int y, int z

    PC = true
    state(asym) = asym, state(bsym) = bsym
    state(csym) = csym,
    count(l0) = 1, count(l1) = 0, ...


------------------------------------------
     Using asym as the symbolic value of a
        

      l1: x = a + b;

    PC = true
    state(asym) = asym, state(bsym) = bsym
    state(csym) = csym,
    state(xsym) = plussym(asym, bsym)
    count(l0) = 1, count(l1) = 1,
    count(l2) = 0, count(l3) = 0,
    count(l4) = 0

      l2: y = b + c;

    PC = true
    state(asym) = asym, state(bsym) = bsym
    state(csym) = csym,
    state(xsym) = plussym(asym, bsym)
    state(ysym) = plussym(bsym, csym)
    count(l0) = 1, count(l1) = 1,
    count(l2) = 1, count(l3) = 0,
    count(l4) = 0

      l3: z = x + y - b;

    PC = true
    state(asym) = asym, state(bsym) = bsym
    state(csym) = csym,
    state(xsym) = plussym(asym, bsym)
    state(ysym) = plussym(bsym, csym)
    state(zsym) =
      subsym(plussym(plussym(asym, bsym),
                     plussym(bsym, csym)),
             bsym)
    count(l0) = 1, count(l1) = 1,
    count(l2) = 1, count(l3) = 1,
    count(l4) = 0

      l4: return z;

------------------------------------------
          EXAMPLE WITH A LOOP

int power(int x, int y) {
    l2: int z = 1;
    l3: int j = 1;
    l4: if (y < j) {
       l45: goto l8;
        } else {
       l5: z = z * x;
       l6: j = j + 1;
       l7: goto l4;
    }
    l8: return z;


trace of symbolic execution:


------------------------------------------

      Q: What is the PC at the beginning?
          PC = true
      Q: What is the initial state?
          state(xsym) = xsym, state(ysym) = ysym
          count(l2) = 0, count(l3) = 0, ...

       l2: int z = 1;

           PC = true
          state(xsym) = xsym, state(ysym) = ysym
          state(zsym) = 1
          count(l2) = 1, count(l3) = 0, ...

       l3: int j = 1;

          PC = true
          state(xsym) = xsym, state(ysym) = ysym
          state(zsym) = 1, state(jsym) = 1
          count(l2) = 1, count(l3) = 1,
          count(l4) = 0, ...

      Q: What happens on line 4 the first time?
         neither true ==> not (ysym<1)
             nor true ==> (ysym<1)
         are provable, so execution forks:

 case a: (y < 1)
          PC = (ysym < 1)   // i.e., (ysym-1 < 0)
          state(xsym) = xsym, state(ysym) = ysym
          state(zsym) = 1, state(jsym) = 1
          count(l2) = 1, count(l3) = 1,
          count(l4) = 1, count(l45) = 0, ...

 case b: (!(y < 1))
          PC = notsym(ysym < 1)   // i.e., notsym(ysym-1 < 0)
          state(xsym) = xsym, state(ysym) = ysym
          state(zsym) = 1, state(jsym) = 1
          count(l2) = 1, count(l3) = 1,
          count(l4) = 1, count(l45) = 0, ...

      Q: What happens next in case a?

         l45: goto l8

          PC = notsym(ysym < 1)   // i.e., notsym(ysym-1 < 0)
          state(xsym) = xsym, state(ysym) = ysym
          state(zsym) = 1, state(jsym) = 1
          count(l2) = 1, count(l3) = 1,
          count(l4) = 1, count(l45) = 1, ...

         l8: return z;

      Q: What about case b?

          executes

         l5: z = z * x;
             
          PC = notsym(ysym < 1)   // i.e., notsym(ysym-1 < 0)
          state(xsym) = xsym, state(ysym) = ysym
          state(zsym) = 1 * xsym, state(jsym) = 1
          count(l2) = 1, count(l3) = 1,
          count(l4) = 1, count(l45) = 1,
          count(l5) = 1, count(l6) = 0, ...

          l6: j = j + 1;

          PC = notsym(ysym < 1)   // i.e., notsym(ysym-1 < 0)
          state(xsym) = xsym, state(ysym) = ysym
          state(zsym) = 1 * xsym, state(jsym) = 1 + 1
          count(l2) = 1, count(l3) = 1,
          count(l4) = 1, count(l45) = 1,
          count(l5) = 1, count(l6) = 1,
          count(l7) = 0...

          now test the condition at l4 again...
          does the execution fork again?

    Again will fork, as (ysym < 1) doesn't tell us anything about
     the condition (y < j), where state(jsym) = 2

    So the symbolic execution continues like this forever...

     Q: What should be done to prevent infinite loops in the tool?
      Need to cut off the search...
      (What King's system did, p. 390, is ask the user what to do...)

     Q: How would traditional static analysis handle such a loop?
         It would stop if the analysis value reached a fixedpoint,
         otherwise continue for some small number to find a
         fixedpoint, then approximate using a widening operator
         to find something that is both safe and guaranteed to be a fixedpoint
         

***** Limits of King's algorithm
      Q: What are the limits of King's original algorithm?
        - only integer data
           but now can use an SMT solver to deal with other kinds of data
           for user-defined types in OOP see the paper:
    
           S. Khurshid, C.S. Pasareanu, and W. Visser.
           2003. "Generalized Symbolic Execution for
           Model Checking and Testing." In TACAS 2003.
           LNCS vol. 2619. Springer.
           https://doi.org/10.1007/3-540-36577-X_40
           (or https://rdcu.be/cU1iz)


** background on automatic theorem proving
*** SAT Solvers
------------------------------------------
     BACKGROUND: SAT SOLVERS

goal: decide if first-order predicate is
       always true (or not)

SAT Solvers:

   input: propositional formula in CNF
   output: satisfying assignment to vars
           (or failure indication)

   example: w1 && w2 && w3
     where: w1 == b || c
            w2 == !a || !d
            w3 == !b || d

   satisfying assignment is
      {a =   , b =   , c =   , d =   }
   
------------------------------------------
    Q: What's the time complexity of this problem?
       exponential!

    However, there has been a lot of work making this run very fast
        most of the time, for small enough formulas

*** SMT Solvers
------------------------------------------
         SMT SOLVERS

SMT = Satisfiability Modulo Theories

Combines SAT + decidable theories
 Example theories:
  - Uninterpreted functions and equality
  - Peano arithmetic (arith. without *, /)
  - Arrays, Strings
  - Bitvectors
  - Algebraic Datatypes (e.g., enums)


Uninterpreted functions and equality:

   x == x
   x == y <==> y == x
   (x == y) && (y == z) ==> (x == z)

   x == y ==> f(x) == f(y)

------------------------------------------

**** proof process
------------------------------------------
   PROVING A FORMULA WITH SMT SOLVER

To prove P

see if not P is satisfiable


------------------------------------------
     Q: What does it mean if not P is satisfiable?
        that P is not always true (there are cases where it's negation holds)

     Q: What does it mean if not P is unsatisfiable?
        that P is always true (no cases where it can't be true)

     Bonus, if not P is satisfiable,
        then the satisfying assignment gives a counterexample
          which can be used to show how the formula fails

**** example
------------------------------------------
           EXAMPLE

b+2 == c
  && f(read(write(a,b,3),c-2)) != f(c-b+1)

Which parts are arithmetic?


arrays?


uninterpreted functions?


------------------------------------------
  ... b+2 == c, c-2, c-b+1
  ... write(a,b,3), read(write(a,b,3),c-2)
  ... f(read(write(a,b,3),c-2)) != f(c-b+1)

------------------------------------------
       SMT PROCESS

b+2 == c
  && f(read(write(a,b,3),c-2)) != f(c-b+1)


------------------------------------------
      ... use equality to substitute
           b+2 for c
         
          b+2 == b+2
           && f(read(write(a,b,3),(b+2)-2)) != f((b+2)-b+1)

          simplify by theory of arithmetic, b+2-2 == b and b+2-b+1 == 3
               and by Boolean: (true && A == A)

          f(read(write(a,b,3),b)) != f(3)

          simplify by theory of arrays:

          f(3) != f(3)

          so by theory of uninterpreted functions,
             this is unsatisfiable!