CIS 6614 meeting -*- Outline -*-

* Program Analysis or Static Analysis

** What is program analysis?
   Program analysis is also called static analysis (esp. in the
   literature on programming languages and compilers)

------------------------------------------
        WHAT IS PROGRAM ANALYSIS?

Def: *program analysis* is


------------------------------------------
    ... predicting statically safe approximations
        to the set of configurations or behaviors
        that may occur dynamically in a program's
        execution

     Q: How is it different than testing? Runtime assertion checking?
        - It's done without running the program,
        - It takes all possible executions into account

     Q: How does this differ from (human) code inspection?
        automated, can handle larger more complex code bases

     Traditionally (and first) used for compiler optimizations
           (such as register allocation, dead code elimination, etc.)

** use in software security
------------------------------------------
          USES OF PROGRAM ANALYSIS
            IN SOFTWARE SECURITY

General goal/idea:


  Program ---> Static Analyzer --> Bug
                                   Reports

Examples:
  - warn about bad library functions
      (e.g., gets)


------------------------------------------
  ... also examples:
    - warn about misuse of protocols
       (e.g., crypto
    - warn about using tainted inputs
    - warn about possible:
        - type errors

        - buffer overflows

        - arithmetic overflows

        - double frees of memory
        - use after free

        - using var before written
        - dead code (unreachable code)
            - remember the goto fail bug?

    Q: Could all of these cause security vulnerabilities?
       Yes...

*** vs. program testing

    Q: What are the differences from program testing?
    program analysis:
       - is static, does not run the program
       - doesn't need test data
       - can guarantee program always has some property
           (in all possible cases)
       - uses approximations,
         so may report errors that can't happen
           (unlike testing)
       - requires an expert to design the analysis (and tool)
       
*** vs. program verification
    Q: What's the difference from program verification?
    program analysis:
      - no explicit specification of behavior
      - little annotation
      - targeted to a *specific* property (usually just one)
      - may lose precision to gain speed
      - no human intervention once started
    
    program verification:
      - behavior usually specified in detail
      - lots of annotation
      - for any (safety) property
      - may take a day per function
      - often requires human intervention
    
    Q: What are the similarities with program verification?
     - both based on mathematical theories
     - both summarize executions into static approximations
     - both entirely static
       but (in both cases) there is a possibility for
       interactions or hybrid approaches
        (with dynamic  analysis)

*** kinds of properties
    Q: What is the difference between safety and liveness properties?

    Def: a *safety* property says that nothing bad happens
       and characterizes a prefix of a program's execution history
       (Since it's about a prefix, once false it can never become true)

    Def: a *liveness* property says that something good eventually happens

    Q: What kind of property does static analysis work on?
       safety, typically, but could also work on liveness properties
                             with appropriate assumptions about the system
       (such as that the scheduler is fair or doesn't take forever to
        do something.)

    Q: Are typical security policies safety or liveness properties?
         e.g., only authorized users can gain access to a system
       safety, so it is a good match for static analysis

   Q: Are any security properties liveness properties?
         yes, preventing denial of service
                (i.e., every customer request is eventually served)

*** Goals of Program Analysis
    Q: What are the goals of program analysis (static analysis)?
       - little or no input from programmers
           ==> so practical, usable

        Generally speaking, the bias is towards having no programmer
        input, in contrast to work in formal methods (verification)

       - correctness
           ==> so usable "under the covers"
       - efficient (at compile time):
           in time and space


** equivalent ideas
   Q: In what ways (mathematical formalisms) can an analysis be expressed?
    - Dataflow equations
    - Constraints
    - Abstract Interpretation
    - Type system
    - Control Flow Analysis

    These have differences in Emphasis:

       - Dataflow equations and constraints - compilers
       - Abstract interpretation - correctness

    All can approximate dependence
        on dynamic values
        in imperative languages

    Type systems typically have:
       - no flow dependence
       - uniform treatment of each statement

    Constraint based analysis = Control Flow Analysis
       - is typically used for analysis of functional languages

** key design idea
    Q: What's the key design idea for a program analysis?

       Slogan: Err on the safe side!

    Key questions:
      - What's the bad outcome?
      - What would track the opposite of that?
      - What would be safe?

    A sound analysis track the negation of the bad outcome in the analysis,
           so if it's sound,
             and if we say the bad outcome doesn't happen
                 then it won't be possible

             All programs
  |--------------------------------|
  |      Programs that are safe    |
  |   |------------------------|   |
  |   |  Programs claimed safe |   |
  |   |                        |   |
  |   |------------------------|   |
  |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
  |                                |
  |    Programs that can be bad    |
  |--------------------------------|


    Because: what we know in a sound analysis
       is that the claimed safety property
       holds in all possible executions

*** Example: Dead code analysis (as in the goto fail bug)
     - What's the bad outcome?
       that we say code can be deleted (is dead) and it's not

     - What would track the opposite?
         tracking code that is needed

     - What would be safe?
         saying more code is needed than is true

       So the analsysis may use a bigger set of code than is truly needed
           (overapproximate what is neeeded)
           that way if we are wrong, it avoids the bad outcome
             (we would say something is needed, when it isn't)

             All programs
  |--------------------------------|
  |   Programs with no dead code   |
  |   |------------------------|   |
  |   |  Programs claimed to   |   |
  |   |   have no dead code    |   |
  |   |------------------------|   |
  |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
  |                                |
  |  Programs that have dead code  |
  |--------------------------------|

         note that we won't be claiming that the bad outcome occurs,
              only that it doesn't occur

     Q: Assuming the analysis is sound,
        will we have some false positives (claiming bad programs are safe)?
         No, not if the analysis is sound

     Q: Will we have some false negatives (good programs not claimed safe)?
         Yes, because making it perfectly precise is impossible

------------------------------------------
           FOR YOU TO DO

Suppose we want to prevent using gets()
  (from libc).
  What should we do in an analysis?


------------------------------------------
    ... track program points that use gets()
        Q: What is the bad outcome to prevent?
            saying the program is okay, if it actually might use gets()

        Q: What would track the opposite?
            lines that might use gets()

        Q: What would be safe?
            saying more lines might use gets()

** Terms in static analysis
*** intraprocedural vs. interprocedural analyses
   Q: What is a procedure in the context of program analysis?
       a subroutine: procedure, function, method, ...

   Q: What is the difference between an intraprocedural analysis and
      an interprocedural analysis?

Def: an *intraprocedural* analysis
     analyzes information only within a procedure/function/method
         - it can be modular
             if there are summaries of the analysis information
             for each procedure
         - so it scales well

Def: an *interprocedural* analysis
     analyzes the entire program, across procedure calls
         - much more complex
         - not modular (if no summaries ar used)
         - doesn't scale well (often cubic in size of program)
         - but can be more accurate for statically composed programs

*** may vs. must analysis
     Q: What is the difference between a "may" and a "must property/analysis?

    Def: A "may property" (or analysis) tracks
         states what might happen in some execution

     A "must property" (or analysis) tracks
        what will always happen in every execution

*** sources of imprecision
**** conditional statements
------------------------------------------
    CONDITIONAL STATEMENTS: MAY ANALYSIS

At point END,
might the following code have used gets?

    if (...) {
        // 2
        gets(buffer);
    } else {
        // 4
        fgets(buffer, sizeof(buffer), fn);
    }
    // point END

------------------------------------------
    ... yes, maybe, and we want to warn about it

------------------------------------------
   CONDITIONAL STATEMENTS: MUST ANALYSIS

At point END, will
the following code always have used gets?

    if (...) {
        // 2
        gets(buffer);
    } else {
        // 4
        fgets(buffer, sizeof(buffer), fn);
    }
    // point END

------------------------------------------
    ... No, the condition might be false.

    Q: What does a "may" analysis do with information from the
       branches of a conditional?
         - it unions them

    Q: What does a "must" analysis do with information from the
       branches of a conditional?
         - it intersects them
    
**** loops

------------------------------------------
       LOOP STATEMENTS: MAY ANALYSIS

Might the following code use gets?

    while (...) {
        // 2
        gets(buffer);
    }
    // point END


What about:

    while (i < BUFSIZ) {
       // 5
       if (i >= 3) gets(buffer);
       i++;
    }
    // point END

------------------------------------------

        ... yes, at END both might have used gets,
            so we need to warn about them

------------------------------------------
     LOOP STATEMENTS: MUST ANALYSIS

Will the following code always use gets?

    while (...) {
        // 2
        gets(buffer);
    }
    // point END


What about:

    while (i < BUFSIZ) {
       // 5
       if (i >= 3) gets(buffer);
       i++;
    }
    // point END

------------------------------------------

      Q: What should be done with the information in a may analysis?
          union the information on all paths
      Q: How about for a must analysis?
          intersect the information on all paths

**** handling different kinds of control structures (CFGs)

     Q: How can one write an analysis that is generic with respect to
     different kinds of control structures in different languages
     (such as switch statements, different kinds of loops,
     exception handling, etc.)?
       - Write it in terms of a universal (lower-level) representation:
            sequencing and gotos

     Q: What is a control flow graph (CFG) in static analysis?
         - nodes are statements or tests
         - edges are control flow relation between labels
             (map from a node to the next node in the forward control flow)

Example:

    z = 1; // 1
    while (x > 0) // 2
    {  z = z*y;   // 3
       x = x-1; }  // 4
    print("done"); // 5

  Blocks are:
        statements 1, 3, 4, 5 and test 2

  Flows are:
           (1,2), (2,3), (2,5), (3,4), (4,2)

Picture:
       [ 1 ]
         |
         v
   /-->[ 2 ]--\
   |     |    |
   |     v    |
   |   [ 3 ]  |
   |     |    |
   |     v    |
   \---[ 4 ]  |
              |
         /----/
         v
       [ 5 ]


**** property space
    this is the technical term for the information used in the analysis
       (and the way that information is combined, the join operator
        and an approximation ordering on the information, making it a lattice)

    We divide analyses into: may/must
       When the property space is a set,
       then the join operator is:
          - union if doing a may analysis
          - intersection if doing a must analysis
       the intiial value of the information is (representing unknown):
          - empty set if a may analysis
          - the total set if a must analysis

    Q: How does a property space reflect the imprecision of an analysis?
         - unions (and intersections) may lose information
           when paths are combined

***** example
------------------------------------------
      EXAMPLE: UNREACHABLE CODE

Consider C with statments:
   assignments, empty statements (skip),
   if-then-else, while,
   sequencing, labels

 Stmt ::= L: x = E; | L : ;
      | if (L: E) { Stmt1 } else { Stmt2 }
      | while (L: E) { Stmt }
      | Stmt1 Stmt2

 Atomic blocks are:
    S ::= L : x = E; | L : ;
        | L : E

So CFG of

    l1: x = 0;
    while (l2: x < 4) {
       l3: y = y+1;
       l4: x = x-1; }
    l5: y = x*x;

is

     [ l1: x = 0;]
           |
           v
  /->[ l2: x < 4; ]--\
  |        |         |
  |        v         |
  |  [ l3: y = y+1;] |
  |        |         |
  |        v         |
  \--[ l4: x = x-1;] |
                     |
           /---------/
           |
           v
     [ l5: y = x*x;]

Set of flows is:


We want a forward, may analysis with:

Property space:
  Reaches(l) = Set of labels that may
               be reached from start
  initial value: {}

Combination:
  join is \union
  EntryTo(l) =
    union {ExitOf(l') | (l',l) flows}

Transfer functions:
 f(l: x = E;, EntryTo(l))
      = {l} union EntryTo(l)
 f(l: ;, EntryTo(l)) = EntryTo(l)
 f(l:E, EntryTo(l)) = {l} union EntryTo(l)


------------------------------------------
    ... {(l1,l2), (l2,l3)
             (l3,l4), (l4,l2), (l2,l5)}

------------------------------------------
    EQUATIONS AND SOLUTION FOR EXAMPLE

EntryTo(l1) = {}
ExitOf(l1) = {l1} union EntryTo(l1)
EntryTo(l2) = union{ExitOf(l1),ExitOf(l4)}
ExitOf(l2) = {l2} union EntryTo(l2)
EntryTo(l3) = ExitOf(l2)
ExitOf(l3) = {l3} union EntryTo(l3)
EntryTo(l4) = ExitOf(l3)
ExitOf(l4) = {l4} union EntryTo(l4)
EntryTo(l5) = ExitOf(l2)
ExitOf(l5) = {l5} union EntryTo(l5)

Trace of iterations:

     Steps: 0  1
EntryTo(l1) {}
ExitOf(l1)  {}

EntryTo(l2) {}
ExitOf(l2)  {}

EntryTo(l3) {}
ExitOf(l3)  {}

EntryTo(l4) {}
ExitOf(l4)  {}

EntryTo(l5) {}
ExitOf(l5)  {}
------------------------------------------

   trace this

   Q: Is this result what happens dynamically?
        No, the loop never executes
   Q: Is it safe?
        no, claims all code reachable sometimes, but that isn't true!

   Q: How could we make this safe?
       Need to take values into account,
       so need to approximate those, or partially evaluate constant
       exprs

*** building tools with dataflow analysis
   Q: What are some existing tools for building an analysis?
      Parser generator and:
       - LLVM
       - JastAdd Java compiler
       - XText