CIS 6614 meeting -*- Outline -*-

* Fuzz Testing

** What is Fuzz Testing
   This is a kind of dynamic, randomized testing   

   The first publication may be
   Barton P. Miller, Louis Fredriksen, and Bryan So.
   "An empirical study of the reliability of UNIX utilities".
   CACM, 33(12):32-44, Dec. 1990.
   https://doi.org/10.1145/96267.96279

   Q: What is fuzz testing?
   
------------------------------------------
        WHAT IS FUZZING

Def: *fuzz testing* (or black box fuzzing)
     is


Idea:

               random input
 [ Fuzzer ]  ----------------> [ PUT ]
     |   ^                       /
     |   \ - - - - - - - - - - -/
     |      crashes and timeouts
     v
inputs that
crash the PUT

------------------------------------------
    ... a dynamic analysis technique that feeds a program
        random inputs and tries to make the program crash

        Q: What is the goal of fuzz testing?

        Q: What could cause a program crash?
        security violations that the OS notices (e.g., segmentation
        faults), semantic violations in the PL, or assertion errors

        Q: What is likely to be the largest cost in terms of time in
           fuzz testing?
             Executing the program under test (the PUT)

------------------------------------------
      FUZZING VS. REGRESSION TESTING

         Regression          Fuzzing

Goal


Inputs


------------------------------------------
       Q: What is regression testing?
           re-running test cases when a program is changed.
           
       Q: What is the goal of regression testing?
           Check proper functionality

       Q: What is the goal of fuzz testing?
           Find bugs/vulnerabilities

       Q: What kind of inputs are used in regression testing?
           Expected, or normal inputs

       Q: What kind of inputs are used in fuzz testing?
           Random, possibly abnormal inputs
           (that might be used in an attack)

       Q: Can we use fuzz testing during regression testing?
          Yes, but to be repeatable need a list of test inputs
               which fuzzer can save for us...

*** monitoring program execution
------------------------------------------
          MONITORING OF PROGRAM

Fuzz tester:
   - maintains list of inputs
      that cause crashes
      

------------------------------------------
        Q: How can the fuzzer know what inputs cause crashes?
           It watches their execution, OS tells it what happens,
              if input leads to a crash, then it records that input
        Q: What if the program goes into an infinite loop?
           There must be a timeout (usually set by user)

        Q: What information can a crash tell the developer/tester?
           kind of crash and trace,
           if it's an assertion violation, where it occurred and the condition

        Q: How could we find more bugs than just running the program?
           use a dynamic memory error detector
              (e.g., valgrind, purify, or address sanitizer)
              advantage: finds more potential problems
              disadvantage: runs slower

           Valgrind also allows custom checkers (skins) that can find
           other bugs...
              
*** advantages and disadvantages
------------------------------------------
  ADVANTAGES AND DISADVANTAGES OF FUZZING

Advantages:


Disadvantages:


------------------------------------------

      Advantages:
        Q: How hard would this be to implement?
           not hard at all, easy to generate random inputs
           the only interesting part is the execution monitor
           
        Q: How much time of a programmer is involved in doing fuzz testing?
           none, can run automatically (e.g., overnight)

      Disadvantages:
        Q: Are random inputs going to be effective at finding crashes?
           They seem to be in fact,
           unless the program validates its inputs

           Random inputs may be a small fraction of possible inputs,
              so random search might not find them

        Q: How does the fuzzer know what inputs are valid?
        It doesn't.
            Part of the idea is to generate inputs outside what is expected
              (as an attacker might do)
            This can be a major problem


***  finding inputs that crash the program

------------------------------------------
    FUZZ TESTING AND INPUT VALIDATION


Reality:

 [ Fuzzer ]
     |
     | random
     | bytes
     v
 [ Driver ]  ----------------> [ Program ]
                                    |
                                  crashes
                                    |
                                    v
                               inputs that
                             crash program


------------------------------------------
          If the program takes bytes as input
             (as would a Unix command),
          then can pass them along,
          otherwise convert bytes
          to inputs that the program will accept
          (this is especially needed for unit testing
            of procedures/methods)

          Q: What kind of programs would take bytes as inputs?
             programs that work on files or network sockets

             Unix commands usually operate on files (of chars)
             and this was the first application
               in Miller, Fredriksen, and So's paper, 1990

** mutation-based fuzzing
          Q: Could we add some human expertise to fuzz testing? How?
              Ask human to give some starting inputs,
              try to modify them to explore around them
                 (as was done in hybrid symbolic execution,
                 but less systematic)
*** idea of mutation-based fuzzing
------------------------------------------
         MUTATION-BASED FUZZING

 1. get seed input(s) from human
 2. track list of good inputs
 3. a. pick a random good input
    b. mutate it
    c. feed mutant to the program
    d. if it causes a crash,
    

 4. repeat


Example fuzzers:
   ZZUF, FileFuzz, Taof, ProxyFuzz, etc.

Example scenario:
   fuzzing a PDF viewer
     1. collect PDF files from web
     2. Mutate those files
     3. Record which mutants crash viewer
------------------------------------------
         ... then add it to the list of good inputs

         Q: When does this process stop?
            It doesn't, user needs to stop it (by timeout, say)

         Q: What kinds of mutations of inputs would be effective?
             - small ones, so that they can still pass validations
                e.g., bit flips
             - or could use a heuristic (e.g., changing pointers to null)
         
         Examples: ZZUF and FileFuzz work on files
                   Taof and ProxyFuzz are protocol fuzzers

*** advantages and disadvantages
------------------------------------------
     MUTATION-BASED FUZZING

Advantages:


Disadvantages:


------------------------------------------

      Advantages:
        Q: How hard is mutation-based fuzzing to implement?
           Not hard, especially if one already has a monitor

        Q: Does mutation-based fuzzing require precise specifications?
           No, only examples of inputs

      Disadvantages:
        Q: Is there any danger of missing kinds of attacks?
           Yes, the testing is at least limited by the seeds,
                and it assumes that attacks are similar to the seeds

        Q: What kinds of inputs would resist fuzz testing with mutation?
           Those that have checksums, certificates,
            or challenge/response protocols

** generation-based fuzzing
        Q: How could we fuzz-test programs that parse/validate their
           inputs?
           We could use a specification of valid inputs to generate
           new inputs for fuzz testing...

------------------------------------------
          GENERATION-BASED FUZZING

Idea:
  1. Use specification of format
     to generate inputs

  2. Mutations added to syntactic spots
     in inputs

  3. Otherwise like mutation-based
  
------------------------------------------

        Q: What does it mean to be like generation-based fuzzing otherwise?
           Track list of good inputs (that cause crashes)
           Base mutations on those

        Q: Could we use heuristics in generating inputs?
           Yes, that might help find problems faster

** comparison
------------------------------------------
    ADVANTAGES: MUTATION VS. GENERATION

Mutation-based fuzz testing:


Generation-based fuzz testing:


------------------------------------------

        Q: Which is easier to implement?
            mutation-based

        Q: Which requires the least amount of expertise to use?
            mutation-based

        Q: Which can work for programs that validate input syntax?
            generation-based

        Q: Which works for data with checksums?
            generation-based is better

        Q: Which is more likely to find problems?
            generation-based if the program validates input syntax

        Q: How should we measure the performance of these?
            time to find bugs or coverage of code...
            
        Q: Which has better performance? 
            generation-based could be faster to find bugs
            especially for programs that validates input syntax

** when to stop fuzzing
------------------------------------------
            WHEN TO STOP FUZZING?

How do we know we have tested enough?

------------------------------------------
        Q: The original and mutation-based fuzz testers run forever,
             so when to stop them?
           Usually based on resource limits,
              but how do we know that's good enough?

        Q: Generation-based fuzzers may only generate a limited number
             of inputs, how do we know if that's enough?

           Code coverage may be an answer, as is standard in testing

*** code coverage
------------------------------------------
               CODE COVERAGE

Def: *Code coverage* is


Tools for profiling:
   gcov, lcov
(see https://about.codecov.io/tool/lcov/)

Also for LLVM see:
  clang.llvm.org/docs/
     SourceBasedCodeCoverage.html
------------------------------------------
        ... a measure of how much of a program's code
            is executed by testing process

        gcov is the GNU profiling tool, gives statement coverage
        lcov is a graphical front end to gcov

        There are also tools for Java: e.g., JaCoCo

        Q: Does code coverage actually tell us about a testing
           process's ability to find bugs?

**** line coverage
------------------------------------------
         LINE COVERAGE

Measures:
   Percentage of lines of code executed
                (or statements)

Example:

void check(int a, int b) {
    if (a > 2) {
        x = 1;
    }
    if (b > 2) {
        y = 2;
    }
}

  How many pairs (a,b) needed for 100%
  line coverage?


Variation: block coverage is


------------------------------------------
     ... 2  e.g., (3,4)

     Q: Do we care about the lines with only a } in them?
        no

     ... percentage of basic blocks
         executed by a testing process

*** branch coverage
------------------------------------------
         BRANCH COVERAGE

Also called: decision coverage

Measures:
   Percentage of branches of code executed

Example:

void check(int a, int b) {
    if (a > 2) {
        x = 1;
    }
    if (b > 2) {
        y = 2;
    }
}

  How many pairs (a,b) needed for 100%
  branch coverage?


------------------------------------------
     ... 2, to take both true and false branches of both if-statements
            e.g. (0,0), which takes both false branches
               and (3,3), which takes both true branches

*** path coverage
------------------------------------------
          PATH COVERAGE

Measures:
   Percentage of paths in code executed

Example:

void check(int a, int b) {
    if (a > 2) {
        x = 1;
    }
    if (b > 2) {
        y = 2;
    }
}

  How many pairs (a,b) needed for 100%
  path coverage?


------------------------------------------
      4, the true branch of each, true and false, false and true and
      false and false,
         e.g., (3,3), (3,0), (0,3), and (0,0).

      Q: How many paths can a program have?
         - exponential in number of if-tests
         - but infinite for programs with loops!

*** comparison (omit)
------------------------------------------
         COMPARING TYPES OF COVERAGE

Which type is most desirable?


Does 100% coverage guarantee finding bugs?


------------------------------------------
      ... testers typically want path coverage,
          but impossible to get 100% (due to loops)

      ... no, the example below of strcpy shows that isn't true
          and in practice there might be bugs hiding in paths far
          away...

      The examples above show that X percentage of one measure
      doesn't necessarily imply X percentage in another.
         (although 100% path coverage implies 100% of the others)
       e.g., if path coverage is 25%, branch coverage may be 50%
                 and line coverage might be a number of different values

*** evaluation
**** benefits of code coverage
------------------------------------------
      BENEFITS OF CODE COVERAGE


------------------------------------------
      Q: What benefits could one get from code coverage?
      ... - seeing if test never executes some code
          - comparing fuzzers
          - helping decide when to stop fuzzing

**** problems with code coverage
      Conventional wisdom in SE:
        more coverage implies finding more bugs
        (affirmed by the Cai & Lyu reference below)
        
------------------------------------------
       PROBLEMS OF CODE COVERAGE


Consider:

void myStrCpy(char *dst, char* src) {
   if (dst && src) {
     strcpy(dst, src);
   }
}

------------------------------------------
      Q: Is there any vulnerability in that code?
         yes, could be buffer overflow...
             the lengths are not checked
      Q: Will code coverage help us find that vulnerability?
         no, easy to get 100% path coverage in this function

*** how good is code coverage?
------------------------------------------
   HOW GOOD A MEASURE IS CODE COVERAGE?

See:
Xia Cai and Michael R. Lyu.
"The effect of code coverage on
fault detection under different
testing profiles" 
In A-MOST '05: Proceedings of the 1st
international workshop on Advances
in model-based testing, pp. 1-7, May 2005.
https://doi.org/10.1145/1083274.1083288

"code coverage is simply
  a moderate indicator
  for the capability of fault detection
  on the whole test set."

[code coverage is] "clearly a
  good estimator for the fault detection
  of exceptional test cases,
  but a poor one for test cases
  in normal operations."

"the correlation between code coverage
 and fault coverage is higher in
 functional test than in random testing"
------------------------------------------

        The test profiles considered are:
           - functional testing (based on specification),
           - random testing
        
        The lack of correlation between random testing and fault
        coverage may be explained by functional testing being designed
        to cover more of the code

        from section 3.2 "In general, functional test cases
        are designed to increase their code coverage
        (i.e., to cover more code fragments), while random
        test cases are generated to simulate real operational
        environment and not likely to improve code coverage."

** coverage-guided gray-box fuzzing
------------------------------------------
   COVERAGE-GUIDED GRAY-BOX FUZZING

Idea:
  1. Mutants of good inputs are good if


Algorithm:
  1. Good inputs start from seed input(s)
     that don't crash or loop forever

  2. Mutate random good input
  3. Execute it and monitor execution
     and its coverage
  4. If program crashes
     or                
     then save input as a new good input


Example: AFL
 
------------------------------------------
        ... they increase code coverage
        ... increases code coverage,        

**** AFL
     An implementation of this idea for x86 programs from Google is AFL
------------------------------------------
     AMERICAN FUZZY LOP (AFL)

Instruments code at compile-time
    (can instrument assembly code or LLVM)
    
Tracks all edges in CFG (with counters)
  - Hashtable tracks log_2 of executions
     with 8 bits per edge


Target program runs as separate process


------------------------------------------

        Note execution counter maxes out at 128+, 

        Q: What's the advantage of using log base 2 of executions?
            efficient in terms of space and time
        Q: What's the disadvantage of using log base 2 of executions?
            imprecise

**** libfuzzer
     Q: Would it help to combine the ideas of fuzzing and concolic testing?
        Yes, then we could more systematically explore paths

     Q: How would we do that?
        track effects of branch decisions on path condition
           try to generate (with SMT) exploration of different paths
           which would (potentially) increase coverage
           
------------------------------------------
         LIBFUZZER

libFuzzer, "a library for coverage-guided
           fuzz testing" in LLVM
          -- llvm.org/docs/LibFuzzer.html

See also tutorial.libfuzzer.info

Uses LLVM's SanitizerCoverage
   for instrumentation

e.g., to build also with address sanitizer

 clang -g -O1 -fsanitize=fuzzer,address \
     target.c

Can be used with AFL
------------------------------------------

**** go-fuzz
    also guided by dataflow...
------------------------------------------
        GO-FUZZ

For the Go programming language

Built-in to their tool chain

See https://go.dev/security/fuzz/
------------------------------------------

** challenges in fuzz testing

------------------------------------------
      CHALLENGES IN FUZZ TESTING

Seeds need to be crafted well
   - must not crash program
   - must not cause loops
   - should cover different branches

Some branches make code hard to cover...
------------------------------------------

*** coverage challenges
------------------------------------------
      DIFFICULT TO FUZZ BRANCHES

void test1(int n) {
  if (n==0x12345678) {
     abort();
  }
}
------------------------------------------
      Q: How hard would it be to get to the abort in test1?
         needs 2^32 attempts (about 4 billion) in worst case

------------------------------------------
         EASIER TO FUZZ TEST

void test2(int n) {
  int dummy = 0;
  char *p = (char *)&n;
  if (p[0]==0x12) dummy++;
  if (p[1]==0x34) dummy++;
  if (p[2]==0x56) dummy++;
  if (p[3]==0x56) dummy++;
  if (dummy==4) {
     abort();
  }
}

------------------------------------------
       Q: How many inputs are needed in the worst case for test2?
          only about 2^10 (about 1024) in worst case

** rules of thumb
------------------------------------------
         WISDOM FROM FUZZ TESTERS

- Knowing input format helps

- Generational testing better than random
  but needs good specifications

- Implementations vary,
  different fuzzers ==> different bugs

- Running longer will find more bugs
  (but reaches a limit)

- Coverage guidance works best

- Use profiling to see where getting stuck
  design better seeds to get past those...
------------------------------------------