CIS 6614 meeting -*- Outline -*- * Fuzz Testing ** What is Fuzz Testing This is a kind of dynamic, randomized testing The first publication may be Barton P. Miller, Louis Fredriksen, and Bryan So. "An empirical study of the reliability of UNIX utilities". CACM, 33(12):32-44, Dec. 1990. https://doi.org/10.1145/96267.96279 Q: What is fuzz testing? ------------------------------------------ WHAT IS FUZZING Def: *fuzz testing* (or black box fuzzing) is Idea: random input [ Fuzzer ] ----------------> [ PUT ] | ^ / | \ - - - - - - - - - - -/ | crashes and timeouts v inputs that crash the PUT ------------------------------------------ ... a dynamic analysis technique that feeds a program random inputs and tries to make the program crash Q: What is the goal of fuzz testing? Q: What could cause a program crash? security violations that the OS notices (e.g., segmentation faults), semantic violations in the PL, or assertion errors Q: What is likely to be the largest cost in terms of time in fuzz testing? Executing the program under test (the PUT) ------------------------------------------ FUZZING VS. REGRESSION TESTING Regression Fuzzing Goal Inputs ------------------------------------------ Q: What is regression testing? re-running test cases when a program is changed. Q: What is the goal of regression testing? Check proper functionality Q: What is the goal of fuzz testing? Find bugs/vulnerabilities Q: What kind of inputs are used in regression testing? Expected, or normal inputs Q: What kind of inputs are used in fuzz testing? Random, possibly abnormal inputs (that might be used in an attack) Q: Can we use fuzz testing during regression testing? Yes, but to be repeatable need a list of test inputs which fuzzer can save for us... *** monitoring program execution ------------------------------------------ MONITORING OF PROGRAM Fuzz tester: - maintains list of inputs that cause crashes ------------------------------------------ Q: How can the fuzzer know what inputs cause crashes? It watches their execution, OS tells it what happens, if input leads to a crash, then it records that input Q: What if the program goes into an infinite loop? There must be a timeout (usually set by user) Q: What information can a crash tell the developer/tester? kind of crash and trace, if it's an assertion violation, where it occurred and the condition Q: How could we find more bugs than just running the program? use a dynamic memory error detector (e.g., valgrind, purify, or address sanitizer) advantage: finds more potential problems disadvantage: runs slower Valgrind also allows custom checkers (skins) that can find other bugs... *** advantages and disadvantages ------------------------------------------ ADVANTAGES AND DISADVANTAGES OF FUZZING Advantages: Disadvantages: ------------------------------------------ Advantages: Q: How hard would this be to implement? not hard at all, easy to generate random inputs the only interesting part is the execution monitor Q: How much time of a programmer is involved in doing fuzz testing? none, can run automatically (e.g., overnight) Disadvantages: Q: Are random inputs going to be effective at finding crashes? They seem to be in fact, unless the program validates its inputs Random inputs may be a small fraction of possible inputs, so random search might not find them Q: How does the fuzzer know what inputs are valid? It doesn't. Part of the idea is to generate inputs outside what is expected (as an attacker might do) This can be a major problem *** finding inputs that crash the program ------------------------------------------ FUZZ TESTING AND INPUT VALIDATION Reality: [ Fuzzer ] | | random | bytes v [ Driver ] ----------------> [ Program ] | crashes | v inputs that crash program ------------------------------------------ If the program takes bytes as input (as would a Unix command), then can pass them along, otherwise convert bytes to inputs that the program will accept (this is especially needed for unit testing of procedures/methods) Q: What kind of programs would take bytes as inputs? programs that work on files or network sockets Unix commands usually operate on files (of chars) and this was the first application in Miller, Fredriksen, and So's paper, 1990 ** mutation-based fuzzing Q: Could we add some human expertise to fuzz testing? How? Ask human to give some starting inputs, try to modify them to explore around them (as was done in hybrid symbolic execution, but less systematic) *** idea of mutation-based fuzzing ------------------------------------------ MUTATION-BASED FUZZING 1. get seed input(s) from human 2. track list of good inputs 3. a. pick a random good input b. mutate it c. feed mutant to the program d. if it causes a crash, 4. repeat Example fuzzers: ZZUF, FileFuzz, Taof, ProxyFuzz, etc. Example scenario: fuzzing a PDF viewer 1. collect PDF files from web 2. Mutate those files 3. Record which mutants crash viewer ------------------------------------------ ... then add it to the list of good inputs Q: When does this process stop? It doesn't, user needs to stop it (by timeout, say) Q: What kinds of mutations of inputs would be effective? - small ones, so that they can still pass validations e.g., bit flips - or could use a heuristic (e.g., changing pointers to null) Examples: ZZUF and FileFuzz work on files Taof and ProxyFuzz are protocol fuzzers *** advantages and disadvantages ------------------------------------------ MUTATION-BASED FUZZING Advantages: Disadvantages: ------------------------------------------ Advantages: Q: How hard is mutation-based fuzzing to implement? Not hard, especially if one already has a monitor Q: Does mutation-based fuzzing require precise specifications? No, only examples of inputs Disadvantages: Q: Is there any danger of missing kinds of attacks? Yes, the testing is at least limited by the seeds, and it assumes that attacks are similar to the seeds Q: What kinds of inputs would resist fuzz testing with mutation? Those that have checksums, certificates, or challenge/response protocols ** generation-based fuzzing Q: How could we fuzz-test programs that parse/validate their inputs? We could use a specification of valid inputs to generate new inputs for fuzz testing... ------------------------------------------ GENERATION-BASED FUZZING Idea: 1. Use specification of format to generate inputs 2. Mutations added to syntactic spots in inputs 3. Otherwise like mutation-based ------------------------------------------ Q: What does it mean to be like generation-based fuzzing otherwise? Track list of good inputs (that cause crashes) Base mutations on those Q: Could we use heuristics in generating inputs? Yes, that might help find problems faster ** comparison ------------------------------------------ ADVANTAGES: MUTATION VS. GENERATION Mutation-based fuzz testing: Generation-based fuzz testing: ------------------------------------------ Q: Which is easier to implement? mutation-based Q: Which requires the least amount of expertise to use? mutation-based Q: Which can work for programs that validate input syntax? generation-based Q: Which works for data with checksums? generation-based is better Q: Which is more likely to find problems? generation-based if the program validates input syntax Q: How should we measure the performance of these? time to find bugs or coverage of code... Q: Which has better performance? generation-based could be faster to find bugs especially for programs that validates input syntax ** when to stop fuzzing ------------------------------------------ WHEN TO STOP FUZZING? How do we know we have tested enough? ------------------------------------------ Q: The original and mutation-based fuzz testers run forever, so when to stop them? Usually based on resource limits, but how do we know that's good enough? Q: Generation-based fuzzers may only generate a limited number of inputs, how do we know if that's enough? Code coverage may be an answer, as is standard in testing *** code coverage ------------------------------------------ CODE COVERAGE Def: *Code coverage* is Tools for profiling: gcov, lcov (see https://about.codecov.io/tool/lcov/) Also for LLVM see: clang.llvm.org/docs/ SourceBasedCodeCoverage.html ------------------------------------------ ... a measure of how much of a program's code is executed by testing process gcov is the GNU profiling tool, gives statement coverage lcov is a graphical front end to gcov There are also tools for Java: e.g., JaCoCo Q: Does code coverage actually tell us about a testing process's ability to find bugs? **** line coverage ------------------------------------------ LINE COVERAGE Measures: Percentage of lines of code executed (or statements) Example: void check(int a, int b) { if (a > 2) { x = 1; } if (b > 2) { y = 2; } } How many pairs (a,b) needed for 100% line coverage? Variation: block coverage is ------------------------------------------ ... 2 e.g., (3,4) Q: Do we care about the lines with only a } in them? no ... percentage of basic blocks executed by a testing process *** branch coverage ------------------------------------------ BRANCH COVERAGE Also called: decision coverage Measures: Percentage of branches of code executed Example: void check(int a, int b) { if (a > 2) { x = 1; } if (b > 2) { y = 2; } } How many pairs (a,b) needed for 100% branch coverage? ------------------------------------------ ... 2, to take both true and false branches of both if-statements e.g. (0,0), which takes both false branches and (3,3), which takes both true branches *** path coverage ------------------------------------------ PATH COVERAGE Measures: Percentage of paths in code executed Example: void check(int a, int b) { if (a > 2) { x = 1; } if (b > 2) { y = 2; } } How many pairs (a,b) needed for 100% path coverage? ------------------------------------------ 4, the true branch of each, true and false, false and true and false and false, e.g., (3,3), (3,0), (0,3), and (0,0). Q: How many paths can a program have? - exponential in number of if-tests - but infinite for programs with loops! *** comparison (omit) ------------------------------------------ COMPARING TYPES OF COVERAGE Which type is most desirable? Does 100% coverage guarantee finding bugs? ------------------------------------------ ... testers typically want path coverage, but impossible to get 100% (due to loops) ... no, the example below of strcpy shows that isn't true and in practice there might be bugs hiding in paths far away... The examples above show that X percentage of one measure doesn't necessarily imply X percentage in another. (although 100% path coverage implies 100% of the others) e.g., if path coverage is 25%, branch coverage may be 50% and line coverage might be a number of different values *** evaluation **** benefits of code coverage ------------------------------------------ BENEFITS OF CODE COVERAGE ------------------------------------------ Q: What benefits could one get from code coverage? ... - seeing if test never executes some code - comparing fuzzers - helping decide when to stop fuzzing **** problems with code coverage Conventional wisdom in SE: more coverage implies finding more bugs (affirmed by the Cai & Lyu reference below) ------------------------------------------ PROBLEMS OF CODE COVERAGE Consider: void myStrCpy(char *dst, char* src) { if (dst && src) { strcpy(dst, src); } } ------------------------------------------ Q: Is there any vulnerability in that code? yes, could be buffer overflow... the lengths are not checked Q: Will code coverage help us find that vulnerability? no, easy to get 100% path coverage in this function *** how good is code coverage? ------------------------------------------ HOW GOOD A MEASURE IS CODE COVERAGE? See: Xia Cai and Michael R. Lyu. "The effect of code coverage on fault detection under different testing profiles" In A-MOST '05: Proceedings of the 1st international workshop on Advances in model-based testing, pp. 1-7, May 2005. https://doi.org/10.1145/1083274.1083288 "code coverage is simply a moderate indicator for the capability of fault detection on the whole test set." [code coverage is] "clearly a good estimator for the fault detection of exceptional test cases, but a poor one for test cases in normal operations." "the correlation between code coverage and fault coverage is higher in functional test than in random testing" ------------------------------------------ The test profiles considered are: - functional testing (based on specification), - random testing The lack of correlation between random testing and fault coverage may be explained by functional testing being designed to cover more of the code from section 3.2 "In general, functional test cases are designed to increase their code coverage (i.e., to cover more code fragments), while random test cases are generated to simulate real operational environment and not likely to improve code coverage." ** coverage-guided gray-box fuzzing ------------------------------------------ COVERAGE-GUIDED GRAY-BOX FUZZING Idea: 1. Mutants of good inputs are good if Algorithm: 1. Good inputs start from seed input(s) that don't crash or loop forever 2. Mutate random good input 3. Execute it and monitor execution and its coverage 4. If program crashes or then save input as a new good input Example: AFL ------------------------------------------ ... they increase code coverage ... increases code coverage, **** AFL An implementation of this idea for x86 programs from Google is AFL ------------------------------------------ AMERICAN FUZZY LOP (AFL) Instruments code at compile-time (can instrument assembly code or LLVM) Tracks all edges in CFG (with counters) - Hashtable tracks log_2 of executions with 8 bits per edge Target program runs as separate process ------------------------------------------ Note execution counter maxes out at 128+, Q: What's the advantage of using log base 2 of executions? efficient in terms of space and time Q: What's the disadvantage of using log base 2 of executions? imprecise **** libfuzzer Q: Would it help to combine the ideas of fuzzing and concolic testing? Yes, then we could more systematically explore paths Q: How would we do that? track effects of branch decisions on path condition try to generate (with SMT) exploration of different paths which would (potentially) increase coverage ------------------------------------------ LIBFUZZER libFuzzer, "a library for coverage-guided fuzz testing" in LLVM -- llvm.org/docs/LibFuzzer.html See also tutorial.libfuzzer.info Uses LLVM's SanitizerCoverage for instrumentation e.g., to build also with address sanitizer clang -g -O1 -fsanitize=fuzzer,address \ target.c Can be used with AFL ------------------------------------------ **** go-fuzz also guided by dataflow... ------------------------------------------ GO-FUZZ For the Go programming language Built-in to their tool chain See https://go.dev/security/fuzz/ ------------------------------------------ ** challenges in fuzz testing ------------------------------------------ CHALLENGES IN FUZZ TESTING Seeds need to be crafted well - must not crash program - must not cause loops - should cover different branches Some branches make code hard to cover... ------------------------------------------ *** coverage challenges ------------------------------------------ DIFFICULT TO FUZZ BRANCHES void test1(int n) { if (n==0x12345678) { abort(); } } ------------------------------------------ Q: How hard would it be to get to the abort in test1? needs 2^32 attempts (about 4 billion) in worst case ------------------------------------------ EASIER TO FUZZ TEST void test2(int n) { int dummy = 0; char *p = (char *)&n; if (p[0]==0x12) dummy++; if (p[1]==0x34) dummy++; if (p[2]==0x56) dummy++; if (p[3]==0x56) dummy++; if (dummy==4) { abort(); } } ------------------------------------------ Q: How many inputs are needed in the worst case for test2? only about 2^10 (about 1024) in worst case ** rules of thumb ------------------------------------------ WISDOM FROM FUZZ TESTERS - Knowing input format helps - Generational testing better than random but needs good specifications - Implementations vary, different fuzzers ==> different bugs - Running longer will find more bugs (but reaches a limit) - Coverage guidance works best - Use profiling to see where getting stuck design better seeds to get past those... ------------------------------------------