└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Fuzzing Evaluation Guidelines 2 | 3 | 4 | Current version: 1.0.3 5 | 6 | Proposals for changes welcome (please open an issue for discussion or a pull request for changes). 7 | 8 | DISCLAIMER: These items represent are a best-effort attempt at capturing action items to follow during the evaluation of a scientific paper that focuses on fuzzing. **They do not apply universally to all fuzzing methods - in certain scenarios, techniques may wish to deviate for good reason from these guidelines. In any case, a case-by-case judgment is necessary.** 9 | The guidelines do not discuss many malicious choices that immediately negate any chance of a fair evaluation, such as giving your fuzzer an unfair advantage (e.g., by fine-tuning the fuzzer or its targets) or putting other fuzzers at a disadvantage. 10 | 11 | 12 | 13 | A. Preparation for Evaluation 14 | 1. Find relevant tools and baselines to compare against 15 | - 1.1 Include state-of-the-art techniques from both academia and industry 16 | - 1.2 If your fuzzer is based on an existing fuzzer, include the baseline (to measure the delta of your changes, which allows attributing improvements to your technique) 17 | - 1.3 Use recent versions of fuzzers 18 | - 1.4 If applicable, derive a baseline variant of your technique that replaces core contributions by alternatives. For example, consider using a variant that replaces an informed algorithm with randomness. 19 | - 1.5 If using AFL-style fuzzers, do not use afl-gcc but afl-clang-fast or afl-clang-lto. 20 | 21 | 2. Identify suitable targets for the evaluation 22 | - 2.1 If applicable, consider using evaluation benchmarks, such as Fuzzbench (this allows to test many fuzzers under standardized conditions) 23 | - 2.2 Select a representative set of programs from the target domain 24 | - 2.3 Include targets used by related work (for comparability reasons) 25 | - 2.4 Do not cherry-pick targets based on preliminary results 26 | - 2.5 Do not pick multiple targets that share a considerable amount of code (e.g., two wrappers for the same library) 27 | - 2.6 Do not use artificial programs or programs with artificially injected bugs 28 | 29 | 3. Derive suitable experiments to evaluate your approach 30 | - 3.1 Evaluate on found bugs (if applicable) 31 | - 3.1.1 If using *new* bugs, 32 | - 3.1.1.1 include whether other fuzzers find the bug as well (so you can attribute finding this bug to your technique rather than being the first to fuzz this target); other fuzzers must have had the same computing resources 33 | - 3.1.1.2 deduplicate crashing inputs to derive the true bug count 34 | - 3.1.1.2.1 If possible, use vendor confirmation to identify true bugs 35 | - 3.1.1.2.2 Otherwise use manual triaging (consider automated deduplication as a pre-step to reduce number of findings) 36 | - 3.1.1.3 do not fuzz unsuitable programs for the sake of finding bugs (e.g., small hobby projects that are no longer maintained are not suitable) 37 | - 3.1.1.4 do not search for bugs in unstable, fast moving development branches, but prefer stable/release versions 38 | - 3.1.2 If using *known* bugs, 39 | - 3.1.2.1 use the known bugs as ground truth 40 | - 3.1.2.2 take into account that known bugs may not have been deduplicated 41 | - 3.1.2.3 do not evaluate on artificial bugs 42 | - 3.1.3 Do not use the number of (unique) crashing inputs as bug count 43 | - 3.2 Evaluate code coverage over time (if applicable) 44 | - 3.2.1 If possible, use source code-based coverage (e.g., llvm-cov or lcov) 45 | - 3.2.2 Otherwise use a collision-free encoding 46 | - 3.2.3 Measure coverage on a neutral binary; this binary should include only instrumentation needed to measure the coverage, but not sanitizers or fuzzer-specific instrumentation 47 | - 3.2.4 If using dynamic binary translation, the coverage measurement should be independent of the translation (e.g., emulators may split a basic block into multiple translation blocks, disturbing measurements) 48 | - 3.3 If applicable, evaluate domain-specific aspects of your fuzzer 49 | - 3.4 If applicable, conduct ablation studies to measure individual design choices 50 | - 3.5 If applicable, evaluate the influence of hyperparameters on your design 51 | - 3.6 If doing experiments using custom metrics, 52 | - 3.6.1 take special care to ensure a fair comparison to existing work. 53 | - 3.6.2 In particular, avoid queue survivor bias (i.e., the queue only contains input fulfilling specific criteria) as it may favor your fuzzer. For example, your fuzzer optimizing towards the new, custom metric may keep inputs in the queue that others discard (even though they find the input at runtime) -- evaluating only inputs on the queue thus gives your fuzzer an unfair advantage. 54 | 55 | 56 | B. Documenting the Evaluation 57 | 1. Describe the setup, including 58 | - 1.1 hardware used (such as CPU and RAM) 59 | - 1.2 how many cores have been available to each fuzzing campaign (e.g., via CPU pinning) 60 | - 1.3 technologies used, such as Docker or virtualization 61 | 2. Choose and document experiment parameters, including 62 | - 2.1 a sufficiently long runtime (if possible >= 24h) 63 | - 2.2 a sufficient number of repetitions/trials to account for randomness and enable a robust statistical evaluation (if possible >= 10 trials) 64 | - 2.3 fairness of computing resource allocation, i.e., all fuzzers have access to the same amount of computation resources. This requires particular consideration if a tool requires pre-computation(s). 65 | - 2.4 suitable seeds: 66 | - 2.4.1 If possible, use uninformed seeds for coverage evaluation (for bug experiments, informed seeds may be beneficial) 67 | - 2.4.2 Otherwise identify the coverage achieved by the initial seed set 68 | - 2.4.3 Provide all fuzzers with the same set of seeds 69 | - 2.4.4 Publish the used set of seeds 70 | - 2.5 targets: 71 | - 2.5.1 Use recent versions 72 | - 2.5.2 If applicable, explain modifications to the programs or runtime environment (e.g., when you patch the program or set a lower stack size) 73 | - 2.6 other tools/fuzzers: 74 | - 2.6.1 Use recent versions 75 | - 2.6.2 If your fuzzer is based on another one, make sure the version you base your tool on and the one used in the evaluation are the same 76 | 77 | 78 | C. Experiment Postprocessing 79 | 1. Data Analysis 80 | - 1.1 Run a robust statistical evaluation to measure significance, such as Mann-Whitney U or bootstrap-based methods 81 | - 1.2 Measure effect size using a test such as the Vargha and Delaney A^\_{12} test 82 | 2. Data Visualization 83 | - 2.1 If applicable, plot absolute values (such as for coverage over time) 84 | - 2.2 Measure uncertainty, for example using standard deviation or (confidence) intervals in plots 85 | 3. Bug Handling 86 | - 3.1 Deduplicate and triage crashing inputs 87 | - 3.2 Report new bugs 88 | - 3.2.1 Follow responsible disclosure guidelines 89 | - 3.2.2 If possible, minimize samples before reporting 90 | - 3.2.3 If possible, attach available information, such as precise environment (OS, compilation flags, command line arguments, ...), ASAN reports, and (minimized) crashing input 91 | - 3.2.4 Consider reporting the bug with an anonymous identity and link to it in the paper during submission, such that reviewers can assess the bug and its impact themselves 92 | - 3.3 CVEs 93 | - 3.3.1 CVEs should be requested by maintainers 94 | - 3.3.2 If the maintainers do not request a CVE, link to the bug tracker instead of requesting a CVE yourself 95 | 96 | 97 | D. Artifact Release 98 | 1. Artifact Contents 99 | - 1.1 Publish your code on a platform ensuring long-term availability, such as Zenodo or GitHub 100 | - 1.2 Publish modifications of other tools 101 | - 1.2.1 If modifying other tools, publish any modifications 102 | - 1.2.2 Publish your integration of other tools 103 | - 1.3 If possible, publish experiment data 104 | 2. Artifact Documentation 105 | - 2.1 Document how to build your fuzzer 106 | - 2.2 Document how to interact with your fuzzer 107 | - 2.3 Document the source code 108 | - 2.4 Document modifications/extensions to other tools and their integration 109 | - 2.5 Document how to run and reproduce experiments described in the paper 110 | 3. Artifact Reusability 111 | - 3.1 Specify versions of all tools used 112 | - 3.2 If possible, enable execution of your fuzzer independent of the underlying system, e.g., through virtualization or container engines 113 | - 3.3 Avoid external dependencies that may be unavailable in the future, such as tarball downloads via https 114 | - 3.4 Pin versions of dependencies 115 | - 3.5 If applicable, maintain the commit history of underlying tools instead of squashing them 116 | - 3.6 Double-check your code is complete and reusable 117 | --------------------------------------------------------------------------------