.
675 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | 🤖 Automated Fuzzing Corpus Generation 📁
6 |
7 |
8 |
9 | 
10 | 
11 | 
12 | 
13 |
14 |
15 |
16 |
17 | AutoCorpus is a tool backed by a large language model (LLM) for automatically generating corpus files for fuzzing.
18 |
19 | AutoCorpus works best when generating corpus files that are based on natural language, such as JSON, XML, or other config files.
20 |
21 | # ⚙️ Setup
22 |
23 | ## System Requirements
24 | AutoCorpus utilizes the Mistral-7B-Instruct-v0.2 model and, where possible, offloads processing to your system's GPU. It is recommended to run AutoCorpus on a machine with a minimum of 16GB of RAM and a dedicated Nvidia GPU with at least 4GB of memory. However, it can run on lower-spec machines, albeit at a significantly slower pace.
25 |
26 | **AutoCorpus has been tested on Windows 11; however, it should be compatible with Unix and other systems.**
27 |
28 | ## Dependencies
29 | AutoCorpus requires **Nvidia CUDA** for enhanced LLM performance. Follow the steps below:
30 | - Ensure your Nvidia drivers are up to date: [Nvidia Drivers](https://www.nvidia.com/en-us/geforce/drivers/)
31 | - Install the appropriate dependencies from [here](https://pytorch.org/get-started/locally/)
32 | - Validate CUDA installation by running the following command and receiving a prompt: ```python -c "import torch; print(torch.rand(2,3).cuda())"```
33 |
34 | Python dependencies can be found in the `requirements.txt` file:
35 |
36 | ```
37 | pip install -r requirements.txt
38 | ```
39 |
40 | AutoCorpus can then be installed using the ```./setup.py``` script as below:
41 |
42 | ```
43 | python -m pip install .
44 | ```
45 |
46 | ## Running
47 |
48 | AutoCorpus can generate corpus files via three different scenarios:
49 |
50 | ### A Single Prompt
51 | For example asking AutoCorpus to generate an XML file would be as follows:
52 | ```
53 | AutoCorpus.exe -o "out" -p "xml file"
54 | ```
55 | ### Existing Corpus File(s)
56 | AutoCorpus can base new corpus files off existing ones.
57 | ```
58 | AutoCorpus.exe -i "input_folder" -o "out"
59 | ```
60 | ### Both Existing Corpus Files And a Prompt.
61 | Generation can be run by using both an existing corpus and a prompt.
62 | ```
63 | AutoCorpus.exe -i "input_folder" -o "out" -p "xml file"
64 | ```
65 |
66 | ### Usage
67 | ```
68 | usage: AutoCorpus [-h] [--input_folder INPUT_FOLDER] [--output_folder OUTPUT_FOLDER] [--number_of_corpus_files NUMBER_OF_CORPUS_FILES] [--prompt PROMPT]
69 | [--size SIZE] [--verbose]
70 |
71 | A tool for automatically generating initial fuzzing input corpus test cases
72 |
73 | optional arguments:
74 | -h, --help show this help message and exit
75 | --input_folder INPUT_FOLDER, -i INPUT_FOLDER
76 | The input folder to base generated corpus files off. If no prompt is given, the folder needs at least 1 file.
77 | --output_folder OUTPUT_FOLDER, -o OUTPUT_FOLDER
78 | The folder to save generated corpus files to (will default to input folder).
79 | --number_of_corpus_files NUMBER_OF_CORPUS_FILES, -n NUMBER_OF_CORPUS_FILES
80 | The number of corpus files to generate
81 | --prompt PROMPT, -p PROMPT
82 | A sentence defining what the corpus files are for. This helps steer generation.
83 | --size SIZE, -s SIZE Max size of tokens created by the LLM
84 | --verbose, -v Provides verbose outputs
85 | ```
86 |
87 | ### Examples
88 |
89 | #### JSON Corpus Generation
90 | Generates 5 corpus files solely on the prompt ```complex json files with varying data```.
91 | ```
92 | AutoCorpus.exe -o "out" -p "complex json files with varying data"
93 | ```
94 |
95 | ```
96 | [{"id": 1, "name": "John Doe", "age": 30, "city": "New York"},
97 |
98 | {"id": 2, "name": "Jane Smith", "age": 28, "city": "Los Angeles"},
99 |
100 | {"id": 3, "name": "Mike Johnson", "age": 35, "city": "Chicago"},
101 |
102 | {"id": 4, "name": "Emma Watson", "age": 27, "city": "London"}]
103 | ```
104 |
105 | #### AWK Config Corpus Generation
106 |
107 | Creates an AWK config based on existing example awk configs in the ``` ..\corpus\awk\``` directory along with the prompt ```config file for busybox awk```.
108 | ```
109 | AutoCorpus.exe -i ..\corpus\awk\ -p "config file for busybox awk" -n 10 -s 700
110 | ```
111 | ```
112 | ```bash
113 | #!/usr/bin/awk -f
114 |
115 | BEGIN {
116 | FS="\t"
117 | if (ARGC != 3) {
118 | print "Usage: awk-script.awk "
119 | exit 1
120 | }
121 | print "Input file:", ARGV[1]
122 | print "Field to print:", ARGC[2]
123 | print "Delimiter:", ARGC[3]
124 |
125 | FILENAME = ARGV[1]
126 | if (open(FILENAME, "r")) {
127 | while ((getline line < FILENAME) > 0) {
128 | gsub(/[[:space:]]+/, "", line) # remove whitespaces
129 | split(line, fields, FS)
130 | for (i = 1; i <= NF; i++) {
131 | if (length(fields[i]) > 0 && fields[i] ~ /\Q"[\"']"ARGV[2]"\Q/ && i != ARGC[3]) {
132 | next
133 | }
134 | if (i == ARGC[3] || i == NF) {
135 | print fields[i]
136 | break
137 | }
138 | }
139 | }
140 | close(FILENAME)
141 | } else {
142 | print "Error opening file:", FILENAME
143 | exit 1
144 | }
145 | }```
146 | ```
147 |
148 | # 🤖 Mistral-7B-Instruct-v0.2
149 | Behind the scenes AutoCorpus uses the ```Mistral-7B-Instruct-v0.2``` model from The Mistral AI Team - see [here](https://arxiv.org/abs/2310.06825). The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.2. More can be found on the model [here!](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
150 | - 7.24B params
151 | - Tensor type: BF16
152 | - 32k context window (vs 8k context in v0.1)
153 | - Rope-theta = 1e6
154 | - No Sliding-Window Attention
155 |
156 | # 🙏 Contributions
157 | AutoCorpus is an open-source project and welcomes contributions from the community. If you would like to contribute to
158 | AutoCorpus, please follow these guidelines:
159 |
160 | - Fork the repository to your own GitHub account.
161 | - Create a new branch with a descriptive name for your contribution.
162 | - Make your changes and test them thoroughly.
163 | - Submit a pull request to the main repository, including a detailed description of your changes and any relevant documentation.
164 | - Wait for feedback from the maintainers and address any comments or suggestions (if any).
165 | - Once your changes have been reviewed and approved, they will be merged into the main repository.
166 |
167 | # ⚖️ Code of Conduct
168 | AutoCorpus follows the Contributor Covenant Code of Conduct. Please make sure to review and adhere to this code of conduct when contributing to AutoCorpus.
169 |
170 | # 🐛 Bug Reports and Feature Requests
171 | If you encounter a bug or have a suggestion for a new feature, please open an issue in the GitHub repository. Please provide as much detail as possible, including steps to reproduce the issue or a clear description of the proposed feature. Your feedback is valuable and will help improve AutoCorpus for everyone.
172 |
173 | # 📜 License
174 |
175 | [GNU General Public License v3.0](https://choosealicense.com/licenses/gpl-3.0/)
176 |
--------------------------------------------------------------------------------
/logo.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/user1342/AutoCorpus/0b8e0b9883d542326a5ef9024a22d3c3151fc1a9/logo.gif
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | argparse
2 | rich
3 | setuptools
4 | huggingface_hub
5 | numpy
6 | accelerate
7 | pyyaml
8 | torchvision -index-url https://download.pytorch.org/whl/cu118
9 | torchaudio -index-url https://download.pytorch.org/whl/cu118
10 | bitsandbytes>=0.39.0
11 | accelerate>=0.16.0,<1
12 | transformers[torch]>=4.28.1
13 | torch>=1.13.1
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup
2 |
3 | setup(
4 | name='AutoCorpus',
5 | version='0.1',
6 | install_requires=[
7 | 'argparse',
8 | 'rich',
9 | 'setuptools',
10 | 'huggingface_hub',
11 | 'numpy',
12 | 'pyyaml',
13 | 'torchvision',
14 | 'torchaudio',
15 | 'bitsandbytes',
16 | 'accelerate',
17 | 'transformers',
18 | 'torch'
19 | ],
20 | entry_points={
21 | 'console_scripts': [
22 | 'AutoCorpus = AutoCorpus.auto_corpus:run'
23 | ]
24 | }
25 | )
--------------------------------------------------------------------------------