├── .gitignore ├── 3rdparty └── strings.xml ├── LICENSE ├── README.md ├── prepare-release.sh ├── requirements.txt ├── screens ├── output-rule-0.14.1.png └── yargen-running.png ├── tools └── byte-mapper.py └── yarGen.py /.gitignore: -------------------------------------------------------------------------------- 1 | # IDE & Virtual Environment 2 | .idea/ 3 | venv/ 4 | 5 | # MacOS 6 | .DS_Store 7 | .AppleDouble 8 | .LSOverride 9 | 10 | # Thumbnails 11 | ._* 12 | 13 | # Prebuilt Database 14 | dbs/ 15 | 16 | # YARA Rules 17 | yargen_rules.yar 18 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | yarGen - Yara Rule Generator, Copyright (c) 2015, Florian Roth 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | * Redistributions of source code must retain the above copyright 7 | notice, this list of conditions and the following disclaimer. 8 | * Redistributions in binary form must reproduce the above copyright 9 | notice, this list of conditions and the following disclaimer in the 10 | documentation and/or other materials provided with the distribution. 11 | * Neither the name of the copyright owner nor the 12 | names of its contributors may be used to endorse or promote products 13 | derived from this software without specific prior written permission. 14 | 15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND 16 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED 17 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 18 | DISCLAIMED. IN NO EVENT SHALL Florian Roth BE LIABLE FOR ANY 19 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES 20 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 21 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND 22 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 23 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![Actively Maintained](https://img.shields.io/badge/Maintenance%20Level-Actively%20Maintained-green.svg)](https://gist.github.com/cheerfulstoic/d107229326a01ff0f333a1d3476e068d) 2 | 3 | # yarGen 4 | _____ 5 | __ _____ _____/ ___/__ ___ 6 | / // / _ `/ __/ (_ / -_) _ \ 7 | \_, /\_,_/_/ \___/\__/_//_/ 8 | /___/ Yara Rule Generator 9 | Florian Roth, July 2020, Version 0.23.2 10 | 11 | Note: Rules have to be post-processed 12 | See this post for details: https://medium.com/@cyb3rops/121d29322282 13 | 14 | ### What does yarGen do? 15 | 16 | yarGen is a generator for [YARA](https://github.com/plusvic/yara/) rules 17 | 18 | The main principle is the creation of yara rules from strings found in malware files while removing all strings that also appear in goodware files. Therefore yarGen includes a big goodware strings and opcode database as ZIP archives that have to be extracted before the first use. 19 | 20 | In version 0.24.0, yarGen introduces an output option (`--ai`). This feature generates a YARA rule with an expanded set of strings and includes instructions tailored for an AI. I suggest employing ChatGPT Plus with model 4 to refine these rules. Activating the `--ai` flag appends the instruction text to the `yargen_rules.yar` output file, which can subsequently be fed into your AI for processing. 21 | 22 | With version 0.23.0 yarGen has been ported to Python3. If you'd like to use a version using Python 2, try a previous release. (Note that the download location for the pre-built databases has changed, since the database format has been changed from the outdated `pickle` to `json`. The old databases are still available but in an old location on our web server only used in the old yarGen version <0.23) 23 | 24 | Since version 0.12.0 yarGen does not completely remove the goodware strings from the analysis process but includes them with a very low score depending on the number of occurrences in goodware samples. The rules will be included if no 25 | better strings can be found and marked with a comment /* Goodware rule */. 26 | Force yarGen to remove all goodware strings with --excludegood. Also since version 0.12.0 yarGen allows to place the "strings.xml" from [PEstudio](https://winitor.com/) in the program directory in order to apply the blacklist definition during the string analysis process. You'll get better results. 27 | 28 | Since version 0.14.0 it uses naive-bayes-classifier by Mustafa Atik and Nejdet Yucesoy in order to classify the string and detect useful words instead of compression/encryption garbage. 29 | 30 | Since version 0.15.0 yarGen supports opcode elements extracted from the `.text` sections of PE files. During database creation it splits the `.text` sections with the regex [\x00]{3,} and takes the first 16 bytes of each part 31 | to build an opcode database from goodware PE files. During rule creation on sample files it compares the goodware opcodes with the opcodes extracted from the malware samples and removes all opcodes that also appear in the goodware 32 | database. (there is no further magic in it yet - no XOR loop detection etc.) The option to activate opcode integration is '--opcodes'. 33 | 34 | Since version 0.17.0 yarGen allows creating multiple databases for opcodes and strings. You can now easily create a new database by using "-c" and an identifier "-i identifier" e.g. "office". It will then create two new 35 | database files named "good-strings-office.db" and "good-opcodes-office.db" that will be initialized during startup with the built-in databases. 36 | 37 | Since version 0.18.0 yarGen supports extra conditions that make use of the `pe` module. This includes [imphash](https://www.fireeye.com/blog/threat-research/2014/01/tracking-malware-import-hashing.html) values and the PE file's exports. We provide pre-generated imphash and export databases. 38 | 39 | Since version 0.19.0 yarGen support a 'dropzone' mode in which it initializes all strings/opcodes/imphashes/exports only once and queries a given folder for new samples. If it finds new samples dropped to the folder, it creates rules for these samples, writes the YARA rules to the defined output file (default: yargen_rules.yar) and removes the dropped samples. You can specify a text file (`-b`) from which the identifier is read. The reference parameter (`-r`) has also been extended so that it can be a text file on disk from which the reference is read. E.g. drop two files named 'identifier.txt' and 'reference.txt' together with the samples to the folder and use the parameters `-b ./dropzone/identifier.txt` and `-r ./dropzone/reference.txt` to read the respective strings from the files each time an analysis starts. 40 | 41 | Since version 0.20.0 yarGen supports the extraction and use of hex encoded strings that often appear in weaponized RTF files. 42 | 43 | The rule generation process also tries to identify similarities between the files that get analyzed and then combines the strings to so called **super rules**. The super rule generation does not remove the simple rule for the files that have been combined in a single super rule. This means that there is some redundancy when super rules are created. You can suppress a simple rule for a file that was already covered by super rule by using --nosimple. 44 | 45 | ### Installation 46 | 47 | 1. Make sure you have at least 4GB of RAM on the machine you plan to use yarGen (8GB if opcodes are included in rule generation, use with --opcodes) 48 | 2. Install all dependencies with `pip install -r requirements.txt` (or `pip3 install -r requirements.txt`) 49 | 3. Run `python yarGen.py --update` to automatically download the built-in databases. The are saved into the './dbs' sub folder. (Download: 913 MB) 50 | 4. See help with `python yarGen.py --help` for more information on the command line parameters 51 | 52 | ### Memory Requirements 53 | 54 | Warning: yarGen pulls the whole `goodstring` database to memory and uses at least 3 GB of memory for a few seconds - 6 GB if opcodes evaluation is activated (--opcodes). 55 | 56 | I've already tried to migrate the database to sqlite but the numerous string comparisons and lookups made the analysis painfully slow. 57 | 58 | # Post-Processing Video Tutorial 59 | 60 | [![YARA rule post-processing video tutorial](https://img.youtube.com/vi/y8oAjIjZMIg/0.jpg)](https://medium.com/@cyb3rops/how-to-post-process-yara-rules-generated-by-yargen-121d29322282) 61 | 62 | # Multiple Database Support 63 | 64 | yarGen allows creating multiple databases for opcodes or strings. You can easily create a new database by using "-c" for new database creation and "-i identifier" to give the new database a unique identifier as e.g. "office". It will the create two new database files named "good-strings-office.db" and "good-opcodes-office.db" that will from then on be initialized during startup with the built-in databases. 65 | 66 | ### Database Creation / Update Example 67 | 68 | Create a new strings and opcodes database from an Office 2013 program directory: 69 | ``` 70 | yarGen.py -c --opcodes -i office -g /opt/packs/office2013 71 | ``` 72 | The analysis and string extraction process will create the following new databases in the "./dbs" sub folder. 73 | ``` 74 | good-strings-office.db 75 | good-opcodes-office.db 76 | ``` 77 | The values from these new databases will be automatically applied during the rule creation process because all *.db files in the sub folder "./dbs" will be initialized during startup. 78 | 79 | You can update the once created databases with the "-u" parameter 80 | ``` 81 | yarGen.py -u --opcodes -i office -g /opt/packs/office365 82 | ``` 83 | This would update the "office" databases with new strings extracted from files in the given directory. 84 | 85 | ## Command Line Parameters 86 | 87 | ``` 88 | usage: yarGen.py [-h] [-m M] [-y min-size] [-z min-score] [-x high-scoring] 89 | [-w superrule-overlap] [-s max-size] [-rc maxstrings] 90 | [--excludegood] [-o output_rule_file] [-e output_dir_strings] 91 | [-a author] [-r ref] [-l lic] [-p prefix] [-b identifier] 92 | [--score] [--strings] [--nosimple] [--nomagic] [--nofilesize] 93 | [-fm FM] [--globalrule] [--nosuper] [--update] [-g G] [-u] 94 | [-c] [-i I] [--dropzone] [--nr] [--oe] [-fs size-in-MB] 95 | [--noextras] [--debug] [--trace] [--opcodes] [-n opcode-num] 96 | 97 | yarGen 98 | 99 | optional arguments: 100 | -h, --help show this help message and exit 101 | 102 | Rule Creation: 103 | -m M Path to scan for malware 104 | -y min-size Minimum string length to consider (default=8) 105 | -z min-score Minimum score to consider (default=0) 106 | -x high-scoring Score required to set string as 'highly specific 107 | string' (default: 30) 108 | -w superrule-overlap Minimum number of strings that overlap to create a 109 | super rule (default: 5) 110 | -s max-size Maximum length to consider (default=128) 111 | -rc maxstrings Maximum number of strings per rule (default=20, 112 | intelligent filtering will be applied) 113 | --excludegood Force the exclude all goodware strings 114 | 115 | Rule Output: 116 | -o output_rule_file Output rule file 117 | -e output_dir_strings 118 | Output directory for string exports 119 | -a author Author Name 120 | -r ref Reference (can be string or text file) 121 | -l lic License 122 | -p prefix Prefix for the rule description 123 | -b identifier Text file from which the identifier is read (default: 124 | last folder name in the full path, e.g. "myRAT" if -m 125 | points to /mnt/mal/myRAT) 126 | --score Show the string scores as comments in the rules 127 | --strings Show the string scores as comments in the rules 128 | --nosimple Skip simple rule creation for files included in super 129 | rules 130 | --nomagic Don't include the magic header condition statement 131 | --nofilesize Don't include the filesize condition statement 132 | -fm FM Multiplier for the maximum 'filesize' condition value 133 | (default: 3) 134 | --globalrule Create global rules (improved rule set speed) 135 | --nosuper Don't try to create super rules that match against 136 | various files 137 | 138 | Database Operations: 139 | --update Update the local strings and opcodes dbs from the 140 | online repository 141 | -g G Path to scan for goodware (dont use the database 142 | shipped with yaraGen) 143 | -u Update local standard goodware database with a new 144 | analysis result (used with -g) 145 | -c Create new local goodware database (use with -g and 146 | optionally -i "identifier") 147 | -i I Specify an identifier for the newly created databases 148 | (good-strings-identifier.db, good-opcodes- 149 | identifier.db) 150 | 151 | General Options: 152 | --dropzone Dropzone mode - monitors a directory [-m] for new 153 | samples to processWARNING: Processed files will be 154 | deleted! 155 | --nr Do not recursively scan directories 156 | --oe Only scan executable extensions EXE, DLL, ASP, JSP, 157 | PHP, BIN, INFECTED 158 | -fs size-in-MB Max file size in MB to analyze (default=10) 159 | --noextras Don't use extras like Imphash or PE header specifics 160 | --debug Debug output 161 | --trace Trace output 162 | 163 | Other Features: 164 | --opcodes Do use the OpCode feature (use this if not enough high 165 | scoring strings can be found) 166 | -n opcode-num Number of opcodes to add if not enough high scoring 167 | string could be found (default=3) 168 | ``` 169 | 170 | ## Best Practice 171 | 172 | See the following blog posts for a more detailed description on how to use yarGen for YARA rule creation: 173 | 174 | [How to Write Simple but Sound Yara Rules - Part 1](https://www.bsk-consulting.de/2015/02/16/write-simple-sound-yara-rules/) 175 | 176 | [How to Write Simple but Sound Yara Rules - Part 2](https://www.bsk-consulting.de/2015/10/17/how-to-write-simple-but-sound-yara-rules-part-2/) 177 | 178 | [How to Write Simple but Sound Yara Rules - Part 3](https://www.bsk-consulting.de/2016/04/15/how-to-write-simple-but-sound-yara-rules-part-3/) 179 | 180 | ## Screenshots 181 | 182 | ![Generator Run](./screens/yargen-running.png) 183 | 184 | ![Output Rule](./screens/output-rule-0.14.1.png) 185 | 186 | As you can see in the screenshot above you'll get a rule that contains strings, which are not found in the goodware strings database. 187 | 188 | You should clean up the rules afterwards. In the example above, remove the strings $s14, $s17, $s19, $s20 that look like random code to get a cleaner rule that is more likely to match on other samples of the same family. 189 | 190 | To get a more generic rule, remove string $s5, which is very specific for this compiled executable. 191 | 192 | ## Examples 193 | 194 | ### Dropzone Mode (Recommended) 195 | 196 | Monitors a given folder (-m) for new samples, processes the samples, writes YARA rules to the set output file (default: yargen_rules.yar) and deletes the folder contents afterwards. 197 | 198 | ```python yarGen.py -a "yarGen Dropzone" --dropzone -m /opt/mal/dropzone``` 199 | 200 | WARNING: All files dropped to the set dropzone will be removed! 201 | 202 | In the following example two files named `identifier.txt` and `reference.txt` are read and used for the `reference` and as identifier in the YARA rule sets. The files are read at each iteration and not only during initialization. This way you can pass specific strings to each dropzone rule generation. 203 | 204 | ```python yarGen.py --dropzone -m /opt/mal/dropzone -b /opt/mal/dropzone/identifier.txt -r /opt/mal/dropzone/reference.txt``` 205 | 206 | ### Use the shipped database (FAST) to create some rules 207 | 208 | ```python yarGen.py -m X:\MAL\Case1401``` 209 | 210 | Use the shipped database of goodware strings and scan the malware directory 211 | "X:\MAL" recursively. Create rules for all files included in this directory and 212 | below. A file named 'yargen_rules.yar' will be generated in the current 213 | directory. 214 | 215 | ### Show the score of the strings as comment 216 | 217 | yarGen will by default use the top 20 strings based on their score. To see how a 218 | certain string in the rule scored, use the "--score" parameter. 219 | 220 | ```python yarGen.py --score -m X:\MAL\Case1401``` 221 | 222 | ### Use only strings with a certain minimum score 223 | 224 | In order to use only strings for your rules that match a certain minimum score use the "-z" parameter. It is a good pratice to first create rules with "--score" and than perform a second run with a minimum score set for you sample set via "-z". 225 | 226 | ```python yarGen.py --score -z 5 -m X:\MAL\Case1401``` 227 | 228 | ### Preset author and reference 229 | 230 | ```python yarGen.py -a "Florian Roth" -r "http://goo.gl/c2qgFx" -m /opt/mal/case_441 -o case441.yar``` 231 | 232 | ### Add opcodes to the rules 233 | 234 | ```python yarGen.py --opcodes -a "Florian Roth" -r "http://goo.gl/c2qgFx" -m /opt/mal/case33 -o rules33.yar``` 235 | 236 | ### Show debugging output 237 | 238 | ```python yarGen.py --debug -m /opt/mal/case_441``` 239 | 240 | ### Create a new goodware strings database 241 | 242 | ```python yarGen.py -c --opcodes -g /home/user/Downloads/office2013 -i office``` 243 | 244 | This will generate two new databases for strings and opcodes named: 245 | - good-strings-office.db 246 | - good-opcodes-office.db 247 | 248 | The new databases will automatically be initialized during startup and are from then on used for rule generation. 249 | 250 | ### Update a goodware strings database (append new strings, opcodes, imphashes, exports to the old ones) 251 | 252 | ```python yarGen.py -u -g /home/user/Downloads/office365 -i office``` 253 | 254 | ### My Best Pratice Command Line 255 | 256 | ```python yarGen.py -a "Florian Roth" -r "Internal Research" -m /opt/mal/apt_case_32``` 257 | 258 | # db-lookup.py 259 | 260 | A tool named `db-lookup.py`, which was introduced with version 0.18.0 allows you to query the local databases in a simple command line interface. The interface takes an input value, which can be `string`, `export` or `imphash` value, detects the query type and then performs a lookup in the loaded databases. This allows you to query the yarGen databases with `string`, `export` and `imphash` values in order to check if this value appears in goodware that has been processed to generate the databases. 261 | 262 | This is a nice feature that helps you ta answer the following questions: 263 | 264 | * Does this string appear in goodware samples of my database? 265 | * Does this export name appear in goodware samples of my database? 266 | * Does a sample in my goodware database has this imphash? 267 | 268 | However, there are several drawbacks: 269 | 270 | * It does only match on the full string (no contains, no startswith, no endswith) 271 | * Opcode lookup is not supported (yet) 272 | 273 | I plan to release a new project named `Valknut` which extracts overlapping byte sequences from samples and creates searchable databases. This project will be the new backend API for yarGen allowing all kinds of queries, opcodes and string values, ascii and wide formatted. 274 | -------------------------------------------------------------------------------- /prepare-release.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | RELDIR=./release 4 | OUTDIR=$RELDIR/yarGen/ 5 | 6 | cp yarGen.py $OUTDIR 7 | cp -r 3rdparty $OUTDIR 8 | cp -r lib $OUTDIR 9 | cp README.md $OUTDIR 10 | cp LICENSE $OUTDIR 11 | 12 | cd $RELDIR 13 | tar -cvzf yarGen.tar.gz ./yarGen/ 14 | cd .. -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | lief 2 | lxml -------------------------------------------------------------------------------- /screens/output-rule-0.14.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Neo23x0/yarGen/c16faff06ea6461f1c423f14c7690b6624e5fcff/screens/output-rule-0.14.1.png -------------------------------------------------------------------------------- /screens/yargen-running.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Neo23x0/yarGen/c16faff06ea6461f1c423f14c7690b6624e5fcff/screens/yargen-running.png -------------------------------------------------------------------------------- /tools/byte-mapper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: iso-8859-1 -*- 3 | # -*- coding: utf-8 -*- 4 | # 5 | # Byte Mapper 6 | # Binary Signature Generator 7 | # 8 | # Florian Roth 9 | # June 2014 10 | # v0.1a 11 | 12 | import os 13 | import sys 14 | import argparse 15 | import re 16 | import traceback 17 | from colorama import Fore, Back, Style 18 | from colorama import init 19 | from hashlib import md5 20 | 21 | def getFiles(dir, recursive): 22 | # Recursive 23 | if recursive: 24 | for root, directories, files in os.walk (dir, followlinks=False): 25 | for filename in files: 26 | filePath = os.path.join(root,filename) 27 | yield filePath 28 | # Non recursive 29 | else: 30 | for filename in os.listdir(dir): 31 | filePath = os.path.join(dir,filename) 32 | yield filePath 33 | 34 | def parseDir(dir, recursive, numBytes): 35 | 36 | # Prepare dictionary 37 | byte_stats = {} 38 | 39 | fileCount = 0 40 | for filePath in getFiles(dir, recursive): 41 | 42 | if os.path.isdir(filePath): 43 | if recursive: 44 | parseDir(dir, recursive, numBytes) 45 | continue 46 | 47 | with open(filePath, 'r') as file: 48 | fileCount += 1 49 | header = file.read(int(numBytes)) 50 | 51 | pos = 0 52 | for byte in header: 53 | pos += 1 54 | if pos in byte_stats: 55 | if byte in byte_stats[pos]: 56 | byte_stats[pos][byte] += 1 57 | else: 58 | byte_stats[pos][byte] = 1 59 | else: 60 | #byte_stats.append(pos) 61 | byte_stats[pos] = { byte: 1 } 62 | 63 | return byte_stats, fileCount 64 | 65 | def visiualizeStats(byteStats, fileCount, heatMapMode, byteFiller, bytesPerLine): 66 | # Settings 67 | # print fileCount 68 | 69 | bytesPrinted = 0 70 | for byteStat in byteStats: 71 | 72 | if args.d: 73 | print "------------------------" 74 | print byteStats[byteStat] 75 | 76 | byteToPrint = ".." 77 | countOfByte = 0 78 | highestValue = 0 79 | 80 | # Evaluate the most often occured byte value at this position 81 | for ( key, val ) in byteStats[byteStat].iteritems(): 82 | if val > highestValue: 83 | highestValue = val 84 | byteToPrint = key 85 | countOfByte = val 86 | 87 | # Heat Map Mode 88 | if heatMapMode: 89 | printHeatMapValue(byteToPrint, countOfByte, fileCount, byteFiller) 90 | 91 | # Standard Mode 92 | else: 93 | if countOfByte >= fileCount: 94 | sys.stdout.write("%s%s" % ( byteToPrint.encode('hex'), byteFiller )) 95 | else: 96 | sys.stdout.write("..%s" % byteFiller) 97 | 98 | # Line break 99 | bytesPrinted += 1 100 | if bytesPrinted >= bytesPerLine: 101 | sys.stdout.write("\n") 102 | bytesPrinted = 0 103 | 104 | # Print Heat Map Legend 105 | printHeatLegend(int(fileCount)) 106 | 107 | def printHeatMapValue(byteToPrint, countOfByte, fileCount, byteFiller): 108 | if args.d: 109 | print "Count of byte: %s" % countOfByte 110 | print "File Count: %s" % fileCount 111 | if countOfByte == fileCount: 112 | sys.stdout.write(Fore.GREEN + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller) 113 | elif countOfByte == fileCount - 1: 114 | sys.stdout.write(Fore.CYAN + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller) 115 | elif countOfByte == fileCount - 2: 116 | sys.stdout.write(Fore.YELLOW + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller) 117 | elif countOfByte == fileCount - 3: 118 | sys.stdout.write(Fore.RED + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller) 119 | elif countOfByte == fileCount - 4: 120 | sys.stdout.write(Fore.MAGENTA + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller) 121 | elif countOfByte == fileCount - 5: 122 | sys.stdout.write(Fore.WHITE + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller) 123 | else: 124 | sys.stdout.write(Fore.WHITE + Style.DIM + '..' + Fore.WHITE + Style.RESET_ALL + '%s' % byteFiller) 125 | 126 | def printHeatLegend(fileCount): 127 | print "" 128 | print Fore.GREEN + 'GREEN\tContent of all %s files' % str(fileCount) + Fore.WHITE 129 | if fileCount > 1: 130 | print Fore.CYAN + 'CYAN\tContent of %s files' % str(fileCount-1) + Fore.WHITE 131 | if fileCount > 2: 132 | print Fore.YELLOW + 'YELLOW\tContent of %s files' % str(fileCount-2) + Fore.WHITE 133 | if fileCount > 3: 134 | print Fore.RED + 'RED\tContent of %s files' % str(fileCount-3) + Fore.WHITE 135 | if fileCount > 4: 136 | print Fore.MAGENTA + 'MAGENTA\tContent of %s files' % str(fileCount-4) + Fore.WHITE 137 | if fileCount > 5: 138 | print Fore.WHITE + 'WHITE\tContent of %s files' % str(fileCount-5) + Fore.WHITE 139 | if fileCount > 6: 140 | print Fore.WHITE + Style.DIM +'..\tNo identical bytes in more than %s files' % str(fileCount-6) + Fore.WHITE + Style.RESET_ALL 141 | 142 | # MAIN ################################################################ 143 | if __name__ == '__main__': 144 | 145 | # Parse Arguments 146 | parser = argparse.ArgumentParser(description='Yara BSG') 147 | parser.add_argument('-p', metavar="malware-dir", help='Path to scan for malware') 148 | parser.add_argument('-r', action='store_true', default=False, help='Be recursive') 149 | parser.add_argument('-m', action='store_true', default=False, help='Heat map on byte values') 150 | parser.add_argument('-f', default=" ", metavar="byte-filler", help='character to fill the gap between the bytes (default: \' \')') 151 | parser.add_argument('-c', default=None, metavar="num-occurances", help='Print only bytes that occur in at least X of the samples (default: all files; incompatible with heat map mode) ') 152 | parser.add_argument('-b', default=1024, metavar="bytes", help='Number of bytes to print (default: 1024)') 153 | parser.add_argument('-l', default=16, metavar="bytes-per-line", help='Number of bytes to print per line (default: 16)') 154 | parser.add_argument('-d', action='store_true', default=False, help='Debug Info') 155 | 156 | args = parser.parse_args() 157 | 158 | # Colorization 159 | init() 160 | 161 | # Parse the Files 162 | ( byteStats, fileCount) = parseDir(args.p, args.r, args.b) 163 | 164 | # print byteStats 165 | if args.c != None and not args.m: 166 | fileCount = int(args.c) 167 | 168 | # Vizualize Byte Stats 169 | visiualizeStats(byteStats, fileCount, args.m, args.f, args.l) -------------------------------------------------------------------------------- /yarGen.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: iso-8859-1 -*- 3 | # -*- coding: utf-8 -*- 4 | # 5 | # yarGen 6 | # A Rule Generator for YARA Rules 7 | # 8 | # Florian Roth 9 | 10 | __version__ = "0.24.0" 11 | 12 | import os 13 | import sys 14 | 15 | import argparse 16 | import re 17 | import traceback 18 | import operator 19 | import datetime 20 | import time 21 | import lief 22 | import json 23 | import gzip 24 | import urllib.request 25 | import binascii 26 | import base64 27 | import shutil 28 | from collections import Counter 29 | from hashlib import sha256 30 | import signal as signal_module 31 | from lxml import etree 32 | 33 | RELEVANT_EXTENSIONS = [".asp", ".vbs", ".ps", ".ps1", ".tmp", ".bas", ".bat", ".cmd", ".com", ".cpl", 34 | ".crt", ".dll", ".exe", ".msc", ".scr", ".sys", ".vb", ".vbe", ".vbs", ".wsc", 35 | ".wsf", ".wsh", ".input", ".war", ".jsp", ".php", ".asp", ".aspx", ".psd1", ".psm1", ".py"] 36 | 37 | AI_COMMENT = """ 38 | The provided rule is a YARA rule, encompassing a wide range of suspicious strings. Kindly review the list and pinpoint the twenty strings that are most distinctive or appear most suited for a YARA rule focused on malware detection. Arrange them in descending order based on their level of suspicion. Then, swap out the current list of strings in the YARA rule with your chosen set and supply the revised rule. 39 | --- 40 | """ 41 | 42 | REPO_URLS = { 43 | 'good-opcodes-part1.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part1.db', 44 | 'good-opcodes-part2.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part2.db', 45 | 'good-opcodes-part3.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part3.db', 46 | 'good-opcodes-part4.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part4.db', 47 | 'good-opcodes-part5.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part5.db', 48 | 'good-opcodes-part6.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part6.db', 49 | 'good-opcodes-part7.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part7.db', 50 | 'good-opcodes-part8.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part8.db', 51 | 'good-opcodes-part9.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part9.db', 52 | 53 | 'good-strings-part1.db': 'https://www.bsk-consulting.de/yargen/good-strings-part1.db', 54 | 'good-strings-part2.db': 'https://www.bsk-consulting.de/yargen/good-strings-part2.db', 55 | 'good-strings-part3.db': 'https://www.bsk-consulting.de/yargen/good-strings-part3.db', 56 | 'good-strings-part4.db': 'https://www.bsk-consulting.de/yargen/good-strings-part4.db', 57 | 'good-strings-part5.db': 'https://www.bsk-consulting.de/yargen/good-strings-part5.db', 58 | 'good-strings-part6.db': 'https://www.bsk-consulting.de/yargen/good-strings-part6.db', 59 | 'good-strings-part7.db': 'https://www.bsk-consulting.de/yargen/good-strings-part7.db', 60 | 'good-strings-part8.db': 'https://www.bsk-consulting.de/yargen/good-strings-part8.db', 61 | 'good-strings-part9.db': 'https://www.bsk-consulting.de/yargen/good-strings-part9.db', 62 | 63 | 'good-exports-part1.db': 'https://www.bsk-consulting.de/yargen/good-exports-part1.db', 64 | 'good-exports-part2.db': 'https://www.bsk-consulting.de/yargen/good-exports-part2.db', 65 | 'good-exports-part3.db': 'https://www.bsk-consulting.de/yargen/good-exports-part3.db', 66 | 'good-exports-part4.db': 'https://www.bsk-consulting.de/yargen/good-exports-part4.db', 67 | 'good-exports-part5.db': 'https://www.bsk-consulting.de/yargen/good-exports-part5.db', 68 | 'good-exports-part6.db': 'https://www.bsk-consulting.de/yargen/good-exports-part6.db', 69 | 'good-exports-part7.db': 'https://www.bsk-consulting.de/yargen/good-exports-part7.db', 70 | 'good-exports-part8.db': 'https://www.bsk-consulting.de/yargen/good-exports-part8.db', 71 | 'good-exports-part9.db': 'https://www.bsk-consulting.de/yargen/good-exports-part9.db', 72 | 73 | 'good-imphashes-part1.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part1.db', 74 | 'good-imphashes-part2.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part2.db', 75 | 'good-imphashes-part3.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part3.db', 76 | 'good-imphashes-part4.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part4.db', 77 | 'good-imphashes-part5.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part5.db', 78 | 'good-imphashes-part6.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part6.db', 79 | 'good-imphashes-part7.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part7.db', 80 | 'good-imphashes-part8.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part8.db', 81 | 'good-imphashes-part9.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part9.db', 82 | } 83 | 84 | PE_STRINGS_FILE = "./3rdparty/strings.xml" 85 | 86 | KNOWN_IMPHASHES = {'a04dd9f5ee88d7774203e0a0cfa1b941': 'PsExec', 87 | '2b8c9d9ab6fefc247adaf927e83dcea6': 'RAR SFX variant'} 88 | 89 | 90 | def get_abs_path(filename): 91 | return os.path.join(os.path.dirname(os.path.abspath(__file__)), filename) 92 | 93 | 94 | def get_files(folder, notRecursive): 95 | # Not Recursive 96 | if notRecursive: 97 | for filename in os.listdir(folder): 98 | filePath = os.path.join(folder, filename) 99 | if os.path.isdir(filePath): 100 | continue 101 | yield filePath 102 | # Recursive 103 | else: 104 | for root, dirs, files in os.walk(folder, topdown = False): 105 | for name in files: 106 | filePath = os.path.join(root, name) 107 | yield filePath 108 | 109 | 110 | def parse_sample_dir(dir, notRecursive=False, generateInfo=False, onlyRelevantExtensions=False): 111 | # Prepare dictionary 112 | string_stats = {} 113 | opcode_stats = {} 114 | file_info = {} 115 | known_sha1sums = [] 116 | 117 | for filePath in get_files(dir, notRecursive): 118 | try: 119 | print("[+] Processing %s ..." % filePath) 120 | 121 | # Get Extension 122 | extension = os.path.splitext(filePath)[1].lower() 123 | if not extension in RELEVANT_EXTENSIONS and onlyRelevantExtensions: 124 | if args.debug: 125 | print("[-] EXTENSION %s - Skipping file %s" % (extension, filePath)) 126 | continue 127 | 128 | # Info file check 129 | if os.path.basename(filePath) == os.path.basename(args.b) or \ 130 | os.path.basename(filePath) == os.path.basename(args.r): 131 | continue 132 | 133 | # Size Check 134 | size = 0 135 | try: 136 | size = os.stat(filePath).st_size 137 | if size > (args.fs * 1024 * 1024): 138 | if args.debug: 139 | print("[-] File is to big - Skipping file %s (use -fs to adjust this behaviour)" % (filePath)) 140 | continue 141 | except Exception as e: 142 | pass 143 | 144 | # Check and read file 145 | try: 146 | with open(filePath, 'rb') as f: 147 | fileData = f.read() 148 | except Exception as e: 149 | print("[-] Cannot read file - skipping %s" % filePath) 150 | 151 | # Extract strings from file 152 | strings = extract_strings(fileData) 153 | 154 | # Extract opcodes from file 155 | opcodes = [] 156 | if use_opcodes: 157 | print("[-] Extracting OpCodes: %s" % filePath) 158 | opcodes = extract_opcodes(fileData) 159 | 160 | # Add sha256 value 161 | if generateInfo: 162 | sha256sum = sha256(fileData).hexdigest() 163 | file_info[filePath] = {} 164 | file_info[filePath]["hash"] = sha256sum 165 | file_info[filePath]["imphash"], file_info[filePath]["exports"] = get_pe_info(fileData) 166 | 167 | # Skip if hash already known - avoid duplicate files 168 | if sha256sum in known_sha1sums: 169 | # if args.debug: 170 | print("[-] Skipping strings/opcodes from %s due to MD5 duplicate detection" % filePath) 171 | continue 172 | else: 173 | known_sha1sums.append(sha256sum) 174 | 175 | # Magic evaluation 176 | if not args.nomagic: 177 | file_info[filePath]["magic"] = binascii.hexlify(fileData[:2]).decode('ascii') 178 | else: 179 | file_info[filePath]["magic"] = "" 180 | 181 | # File Size 182 | file_info[filePath]["size"] = os.stat(filePath).st_size 183 | 184 | # Add stats for basename (needed for inverse rule generation) 185 | fileName = os.path.basename(filePath) 186 | folderName = os.path.basename(os.path.dirname(filePath)) 187 | if fileName not in file_info: 188 | file_info[fileName] = {} 189 | file_info[fileName]["count"] = 0 190 | file_info[fileName]["hashes"] = [] 191 | file_info[fileName]["folder_names"] = [] 192 | file_info[fileName]["count"] += 1 193 | file_info[fileName]["hashes"].append(sha256sum) 194 | if folderName not in file_info[fileName]["folder_names"]: 195 | file_info[fileName]["folder_names"].append(folderName) 196 | 197 | # Add strings to statistics 198 | for string in strings: 199 | # String is not already known 200 | if string not in string_stats: 201 | string_stats[string] = {} 202 | string_stats[string]["count"] = 0 203 | string_stats[string]["files"] = [] 204 | string_stats[string]["files_basename"] = {} 205 | # String count 206 | string_stats[string]["count"] += 1 207 | # Add file information 208 | if fileName not in string_stats[string]["files_basename"]: 209 | string_stats[string]["files_basename"][fileName] = 0 210 | string_stats[string]["files_basename"][fileName] += 1 211 | if filePath not in string_stats[string]["files"]: 212 | string_stats[string]["files"].append(filePath) 213 | 214 | # Add opcodes to statistics 215 | for opcode in opcodes: 216 | # Opcode is not already known 217 | if opcode not in opcode_stats: 218 | opcode_stats[opcode] = {} 219 | opcode_stats[opcode]["count"] = 0 220 | opcode_stats[opcode]["files"] = [] 221 | opcode_stats[opcode]["files_basename"] = {} 222 | # Opcode count 223 | opcode_stats[opcode]["count"] += 1 224 | # Add file information 225 | if fileName not in opcode_stats[opcode]["files_basename"]: 226 | opcode_stats[opcode]["files_basename"][fileName] = 0 227 | opcode_stats[opcode]["files_basename"][fileName] += 1 228 | if filePath not in opcode_stats[opcode]["files"]: 229 | opcode_stats[opcode]["files"].append(filePath) 230 | 231 | if args.debug: 232 | print("[+] Processed " + filePath + " Size: " + str(size) + " Strings: " + str(len(string_stats)) + \ 233 | " OpCodes: " + str(len(opcode_stats)) + " ... ") 234 | 235 | except Exception as e: 236 | traceback.print_exc() 237 | print("[E] ERROR reading file: %s" % filePath) 238 | 239 | return string_stats, opcode_stats, file_info 240 | 241 | 242 | def parse_good_dir(dir, notRecursive=False, onlyRelevantExtensions=True): 243 | # Prepare dictionary 244 | all_strings = Counter() 245 | all_opcodes = Counter() 246 | all_imphashes = Counter() 247 | all_exports = Counter() 248 | 249 | for filePath in get_files(dir, notRecursive): 250 | # Get Extension 251 | extension = os.path.splitext(filePath)[1].lower() 252 | if extension not in RELEVANT_EXTENSIONS and onlyRelevantExtensions: 253 | if args.debug: 254 | print("[-] EXTENSION %s - Skipping file %s" % (extension, filePath)) 255 | continue 256 | 257 | # Size Check 258 | size = 0 259 | try: 260 | size = os.stat(filePath).st_size 261 | if size > (args.fs * 1024 * 1024): 262 | continue 263 | except Exception as e: 264 | pass 265 | 266 | # Check and read file 267 | try: 268 | with open(filePath, 'rb') as f: 269 | fileData = f.read() 270 | except Exception as e: 271 | print("[-] Cannot read file - skipping %s" % filePath) 272 | 273 | # Extract strings from file 274 | strings = extract_strings(fileData) 275 | # Append to all strings 276 | all_strings.update(strings) 277 | 278 | # Extract Opcodes from file 279 | opcodes = [] 280 | if use_opcodes: 281 | print("[-] Extracting OpCodes: %s" % filePath) 282 | opcodes = extract_opcodes(fileData) 283 | # Append to all opcodes 284 | all_opcodes.update(opcodes) 285 | 286 | # Imphash and Exports 287 | (imphash, exports) = get_pe_info(fileData) 288 | if imphash != "": 289 | all_imphashes.update([imphash]) 290 | all_exports.update(exports) 291 | if args.debug: 292 | print("[+] Processed %s - %d strings %d opcodes %d exports and imphash %s" % (filePath, len(strings), 293 | len(opcodes), len(exports), 294 | imphash)) 295 | 296 | # return it as a set (unique strings) 297 | return all_strings, all_opcodes, all_imphashes, all_exports 298 | 299 | 300 | def extract_strings(fileData) -> list[str]: 301 | # String list 302 | cleaned_strings = [] 303 | # Read file data 304 | try: 305 | # Read strings 306 | strings_full = re.findall(b"[\x1f-\x7e]{6,}", fileData) 307 | strings_limited = re.findall(b"[\x1f-\x7e]{6,%d}" % args.s, fileData) 308 | strings_hex = extract_hex_strings(fileData) 309 | strings = list(set(strings_full) | set(strings_limited) | set(strings_hex)) 310 | wide_strings = [ws for ws in re.findall(b"(?:[\x1f-\x7e][\x00]){6,}", fileData)] 311 | 312 | # Post-process 313 | # WIDE 314 | for ws in wide_strings: 315 | # Decode UTF16 and prepend a marker (facilitates handling) 316 | wide_string = ("UTF16LE:%s" % ws.decode('utf-16')).encode('utf-8') 317 | if wide_string not in strings: 318 | strings.append(wide_string) 319 | for string in strings: 320 | # Escape strings 321 | if len(string) > 0: 322 | string = string.replace(b'\\', b'\\\\') 323 | string = string.replace(b'"', b'\\"') 324 | try: 325 | if isinstance(string, str): 326 | cleaned_strings.append(string) 327 | else: 328 | cleaned_strings.append(string.decode('utf-8')) 329 | except AttributeError as e: 330 | print(string) 331 | traceback.print_exc() 332 | 333 | except Exception as e: 334 | if args.debug: 335 | print(string) 336 | traceback.print_exc() 337 | pass 338 | 339 | return cleaned_strings 340 | 341 | 342 | def extract_opcodes(fileData) -> list[str]: 343 | # Opcode list 344 | opcodes = [] 345 | 346 | try: 347 | # Read file data 348 | binary = lief.parse(fileData) 349 | ep = binary.entrypoint 350 | 351 | # Locate .text section 352 | text = None 353 | if isinstance(binary, lief.PE.Binary): 354 | for sec in binary.sections: 355 | if sec.virtual_address + binary.imagebase <= ep < sec.virtual_address + binary.imagebase + sec.virtual_size: 356 | if args.debug: 357 | print(f'EP is located at {sec.name} section') 358 | text = sec.content.tobytes() 359 | break 360 | elif isinstance(binary, lief.ELF.Binary): 361 | for sec in binary.sections: 362 | if sec.virtual_address <= ep < sec.virtual_address + sec.size: 363 | if args.debug: 364 | print(f'EP is located at {sec.name} section') 365 | text = sec.content.tobytes() 366 | break 367 | 368 | if text is not None: 369 | # Split text into subs 370 | text_parts = re.split(b"[\x00]{3,}", text) 371 | # Now truncate and encode opcodes 372 | for text_part in text_parts: 373 | if text_part == '' or len(text_part) < 8: 374 | continue 375 | opcodes.append(binascii.hexlify(text_part[:16]).decode(encoding='ascii')) 376 | except Exception as e: 377 | if args.debug: 378 | traceback.print_exc() 379 | pass 380 | 381 | return opcodes 382 | 383 | 384 | def get_pe_info(fileData: bytes) -> tuple[str, list[str]]: 385 | """ 386 | Get different PE attributes and hashes by lief 387 | :param fileData: 388 | :return: 389 | """ 390 | imphash = "" 391 | exports = [] 392 | # Check for MZ header (speed improvement) 393 | if fileData[:2] != b"MZ": 394 | return imphash, exports 395 | try: 396 | if args.debug: 397 | print("Extracting PE information") 398 | binary: lief.PE.Binary = lief.parse(fileData) 399 | # Imphash 400 | imphash = lief.PE.get_imphash(binary, lief.PE.IMPHASH_MODE.PEFILE) 401 | # Exports (names) 402 | for exp in binary.get_export().entries: 403 | exp: lief.PE.ExportEntry 404 | exports.append(str(exp.name)) 405 | except Exception as e: 406 | if args.debug: 407 | traceback.print_exc() 408 | pass 409 | 410 | return imphash, exports 411 | 412 | 413 | def sample_string_evaluation(string_stats, opcode_stats, file_info): 414 | # Generate Stats ----------------------------------------------------------- 415 | print("[+] Generating statistical data ...") 416 | file_strings = {} 417 | file_opcodes = {} 418 | combinations = {} 419 | inverse_stats = {} 420 | max_combi_count = 0 421 | super_rules = [] 422 | 423 | # OPCODE EVALUATION -------------------------------------------------------- 424 | for opcode in opcode_stats: 425 | # If string occurs not too often in sample files 426 | if opcode_stats[opcode]["count"] < 10: 427 | # If string list in file dictionary not yet exists 428 | for filePath in opcode_stats[opcode]["files"]: 429 | if filePath in file_opcodes: 430 | # Append string 431 | file_opcodes[filePath].append(opcode) 432 | else: 433 | # Create list and then add the first string to the file 434 | file_opcodes[filePath] = [] 435 | file_opcodes[filePath].append(opcode) 436 | 437 | # STRING EVALUATION ------------------------------------------------------- 438 | 439 | # Iterate through strings found in malware files 440 | for string in string_stats: 441 | 442 | # If string occurs not too often in (goodware) sample files 443 | if string_stats[string]["count"] < 10: 444 | # If string list in file dictionary not yet exists 445 | for filePath in string_stats[string]["files"]: 446 | if filePath in file_strings: 447 | # Append string 448 | file_strings[filePath].append(string) 449 | else: 450 | # Create list and then add the first string to the file 451 | file_strings[filePath] = [] 452 | file_strings[filePath].append(string) 453 | 454 | # INVERSE RULE GENERATION ------------------------------------- 455 | if args.inverse: 456 | for fileName in string_stats[string]["files_basename"]: 457 | string_occurrance_count = string_stats[string]["files_basename"][fileName] 458 | total_count_basename = file_info[fileName]["count"] 459 | # print "string_occurance_count %s - total_count_basename %s" % ( string_occurance_count, 460 | # total_count_basename ) 461 | if string_occurrance_count == total_count_basename: 462 | if fileName not in inverse_stats: 463 | inverse_stats[fileName] = [] 464 | if args.trace: 465 | print("Appending %s to %s" % (string, fileName)) 466 | inverse_stats[fileName].append(string) 467 | 468 | # SUPER RULE GENERATION ----------------------------------------------- 469 | if not nosuper and not args.inverse: 470 | 471 | # SUPER RULES GENERATOR - preliminary work 472 | # If a string occurs more than once in different files 473 | # print sample_string_stats[string]["count"] 474 | if string_stats[string]["count"] > 1: 475 | if args.debug: 476 | print("OVERLAP Count: %s\nString: \"%s\"%s" % (string_stats[string]["count"], string, 477 | "\nFILE: ".join(string_stats[string]["files"]))) 478 | # Create a combination string from the file set that matches to that string 479 | combi = ":".join(sorted(string_stats[string]["files"])) 480 | # print "STRING: " + string 481 | if args.debug: 482 | print("COMBI: " + combi) 483 | # If combination not yet known 484 | if combi not in combinations: 485 | combinations[combi] = {} 486 | combinations[combi]["count"] = 1 487 | combinations[combi]["strings"] = [] 488 | combinations[combi]["strings"].append(string) 489 | combinations[combi]["files"] = string_stats[string]["files"] 490 | else: 491 | combinations[combi]["count"] += 1 492 | combinations[combi]["strings"].append(string) 493 | # Set the maximum combination count 494 | if combinations[combi]["count"] > max_combi_count: 495 | max_combi_count = combinations[combi]["count"] 496 | # print "Max Combi Count set to: %s" % max_combi_count 497 | 498 | print("[+] Generating Super Rules ... (a lot of magic)") 499 | for combi_count in range(max_combi_count, 1, -1): 500 | for combi in combinations: 501 | if combi_count == combinations[combi]["count"]: 502 | # print "Count %s - Combi %s" % ( str(combinations[combi]["count"]), combi ) 503 | # Filter the string set 504 | # print "BEFORE" 505 | # print len(combinations[combi]["strings"]) 506 | # print combinations[combi]["strings"] 507 | string_set = combinations[combi]["strings"] 508 | combinations[combi]["strings"] = [] 509 | combinations[combi]["strings"] = filter_string_set(string_set) 510 | # print combinations[combi]["strings"] 511 | # print "AFTER" 512 | # print len(combinations[combi]["strings"]) 513 | # Combi String count after filtering 514 | # print "String count after filtering: %s" % str(len(combinations[combi]["strings"])) 515 | 516 | # If the string set of the combination has a required size 517 | if len(combinations[combi]["strings"]) >= int(args.w): 518 | # Remove the files in the combi rule from the simple set 519 | if args.nosimple: 520 | for file in combinations[combi]["files"]: 521 | if file in file_strings: 522 | del file_strings[file] 523 | # Add it as a super rule 524 | print("[-] Adding Super Rule with %s strings." % str(len(combinations[combi]["strings"]))) 525 | # if args.debug: 526 | # print "Rule Combi: %s" % combi 527 | super_rules.append(combinations[combi]) 528 | 529 | # Return all data 530 | return (file_strings, file_opcodes, combinations, super_rules, inverse_stats) 531 | 532 | 533 | def filter_opcode_set(opcode_set: list[str]) -> list[str]: 534 | # Preferred Opcodes 535 | pref_opcodes = [' 34 ', 'ff ff ff '] 536 | 537 | # Useful set 538 | useful_set = [] 539 | pref_set = [] 540 | 541 | for opcode in opcode_set: 542 | opcode: str 543 | # Exclude all opcodes found in goodware 544 | if opcode in good_opcodes_db: 545 | if args.debug: 546 | print("skipping %s" % opcode) 547 | continue 548 | 549 | # Format the opcode 550 | formatted_opcode = get_opcode_string(opcode) 551 | 552 | # Preferred opcodes 553 | set_in_pref = False 554 | for pref in pref_opcodes: 555 | if pref in formatted_opcode: 556 | pref_set.append(formatted_opcode) 557 | set_in_pref = True 558 | if set_in_pref: 559 | continue 560 | 561 | # Else add to useful set 562 | useful_set.append(get_opcode_string(opcode)) 563 | 564 | # Preferred opcodes first 565 | useful_set = pref_set + useful_set 566 | 567 | # Only return the number of opcodes defined with the "-n" parameter 568 | return useful_set[:int(args.n)] 569 | 570 | 571 | def filter_string_set(string_set): 572 | # This is the only set we have - even if it's a weak one 573 | useful_set = [] 574 | 575 | # Local string scores 576 | localStringScores = {} 577 | 578 | # Local UTF strings 579 | utfstrings = [] 580 | 581 | for string in string_set: 582 | 583 | # Goodware string marker 584 | goodstring = False 585 | goodcount = 0 586 | 587 | # Goodware Strings 588 | if string in good_strings_db: 589 | goodstring = True 590 | goodcount = good_strings_db[string] 591 | # print "%s - %s" % ( goodstring, good_strings[string] ) 592 | if args.excludegood: 593 | continue 594 | 595 | # UTF 596 | original_string = string 597 | if string[:8] == "UTF16LE:": 598 | # print "removed UTF16LE from %s" % string 599 | string = string[8:] 600 | utfstrings.append(string) 601 | 602 | # Good string evaluation (after the UTF modification) 603 | if goodstring: 604 | # Reduce the score by the number of occurence in goodware files 605 | localStringScores[string] = (goodcount * -1) + 5 606 | else: 607 | localStringScores[string] = 0 608 | 609 | # PEStudio String Blacklist Evaluation 610 | if pestudio_available: 611 | (pescore, type) = get_pestudio_score(string) 612 | # print("PE Match: %s" % string) 613 | # Reset score of goodware files to 5 if blacklisted in PEStudio 614 | if type != "": 615 | pestudioMarker[string] = type 616 | # Modify the PEStudio blacklisted strings with their goodware stats count 617 | if goodstring: 618 | pescore = pescore - (goodcount / 1000.0) 619 | # print "%s - %s - %s" % (string, pescore, goodcount) 620 | localStringScores[string] = pescore 621 | 622 | if not goodstring: 623 | 624 | # Length Score 625 | #length = len(string) 626 | #if length > int(args.y) and length < int(args.s): 627 | # localStringScores[string] += round(len(string) / 8, 2) 628 | #if length >= int(args.s): 629 | # localStringScores[string] += 1 630 | 631 | # Reduction 632 | if ".." in string: 633 | localStringScores[string] -= 5 634 | if " " in string: 635 | localStringScores[string] -= 5 636 | # Packer Strings 637 | if re.search(r'(WinRAR\\SFX)', string): 638 | localStringScores[string] -= 4 639 | # US ASCII char 640 | if "\x1f" in string: 641 | localStringScores[string] -= 4 642 | # Chains of 00s 643 | if string.count('0000000000') > 2: 644 | localStringScores[string] -= 5 645 | # Repeated characters 646 | if re.search(r'(?!.* ([A-Fa-f0-9])\1{8,})', string): 647 | localStringScores[string] -= 5 648 | 649 | # Certain strings add-ons ---------------------------------------------- 650 | # Extensions - Drive 651 | if re.search(r'[A-Za-z]:\\', string, re.IGNORECASE): 652 | localStringScores[string] += 2 653 | # Relevant file extensions 654 | if re.search(r'(\.exe|\.pdb|\.scr|\.log|\.cfg|\.txt|\.dat|\.msi|\.com|\.bat|\.dll|\.pdb|\.vbs|' 655 | r'\.tmp|\.sys|\.ps1|\.vbp|\.hta|\.lnk)', string, re.IGNORECASE): 656 | localStringScores[string] += 4 657 | # System keywords 658 | if re.search(r'(cmd.exe|system32|users|Documents and|SystemRoot|Grant|hello|password|process|log)', 659 | string, re.IGNORECASE): 660 | localStringScores[string] += 5 661 | # Protocol Keywords 662 | if re.search(r'(ftp|irc|smtp|command|GET|POST|Agent|tor2web|HEAD)', string, re.IGNORECASE): 663 | localStringScores[string] += 5 664 | # Connection keywords 665 | if re.search(r'(error|http|closed|fail|version|proxy)', string, re.IGNORECASE): 666 | localStringScores[string] += 3 667 | # Browser User Agents 668 | if re.search(r'(Mozilla|MSIE|Windows NT|Macintosh|Gecko|Opera|User\-Agent)', string, re.IGNORECASE): 669 | localStringScores[string] += 5 670 | # Temp and Recycler 671 | if re.search(r'(TEMP|Temporary|Appdata|Recycler)', string, re.IGNORECASE): 672 | localStringScores[string] += 4 673 | # Malicious keywords - hacktools 674 | if re.search(r'(scan|sniff|poison|intercept|fake|spoof|sweep|dump|flood|inject|forward|scan|vulnerable|' 675 | r'credentials|creds|coded|p0c|Content|host)', string, re.IGNORECASE): 676 | localStringScores[string] += 5 677 | # Network keywords 678 | if re.search(r'(address|port|listen|remote|local|process|service|mutex|pipe|frame|key|lookup|connection)', 679 | string, re.IGNORECASE): 680 | localStringScores[string] += 3 681 | # Drive 682 | if re.search(r'([C-Zc-z]:\\)', string, re.IGNORECASE): 683 | localStringScores[string] += 4 684 | # IP 685 | if re.search( 686 | r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b', 687 | string, re.IGNORECASE): # IP Address 688 | localStringScores[string] += 5 689 | # Copyright Owner 690 | if re.search(r'(coded | c0d3d |cr3w\b|Coded by |codedby)', string, re.IGNORECASE): 691 | localStringScores[string] += 7 692 | # Extension generic 693 | if re.search(r'\.[a-zA-Z]{3}\b', string): 694 | localStringScores[string] += 3 695 | # All upper case 696 | if re.search(r'^[A-Z]{6,}$', string): 697 | localStringScores[string] += 2.5 698 | # All lower case 699 | if re.search(r'^[a-z]{6,}$', string): 700 | localStringScores[string] += 2 701 | # All lower with space 702 | if re.search(r'^[a-z\s]{6,}$', string): 703 | localStringScores[string] += 2 704 | # All characters 705 | if re.search(r'^[A-Z][a-z]{5,}$', string): 706 | localStringScores[string] += 2 707 | # URL 708 | if re.search(r'(%[a-z][:\-,;]|\\\\%s|\\\\[A-Z0-9a-z%]+\\[A-Z0-9a-z%]+)', string): 709 | localStringScores[string] += 2.5 710 | # certificates 711 | if re.search(r'(thawte|trustcenter|signing|class|crl|CA|certificate|assembly)', string, re.IGNORECASE): 712 | localStringScores[string] -= 4 713 | # Parameters 714 | if re.search(r'( \-[a-z]{,2}[\s]?[0-9]?| /[a-z]+[\s]?[\w]*)', string, re.IGNORECASE): 715 | localStringScores[string] += 4 716 | # Directory 717 | if re.search(r'([a-zA-Z]:|^|%)\\[A-Za-z]{4,30}\\', string): 718 | localStringScores[string] += 4 719 | # Executable - not in directory 720 | if re.search(r'^[^\\]+\.(exe|com|scr|bat|sys)$', string, re.IGNORECASE): 721 | localStringScores[string] += 4 722 | # Date placeholders 723 | if re.search(r'(yyyy|hh:mm|dd/mm|mm/dd|%s:%s:)', string, re.IGNORECASE): 724 | localStringScores[string] += 3 725 | # Placeholders 726 | if re.search(r'[^A-Za-z](%s|%d|%i|%02d|%04d|%2d|%3s)[^A-Za-z]', string, re.IGNORECASE): 727 | localStringScores[string] += 3 728 | # String parts from file system elements 729 | if re.search(r'(cmd|com|pipe|tmp|temp|recycle|bin|secret|private|AppData|driver|config)', string, 730 | re.IGNORECASE): 731 | localStringScores[string] += 3 732 | # Programming 733 | if re.search(r'(execute|run|system|shell|root|cimv2|login|exec|stdin|read|process|netuse|script|share)', 734 | string, re.IGNORECASE): 735 | localStringScores[string] += 3 736 | # Credentials 737 | if re.search(r'(user|pass|login|logon|token|cookie|creds|hash|ticket|NTLM|LMHASH|kerberos|spnego|session|' 738 | r'identif|account|login|auth|privilege)', string, re.IGNORECASE): 739 | localStringScores[string] += 3 740 | # Malware 741 | if re.search(r'(\.[a-z]/[^/]+\.txt|)', string, re.IGNORECASE): 742 | localStringScores[string] += 3 743 | # Variables 744 | if re.search(r'%[A-Z_]+%', string, re.IGNORECASE): 745 | localStringScores[string] += 4 746 | # RATs / Malware 747 | if re.search(r'(spy|logger|dark|cryptor|RAT\b|eye|comet|evil|xtreme|poison|meterpreter|metasploit|/veil|Blood)', 748 | string, re.IGNORECASE): 749 | localStringScores[string] += 5 750 | # Missed user profiles 751 | if re.search(r'[\\](users|profiles|username|benutzer|Documents and Settings|Utilisateurs|Utenti|' 752 | r'Usuários)[\\]', string, re.IGNORECASE): 753 | localStringScores[string] += 3 754 | # Strings: Words ending with numbers 755 | if re.search(r'^[A-Z][a-z]+[0-9]+$', string, re.IGNORECASE): 756 | localStringScores[string] += 1 757 | # Spying 758 | if re.search(r'(implant)', string, re.IGNORECASE): 759 | localStringScores[string] += 1 760 | # Program Path - not Programs or Windows 761 | if re.search(r'^[Cc]:\\\\[^PW]', string): 762 | localStringScores[string] += 3 763 | # Special strings 764 | if re.search(r'(\\\\\.\\|kernel|.dll|usage|\\DosDevices\\)', string, re.IGNORECASE): 765 | localStringScores[string] += 5 766 | # Parameters 767 | if re.search(r'( \-[a-z] | /[a-z] | \-[a-z]:[a-zA-Z]| \/[a-z]:[a-zA-Z])', string): 768 | localStringScores[string] += 4 769 | # File 770 | if re.search(r'^[a-zA-Z0-9]{3,40}\.[a-zA-Z]{3}', string, re.IGNORECASE): 771 | localStringScores[string] += 3 772 | # Comment Line / Output Log 773 | if re.search(r'^([\*\#]+ |\[[\*\-\+]\] |[\-=]> |\[[A-Za-z]\] )', string): 774 | localStringScores[string] += 4 775 | # Output typo / special expression 776 | if re.search(r'(!\.$|!!!$| :\)$| ;\)$|fucked|[\w]\.\.\.\.$)', string): 777 | localStringScores[string] += 4 778 | # Base64 779 | if re.search(r'^(?:[A-Za-z0-9+/]{4}){30,}(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$', string) and \ 780 | re.search(r'[A-Za-z]', string) and re.search(r'[0-9]', string): 781 | localStringScores[string] += 7 782 | # Base64 Executables 783 | if re.search(r'(TVqQAAMAAAAEAAAA//8AALgAAAA|TVpQAAIAAAAEAA8A//8AALgAAAA|TVqAAAEAAAAEABAAAAAAAAAAAAA|' 784 | r'TVoAAAAAAAAAAAAAAAAAAAAAAAA|TVpTAQEAAAAEAAAA//8AALgAAAA)', string): 785 | localStringScores[string] += 5 786 | # Malicious intent 787 | if re.search(r'(loader|cmdline|ntlmhash|lmhash|infect|encrypt|exec|elevat|dump|target|victim|override|' 788 | r'traverse|mutex|pawnde|exploited|shellcode|injected|spoofed|dllinjec|exeinj|reflective|' 789 | r'payload|inject|back conn)', 790 | string, re.IGNORECASE): 791 | localStringScores[string] += 5 792 | # Privileges 793 | if re.search(r'(administrator|highest|system|debug|dbg|admin|adm|root) privilege', string, re.IGNORECASE): 794 | localStringScores[string] += 4 795 | # System file/process names 796 | if re.search(r'(LSASS|SAM|lsass.exe|cmd.exe|LSASRV.DLL)', string): 797 | localStringScores[string] += 4 798 | # System file/process names 799 | if re.search(r'(\.exe|\.dll|\.sys)$', string, re.IGNORECASE): 800 | localStringScores[string] += 4 801 | # Indicators that string is valid 802 | if re.search(r'(^\\\\)', string, re.IGNORECASE): 803 | localStringScores[string] += 1 804 | # Compiler output directories 805 | if re.search(r'(\\Release\\|\\Debug\\|\\bin|\\sbin)', string, re.IGNORECASE): 806 | localStringScores[string] += 2 807 | # Special - Malware related strings 808 | if re.search(r'(Management Support Team1|/c rundll32|DTOPTOOLZ Co.|net start|Exec|taskkill)', string): 809 | localStringScores[string] += 4 810 | # Powershell 811 | if re.search(r'(bypass|windowstyle | hidden |-command|IEX |Invoke-Expression|Net.Webclient|Invoke[A-Z]|' 812 | r'Net.WebClient|-w hidden |-encoded' 813 | r'-encodedcommand| -nop |MemoryLoadLibrary|FromBase64String|Download|EncodedCommand)', string, re.IGNORECASE): 814 | localStringScores[string] += 4 815 | # WMI 816 | if re.search(r'( /c WMIC)', string, re.IGNORECASE): 817 | localStringScores[string] += 3 818 | # Windows Commands 819 | if re.search(r'( net user | net group |ping |whoami |bitsadmin |rundll32.exe javascript:|' 820 | r'schtasks.exe /create|/c start )', 821 | string, re.IGNORECASE): 822 | localStringScores[string] += 3 823 | # JavaScript 824 | if re.search(r'(new ActiveXObject\("WScript.Shell"\).Run|.Run\("cmd.exe|.Run\("%comspec%\)|' 825 | r'.Run\("c:\\Windows|.RegisterXLL\()', string, re.IGNORECASE): 826 | localStringScores[string] += 3 827 | # Signing Certificates 828 | if re.search(r'( Inc | Co.| Ltd.,| LLC| Limited)', string): 829 | localStringScores[string] += 2 830 | # Privilege escalation 831 | if re.search(r'(sysprep|cryptbase|secur32)', string, re.IGNORECASE): 832 | localStringScores[string] += 2 833 | # Webshells 834 | if re.search(r'(isset\($post\[|isset\($get\[|eval\(Request)', string, re.IGNORECASE): 835 | localStringScores[string] += 2 836 | # Suspicious words 1 837 | if re.search(r'(impersonate|drop|upload|download|execute|shell|\bcmd\b|decode|rot13|decrypt)', string, 838 | re.IGNORECASE): 839 | localStringScores[string] += 2 840 | # Suspicious words 1 841 | if re.search(r'([+] |[-] |[*] |injecting|exploit|dumped|dumping|scanning|scanned|elevation|' 842 | r'elevated|payload|vulnerable|payload|reverse connect|bind shell|reverse shell| dump | ' 843 | r'back connect |privesc|privilege escalat|debug privilege| inject |interactive shell|' 844 | r'shell commands| spawning |] target |] Transmi|] Connect|] connect|] Dump|] command |' 845 | r'] token|] Token |] Firing | hashes | etc/passwd| SAM | NTML|unsupported target|' 846 | r'race condition|Token system |LoaderConfig| add user |ile upload |ile download |' 847 | r'Attaching to |ser has been successfully added|target system |LSA Secrets|DefaultPassword|' 848 | r'Password: |loading dll|.Execute\(|Shellcode|Loader|inject x86|inject x64|bypass|katz|' 849 | r'sploit|ms[0-9][0-9][^0-9]|\bCVE[^a-zA-Z]|privilege::|lsadump|door)', 850 | string, re.IGNORECASE): 851 | localStringScores[string] += 4 852 | # Mutex / Named Pipes 853 | if re.search(r'(Mutex|NamedPipe|\\Global\\|\\pipe\\)', string, re.IGNORECASE): 854 | localStringScores[string] += 3 855 | # Usage 856 | if re.search(r'(isset\($post\[|isset\($get\[)', string, re.IGNORECASE): 857 | localStringScores[string] += 2 858 | # Hash 859 | if re.search(r'\b([a-f0-9]{32}|[a-f0-9]{40}|[a-f0-9]{64})\b', string, re.IGNORECASE): 860 | localStringScores[string] += 2 861 | # Persistence 862 | if re.search(r'(sc.exe |schtasks|at \\\\|at [0-9]{2}:[0-9]{2})', string, re.IGNORECASE): 863 | localStringScores[string] += 3 864 | # Unix/Linux 865 | if re.search(r'(;chmod |; chmod |sh -c|/dev/tcp/|/bin/telnet|selinux| shell| cp /bin/sh )', string, 866 | re.IGNORECASE): 867 | localStringScores[string] += 3 868 | # Attack 869 | if re.search( 870 | r'(attacker|brute force|bruteforce|connecting back|EXHAUSTIVE|exhaustion| spawn| evil| elevated)', 871 | string, re.IGNORECASE): 872 | localStringScores[string] += 3 873 | # Strings with less value 874 | if re.search(r'(abcdefghijklmnopqsst|ABCDEFGHIJKLMNOPQRSTUVWXYZ|0123456789:;)', string, re.IGNORECASE): 875 | localStringScores[string] -= 5 876 | # VB Backdoors 877 | if re.search( 878 | r'(kill|wscript|plugins|svr32|Select |)', 879 | string, re.IGNORECASE): 880 | localStringScores[string] += 3 881 | # Suspicious strings - combo / special characters 882 | if re.search( 883 | r'([a-z]{4,}[!\?]|\[[!+\-]\] |[a-zA-Z]{4,}...)', 884 | string, re.IGNORECASE): 885 | localStringScores[string] += 3 886 | if re.search( 887 | r'(-->|!!!| <<< | >>> )', 888 | string, re.IGNORECASE): 889 | localStringScores[string] += 5 890 | # Swear words 891 | if re.search( 892 | r'\b(fuck|damn|shit|penis)\b', 893 | string, re.IGNORECASE): 894 | localStringScores[string] += 5 895 | # Scripting Strings 896 | if re.search( 897 | r'(%APPDATA%|%USERPROFILE%|Public|Roaming|& del|& rm| && |script)', 898 | string, re.IGNORECASE): 899 | localStringScores[string] += 3 900 | # UACME Bypass 901 | if re.search( 902 | r'(Elevation|pwnd|pawn|elevate to)', 903 | string, re.IGNORECASE): 904 | localStringScores[string] += 3 905 | 906 | # ENCODING DETECTIONS -------------------------------------------------- 907 | try: 908 | if len(string) > 8: 909 | # Try different ways - fuzz string 910 | # Base64 911 | if args.trace: 912 | print("Starting Base64 string analysis ...") 913 | for m_string in (string, string[1:], string[:-1], string[1:] + "=", string + "=", string + "=="): 914 | if is_base_64(m_string): 915 | try: 916 | decoded_string = base64.b64decode(m_string, validate=False) 917 | except binascii.Error as e: 918 | continue 919 | if is_ascii_string(decoded_string, padding_allowed=True): 920 | # print "match" 921 | localStringScores[string] += 10 922 | base64strings[string] = decoded_string 923 | # Hex Encoded string 924 | if args.trace: 925 | print("Starting Hex encoded string analysis ...") 926 | for m_string in ([string, re.sub('[^a-zA-Z0-9]', '', string)]): 927 | #print m_string 928 | if is_hex_encoded(m_string): 929 | #print("^ is HEX") 930 | decoded_string = bytes.fromhex(m_string) 931 | #print removeNonAsciiDrop(decoded_string) 932 | if is_ascii_string(decoded_string, padding_allowed=True): 933 | # not too many 00s 934 | if '00' in m_string: 935 | if len(m_string) / float(m_string.count('0')) <= 1.2: 936 | continue 937 | #print("^ is ASCII / WIDE") 938 | localStringScores[string] += 8 939 | hexEncStrings[string] = decoded_string 940 | except Exception as e: 941 | if args.debug: 942 | traceback.print_exc() 943 | pass 944 | 945 | # Reversed String ----------------------------------------------------- 946 | if string[::-1] in good_strings_db: 947 | localStringScores[string] += 10 948 | reversedStrings[string] = string[::-1] 949 | 950 | # Certain string reduce ----------------------------------------------- 951 | if re.search(r'(rundll32\.exe$|kernel\.dll$)', string, re.IGNORECASE): 952 | localStringScores[string] -= 4 953 | 954 | # Set the global string score 955 | stringScores[original_string] = localStringScores[string] 956 | 957 | if args.debug: 958 | if string in utfstrings: 959 | is_utf = True 960 | else: 961 | is_utf = False 962 | # print "SCORE: %s\tUTF: %s\tSTRING: %s" % ( localStringScores[string], is_utf, string ) 963 | 964 | sorted_set = sorted(localStringScores.items(), key=operator.itemgetter(1), reverse=True) 965 | 966 | # Only the top X strings 967 | c = 0 968 | result_set = [] 969 | for string in sorted_set: 970 | 971 | # Skip the one with a score lower than -z X 972 | if not args.noscorefilter and not args.inverse: 973 | if string[1] < int(args.z): 974 | continue 975 | 976 | if string[0] in utfstrings: 977 | result_set.append("UTF16LE:%s" % string[0]) 978 | else: 979 | result_set.append(string[0]) 980 | 981 | #c += 1 982 | #if c > int(args.rc): 983 | # break 984 | 985 | if args.trace: 986 | print("RESULT SET:") 987 | print(result_set) 988 | 989 | # return the filtered set 990 | return result_set 991 | 992 | 993 | def generate_general_condition(file_info): 994 | """ 995 | Generates a general condition for a set of files 996 | :param file_info: 997 | :return: 998 | """ 999 | conditions_string = "" 1000 | conditions = [] 1001 | pe_module_neccessary = False 1002 | 1003 | # Different Magic Headers and File Sizes 1004 | magic_headers = [] 1005 | file_sizes = [] 1006 | imphashes = [] 1007 | 1008 | try: 1009 | for filePath in file_info: 1010 | # Short file name info used for inverse generation has no magic/size fields 1011 | if "magic" not in file_info[filePath]: 1012 | continue 1013 | magic = file_info[filePath]["magic"] 1014 | size = file_info[filePath]["size"] 1015 | imphash = file_info[filePath]["imphash"] 1016 | 1017 | # Add them to the lists 1018 | if magic not in magic_headers and magic != "": 1019 | magic_headers.append(magic) 1020 | if size not in file_sizes: 1021 | file_sizes.append(size) 1022 | if imphash not in imphashes and imphash != "": 1023 | imphashes.append(imphash) 1024 | 1025 | # If different magic headers are less than 5 1026 | if len(magic_headers) <= 5: 1027 | magic_string = " or ".join(get_uint_string(h) for h in magic_headers) 1028 | if " or " in magic_string: 1029 | conditions.append("( {0} )".format(magic_string)) 1030 | else: 1031 | conditions.append("{0}".format(magic_string)) 1032 | 1033 | # Biggest size multiplied with maxsize_multiplier 1034 | if not args.nofilesize and len(file_sizes) > 0: 1035 | conditions.append(get_file_range(max(file_sizes))) 1036 | 1037 | # If different magic headers are less than 5 1038 | if len(imphashes) == 1: 1039 | conditions.append("pe.imphash() == \"{0}\"".format(imphashes[0])) 1040 | pe_module_neccessary = True 1041 | 1042 | # If enough attributes were special 1043 | condition_string = " and ".join(conditions) 1044 | 1045 | except Exception as e: 1046 | if args.debug: 1047 | traceback.print_exc() 1048 | exit(1) 1049 | print("[E] ERROR while generating general condition - check the global rule and remove it if it's faulty") 1050 | 1051 | return condition_string, pe_module_neccessary 1052 | 1053 | 1054 | def generate_rules(file_strings, file_opcodes, super_rules, file_info, inverse_stats): 1055 | # Write to file --------------------------------------------------- 1056 | if args.o: 1057 | try: 1058 | fh = open(args.o, 'w') 1059 | except Exception as e: 1060 | traceback.print_exc() 1061 | 1062 | # General Info 1063 | general_info = "/*\n" 1064 | general_info += " YARA Rule Set\n" 1065 | general_info += " Author: {0}\n".format(args.a) 1066 | general_info += " Date: {0}\n".format(get_timestamp_basic()) 1067 | general_info += " Identifier: {0}\n".format(identifier) 1068 | general_info += " Reference: {0}\n".format(reference) 1069 | if args.l != "": 1070 | general_info += " License: {0}\n".format(args.l) 1071 | general_info += "*/\n\n" 1072 | 1073 | if args.ai: 1074 | fh.write(AI_COMMENT) 1075 | else: 1076 | fh.write(general_info) 1077 | 1078 | # GLOBAL RULES ---------------------------------------------------- 1079 | if args.globalrule: 1080 | 1081 | condition, pe_module_necessary = generate_general_condition(file_info) 1082 | 1083 | # Global Rule 1084 | if condition != "": 1085 | global_rule = "/* Global Rule -------------------------------------------------------------- */\n" 1086 | global_rule += "/* Will be evaluated first, speeds up scanning process, remove at will */\n\n" 1087 | global_rule += "global private rule gen_characteristics {\n" 1088 | global_rule += " condition:\n" 1089 | global_rule += " {0}\n".format(condition) 1090 | global_rule += "}\n\n" 1091 | 1092 | # Write rule 1093 | if args.o: 1094 | fh.write(global_rule) 1095 | 1096 | # General vars 1097 | rules = "" 1098 | printed_rules = {} 1099 | opcodes_to_add = [] 1100 | rule_count = 0 1101 | inverse_rule_count = 0 1102 | super_rule_count = 0 1103 | pe_module_necessary = False 1104 | 1105 | if not args.inverse: 1106 | # PROCESS SIMPLE RULES ---------------------------------------------------- 1107 | print("[+] Generating Simple Rules ...") 1108 | # Apply intelligent filters 1109 | print("[-] Applying intelligent filters to string findings ...") 1110 | for filePath in file_strings: 1111 | 1112 | print("[-] Filtering string set for %s ..." % filePath) 1113 | 1114 | # Replace the original string set with the filtered one 1115 | file_strings[filePath] = filter_string_set(file_strings[filePath]) 1116 | 1117 | print("[-] Filtering opcode set for %s ..." % filePath) 1118 | 1119 | # Replace the original opcode set with the filtered one 1120 | file_opcodes[filePath] = filter_opcode_set(file_opcodes[filePath]) if filePath in file_opcodes else [] 1121 | 1122 | # GENERATE SIMPLE RULES ------------------------------------------- 1123 | fh.write("/* Rule Set ----------------------------------------------------------------- */\n\n") 1124 | 1125 | for filePath in file_strings: 1126 | 1127 | # Skip if there is nothing to do 1128 | if len(file_strings[filePath]) == 0: 1129 | print("[W] Not enough high scoring strings to create a rule. " 1130 | "(Try -z 0 to reduce the min score or --opcodes to include opcodes) FILE: %s" % filePath) 1131 | continue 1132 | elif len(file_strings[filePath]) == 0 and len(file_opcodes[filePath]) == 0: 1133 | print("[W] Not enough high scoring strings and opcodes to create a rule. " \ 1134 | "(Try -z 0 to reduce the min score) FILE: %s" % filePath) 1135 | continue 1136 | 1137 | # Create Rule 1138 | try: 1139 | rule = "" 1140 | (path, file) = os.path.split(filePath) 1141 | # Prepare name 1142 | fileBase = os.path.splitext(file)[0] 1143 | # Create a clean new name 1144 | cleanedName = fileBase 1145 | # Adapt length of rule name 1146 | if len(fileBase) < 8: # if name is too short add part from path 1147 | cleanedName = path.split('\\')[-1:][0] + "_" + cleanedName 1148 | # File name starts with a number 1149 | if re.search(r'^[0-9]', cleanedName): 1150 | cleanedName = "sig_" + cleanedName 1151 | # clean name from all characters that would cause errors 1152 | cleanedName = re.sub(r'[^\w]', '_', cleanedName) 1153 | # Check if already printed 1154 | if cleanedName in printed_rules: 1155 | printed_rules[cleanedName] += 1 1156 | cleanedName = cleanedName + "_" + str(printed_rules[cleanedName]) 1157 | else: 1158 | printed_rules[cleanedName] = 1 1159 | 1160 | # Print rule title ---------------------------------------- 1161 | rule += "rule %s {\n" % cleanedName 1162 | 1163 | # Meta data ----------------------------------------------- 1164 | rule += " meta:\n" 1165 | rule += " description = \"%s - file %s\"\n" % (prefix, file) 1166 | rule += " author = \"%s\"\n" % args.a 1167 | rule += " reference = \"%s\"\n" % reference 1168 | rule += " date = \"%s\"\n" % get_timestamp_basic() 1169 | rule += " hash1 = \"%s\"\n" % file_info[filePath]["hash"] 1170 | rule += " strings:\n" 1171 | 1172 | # Get the strings ----------------------------------------- 1173 | # Rule String generation 1174 | (rule_strings, opcodes_included, string_rule_count, high_scoring_strings) = \ 1175 | get_rule_strings(file_strings[filePath], file_opcodes[filePath]) 1176 | rule += rule_strings 1177 | 1178 | # Extract rul strings 1179 | if args.strings: 1180 | strings = get_strings(file_strings[filePath]) 1181 | write_strings(filePath, strings, args.e, args.score) 1182 | 1183 | # Condition ----------------------------------------------- 1184 | # Conditions list (will later be joined with 'or') 1185 | conditions = [] # AND connected 1186 | subconditions = [] # OR connected 1187 | 1188 | # Condition PE 1189 | # Imphash and Exports - applicable to PE files only 1190 | condition_pe = [] 1191 | condition_pe_part1 = [] 1192 | condition_pe_part2 = [] 1193 | if not args.noextras and file_info[filePath]["magic"] == "MZ": 1194 | # Add imphash - if certain conditions are met 1195 | if file_info[filePath]["imphash"] not in good_imphashes_db and file_info[filePath]["imphash"] != "": 1196 | # Comment to imphash 1197 | imphash = file_info[filePath]["imphash"] 1198 | comment = "" 1199 | if imphash in KNOWN_IMPHASHES: 1200 | comment = " /* {0} */".format(KNOWN_IMPHASHES[imphash]) 1201 | # Add imphash to condition 1202 | condition_pe_part1.append("pe.imphash() == \"{0}\"{1}".format(imphash, comment)) 1203 | pe_module_necessary = True 1204 | if file_info[filePath]["exports"]: 1205 | e_count = 0 1206 | for export in file_info[filePath]["exports"]: 1207 | if export not in good_exports_db: 1208 | condition_pe_part2.append("pe.exports(\"{0}\")".format(export)) 1209 | e_count += 1 1210 | pe_module_necessary = True 1211 | if e_count > 5: 1212 | break 1213 | 1214 | # 1st Part of Condition 1 1215 | basic_conditions = [] 1216 | # Filesize 1217 | if not args.nofilesize: 1218 | basic_conditions.insert(0, get_file_range(file_info[filePath]["size"])) 1219 | # Magic 1220 | if file_info[filePath]["magic"] != "": 1221 | uint_string = get_uint_string(file_info[filePath]["magic"]) 1222 | basic_conditions.insert(0, uint_string) 1223 | # Basic Condition 1224 | if len(basic_conditions): 1225 | conditions.append(" and ".join(basic_conditions)) 1226 | 1227 | # Add extra PE conditions to condition 1 1228 | pe_conditions_add = False 1229 | if condition_pe_part1 or condition_pe_part2: 1230 | if len(condition_pe_part1) == 1: 1231 | condition_pe.append(condition_pe_part1[0]) 1232 | elif len(condition_pe_part1) > 1: 1233 | condition_pe.append("( %s )" % " or ".join(condition_pe_part1)) 1234 | if len(condition_pe_part2) == 1: 1235 | condition_pe.append(condition_pe_part2[0]) 1236 | elif len(condition_pe_part2) > 1: 1237 | condition_pe.append("( %s )" % " and ".join(condition_pe_part2)) 1238 | # Marker that PE conditions have been added 1239 | pe_conditions_add = True 1240 | # Add to sub condition 1241 | subconditions.append(" and ".join(condition_pe)) 1242 | 1243 | # String combinations 1244 | cond_op = "" # opcodes condition 1245 | cond_hs = "" # high scoring strings condition 1246 | cond_ls = "" # low scoring strings condition 1247 | 1248 | low_scoring_strings = (string_rule_count - high_scoring_strings) 1249 | if high_scoring_strings > 0: 1250 | cond_hs = "1 of ($x*)" 1251 | if low_scoring_strings > 0: 1252 | if low_scoring_strings > 10: 1253 | if high_scoring_strings > 0: 1254 | cond_ls = "4 of them" 1255 | else: 1256 | cond_ls = "8 of them" 1257 | else: 1258 | cond_ls = "all of them" 1259 | 1260 | # If low scoring and high scoring 1261 | cond_combined = "all of them" 1262 | needs_brackets = False 1263 | if low_scoring_strings > 0 and high_scoring_strings > 0: 1264 | # If PE conditions have been added, don't be so strict with the strings 1265 | if pe_conditions_add: 1266 | cond_combined = "{0} or {1}".format(cond_hs, cond_ls) 1267 | needs_brackets = True 1268 | else: 1269 | cond_combined = "{0} and {1}".format(cond_hs, cond_ls) 1270 | elif low_scoring_strings > 0 and not high_scoring_strings > 0: 1271 | cond_combined = "{0}".format(cond_ls) 1272 | elif not low_scoring_strings > 0 and high_scoring_strings > 0: 1273 | cond_combined = "{0}".format(cond_hs) 1274 | if opcodes_included: 1275 | cond_op = " and all of ($op*)" 1276 | 1277 | # Opcodes (if needed) 1278 | if cond_op or needs_brackets: 1279 | subconditions.append("( {0}{1} )".format(cond_combined, cond_op)) 1280 | else: 1281 | subconditions.append(cond_combined) 1282 | 1283 | # Now add string condition to the conditions 1284 | if len(subconditions) == 1: 1285 | conditions.append(subconditions[0]) 1286 | elif len(subconditions) > 1: 1287 | conditions.append("( %s )" % " or ".join(subconditions)) 1288 | 1289 | # Create condition string 1290 | condition_string = " and\n ".join(conditions) 1291 | 1292 | rule += " condition:\n" 1293 | rule += " %s\n" % condition_string 1294 | rule += "}\n\n" 1295 | 1296 | # Add to rules string 1297 | rules += rule 1298 | 1299 | rule_count += 1 1300 | except Exception as e: 1301 | traceback.print_exc() 1302 | 1303 | # GENERATE SUPER RULES -------------------------------------------- 1304 | if not nosuper and not args.inverse: 1305 | 1306 | rules += "/* Super Rules ------------------------------------------------------------- */\n\n" 1307 | super_rule_names = [] 1308 | 1309 | print("[+] Generating Super Rules ...") 1310 | printed_combi = {} 1311 | for super_rule in super_rules: 1312 | try: 1313 | rule = "" 1314 | # Prepare Name 1315 | rule_name = "" 1316 | file_list = [] 1317 | 1318 | # Loop through files 1319 | imphashes = Counter() 1320 | for filePath in super_rule["files"]: 1321 | (path, file) = os.path.split(filePath) 1322 | file_list.append(file) 1323 | # Prepare name 1324 | fileBase = os.path.splitext(file)[0] 1325 | # Create a clean new name 1326 | cleanedName = fileBase 1327 | # Append it to the full name 1328 | rule_name += "_" + cleanedName 1329 | # Check if imphash of all files is equal 1330 | imphash = file_info[filePath]["imphash"] 1331 | if imphash != "-" and imphash != "": 1332 | imphashes.update([imphash]) 1333 | 1334 | # Imphash usable 1335 | if len(imphashes) == 1: 1336 | unique_imphash = list(imphashes.items())[0][0] 1337 | if unique_imphash in good_imphashes_db: 1338 | unique_imphash = "" 1339 | 1340 | # Shorten rule name 1341 | rule_name = rule_name[:124] 1342 | # Add count if rule name already taken 1343 | if rule_name not in super_rule_names: 1344 | rule_name = "%s_%s" % (rule_name, super_rule_count) 1345 | super_rule_names.append(rule_name) 1346 | 1347 | # Create a list of files 1348 | file_listing = ", ".join(file_list) 1349 | 1350 | # File name starts with a number 1351 | if re.search(r'^[0-9]', rule_name): 1352 | rule_name = "sig_" + rule_name 1353 | # clean name from all characters that would cause errors 1354 | rule_name = re.sub(r'[^\w]', '_', rule_name) 1355 | # Check if already printed 1356 | if rule_name in printed_rules: 1357 | printed_combi[rule_name] += 1 1358 | rule_name = rule_name + "_" + str(printed_combi[rule_name]) 1359 | else: 1360 | printed_combi[rule_name] = 1 1361 | 1362 | # Print rule title 1363 | rule += "rule %s {\n" % rule_name 1364 | rule += " meta:\n" 1365 | rule += " description = \"%s - from files %s\"\n" % (prefix, file_listing) 1366 | rule += " author = \"%s\"\n" % args.a 1367 | rule += " reference = \"%s\"\n" % reference 1368 | rule += " date = \"%s\"\n" % get_timestamp_basic() 1369 | for i, filePath in enumerate(super_rule["files"]): 1370 | rule += " hash%s = \"%s\"\n" % (str(i + 1), file_info[filePath]["hash"]) 1371 | 1372 | rule += " strings:\n" 1373 | 1374 | # Adding the opcodes 1375 | if file_opcodes.get(filePath) is None: 1376 | tmp_file_opcodes = {} 1377 | else: 1378 | tmp_file_opcodes = file_opcodes.get(filePath) 1379 | (rule_strings, opcodes_included, string_rule_count, high_scoring_strings) = \ 1380 | get_rule_strings(super_rule["strings"], tmp_file_opcodes) 1381 | rule += rule_strings 1382 | 1383 | # Condition ----------------------------------------------- 1384 | # Conditions list (will later be joined with 'or') 1385 | conditions = [] 1386 | 1387 | # 1st condition 1388 | # Evaluate the general characteristics 1389 | file_info_super = {} 1390 | for filePath in super_rule["files"]: 1391 | file_info_super[filePath] = file_info[filePath] 1392 | condition_strings, pe_module_necessary_gen = generate_general_condition(file_info_super) 1393 | if pe_module_necessary_gen: 1394 | pe_module_necessary = True 1395 | 1396 | # 2nd condition 1397 | # String combinations 1398 | cond_op = "" # opcodes condition 1399 | cond_hs = "" # high scoring strings condition 1400 | cond_ls = "" # low scoring strings condition 1401 | 1402 | low_scoring_strings = (string_rule_count - high_scoring_strings) 1403 | if high_scoring_strings > 0: 1404 | cond_hs = "1 of ($x*)" 1405 | if low_scoring_strings > 0: 1406 | if low_scoring_strings > 10: 1407 | if high_scoring_strings > 0: 1408 | cond_ls = "4 of them" 1409 | else: 1410 | cond_ls = "8 of them" 1411 | else: 1412 | cond_ls = "all of them" 1413 | 1414 | # If low scoring and high scoring 1415 | cond_combined = "all of them" 1416 | if low_scoring_strings > 0 and high_scoring_strings > 0: 1417 | cond_combined = "{0} and {1}".format(cond_hs, cond_ls) 1418 | elif low_scoring_strings > 0 and not high_scoring_strings > 0: 1419 | cond_combined = "{0}".format(cond_ls) 1420 | elif not low_scoring_strings > 0 and high_scoring_strings > 0: 1421 | cond_combined = "{0}".format(cond_hs) 1422 | if opcodes_included: 1423 | cond_op = " and all of ($op*)" 1424 | 1425 | condition2 = "( {0} ){1}".format(cond_combined, cond_op) 1426 | conditions.append(" and ".join([condition_strings, condition2])) 1427 | 1428 | # 3nd condition 1429 | # In memory detection base condition (no magic, no filesize) 1430 | condition_pe = "all of them" 1431 | conditions.append(condition_pe) 1432 | 1433 | # Create condition string 1434 | condition_string = "\n ) or ( ".join(conditions) 1435 | 1436 | rule += " condition:\n" 1437 | rule += " ( %s )\n" % condition_string 1438 | rule += "}\n\n" 1439 | 1440 | # print rule 1441 | # Add to rules string 1442 | rules += rule 1443 | 1444 | super_rule_count += 1 1445 | except Exception as e: 1446 | traceback.print_exc() 1447 | 1448 | try: 1449 | # WRITING RULES TO FILE 1450 | # PE Module ------------------------------------------------------- 1451 | if not args.noextras: 1452 | if pe_module_necessary: 1453 | fh.write('import "pe"\n\n') 1454 | # RULES ----------------------------------------------------------- 1455 | if args.o: 1456 | fh.write(rules) 1457 | except Exception as e: 1458 | traceback.print_exc() 1459 | 1460 | # PROCESS INVERSE RULES --------------------------------------------------- 1461 | # print inverse_stats.keys() 1462 | if args.inverse: 1463 | print("[+] Generating inverse rules ...") 1464 | inverse_rules = "" 1465 | # Apply intelligent filters ------------------------------------------- 1466 | print("[+] Applying intelligent filters to string findings ...") 1467 | for fileName in inverse_stats: 1468 | 1469 | print("[-] Filtering string set for %s ..." % fileName) 1470 | 1471 | # Replace the original string set with the filtered one 1472 | string_set = inverse_stats[fileName] 1473 | inverse_stats[fileName] = [] 1474 | inverse_stats[fileName] = filter_string_set(string_set) 1475 | 1476 | # Preset if empty 1477 | if fileName not in file_opcodes: 1478 | file_opcodes[fileName] = {} 1479 | 1480 | # GENERATE INVERSE RULES ------------------------------------------- 1481 | fh.write("/* Inverse Rules ------------------------------------------------------------- */\n\n") 1482 | 1483 | for fileName in inverse_stats: 1484 | try: 1485 | rule = "" 1486 | # Create a clean new name 1487 | cleanedName = fileName.replace(".", "_") 1488 | # Add ANOMALY 1489 | cleanedName += "_ANOMALY" 1490 | # File name starts with a number 1491 | if re.search(r'^[0-9]', cleanedName): 1492 | cleanedName = "sig_" + cleanedName 1493 | # clean name from all characters that would cause errors 1494 | cleanedName = re.sub(r'[^\w]', '_', cleanedName) 1495 | # Check if already printed 1496 | if cleanedName in printed_rules: 1497 | printed_rules[cleanedName] += 1 1498 | cleanedName = cleanedName + "_" + str(printed_rules[cleanedName]) 1499 | else: 1500 | printed_rules[cleanedName] = 1 1501 | 1502 | # Print rule title ---------------------------------------- 1503 | rule += "rule %s {\n" % cleanedName 1504 | 1505 | # Meta data ----------------------------------------------- 1506 | rule += " meta:\n" 1507 | rule += " description = \"%s for anomaly detection - file %s\"\n" % (prefix, fileName) 1508 | rule += " author = \"%s\"\n" % args.a 1509 | rule += " reference = \"%s\"\n" % reference 1510 | rule += " date = \"%s\"\n" % get_timestamp_basic() 1511 | for i, hash in enumerate(file_info[fileName]["hashes"]): 1512 | rule += " hash%s = \"%s\"\n" % (str(i + 1), hash) 1513 | 1514 | rule += " strings:\n" 1515 | 1516 | # Get the strings ----------------------------------------- 1517 | # Rule String generation 1518 | (rule_strings, opcodes_included, string_rule_count, high_scoring_strings) = \ 1519 | get_rule_strings(inverse_stats[fileName], file_opcodes[fileName]) 1520 | rule += rule_strings 1521 | 1522 | # Condition ----------------------------------------------- 1523 | folderNames = "" 1524 | if not args.nodirname: 1525 | folderNames += "and ( filepath matches /" 1526 | folderNames += "$/ or filepath matches /".join(file_info[fileName]["folder_names"]) 1527 | folderNames += "$/ )" 1528 | condition = "filename == \"%s\" %s and not ( all of them )" % (fileName, folderNames) 1529 | 1530 | rule += " condition:\n" 1531 | rule += " %s\n" % condition 1532 | rule += "}\n\n" 1533 | 1534 | # print rule 1535 | # Add to rules string 1536 | inverse_rules += rule 1537 | 1538 | except Exception as e: 1539 | traceback.print_exc() 1540 | 1541 | try: 1542 | # Try to write rule to file 1543 | if args.o: 1544 | fh.write(inverse_rules) 1545 | inverse_rule_count += 1 1546 | except Exception as e: 1547 | traceback.print_exc() 1548 | 1549 | # Close the rules file -------------------------------------------- 1550 | if args.o: 1551 | try: 1552 | fh.close() 1553 | except Exception as e: 1554 | traceback.print_exc() 1555 | 1556 | # Print rules to command line ------------------------------------- 1557 | if args.debug: 1558 | print(rules) 1559 | 1560 | return (rule_count, inverse_rule_count, super_rule_count) 1561 | 1562 | 1563 | def get_rule_strings(string_elements, opcode_elements): 1564 | rule_strings = "" 1565 | high_scoring_strings = 0 1566 | string_rule_count = 0 1567 | 1568 | # Adding the strings -------------------------------------- 1569 | for i, string in enumerate(string_elements): 1570 | 1571 | # Collect the data 1572 | is_fullword = True 1573 | initial_string = string 1574 | enc = " ascii" 1575 | base64comment = "" 1576 | hexEncComment = "" 1577 | reversedComment = "" 1578 | fullword = "" 1579 | pestudio_comment = "" 1580 | score_comment = "" 1581 | goodware_comment = "" 1582 | 1583 | if string in good_strings_db: 1584 | goodware_comment = " /* Goodware String - occured %s times */" % (good_strings_db[string]) 1585 | 1586 | if string in stringScores: 1587 | if args.score: 1588 | score_comment += " /* score: '%.2f'*/" % (stringScores[string]) 1589 | else: 1590 | print("NO SCORE: %s" % string) 1591 | 1592 | if string[:8] == "UTF16LE:": 1593 | string = string[8:] 1594 | enc = " wide" 1595 | if string in base64strings: 1596 | base64comment = " /* base64 encoded string '%s' */" % base64strings[string].decode() 1597 | if string in hexEncStrings: 1598 | hexEncComment = " /* hex encoded string '%s' */" % removeNonAsciiDrop(hexEncStrings[string]).decode() 1599 | if string in pestudioMarker and args.score: 1600 | pestudio_comment = " /* PEStudio Blacklist: %s */" % pestudioMarker[string] 1601 | if string in reversedStrings: 1602 | reversedComment = " /* reversed goodware string '%s' */" % reversedStrings[string] 1603 | 1604 | # Extra checks 1605 | if is_hex_encoded(string, check_length=False): 1606 | is_fullword = False 1607 | 1608 | # Checking string length 1609 | if len(string) >= args.s: 1610 | # cut string 1611 | string = string[:args.s].rstrip("\\") 1612 | # not fullword anymore 1613 | is_fullword = False 1614 | # Show as fullword 1615 | if is_fullword: 1616 | fullword = " fullword" 1617 | 1618 | # Now compose the rule line 1619 | if float(stringScores[initial_string]) > score_highly_specific: 1620 | high_scoring_strings += 1 1621 | rule_strings += " $x%s = \"%s\"%s%s%s%s%s%s%s%s\n" % ( 1622 | str(i + 1), string, fullword, enc, base64comment, reversedComment, pestudio_comment, score_comment, 1623 | goodware_comment, hexEncComment) 1624 | else: 1625 | rule_strings += " $s%s = \"%s\"%s%s%s%s%s%s%s%s\n" % ( 1626 | str(i + 1), string, fullword, enc, base64comment, reversedComment, pestudio_comment, score_comment, 1627 | goodware_comment, hexEncComment) 1628 | 1629 | # If too many string definitions found - cut it at the 1630 | # count defined via command line param -rc 1631 | if (i + 1) >= strings_per_rule: 1632 | break 1633 | 1634 | string_rule_count += 1 1635 | 1636 | # Adding the opcodes -------------------------------------- 1637 | opcodes_included = False 1638 | if len(opcode_elements) > 0: 1639 | rule_strings += "\n" 1640 | for i, opcode in enumerate(opcode_elements): 1641 | rule_strings += " $op%s = { %s }\n" % (str(i), opcode) 1642 | opcodes_included = True 1643 | else: 1644 | if args.opcodes: 1645 | print("[-] Not enough unique opcodes found to include them") 1646 | 1647 | return rule_strings, opcodes_included, string_rule_count, high_scoring_strings 1648 | 1649 | 1650 | def get_strings(string_elements): 1651 | """ 1652 | Get a dictionary of all string types 1653 | :param string_elements: 1654 | :return: 1655 | """ 1656 | strings = { 1657 | "ascii": [], 1658 | "wide": [], 1659 | "base64 encoded": [], 1660 | "hex encoded": [], 1661 | "reversed": [] 1662 | } 1663 | 1664 | # Adding the strings -------------------------------------- 1665 | for i, string in enumerate(string_elements): 1666 | 1667 | if string[:8] == "UTF16LE:": 1668 | string = string[8:] 1669 | strings["wide"].append(string) 1670 | elif string in base64strings: 1671 | strings["base64 encoded"].append(string) 1672 | elif string in hexEncStrings: 1673 | strings["hex encoded"].append(string) 1674 | elif string in reversedStrings: 1675 | strings["reversed"].append(string) 1676 | else: 1677 | strings["ascii"].append(string) 1678 | 1679 | return strings 1680 | 1681 | 1682 | def write_strings(filePath, strings, output_dir, scores): 1683 | """ 1684 | Writes string information to an output file 1685 | :param filePath: 1686 | :param strings: 1687 | :param output_dir: 1688 | :param scores: 1689 | :return: 1690 | """ 1691 | SECTIONS = ["ascii", "wide", "base64 encoded", "hex encoded", "reversed"] 1692 | # File 1693 | filename = os.path.basename(filePath) 1694 | strings_filename = os.path.join(output_dir, "%s_strings.txt" % filename) 1695 | print("[+] Writing strings to file %s" % strings_filename) 1696 | # Strings 1697 | output_string = [] 1698 | for key in SECTIONS: 1699 | # Skip empty 1700 | if len(strings[key]) < 1: 1701 | continue 1702 | # Section 1703 | output_string.append("%s Strings" % key.upper()) 1704 | output_string.append("------------------------------------------------------------------------") 1705 | for string in strings[key]: 1706 | if scores: 1707 | score = "unknown" 1708 | if key == "wide": 1709 | score = stringScores["UTF16LE:%s" % string] 1710 | else: 1711 | score = stringScores[string] 1712 | output_string.append("%d;%s" % (score, string)) 1713 | else: 1714 | output_string.append(string) 1715 | # Empty line between sections 1716 | output_string.append("\n") 1717 | with open(strings_filename, "w") as fh: 1718 | fh.write("\n".join(output_string)) 1719 | 1720 | 1721 | def initialize_pestudio_strings(): 1722 | pestudio_strings = {} 1723 | 1724 | tree = etree.parse(get_abs_path(PE_STRINGS_FILE)) 1725 | 1726 | pestudio_strings["strings"] = tree.findall(".//string") 1727 | pestudio_strings["av"] = tree.findall(".//av") 1728 | pestudio_strings["folder"] = tree.findall(".//folder") 1729 | pestudio_strings["os"] = tree.findall(".//os") 1730 | pestudio_strings["reg"] = tree.findall(".//reg") 1731 | pestudio_strings["guid"] = tree.findall(".//guid") 1732 | pestudio_strings["ssdl"] = tree.findall(".//ssdl") 1733 | pestudio_strings["ext"] = tree.findall(".//ext") 1734 | pestudio_strings["agent"] = tree.findall(".//agent") 1735 | pestudio_strings["oid"] = tree.findall(".//oid") 1736 | pestudio_strings["priv"] = tree.findall(".//priv") 1737 | 1738 | # Obsolete 1739 | # for elem in string_elems: 1740 | # strings.append(elem.text) 1741 | 1742 | return pestudio_strings 1743 | 1744 | 1745 | def get_pestudio_score(string): 1746 | for type in pestudio_strings: 1747 | for elem in pestudio_strings[type]: 1748 | # Full match 1749 | if elem.text.lower() == string.lower(): 1750 | # Exclude the "extension" black list for now 1751 | if type != "ext": 1752 | return 5, type 1753 | return 0, "" 1754 | 1755 | 1756 | def get_opcode_string(opcode): 1757 | return ' '.join(opcode[i:i + 2] for i in range(0, len(opcode), 2)) 1758 | 1759 | 1760 | def get_uint_string(magic): 1761 | if len(magic) == 2: 1762 | return "uint8(0) == 0x{0}{1}".format(magic[0], magic[1]) 1763 | if len(magic) == 4: 1764 | return "uint16(0) == 0x{2}{3}{0}{1}".format(magic[0], magic[1], magic[2], magic[3]) 1765 | return "" 1766 | 1767 | 1768 | def get_file_range(size): 1769 | size_string = "" 1770 | try: 1771 | # max sample size - args.fm times the original size 1772 | max_size_b = size * args.fm 1773 | # Minimum size 1774 | if max_size_b < 1024: 1775 | max_size_b = 1024 1776 | # in KB 1777 | max_size = int(max_size_b / 1024) 1778 | max_size_kb = max_size 1779 | # Round 1780 | if len(str(max_size)) == 2: 1781 | max_size = int(round(max_size, -1)) 1782 | elif len(str(max_size)) == 3: 1783 | max_size = int(round(max_size, -2)) 1784 | elif len(str(max_size)) == 4: 1785 | max_size = int(round(max_size, -3)) 1786 | elif len(str(max_size)) >= 5: 1787 | max_size = int(round(max_size, -3)) 1788 | size_string = "filesize < {0}KB".format(max_size) 1789 | if args.debug: 1790 | print("File Size Eval: SampleSize (b): {0} SizeWithMultiplier (b/Kb): {1} / {2} RoundedSize: {3}".format( 1791 | str(size), str(max_size_b), str(max_size_kb), str(max_size))) 1792 | except Exception as e: 1793 | traceback.print_exc() 1794 | finally: 1795 | return size_string 1796 | 1797 | 1798 | def get_timestamp_basic(date_obj=None): 1799 | if not date_obj: 1800 | date_obj = datetime.datetime.now() 1801 | date_str = date_obj.strftime("%Y-%m-%d") 1802 | return date_str 1803 | 1804 | 1805 | def is_ascii_char(b, padding_allowed=False): 1806 | if padding_allowed: 1807 | if (ord(b) < 127 and ord(b) > 31) or ord(b) == 0: 1808 | return 1 1809 | else: 1810 | if ord(b) < 127 and ord(b) > 31: 1811 | return 1 1812 | return 0 1813 | 1814 | 1815 | def is_ascii_string(string, padding_allowed=False): 1816 | for b in [i.to_bytes(1, sys.byteorder) for i in string]: 1817 | if padding_allowed: 1818 | if not ((ord(b) < 127 and ord(b) > 31) or ord(b) == 0): 1819 | return 0 1820 | else: 1821 | if not (ord(b) < 127 and ord(b) > 31): 1822 | return 0 1823 | return 1 1824 | 1825 | 1826 | def is_base_64(s): 1827 | return (len(s) % 4 == 0) and re.match('^[A-Za-z0-9+/]+[=]{0,2}$', s) 1828 | 1829 | 1830 | def is_hex_encoded(s, check_length=True): 1831 | if re.match('^[A-Fa-f0-9]+$', s): 1832 | if check_length: 1833 | if len(s) % 2 == 0: 1834 | return True 1835 | else: 1836 | return True 1837 | return False 1838 | 1839 | 1840 | # TODO: Still buggy after port to Python3 1841 | def extract_hex_strings(s): 1842 | strings = [] 1843 | hex_strings = re.findall(b"([a-fA-F0-9]{10,})", s) 1844 | for string in list(hex_strings): 1845 | hex_strings += string.split(b'0000') 1846 | hex_strings += string.split(b'0d0a') 1847 | hex_strings += re.findall(b'((?:0000|002[a-f0-9]|00[3-9a-f][0-9a-f]){6,})', string, re.IGNORECASE) 1848 | hex_strings = list(set(hex_strings)) 1849 | # ASCII Encoded Strings 1850 | for string in hex_strings: 1851 | for x in string.split(b'00'): 1852 | if len(x) > 10: 1853 | strings.append(x) 1854 | # WIDE Encoded Strings 1855 | for string in hex_strings: 1856 | try: 1857 | if len(string) % 2 != 0 or len(string) < 8: 1858 | continue 1859 | # Skip 1860 | if b'0000' in string: 1861 | continue 1862 | dec = string.replace(b'00', b'') 1863 | if is_ascii_string(dec, padding_allowed=False): 1864 | strings.append(string) 1865 | except Exception as e: 1866 | traceback.print_exc() 1867 | return strings 1868 | 1869 | 1870 | def removeNonAsciiDrop(string): 1871 | nonascii = "error" 1872 | try: 1873 | byte_list = [i.to_bytes(1, sys.byteorder) for i in string] 1874 | # Generate a new string without disturbing characters 1875 | nonascii = b"".join(i for i in byte_list if ord(i)<127 and ord(i)>31) 1876 | except Exception as e: 1877 | traceback.print_exc() 1878 | pass 1879 | return nonascii 1880 | 1881 | 1882 | def save(object, filename): 1883 | file = gzip.GzipFile(filename, 'wb') 1884 | file.write(bytes(json.dumps(object), 'utf-8')) 1885 | file.close() 1886 | 1887 | 1888 | def load(filename): 1889 | file = gzip.GzipFile(filename, 'rb') 1890 | object = json.loads(file.read()) 1891 | file.close() 1892 | return object 1893 | 1894 | 1895 | def update_databases(): 1896 | # Preparations 1897 | try: 1898 | dbDir = './dbs/' 1899 | if not os.path.exists(dbDir): 1900 | os.makedirs(dbDir) 1901 | except Exception as e: 1902 | if args.debug: 1903 | traceback.print_exc() 1904 | print("Error while creating the database directory ./dbs") 1905 | sys.exit(1) 1906 | 1907 | # Downloading current repository 1908 | try: 1909 | for filename, repo_url in REPO_URLS.items(): 1910 | print("Downloading %s from %s ..." % (filename, repo_url)) 1911 | with urllib.request.urlopen(repo_url) as response, open("./dbs/%s" % filename, 'wb') as out_file: 1912 | shutil.copyfileobj(response, out_file) 1913 | except Exception as e: 1914 | if args.debug: 1915 | traceback.print_exc() 1916 | print("Error while downloading the database file - check your Internet connection " 1917 | "(try to run it with --debug to see the full error message)") 1918 | sys.exit(1) 1919 | 1920 | 1921 | def processSampleDir(targetDir): 1922 | """ 1923 | Processes samples in a given directory and creates a yara rule file 1924 | :param directory: 1925 | :return: 1926 | """ 1927 | # Special strings 1928 | base64strings = {} 1929 | hexEncStrings = {} 1930 | reversedStrings = {} 1931 | pestudioMarker = {} 1932 | stringScores = {} 1933 | 1934 | # Extract all information 1935 | (sample_string_stats, sample_opcode_stats, file_info) = \ 1936 | parse_sample_dir(targetDir, args.nr, generateInfo=True, onlyRelevantExtensions=args.oe) 1937 | 1938 | # Evaluate Strings 1939 | (file_strings, file_opcodes, combinations, super_rules, inverse_stats) = \ 1940 | sample_string_evaluation(sample_string_stats, sample_opcode_stats, file_info) 1941 | 1942 | # Create Rule Files 1943 | (rule_count, inverse_rule_count, super_rule_count) = \ 1944 | generate_rules(file_strings, file_opcodes, super_rules, file_info, inverse_stats) 1945 | 1946 | if args.inverse: 1947 | print("[=] Generated %s INVERSE rules." % str(inverse_rule_count)) 1948 | else: 1949 | print("[=] Generated %s SIMPLE rules." % str(rule_count)) 1950 | if not nosuper: 1951 | print("[=] Generated %s SUPER rules." % str(super_rule_count)) 1952 | print("[=] All rules written to %s" % args.o) 1953 | 1954 | 1955 | def emptyFolder(dir): 1956 | """ 1957 | Removes all files from a given folder 1958 | :return: 1959 | """ 1960 | for file in os.listdir(dir): 1961 | filePath = os.path.join(dir, file) 1962 | try: 1963 | if os.path.isfile(filePath): 1964 | print("[!] Removing %s ..." % filePath) 1965 | os.unlink(filePath) 1966 | except Exception as e: 1967 | print(e) 1968 | 1969 | 1970 | def getReference(ref): 1971 | """ 1972 | Get a reference string - if the provided string is the path to a text file, then read the contents and return it as 1973 | reference 1974 | :param ref: 1975 | :return: 1976 | """ 1977 | if os.path.exists(ref): 1978 | reference = getFileContent(ref) 1979 | print("[+] Read reference from file %s > %s" % (ref, reference)) 1980 | return reference 1981 | else: 1982 | return ref 1983 | 1984 | 1985 | def getIdentifier(id, path): 1986 | """ 1987 | Get a identifier string - if the provided string is the path to a text file, then read the contents and return it as 1988 | reference, otherwise use the last element of the full path 1989 | :param ref: 1990 | :return: 1991 | """ 1992 | # Identifier 1993 | if id == "not set" or not os.path.exists(id): 1994 | # Identifier is the highest folder name 1995 | return os.path.basename(path.rstrip('/')) 1996 | else: 1997 | # Read identifier from file 1998 | identifier = getFileContent(id) 1999 | print("[+] Read identifier from file %s > %s" % (id, identifier)) 2000 | return identifier 2001 | 2002 | 2003 | def getPrefix(prefix, identifier): 2004 | """ 2005 | Get a prefix string for the rule description based on the identifier 2006 | :param prefix: 2007 | :param identifier: 2008 | :return: 2009 | """ 2010 | if prefix == "Auto-generated rule": 2011 | return identifier 2012 | else: 2013 | return prefix 2014 | 2015 | 2016 | def getFileContent(file): 2017 | """ 2018 | Gets the contents of a file (limited to 1024 characters) 2019 | :param file: 2020 | :return: 2021 | """ 2022 | try: 2023 | with open(file) as f: 2024 | return f.read(1024) 2025 | except Exception as e: 2026 | return "not found" 2027 | 2028 | 2029 | # CTRL+C Handler -------------------------------------------------------------- 2030 | def signal_handler(signal_name, frame): 2031 | print("> yarGen's work has been interrupted") 2032 | sys.exit(0) 2033 | 2034 | 2035 | def print_welcome(): 2036 | print("------------------------------------------------------------------------") 2037 | print(" _____ ") 2038 | print(" __ _____ _____/ ___/__ ___ ") 2039 | print(" / // / _ `/ __/ (_ / -_) _ \\ ") 2040 | print(" \\_, /\\_,_/_/ \\___/\\__/_//_/ ") 2041 | print(" /___/ Yara Rule Generator ") 2042 | print(" Florian Roth, August 2023, Version %s" % __version__) 2043 | print(" ") 2044 | print(" Note: Rules have to be post-processed") 2045 | print(" See this post for details: https://medium.com/@cyb3rops/121d29322282") 2046 | print("------------------------------------------------------------------------") 2047 | 2048 | 2049 | # MAIN ################################################################ 2050 | if __name__ == '__main__': 2051 | 2052 | # Signal handler for CTRL+C 2053 | signal_module.signal(signal_module.SIGINT, signal_handler) 2054 | 2055 | # Parse Arguments 2056 | parser = argparse.ArgumentParser(description='yarGen') 2057 | 2058 | group_creation = parser.add_argument_group('Rule Creation') 2059 | group_creation.add_argument('-m', help='Path to scan for malware') 2060 | group_creation.add_argument('-y', help='Minimum string length to consider (default=8)', metavar='min-size', 2061 | default=8) 2062 | group_creation.add_argument('-z', help='Minimum score to consider (default=0)', metavar='min-score', default=0) 2063 | group_creation.add_argument('-x', help='Score required to set string as \'highly specific string\' (default: 30)', 2064 | metavar='high-scoring', default=30) 2065 | group_creation.add_argument('-w', help='Minimum number of strings that overlap to create a super rule (default: 5)', 2066 | metavar='superrule-overlap', default=5) 2067 | group_creation.add_argument('-s', help='Maximum length to consider (default=128)', metavar='max-size', default=128, type=int) 2068 | group_creation.add_argument('-rc', help='Maximum number of strings per rule (default=20, intelligent filtering ' 2069 | 'will be applied)', metavar='maxstrings', default=20) 2070 | group_creation.add_argument('--excludegood', help='Force the exclude all goodware strings', action='store_true', 2071 | default=False) 2072 | 2073 | group_output = parser.add_argument_group('Rule Output') 2074 | group_output.add_argument('-o', help='Output rule file', metavar='output_rule_file', default='yargen_rules.yar') 2075 | group_output.add_argument('-e', help='Output directory for string exports', metavar='output_dir_strings', default='') 2076 | group_output.add_argument('-a', help='Author Name', metavar='author', default='yarGen Rule Generator') 2077 | group_output.add_argument('-r', help='Reference (can be string or text file)', metavar='ref', 2078 | default='https://github.com/Neo23x0/yarGen') 2079 | group_output.add_argument('-l', help='License', metavar='lic', default='') 2080 | group_output.add_argument('-p', help='Prefix for the rule description', metavar='prefix', 2081 | default='Auto-generated rule') 2082 | group_output.add_argument('-b', help='Text file from which the identifier is read (default: last folder name in ' 2083 | 'the full path, e.g. "myRAT" if -m points to /mnt/mal/myRAT)', 2084 | metavar='identifier', 2085 | default='not set') 2086 | group_output.add_argument('--score', help='Show the string scores as comments in the rules', action='store_true', 2087 | default=False) 2088 | group_output.add_argument('--strings', help='Show the string scores as comments in the rules', action='store_true', 2089 | default=False) 2090 | group_output.add_argument('--nosimple', help='Skip simple rule creation for files included in super rules', 2091 | action='store_true', default=False) 2092 | group_output.add_argument('--nomagic', help='Don\'t include the magic header condition statement', 2093 | action='store_true', default=False) 2094 | group_output.add_argument('--nofilesize', help='Don\'t include the filesize condition statement', 2095 | action='store_true', default=False) 2096 | group_output.add_argument('-fm', help='Multiplier for the maximum \'filesize\' condition value (default: 3)', 2097 | default=3) 2098 | group_output.add_argument('--globalrule', help='Create global rules (improved rule set speed)', 2099 | action='store_true', default=False) 2100 | group_output.add_argument('--nosuper', action='store_true', default=False, help='Don\'t try to create super rules ' 2101 | 'that match against various files') 2102 | 2103 | group_db = parser.add_argument_group('Database Operations') 2104 | group_db.add_argument('--update', action='store_true', default=False, help='Update the local strings and opcodes ' 2105 | 'dbs from the online repository') 2106 | group_db.add_argument('-g', help='Path to scan for goodware (dont use the database shipped with yaraGen)') 2107 | group_db.add_argument('-u', action='store_true', default=False, help='Update local standard goodware database with ' 2108 | 'a new analysis result (used with -g)') 2109 | group_db.add_argument('-c', action='store_true', default=False, help='Create new local goodware database ' 2110 | '(use with -g and optionally -i "identifier")') 2111 | group_db.add_argument('-i', default="", help='Specify an identifier for the newly created databases ' 2112 | '(good-strings-identifier.db, good-opcodes-identifier.db)') 2113 | 2114 | group_general = parser.add_argument_group('General Options') 2115 | group_general.add_argument('--dropzone', action='store_true', default=False, 2116 | help='Dropzone mode - monitors a directory [-m] for new samples to process. ' 2117 | 'WARNING: Processed files will be deleted!') 2118 | group_general.add_argument('--nr', action='store_true', default=False, help='Do not recursively scan directories') 2119 | group_general.add_argument('--oe', action='store_true', default=False, help='Only scan executable extensions EXE, ' 2120 | 'DLL, ASP, JSP, PHP, BIN, INFECTED') 2121 | group_general.add_argument('-fs', help='Max file size in MB to analyze (default=10)', metavar='size-in-MB', 2122 | default=10) 2123 | group_general.add_argument('--noextras', action='store_true', default=False, 2124 | help='Don\'t use extras like Imphash or PE header specifics') 2125 | group_general.add_argument('--ai', action='store_true', default=False, help='Create output to be used as ChatGPT4 input') 2126 | group_general.add_argument('--debug', action='store_true', default=False, help='Debug output') 2127 | group_general.add_argument('--trace', action='store_true', default=False, help='Trace output') 2128 | 2129 | group_opcode = parser.add_argument_group('Other Features') 2130 | group_opcode.add_argument('--opcodes', action='store_true', default=False, help='Do use the OpCode feature ' 2131 | '(use this if not enough high ' 2132 | 'scoring strings can be found)') 2133 | group_opcode.add_argument('-n', help='Number of opcodes to add if not enough high scoring string could be found ' 2134 | '(default=3)', metavar='opcode-num', default=3) 2135 | 2136 | group_inverse = parser.add_argument_group('Inverse Mode (unstable)') 2137 | group_inverse.add_argument('--inverse', help=argparse.SUPPRESS, action='store_true', default=False) 2138 | group_inverse.add_argument('--nodirname', help=argparse.SUPPRESS, action='store_true', default=False) 2139 | group_inverse.add_argument('--noscorefilter', help=argparse.SUPPRESS, action='store_true', default=False) 2140 | 2141 | args = parser.parse_args() 2142 | 2143 | # Print Welcome 2144 | print_welcome() 2145 | 2146 | if not args.update and not args.m and not args.g: 2147 | parser.print_help() 2148 | print("") 2149 | print(""" 2150 | [E] You have to select --update to update yarGens database or -m for signature generation or -g for the 2151 | creation of goodware string collections 2152 | (see https://github.com/Neo23x0/yarGen#examples for more details) 2153 | 2154 | Recommended command line: 2155 | python yarGen.py -a 'Your Name' --opcodes --dropzone -m ./dropzone""") 2156 | sys.exit(1) 2157 | 2158 | # Update 2159 | if args.update: 2160 | update_databases() 2161 | print("[+] Updated databases - you can now start creating YARA rules") 2162 | sys.exit(0) 2163 | 2164 | # Typical input erros 2165 | if args.m: 2166 | if os.path.isfile(args.m): 2167 | print("[E] Input is a file, please use a directory instead (-m path)") 2168 | sys.exit(0) 2169 | 2170 | # Opcodes evaluation or not 2171 | use_opcodes = False 2172 | if args.opcodes: 2173 | use_opcodes = True 2174 | 2175 | # Read PEStudio string list 2176 | pestudio_strings = {} 2177 | pestudio_available = False 2178 | 2179 | # Super Rule Generation 2180 | nosuper = args.nosuper 2181 | 2182 | # Identifier 2183 | sourcepath = args.m 2184 | if args.g: 2185 | sourcepath = args.g 2186 | identifier = getIdentifier(args.b, sourcepath) 2187 | print("[+] Using identifier '%s'" % identifier) 2188 | 2189 | # Reference 2190 | reference = getReference(args.r) 2191 | print("[+] Using reference '%s'" % reference) 2192 | 2193 | # Prefix 2194 | prefix = getPrefix(args.p, identifier) 2195 | print("[+] Using prefix '%s'" % prefix) 2196 | 2197 | if os.path.isfile(get_abs_path(PE_STRINGS_FILE)): 2198 | print("[+] Processing PEStudio strings ...") 2199 | pestudio_strings = initialize_pestudio_strings() 2200 | pestudio_available = True 2201 | 2202 | # Highly specific string score 2203 | score_highly_specific = int(args.x) 2204 | 2205 | # Scan goodware files 2206 | if args.g: 2207 | print("[+] Processing goodware files ...") 2208 | good_strings_db, good_opcodes_db, good_imphashes_db, good_exports_db = \ 2209 | parse_good_dir(args.g, args.nr, args.oe) 2210 | 2211 | # Update existing databases 2212 | if args.u: 2213 | try: 2214 | print("[+] Updating databases ...") 2215 | 2216 | # Evaluate the database identifiers 2217 | db_identifier = "" 2218 | if args.i != "": 2219 | db_identifier = "-%s" % args.i 2220 | strings_db = "./dbs/good-strings%s.db" % db_identifier 2221 | opcodes_db = "./dbs/good-opcodes%s.db" % db_identifier 2222 | imphashes_db = "./dbs/good-imphashes%s.db" % db_identifier 2223 | exports_db = "./dbs/good-exports%s.db" % db_identifier 2224 | 2225 | # Strings ----------------------------------------------------- 2226 | print("[+] Updating %s ..." % strings_db) 2227 | good_pickle = load(get_abs_path(strings_db)) 2228 | print("Old string database entries: %s" % len(good_pickle)) 2229 | good_pickle.update(good_strings_db) 2230 | print("New string database entries: %s" % len(good_pickle)) 2231 | save(good_pickle, strings_db) 2232 | 2233 | # Opcodes ----------------------------------------------------- 2234 | print("[+] Updating %s ..." % opcodes_db) 2235 | good_opcode_pickle = load(get_abs_path(opcodes_db)) 2236 | print("Old opcode database entries: %s" % len(good_opcode_pickle)) 2237 | good_opcode_pickle.update(good_opcodes_db) 2238 | print("New opcode database entries: %s" % len(good_opcode_pickle)) 2239 | save(good_opcode_pickle, opcodes_db) 2240 | 2241 | # Imphashes --------------------------------------------------- 2242 | print("[+] Updating %s ..." % imphashes_db) 2243 | good_imphashes_pickle = load(get_abs_path(imphashes_db)) 2244 | print("Old opcode database entries: %s" % len(good_imphashes_pickle)) 2245 | good_imphashes_pickle.update(good_imphashes_db) 2246 | print("New opcode database entries: %s" % len(good_imphashes_pickle)) 2247 | save(good_imphashes_pickle, imphashes_db) 2248 | 2249 | # Exports ----------------------------------------------------- 2250 | print("[+] Updating %s ..." % exports_db) 2251 | good_exports_pickle = load(get_abs_path(exports_db)) 2252 | print("Old opcode database entries: %s" % len(good_exports_pickle)) 2253 | good_exports_pickle.update(good_exports_db) 2254 | print("New opcode database entries: %s" % len(good_exports_pickle)) 2255 | save(good_exports_pickle, exports_db) 2256 | 2257 | except Exception as e: 2258 | traceback.print_exc() 2259 | 2260 | # Create new databases 2261 | if args.c: 2262 | print("[+] Creating local database ...") 2263 | # Evaluate the database identifiers 2264 | db_identifier = "" 2265 | if args.i != "": 2266 | db_identifier = "-%s" % args.i 2267 | strings_db = "./dbs/good-strings%s.db" % db_identifier 2268 | opcodes_db = "./dbs/good-opcodes%s.db" % db_identifier 2269 | imphashes_db = "./dbs/good-imphashes%s.db" % db_identifier 2270 | exports_db = "./dbs/good-exports%s.db" % db_identifier 2271 | 2272 | # Creating the databases 2273 | print("[+] Using '%s' as filename for newly created strings database" % strings_db) 2274 | print("[+] Using '%s' as filename for newly created opcodes database" % opcodes_db) 2275 | print("[+] Using '%s' as filename for newly created opcodes database" % imphashes_db) 2276 | print("[+] Using '%s' as filename for newly created opcodes database" % exports_db) 2277 | 2278 | try: 2279 | 2280 | if os.path.isfile(strings_db): 2281 | input("File %s alread exists. Press enter to proceed or CTRL+C to exit." % strings_db) 2282 | os.remove(strings_db) 2283 | if os.path.isfile(opcodes_db): 2284 | input("File %s alread exists. Press enter to proceed or CTRL+C to exit." % opcodes_db) 2285 | os.remove(opcodes_db) 2286 | if os.path.isfile(imphashes_db): 2287 | input("File %s alread exists. Press enter to proceed or CTRL+C to exit." % imphashes_db) 2288 | os.remove(imphashes_db) 2289 | if os.path.isfile(exports_db): 2290 | input("File %s alread exists. Press enter to proceed or CTRL+C to exit." % exports_db) 2291 | os.remove(exports_db) 2292 | 2293 | # Strings 2294 | good_json = Counter() 2295 | good_json = good_strings_db 2296 | # Opcodes 2297 | good_op_json = Counter() 2298 | good_op_json = good_opcodes_db 2299 | # Imphashes 2300 | good_imphashes_json = Counter() 2301 | good_imphashes_json = good_imphashes_db 2302 | # Exports 2303 | good_exports_json = Counter() 2304 | good_exports_json = good_exports_db 2305 | 2306 | # Save 2307 | save(good_json, strings_db) 2308 | save(good_op_json, opcodes_db) 2309 | save(good_imphashes_json, imphashes_db) 2310 | save(good_exports_json, exports_db) 2311 | 2312 | print("New database with %d string, %d opcode, %d imphash, %d export entries created. " \ 2313 | "(remember to use --opcodes to extract opcodes from the samples and create the opcode databases)"\ 2314 | % (len(good_strings_db), len(good_opcodes_db), len(good_imphashes_db), len(good_exports_db))) 2315 | except Exception as e: 2316 | traceback.print_exc() 2317 | 2318 | # Analyse malware samples and create rules 2319 | else: 2320 | print("[+] Reading goodware strings from database 'good-strings.db' ...") 2321 | print(" (This could take some time and uses several Gigabytes of RAM depending on your db size)") 2322 | 2323 | good_strings_db = Counter() 2324 | good_opcodes_db = Counter() 2325 | good_imphashes_db = Counter() 2326 | good_exports_db = Counter() 2327 | 2328 | opcodes_num = 0 2329 | strings_num = 0 2330 | imphash_num = 0 2331 | exports_num = 0 2332 | 2333 | # Initialize all databases 2334 | for file in os.listdir(get_abs_path("./dbs/")): 2335 | if not file.endswith(".db"): 2336 | continue 2337 | filePath = os.path.join("./dbs/", file) 2338 | # String databases 2339 | if file.startswith("good-strings"): 2340 | try: 2341 | print("[+] Loading %s ..." % filePath) 2342 | good_json = load(get_abs_path(filePath)) 2343 | good_strings_db.update(good_json) 2344 | print("[+] Total: %s / Added %d entries" % ( 2345 | len(good_strings_db), len(good_strings_db) - strings_num)) 2346 | strings_num = len(good_strings_db) 2347 | except Exception as e: 2348 | traceback.print_exc() 2349 | # Opcode databases 2350 | if file.startswith("good-opcodes"): 2351 | try: 2352 | if use_opcodes: 2353 | print("[+] Loading %s ..." % filePath) 2354 | good_op_json = load(get_abs_path(filePath)) 2355 | good_opcodes_db.update(good_op_json) 2356 | print("[+] Total: %s (removed duplicates) / Added %d entries" % ( 2357 | len(good_opcodes_db), len(good_opcodes_db) - opcodes_num)) 2358 | opcodes_num = len(good_opcodes_db) 2359 | except Exception as e: 2360 | use_opcodes = False 2361 | traceback.print_exc() 2362 | # Imphash databases 2363 | if file.startswith("good-imphash"): 2364 | try: 2365 | print("[+] Loading %s ..." % filePath) 2366 | good_imphashes_json = load(get_abs_path(filePath)) 2367 | good_imphashes_db.update(good_imphashes_json) 2368 | print("[+] Total: %s / Added %d entries" % ( 2369 | len(good_imphashes_db), len(good_imphashes_db) - imphash_num)) 2370 | imphash_num = len(good_imphashes_db) 2371 | except Exception as e: 2372 | traceback.print_exc() 2373 | # Export databases 2374 | if file.startswith("good-exports"): 2375 | try: 2376 | print("[+] Loading %s ..." % filePath) 2377 | good_exports_json = load(get_abs_path(filePath)) 2378 | good_exports_db.update(good_exports_json) 2379 | print("[+] Total: %s / Added %d entries" % ( 2380 | len(good_exports_db), len(good_exports_db) - exports_num)) 2381 | exports_num = len(good_exports_db) 2382 | except Exception as e: 2383 | traceback.print_exc() 2384 | 2385 | if use_opcodes and len(good_opcodes_db) < 1: 2386 | print("[E] Missing goodware opcode databases." 2387 | " Please run 'yarGen.py --update' to retrieve the newest database set.") 2388 | use_opcodes = False 2389 | 2390 | if len(good_exports_db) < 1 and len(good_imphashes_db) < 1: 2391 | print("[E] Missing goodware imphash/export databases. " 2392 | " Please run 'yarGen.py --update' to retrieve the newest database set.") 2393 | 2394 | if len(good_strings_db) < 1 and not args.c: 2395 | print("[E] Error - no goodware databases found. " 2396 | " Please run 'yarGen.py --update' to retrieve the newest database set.") 2397 | sys.exit(1) 2398 | 2399 | # If malware directory given 2400 | if args.m: 2401 | 2402 | # Deactivate super rule generation if there's only a single file in the folder 2403 | if len(os.listdir(args.m)) < 2: 2404 | nosuper = True 2405 | 2406 | # AI input generation 2407 | strings_per_rule = int(args.rc) 2408 | if args.ai: 2409 | strings_per_rule = 200 2410 | 2411 | # Special strings 2412 | base64strings = {} 2413 | reversedStrings = {} 2414 | hexEncStrings = {} 2415 | pestudioMarker = {} 2416 | stringScores = {} 2417 | 2418 | # Dropzone mode 2419 | if args.dropzone: 2420 | # Monitoring folder for changes 2421 | print("Monitoring %s for new sample files (processed samples will be removed)" % args.m) 2422 | while(True): 2423 | if len(os.listdir(args.m)) > 0: 2424 | # Deactivate super rule generation if there's only a single file in the folder 2425 | if len(os.listdir(args.m)) < 2: 2426 | nosuper = True 2427 | else: 2428 | nosuper = False 2429 | # Read a new identifier 2430 | identifier = getIdentifier(args.b, args.m) 2431 | # Read a new reference 2432 | reference = getReference(args.r) 2433 | # Generate a new description prefix 2434 | prefix = getPrefix(args.p, identifier) 2435 | # Process the samples 2436 | processSampleDir(args.m) 2437 | # Delete all samples from the dropzone folder 2438 | emptyFolder(args.m) 2439 | time.sleep(1) 2440 | else: 2441 | # Scan malware files 2442 | print("[+] Processing malware files ...") 2443 | processSampleDir(args.m) 2444 | 2445 | print("[+] yarGen run finished") 2446 | --------------------------------------------------------------------------------