├── .gitignore
├── 3rdparty
    └── strings.xml
├── LICENSE
├── README.md
├── prepare-release.sh
├── requirements.txt
├── screens
    ├── output-rule-0.14.1.png
    └── yargen-running.png
├── tools
    └── byte-mapper.py
└── yarGen.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | # IDE & Virtual Environment
 2 | .idea/
 3 | venv/
 4 | 
 5 | # MacOS
 6 | .DS_Store
 7 | .AppleDouble
 8 | .LSOverride
 9 | 
10 | # Thumbnails
11 | ._*
12 | 
13 | # Prebuilt Database
14 | dbs/
15 | 
16 | # YARA Rules
17 | yargen_rules.yar
18 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | yarGen - Yara Rule Generator, Copyright (c) 2015, Florian Roth
 2 | All rights reserved.
 3 | 
 4 | Redistribution and use in source and binary forms, with or without
 5 | modification, are permitted provided that the following conditions are met:
 6 |     * Redistributions of source code must retain the above copyright
 7 |       notice, this list of conditions and the following disclaimer.
 8 |     * Redistributions in binary form must reproduce the above copyright
 9 |       notice, this list of conditions and the following disclaimer in the
10 |       documentation and/or other materials provided with the distribution.
11 |     * Neither the name of the copyright owner nor the
12 |       names of its contributors may be used to endorse or promote products
13 |       derived from this software without specific prior written permission.
14 | 
15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
16 | ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
17 | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
18 | DISCLAIMED. IN NO EVENT SHALL Florian Roth BE LIABLE FOR ANY
19 | DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
20 | (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
21 | LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
22 | ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
23 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
24 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | [![Actively Maintained](https://img.shields.io/badge/Maintenance%20Level-Actively%20Maintained-green.svg)](https://gist.github.com/cheerfulstoic/d107229326a01ff0f333a1d3476e068d)
  2 | 
  3 | # yarGen
  4 |                        _____
  5 |         __ _____ _____/ ___/__ ___
  6 |        / // / _ `/ __/ (_ / -_) _ \
  7 |        \_, /\_,_/_/  \___/\__/_//_/
  8 |       /___/  Yara Rule Generator
  9 |              Florian Roth, July 2020, Version 0.23.2
 10 |     
 11 |       Note: Rules have to be post-processed
 12 |       See this post for details: https://medium.com/@cyb3rops/121d29322282
 13 | 
 14 | ### What does yarGen do?
 15 | 
 16 | yarGen is a generator for [YARA](https://github.com/plusvic/yara/) rules
 17 | 
 18 | The main principle is the creation of yara rules from strings found in malware files while removing all strings that also appear in goodware files. Therefore yarGen includes a big goodware strings and opcode database as ZIP archives that have to be extracted before the first use.
 19 | 
 20 | In version 0.24.0, yarGen introduces an output option (`--ai`). This feature generates a YARA rule with an expanded set of strings and includes instructions tailored for an AI. I suggest employing ChatGPT Plus with model 4 to refine these rules. Activating the `--ai` flag appends the instruction text to the `yargen_rules.yar` output file, which can subsequently be fed into your AI for processing.
 21 | 
 22 | With version 0.23.0 yarGen has been ported to Python3. If you'd like to use a version using Python 2, try a previous release. (Note that the download location for the pre-built databases has changed, since the database format has been changed from the outdated `pickle` to `json`. The old databases are still available but in an old location on our web server only used in the old yarGen version <0.23) 
 23 | 
 24 | Since version 0.12.0 yarGen does not completely remove the goodware strings from the analysis process but includes them with a very low score depending on the number of occurrences in goodware samples. The rules will be included if no
 25 | better strings can be found and marked with a comment /* Goodware rule */.
 26 | Force yarGen to remove all goodware strings with --excludegood. Also since version 0.12.0 yarGen allows to place the "strings.xml" from [PEstudio](https://winitor.com/) in the program directory in order to apply the blacklist definition during the string analysis process. You'll get better results.
 27 | 
 28 | Since version 0.14.0 it uses naive-bayes-classifier by Mustafa Atik and Nejdet Yucesoy in order to classify the string and detect useful words instead of compression/encryption garbage.
 29 | 
 30 | Since version 0.15.0 yarGen supports opcode elements extracted from the `.text` sections of PE files. During database creation it splits the `.text` sections with the regex [\x00]{3,} and takes the first 16 bytes of each part
 31 | to build an opcode database from goodware PE files. During rule creation on sample files it compares the goodware opcodes with the opcodes extracted from the malware samples and removes all opcodes that also appear in the goodware
 32 | database. (there is no further magic in it yet - no XOR loop detection etc.) The option to activate opcode integration is '--opcodes'.
 33 | 
 34 | Since version 0.17.0 yarGen allows creating multiple databases for opcodes and strings. You can now easily create a new database by using "-c" and an identifier "-i identifier" e.g. "office". It will then create two new
 35 | database files named "good-strings-office.db" and "good-opcodes-office.db" that will be initialized during startup with the built-in databases.
 36 | 
 37 | Since version 0.18.0 yarGen supports extra conditions that make use of the `pe` module. This includes [imphash](https://www.fireeye.com/blog/threat-research/2014/01/tracking-malware-import-hashing.html) values and the PE file's exports. We provide pre-generated imphash and export databases.
 38 | 
 39 | Since version 0.19.0 yarGen support a 'dropzone' mode in which it initializes all strings/opcodes/imphashes/exports only once and queries a given folder for new samples. If it finds new samples dropped to the folder, it creates rules for these samples, writes the YARA rules to the defined output file (default: yargen_rules.yar) and removes the dropped samples. You can specify a text file (`-b`) from which the identifier is read. The reference parameter (`-r`) has also been extended so that it can be a text file on disk from which the reference is read. E.g. drop two files named 'identifier.txt' and 'reference.txt' together with the samples to the folder and use the parameters `-b ./dropzone/identifier.txt` and `-r ./dropzone/reference.txt` to read the respective strings from the files each time an analysis starts.
 40 | 
 41 | Since version 0.20.0 yarGen supports the extraction and use of hex encoded strings that often appear in weaponized RTF files.
 42 | 
 43 | The rule generation process also tries to identify similarities between the files that get analyzed and then combines the strings to so called **super rules**. The super rule generation does not remove the simple rule for the files that have been combined in a single super rule. This means that there is some redundancy when super rules are created. You can suppress a simple rule for a file that was already covered by super rule by using --nosimple.
 44 | 
 45 | ### Installation
 46 | 
 47 | 1. Make sure you have at least 4GB of RAM on the machine you plan to use yarGen (8GB if opcodes are included in rule generation, use with --opcodes)
 48 | 2. Install all dependencies with `pip install -r requirements.txt` (or `pip3 install -r requirements.txt`)
 49 | 3. Run `python yarGen.py --update` to automatically download the built-in databases. The are saved into the  './dbs' sub folder. (Download: 913 MB)
 50 | 4. See help with `python yarGen.py --help` for more information on the command line parameters
 51 | 
 52 | ### Memory Requirements
 53 | 
 54 | Warning: yarGen pulls the whole `goodstring` database to memory and uses at least 3 GB of memory for a few seconds - 6 GB if opcodes evaluation is activated (--opcodes).
 55 | 
 56 | I've already tried to migrate the database to sqlite but the numerous string comparisons and lookups made the analysis painfully slow.
 57 | 
 58 | # Post-Processing Video Tutorial
 59 | 
 60 | [![YARA rule post-processing video tutorial](https://img.youtube.com/vi/y8oAjIjZMIg/0.jpg)](https://medium.com/@cyb3rops/how-to-post-process-yara-rules-generated-by-yargen-121d29322282)
 61 | 
 62 | # Multiple Database Support
 63 | 
 64 | yarGen allows creating multiple databases for opcodes or strings. You can easily create a new database by using "-c" for new database creation and "-i identifier" to give the new database a unique identifier as e.g. "office". It will the create two new database files named "good-strings-office.db" and "good-opcodes-office.db" that will from then on be initialized during startup with the built-in databases.
 65 | 
 66 | ### Database Creation / Update Example
 67 | 
 68 | Create a new strings and opcodes database from an Office 2013 program directory:
 69 | ```
 70 | yarGen.py -c --opcodes -i office -g /opt/packs/office2013
 71 | ```
 72 | The analysis and string extraction process will create the following new databases in the "./dbs" sub folder.
 73 | ```
 74 | good-strings-office.db
 75 | good-opcodes-office.db
 76 | ```
 77 | The values from these new databases will be automatically applied during the rule creation process because all *.db files in the sub folder "./dbs" will be initialized during startup.
 78 | 
 79 | You can update the once created databases with the "-u" parameter
 80 | ```
 81 | yarGen.py -u --opcodes -i office -g /opt/packs/office365
 82 | ```
 83 | This would update the "office" databases with new strings extracted from files in the given directory.
 84 | 
 85 | ## Command Line Parameters
 86 | 
 87 | ```
 88 | usage: yarGen.py [-h] [-m M] [-y min-size] [-z min-score] [-x high-scoring]
 89 |                  [-w superrule-overlap] [-s max-size] [-rc maxstrings]
 90 |                  [--excludegood] [-o output_rule_file] [-e output_dir_strings]
 91 |                  [-a author] [-r ref] [-l lic] [-p prefix] [-b identifier]
 92 |                  [--score] [--strings] [--nosimple] [--nomagic] [--nofilesize]
 93 |                  [-fm FM] [--globalrule] [--nosuper] [--update] [-g G] [-u]
 94 |                  [-c] [-i I] [--dropzone] [--nr] [--oe] [-fs size-in-MB]
 95 |                  [--noextras] [--debug] [--trace] [--opcodes] [-n opcode-num]
 96 | 
 97 | yarGen
 98 | 
 99 | optional arguments:
100 |   -h, --help            show this help message and exit
101 | 
102 | Rule Creation:
103 |   -m M                  Path to scan for malware
104 |   -y min-size           Minimum string length to consider (default=8)
105 |   -z min-score          Minimum score to consider (default=0)
106 |   -x high-scoring       Score required to set string as 'highly specific
107 |                         string' (default: 30)
108 |   -w superrule-overlap  Minimum number of strings that overlap to create a
109 |                         super rule (default: 5)
110 |   -s max-size           Maximum length to consider (default=128)
111 |   -rc maxstrings        Maximum number of strings per rule (default=20,
112 |                         intelligent filtering will be applied)
113 |   --excludegood         Force the exclude all goodware strings
114 | 
115 | Rule Output:
116 |   -o output_rule_file   Output rule file
117 |   -e output_dir_strings
118 |                         Output directory for string exports
119 |   -a author             Author Name
120 |   -r ref                Reference (can be string or text file)
121 |   -l lic                License
122 |   -p prefix             Prefix for the rule description
123 |   -b identifier         Text file from which the identifier is read (default:
124 |                         last folder name in the full path, e.g. "myRAT" if -m
125 |                         points to /mnt/mal/myRAT)
126 |   --score               Show the string scores as comments in the rules
127 |   --strings             Show the string scores as comments in the rules
128 |   --nosimple            Skip simple rule creation for files included in super
129 |                         rules
130 |   --nomagic             Don't include the magic header condition statement
131 |   --nofilesize          Don't include the filesize condition statement
132 |   -fm FM                Multiplier for the maximum 'filesize' condition value
133 |                         (default: 3)
134 |   --globalrule          Create global rules (improved rule set speed)
135 |   --nosuper             Don't try to create super rules that match against
136 |                         various files
137 | 
138 | Database Operations:
139 |   --update              Update the local strings and opcodes dbs from the
140 |                         online repository
141 |   -g G                  Path to scan for goodware (dont use the database
142 |                         shipped with yaraGen)
143 |   -u                    Update local standard goodware database with a new
144 |                         analysis result (used with -g)
145 |   -c                    Create new local goodware database (use with -g and
146 |                         optionally -i "identifier")
147 |   -i I                  Specify an identifier for the newly created databases
148 |                         (good-strings-identifier.db, good-opcodes-
149 |                         identifier.db)
150 | 
151 | General Options:
152 |   --dropzone            Dropzone mode - monitors a directory [-m] for new
153 |                         samples to processWARNING: Processed files will be
154 |                         deleted!
155 |   --nr                  Do not recursively scan directories
156 |   --oe                  Only scan executable extensions EXE, DLL, ASP, JSP,
157 |                         PHP, BIN, INFECTED
158 |   -fs size-in-MB        Max file size in MB to analyze (default=10)
159 |   --noextras            Don't use extras like Imphash or PE header specifics
160 |   --debug               Debug output
161 |   --trace               Trace output
162 | 
163 | Other Features:
164 |   --opcodes             Do use the OpCode feature (use this if not enough high
165 |                         scoring strings can be found)
166 |   -n opcode-num         Number of opcodes to add if not enough high scoring
167 |                         string could be found (default=3)
168 | ```
169 | 
170 | ## Best Practice
171 | 
172 | See the following blog posts for a more detailed description on how to use yarGen for YARA rule creation: 
173 | 
174 | [How to Write Simple but Sound Yara Rules - Part 1](https://www.bsk-consulting.de/2015/02/16/write-simple-sound-yara-rules/)
175 | 
176 | [How to Write Simple but Sound Yara Rules - Part 2](https://www.bsk-consulting.de/2015/10/17/how-to-write-simple-but-sound-yara-rules-part-2/)
177 | 
178 | [How to Write Simple but Sound Yara Rules - Part 3](https://www.bsk-consulting.de/2016/04/15/how-to-write-simple-but-sound-yara-rules-part-3/)
179 |   
180 | ## Screenshots
181 | 
182 | ![Generator Run](./screens/yargen-running.png)
183 | 
184 | ![Output Rule](./screens/output-rule-0.14.1.png)
185 | 
186 | As you can see in the screenshot above you'll get a rule that contains strings, which are not found in the goodware strings database. 
187 | 
188 | You should clean up the rules afterwards. In the example above, remove the strings $s14, $s17, $s19, $s20 that look like random code to get a cleaner rule that is more likely to match on other samples of the same family.
189 | 
190 | To get a more generic rule, remove string $s5, which is very specific for this compiled executable.
191 |  
192 | ## Examples
193 | 
194 | ### Dropzone Mode (Recommended)
195 | 
196 | Monitors a given folder (-m) for new samples, processes the samples, writes YARA rules to the set output file (default: yargen_rules.yar) and deletes the folder contents afterwards.
197 | 
198 | ```python yarGen.py -a "yarGen Dropzone" --dropzone -m /opt/mal/dropzone```
199 | 
200 | WARNING: All files dropped to the set dropzone will be removed!
201 | 
202 | In the following example two files named `identifier.txt` and `reference.txt` are read and used for the `reference` and as identifier in the YARA rule sets. The files are read at each iteration and not only during initialization. This way you can pass specific strings to each dropzone rule generation.
203 | 
204 | ```python yarGen.py --dropzone -m /opt/mal/dropzone -b /opt/mal/dropzone/identifier.txt -r /opt/mal/dropzone/reference.txt```
205 | 
206 | ### Use the shipped database (FAST) to create some rules
207 | 
208 | ```python yarGen.py -m X:\MAL\Case1401```
209 | 
210 | Use the shipped database of goodware strings and scan the malware directory 
211 | "X:\MAL" recursively. Create rules for all files included in this directory and 
212 | below. A file named 'yargen_rules.yar' will be generated in the current 
213 | directory. 
214 | 
215 | ### Show the score of the strings as comment
216 | 
217 | yarGen will by default use the top 20 strings based on their score. To see how a
218 | certain string in the rule scored, use the "--score" parameter.
219 | 
220 | ```python yarGen.py --score -m X:\MAL\Case1401```
221 | 
222 | ### Use only strings with a certain minimum score
223 | 
224 | In order to use only strings for your rules that match a certain minimum score use the "-z" parameter. It is a good pratice to first create rules with "--score" and than perform a second run with a minimum score set for you sample set via "-z".  
225 | 
226 | ```python yarGen.py --score -z 5 -m X:\MAL\Case1401```
227 | 
228 | ### Preset author and reference
229 | 
230 | ```python yarGen.py -a "Florian Roth" -r "http://goo.gl/c2qgFx" -m /opt/mal/case_441 -o case441.yar```
231 | 
232 | ### Add opcodes to the rules
233 | 
234 | ```python yarGen.py --opcodes -a "Florian Roth" -r "http://goo.gl/c2qgFx" -m /opt/mal/case33 -o rules33.yar```
235 | 
236 | ### Show debugging output
237 | 
238 | ```python yarGen.py --debug -m /opt/mal/case_441```
239 | 
240 | ### Create a new goodware strings database
241 | 
242 | ```python yarGen.py -c --opcodes -g /home/user/Downloads/office2013 -i office```
243 | 
244 | This will generate two new databases for strings and opcodes named:
245 | - good-strings-office.db
246 | - good-opcodes-office.db
247 | 
248 | The new databases will automatically be initialized during startup and are from then on used for rule generation.
249 | 
250 | ### Update a goodware strings database (append new strings, opcodes, imphashes, exports to the old ones)
251 | 
252 | ```python yarGen.py -u -g /home/user/Downloads/office365 -i office```
253 | 
254 | ### My Best Pratice Command Line
255 | 
256 | ```python yarGen.py -a "Florian Roth" -r "Internal Research" -m /opt/mal/apt_case_32```
257 | 
258 | # db-lookup.py
259 | 
260 | A tool named `db-lookup.py`, which was introduced with version 0.18.0 allows you to query the local databases in a simple command line interface. The interface takes an input value, which can be `string`, `export` or `imphash` value, detects the query type and then performs a lookup in the loaded databases. This allows you to query the yarGen databases with `string`, `export` and `imphash` values in order to check if this value appears in goodware that has been processed to generate the databases.
261 | 
262 | This is a nice feature that helps you ta answer the following questions:
263 | 
264 | * Does this string appear in goodware samples of my database?
265 | * Does this export name appear in goodware samples of my database?
266 | * Does a sample in my goodware database has this imphash?
267 | 
268 | However, there are several drawbacks:
269 | 
270 | * It does only match on the full string (no contains, no startswith, no endswith)
271 | * Opcode lookup is not supported (yet)
272 | 
273 | I plan to release a new project named `Valknut` which extracts overlapping byte sequences from samples and creates searchable databases. This project will be the new backend API for yarGen allowing all kinds of queries, opcodes and string values, ascii and wide formatted.
274 | 


--------------------------------------------------------------------------------
/prepare-release.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | RELDIR=./release
 4 | OUTDIR=$RELDIR/yarGen/
 5 | 
 6 | cp yarGen.py $OUTDIR
 7 | cp -r 3rdparty $OUTDIR
 8 | cp -r lib $OUTDIR
 9 | cp README.md $OUTDIR
10 | cp LICENSE $OUTDIR
11 | 
12 | cd $RELDIR
13 | tar -cvzf yarGen.tar.gz ./yarGen/
14 | cd ..


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | lief
2 | lxml


--------------------------------------------------------------------------------
/screens/output-rule-0.14.1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Neo23x0/yarGen/c16faff06ea6461f1c423f14c7690b6624e5fcff/screens/output-rule-0.14.1.png


--------------------------------------------------------------------------------
/screens/yargen-running.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Neo23x0/yarGen/c16faff06ea6461f1c423f14c7690b6624e5fcff/screens/yargen-running.png


--------------------------------------------------------------------------------
/tools/byte-mapper.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: iso-8859-1 -*-
  3 | # -*- coding: utf-8 -*-
  4 | #
  5 | # Byte Mapper
  6 | # Binary Signature Generator
  7 | # 
  8 | # Florian Roth
  9 | # June 2014
 10 | # v0.1a
 11 | 
 12 | import os
 13 | import sys
 14 | import argparse
 15 | import re
 16 | import traceback
 17 | from colorama import Fore, Back, Style
 18 | from colorama import init
 19 | from hashlib import md5
 20 | 
 21 | def getFiles(dir, recursive):
 22 | 	# Recursive
 23 | 	if recursive:
 24 | 		for root, directories, files in os.walk (dir, followlinks=False):
 25 | 			for filename in files:
 26 | 				filePath = os.path.join(root,filename)
 27 | 				yield filePath
 28 | 	# Non recursive
 29 | 	else:
 30 | 		for filename in os.listdir(dir):
 31 | 			filePath = os.path.join(dir,filename)
 32 | 			yield filePath
 33 | 			
 34 | def parseDir(dir, recursive, numBytes):
 35 | 
 36 | 	# Prepare dictionary
 37 | 	byte_stats = {}
 38 | 	
 39 | 	fileCount = 0
 40 | 	for filePath in getFiles(dir, recursive):
 41 | 	
 42 | 		if os.path.isdir(filePath):
 43 | 			if recursive:
 44 | 				parseDir(dir, recursive, numBytes)
 45 | 			continue
 46 | 	
 47 | 		with open(filePath, 'r') as file:
 48 | 			fileCount += 1
 49 | 			header = file.read(int(numBytes))
 50 | 	
 51 | 			pos = 0
 52 | 			for byte in header:
 53 | 				pos += 1
 54 | 				if pos in byte_stats:
 55 | 					if byte in byte_stats[pos]:
 56 | 						byte_stats[pos][byte] += 1
 57 | 					else:
 58 | 						byte_stats[pos][byte] = 1 
 59 | 				else:
 60 | 					#byte_stats.append(pos) 
 61 | 					byte_stats[pos] = { byte: 1 }
 62 | 
 63 | 	return byte_stats, fileCount
 64 | 	
 65 | def visiualizeStats(byteStats, fileCount, heatMapMode, byteFiller, bytesPerLine):
 66 | 	# Settings
 67 | 	# print fileCount
 68 | 	
 69 | 	bytesPrinted = 0
 70 | 	for byteStat in byteStats:
 71 | 		
 72 | 		if args.d:
 73 | 			print "------------------------"
 74 | 			print byteStats[byteStat]
 75 | 		
 76 | 		byteToPrint = ".."
 77 | 		countOfByte = 0
 78 | 		highestValue = 0
 79 | 		
 80 | 		# Evaluate the most often occured byte value at this position
 81 | 		for ( key, val ) in byteStats[byteStat].iteritems():
 82 | 			if val > highestValue:
 83 | 				highestValue = val
 84 | 				byteToPrint = key
 85 | 				countOfByte = val 
 86 | 		
 87 | 		# Heat Map Mode
 88 | 		if heatMapMode:
 89 | 			printHeatMapValue(byteToPrint, countOfByte, fileCount, byteFiller)				
 90 | 			
 91 | 		# Standard Mode			
 92 | 		else:
 93 | 			if countOfByte >= fileCount:
 94 | 				sys.stdout.write("%s%s" % ( byteToPrint.encode('hex'), byteFiller ))
 95 | 			else:
 96 | 				sys.stdout.write("..%s" % byteFiller)
 97 | 		
 98 | 		# Line break
 99 | 		bytesPrinted += 1
100 | 		if bytesPrinted >= bytesPerLine:
101 | 			sys.stdout.write("\n")
102 | 			bytesPrinted = 0
103 | 			
104 | 	# Print Heat Map Legend
105 | 	printHeatLegend(int(fileCount))
106 | 				
107 | def printHeatMapValue(byteToPrint, countOfByte, fileCount, byteFiller):
108 | 	if args.d:
109 | 		print "Count of byte: %s" % countOfByte
110 | 		print "File Count: %s" % fileCount	
111 | 	if countOfByte == fileCount:
112 | 		sys.stdout.write(Fore.GREEN + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller)
113 | 	elif countOfByte == fileCount - 1:
114 | 		sys.stdout.write(Fore.CYAN + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller)
115 | 	elif countOfByte == fileCount - 2:
116 | 		sys.stdout.write(Fore.YELLOW + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller)
117 | 	elif countOfByte == fileCount - 3:
118 | 		sys.stdout.write(Fore.RED + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller)
119 | 	elif countOfByte == fileCount - 4:
120 | 		sys.stdout.write(Fore.MAGENTA + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller)
121 | 	elif countOfByte == fileCount - 5:
122 | 		sys.stdout.write(Fore.WHITE + '%s' % byteToPrint.encode('hex') + Fore.WHITE + '%s' % byteFiller)		
123 | 	else: 
124 | 		sys.stdout.write(Fore.WHITE + Style.DIM + '..' + Fore.WHITE + Style.RESET_ALL + '%s' % byteFiller)
125 | 		
126 | def printHeatLegend(fileCount):
127 | 	print ""
128 | 	print Fore.GREEN + 'GREEN\tContent of all %s files' % str(fileCount) + Fore.WHITE
129 | 	if fileCount > 1:
130 | 		print Fore.CYAN + 'CYAN\tContent of %s files' % str(fileCount-1) + Fore.WHITE
131 | 	if fileCount > 2:
132 | 		print Fore.YELLOW + 'YELLOW\tContent of %s files' % str(fileCount-2) + Fore.WHITE
133 | 	if fileCount > 3:
134 | 		print Fore.RED + 'RED\tContent of %s files' % str(fileCount-3) + Fore.WHITE
135 | 	if fileCount > 4:
136 | 		print Fore.MAGENTA + 'MAGENTA\tContent of %s files' % str(fileCount-4) + Fore.WHITE
137 | 	if fileCount > 5:		
138 | 		print Fore.WHITE + 'WHITE\tContent of %s files' % str(fileCount-5) + Fore.WHITE
139 | 	if fileCount > 6:		
140 | 		print Fore.WHITE + Style.DIM +'..\tNo identical bytes in more than %s files' % str(fileCount-6) + Fore.WHITE + Style.RESET_ALL	
141 | 		
142 | # MAIN ################################################################
143 | if __name__ == '__main__':
144 | 	
145 | 	# Parse Arguments
146 | 	parser = argparse.ArgumentParser(description='Yara BSG')
147 | 	parser.add_argument('-p', metavar="malware-dir", help='Path to scan for malware')
148 | 	parser.add_argument('-r', action='store_true', default=False, help='Be recursive')
149 | 	parser.add_argument('-m', action='store_true', default=False, help='Heat map on byte values')
150 | 	parser.add_argument('-f', default=" ", metavar="byte-filler", help='character to fill the gap between the bytes (default: \' \')')	
151 | 	parser.add_argument('-c', default=None, metavar="num-occurances", help='Print only bytes that occur in at least X of the samples (default: all files; incompatible with heat map mode) ')
152 | 	parser.add_argument('-b', default=1024, metavar="bytes", help='Number of bytes to print (default: 1024)')
153 | 	parser.add_argument('-l', default=16, metavar="bytes-per-line", help='Number of bytes to print per line (default: 16)')	
154 | 	parser.add_argument('-d', action='store_true', default=False, help='Debug Info')
155 | 	
156 | 	args = parser.parse_args()
157 | 	
158 | 	# Colorization
159 | 	init()
160 | 	
161 | 	# Parse the Files
162 | 	( byteStats, fileCount) = parseDir(args.p, args.r, args.b)
163 | 	
164 | 	# print byteStats
165 | 	if args.c != None and not args.m:
166 | 		fileCount = int(args.c)
167 | 	
168 | 	# Vizualize Byte Stats
169 | 	visiualizeStats(byteStats, fileCount, args.m, args.f, args.l)


--------------------------------------------------------------------------------
/yarGen.py:
--------------------------------------------------------------------------------
   1 | #!/usr/bin/env python
   2 | # -*- coding: iso-8859-1 -*-
   3 | # -*- coding: utf-8 -*-
   4 | #
   5 | # yarGen
   6 | # A Rule Generator for YARA Rules
   7 | #
   8 | # Florian Roth
   9 | 
  10 | __version__ = "0.24.0"
  11 | 
  12 | import os
  13 | import sys
  14 | 
  15 | import argparse
  16 | import re
  17 | import traceback
  18 | import operator
  19 | import datetime
  20 | import time
  21 | import lief
  22 | import json
  23 | import gzip
  24 | import urllib.request
  25 | import binascii
  26 | import base64
  27 | import shutil
  28 | from collections import Counter
  29 | from hashlib import sha256
  30 | import signal as signal_module
  31 | from lxml import etree
  32 | 
  33 | RELEVANT_EXTENSIONS = [".asp", ".vbs", ".ps", ".ps1", ".tmp", ".bas", ".bat", ".cmd", ".com", ".cpl",
  34 |                        ".crt", ".dll", ".exe", ".msc", ".scr", ".sys", ".vb", ".vbe", ".vbs", ".wsc",
  35 |                        ".wsf", ".wsh", ".input", ".war", ".jsp", ".php", ".asp", ".aspx", ".psd1", ".psm1", ".py"]
  36 | 
  37 | AI_COMMENT = """
  38 | The provided rule is a YARA rule, encompassing a wide range of suspicious strings. Kindly review the list and pinpoint the twenty strings that are most distinctive or appear most suited for a YARA rule focused on malware detection. Arrange them in descending order based on their level of suspicion. Then, swap out the current list of strings in the YARA rule with your chosen set and supply the revised rule.
  39 | ---
  40 | """
  41 | 
  42 | REPO_URLS = {
  43 |     'good-opcodes-part1.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part1.db',
  44 |     'good-opcodes-part2.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part2.db',
  45 |     'good-opcodes-part3.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part3.db',
  46 |     'good-opcodes-part4.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part4.db',
  47 |     'good-opcodes-part5.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part5.db',
  48 |     'good-opcodes-part6.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part6.db',
  49 |     'good-opcodes-part7.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part7.db',
  50 |     'good-opcodes-part8.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part8.db',
  51 |     'good-opcodes-part9.db': 'https://www.bsk-consulting.de/yargen/good-opcodes-part9.db',
  52 | 
  53 |     'good-strings-part1.db': 'https://www.bsk-consulting.de/yargen/good-strings-part1.db',
  54 |     'good-strings-part2.db': 'https://www.bsk-consulting.de/yargen/good-strings-part2.db',
  55 |     'good-strings-part3.db': 'https://www.bsk-consulting.de/yargen/good-strings-part3.db',
  56 |     'good-strings-part4.db': 'https://www.bsk-consulting.de/yargen/good-strings-part4.db',
  57 |     'good-strings-part5.db': 'https://www.bsk-consulting.de/yargen/good-strings-part5.db',
  58 |     'good-strings-part6.db': 'https://www.bsk-consulting.de/yargen/good-strings-part6.db',
  59 |     'good-strings-part7.db': 'https://www.bsk-consulting.de/yargen/good-strings-part7.db',
  60 |     'good-strings-part8.db': 'https://www.bsk-consulting.de/yargen/good-strings-part8.db',
  61 |     'good-strings-part9.db': 'https://www.bsk-consulting.de/yargen/good-strings-part9.db',
  62 | 
  63 |     'good-exports-part1.db': 'https://www.bsk-consulting.de/yargen/good-exports-part1.db',
  64 |     'good-exports-part2.db': 'https://www.bsk-consulting.de/yargen/good-exports-part2.db',
  65 |     'good-exports-part3.db': 'https://www.bsk-consulting.de/yargen/good-exports-part3.db',
  66 |     'good-exports-part4.db': 'https://www.bsk-consulting.de/yargen/good-exports-part4.db',
  67 |     'good-exports-part5.db': 'https://www.bsk-consulting.de/yargen/good-exports-part5.db',
  68 |     'good-exports-part6.db': 'https://www.bsk-consulting.de/yargen/good-exports-part6.db',
  69 |     'good-exports-part7.db': 'https://www.bsk-consulting.de/yargen/good-exports-part7.db',
  70 |     'good-exports-part8.db': 'https://www.bsk-consulting.de/yargen/good-exports-part8.db',
  71 |     'good-exports-part9.db': 'https://www.bsk-consulting.de/yargen/good-exports-part9.db',
  72 | 
  73 |     'good-imphashes-part1.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part1.db',
  74 |     'good-imphashes-part2.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part2.db',
  75 |     'good-imphashes-part3.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part3.db',
  76 |     'good-imphashes-part4.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part4.db',
  77 |     'good-imphashes-part5.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part5.db',
  78 |     'good-imphashes-part6.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part6.db',
  79 |     'good-imphashes-part7.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part7.db',
  80 |     'good-imphashes-part8.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part8.db',
  81 |     'good-imphashes-part9.db': 'https://www.bsk-consulting.de/yargen/good-imphashes-part9.db',
  82 | }
  83 | 
  84 | PE_STRINGS_FILE = "./3rdparty/strings.xml"
  85 | 
  86 | KNOWN_IMPHASHES = {'a04dd9f5ee88d7774203e0a0cfa1b941': 'PsExec',
  87 |                    '2b8c9d9ab6fefc247adaf927e83dcea6': 'RAR SFX variant'}
  88 | 
  89 | 
  90 | def get_abs_path(filename):
  91 |     return os.path.join(os.path.dirname(os.path.abspath(__file__)), filename)
  92 | 
  93 | 
  94 | def get_files(folder, notRecursive):
  95 |     # Not Recursive
  96 |     if notRecursive:
  97 |         for filename in os.listdir(folder):
  98 |             filePath = os.path.join(folder, filename)
  99 |             if os.path.isdir(filePath):
 100 |                 continue
 101 |             yield filePath
 102 |     # Recursive
 103 |     else:
 104 |         for root, dirs, files in os.walk(folder, topdown = False):
 105 |             for name in files:
 106 |                 filePath = os.path.join(root, name)
 107 |                 yield filePath
 108 | 
 109 | 
 110 | def parse_sample_dir(dir, notRecursive=False, generateInfo=False, onlyRelevantExtensions=False):
 111 |     # Prepare dictionary
 112 |     string_stats = {}
 113 |     opcode_stats = {}
 114 |     file_info = {}
 115 |     known_sha1sums = []
 116 | 
 117 |     for filePath in get_files(dir, notRecursive):
 118 |         try:
 119 |             print("[+] Processing %s ..." % filePath)
 120 | 
 121 |             # Get Extension
 122 |             extension = os.path.splitext(filePath)[1].lower()
 123 |             if not extension in RELEVANT_EXTENSIONS and onlyRelevantExtensions:
 124 |                 if args.debug:
 125 |                     print("[-] EXTENSION %s - Skipping file %s" % (extension, filePath))
 126 |                 continue
 127 | 
 128 |             # Info file check
 129 |             if os.path.basename(filePath) == os.path.basename(args.b) or \
 130 |                     os.path.basename(filePath) == os.path.basename(args.r):
 131 |                 continue
 132 | 
 133 |             # Size Check
 134 |             size = 0
 135 |             try:
 136 |                 size = os.stat(filePath).st_size
 137 |                 if size > (args.fs * 1024 * 1024):
 138 |                     if args.debug:
 139 |                         print("[-] File is to big - Skipping file %s (use -fs to adjust this behaviour)" % (filePath))
 140 |                     continue
 141 |             except Exception as e:
 142 |                 pass
 143 | 
 144 |             # Check and read file
 145 |             try:
 146 |                 with open(filePath, 'rb') as f:
 147 |                     fileData = f.read()
 148 |             except Exception as e:
 149 |                 print("[-] Cannot read file - skipping %s" % filePath)
 150 | 
 151 |             # Extract strings from file
 152 |             strings = extract_strings(fileData)
 153 | 
 154 |             # Extract opcodes from file
 155 |             opcodes = []
 156 |             if use_opcodes:
 157 |                 print("[-] Extracting OpCodes: %s" % filePath)
 158 |                 opcodes = extract_opcodes(fileData)
 159 | 
 160 |             # Add sha256 value
 161 |             if generateInfo:
 162 |                 sha256sum = sha256(fileData).hexdigest()
 163 |                 file_info[filePath] = {}
 164 |                 file_info[filePath]["hash"] = sha256sum
 165 |                 file_info[filePath]["imphash"], file_info[filePath]["exports"] = get_pe_info(fileData)
 166 | 
 167 |             # Skip if hash already known - avoid duplicate files
 168 |             if sha256sum in known_sha1sums:
 169 |                 # if args.debug:
 170 |                 print("[-] Skipping strings/opcodes from %s due to MD5 duplicate detection" % filePath)
 171 |                 continue
 172 |             else:
 173 |                 known_sha1sums.append(sha256sum)
 174 | 
 175 |             # Magic evaluation
 176 |             if not args.nomagic:
 177 |                 file_info[filePath]["magic"] = binascii.hexlify(fileData[:2]).decode('ascii')
 178 |             else:
 179 |                 file_info[filePath]["magic"] = ""
 180 | 
 181 |             # File Size
 182 |             file_info[filePath]["size"] = os.stat(filePath).st_size
 183 | 
 184 |             # Add stats for basename (needed for inverse rule generation)
 185 |             fileName = os.path.basename(filePath)
 186 |             folderName = os.path.basename(os.path.dirname(filePath))
 187 |             if fileName not in file_info:
 188 |                 file_info[fileName] = {}
 189 |                 file_info[fileName]["count"] = 0
 190 |                 file_info[fileName]["hashes"] = []
 191 |                 file_info[fileName]["folder_names"] = []
 192 |             file_info[fileName]["count"] += 1
 193 |             file_info[fileName]["hashes"].append(sha256sum)
 194 |             if folderName not in file_info[fileName]["folder_names"]:
 195 |                 file_info[fileName]["folder_names"].append(folderName)
 196 | 
 197 |             # Add strings to statistics
 198 |             for string in strings:
 199 |                 # String is not already known
 200 |                 if string not in string_stats:
 201 |                     string_stats[string] = {}
 202 |                     string_stats[string]["count"] = 0
 203 |                     string_stats[string]["files"] = []
 204 |                     string_stats[string]["files_basename"] = {}
 205 |                 # String count
 206 |                 string_stats[string]["count"] += 1
 207 |                 # Add file information
 208 |                 if fileName not in string_stats[string]["files_basename"]:
 209 |                     string_stats[string]["files_basename"][fileName] = 0
 210 |                 string_stats[string]["files_basename"][fileName] += 1
 211 |                 if filePath not in string_stats[string]["files"]:
 212 |                     string_stats[string]["files"].append(filePath)
 213 | 
 214 |             # Add opcodes to statistics
 215 |             for opcode in opcodes:
 216 |                 # Opcode is not already known
 217 |                 if opcode not in opcode_stats:
 218 |                     opcode_stats[opcode] = {}
 219 |                     opcode_stats[opcode]["count"] = 0
 220 |                     opcode_stats[opcode]["files"] = []
 221 |                     opcode_stats[opcode]["files_basename"] = {}
 222 |                 # Opcode count
 223 |                 opcode_stats[opcode]["count"] += 1
 224 |                 # Add file information
 225 |                 if fileName not in opcode_stats[opcode]["files_basename"]:
 226 |                     opcode_stats[opcode]["files_basename"][fileName] = 0
 227 |                 opcode_stats[opcode]["files_basename"][fileName] += 1
 228 |                 if filePath not in opcode_stats[opcode]["files"]:
 229 |                     opcode_stats[opcode]["files"].append(filePath)
 230 | 
 231 |             if args.debug:
 232 |                 print("[+] Processed " + filePath + " Size: " + str(size) + " Strings: " + str(len(string_stats)) + \
 233 |                       " OpCodes: " + str(len(opcode_stats)) + " ... ")
 234 | 
 235 |         except Exception as e:
 236 |             traceback.print_exc()
 237 |             print("[E] ERROR reading file: %s" % filePath)
 238 | 
 239 |     return string_stats, opcode_stats, file_info
 240 | 
 241 | 
 242 | def parse_good_dir(dir, notRecursive=False, onlyRelevantExtensions=True):
 243 |     # Prepare dictionary
 244 |     all_strings = Counter()
 245 |     all_opcodes = Counter()
 246 |     all_imphashes = Counter()
 247 |     all_exports = Counter()
 248 | 
 249 |     for filePath in get_files(dir, notRecursive):
 250 |         # Get Extension
 251 |         extension = os.path.splitext(filePath)[1].lower()
 252 |         if extension not in RELEVANT_EXTENSIONS and onlyRelevantExtensions:
 253 |             if args.debug:
 254 |                 print("[-] EXTENSION %s - Skipping file %s" % (extension, filePath))
 255 |             continue
 256 | 
 257 |         # Size Check
 258 |         size = 0
 259 |         try:
 260 |             size = os.stat(filePath).st_size
 261 |             if size > (args.fs * 1024 * 1024):
 262 |                 continue
 263 |         except Exception as e:
 264 |             pass
 265 | 
 266 |         # Check and read file
 267 |         try:
 268 |             with open(filePath, 'rb') as f:
 269 |                 fileData = f.read()
 270 |         except Exception as e:
 271 |             print("[-] Cannot read file - skipping %s" % filePath)
 272 | 
 273 |         # Extract strings from file
 274 |         strings = extract_strings(fileData)
 275 |         # Append to all strings
 276 |         all_strings.update(strings)
 277 | 
 278 |         # Extract Opcodes from file
 279 |         opcodes = []
 280 |         if use_opcodes:
 281 |             print("[-] Extracting OpCodes: %s" % filePath)
 282 |             opcodes = extract_opcodes(fileData)
 283 |             # Append to all opcodes
 284 |             all_opcodes.update(opcodes)
 285 | 
 286 |         # Imphash and Exports
 287 |         (imphash, exports) = get_pe_info(fileData)
 288 |         if imphash != "":
 289 |             all_imphashes.update([imphash])
 290 |         all_exports.update(exports)
 291 |         if args.debug:
 292 |             print("[+] Processed %s - %d strings %d opcodes %d exports and imphash %s" % (filePath, len(strings),
 293 |                                                                                           len(opcodes), len(exports),
 294 |                                                                                           imphash))
 295 | 
 296 |     # return it as a set (unique strings)
 297 |     return all_strings, all_opcodes, all_imphashes, all_exports
 298 | 
 299 | 
 300 | def extract_strings(fileData) -> list[str]:
 301 |     # String list
 302 |     cleaned_strings = []
 303 |     # Read file data
 304 |     try:
 305 |         # Read strings
 306 |         strings_full = re.findall(b"[\x1f-\x7e]{6,}", fileData)
 307 |         strings_limited = re.findall(b"[\x1f-\x7e]{6,%d}" % args.s, fileData)
 308 |         strings_hex = extract_hex_strings(fileData)
 309 |         strings = list(set(strings_full) | set(strings_limited) | set(strings_hex))
 310 |         wide_strings = [ws for ws in re.findall(b"(?:[\x1f-\x7e][\x00]){6,}", fileData)]
 311 | 
 312 |         # Post-process
 313 |         # WIDE
 314 |         for ws in wide_strings:
 315 |             # Decode UTF16 and prepend a marker (facilitates handling)
 316 |             wide_string = ("UTF16LE:%s" % ws.decode('utf-16')).encode('utf-8')
 317 |             if wide_string not in strings:
 318 |                 strings.append(wide_string)
 319 |         for string in strings:
 320 |             # Escape strings
 321 |             if len(string) > 0:
 322 |                 string = string.replace(b'\\', b'\\\\')
 323 |                 string = string.replace(b'"', b'\\"')
 324 |             try:
 325 |                 if isinstance(string, str):
 326 |                     cleaned_strings.append(string)
 327 |                 else:
 328 |                     cleaned_strings.append(string.decode('utf-8'))
 329 |             except AttributeError as e:
 330 |                 print(string)
 331 |                 traceback.print_exc()
 332 | 
 333 |     except Exception as e:
 334 |         if args.debug:
 335 |             print(string)
 336 |             traceback.print_exc()
 337 |         pass
 338 | 
 339 |     return cleaned_strings
 340 | 
 341 | 
 342 | def extract_opcodes(fileData) -> list[str]:
 343 |     # Opcode list
 344 |     opcodes = []
 345 | 
 346 |     try:
 347 |         # Read file data
 348 |         binary = lief.parse(fileData)
 349 |         ep = binary.entrypoint
 350 | 
 351 |         # Locate .text section
 352 |         text = None
 353 |         if isinstance(binary, lief.PE.Binary):
 354 |             for sec in binary.sections:
 355 |                 if sec.virtual_address + binary.imagebase <= ep < sec.virtual_address + binary.imagebase + sec.virtual_size:
 356 |                     if args.debug:
 357 |                         print(f'EP is located at {sec.name} section')
 358 |                     text = sec.content.tobytes()
 359 |                     break
 360 |         elif isinstance(binary, lief.ELF.Binary):
 361 |             for sec in binary.sections:
 362 |                 if sec.virtual_address <= ep < sec.virtual_address + sec.size:
 363 |                     if args.debug:
 364 |                         print(f'EP is located at {sec.name} section')
 365 |                     text = sec.content.tobytes()
 366 |                     break
 367 |         
 368 |         if text is not None:
 369 |             # Split text into subs
 370 |             text_parts = re.split(b"[\x00]{3,}", text)
 371 |             # Now truncate and encode opcodes
 372 |             for text_part in text_parts:
 373 |                 if text_part == '' or len(text_part) < 8:
 374 |                     continue
 375 |                 opcodes.append(binascii.hexlify(text_part[:16]).decode(encoding='ascii'))
 376 |     except Exception as e:
 377 |         if args.debug:
 378 |             traceback.print_exc()
 379 |         pass
 380 | 
 381 |     return opcodes
 382 | 
 383 | 
 384 | def get_pe_info(fileData: bytes) -> tuple[str, list[str]]:
 385 |     """
 386 |     Get different PE attributes and hashes by lief
 387 |     :param fileData:
 388 |     :return:
 389 |     """
 390 |     imphash = ""
 391 |     exports = []
 392 |     # Check for MZ header (speed improvement)
 393 |     if fileData[:2] != b"MZ":
 394 |         return imphash, exports
 395 |     try:
 396 |         if args.debug:
 397 |             print("Extracting PE information")
 398 |         binary: lief.PE.Binary = lief.parse(fileData)
 399 |         # Imphash
 400 |         imphash = lief.PE.get_imphash(binary, lief.PE.IMPHASH_MODE.PEFILE)
 401 |         # Exports (names)
 402 |         for exp in binary.get_export().entries:
 403 |             exp: lief.PE.ExportEntry
 404 |             exports.append(str(exp.name))
 405 |     except Exception as e:
 406 |         if args.debug:
 407 |             traceback.print_exc()
 408 |         pass
 409 | 
 410 |     return imphash, exports
 411 | 
 412 | 
 413 | def sample_string_evaluation(string_stats, opcode_stats, file_info):
 414 |     # Generate Stats -----------------------------------------------------------
 415 |     print("[+] Generating statistical data ...")
 416 |     file_strings = {}
 417 |     file_opcodes = {}
 418 |     combinations = {}
 419 |     inverse_stats = {}
 420 |     max_combi_count = 0
 421 |     super_rules = []
 422 | 
 423 |     # OPCODE EVALUATION --------------------------------------------------------
 424 |     for opcode in opcode_stats:
 425 |         # If string occurs not too often in sample files
 426 |         if opcode_stats[opcode]["count"] < 10:
 427 |             # If string list in file dictionary not yet exists
 428 |             for filePath in opcode_stats[opcode]["files"]:
 429 |                 if filePath in file_opcodes:
 430 |                     # Append string
 431 |                     file_opcodes[filePath].append(opcode)
 432 |                 else:
 433 |                     # Create list and then add the first string to the file
 434 |                     file_opcodes[filePath] = []
 435 |                     file_opcodes[filePath].append(opcode)
 436 | 
 437 |     # STRING EVALUATION -------------------------------------------------------
 438 | 
 439 |     # Iterate through strings found in malware files
 440 |     for string in string_stats:
 441 | 
 442 |         # If string occurs not too often in (goodware) sample files
 443 |         if string_stats[string]["count"] < 10:
 444 |             # If string list in file dictionary not yet exists
 445 |             for filePath in string_stats[string]["files"]:
 446 |                 if filePath in file_strings:
 447 |                     # Append string
 448 |                     file_strings[filePath].append(string)
 449 |                 else:
 450 |                     # Create list and then add the first string to the file
 451 |                     file_strings[filePath] = []
 452 |                     file_strings[filePath].append(string)
 453 | 
 454 |                 # INVERSE RULE GENERATION -------------------------------------
 455 |                 if args.inverse:
 456 |                     for fileName in string_stats[string]["files_basename"]:
 457 |                         string_occurrance_count = string_stats[string]["files_basename"][fileName]
 458 |                         total_count_basename = file_info[fileName]["count"]
 459 |                         # print "string_occurance_count %s - total_count_basename %s" % ( string_occurance_count,
 460 |                         # total_count_basename )
 461 |                         if string_occurrance_count == total_count_basename:
 462 |                             if fileName not in inverse_stats:
 463 |                                 inverse_stats[fileName] = []
 464 |                             if args.trace:
 465 |                                 print("Appending %s to %s" % (string, fileName))
 466 |                             inverse_stats[fileName].append(string)
 467 | 
 468 |         # SUPER RULE GENERATION -----------------------------------------------
 469 |         if not nosuper and not args.inverse:
 470 | 
 471 |             # SUPER RULES GENERATOR	- preliminary work
 472 |             # If a string occurs more than once in different files
 473 |             # print sample_string_stats[string]["count"]
 474 |             if string_stats[string]["count"] > 1:
 475 |                 if args.debug:
 476 |                     print("OVERLAP Count: %s\nString: \"%s\"%s" % (string_stats[string]["count"], string,
 477 |                                                                    "\nFILE: ".join(string_stats[string]["files"])))
 478 |                 # Create a combination string from the file set that matches to that string
 479 |                 combi = ":".join(sorted(string_stats[string]["files"]))
 480 |                 # print "STRING: " + string
 481 |                 if args.debug:
 482 |                     print("COMBI: " + combi)
 483 |                 # If combination not yet known
 484 |                 if combi not in combinations:
 485 |                     combinations[combi] = {}
 486 |                     combinations[combi]["count"] = 1
 487 |                     combinations[combi]["strings"] = []
 488 |                     combinations[combi]["strings"].append(string)
 489 |                     combinations[combi]["files"] = string_stats[string]["files"]
 490 |                 else:
 491 |                     combinations[combi]["count"] += 1
 492 |                     combinations[combi]["strings"].append(string)
 493 |                 # Set the maximum combination count
 494 |                 if combinations[combi]["count"] > max_combi_count:
 495 |                     max_combi_count = combinations[combi]["count"]
 496 |                     # print "Max Combi Count set to: %s" % max_combi_count
 497 | 
 498 |     print("[+] Generating Super Rules ... (a lot of magic)")
 499 |     for combi_count in range(max_combi_count, 1, -1):
 500 |         for combi in combinations:
 501 |             if combi_count == combinations[combi]["count"]:
 502 |                 # print "Count %s - Combi %s" % ( str(combinations[combi]["count"]), combi )
 503 |                 # Filter the string set
 504 |                 # print "BEFORE"
 505 |                 # print len(combinations[combi]["strings"])
 506 |                 # print combinations[combi]["strings"]
 507 |                 string_set = combinations[combi]["strings"]
 508 |                 combinations[combi]["strings"] = []
 509 |                 combinations[combi]["strings"] = filter_string_set(string_set)
 510 |                 # print combinations[combi]["strings"]
 511 |                 # print "AFTER"
 512 |                 # print len(combinations[combi]["strings"])
 513 |                 # Combi String count after filtering
 514 |                 # print "String count after filtering: %s" % str(len(combinations[combi]["strings"]))
 515 | 
 516 |                 # If the string set of the combination has a required size
 517 |                 if len(combinations[combi]["strings"]) >= int(args.w):
 518 |                     # Remove the files in the combi rule from the simple set
 519 |                     if args.nosimple:
 520 |                         for file in combinations[combi]["files"]:
 521 |                             if file in file_strings:
 522 |                                 del file_strings[file]
 523 |                     # Add it as a super rule
 524 |                     print("[-] Adding Super Rule with %s strings." % str(len(combinations[combi]["strings"])))
 525 |                     # if args.debug:
 526 |                     # print "Rule Combi: %s" % combi
 527 |                     super_rules.append(combinations[combi])
 528 | 
 529 |     # Return all data
 530 |     return (file_strings, file_opcodes, combinations, super_rules, inverse_stats)
 531 | 
 532 | 
 533 | def filter_opcode_set(opcode_set: list[str]) -> list[str]:
 534 |     # Preferred Opcodes
 535 |     pref_opcodes = [' 34 ', 'ff ff ff ']
 536 | 
 537 |     # Useful set
 538 |     useful_set = []
 539 |     pref_set = []
 540 | 
 541 |     for opcode in opcode_set:
 542 |         opcode: str
 543 |         # Exclude all opcodes found in goodware
 544 |         if opcode in good_opcodes_db:
 545 |             if args.debug:
 546 |                 print("skipping %s" % opcode)
 547 |             continue
 548 | 
 549 |         # Format the opcode
 550 |         formatted_opcode = get_opcode_string(opcode)
 551 | 
 552 |         # Preferred opcodes
 553 |         set_in_pref = False
 554 |         for pref in pref_opcodes:
 555 |             if pref in formatted_opcode:
 556 |                 pref_set.append(formatted_opcode)
 557 |                 set_in_pref = True
 558 |         if set_in_pref:
 559 |             continue
 560 | 
 561 |         # Else add to useful set
 562 |         useful_set.append(get_opcode_string(opcode))
 563 | 
 564 |     # Preferred opcodes first
 565 |     useful_set = pref_set + useful_set
 566 | 
 567 |     # Only return the number of opcodes defined with the "-n" parameter
 568 |     return useful_set[:int(args.n)]
 569 | 
 570 | 
 571 | def filter_string_set(string_set):
 572 |     # This is the only set we have - even if it's a weak one
 573 |     useful_set = []
 574 | 
 575 |     # Local string scores
 576 |     localStringScores = {}
 577 | 
 578 |     # Local UTF strings
 579 |     utfstrings = []
 580 | 
 581 |     for string in string_set:
 582 | 
 583 |         # Goodware string marker
 584 |         goodstring = False
 585 |         goodcount = 0
 586 | 
 587 |         # Goodware Strings
 588 |         if string in good_strings_db:
 589 |             goodstring = True
 590 |             goodcount = good_strings_db[string]
 591 |             # print "%s - %s" % ( goodstring, good_strings[string] )
 592 |             if args.excludegood:
 593 |                 continue
 594 | 
 595 |         # UTF
 596 |         original_string = string
 597 |         if string[:8] == "UTF16LE:":
 598 |             # print "removed UTF16LE from %s" % string
 599 |             string = string[8:]
 600 |             utfstrings.append(string)
 601 | 
 602 |         # Good string evaluation (after the UTF modification)
 603 |         if goodstring:
 604 |             # Reduce the score by the number of occurence in goodware files
 605 |             localStringScores[string] = (goodcount * -1) + 5
 606 |         else:
 607 |             localStringScores[string] = 0
 608 | 
 609 |         # PEStudio String Blacklist Evaluation
 610 |         if pestudio_available:
 611 |             (pescore, type) = get_pestudio_score(string)
 612 |             # print("PE Match: %s" % string)
 613 |             # Reset score of goodware files to 5 if blacklisted in PEStudio
 614 |             if type != "":
 615 |                 pestudioMarker[string] = type
 616 |                 # Modify the PEStudio blacklisted strings with their goodware stats count
 617 |                 if goodstring:
 618 |                     pescore = pescore - (goodcount / 1000.0)
 619 |                     # print "%s - %s - %s" % (string, pescore, goodcount)
 620 |                 localStringScores[string] = pescore
 621 | 
 622 |         if not goodstring:
 623 | 
 624 |             # Length Score
 625 |             #length = len(string)
 626 |             #if length > int(args.y) and length < int(args.s):
 627 |             #    localStringScores[string] += round(len(string) / 8, 2)
 628 |             #if length >= int(args.s):
 629 |             #    localStringScores[string] += 1
 630 | 
 631 |             # Reduction
 632 |             if ".." in string:
 633 |                 localStringScores[string] -= 5
 634 |             if "   " in string:
 635 |                 localStringScores[string] -= 5
 636 |             # Packer Strings
 637 |             if re.search(r'(WinRAR\\SFX)', string):
 638 |                 localStringScores[string] -= 4
 639 |             # US ASCII char
 640 |             if "\x1f" in string:
 641 |                 localStringScores[string] -= 4
 642 |             # Chains of 00s
 643 |             if string.count('0000000000') > 2:
 644 |                 localStringScores[string] -= 5
 645 |             # Repeated characters
 646 |             if re.search(r'(?!.* ([A-Fa-f0-9])\1{8,})', string):
 647 |                 localStringScores[string] -= 5
 648 | 
 649 |             # Certain strings add-ons ----------------------------------------------
 650 |             # Extensions - Drive
 651 |             if re.search(r'[A-Za-z]:\\', string, re.IGNORECASE):
 652 |                 localStringScores[string] += 2
 653 |             # Relevant file extensions
 654 |             if re.search(r'(\.exe|\.pdb|\.scr|\.log|\.cfg|\.txt|\.dat|\.msi|\.com|\.bat|\.dll|\.pdb|\.vbs|'
 655 |                          r'\.tmp|\.sys|\.ps1|\.vbp|\.hta|\.lnk)', string, re.IGNORECASE):
 656 |                 localStringScores[string] += 4
 657 |             # System keywords
 658 |             if re.search(r'(cmd.exe|system32|users|Documents and|SystemRoot|Grant|hello|password|process|log)',
 659 |                          string, re.IGNORECASE):
 660 |                 localStringScores[string] += 5
 661 |             # Protocol Keywords
 662 |             if re.search(r'(ftp|irc|smtp|command|GET|POST|Agent|tor2web|HEAD)', string, re.IGNORECASE):
 663 |                 localStringScores[string] += 5
 664 |             # Connection keywords
 665 |             if re.search(r'(error|http|closed|fail|version|proxy)', string, re.IGNORECASE):
 666 |                 localStringScores[string] += 3
 667 |             # Browser User Agents
 668 |             if re.search(r'(Mozilla|MSIE|Windows NT|Macintosh|Gecko|Opera|User\-Agent)', string, re.IGNORECASE):
 669 |                 localStringScores[string] += 5
 670 |             # Temp and Recycler
 671 |             if re.search(r'(TEMP|Temporary|Appdata|Recycler)', string, re.IGNORECASE):
 672 |                 localStringScores[string] += 4
 673 |             # Malicious keywords - hacktools
 674 |             if re.search(r'(scan|sniff|poison|intercept|fake|spoof|sweep|dump|flood|inject|forward|scan|vulnerable|'
 675 |                          r'credentials|creds|coded|p0c|Content|host)', string, re.IGNORECASE):
 676 |                 localStringScores[string] += 5
 677 |             # Network keywords
 678 |             if re.search(r'(address|port|listen|remote|local|process|service|mutex|pipe|frame|key|lookup|connection)',
 679 |                          string, re.IGNORECASE):
 680 |                 localStringScores[string] += 3
 681 |             # Drive
 682 |             if re.search(r'([C-Zc-z]:\\)', string, re.IGNORECASE):
 683 |                 localStringScores[string] += 4
 684 |             # IP
 685 |             if re.search(
 686 |                     r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b',
 687 |                     string, re.IGNORECASE):  # IP Address
 688 |                 localStringScores[string] += 5
 689 |             # Copyright Owner
 690 |             if re.search(r'(coded | c0d3d |cr3w\b|Coded by |codedby)', string, re.IGNORECASE):
 691 |                 localStringScores[string] += 7
 692 |             # Extension generic
 693 |             if re.search(r'\.[a-zA-Z]{3}\b', string):
 694 |                 localStringScores[string] += 3
 695 |             # All upper case
 696 |             if re.search(r'^[A-Z]{6,}$', string):
 697 |                 localStringScores[string] += 2.5
 698 |             # All lower case
 699 |             if re.search(r'^[a-z]{6,}$', string):
 700 |                 localStringScores[string] += 2
 701 |             # All lower with space
 702 |             if re.search(r'^[a-z\s]{6,}$', string):
 703 |                 localStringScores[string] += 2
 704 |             # All characters
 705 |             if re.search(r'^[A-Z][a-z]{5,}$', string):
 706 |                 localStringScores[string] += 2
 707 |             # URL
 708 |             if re.search(r'(%[a-z][:\-,;]|\\\\%s|\\\\[A-Z0-9a-z%]+\\[A-Z0-9a-z%]+)', string):
 709 |                 localStringScores[string] += 2.5
 710 |             # certificates
 711 |             if re.search(r'(thawte|trustcenter|signing|class|crl|CA|certificate|assembly)', string, re.IGNORECASE):
 712 |                 localStringScores[string] -= 4
 713 |             # Parameters
 714 |             if re.search(r'( \-[a-z]{,2}[\s]?[0-9]?| /[a-z]+[\s]?[\w]*)', string, re.IGNORECASE):
 715 |                 localStringScores[string] += 4
 716 |             # Directory
 717 |             if re.search(r'([a-zA-Z]:|^|%)\\[A-Za-z]{4,30}\\', string):
 718 |                 localStringScores[string] += 4
 719 |             # Executable - not in directory
 720 |             if re.search(r'^[^\\]+\.(exe|com|scr|bat|sys)$', string, re.IGNORECASE):
 721 |                 localStringScores[string] += 4
 722 |             # Date placeholders
 723 |             if re.search(r'(yyyy|hh:mm|dd/mm|mm/dd|%s:%s:)', string, re.IGNORECASE):
 724 |                 localStringScores[string] += 3
 725 |             # Placeholders
 726 |             if re.search(r'[^A-Za-z](%s|%d|%i|%02d|%04d|%2d|%3s)[^A-Za-z]', string, re.IGNORECASE):
 727 |                 localStringScores[string] += 3
 728 |             # String parts from file system elements
 729 |             if re.search(r'(cmd|com|pipe|tmp|temp|recycle|bin|secret|private|AppData|driver|config)', string,
 730 |                          re.IGNORECASE):
 731 |                 localStringScores[string] += 3
 732 |             # Programming
 733 |             if re.search(r'(execute|run|system|shell|root|cimv2|login|exec|stdin|read|process|netuse|script|share)',
 734 |                          string, re.IGNORECASE):
 735 |                 localStringScores[string] += 3
 736 |             # Credentials
 737 |             if re.search(r'(user|pass|login|logon|token|cookie|creds|hash|ticket|NTLM|LMHASH|kerberos|spnego|session|'
 738 |                          r'identif|account|login|auth|privilege)', string, re.IGNORECASE):
 739 |                 localStringScores[string] += 3
 740 |             # Malware
 741 |             if re.search(r'(\.[a-z]/[^/]+\.txt|)', string, re.IGNORECASE):
 742 |                 localStringScores[string] += 3
 743 |             # Variables
 744 |             if re.search(r'%[A-Z_]+%', string, re.IGNORECASE):
 745 |                 localStringScores[string] += 4
 746 |             # RATs / Malware
 747 |             if re.search(r'(spy|logger|dark|cryptor|RAT\b|eye|comet|evil|xtreme|poison|meterpreter|metasploit|/veil|Blood)',
 748 |                          string, re.IGNORECASE):
 749 |                 localStringScores[string] += 5
 750 |             # Missed user profiles
 751 |             if re.search(r'[\\](users|profiles|username|benutzer|Documents and Settings|Utilisateurs|Utenti|'
 752 |                          r'Usuários)[\\]', string, re.IGNORECASE):
 753 |                 localStringScores[string] += 3
 754 |             # Strings: Words ending with numbers
 755 |             if re.search(r'^[A-Z][a-z]+[0-9]+$', string, re.IGNORECASE):
 756 |                 localStringScores[string] += 1
 757 |             # Spying
 758 |             if re.search(r'(implant)', string, re.IGNORECASE):
 759 |                 localStringScores[string] += 1
 760 |             # Program Path - not Programs or Windows
 761 |             if re.search(r'^[Cc]:\\\\[^PW]', string):
 762 |                 localStringScores[string] += 3
 763 |             # Special strings
 764 |             if re.search(r'(\\\\\.\\|kernel|.dll|usage|\\DosDevices\\)', string, re.IGNORECASE):
 765 |                 localStringScores[string] += 5
 766 |             # Parameters
 767 |             if re.search(r'( \-[a-z] | /[a-z] | \-[a-z]:[a-zA-Z]| \/[a-z]:[a-zA-Z])', string):
 768 |                 localStringScores[string] += 4
 769 |             # File
 770 |             if re.search(r'^[a-zA-Z0-9]{3,40}\.[a-zA-Z]{3}', string, re.IGNORECASE):
 771 |                 localStringScores[string] += 3
 772 |             # Comment Line / Output Log
 773 |             if re.search(r'^([\*\#]+ |\[[\*\-\+]\] |[\-=]> |\[[A-Za-z]\] )', string):
 774 |                 localStringScores[string] += 4
 775 |             # Output typo / special expression
 776 |             if re.search(r'(!\.$|!!!$| :\)$| ;\)$|fucked|[\w]\.\.\.\.$)', string):
 777 |                 localStringScores[string] += 4
 778 |             # Base64
 779 |             if re.search(r'^(?:[A-Za-z0-9+/]{4}){30,}(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$', string) and \
 780 |                     re.search(r'[A-Za-z]', string) and re.search(r'[0-9]', string):
 781 |                 localStringScores[string] += 7
 782 |             # Base64 Executables
 783 |             if re.search(r'(TVqQAAMAAAAEAAAA//8AALgAAAA|TVpQAAIAAAAEAA8A//8AALgAAAA|TVqAAAEAAAAEABAAAAAAAAAAAAA|'
 784 |                          r'TVoAAAAAAAAAAAAAAAAAAAAAAAA|TVpTAQEAAAAEAAAA//8AALgAAAA)', string):
 785 |                 localStringScores[string] += 5
 786 |             # Malicious intent
 787 |             if re.search(r'(loader|cmdline|ntlmhash|lmhash|infect|encrypt|exec|elevat|dump|target|victim|override|'
 788 |                          r'traverse|mutex|pawnde|exploited|shellcode|injected|spoofed|dllinjec|exeinj|reflective|'
 789 |                          r'payload|inject|back conn)',
 790 |                          string, re.IGNORECASE):
 791 |                 localStringScores[string] += 5
 792 |             # Privileges
 793 |             if re.search(r'(administrator|highest|system|debug|dbg|admin|adm|root) privilege', string, re.IGNORECASE):
 794 |                 localStringScores[string] += 4
 795 |             # System file/process names
 796 |             if re.search(r'(LSASS|SAM|lsass.exe|cmd.exe|LSASRV.DLL)', string):
 797 |                 localStringScores[string] += 4
 798 |             # System file/process names
 799 |             if re.search(r'(\.exe|\.dll|\.sys)$', string, re.IGNORECASE):
 800 |                 localStringScores[string] += 4
 801 |             # Indicators that string is valid
 802 |             if re.search(r'(^\\\\)', string, re.IGNORECASE):
 803 |                 localStringScores[string] += 1
 804 |             # Compiler output directories
 805 |             if re.search(r'(\\Release\\|\\Debug\\|\\bin|\\sbin)', string, re.IGNORECASE):
 806 |                 localStringScores[string] += 2
 807 |             # Special - Malware related strings
 808 |             if re.search(r'(Management Support Team1|/c rundll32|DTOPTOOLZ Co.|net start|Exec|taskkill)', string):
 809 |                 localStringScores[string] += 4
 810 |             # Powershell
 811 |             if re.search(r'(bypass|windowstyle | hidden |-command|IEX |Invoke-Expression|Net.Webclient|Invoke[A-Z]|'
 812 |                          r'Net.WebClient|-w hidden |-encoded'
 813 |                          r'-encodedcommand| -nop |MemoryLoadLibrary|FromBase64String|Download|EncodedCommand)', string, re.IGNORECASE):
 814 |                 localStringScores[string] += 4
 815 |             # WMI
 816 |             if re.search(r'( /c WMIC)', string, re.IGNORECASE):
 817 |                 localStringScores[string] += 3
 818 |             # Windows Commands
 819 |             if re.search(r'( net user | net group |ping |whoami |bitsadmin |rundll32.exe javascript:|'
 820 |                          r'schtasks.exe /create|/c start )',
 821 |                          string, re.IGNORECASE):
 822 |                 localStringScores[string] += 3
 823 |             # JavaScript
 824 |             if re.search(r'(new ActiveXObject\("WScript.Shell"\).Run|.Run\("cmd.exe|.Run\("%comspec%\)|'
 825 |                          r'.Run\("c:\\Windows|.RegisterXLL\()', string, re.IGNORECASE):
 826 |                 localStringScores[string] += 3
 827 |             # Signing Certificates
 828 |             if re.search(r'( Inc | Co.|  Ltd.,| LLC| Limited)', string):
 829 |                 localStringScores[string] += 2
 830 |             # Privilege escalation
 831 |             if re.search(r'(sysprep|cryptbase|secur32)', string, re.IGNORECASE):
 832 |                 localStringScores[string] += 2
 833 |             # Webshells
 834 |             if re.search(r'(isset\($post\[|isset\($get\[|eval\(Request)', string, re.IGNORECASE):
 835 |                 localStringScores[string] += 2
 836 |             # Suspicious words 1
 837 |             if re.search(r'(impersonate|drop|upload|download|execute|shell|\bcmd\b|decode|rot13|decrypt)', string,
 838 |                          re.IGNORECASE):
 839 |                 localStringScores[string] += 2
 840 |             # Suspicious words 1
 841 |             if re.search(r'([+] |[-] |[*] |injecting|exploit|dumped|dumping|scanning|scanned|elevation|'
 842 |                          r'elevated|payload|vulnerable|payload|reverse connect|bind shell|reverse shell| dump | '
 843 |                          r'back connect |privesc|privilege escalat|debug privilege| inject |interactive shell|'
 844 |                          r'shell commands| spawning |] target |] Transmi|] Connect|] connect|] Dump|] command |'
 845 |                          r'] token|] Token |] Firing | hashes | etc/passwd| SAM | NTML|unsupported target|'
 846 |                          r'race condition|Token system |LoaderConfig| add user |ile upload |ile download |'
 847 |                          r'Attaching to |ser has been successfully added|target system |LSA Secrets|DefaultPassword|'
 848 |                          r'Password: |loading dll|.Execute\(|Shellcode|Loader|inject x86|inject x64|bypass|katz|'
 849 |                          r'sploit|ms[0-9][0-9][^0-9]|\bCVE[^a-zA-Z]|privilege::|lsadump|door)',
 850 |                          string, re.IGNORECASE):
 851 |                 localStringScores[string] += 4
 852 |             # Mutex / Named Pipes
 853 |             if re.search(r'(Mutex|NamedPipe|\\Global\\|\\pipe\\)', string, re.IGNORECASE):
 854 |                 localStringScores[string] += 3
 855 |             # Usage
 856 |             if re.search(r'(isset\($post\[|isset\($get\[)', string, re.IGNORECASE):
 857 |                 localStringScores[string] += 2
 858 |             # Hash
 859 |             if re.search(r'\b([a-f0-9]{32}|[a-f0-9]{40}|[a-f0-9]{64})\b', string, re.IGNORECASE):
 860 |                 localStringScores[string] += 2
 861 |             # Persistence
 862 |             if re.search(r'(sc.exe |schtasks|at \\\\|at [0-9]{2}:[0-9]{2})', string, re.IGNORECASE):
 863 |                 localStringScores[string] += 3
 864 |             # Unix/Linux
 865 |             if re.search(r'(;chmod |; chmod |sh -c|/dev/tcp/|/bin/telnet|selinux| shell| cp /bin/sh )', string,
 866 |                          re.IGNORECASE):
 867 |                 localStringScores[string] += 3
 868 |             # Attack
 869 |             if re.search(
 870 |                     r'(attacker|brute force|bruteforce|connecting back|EXHAUSTIVE|exhaustion| spawn| evil| elevated)',
 871 |                     string, re.IGNORECASE):
 872 |                 localStringScores[string] += 3
 873 |             # Strings with less value
 874 |             if re.search(r'(abcdefghijklmnopqsst|ABCDEFGHIJKLMNOPQRSTUVWXYZ|0123456789:;)', string, re.IGNORECASE):
 875 |                 localStringScores[string] -= 5
 876 |             # VB Backdoors
 877 |             if re.search(
 878 |                     r'(kill|wscript|plugins|svr32|Select |)',
 879 |                     string, re.IGNORECASE):
 880 |                 localStringScores[string] += 3
 881 |             # Suspicious strings - combo / special characters
 882 |             if re.search(
 883 |                     r'([a-z]{4,}[!\?]|\[[!+\-]\] |[a-zA-Z]{4,}...)',
 884 |                     string, re.IGNORECASE):
 885 |                 localStringScores[string] += 3
 886 |             if re.search(
 887 |                     r'(-->|!!!| <<< | >>> )',
 888 |                     string, re.IGNORECASE):
 889 |                 localStringScores[string] += 5
 890 |             # Swear words
 891 |             if re.search(
 892 |                     r'\b(fuck|damn|shit|penis)\b',
 893 |                     string, re.IGNORECASE):
 894 |                 localStringScores[string] += 5
 895 |             # Scripting Strings
 896 |             if re.search(
 897 |                     r'(%APPDATA%|%USERPROFILE%|Public|Roaming|& del|& rm| && |script)',
 898 |                     string, re.IGNORECASE):
 899 |                 localStringScores[string] += 3
 900 |             # UACME Bypass
 901 |             if re.search(
 902 |                     r'(Elevation|pwnd|pawn|elevate to)',
 903 |                     string, re.IGNORECASE):
 904 |                 localStringScores[string] += 3
 905 | 
 906 |             # ENCODING DETECTIONS --------------------------------------------------
 907 |             try:
 908 |                 if len(string) > 8:
 909 |                     # Try different ways - fuzz string
 910 |                     # Base64
 911 |                     if args.trace:
 912 |                         print("Starting Base64 string analysis ...")
 913 |                     for m_string in (string, string[1:], string[:-1], string[1:] + "=", string + "=", string + "=="):
 914 |                         if is_base_64(m_string):
 915 |                             try:
 916 |                                 decoded_string = base64.b64decode(m_string, validate=False)
 917 |                             except binascii.Error as e:
 918 |                                 continue
 919 |                             if is_ascii_string(decoded_string, padding_allowed=True):
 920 |                                 # print "match"
 921 |                                 localStringScores[string] += 10
 922 |                                 base64strings[string] = decoded_string
 923 |                     # Hex Encoded string
 924 |                     if args.trace:
 925 |                         print("Starting Hex encoded string analysis ...")
 926 |                     for m_string in ([string, re.sub('[^a-zA-Z0-9]', '', string)]):
 927 |                         #print m_string
 928 |                         if is_hex_encoded(m_string):
 929 |                             #print("^ is HEX")
 930 |                             decoded_string = bytes.fromhex(m_string)
 931 |                             #print removeNonAsciiDrop(decoded_string)
 932 |                             if is_ascii_string(decoded_string, padding_allowed=True):
 933 |                                 # not too many 00s
 934 |                                 if '00' in m_string:
 935 |                                     if len(m_string) / float(m_string.count('0')) <= 1.2:
 936 |                                         continue
 937 |                                 #print("^ is ASCII / WIDE")
 938 |                                 localStringScores[string] += 8
 939 |                                 hexEncStrings[string] = decoded_string
 940 |             except Exception as e:
 941 |                 if args.debug:
 942 |                     traceback.print_exc()
 943 |                 pass
 944 | 
 945 |             # Reversed String -----------------------------------------------------
 946 |             if string[::-1] in good_strings_db:
 947 |                 localStringScores[string] += 10
 948 |                 reversedStrings[string] = string[::-1]
 949 | 
 950 |             # Certain string reduce	-----------------------------------------------
 951 |             if re.search(r'(rundll32\.exe$|kernel\.dll$)', string, re.IGNORECASE):
 952 |                 localStringScores[string] -= 4
 953 | 
 954 |         # Set the global string score
 955 |         stringScores[original_string] = localStringScores[string]
 956 | 
 957 |         if args.debug:
 958 |             if string in utfstrings:
 959 |                 is_utf = True
 960 |             else:
 961 |                 is_utf = False
 962 |                 # print "SCORE: %s\tUTF: %s\tSTRING: %s" % ( localStringScores[string], is_utf, string )
 963 | 
 964 |     sorted_set = sorted(localStringScores.items(), key=operator.itemgetter(1), reverse=True)
 965 | 
 966 |     # Only the top X strings
 967 |     c = 0
 968 |     result_set = []
 969 |     for string in sorted_set:
 970 | 
 971 |         # Skip the one with a score lower than -z X
 972 |         if not args.noscorefilter and not args.inverse:
 973 |             if string[1] < int(args.z):
 974 |                 continue
 975 | 
 976 |         if string[0] in utfstrings:
 977 |             result_set.append("UTF16LE:%s" % string[0])
 978 |         else:
 979 |             result_set.append(string[0])
 980 | 
 981 |         #c += 1
 982 |         #if c > int(args.rc):
 983 |         #    break
 984 | 
 985 |     if args.trace:
 986 |         print("RESULT SET:")
 987 |         print(result_set)
 988 | 
 989 |     # return the filtered set
 990 |     return result_set
 991 | 
 992 | 
 993 | def generate_general_condition(file_info):
 994 |     """
 995 |     Generates a general condition for a set of files
 996 |     :param file_info:
 997 |     :return:
 998 |     """
 999 |     conditions_string = ""
1000 |     conditions = []
1001 |     pe_module_neccessary = False
1002 | 
1003 |     # Different Magic Headers and File Sizes
1004 |     magic_headers = []
1005 |     file_sizes = []
1006 |     imphashes = []
1007 | 
1008 |     try:
1009 |         for filePath in file_info:
1010 |             # Short file name info used for inverse generation has no magic/size fields
1011 |             if "magic" not in file_info[filePath]:
1012 |                 continue
1013 |             magic = file_info[filePath]["magic"]
1014 |             size = file_info[filePath]["size"]
1015 |             imphash = file_info[filePath]["imphash"]
1016 | 
1017 |             # Add them to the lists
1018 |             if magic not in magic_headers and magic != "":
1019 |                 magic_headers.append(magic)
1020 |             if size not in file_sizes:
1021 |                 file_sizes.append(size)
1022 |             if imphash not in imphashes and imphash != "":
1023 |                 imphashes.append(imphash)
1024 | 
1025 |         # If different magic headers are less than 5
1026 |         if len(magic_headers) <= 5:
1027 |             magic_string = " or ".join(get_uint_string(h) for h in magic_headers)
1028 |             if " or " in magic_string:
1029 |                 conditions.append("( {0} )".format(magic_string))
1030 |             else:
1031 |                 conditions.append("{0}".format(magic_string))
1032 | 
1033 |         # Biggest size multiplied with maxsize_multiplier
1034 |         if not args.nofilesize and len(file_sizes) > 0:
1035 |             conditions.append(get_file_range(max(file_sizes)))
1036 | 
1037 |         # If different magic headers are less than 5
1038 |         if len(imphashes) == 1:
1039 |             conditions.append("pe.imphash() == \"{0}\"".format(imphashes[0]))
1040 |             pe_module_neccessary = True
1041 | 
1042 |         # If enough attributes were special
1043 |         condition_string = " and ".join(conditions)
1044 | 
1045 |     except Exception as e:
1046 |         if args.debug:
1047 |             traceback.print_exc()
1048 |             exit(1)
1049 |         print("[E] ERROR while generating general condition - check the global rule and remove it if it's faulty")
1050 | 
1051 |     return condition_string, pe_module_neccessary
1052 | 
1053 | 
1054 | def generate_rules(file_strings, file_opcodes, super_rules, file_info, inverse_stats):
1055 |     # Write to file ---------------------------------------------------
1056 |     if args.o:
1057 |         try:
1058 |             fh = open(args.o, 'w')
1059 |         except Exception as e:
1060 |             traceback.print_exc()
1061 | 
1062 |     # General Info
1063 |     general_info = "/*\n"
1064 |     general_info += "   YARA Rule Set\n"
1065 |     general_info += "   Author: {0}\n".format(args.a)
1066 |     general_info += "   Date: {0}\n".format(get_timestamp_basic())
1067 |     general_info += "   Identifier: {0}\n".format(identifier)
1068 |     general_info += "   Reference: {0}\n".format(reference)
1069 |     if args.l != "":
1070 |         general_info += "   License: {0}\n".format(args.l)
1071 |     general_info += "*/\n\n"
1072 | 
1073 |     if args.ai: 
1074 |         fh.write(AI_COMMENT)
1075 |     else:
1076 |         fh.write(general_info)
1077 | 
1078 |     # GLOBAL RULES ----------------------------------------------------
1079 |     if args.globalrule:
1080 | 
1081 |         condition, pe_module_necessary = generate_general_condition(file_info)
1082 | 
1083 |         # Global Rule
1084 |         if condition != "":
1085 |             global_rule = "/* Global Rule -------------------------------------------------------------- */\n"
1086 |             global_rule += "/* Will be evaluated first, speeds up scanning process, remove at will */\n\n"
1087 |             global_rule += "global private rule gen_characteristics {\n"
1088 |             global_rule += "   condition:\n"
1089 |             global_rule += "      {0}\n".format(condition)
1090 |             global_rule += "}\n\n"
1091 | 
1092 |             # Write rule
1093 |             if args.o:
1094 |                 fh.write(global_rule)
1095 | 
1096 |     # General vars
1097 |     rules = ""
1098 |     printed_rules = {}
1099 |     opcodes_to_add = []
1100 |     rule_count = 0
1101 |     inverse_rule_count = 0
1102 |     super_rule_count = 0
1103 |     pe_module_necessary = False
1104 | 
1105 |     if not args.inverse:
1106 |         # PROCESS SIMPLE RULES ----------------------------------------------------
1107 |         print("[+] Generating Simple Rules ...")
1108 |         # Apply intelligent filters
1109 |         print("[-] Applying intelligent filters to string findings ...")
1110 |         for filePath in file_strings:
1111 | 
1112 |             print("[-] Filtering string set for %s ..." % filePath)
1113 | 
1114 |             # Replace the original string set with the filtered one
1115 |             file_strings[filePath] = filter_string_set(file_strings[filePath])
1116 | 
1117 |             print("[-] Filtering opcode set for %s ..." % filePath)
1118 | 
1119 |             # Replace the original opcode set with the filtered one
1120 |             file_opcodes[filePath] = filter_opcode_set(file_opcodes[filePath]) if filePath in file_opcodes else []
1121 | 
1122 |         # GENERATE SIMPLE RULES -------------------------------------------
1123 |         fh.write("/* Rule Set ----------------------------------------------------------------- */\n\n")
1124 | 
1125 |         for filePath in file_strings:
1126 | 
1127 |             # Skip if there is nothing to do
1128 |             if len(file_strings[filePath]) == 0:
1129 |                 print("[W] Not enough high scoring strings to create a rule. "
1130 |                       "(Try -z 0 to reduce the min score or --opcodes to include opcodes) FILE: %s" % filePath)
1131 |                 continue
1132 |             elif len(file_strings[filePath]) == 0 and len(file_opcodes[filePath]) == 0:
1133 |                 print("[W] Not enough high scoring strings and opcodes to create a rule. " \
1134 |                       "(Try -z 0 to reduce the min score) FILE: %s" % filePath)
1135 |                 continue
1136 | 
1137 |             # Create Rule
1138 |             try:
1139 |                 rule = ""
1140 |                 (path, file) = os.path.split(filePath)
1141 |                 # Prepare name
1142 |                 fileBase = os.path.splitext(file)[0]
1143 |                 # Create a clean new name
1144 |                 cleanedName = fileBase
1145 |                 # Adapt length of rule name
1146 |                 if len(fileBase) < 8:  # if name is too short add part from path
1147 |                     cleanedName = path.split('\\')[-1:][0] + "_" + cleanedName
1148 |                 # File name starts with a number
1149 |                 if re.search(r'^[0-9]', cleanedName):
1150 |                     cleanedName = "sig_" + cleanedName
1151 |                 # clean name from all characters that would cause errors
1152 |                 cleanedName = re.sub(r'[^\w]', '_', cleanedName)
1153 |                 # Check if already printed
1154 |                 if cleanedName in printed_rules:
1155 |                     printed_rules[cleanedName] += 1
1156 |                     cleanedName = cleanedName + "_" + str(printed_rules[cleanedName])
1157 |                 else:
1158 |                     printed_rules[cleanedName] = 1
1159 | 
1160 |                 # Print rule title ----------------------------------------
1161 |                 rule += "rule %s {\n" % cleanedName
1162 | 
1163 |                 # Meta data -----------------------------------------------
1164 |                 rule += "   meta:\n"
1165 |                 rule += "      description = \"%s - file %s\"\n" % (prefix, file)
1166 |                 rule += "      author = \"%s\"\n" % args.a
1167 |                 rule += "      reference = \"%s\"\n" % reference
1168 |                 rule += "      date = \"%s\"\n" % get_timestamp_basic()
1169 |                 rule += "      hash1 = \"%s\"\n" % file_info[filePath]["hash"]
1170 |                 rule += "   strings:\n"
1171 | 
1172 |                 # Get the strings -----------------------------------------
1173 |                 # Rule String generation
1174 |                 (rule_strings, opcodes_included, string_rule_count, high_scoring_strings) = \
1175 |                     get_rule_strings(file_strings[filePath], file_opcodes[filePath])
1176 |                 rule += rule_strings
1177 | 
1178 |                 # Extract rul strings
1179 |                 if args.strings:
1180 |                     strings = get_strings(file_strings[filePath])
1181 |                     write_strings(filePath, strings, args.e, args.score)
1182 | 
1183 |                 # Condition -----------------------------------------------
1184 |                 # Conditions list (will later be joined with 'or')
1185 |                 conditions = []  # AND connected
1186 |                 subconditions = []  # OR connected
1187 | 
1188 |                 # Condition PE
1189 |                 # Imphash and Exports - applicable to PE files only
1190 |                 condition_pe = []
1191 |                 condition_pe_part1 = []
1192 |                 condition_pe_part2 = []
1193 |                 if not args.noextras and file_info[filePath]["magic"] == "MZ":
1194 |                     # Add imphash - if certain conditions are met
1195 |                     if file_info[filePath]["imphash"] not in good_imphashes_db and file_info[filePath]["imphash"] != "":
1196 |                         # Comment to imphash
1197 |                         imphash = file_info[filePath]["imphash"]
1198 |                         comment = ""
1199 |                         if imphash in KNOWN_IMPHASHES:
1200 |                             comment = " /* {0} */".format(KNOWN_IMPHASHES[imphash])
1201 |                         # Add imphash to condition
1202 |                         condition_pe_part1.append("pe.imphash() == \"{0}\"{1}".format(imphash, comment))
1203 |                         pe_module_necessary = True
1204 |                     if file_info[filePath]["exports"]:
1205 |                         e_count = 0
1206 |                         for export in file_info[filePath]["exports"]:
1207 |                             if export not in good_exports_db:
1208 |                                 condition_pe_part2.append("pe.exports(\"{0}\")".format(export))
1209 |                                 e_count += 1
1210 |                                 pe_module_necessary = True
1211 |                             if e_count > 5:
1212 |                                 break
1213 | 
1214 |                 # 1st Part of Condition 1
1215 |                 basic_conditions = []
1216 |                 # Filesize
1217 |                 if not args.nofilesize:
1218 |                     basic_conditions.insert(0, get_file_range(file_info[filePath]["size"]))
1219 |                 # Magic
1220 |                 if file_info[filePath]["magic"] != "":
1221 |                     uint_string = get_uint_string(file_info[filePath]["magic"])
1222 |                     basic_conditions.insert(0, uint_string)
1223 |                 # Basic Condition
1224 |                 if len(basic_conditions):
1225 |                     conditions.append(" and ".join(basic_conditions))
1226 | 
1227 |                 # Add extra PE conditions to condition 1
1228 |                 pe_conditions_add = False
1229 |                 if condition_pe_part1 or condition_pe_part2:
1230 |                     if len(condition_pe_part1) == 1:
1231 |                         condition_pe.append(condition_pe_part1[0])
1232 |                     elif len(condition_pe_part1) > 1:
1233 |                         condition_pe.append("( %s )" % " or ".join(condition_pe_part1))
1234 |                     if len(condition_pe_part2) == 1:
1235 |                         condition_pe.append(condition_pe_part2[0])
1236 |                     elif len(condition_pe_part2) > 1:
1237 |                         condition_pe.append("( %s )" % " and ".join(condition_pe_part2))
1238 |                     # Marker that PE conditions have been added
1239 |                     pe_conditions_add = True
1240 |                     # Add to sub condition
1241 |                     subconditions.append(" and ".join(condition_pe))
1242 | 
1243 |                 # String combinations
1244 |                 cond_op = ""  # opcodes condition
1245 |                 cond_hs = ""  # high scoring strings condition
1246 |                 cond_ls = ""  # low scoring strings condition
1247 | 
1248 |                 low_scoring_strings = (string_rule_count - high_scoring_strings)
1249 |                 if high_scoring_strings > 0:
1250 |                     cond_hs = "1 of ($x*)"
1251 |                 if low_scoring_strings > 0:
1252 |                     if low_scoring_strings > 10:
1253 |                         if high_scoring_strings > 0:
1254 |                             cond_ls = "4 of them"
1255 |                         else:
1256 |                             cond_ls = "8 of them"
1257 |                     else:
1258 |                         cond_ls = "all of them"
1259 | 
1260 |                 # If low scoring and high scoring
1261 |                 cond_combined = "all of them"
1262 |                 needs_brackets = False
1263 |                 if low_scoring_strings > 0 and high_scoring_strings > 0:
1264 |                     # If PE conditions have been added, don't be so strict with the strings
1265 |                     if pe_conditions_add:
1266 |                         cond_combined = "{0} or {1}".format(cond_hs, cond_ls)
1267 |                         needs_brackets = True
1268 |                     else:
1269 |                         cond_combined = "{0} and {1}".format(cond_hs, cond_ls)
1270 |                 elif low_scoring_strings > 0 and not high_scoring_strings > 0:
1271 |                     cond_combined = "{0}".format(cond_ls)
1272 |                 elif not low_scoring_strings > 0 and high_scoring_strings > 0:
1273 |                     cond_combined = "{0}".format(cond_hs)
1274 |                 if opcodes_included:
1275 |                     cond_op = " and all of ($op*)"
1276 | 
1277 |                 # Opcodes (if needed)
1278 |                 if cond_op or needs_brackets:
1279 |                     subconditions.append("( {0}{1} )".format(cond_combined, cond_op))
1280 |                 else:
1281 |                     subconditions.append(cond_combined)
1282 | 
1283 |                 # Now add string condition to the conditions
1284 |                 if len(subconditions) == 1:
1285 |                     conditions.append(subconditions[0])
1286 |                 elif len(subconditions) > 1:
1287 |                     conditions.append("( %s )" % " or ".join(subconditions))
1288 | 
1289 |                 # Create condition string
1290 |                 condition_string = " and\n      ".join(conditions)
1291 | 
1292 |                 rule += "   condition:\n"
1293 |                 rule += "      %s\n" % condition_string
1294 |                 rule += "}\n\n"
1295 | 
1296 |                 # Add to rules string
1297 |                 rules += rule
1298 | 
1299 |                 rule_count += 1
1300 |             except Exception as e:
1301 |                 traceback.print_exc()
1302 | 
1303 |     # GENERATE SUPER RULES --------------------------------------------
1304 |     if not nosuper and not args.inverse:
1305 | 
1306 |         rules += "/* Super Rules ------------------------------------------------------------- */\n\n"
1307 |         super_rule_names = []
1308 | 
1309 |         print("[+] Generating Super Rules ...")
1310 |         printed_combi = {}
1311 |         for super_rule in super_rules:
1312 |             try:
1313 |                 rule = ""
1314 |                 # Prepare Name
1315 |                 rule_name = ""
1316 |                 file_list = []
1317 | 
1318 |                 # Loop through files
1319 |                 imphashes = Counter()
1320 |                 for filePath in super_rule["files"]:
1321 |                     (path, file) = os.path.split(filePath)
1322 |                     file_list.append(file)
1323 |                     # Prepare name
1324 |                     fileBase = os.path.splitext(file)[0]
1325 |                     # Create a clean new name
1326 |                     cleanedName = fileBase
1327 |                     # Append it to the full name
1328 |                     rule_name += "_" + cleanedName
1329 |                     # Check if imphash of all files is equal
1330 |                     imphash = file_info[filePath]["imphash"]
1331 |                     if imphash != "-" and imphash != "":
1332 |                         imphashes.update([imphash])
1333 | 
1334 |                 # Imphash usable
1335 |                 if len(imphashes) == 1:
1336 |                     unique_imphash = list(imphashes.items())[0][0]
1337 |                     if unique_imphash in good_imphashes_db:
1338 |                         unique_imphash = ""
1339 | 
1340 |                 # Shorten rule name
1341 |                 rule_name = rule_name[:124]
1342 |                 # Add count if rule name already taken
1343 |                 if rule_name not in super_rule_names:
1344 |                     rule_name = "%s_%s" % (rule_name, super_rule_count)
1345 |                 super_rule_names.append(rule_name)
1346 | 
1347 |                 # Create a list of files
1348 |                 file_listing = ", ".join(file_list)
1349 | 
1350 |                 # File name starts with a number
1351 |                 if re.search(r'^[0-9]', rule_name):
1352 |                     rule_name = "sig_" + rule_name
1353 |                 # clean name from all characters that would cause errors
1354 |                 rule_name = re.sub(r'[^\w]', '_', rule_name)
1355 |                 # Check if already printed
1356 |                 if rule_name in printed_rules:
1357 |                     printed_combi[rule_name] += 1
1358 |                     rule_name = rule_name + "_" + str(printed_combi[rule_name])
1359 |                 else:
1360 |                     printed_combi[rule_name] = 1
1361 | 
1362 |                 # Print rule title
1363 |                 rule += "rule %s {\n" % rule_name
1364 |                 rule += "   meta:\n"
1365 |                 rule += "      description = \"%s - from files %s\"\n" % (prefix, file_listing)
1366 |                 rule += "      author = \"%s\"\n" % args.a
1367 |                 rule += "      reference = \"%s\"\n" % reference
1368 |                 rule += "      date = \"%s\"\n" % get_timestamp_basic()
1369 |                 for i, filePath in enumerate(super_rule["files"]):
1370 |                     rule += "      hash%s = \"%s\"\n" % (str(i + 1), file_info[filePath]["hash"])
1371 | 
1372 |                 rule += "   strings:\n"
1373 | 
1374 |                 # Adding the opcodes
1375 |                 if file_opcodes.get(filePath) is None:
1376 |                     tmp_file_opcodes = {}
1377 |                 else:
1378 |                     tmp_file_opcodes = file_opcodes.get(filePath)
1379 |                 (rule_strings, opcodes_included, string_rule_count, high_scoring_strings) = \
1380 |                     get_rule_strings(super_rule["strings"], tmp_file_opcodes)
1381 |                 rule += rule_strings
1382 | 
1383 |                 # Condition -----------------------------------------------
1384 |                 # Conditions list (will later be joined with 'or')
1385 |                 conditions = []
1386 | 
1387 |                 # 1st condition
1388 |                 # Evaluate the general characteristics
1389 |                 file_info_super = {}
1390 |                 for filePath in super_rule["files"]:
1391 |                     file_info_super[filePath] = file_info[filePath]
1392 |                 condition_strings, pe_module_necessary_gen = generate_general_condition(file_info_super)
1393 |                 if pe_module_necessary_gen:
1394 |                      pe_module_necessary = True
1395 | 
1396 |                 # 2nd condition
1397 |                 # String combinations
1398 |                 cond_op = ""  # opcodes condition
1399 |                 cond_hs = ""  # high scoring strings condition
1400 |                 cond_ls = ""  # low scoring strings condition
1401 | 
1402 |                 low_scoring_strings = (string_rule_count - high_scoring_strings)
1403 |                 if high_scoring_strings > 0:
1404 |                     cond_hs = "1 of ($x*)"
1405 |                 if low_scoring_strings > 0:
1406 |                     if low_scoring_strings > 10:
1407 |                         if high_scoring_strings > 0:
1408 |                             cond_ls = "4 of them"
1409 |                         else:
1410 |                             cond_ls = "8 of them"
1411 |                     else:
1412 |                         cond_ls = "all of them"
1413 | 
1414 |                 # If low scoring and high scoring
1415 |                 cond_combined = "all of them"
1416 |                 if low_scoring_strings > 0 and high_scoring_strings > 0:
1417 |                     cond_combined = "{0} and {1}".format(cond_hs, cond_ls)
1418 |                 elif low_scoring_strings > 0 and not high_scoring_strings > 0:
1419 |                     cond_combined = "{0}".format(cond_ls)
1420 |                 elif not low_scoring_strings > 0 and high_scoring_strings > 0:
1421 |                     cond_combined = "{0}".format(cond_hs)
1422 |                 if opcodes_included:
1423 |                     cond_op = " and all of ($op*)"
1424 | 
1425 |                 condition2 = "( {0} ){1}".format(cond_combined, cond_op)
1426 |                 conditions.append(" and ".join([condition_strings, condition2]))
1427 | 
1428 |                 # 3nd condition
1429 |                 # In memory detection base condition (no magic, no filesize)
1430 |                 condition_pe = "all of them"
1431 |                 conditions.append(condition_pe)
1432 | 
1433 |                 # Create condition string
1434 |                 condition_string = "\n      ) or ( ".join(conditions)
1435 | 
1436 |                 rule += "   condition:\n"
1437 |                 rule += "      ( %s )\n" % condition_string
1438 |                 rule += "}\n\n"
1439 | 
1440 |                 # print rule
1441 |                 # Add to rules string
1442 |                 rules += rule
1443 | 
1444 |                 super_rule_count += 1
1445 |             except Exception as e:
1446 |                 traceback.print_exc()
1447 | 
1448 |     try:
1449 |         # WRITING RULES TO FILE
1450 |         # PE Module -------------------------------------------------------
1451 |         if not args.noextras:
1452 |             if pe_module_necessary:
1453 |                 fh.write('import "pe"\n\n')
1454 |         # RULES -----------------------------------------------------------
1455 |         if args.o:
1456 |             fh.write(rules)
1457 |     except Exception as e:
1458 |         traceback.print_exc()
1459 | 
1460 |     # PROCESS INVERSE RULES ---------------------------------------------------
1461 |     # print inverse_stats.keys()
1462 |     if args.inverse:
1463 |         print("[+] Generating inverse rules ...")
1464 |         inverse_rules = ""
1465 |         # Apply intelligent filters -------------------------------------------
1466 |         print("[+] Applying intelligent filters to string findings ...")
1467 |         for fileName in inverse_stats:
1468 | 
1469 |             print("[-] Filtering string set for %s ..." % fileName)
1470 | 
1471 |             # Replace the original string set with the filtered one
1472 |             string_set = inverse_stats[fileName]
1473 |             inverse_stats[fileName] = []
1474 |             inverse_stats[fileName] = filter_string_set(string_set)
1475 | 
1476 |             # Preset if empty
1477 |             if fileName not in file_opcodes:
1478 |                 file_opcodes[fileName] = {}
1479 | 
1480 |         # GENERATE INVERSE RULES -------------------------------------------
1481 |         fh.write("/* Inverse Rules ------------------------------------------------------------- */\n\n")
1482 | 
1483 |         for fileName in inverse_stats:
1484 |             try:
1485 |                 rule = ""
1486 |                 # Create a clean new name
1487 |                 cleanedName = fileName.replace(".", "_")
1488 |                 # Add ANOMALY
1489 |                 cleanedName += "_ANOMALY"
1490 |                 # File name starts with a number
1491 |                 if re.search(r'^[0-9]', cleanedName):
1492 |                     cleanedName = "sig_" + cleanedName
1493 |                 # clean name from all characters that would cause errors
1494 |                 cleanedName = re.sub(r'[^\w]', '_', cleanedName)
1495 |                 # Check if already printed
1496 |                 if cleanedName in printed_rules:
1497 |                     printed_rules[cleanedName] += 1
1498 |                     cleanedName = cleanedName + "_" + str(printed_rules[cleanedName])
1499 |                 else:
1500 |                     printed_rules[cleanedName] = 1
1501 | 
1502 |                 # Print rule title ----------------------------------------
1503 |                 rule += "rule %s {\n" % cleanedName
1504 | 
1505 |                 # Meta data -----------------------------------------------
1506 |                 rule += "   meta:\n"
1507 |                 rule += "      description = \"%s for anomaly detection - file %s\"\n" % (prefix, fileName)
1508 |                 rule += "      author = \"%s\"\n" % args.a
1509 |                 rule += "      reference = \"%s\"\n" % reference
1510 |                 rule += "      date = \"%s\"\n" % get_timestamp_basic()
1511 |                 for i, hash in enumerate(file_info[fileName]["hashes"]):
1512 |                     rule += "      hash%s = \"%s\"\n" % (str(i + 1), hash)
1513 | 
1514 |                 rule += "   strings:\n"
1515 | 
1516 |                 # Get the strings -----------------------------------------
1517 |                 # Rule String generation
1518 |                 (rule_strings, opcodes_included, string_rule_count, high_scoring_strings) = \
1519 |                     get_rule_strings(inverse_stats[fileName], file_opcodes[fileName])
1520 |                 rule += rule_strings
1521 | 
1522 |                 # Condition -----------------------------------------------
1523 |                 folderNames = ""
1524 |                 if not args.nodirname:
1525 |                     folderNames += "and ( filepath matches /"
1526 |                     folderNames += "$/ or filepath matches /".join(file_info[fileName]["folder_names"])
1527 |                     folderNames += "$/ )"
1528 |                 condition = "filename == \"%s\" %s and not ( all of them )" % (fileName, folderNames)
1529 | 
1530 |                 rule += "   condition:\n"
1531 |                 rule += "      %s\n" % condition
1532 |                 rule += "}\n\n"
1533 | 
1534 |                 # print rule
1535 |                 # Add to rules string
1536 |                 inverse_rules += rule
1537 | 
1538 |             except Exception as e:
1539 |                 traceback.print_exc()
1540 | 
1541 |         try:
1542 |             # Try to write rule to file
1543 |             if args.o:
1544 |                 fh.write(inverse_rules)
1545 |             inverse_rule_count += 1
1546 |         except Exception as e:
1547 |             traceback.print_exc()
1548 | 
1549 |     # Close the rules file --------------------------------------------
1550 |     if args.o:
1551 |         try:
1552 |             fh.close()
1553 |         except Exception as e:
1554 |             traceback.print_exc()
1555 | 
1556 |     # Print rules to command line -------------------------------------
1557 |     if args.debug:
1558 |         print(rules)
1559 | 
1560 |     return (rule_count, inverse_rule_count, super_rule_count)
1561 | 
1562 | 
1563 | def get_rule_strings(string_elements, opcode_elements):
1564 |     rule_strings = ""
1565 |     high_scoring_strings = 0
1566 |     string_rule_count = 0
1567 | 
1568 |     # Adding the strings --------------------------------------
1569 |     for i, string in enumerate(string_elements):
1570 | 
1571 |         # Collect the data
1572 |         is_fullword = True
1573 |         initial_string = string
1574 |         enc = " ascii"
1575 |         base64comment = ""
1576 |         hexEncComment = ""
1577 |         reversedComment = ""
1578 |         fullword = ""
1579 |         pestudio_comment = ""
1580 |         score_comment = ""
1581 |         goodware_comment = ""
1582 | 
1583 |         if string in good_strings_db:
1584 |             goodware_comment = " /* Goodware String - occured %s times */" % (good_strings_db[string])
1585 | 
1586 |         if string in stringScores:
1587 |             if args.score:
1588 |                 score_comment += " /* score: '%.2f'*/" % (stringScores[string])
1589 |         else:
1590 |             print("NO SCORE: %s" % string)
1591 | 
1592 |         if string[:8] == "UTF16LE:":
1593 |             string = string[8:]
1594 |             enc = " wide"
1595 |         if string in base64strings:
1596 |             base64comment = " /* base64 encoded string '%s' */" % base64strings[string].decode()
1597 |         if string in hexEncStrings:
1598 |             hexEncComment = " /* hex encoded string '%s' */" % removeNonAsciiDrop(hexEncStrings[string]).decode()
1599 |         if string in pestudioMarker and args.score:
1600 |             pestudio_comment = " /* PEStudio Blacklist: %s */" % pestudioMarker[string]
1601 |         if string in reversedStrings:
1602 |             reversedComment = " /* reversed goodware string '%s' */" % reversedStrings[string]
1603 | 
1604 |         # Extra checks
1605 |         if is_hex_encoded(string, check_length=False):
1606 |             is_fullword = False
1607 | 
1608 |         # Checking string length
1609 |         if len(string) >= args.s:
1610 |             # cut string
1611 |             string = string[:args.s].rstrip("\\")
1612 |             # not fullword anymore
1613 |             is_fullword = False
1614 |         # Show as fullword
1615 |         if is_fullword:
1616 |             fullword = " fullword"
1617 | 
1618 |         # Now compose the rule line
1619 |         if float(stringScores[initial_string]) > score_highly_specific:
1620 |             high_scoring_strings += 1
1621 |             rule_strings += "      $x%s = \"%s\"%s%s%s%s%s%s%s%s\n" % (
1622 |             str(i + 1), string, fullword, enc, base64comment, reversedComment, pestudio_comment, score_comment,
1623 |             goodware_comment, hexEncComment)
1624 |         else:
1625 |             rule_strings += "      $s%s = \"%s\"%s%s%s%s%s%s%s%s\n" % (
1626 |             str(i + 1), string, fullword, enc, base64comment, reversedComment, pestudio_comment, score_comment,
1627 |             goodware_comment, hexEncComment)
1628 | 
1629 |         # If too many string definitions found - cut it at the
1630 |         # count defined via command line param -rc
1631 |         if (i + 1) >= strings_per_rule:
1632 |             break
1633 | 
1634 |         string_rule_count += 1
1635 | 
1636 |     # Adding the opcodes --------------------------------------
1637 |     opcodes_included = False
1638 |     if len(opcode_elements) > 0:
1639 |         rule_strings += "\n"
1640 |         for i, opcode in enumerate(opcode_elements):
1641 |             rule_strings += "      $op%s = { %s }\n" % (str(i), opcode)
1642 |             opcodes_included = True
1643 |     else:
1644 |         if args.opcodes:
1645 |             print("[-] Not enough unique opcodes found to include them")
1646 | 
1647 |     return rule_strings, opcodes_included, string_rule_count, high_scoring_strings
1648 | 
1649 | 
1650 | def get_strings(string_elements):
1651 |     """
1652 |     Get a dictionary of all string types
1653 |     :param string_elements:
1654 |     :return:
1655 |     """
1656 |     strings = {
1657 |         "ascii": [],
1658 |         "wide": [],
1659 |         "base64 encoded": [],
1660 |         "hex encoded": [],
1661 |         "reversed": []
1662 |     }
1663 | 
1664 |     # Adding the strings --------------------------------------
1665 |     for i, string in enumerate(string_elements):
1666 | 
1667 |         if string[:8] == "UTF16LE:":
1668 |             string = string[8:]
1669 |             strings["wide"].append(string)
1670 |         elif string in base64strings:
1671 |             strings["base64 encoded"].append(string)
1672 |         elif string in hexEncStrings:
1673 |             strings["hex encoded"].append(string)
1674 |         elif string in reversedStrings:
1675 |             strings["reversed"].append(string)
1676 |         else:
1677 |             strings["ascii"].append(string)
1678 | 
1679 |     return strings
1680 | 
1681 | 
1682 | def write_strings(filePath, strings, output_dir, scores):
1683 |     """
1684 |     Writes string information to an output file
1685 |     :param filePath:
1686 |     :param strings:
1687 |     :param output_dir:
1688 |     :param scores:
1689 |     :return:
1690 |     """
1691 |     SECTIONS = ["ascii", "wide", "base64 encoded", "hex encoded", "reversed"]
1692 |     # File
1693 |     filename = os.path.basename(filePath)
1694 |     strings_filename = os.path.join(output_dir, "%s_strings.txt" % filename)
1695 |     print("[+] Writing strings to file %s" % strings_filename)
1696 |     # Strings
1697 |     output_string = []
1698 |     for key in SECTIONS:
1699 |         # Skip empty
1700 |         if len(strings[key]) < 1:
1701 |             continue
1702 |         # Section
1703 |         output_string.append("%s Strings" % key.upper())
1704 |         output_string.append("------------------------------------------------------------------------")
1705 |         for string in strings[key]:
1706 |             if scores:
1707 |                 score = "unknown"
1708 |                 if key == "wide":
1709 |                     score = stringScores["UTF16LE:%s" % string]
1710 |                 else:
1711 |                     score = stringScores[string]
1712 |                 output_string.append("%d;%s" % (score, string))
1713 |             else:
1714 |                 output_string.append(string)
1715 |         # Empty line between sections
1716 |         output_string.append("\n")
1717 |     with open(strings_filename, "w") as fh:
1718 |         fh.write("\n".join(output_string))
1719 | 
1720 | 
1721 | def initialize_pestudio_strings():
1722 |     pestudio_strings = {}
1723 | 
1724 |     tree = etree.parse(get_abs_path(PE_STRINGS_FILE))
1725 | 
1726 |     pestudio_strings["strings"] = tree.findall(".//string")
1727 |     pestudio_strings["av"] = tree.findall(".//av")
1728 |     pestudio_strings["folder"] = tree.findall(".//folder")
1729 |     pestudio_strings["os"] = tree.findall(".//os")
1730 |     pestudio_strings["reg"] = tree.findall(".//reg")
1731 |     pestudio_strings["guid"] = tree.findall(".//guid")
1732 |     pestudio_strings["ssdl"] = tree.findall(".//ssdl")
1733 |     pestudio_strings["ext"] = tree.findall(".//ext")
1734 |     pestudio_strings["agent"] = tree.findall(".//agent")
1735 |     pestudio_strings["oid"] = tree.findall(".//oid")
1736 |     pestudio_strings["priv"] = tree.findall(".//priv")
1737 | 
1738 |     # Obsolete
1739 |     # for elem in string_elems:
1740 |     #    strings.append(elem.text)
1741 | 
1742 |     return pestudio_strings
1743 | 
1744 | 
1745 | def get_pestudio_score(string):
1746 |     for type in pestudio_strings:
1747 |         for elem in pestudio_strings[type]:
1748 |             # Full match
1749 |             if elem.text.lower() == string.lower():
1750 |                 # Exclude the "extension" black list for now
1751 |                 if type != "ext":
1752 |                     return 5, type
1753 |     return 0, ""
1754 | 
1755 | 
1756 | def get_opcode_string(opcode):
1757 |     return ' '.join(opcode[i:i + 2] for i in range(0, len(opcode), 2))
1758 | 
1759 | 
1760 | def get_uint_string(magic):
1761 |     if len(magic) == 2:
1762 |         return "uint8(0) == 0x{0}{1}".format(magic[0], magic[1])
1763 |     if len(magic) == 4:
1764 |         return "uint16(0) == 0x{2}{3}{0}{1}".format(magic[0], magic[1], magic[2], magic[3])
1765 |     return ""
1766 | 
1767 | 
1768 | def get_file_range(size):
1769 |     size_string = ""
1770 |     try:
1771 |         # max sample size - args.fm times the original size
1772 |         max_size_b = size * args.fm
1773 |         # Minimum size
1774 |         if max_size_b < 1024:
1775 |             max_size_b = 1024
1776 |         # in KB
1777 |         max_size = int(max_size_b / 1024)
1778 |         max_size_kb = max_size
1779 |         # Round
1780 |         if len(str(max_size)) == 2:
1781 |             max_size = int(round(max_size, -1))
1782 |         elif len(str(max_size)) == 3:
1783 |             max_size = int(round(max_size, -2))
1784 |         elif len(str(max_size)) == 4:
1785 |             max_size = int(round(max_size, -3))
1786 |         elif len(str(max_size)) >= 5:
1787 |             max_size = int(round(max_size, -3))
1788 |         size_string = "filesize < {0}KB".format(max_size)
1789 |         if args.debug:
1790 |             print("File Size Eval: SampleSize (b): {0} SizeWithMultiplier (b/Kb): {1} / {2} RoundedSize: {3}".format(
1791 |                 str(size), str(max_size_b), str(max_size_kb), str(max_size)))
1792 |     except Exception as e:
1793 |         traceback.print_exc()
1794 |     finally:
1795 |         return size_string
1796 | 
1797 | 
1798 | def get_timestamp_basic(date_obj=None):
1799 |     if not date_obj:
1800 |         date_obj = datetime.datetime.now()
1801 |     date_str = date_obj.strftime("%Y-%m-%d")
1802 |     return date_str
1803 | 
1804 | 
1805 | def is_ascii_char(b, padding_allowed=False):
1806 |     if padding_allowed:
1807 |         if (ord(b) < 127 and ord(b) > 31) or ord(b) == 0:
1808 |             return 1
1809 |     else:
1810 |         if ord(b) < 127 and ord(b) > 31:
1811 |             return 1
1812 |     return 0
1813 | 
1814 | 
1815 | def is_ascii_string(string, padding_allowed=False):
1816 |     for b in [i.to_bytes(1, sys.byteorder) for i in string]:
1817 |         if padding_allowed:
1818 |             if not ((ord(b) < 127 and ord(b) > 31) or ord(b) == 0):
1819 |                 return 0
1820 |         else:
1821 |             if not (ord(b) < 127 and ord(b) > 31):
1822 |                 return 0
1823 |     return 1
1824 | 
1825 | 
1826 | def is_base_64(s):
1827 |     return (len(s) % 4 == 0) and re.match('^[A-Za-z0-9+/]+[=]{0,2}$', s)
1828 | 
1829 | 
1830 | def is_hex_encoded(s, check_length=True):
1831 |     if re.match('^[A-Fa-f0-9]+$', s):
1832 |         if check_length:
1833 |             if len(s) % 2 == 0:
1834 |                 return True
1835 |         else:
1836 |             return True
1837 |     return False
1838 | 
1839 | 
1840 | # TODO: Still buggy after port to Python3
1841 | def extract_hex_strings(s):
1842 |     strings = []
1843 |     hex_strings = re.findall(b"([a-fA-F0-9]{10,})", s)
1844 |     for string in list(hex_strings):
1845 |         hex_strings += string.split(b'0000')
1846 |         hex_strings += string.split(b'0d0a')
1847 |         hex_strings += re.findall(b'((?:0000|002[a-f0-9]|00[3-9a-f][0-9a-f]){6,})', string, re.IGNORECASE)
1848 |     hex_strings = list(set(hex_strings))
1849 |     # ASCII Encoded Strings
1850 |     for string in hex_strings:
1851 |         for x in string.split(b'00'):
1852 |             if len(x) > 10:
1853 |                 strings.append(x)
1854 |     # WIDE Encoded Strings
1855 |     for string in hex_strings:
1856 |         try:
1857 |             if len(string) % 2 != 0 or len(string) < 8:
1858 |                 continue
1859 |             # Skip
1860 |             if b'0000' in string:
1861 |                 continue
1862 |             dec = string.replace(b'00', b'')
1863 |             if is_ascii_string(dec, padding_allowed=False):
1864 |                 strings.append(string)
1865 |         except Exception as e:
1866 |             traceback.print_exc()
1867 |     return strings
1868 | 
1869 | 
1870 | def removeNonAsciiDrop(string):
1871 |     nonascii = "error"
1872 |     try:
1873 |         byte_list = [i.to_bytes(1, sys.byteorder) for i in string]
1874 |         # Generate a new string without disturbing characters
1875 |         nonascii = b"".join(i for i in byte_list if ord(i)<127 and ord(i)>31)
1876 |     except Exception as e:
1877 |         traceback.print_exc()
1878 |         pass
1879 |     return nonascii
1880 | 
1881 | 
1882 | def save(object, filename):
1883 |     file = gzip.GzipFile(filename, 'wb')
1884 |     file.write(bytes(json.dumps(object), 'utf-8'))
1885 |     file.close()
1886 | 
1887 | 
1888 | def load(filename):
1889 |     file = gzip.GzipFile(filename, 'rb')
1890 |     object = json.loads(file.read())
1891 |     file.close()
1892 |     return object
1893 | 
1894 | 
1895 | def update_databases():
1896 |     # Preparations
1897 |     try:
1898 |         dbDir = './dbs/'
1899 |         if not os.path.exists(dbDir):
1900 |             os.makedirs(dbDir)
1901 |     except Exception as e:
1902 |         if args.debug:
1903 |             traceback.print_exc()
1904 |         print("Error while creating the database directory ./dbs")
1905 |         sys.exit(1)
1906 | 
1907 |     # Downloading current repository
1908 |     try:
1909 |         for filename, repo_url in REPO_URLS.items():
1910 |             print("Downloading %s from %s ..." % (filename, repo_url))
1911 |             with urllib.request.urlopen(repo_url) as response, open("./dbs/%s" % filename, 'wb') as out_file:
1912 |                 shutil.copyfileobj(response, out_file)
1913 |     except Exception as e:
1914 |         if args.debug:
1915 |             traceback.print_exc()
1916 |         print("Error while downloading the database file - check your Internet connection "
1917 |               "(try to run it with --debug to see the full error message)")
1918 |         sys.exit(1)
1919 | 
1920 | 
1921 | def processSampleDir(targetDir):
1922 |     """
1923 |     Processes samples in a given directory and creates a yara rule file
1924 |     :param directory:
1925 |     :return:
1926 |     """
1927 |     # Special strings
1928 |     base64strings = {}
1929 |     hexEncStrings = {}
1930 |     reversedStrings = {}
1931 |     pestudioMarker = {}
1932 |     stringScores = {}
1933 | 
1934 |     # Extract all information
1935 |     (sample_string_stats, sample_opcode_stats, file_info) = \
1936 |         parse_sample_dir(targetDir, args.nr, generateInfo=True, onlyRelevantExtensions=args.oe)
1937 | 
1938 |     # Evaluate Strings
1939 |     (file_strings, file_opcodes, combinations, super_rules, inverse_stats) = \
1940 |         sample_string_evaluation(sample_string_stats, sample_opcode_stats, file_info)
1941 | 
1942 |     # Create Rule Files
1943 |     (rule_count, inverse_rule_count, super_rule_count) = \
1944 |         generate_rules(file_strings, file_opcodes, super_rules, file_info, inverse_stats)
1945 | 
1946 |     if args.inverse:
1947 |         print("[=] Generated %s INVERSE rules." % str(inverse_rule_count))
1948 |     else:
1949 |         print("[=] Generated %s SIMPLE rules." % str(rule_count))
1950 |         if not nosuper:
1951 |             print("[=] Generated %s SUPER rules." % str(super_rule_count))
1952 |         print("[=] All rules written to %s" % args.o)
1953 | 
1954 | 
1955 | def emptyFolder(dir):
1956 |     """
1957 |     Removes all files from a given folder
1958 |     :return:
1959 |     """
1960 |     for file in os.listdir(dir):
1961 |         filePath = os.path.join(dir, file)
1962 |         try:
1963 |             if os.path.isfile(filePath):
1964 |                 print("[!] Removing %s ..." % filePath)
1965 |                 os.unlink(filePath)
1966 |         except Exception as e:
1967 |             print(e)
1968 | 
1969 | 
1970 | def getReference(ref):
1971 |     """
1972 |     Get a reference string - if the provided string is the path to a text file, then read the contents and return it as
1973 |     reference
1974 |     :param ref:
1975 |     :return:
1976 |     """
1977 |     if os.path.exists(ref):
1978 |         reference = getFileContent(ref)
1979 |         print("[+] Read reference from file %s > %s" % (ref, reference))
1980 |         return reference
1981 |     else:
1982 |         return ref
1983 | 
1984 | 
1985 | def getIdentifier(id, path):
1986 |     """
1987 |     Get a identifier string - if the provided string is the path to a text file, then read the contents and return it as
1988 |     reference, otherwise use the last element of the full path
1989 |     :param ref:
1990 |     :return:
1991 |     """
1992 |     # Identifier
1993 |     if id == "not set" or not os.path.exists(id):
1994 |         # Identifier is the highest folder name
1995 |         return os.path.basename(path.rstrip('/'))
1996 |     else:
1997 |         # Read identifier from file
1998 |         identifier = getFileContent(id)
1999 |         print("[+] Read identifier from file %s > %s" % (id, identifier))
2000 |         return identifier
2001 | 
2002 | 
2003 | def getPrefix(prefix, identifier):
2004 |     """
2005 |     Get a prefix string for the rule description based on the identifier
2006 |     :param prefix:
2007 |     :param identifier:
2008 |     :return:
2009 |     """
2010 |     if prefix == "Auto-generated rule":
2011 |         return identifier
2012 |     else:
2013 |         return prefix
2014 | 
2015 | 
2016 | def getFileContent(file):
2017 |     """
2018 |     Gets the contents of a file (limited to 1024 characters)
2019 |     :param file:
2020 |     :return:
2021 |     """
2022 |     try:
2023 |         with open(file) as f:
2024 |             return f.read(1024)
2025 |     except Exception as e:
2026 |         return "not found"
2027 | 
2028 | 
2029 | # CTRL+C Handler --------------------------------------------------------------
2030 | def signal_handler(signal_name, frame):
2031 |     print("> yarGen's work has been interrupted")
2032 |     sys.exit(0)
2033 | 
2034 | 
2035 | def print_welcome():
2036 |     print("------------------------------------------------------------------------")
2037 |     print("                   _____            ")
2038 |     print("    __ _____ _____/ ___/__ ___      ")
2039 |     print("   / // / _ `/ __/ (_ / -_) _ \\     ")
2040 |     print("   \\_, /\\_,_/_/  \\___/\\__/_//_/     ")
2041 |     print("  /___/  Yara Rule Generator        ")
2042 |     print("         Florian Roth, August 2023, Version %s" % __version__)
2043 |     print("   ")
2044 |     print("  Note: Rules have to be post-processed")
2045 |     print("  See this post for details: https://medium.com/@cyb3rops/121d29322282")
2046 |     print("------------------------------------------------------------------------")
2047 | 
2048 | 
2049 | # MAIN ################################################################
2050 | if __name__ == '__main__':
2051 | 
2052 |     # Signal handler for CTRL+C
2053 |     signal_module.signal(signal_module.SIGINT, signal_handler)
2054 | 
2055 |     # Parse Arguments
2056 |     parser = argparse.ArgumentParser(description='yarGen')
2057 | 
2058 |     group_creation = parser.add_argument_group('Rule Creation')
2059 |     group_creation.add_argument('-m', help='Path to scan for malware')
2060 |     group_creation.add_argument('-y', help='Minimum string length to consider (default=8)', metavar='min-size',
2061 |                                 default=8)
2062 |     group_creation.add_argument('-z', help='Minimum score to consider (default=0)', metavar='min-score', default=0)
2063 |     group_creation.add_argument('-x', help='Score required to set string as \'highly specific string\' (default: 30)',
2064 |                                 metavar='high-scoring', default=30)
2065 |     group_creation.add_argument('-w', help='Minimum number of strings that overlap to create a super rule (default: 5)',
2066 |                                 metavar='superrule-overlap', default=5)
2067 |     group_creation.add_argument('-s', help='Maximum length to consider (default=128)', metavar='max-size', default=128, type=int)
2068 |     group_creation.add_argument('-rc', help='Maximum number of strings per rule (default=20, intelligent filtering '
2069 |                                             'will be applied)', metavar='maxstrings', default=20)
2070 |     group_creation.add_argument('--excludegood', help='Force the exclude all goodware strings', action='store_true',
2071 |                                 default=False)
2072 | 
2073 |     group_output = parser.add_argument_group('Rule Output')
2074 |     group_output.add_argument('-o', help='Output rule file', metavar='output_rule_file', default='yargen_rules.yar')
2075 |     group_output.add_argument('-e', help='Output directory for string exports', metavar='output_dir_strings', default='')
2076 |     group_output.add_argument('-a', help='Author Name', metavar='author', default='yarGen Rule Generator')
2077 |     group_output.add_argument('-r', help='Reference (can be string or text file)', metavar='ref',
2078 |                               default='https://github.com/Neo23x0/yarGen')
2079 |     group_output.add_argument('-l', help='License', metavar='lic', default='')
2080 |     group_output.add_argument('-p', help='Prefix for the rule description', metavar='prefix',
2081 |                               default='Auto-generated rule')
2082 |     group_output.add_argument('-b', help='Text file from which the identifier is read (default: last folder name in '
2083 |                                          'the full path, e.g. "myRAT" if -m points to /mnt/mal/myRAT)',
2084 |                               metavar='identifier',
2085 |                               default='not set')
2086 |     group_output.add_argument('--score', help='Show the string scores as comments in the rules', action='store_true',
2087 |                               default=False)
2088 |     group_output.add_argument('--strings', help='Show the string scores as comments in the rules', action='store_true',
2089 |                               default=False)
2090 |     group_output.add_argument('--nosimple', help='Skip simple rule creation for files included in super rules',
2091 |                               action='store_true', default=False)
2092 |     group_output.add_argument('--nomagic', help='Don\'t include the magic header condition statement',
2093 |                               action='store_true', default=False)
2094 |     group_output.add_argument('--nofilesize', help='Don\'t include the filesize condition statement',
2095 |                               action='store_true', default=False)
2096 |     group_output.add_argument('-fm', help='Multiplier for the maximum \'filesize\' condition value (default: 3)',
2097 |                               default=3)
2098 |     group_output.add_argument('--globalrule', help='Create global rules (improved rule set speed)',
2099 |                               action='store_true', default=False)
2100 |     group_output.add_argument('--nosuper', action='store_true', default=False, help='Don\'t try to create super rules '
2101 |                                                                                     'that match against various files')
2102 | 
2103 |     group_db = parser.add_argument_group('Database Operations')
2104 |     group_db.add_argument('--update', action='store_true', default=False, help='Update the local strings and opcodes '
2105 |                                                                                'dbs from the online repository')
2106 |     group_db.add_argument('-g', help='Path to scan for goodware (dont use the database shipped with yaraGen)')
2107 |     group_db.add_argument('-u', action='store_true', default=False, help='Update local standard goodware database with '
2108 |                                                                          'a new analysis result (used with -g)')
2109 |     group_db.add_argument('-c', action='store_true', default=False, help='Create new local goodware database '
2110 |                                                                          '(use with -g and optionally -i "identifier")')
2111 |     group_db.add_argument('-i', default="", help='Specify an identifier for the newly created databases '
2112 |                                                  '(good-strings-identifier.db, good-opcodes-identifier.db)')
2113 | 
2114 |     group_general = parser.add_argument_group('General Options')
2115 |     group_general.add_argument('--dropzone', action='store_true', default=False,
2116 |                                help='Dropzone mode - monitors a directory [-m] for new samples to process. '
2117 |                                     'WARNING: Processed files will be deleted!')
2118 |     group_general.add_argument('--nr', action='store_true', default=False, help='Do not recursively scan directories')
2119 |     group_general.add_argument('--oe', action='store_true', default=False, help='Only scan executable extensions EXE, '
2120 |                                                                                 'DLL, ASP, JSP, PHP, BIN, INFECTED')
2121 |     group_general.add_argument('-fs', help='Max file size in MB to analyze (default=10)', metavar='size-in-MB',
2122 |                                default=10)
2123 |     group_general.add_argument('--noextras', action='store_true', default=False,
2124 |                               help='Don\'t use extras like Imphash or PE header specifics')
2125 |     group_general.add_argument('--ai', action='store_true', default=False, help='Create output to be used as ChatGPT4 input')
2126 |     group_general.add_argument('--debug', action='store_true', default=False, help='Debug output')
2127 |     group_general.add_argument('--trace', action='store_true', default=False, help='Trace output')
2128 | 
2129 |     group_opcode = parser.add_argument_group('Other Features')
2130 |     group_opcode.add_argument('--opcodes', action='store_true', default=False, help='Do use the OpCode feature '
2131 |                                                                                     '(use this if not enough high '
2132 |                                                                                     'scoring strings can be found)')
2133 |     group_opcode.add_argument('-n', help='Number of opcodes to add if not enough high scoring string could be found '
2134 |                                          '(default=3)', metavar='opcode-num', default=3)
2135 | 
2136 |     group_inverse = parser.add_argument_group('Inverse Mode (unstable)')
2137 |     group_inverse.add_argument('--inverse', help=argparse.SUPPRESS, action='store_true', default=False)
2138 |     group_inverse.add_argument('--nodirname', help=argparse.SUPPRESS, action='store_true', default=False)
2139 |     group_inverse.add_argument('--noscorefilter', help=argparse.SUPPRESS, action='store_true', default=False)
2140 | 
2141 |     args = parser.parse_args()
2142 | 
2143 |     # Print Welcome
2144 |     print_welcome()
2145 | 
2146 |     if not args.update and not args.m and not args.g:
2147 |         parser.print_help()
2148 |         print("")
2149 |         print("""
2150 | [E] You have to select --update to update yarGens database or -m for signature generation or -g for the 
2151 | creation of goodware string collections 
2152 | (see https://github.com/Neo23x0/yarGen#examples for more details)
2153 | 
2154 | Recommended command line:
2155 |     python yarGen.py -a 'Your Name' --opcodes --dropzone -m ./dropzone""")
2156 |         sys.exit(1)
2157 | 
2158 |     # Update
2159 |     if args.update:
2160 |         update_databases()
2161 |         print("[+] Updated databases - you can now start creating YARA rules")
2162 |         sys.exit(0)
2163 | 
2164 |     # Typical input erros
2165 |     if args.m:
2166 |         if os.path.isfile(args.m):
2167 |             print("[E] Input is a file, please use a directory instead (-m path)")
2168 |             sys.exit(0)
2169 | 
2170 |     # Opcodes evaluation or not
2171 |     use_opcodes = False
2172 |     if args.opcodes:
2173 |         use_opcodes = True
2174 | 
2175 |     # Read PEStudio string list
2176 |     pestudio_strings = {}
2177 |     pestudio_available = False
2178 | 
2179 |     # Super Rule Generation
2180 |     nosuper = args.nosuper
2181 | 
2182 |     # Identifier
2183 |     sourcepath = args.m
2184 |     if args.g:
2185 |         sourcepath = args.g
2186 |     identifier = getIdentifier(args.b, sourcepath)
2187 |     print("[+] Using identifier '%s'" % identifier)
2188 | 
2189 |     # Reference
2190 |     reference = getReference(args.r)
2191 |     print("[+] Using reference '%s'" % reference)
2192 | 
2193 |     # Prefix
2194 |     prefix = getPrefix(args.p, identifier)
2195 |     print("[+] Using prefix '%s'" % prefix)
2196 | 
2197 |     if os.path.isfile(get_abs_path(PE_STRINGS_FILE)):
2198 |         print("[+] Processing PEStudio strings ...")
2199 |         pestudio_strings = initialize_pestudio_strings()
2200 |         pestudio_available = True
2201 | 
2202 |     # Highly specific string score
2203 |     score_highly_specific = int(args.x)
2204 | 
2205 |     # Scan goodware files
2206 |     if args.g:
2207 |         print("[+] Processing goodware files ...")
2208 |         good_strings_db, good_opcodes_db, good_imphashes_db, good_exports_db = \
2209 |             parse_good_dir(args.g, args.nr, args.oe)
2210 | 
2211 |         # Update existing databases
2212 |         if args.u:
2213 |             try:
2214 |                 print("[+] Updating databases ...")
2215 | 
2216 |                 # Evaluate the database identifiers
2217 |                 db_identifier = ""
2218 |                 if args.i != "":
2219 |                     db_identifier = "-%s" % args.i
2220 |                 strings_db = "./dbs/good-strings%s.db" % db_identifier
2221 |                 opcodes_db = "./dbs/good-opcodes%s.db" % db_identifier
2222 |                 imphashes_db = "./dbs/good-imphashes%s.db" % db_identifier
2223 |                 exports_db = "./dbs/good-exports%s.db" % db_identifier
2224 | 
2225 |                 # Strings -----------------------------------------------------
2226 |                 print("[+] Updating %s ..." % strings_db)
2227 |                 good_pickle = load(get_abs_path(strings_db))
2228 |                 print("Old string database entries: %s" % len(good_pickle))
2229 |                 good_pickle.update(good_strings_db)
2230 |                 print("New string database entries: %s" % len(good_pickle))
2231 |                 save(good_pickle, strings_db)
2232 | 
2233 |                 # Opcodes -----------------------------------------------------
2234 |                 print("[+] Updating %s ..." % opcodes_db)
2235 |                 good_opcode_pickle = load(get_abs_path(opcodes_db))
2236 |                 print("Old opcode database entries: %s" % len(good_opcode_pickle))
2237 |                 good_opcode_pickle.update(good_opcodes_db)
2238 |                 print("New opcode database entries: %s" % len(good_opcode_pickle))
2239 |                 save(good_opcode_pickle, opcodes_db)
2240 | 
2241 |                 # Imphashes ---------------------------------------------------
2242 |                 print("[+] Updating %s ..." % imphashes_db)
2243 |                 good_imphashes_pickle = load(get_abs_path(imphashes_db))
2244 |                 print("Old opcode database entries: %s" % len(good_imphashes_pickle))
2245 |                 good_imphashes_pickle.update(good_imphashes_db)
2246 |                 print("New opcode database entries: %s" % len(good_imphashes_pickle))
2247 |                 save(good_imphashes_pickle, imphashes_db)
2248 | 
2249 |                 # Exports -----------------------------------------------------
2250 |                 print("[+] Updating %s ..." % exports_db)
2251 |                 good_exports_pickle = load(get_abs_path(exports_db))
2252 |                 print("Old opcode database entries: %s" % len(good_exports_pickle))
2253 |                 good_exports_pickle.update(good_exports_db)
2254 |                 print("New opcode database entries: %s" % len(good_exports_pickle))
2255 |                 save(good_exports_pickle, exports_db)
2256 | 
2257 |             except Exception as e:
2258 |                 traceback.print_exc()
2259 | 
2260 |         # Create new databases
2261 |         if args.c:
2262 |             print("[+] Creating local database ...")
2263 |             # Evaluate the database identifiers
2264 |             db_identifier = ""
2265 |             if args.i != "":
2266 |                 db_identifier = "-%s" % args.i
2267 |             strings_db = "./dbs/good-strings%s.db" % db_identifier
2268 |             opcodes_db = "./dbs/good-opcodes%s.db" % db_identifier
2269 |             imphashes_db = "./dbs/good-imphashes%s.db" % db_identifier
2270 |             exports_db = "./dbs/good-exports%s.db" % db_identifier
2271 | 
2272 |             # Creating the databases
2273 |             print("[+] Using '%s' as filename for newly created strings database" % strings_db)
2274 |             print("[+] Using '%s' as filename for newly created opcodes database" % opcodes_db)
2275 |             print("[+] Using '%s' as filename for newly created opcodes database" % imphashes_db)
2276 |             print("[+] Using '%s' as filename for newly created opcodes database" % exports_db)
2277 | 
2278 |             try:
2279 | 
2280 |                 if os.path.isfile(strings_db):
2281 |                     input("File %s alread exists. Press enter to proceed or CTRL+C to exit." % strings_db)
2282 |                     os.remove(strings_db)
2283 |                 if os.path.isfile(opcodes_db):
2284 |                     input("File %s alread exists. Press enter to proceed or CTRL+C to exit." % opcodes_db)
2285 |                     os.remove(opcodes_db)
2286 |                 if os.path.isfile(imphashes_db):
2287 |                     input("File %s alread exists. Press enter to proceed or CTRL+C to exit." % imphashes_db)
2288 |                     os.remove(imphashes_db)
2289 |                 if os.path.isfile(exports_db):
2290 |                     input("File %s alread exists. Press enter to proceed or CTRL+C to exit." % exports_db)
2291 |                     os.remove(exports_db)
2292 | 
2293 |                 # Strings
2294 |                 good_json = Counter()
2295 |                 good_json = good_strings_db
2296 |                 # Opcodes
2297 |                 good_op_json = Counter()
2298 |                 good_op_json = good_opcodes_db
2299 |                 # Imphashes
2300 |                 good_imphashes_json = Counter()
2301 |                 good_imphashes_json = good_imphashes_db
2302 |                 # Exports
2303 |                 good_exports_json = Counter()
2304 |                 good_exports_json = good_exports_db
2305 | 
2306 |                 # Save
2307 |                 save(good_json, strings_db)
2308 |                 save(good_op_json, opcodes_db)
2309 |                 save(good_imphashes_json, imphashes_db)
2310 |                 save(good_exports_json, exports_db)
2311 | 
2312 |                 print("New database with %d string, %d opcode, %d imphash, %d export entries created. " \
2313 |                       "(remember to use --opcodes to extract opcodes from the samples and create the opcode databases)"\
2314 |                       % (len(good_strings_db), len(good_opcodes_db), len(good_imphashes_db), len(good_exports_db)))
2315 |             except Exception as e:
2316 |                 traceback.print_exc()
2317 | 
2318 |     # Analyse malware samples and create rules
2319 |     else:
2320 |         print("[+] Reading goodware strings from database 'good-strings.db' ...")
2321 |         print("    (This could take some time and uses several Gigabytes of RAM depending on your db size)")
2322 | 
2323 |         good_strings_db = Counter()
2324 |         good_opcodes_db = Counter()
2325 |         good_imphashes_db = Counter()
2326 |         good_exports_db = Counter()
2327 | 
2328 |         opcodes_num = 0
2329 |         strings_num = 0
2330 |         imphash_num = 0
2331 |         exports_num = 0
2332 | 
2333 |         # Initialize all databases
2334 |         for file in os.listdir(get_abs_path("./dbs/")):
2335 |             if not file.endswith(".db"):
2336 |                 continue
2337 |             filePath = os.path.join("./dbs/", file)
2338 |             # String databases
2339 |             if file.startswith("good-strings"):
2340 |                 try:
2341 |                     print("[+] Loading %s ..." % filePath)
2342 |                     good_json = load(get_abs_path(filePath))
2343 |                     good_strings_db.update(good_json)
2344 |                     print("[+] Total: %s / Added %d entries" % (
2345 |                     len(good_strings_db), len(good_strings_db) - strings_num))
2346 |                     strings_num = len(good_strings_db)
2347 |                 except Exception as e:
2348 |                     traceback.print_exc()
2349 |             # Opcode databases
2350 |             if file.startswith("good-opcodes"):
2351 |                 try:
2352 |                     if use_opcodes:
2353 |                         print("[+] Loading %s ..." % filePath)
2354 |                         good_op_json = load(get_abs_path(filePath))
2355 |                         good_opcodes_db.update(good_op_json)
2356 |                         print("[+] Total: %s (removed duplicates) / Added %d entries" % (
2357 |                         len(good_opcodes_db), len(good_opcodes_db) - opcodes_num))
2358 |                         opcodes_num = len(good_opcodes_db)
2359 |                 except Exception as e:
2360 |                     use_opcodes = False
2361 |                     traceback.print_exc()
2362 |             # Imphash databases
2363 |             if file.startswith("good-imphash"):
2364 |                 try:
2365 |                     print("[+] Loading %s ..." % filePath)
2366 |                     good_imphashes_json = load(get_abs_path(filePath))
2367 |                     good_imphashes_db.update(good_imphashes_json)
2368 |                     print("[+] Total: %s / Added %d entries" % (
2369 |                     len(good_imphashes_db), len(good_imphashes_db) - imphash_num))
2370 |                     imphash_num = len(good_imphashes_db)
2371 |                 except Exception as e:
2372 |                     traceback.print_exc()
2373 |             # Export databases
2374 |             if file.startswith("good-exports"):
2375 |                 try:
2376 |                     print("[+] Loading %s ..." % filePath)
2377 |                     good_exports_json = load(get_abs_path(filePath))
2378 |                     good_exports_db.update(good_exports_json)
2379 |                     print("[+] Total: %s / Added %d entries" % (
2380 |                     len(good_exports_db), len(good_exports_db) - exports_num))
2381 |                     exports_num = len(good_exports_db)
2382 |                 except Exception as e:
2383 |                     traceback.print_exc()
2384 | 
2385 |         if use_opcodes and len(good_opcodes_db) < 1:
2386 |             print("[E] Missing goodware opcode databases."
2387 |                   "    Please run 'yarGen.py --update' to retrieve the newest database set.")
2388 |             use_opcodes = False
2389 | 
2390 |         if len(good_exports_db) < 1 and len(good_imphashes_db) < 1:
2391 |             print("[E] Missing goodware imphash/export databases. "
2392 |                   "    Please run 'yarGen.py --update' to retrieve the newest database set.")
2393 | 
2394 |         if len(good_strings_db) < 1 and not args.c:
2395 |             print("[E] Error - no goodware databases found. "
2396 |                   "    Please run 'yarGen.py --update' to retrieve the newest database set.")
2397 |             sys.exit(1)
2398 | 
2399 |     # If malware directory given
2400 |     if args.m:
2401 | 
2402 |         # Deactivate super rule generation if there's only a single file in the folder
2403 |         if len(os.listdir(args.m)) < 2:
2404 |             nosuper = True
2405 | 
2406 |         # AI input generation
2407 |         strings_per_rule = int(args.rc)
2408 |         if args.ai:
2409 |             strings_per_rule = 200
2410 | 
2411 |         # Special strings
2412 |         base64strings = {}
2413 |         reversedStrings = {}
2414 |         hexEncStrings = {}
2415 |         pestudioMarker = {}
2416 |         stringScores = {}
2417 | 
2418 |         # Dropzone mode
2419 |         if args.dropzone:
2420 |             # Monitoring folder for changes
2421 |             print("Monitoring %s for new sample files (processed samples will be removed)" % args.m)
2422 |             while(True):
2423 |                 if len(os.listdir(args.m)) > 0:
2424 |                     # Deactivate super rule generation if there's only a single file in the folder
2425 |                     if len(os.listdir(args.m)) < 2:
2426 |                         nosuper = True
2427 |                     else:
2428 |                         nosuper = False
2429 |                     # Read a new identifier
2430 |                     identifier = getIdentifier(args.b, args.m)
2431 |                     # Read a new reference
2432 |                     reference = getReference(args.r)
2433 |                     # Generate a new description prefix
2434 |                     prefix = getPrefix(args.p, identifier)
2435 |                     # Process the samples
2436 |                     processSampleDir(args.m)
2437 |                     # Delete all samples from the dropzone folder
2438 |                     emptyFolder(args.m)
2439 |                 time.sleep(1)
2440 |         else:
2441 |             # Scan malware files
2442 |             print("[+] Processing malware files ...")
2443 |             processSampleDir(args.m)
2444 | 
2445 |         print("[+] yarGen run finished")
2446 | 


--------------------------------------------------------------------------------