├── LICENCE ├── README.md ├── data ├── dga_training.txt └── test_domains.txt ├── dga_detection.py ├── install.sh └── requirements.txt /LICENCE: -------------------------------------------------------------------------------- 1 | Software License Agreement (BSD License) 2 | Copyright (c) 2017 Phil Arkwright 3 | All rights reserved. 4 | 5 | Redistribution and use in source and binary forms, with or without 6 | modification, are permitted provided that the following conditions 7 | are met: 8 | 1. Redistributions of source code must retain the above copyright 9 | notice, this list of conditions and the following disclaimer. 10 | 2. Redistributions in binary form must reproduce the above copyright 11 | notice, this list of conditions and the following disclaimer in the 12 | documentation and/or other materials provided with the distribution. 13 | 3. The name of the author may not be used to endorse or promote products 14 | derived from this software without specific prior written permission. 15 | 16 | THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR 17 | IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES 18 | OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. 19 | IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, 20 | INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT 21 | NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 22 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 23 | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 24 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF 25 | THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DGA-Detection 2 | More and more malware is being created with advanced blocking circumvention techniques. One of the most prevalent techniques being used is the use of Domain Generation Algorithms which periodically generates a set of Domains to contact a C&C server. The majority of these DGA domains generate random alphanumeric strings which differ significantly in structure to a standard domain. By looking at the frequency that a set of bigrams in a domain occur within the Alexa top 1M, we were able to detect whether a domain was structured with a random string or if it was a legitimate human readable domain. If a domain is comprised nearly entirely of low frequency bigrams which occurred rarely within the Alexa top 1m then the domain would more likely be a random string. Bigrams of a vowel and constants occurred the most frequent whereas characters and integers occurred the least frequent. The script was ran against 100,000 GameoverZeus domains and had a detection rate of 100% and a false positive rate against the Alexa top 1m of 8% without any domain whitelisting being applied. 3 | 4 | 5 | This System has been tested on Ubuntu and RaspberryPi. 6 | Currently I have my raspberrypi setup as a DNS server using Bind9. 7 | The DGA-Detection script is also run on the raspberrypi and reads the requests. 8 | The requests are then processed to determine if they are a potential DGA or not. 9 | 10 | ## Install 11 | 12 | ```python 13 | git clone https://github.com/philarkwright/DGA-Detection.git 14 | cd DGA-Detection 15 | chmod +x install.sh 16 | ./install.sh 17 | ``` 18 | 19 | ## Use 20 | 21 | ```python 22 | sudo python dga_detection.py 23 | ``` 24 | 25 | ## Training 26 | - The /data/dga_training.txt file contains DGA domains from the Tinba DGA. I'd suggest using this to train the model as this follows the structure of the majority of the DGA's domains however you may replace the domains with your own set if you wish too. 27 | 28 | ## Testing 29 | - To test domains against the model after training has been complete, create a textfile called test_domains.txt and place it into /data/. 30 | -A sample of the Tinba DGA domains has been included in the download. 31 | 32 | ## Settings File 33 | - The settings file is where the model stores the baseline value used to decide whether or not a domain is a potential DGA. This value can be manually changed to increase detection rate or reduced to decrease false positives. 34 | 35 | ## Live Capture Arguments 36 | 37 | ```python 38 | nohup sudo python dga-detection.py -o 2 -i 39 | ``` 40 | 41 | ## Potential Issues 42 | When running the install.sh file please note that the git:// protocol uses port 9418, so you should make sure your firewall allows outbound connections to this port. 43 | 44 | This project is still very much in development. 45 | 46 | While running the script at network level on my home network I noticed that if I did certain google searches on google chrome, that I'd get a bunch of alerts which appeared to be DGA domains. Even if you don't visit these sites which are normally chinese (Since they use giberish strings for their domains), google chrome will preload and fetch them causing the alerts. 47 | 48 | NOTE: Whitelist features uses the Alexa Top 1m. 49 | 50 | Contact me via Twitter @philarkwright 51 | 52 | ## Completed 53 | 54 | - [ ] Add Alexa Whitelisting 55 | - [ ] Add Pushbullet (Notify admin) 56 | - [ ] Fix Lag on capture traffic alerts 57 | - [ ] Add arguments so capture live can be ran with nohup 58 | 59 | -------------------------------------------------------------------------------- /dga_detection.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | #Software License Agreement (BSD License) 3 | #Copyright (c) 2017 Phil Arkwright 4 | #All rights reserved. 5 | 6 | from __future__ import division 7 | from pprint import pprint 8 | from scapy.all import * 9 | import scipy 10 | 11 | import ConfigParser 12 | import os.path 13 | import json 14 | import tldextract #Seperating subdomain from input_domain in capture 15 | import alexa 16 | 17 | from pushbullet import PushBullet 18 | 19 | import argparse 20 | 21 | pushbullet_key = '' 22 | if pushbullet_key != '': 23 | #Configure pushbulet 24 | p = PushBullet(pushbullet_key) 25 | 26 | def send_note(note): 27 | push = p.push_note('%s' % (note), '') 28 | 29 | def hasNumbers(inputString): 30 | return any(char.isdigit() for char in inputString) 31 | 32 | def ConfigSectionMap(section): 33 | dict1 = {} 34 | options = Config.options(section) 35 | for option in options: 36 | try: 37 | dict1[option] = Config.get(section, option) 38 | if dict1[option] == -1: 39 | DebugPrint("skip: %s" % option) 40 | except: 41 | print("exception on %s!" % option) 42 | dict1[option] = None 43 | return dict1 44 | 45 | Config = ConfigParser.ConfigParser() 46 | previous_domain = '' 47 | whitelist = {} 48 | 49 | def load_settings(): 50 | 51 | if os.path.isfile('data/settings.conf'): 52 | Config.read("data/settings.conf") 53 | percentage_list_dga_settings = float(ConfigSectionMap("Percentages")['percentage_list_dga_settings']) 54 | percentage_list_alexa_settings = float(ConfigSectionMap("Percentages")['percentage_list_alexa_settings']) 55 | baseline = float(ConfigSectionMap("Percentages")['baseline']) 56 | total_bigrams_settings = float(ConfigSectionMap("Values")['total_bigrams_settings']) 57 | return baseline, total_bigrams_settings 58 | else: 59 | print "No settings file. Please run training function." 60 | 61 | def load_data(): 62 | 63 | if os.path.isfile('data/database.json') and os.path.isfile('data/settings.conf'): 64 | 65 | baseline, total_bigrams_settings = load_settings() 66 | 67 | with open('data/database.json', 'r') as f: 68 | try: 69 | bigram_dict = json.load(f) 70 | process_data(bigram_dict, total_bigrams_settings) #Call process_data 71 | # if the file is empty the ValueError will be thrown 72 | except ValueError: 73 | bigram_dict = {} 74 | else: 75 | 76 | try: 77 | cfgfile = open("data/settings.conf",'w') 78 | Config.add_section('Percentages') 79 | Config.add_section('Values') 80 | Config.set('Percentages','baseline', 0) 81 | Config.write(cfgfile) 82 | cfgfile.close() 83 | except: 84 | print "Settings file error. Please Delete." 85 | exit() 86 | 87 | 88 | if os.path.isfile('data/alexa_top_1m_domain.json'): 89 | with open('data/alexa_top_1m_domain.json', 'r') as f: 90 | training_data = json.load(f) 91 | else: 92 | print "Downloading Alexa Top 1m Domains..." 93 | training_data = alexa.top_list(1000000) 94 | with open('data/alexa_top_1m_domain.json', 'w') as f: 95 | json.dump(training_data, f) 96 | 97 | 98 | bigram_dict = {} #Define bigram_dict 99 | total_bigrams = 0 #Set initial total to 0 100 | for input_domain in xrange(len(training_data)): #Run through each input_domain in the training list 101 | input_domain = tldextract.extract(training_data[input_domain][1]) 102 | if len(input_domain.domain) > 5 and "-" not in input_domain.domain: 103 | print "Processing domain:", input_domain.domain #Print input_domain number in list 104 | for bigram_position in xrange(len(input_domain.domain) - 1): #Run through each bigram in input_domain 105 | total_bigrams = total_bigrams + 1 #Increment bigram total 106 | if input_domain.domain[bigram_position:bigram_position + 2] in bigram_dict: #Check if bigram already exists in dictionary 107 | bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] = bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] + 1 #Increment dictionary value by 1 108 | else: 109 | bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] = 1 #Add bigram to list and set value to 1 110 | 111 | pprint(bigram_dict) #Print bigram list 112 | with open('data/database.json', 'w') as f: 113 | json.dump(bigram_dict, f) 114 | 115 | process_data(bigram_dict, total_bigrams) #Call process_data 116 | 117 | def process_data(bigram_dict, total_bigrams): 118 | 119 | if os.path.isfile('data/alexa_top_1m_domain.json'): 120 | with open('data/alexa_top_1m_domain.json', 'r') as f: 121 | data = json.load(f) 122 | 123 | percentage_list_alexa = [] #Define average_percentage 124 | 125 | 126 | for input_domain in xrange(len(data)): #Run through each input_domain in the data 127 | input_domain = tldextract.extract(data[input_domain][1]) 128 | if len(input_domain.domain) > 5 and "-" not in input_domain.domain: 129 | percentage = [] #Clear percentage list 130 | for bigram_position in xrange(len(input_domain.domain) - 1): #Run through each bigram in the data 131 | if input_domain.domain[bigram_position:bigram_position + 2] in bigram_dict: #Check if bigram is in dictionary 132 | percentage.append((bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] / total_bigrams) * 100) #Get bigram dictionary value and convert to percantage 133 | else: 134 | percentage.append(0) #Bigram value is 0 as it doesn't exist 135 | 136 | percentage_list_alexa.append(scipy.mean(percentage)) #Add percentage value to list for total average 137 | print input_domain.domain, "AP:", scipy.mean(percentage) #Print input_domain and percentage list 138 | 139 | 140 | data = open('data/dga_training.txt').read().splitlines() 141 | percentage_list_dga = [] #Define average_percentage 142 | 143 | for input_domain in xrange(len(data)): #Run through each input_domain in the data 144 | input_domain = tldextract.extract(data[input_domain]) 145 | if len(input_domain.domain) > 5 and "-" not in input_domain.domain: 146 | percentage = [] #Clear percentage list 147 | for bigram_position in xrange(len(input_domain.domain) - 1): #Run through each bigram in the data 148 | if input_domain.domain[bigram_position:bigram_position + 2] in bigram_dict: #Check if bigram is in dictionary 149 | percentage.append((bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] / total_bigrams) * 100) #Get bigram dictionary value and convert to percantage 150 | else: 151 | percentage.append(0) #Bigram value is 0 as it doesn't exist 152 | 153 | percentage_list_dga.append(scipy.mean(percentage)) #Add percentage value to list for total average 154 | print input_domain.domain, "AP:", scipy.mean(percentage) #Print input_domain and percentage list 155 | 156 | print 67 * "*" 157 | print "Total Average Percentage Alexa:", scipy.mean(percentage_list_alexa), "( Min:", min(percentage_list_alexa), "Max:", max(percentage_list_alexa), ")" #Get average percentage 158 | print "Total Average Percentage DGA:", scipy.mean(percentage_list_dga), "( Min:", min(percentage_list_dga), "Max:", max(percentage_list_dga), ")" #Get average percentage 159 | print "Baseline:", (((scipy.mean(percentage_list_alexa) - scipy.mean(percentage_list_dga)) / 2) + scipy.mean(percentage_list_dga)) 160 | print 67 * "*" 161 | 162 | cfgfile = open("data/settings.conf",'w') 163 | Config.set('Percentages','percentage_list_alexa_settings', scipy.mean(percentage_list_alexa)) 164 | Config.set('Percentages','percentage_list_dga_settings', scipy.mean(percentage_list_dga)) 165 | Config.set('Percentages','baseline', (((scipy.mean(percentage_list_alexa) - scipy.mean(percentage_list_dga)) / 2) + scipy.mean(percentage_list_dga))) 166 | Config.set('Values','total_bigrams_settings', total_bigrams) 167 | Config.write(cfgfile) 168 | cfgfile.close() 169 | 170 | percentage = [] #Define percentage 171 | 172 | 173 | def testing(): 174 | 175 | baseline, total_bigrams_settings = load_settings() 176 | 177 | if os.path.isfile('data/database.json'): 178 | with open('data/database.json', 'r') as f: 179 | try: 180 | bigram_dict = json.load(f) 181 | # if the file is empty the ValueError will be thrown 182 | except ValueError: 183 | bigram_dict = {} 184 | 185 | 186 | data = open('data/test_domains.txt').read().splitlines() 187 | 188 | 189 | flag = 0 190 | total_flags = 0 191 | percentage = [] #Define percentage 192 | 193 | for input_domain in xrange(len(data)): #Run through each input_domain in the data 194 | input_domain = tldextract.extract(data[input_domain]) 195 | if len(input_domain.domain) > 5 and "-" not in input_domain.domain: 196 | for bigram_position in xrange(len(input_domain.domain) - 1): #Run through each bigram in the data 197 | if input_domain.domain[bigram_position:bigram_position + 2] in bigram_dict: #Check if bigram is in dictionary 198 | percentage.append((round(((bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] / total_bigrams_settings) * 100), 2))) #Get bigram dictionary value and convert to percantage 199 | else: 200 | percentage.append(0) #Bigram value is 0 as it doesn't exist 201 | 202 | 203 | total_flags = total_flags + 1 204 | 205 | if baseline >= scipy.mean(percentage): 206 | flag = flag + 1 207 | print input_domain.domain, percentage,"AP:", scipy.mean(percentage) 208 | else: 209 | print input_domain.domain, percentage, "AP:", scipy.mean(percentage) 210 | 211 | 212 | percentage = [] #Clear percentage list 213 | 214 | print 67 * "*" 215 | print "Detection Rate:", flag / total_flags * 100 216 | print 67 * "*" 217 | 218 | def check_domain(input_domain): 219 | 220 | baseline, total_bigrams_settings = load_settings() 221 | 222 | if os.path.isfile('data/database.json'): 223 | with open('data/database.json', 'r') as f: 224 | try: 225 | bigram_dict = json.load(f) 226 | # if the file is empty the ValueError will be thrown 227 | except ValueError: 228 | bigram_dict = {} 229 | 230 | percentage = [] 231 | 232 | for bigram_position in xrange(len(input_domain) - 1): #Run through each bigram in the data 233 | if input_domain[bigram_position:bigram_position + 2] in bigram_dict: #Check if bigram is in dictionary 234 | percentage.append((bigram_dict[input_domain[bigram_position:bigram_position + 2]] / total_bigrams_settings) * 100) #Get bigram dictionary value and convert to percantage 235 | else: 236 | percentage.append(0) #Bigram value is 0 as it doesn't exist 237 | 238 | if baseline >= scipy.mean(percentage): 239 | print 67 * "*" 240 | print 'Baseline:', baseline, 'Domain Average Bigram Percentage:',scipy.mean(percentage) 241 | return 1 242 | else: 243 | return 0 244 | 245 | percentage = [] #Clear percentage list 246 | 247 | def capture_traffic(pkt): 248 | 249 | global previous_domain 250 | global baseline 251 | global total_bigram_settings 252 | global previous_domain 253 | global whitelist 254 | 255 | if IP in pkt: 256 | ip_src = pkt[IP].src 257 | ip_dst = pkt[IP].dst 258 | if pkt.haslayer(DNS) and pkt.getlayer(DNS).qr == 0: 259 | input_domain = tldextract.extract(pkt.getlayer(DNS).qd.qname) 260 | if input_domain.suffix != '' and input_domain.suffix != 'localdomain' and input_domain.subdomain == '' and len(input_domain.domain) > 5 and "-" not in input_domain.domain and previous_domain != input_domain.domain: #Domains are no smaller than 6 261 | previous_domain = input_domain.domain 262 | if ("%s.%s" % (input_domain.domain, input_domain.suffix)) not in whitelist.values() and check_domain(input_domain.domain) == 1: 263 | print 'Extracted Domain:', input_domain.domain 264 | print str(ip_src) + "->", str(ip_dst), "Warning! Potential DGA Detected ", "(", (pkt.getlayer(DNS).qd.qname), ")" 265 | print 67 * "*" 266 | print '\n' 267 | if pushbullet_key != '': 268 | alert_message = str((str(ip_src) + "->", str(ip_dst), "Warning! Potential DGA Detected ", "(", (pkt.getlayer(DNS).qd.qname), ")")) 269 | send_note(alert_message) 270 | 271 | #else: 272 | #print "Safe input_domain", "(" + input_domain + ")" 273 | 274 | 275 | 276 | parser = argparse.ArgumentParser() 277 | parser.add_argument('-o', '--option', required=False) 278 | parser.add_argument('-i', '--interface', required=False) 279 | args = parser.parse_args() 280 | 281 | if args.option == '2': 282 | if os.path.isfile('data/settings.conf'): 283 | print 'Please wait whiles whitelist is read...' 284 | with open('data/alexa_top_1m_domain.json', 'r') as f: 285 | whitelist = json.load(f) 286 | whitelist = dict((k) for k in whitelist) 287 | ################################### 288 | baseline, total_bigrams_settings = load_settings() 289 | print 'Capturing DNS Requests...' 290 | sniff(iface = args.interface, filter = "port 53", prn = capture_traffic, store = 0) 291 | #Using Alexa as a white list (Potentially not the best method incase malware domains make it in the list) More filtering needs to be done. 292 | #This is in beta and might want to be modified or removed. 293 | else: 294 | print 67 * '#' 295 | print 'You must run the training algoirthm first.' 296 | print 67 * '#' 297 | exit() 298 | 299 | 300 | ans=True 301 | while ans: 302 | print 30 * "-" , "MENU" , 30 * "-" 303 | print (""" 304 | 1. Train Data 305 | 2. Start Capturing DNS 306 | 3. Testing 307 | 4. View Config File 308 | 5. Delete script data 309 | 6. Exit/Quit 310 | """) 311 | print 67 * "-" 312 | ans=raw_input("Select an option to proceed: ") 313 | if ans=="1": 314 | load_data() 315 | elif ans=="2": 316 | if os.path.isfile('data/settings.conf'): 317 | print 'Please wait whiles whitelist is read...' 318 | with open('data/alexa_top_1m_domain.json', 'r') as f: 319 | whitelist = json.load(f) 320 | whitelist = dict((k) for k in whitelist) 321 | ################################### 322 | baseline, total_bigrams_settings = load_settings() 323 | try: 324 | interface = raw_input("[*] Enter Desired Interface: ") 325 | except KeyboardInterrupt: 326 | print "[*] User Requested Shutdown..." 327 | print "[*] Exiting..." 328 | sys.exit(1) 329 | sniff(iface = interface,filter = "port 53", prn = capture_traffic, store = 0) 330 | #Using Alexa as a white list (Potentially not the best method incase malware domains make it in the list) More filtering needs to be done. 331 | #This is in beta and might want to be modified or removed. 332 | else: 333 | print 67 * '#' 334 | print 'You must run the training algoirthm first.' 335 | print 67 * '#' 336 | exit() 337 | elif ans=="3": 338 | if os.path.isfile('data/settings.conf') and os.path.isfile('data/database.json'): 339 | testing() 340 | else: 341 | print "\nYou must run the training algoirthm first." 342 | elif ans=="4": 343 | if os.path.isfile('data/settings.conf') and os.path.isfile('data/database.json'): 344 | baseline, total_bigrams_settings = load_settings() 345 | print 67 * "*" 346 | print "Total Bigrams:", total_bigrams_settings 347 | print "Baseline (Baseline):", baseline 348 | print 67 * "*" 349 | else: 350 | print "\n No data files available." 351 | elif ans=="5": 352 | if os.path.isfile('data/settings.conf') and os.path.isfile('data/database.json'): 353 | os.remove('data/settings.conf') 354 | os.remove('data/database.json') 355 | print "\nData has been deleted" 356 | else: 357 | print "\nNo data to delete." 358 | elif ans=="6": 359 | print("\nExiting") 360 | quit() 361 | elif ans !="": 362 | print("\n Not Valid Choice Try again") 363 | 364 | 365 | 366 | 367 | 368 | 369 | -------------------------------------------------------------------------------- /install.sh: -------------------------------------------------------------------------------- 1 | sudo apt-get install python-scipy -y 2 | sudo pip install -r requirements.txt 3 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | scapy 2 | tldextract 3 | -e git://github.com/philarkwright/Alexa-Top-Sites.git#egg=alexa-top-sites 4 | pushbullet.py --------------------------------------------------------------------------------