├── LICENCE
├── README.md
├── data
    ├── dga_training.txt
    └── test_domains.txt
├── dga_detection.py
├── install.sh
└── requirements.txt


/LICENCE:
--------------------------------------------------------------------------------
 1 | Software License Agreement (BSD License)
 2 | Copyright (c) 2017 Phil Arkwright
 3 | All rights reserved.
 4 | 
 5 | Redistribution and use in source and binary forms, with or without
 6 | modification, are permitted provided that the following conditions
 7 | are met:
 8 | 1. Redistributions of source code must retain the above copyright
 9 |    notice, this list of conditions and the following disclaimer.
10 | 2. Redistributions in binary form must reproduce the above copyright
11 |    notice, this list of conditions and the following disclaimer in the
12 |    documentation and/or other materials provided with the distribution.
13 | 3. The name of the author may not be used to endorse or promote products
14 |    derived from this software without specific prior written permission.
15 | 
16 | THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
17 | IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
18 | OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
19 | IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
20 | INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
21 | NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
22 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
23 | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
24 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
25 | THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
26 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # DGA-Detection
 2 | More and more malware is being created with advanced blocking circumvention techniques. One of the most prevalent techniques being used is the use of Domain Generation Algorithms which periodically generates a set of Domains to contact a C&C server. The majority of these DGA domains generate random alphanumeric strings which differ significantly in structure to a standard domain. By looking at the frequency that a set of bigrams in a domain occur within the Alexa top 1M, we were able to detect whether a domain was structured with a random string or if it was a legitimate human readable domain. If a domain is comprised nearly entirely of low frequency bigrams which occurred rarely within the Alexa top 1m then the domain would more likely be a random string. Bigrams of a vowel and constants occurred the most frequent whereas characters and integers occurred the least frequent. The script was ran against 100,000 GameoverZeus domains and had a detection rate of 100% and a false positive rate against the Alexa top 1m of 8% without any domain whitelisting being applied. 
 3 | 
 4 | 
 5 | This System has been tested on Ubuntu and RaspberryPi.
 6 | Currently I have my raspberrypi setup as a DNS server using Bind9.
 7 | The DGA-Detection script is also run on the raspberrypi and reads the requests.
 8 | The requests are then processed to determine if they are a potential DGA or not.
 9 | 
10 | ## Install
11 | 
12 | ```python
13 | git clone https://github.com/philarkwright/DGA-Detection.git  
14 | cd DGA-Detection  
15 | chmod +x install.sh
16 | ./install.sh
17 | ```
18 | 
19 | ## Use
20 | 
21 | ```python
22 | sudo python dga_detection.py
23 | ```
24 | 
25 | ## Training
26 | - The /data/dga_training.txt file contains DGA domains from the Tinba DGA. I'd suggest using this to train the model as this follows the structure of the majority of the DGA's domains however you may replace the domains with your own set if you wish too.
27 | 
28 | ## Testing
29 | - To test domains against the model after training has been complete, create a textfile called test_domains.txt and place it into /data/.
30 | -A sample of the Tinba DGA domains has been included in the download.
31 | 
32 | ## Settings File
33 | - The settings file is where the model stores the baseline value used to decide whether or not a domain is a potential DGA. This value can be manually changed to increase detection rate or reduced to decrease false positives.
34 | 
35 | ## Live Capture Arguments
36 | 
37 | ```python
38 | nohup sudo python dga-detection.py -o 2 -i <interface>
39 | ```
40 | 
41 | ## Potential Issues
42 | When running the install.sh file please note that the git:// protocol uses port 9418, so you should make sure your firewall allows outbound connections to this port.
43 | 
44 | This project is still very much in development.
45 | 
46 | While running the script at network level on my home network I noticed that if I did certain google searches on google chrome, that I'd get a bunch of alerts which appeared to be DGA domains. Even if you don't visit these sites which are normally chinese (Since they use giberish strings for their domains), google chrome will preload and fetch them causing the alerts.
47 | 
48 | NOTE: Whitelist features uses the Alexa Top 1m.
49 | 
50 | Contact me via Twitter @philarkwright
51 | 
52 | ## Completed
53 | 
54 | - [ ] Add Alexa Whitelisting
55 | - [ ] Add Pushbullet (Notify admin)
56 | - [ ] Fix Lag on capture traffic alerts
57 | - [ ] Add arguments so capture live can be ran with nohup
58 | 
59 | 


--------------------------------------------------------------------------------
/dga_detection.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | #Software License Agreement (BSD License)
  3 | #Copyright (c) 2017 Phil Arkwright
  4 | #All rights reserved.
  5 | 
  6 | from __future__ import division
  7 | from pprint import pprint
  8 | from scapy.all import *
  9 | import scipy
 10 | 
 11 | import ConfigParser
 12 | import os.path
 13 | import json
 14 | import tldextract #Seperating subdomain from input_domain in capture 
 15 | import alexa
 16 | 
 17 | from pushbullet import PushBullet
 18 | 
 19 | import argparse
 20 | 
 21 | pushbullet_key = ''
 22 | if pushbullet_key != '':
 23 | 	#Configure pushbulet 
 24 | 	p = PushBullet(pushbullet_key)
 25 | 
 26 | 	def send_note(note):
 27 | 		push = p.push_note('%s' % (note), '')
 28 | 
 29 | def hasNumbers(inputString):
 30 | 	return any(char.isdigit() for char in inputString)
 31 | 
 32 | def ConfigSectionMap(section):
 33 |     dict1 = {}
 34 |     options = Config.options(section)
 35 |     for option in options:
 36 |         try:
 37 |             dict1[option] = Config.get(section, option)
 38 |             if dict1[option] == -1:
 39 |                 DebugPrint("skip: %s" % option)
 40 |         except:
 41 |             print("exception on %s!" % option)
 42 |             dict1[option] = None
 43 |     return dict1
 44 | 
 45 | Config = ConfigParser.ConfigParser()
 46 | previous_domain = ''
 47 | whitelist = {}
 48 | 
 49 | def load_settings():
 50 | 
 51 | 	if os.path.isfile('data/settings.conf'):
 52 | 		Config.read("data/settings.conf")
 53 | 		percentage_list_dga_settings = float(ConfigSectionMap("Percentages")['percentage_list_dga_settings'])
 54 | 		percentage_list_alexa_settings = float(ConfigSectionMap("Percentages")['percentage_list_alexa_settings'])
 55 | 		baseline = float(ConfigSectionMap("Percentages")['baseline'])
 56 | 		total_bigrams_settings = float(ConfigSectionMap("Values")['total_bigrams_settings'])
 57 | 		return baseline, total_bigrams_settings
 58 | 	else:
 59 | 		print "No settings file. Please run training function."
 60 | 
 61 | def load_data():
 62 | 
 63 | 	if os.path.isfile('data/database.json') and os.path.isfile('data/settings.conf'):
 64 | 
 65 | 		baseline, total_bigrams_settings = load_settings()
 66 | 
 67 | 		with open('data/database.json', 'r') as f:
 68 | 		    try:
 69 | 		        bigram_dict = json.load(f)
 70 | 		        process_data(bigram_dict, total_bigrams_settings) #Call process_data
 71 | 		    # if the file is empty the ValueError will be thrown
 72 | 		    except ValueError:
 73 | 		        bigram_dict = {}
 74 | 	else:
 75 | 
 76 | 		try:
 77 | 			cfgfile = open("data/settings.conf",'w')
 78 | 			Config.add_section('Percentages')
 79 | 			Config.add_section('Values')
 80 | 			Config.set('Percentages','baseline', 0)
 81 | 			Config.write(cfgfile)
 82 | 			cfgfile.close()
 83 | 		except:
 84 | 			print "Settings file error. Please Delete."
 85 | 			exit()
 86 | 
 87 | 		
 88 | 		if os.path.isfile('data/alexa_top_1m_domain.json'):
 89 | 			with open('data/alexa_top_1m_domain.json', 'r') as f:
 90 | 				training_data = json.load(f)
 91 | 		else:
 92 | 			print "Downloading Alexa Top 1m Domains..."
 93 | 			training_data = alexa.top_list(1000000)
 94 | 			with open('data/alexa_top_1m_domain.json', 'w') as f:
 95 | 				json.dump(training_data, f)
 96 | 
 97 | 
 98 | 		bigram_dict = {} #Define bigram_dict
 99 | 		total_bigrams = 0 #Set initial total to 0
100 | 		for input_domain in xrange(len(training_data)): #Run through each input_domain in the training list
101 | 			input_domain = tldextract.extract(training_data[input_domain][1])
102 | 			if len(input_domain.domain) > 5 and "-" not in input_domain.domain:
103 | 				print "Processing domain:", input_domain.domain #Print input_domain number in list
104 | 				for  bigram_position in xrange(len(input_domain.domain) - 1): #Run through each bigram in input_domain
105 | 					total_bigrams = total_bigrams + 1 #Increment bigram total
106 | 					if input_domain.domain[bigram_position:bigram_position + 2] in bigram_dict: #Check if bigram already exists in dictionary
107 | 						bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] = bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] + 1 #Increment dictionary value by 1
108 | 					else:
109 | 						bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] = 1 #Add bigram to list and set value to 1
110 | 
111 | 		pprint(bigram_dict) #Print bigram list
112 | 		with open('data/database.json', 'w') as f:
113 | 			json.dump(bigram_dict, f)
114 | 
115 | 		process_data(bigram_dict, total_bigrams) #Call process_data
116 | 
117 | def process_data(bigram_dict, total_bigrams):
118 | 
119 | 	if os.path.isfile('data/alexa_top_1m_domain.json'):
120 | 		with open('data/alexa_top_1m_domain.json', 'r') as f:
121 | 			data = json.load(f)
122 | 
123 | 	percentage_list_alexa = [] #Define average_percentage
124 | 
125 | 
126 | 	for input_domain in xrange(len(data)): #Run through each input_domain in the data
127 | 		input_domain = tldextract.extract(data[input_domain][1])
128 | 		if len(input_domain.domain) > 5 and "-" not in input_domain.domain:
129 | 			percentage = [] #Clear percentage list
130 | 			for  bigram_position in xrange(len(input_domain.domain) - 1): #Run through each bigram in the data
131 | 				if input_domain.domain[bigram_position:bigram_position + 2] in bigram_dict: #Check if bigram is in dictionary 
132 | 					percentage.append((bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] / total_bigrams) * 100) #Get bigram dictionary value and convert to percantage
133 | 				else:
134 | 					percentage.append(0) #Bigram value is 0 as it doesn't exist
135 | 
136 | 			percentage_list_alexa.append(scipy.mean(percentage)) #Add percentage value to list for total average
137 | 			print input_domain.domain, "AP:", scipy.mean(percentage) #Print input_domain and percentage list
138 | 
139 | 
140 | 	data = open('data/dga_training.txt').read().splitlines()
141 | 	percentage_list_dga = [] #Define average_percentage
142 | 
143 | 	for input_domain in xrange(len(data)): #Run through each input_domain in the data
144 | 		input_domain = tldextract.extract(data[input_domain])
145 | 		if len(input_domain.domain) > 5 and "-" not in input_domain.domain:
146 | 			percentage = [] #Clear percentage list
147 | 			for  bigram_position in xrange(len(input_domain.domain) - 1): #Run through each bigram in the data
148 | 				if input_domain.domain[bigram_position:bigram_position + 2] in bigram_dict: #Check if bigram is in dictionary 
149 | 					percentage.append((bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] / total_bigrams) * 100) #Get bigram dictionary value and convert to percantage
150 | 				else:
151 | 					percentage.append(0) #Bigram value is 0 as it doesn't exist
152 | 
153 | 			percentage_list_dga.append(scipy.mean(percentage)) #Add percentage value to list for total average
154 | 			print input_domain.domain, "AP:", scipy.mean(percentage) #Print input_domain and percentage list
155 | 
156 | 	print 67 * "*"
157 | 	print "Total Average Percentage Alexa:", scipy.mean(percentage_list_alexa), "( Min:", min(percentage_list_alexa), "Max:", max(percentage_list_alexa), ")" #Get average percentage
158 | 	print "Total Average Percentage DGA:", scipy.mean(percentage_list_dga), "( Min:", min(percentage_list_dga), "Max:", max(percentage_list_dga), ")" #Get average percentage
159 | 	print "Baseline:", (((scipy.mean(percentage_list_alexa) - scipy.mean(percentage_list_dga)) / 2) + scipy.mean(percentage_list_dga))
160 | 	print 67 * "*"
161 | 
162 | 	cfgfile = open("data/settings.conf",'w')
163 | 	Config.set('Percentages','percentage_list_alexa_settings', scipy.mean(percentage_list_alexa))
164 | 	Config.set('Percentages','percentage_list_dga_settings', scipy.mean(percentage_list_dga))
165 | 	Config.set('Percentages','baseline', (((scipy.mean(percentage_list_alexa) - scipy.mean(percentage_list_dga)) / 2) + scipy.mean(percentage_list_dga)))
166 | 	Config.set('Values','total_bigrams_settings', total_bigrams)
167 | 	Config.write(cfgfile)
168 | 	cfgfile.close()
169 | 
170 | 	percentage = [] #Define percentage
171 | 
172 | 
173 | def testing():
174 | 
175 | 	baseline, total_bigrams_settings = load_settings()
176 | 
177 | 	if os.path.isfile('data/database.json'):
178 | 		with open('data/database.json', 'r') as f:
179 | 		    try:
180 | 		        bigram_dict = json.load(f)
181 | 		    # if the file is empty the ValueError will be thrown
182 | 		    except ValueError:
183 | 		        bigram_dict = {}
184 | 
185 | 
186 | 	data = open('data/test_domains.txt').read().splitlines()
187 | 
188 | 
189 | 	flag = 0
190 | 	total_flags = 0
191 | 	percentage = [] #Define percentage
192 | 
193 | 	for input_domain in xrange(len(data)): #Run through each input_domain in the data
194 | 		input_domain = tldextract.extract(data[input_domain])
195 | 		if len(input_domain.domain) > 5 and "-" not in input_domain.domain:
196 | 			for  bigram_position in xrange(len(input_domain.domain) - 1): #Run through each bigram in the data
197 | 				if input_domain.domain[bigram_position:bigram_position + 2] in bigram_dict: #Check if bigram is in dictionary
198 | 					percentage.append((round(((bigram_dict[input_domain.domain[bigram_position:bigram_position + 2]] / total_bigrams_settings) * 100), 2))) #Get bigram dictionary value and convert to percantage
199 | 				else:
200 | 					percentage.append(0) #Bigram value is 0 as it doesn't exist
201 | 			
202 | 
203 | 			total_flags = total_flags + 1
204 | 
205 | 			if baseline >= scipy.mean(percentage):
206 | 				flag = flag + 1
207 | 				print input_domain.domain, percentage,"AP:", scipy.mean(percentage)
208 | 			else:
209 | 				print input_domain.domain, percentage, "AP:", scipy.mean(percentage)
210 | 
211 | 
212 | 			percentage = [] #Clear percentage list
213 | 
214 | 	print 67 * "*"
215 | 	print "Detection Rate:", flag / total_flags * 100
216 | 	print 67 * "*"
217 | 
218 | def check_domain(input_domain):
219 | 
220 | 	baseline, total_bigrams_settings = load_settings()
221 | 
222 | 	if os.path.isfile('data/database.json'):
223 | 		with open('data/database.json', 'r') as f:
224 | 		    try:
225 | 		        bigram_dict = json.load(f)
226 | 		    # if the file is empty the ValueError will be thrown
227 | 		    except ValueError:
228 | 		        bigram_dict = {}
229 | 	
230 | 	percentage = []
231 | 
232 | 	for  bigram_position in xrange(len(input_domain) - 1): #Run through each bigram in the data
233 | 		if input_domain[bigram_position:bigram_position + 2] in bigram_dict: #Check if bigram is in dictionary 
234 | 			percentage.append((bigram_dict[input_domain[bigram_position:bigram_position + 2]] / total_bigrams_settings) * 100) #Get bigram dictionary value and convert to percantage
235 | 		else:
236 | 			percentage.append(0) #Bigram value is 0 as it doesn't exist
237 | 
238 | 	if baseline >= scipy.mean(percentage):
239 | 		print 67 * "*"
240 | 		print 'Baseline:', baseline, 'Domain Average Bigram Percentage:',scipy.mean(percentage)
241 | 		return 1
242 | 	else:
243 | 		return 0
244 | 
245 | 	percentage = [] #Clear percentage list
246 | 
247 | def capture_traffic(pkt):
248 | 
249 | 	global previous_domain
250 | 	global baseline
251 | 	global total_bigram_settings
252 | 	global previous_domain
253 | 	global whitelist
254 | 
255 | 	if IP in pkt:
256 | 		ip_src = pkt[IP].src
257 | 		ip_dst = pkt[IP].dst
258 | 		if pkt.haslayer(DNS) and pkt.getlayer(DNS).qr == 0:
259 | 			input_domain = tldextract.extract(pkt.getlayer(DNS).qd.qname)
260 | 			if input_domain.suffix != '' and input_domain.suffix != 'localdomain' and input_domain.subdomain == '' and len(input_domain.domain) > 5 and "-" not in input_domain.domain and previous_domain != input_domain.domain: #Domains are no smaller than 6
261 | 				previous_domain = input_domain.domain
262 | 				if ("%s.%s" % (input_domain.domain, input_domain.suffix)) not in whitelist.values() and check_domain(input_domain.domain) == 1:
263 | 					print 'Extracted Domain:', input_domain.domain
264 | 					print str(ip_src) +  "->",  str(ip_dst), "Warning! Potential DGA Detected ", "(", (pkt.getlayer(DNS).qd.qname), ")"
265 | 					print 67 * "*"
266 | 					print '\n'
267 | 					if pushbullet_key != '':
268 | 						alert_message = str((str(ip_src) +  "->",  str(ip_dst), "Warning! Potential DGA Detected ", "(", (pkt.getlayer(DNS).qd.qname), ")"))
269 | 						send_note(alert_message)
270 | 
271 | 				#else:
272 | 					#print "Safe input_domain", "(" + input_domain + ")"
273 | 
274 | 
275 | 
276 | parser = argparse.ArgumentParser()
277 | parser.add_argument('-o', '--option', required=False)
278 | parser.add_argument('-i', '--interface', required=False)
279 | args = parser.parse_args()
280 | 
281 | if args.option == '2':
282 | 	if os.path.isfile('data/settings.conf'):
283 | 		print 'Please wait whiles whitelist is read...'
284 | 		with open('data/alexa_top_1m_domain.json', 'r') as f:
285 | 			whitelist = json.load(f)
286 | 		whitelist = dict((k) for k in whitelist)
287 | 		###################################
288 | 		baseline, total_bigrams_settings = load_settings()
289 | 		print 'Capturing DNS Requests...'
290 | 		sniff(iface = args.interface, filter = "port 53", prn = capture_traffic, store = 0)
291 | 		#Using Alexa as a white list (Potentially not the best method incase malware domains make it in the list) More filtering needs to be done.
292 | 		#This is in beta and might want to be modified or removed.
293 | 	else:
294 | 		print 67 * '#'
295 | 		print 'You must run the training algoirthm first.'
296 | 		print 67 * '#'
297 | 		exit()
298 | 
299 | 
300 | ans=True
301 | while ans:
302 | 	print 30 * "-" , "MENU" , 30 * "-"
303 | 	print ("""
304 | 	1. Train Data
305 | 	2. Start Capturing DNS
306 | 	3. Testing
307 | 	4. View Config File
308 | 	5. Delete script data
309 | 	6. Exit/Quit
310 | 	""")
311 | 	print 67 * "-"
312 | 	ans=raw_input("Select an option to proceed: ") 
313 | 	if ans=="1": 
314 | 		load_data()
315 | 	elif ans=="2":
316 | 		if os.path.isfile('data/settings.conf'):
317 | 			print 'Please wait whiles whitelist is read...'
318 | 			with open('data/alexa_top_1m_domain.json', 'r') as f:
319 | 				whitelist = json.load(f)
320 | 			whitelist = dict((k) for k in whitelist)
321 | 			###################################
322 | 			baseline, total_bigrams_settings = load_settings()
323 | 			try:
324 | 				interface = raw_input("[*] Enter Desired Interface: ")
325 | 			except KeyboardInterrupt:
326 | 				print "[*] User Requested Shutdown..."
327 | 				print "[*] Exiting..."
328 | 				sys.exit(1)
329 | 			sniff(iface = interface,filter = "port 53", prn = capture_traffic, store = 0)
330 | 			#Using Alexa as a white list (Potentially not the best method incase malware domains make it in the list) More filtering needs to be done.
331 | 			#This is in beta and might want to be modified or removed.
332 | 		else:
333 | 			print 67 * '#'
334 | 			print 'You must run the training algoirthm first.'
335 | 			print 67 * '#'
336 | 			exit()
337 | 	elif ans=="3":
338 | 		if os.path.isfile('data/settings.conf') and os.path.isfile('data/database.json'):
339 | 			testing()
340 | 		else:
341 | 			print "\nYou must run the training algoirthm first."
342 | 	elif ans=="4":
343 | 		if os.path.isfile('data/settings.conf') and os.path.isfile('data/database.json'):
344 | 			baseline, total_bigrams_settings = load_settings()
345 | 			print 67 * "*"
346 | 			print "Total Bigrams:", total_bigrams_settings
347 | 			print "Baseline (Baseline):", baseline
348 | 			print 67 * "*"
349 | 		else:
350 | 			print "\n No data files available."
351 | 	elif ans=="5":
352 | 		if os.path.isfile('data/settings.conf') and os.path.isfile('data/database.json'):
353 | 		  os.remove('data/settings.conf')
354 | 		  os.remove('data/database.json')
355 | 		  print "\nData has been deleted"
356 | 		else:
357 | 			print "\nNo data to delete."
358 | 	elif ans=="6":
359 | 	  print("\nExiting") 
360 | 	  quit()
361 | 	elif ans !="":
362 | 	  print("\n Not Valid Choice Try again") 
363 | 
364 | 
365 | 
366 | 
367 | 
368 | 
369 | 


--------------------------------------------------------------------------------
/install.sh:
--------------------------------------------------------------------------------
1 | sudo apt-get install python-scipy -y
2 | sudo pip install -r requirements.txt
3 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | scapy
2 | tldextract
3 | -e git://github.com/philarkwright/Alexa-Top-Sites.git#egg=alexa-top-sites
4 | pushbullet.py


--------------------------------------------------------------------------------