├── README.md
├── entropy.go
├── py
    ├── cryptostalker.py
    └── randumb.py
└── util.go


/README.md:
--------------------------------------------------------------------------------
  1 | # Summary
  2 | 
  3 | randumb naively estimates an input's level of entropy by running some tests on it. Possible values range from 0 to 1 where 1 is as random as randumb can guess.
  4 | 
  5 | The types of analysis currently supported are frequency and skewness. The first of [Kendall and Smith's randomness tests](https://en.wikipedia.org/wiki/Statistical_randomness) is the frequency test. It tries to answer the question of how uniform is the distribution of characters in a given body of bytes.
  6 | 
  7 | The skewness analysis is based upon [Pearson's second coefficient](http://mathworld.wolfram.com/PearsonsSkewnessCoefficients.html). It tries to make a binary guess of randomness based upon a distribution's variance.
  8 | 
  9 | # Description
 10 | Tests:
 11 | * Frequency - iterate over the input in chunks and find the ratio of "unique-number-of-bytes / length-of-chunk". A chunk of length `n`-bytes would be uniformly distributed (and equal 1 in the frequency test) if each value were a different byte value.
 12 |   * Example: if a 26-byte chunk contained the value "abcdefghijklmnopqrstuvwxyz", then it'd be random as far as the frequency test is concerned since each value is present only once. Of course there is a pattern there, but that's for another test to decide.
 13 | * Skewness - iterate over the input and make a histogram out of the input. Each bucket contains a bit tuple (8-bit tuple by default). The amount of variance across the distribution determines the input's randomness. I currently use an anecdotal constant as a threshold for determining randomness. The basis for this equation is [Pearson's second coefficient](http://mathworld.wolfram.com/PearsonsSkewnessCoefficients.html).
 14 | 
 15 | # Simple example
 16 | I use a naïve combination of the randomness tests described above to arrive at a guess of randomness. Given an input, randumb.py returns a binary value: 0 for random, 1 for non-random.
 17 | 
 18 | Let's look at an ELF binary in encrypted and unencrypted form.
 19 | ```bash
 20 | # input - regular OpenSSH binary
 21 | vagrant@precise64:~/randumb$ python randumb.py < /usr/bin/ssh; echo $?
 22 | 1
 23 | vagrant@precise64:~/randumb$
 24 | 
 25 | # input - encrypted OpenSSH binary 
 26 | vagrant@precise64:~/randumb$ python randumb.py < /tmp/ssh.enc; echo $?
 27 | 0
 28 | vagrant@precise64:~/randumb$
 29 | ```
 30 | 
 31 | # Cryptostalker example
 32 | This tool uses the randumb library to monitor a filesystem path and detect newly-written files. If these new files are deemed random and occur at a fast enough rate (configurable), then it notifes you.
 33 | 
 34 | ## MOVED: cryptostalker.py has [moved to its own repository](https://github.com/unixist/cryptostalker) and been ported to the Go language. So it works on Linux, OSX and Windows.
 35 | 
 36 | #### Python version
 37 | I implemented this initially using linux's inotify facility. This allows a file write event to be filtered on IN_CLOSE_WRITE, which occurs when the file is finished writing. I'd prefer to use auditd to alert on new file writes since it can also give the process ID of the writer. That'd allow the process to be killed if we have enough confidence that it's probably bad. (Although auditd can place a recursive watch similar to inotify, I don't know if auditd can alert on a file only *after* all writes are complete and only if it was opened for writing.)
 38 | 
 39 | #### Go version
 40 | The file notification mechanism is Google's [fsnotify](https://github.com/fsnotify/fsnotify). Since it doesn't use the linux-specific [inotify](https://en.wikipedia.org/wiki/Inotify), cryptostalker currently relies on notifications of new files. So random/encrypted files will only be detected if they belong to new inodes; which means it wont catch the following case: a file is opened, truncated, and only then filled in with encrypted content. Fortunately, this is not how most malware works.
 41 | 
 42 | See the [new repo for the go version](https://github.com/unixist/cryptostalker).
 43 | 
 44 | #### Misc
 45 | * I'd be stoked if someone can show me how to get auditd to behave optimally for this use case!
 46 | * cryptostalker may incorrectly identify compressed files as encrypted files. I haven't found this to be true in my testing, but I can only imagine that given a quality compressor and the right type of data input this will yield some false positives. You can always tweak the `RAND_THRESHOLD`s in randumb.py to your liking.
 47 | 
 48 | ```bash
 49 | # Run with only --path parameter defaults to a detection rate of 10/60seconds
 50 | vagrant@precise64:~/randumb$ python cryptostalker.py --path /home
 51 | Seen 10 random files in 30.1896209717 seconds:
 52 | /home/vagrant/.foo.8299
 53 | /home/vagrant/.foo.5266
 54 | /home/vagrant/.foo.10551
 55 | /home/vagrant/.foo.8444
 56 | /home/vagrant/.foo.20163
 57 | /home/vagrant/.foo.820
 58 | /home/vagrant/.foo.28854
 59 | /home/vagrant/.foo.27284
 60 | /home/vagrant/.foo.21306
 61 | /home/vagrant/.foo.24437
 62 | 
 63 | # See example usage
 64 | vagrant@precise64:~/randumb$ python cryptostalker.py  --help
 65 | usage: cryptostalker.py [-h] --path PATH [--count COUNT] [--window WINDOW]
 66 |                         [--sleep SLEEP]
 67 | 
 68 | Detect high throughput of newly-created encrypted files.
 69 | 
 70 | optional arguments:
 71 |   -h, --help       show this help message and exit
 72 |   --path PATH      The directory to watch.
 73 |   --count COUNT    The number of random files required to be seen within
 74 |                    <window>.
 75 |   --window WINDOW  The number of seconds within which <count> random files
 76 |                    must be observed.
 77 |   --sleep SLEEP    The time in seconds to sleep between processing each new
 78 |                    file to determine whether it is random.
 79 | vagrant@precise64:~/randumb$
 80 | ```
 81 | # Frequency example (old)
 82 | ```bash
 83 | 
 84 | # input - ASCII file
 85 | vagrant@precise64:~/randumb$ python randumb.py < /etc/passwd
 86 | Frequency(avg): 0.165232
 87 | vagrant@precise64:~/randumb$
 88 | 
 89 | # input - linux kernel .text
 90 | vagrant@precise64:~/randumb$ objcopy -O binary --only-section=.text ~/vmlinux /dev/stdout | python randumb.py
 91 | Frequency(avg): 0.326217
 92 | vagrant@precise64:~/randumb$
 93 | 
 94 | # input - /dev/random
 95 | vagrant@precise64:~/randumb$ python randumb.py < ~/input.random
 96 | Frequency(avg): 0.632422
 97 | vagrant@precise64:~/randumb$
 98 | 
 99 | # input - openssl enc -aes-256-cbc
100 | vagrant@precise64:~/randumb$ python randumb.py < ~/input.enc
101 | Frequency(avg): 0.613281
102 | vagrant@precise64:~/randumb$
103 | ```
104 | 
105 | The closer to "1", the more of a chance portions of the program are encrypted. I've found that random/well-encrypted content will yield about .6-.7 in my testing.
106 | 
107 | # Future
108 | * Support for more randomness tests, and how the tests should be evaluated in combination
109 | * Example encrypted binaries.
110 | 


--------------------------------------------------------------------------------
/entropy.go:
--------------------------------------------------------------------------------
 1 | package randumb
 2 | 
 3 | import (
 4 | 	"sort"
 5 | 	"sync"
 6 | )
 7 | 
 8 | // Arbitrary thresholds outside of which randomness is "likely"
 9 | const (
10 | 	FreqThresh = .6
11 | 	SkewThresh = .2
12 | )
13 | 
14 | /*
15 |  * Pearson's second skewness coefficient:
16 |  *   3 * (avg - median) / std_dev
17 |  */
18 | func Skewness(data []byte, tuple int) float64 {
19 | 	binHist := makeBinHist(data, tuple)
20 | 	values := []float64{}
21 | 	for _, i := range binHist {
22 | 		values = append(values, float64(i))
23 | 	}
24 | 	sort.Float64s(values)
25 | 
26 | 	var a, m, s float64
27 | 	wg := sync.WaitGroup{}
28 | 
29 | 	// Calculate the average.
30 | 	wg.Add(1)
31 | 	go func() {
32 | 		a = avg(values)
33 | 		wg.Done()
34 | 	}()
35 | 
36 | 	// Calculate the median.
37 | 	wg.Add(1)
38 | 	go func() {
39 | 		m = median(values)
40 | 		wg.Done()
41 | 	}()
42 | 
43 | 	// Calculate the standard deviation.
44 | 	wg.Add(1)
45 | 	go func() {
46 | 		s = stdDev(values)
47 | 		wg.Done()
48 | 	}()
49 | 
50 | 	wg.Wait()
51 | 	return 3 * (a - m) / s
52 | }
53 | 
54 | func Frequency(data []byte, chunkSize int) float64 {
55 | 	fl := frequencyList(data, chunkSize)
56 | 	return sum(fl) / float64(len(fl))
57 | }
58 | 
59 | func IsRandom(data []byte) bool {
60 | 	// These vars may be changed to adjust randomness measurement
61 | 	var tuple = 8
62 | 	var chunkSize = 256
63 | 
64 | 	wg := sync.WaitGroup{}
65 | 	wg.Add(1)
66 | 	var f, s float64
67 | 	go func(){
68 | 		f = Frequency(data, chunkSize)
69 | 		wg.Done()
70 | 	}()
71 | 
72 | 	wg.Add(1)
73 | 	go func(){
74 | 		s = Skewness(data, tuple)
75 | 		wg.Done()
76 | 	}()
77 | 	wg.Wait()
78 | 
79 | 	return f >= FreqThresh && s <= SkewThresh 
80 | }
81 | 


--------------------------------------------------------------------------------
/py/cryptostalker.py:
--------------------------------------------------------------------------------
 1 | import argparse
 2 | import inotify.adapters
 3 | import os
 4 | import sys
 5 | import time
 6 | 
 7 | from randumb import Random
 8 | 
 9 | class Stalk(object):
10 |   def __init__(self, path, count, window):
11 |     self.path = path
12 |     self.count = count
13 |     self.window = window
14 |     self.files = []
15 |     self.last = time.time()
16 | 
17 |   def _reset(self):
18 |     self.files = []
19 |     self.last = time.time()
20 | 
21 |   def is_random(self, filename):
22 |     try:
23 |       data = open(filename, 'rb').read()
24 |       if len(data) > 0 and Random.is_random(data):
25 |         return True
26 |     except Exception as e:
27 |       pass
28 |     return False
29 | 
30 |   def _log(self, diff):
31 |     print('Seen {} random files in {} seconds:'.format(len(self.files), diff))
32 |     for f in self.files:
33 |       print(f)
34 | 
35 |   def meow(self):
36 |     try:
37 |       i = inotify.adapters.InotifyTree(self.path)
38 |       for event in i.event_gen():
39 |         if event:
40 |           (header, type_names, d_name, f_name) = event
41 |           if 'IN_CLOSE_WRITE' in type_names:
42 |             filename = os.path.join(d_name, f_name)
43 |             if self.is_random(filename):
44 | 
45 |               # Code needs to be a smidge ugly code to get immediate reporting.
46 |               self.files.append(filename)
47 |               now = time.time()
48 |               if now - self.last <= self.window and len(self.files) >= self.count:
49 |                 self._log(now - self.last)
50 |                 self._reset()
51 |               if now - self.last >= self.window:
52 |                 self._reset()
53 |     except Exception as e:
54 |       print('[Error]: ', e)
55 |       return 1
56 |     return 0
57 | 
58 | 
59 | if __name__ == '__main__':
60 | 
61 |   p = argparse.ArgumentParser(description='Detect high throughput of newly-'
62 |                                 'created encrypted files.')
63 |   p.add_argument('--path', required=True, help='The directory to watch.')
64 |   p.add_argument('--count', type=int, default=10,
65 |                  help='The number of random files required to be seen within'
66 |                    ' <window>.')
67 |   p.add_argument('--window', type=int, default=60,
68 |                  help='The number of seconds within which <count> random files'
69 |                    ' must be observed.')
70 |   p.add_argument('--sleep', type=float, default=0.0,
71 |                  help='The time in seconds to sleep between processing each new'
72 |                    ' file to determine whether it is random.')
73 | 
74 |   args = p.parse_args()
75 |   sys.exit(Stalk(args.path, args.count, args.window).meow())
76 | 


--------------------------------------------------------------------------------
/py/randumb.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import sys
  3 | import math
  4 | 
  5 | class Frequency(object):
  6 |   '''Byte-frequency of an input'''
  7 | 
  8 |   # If the calculation is more than RAND_THRESHOLD, it's possibly random.
  9 |   RAND_THRESHOLD = .6
 10 | 
 11 |   def __init__(self, data, chunk_size=256, byte_width=256, case_sensitive=True):
 12 |     '''
 13 |     data           - string of bytes to consider
 14 |     chunk_size     - number of bytes to evaluate at a time
 15 |     byte_width     - spectrum of valid byte values for the data (not working)
 16 |     case_sensitive - when data is ASCII, consider 'A' the same as 'a', etc.
 17 |     '''
 18 |     self.data = data
 19 |     self.chunk_size = chunk_size
 20 |     self.byte_width = byte_width
 21 |     self.case_sensitive = case_sensitive
 22 | 
 23 |   def _min_max_gap(self, histogram):
 24 |     items = histogram.values()
 25 |     return max(items) - min(items)
 26 | 
 27 |   def _histogram(self, chunk):
 28 |     b = {}
 29 |     for c in map(ord, chunk):
 30 |       if not self.case_sensitive:
 31 |         c = c.lower()
 32 |       if c in b:
 33 |         b[c] += 1
 34 |       else:
 35 |         b[c] = 1
 36 |     return b
 37 | 
 38 |   # Split data into equal-sized chunks (except for final chunk)
 39 |   def _chunk_it(self):
 40 |     return [self.data[i:i+self.chunk_size]
 41 |               for i in range(0, len(self.data), self.chunk_size)]
 42 | 
 43 |   def _frequency(self, chunk):
 44 |     histo = self._histogram(chunk)
 45 |     histo_len = len(histo)
 46 |     # chunk_len should only be different than self.chunk_size for the final
 47 |     # chunk, or when data < self.chunk_size
 48 |     chunk_len = len(chunk)
 49 |     # This expression doesn't factor in byte_width.
 50 |     if histo_len == 0 or chunk_len == 0:
 51 |       return 0.0
 52 |     return float(histo_len) / chunk_len
 53 | 
 54 |   ### API
 55 |   def frequency_list(self):
 56 |     '''
 57 |     List of frequency values. Similar to avg_frequency except
 58 |     don't perform avg
 59 |     '''
 60 |     chunks = self._chunk_it()
 61 |     return map(lambda c: self._frequency(c), chunks)
 62 | 
 63 |   def gap(self):
 64 |     '''Gap between the highest and lowest frequency chunks'''
 65 |     freqs = self.frequency_list()
 66 |     return float(max(freqs)) - float(min(freqs))
 67 | 
 68 |   def avg(self):
 69 |     '''Average of the uniformity of bytes over all chunks'''
 70 |     freqs = self.frequency_list()
 71 |     return float(sum(freqs)) / float(len(freqs))
 72 | 
 73 | class Skewness(object):
 74 |   '''
 75 |   Pearson's second skewness coefficient:
 76 |     3 * (avg - median) / std_dev
 77 |  '''
 78 |   # If the calculation is less than RAND_THRESHOLD, it's possibly random.
 79 |   RAND_THRESHOLD = .2
 80 | 
 81 |   def __init__(self, data):
 82 |     self.data = data
 83 | 
 84 |   def _avg(self, l):
 85 |     return sum(l) * 1.0 / len(l)
 86 | 
 87 |   def _median(self, l):
 88 |     l.sort()
 89 |     c = len(l)
 90 |     sub = c % 2 == 0
 91 |     return l[c/2-sub]
 92 | 
 93 |   def _std_deviation(self, l):
 94 |     return math.sqrt(self._avg(map(lambda x: (x - self._avg(l))**2, l)))
 95 | 
 96 |   def skew(self, tuples):
 97 |     cur_bin_num = ''
 98 |     # Map the 2^tuple number of keys to the frequency that bit sequence occurs.
 99 |     bin_map = {}  
100 |     i = 0
101 |     for byte in self.data:
102 |       for bit in bin(ord(byte))[2:]:
103 |         if len(cur_bin_num) == tuples:
104 |           if cur_bin_num in bin_map:
105 |             bin_map[cur_bin_num] += 1
106 |           else:
107 |             bin_map[cur_bin_num] = 1
108 |           # Now that we've tallied this bit-sequence, start over.
109 |           cur_bin_num = ''
110 |         else:
111 |           cur_bin_num += bit
112 | 
113 |     vals = bin_map.values()
114 |     
115 |     # Corner case where all bit tuple patterns occur the same number of times.
116 |     # This prevents a divide-by-zero when calculating standard deviation.
117 |     # Hmm, smells janky.
118 |     if len(set(vals)) <= 1: 
119 |       return 1
120 | 
121 |     # Calculate skewness 
122 |     return 3 * (self._avg(vals) - self._median(vals)) / self._std_deviation(vals)
123 | 
124 | class Random(object):
125 |   '''
126 |   Mash up some characteristics of randomness and make an arbitrary decision
127 |   as to whether the input is statistically random.
128 | 
129 |   Returns:
130 |     True if random
131 |     False if not random
132 |   '''
133 | 
134 |   @staticmethod
135 |   def is_random(data, bit_tuples=8):
136 |     freq = Frequency(data)
137 |     skew = Skewness(data)
138 | 
139 |     skew.skew(bit_tuples)
140 |     if freq.avg() >= freq.RAND_THRESHOLD and \
141 |        skew.skew(bit_tuples) <= skew.RAND_THRESHOLD:
142 |       return True
143 |     else:
144 |       return False
145 |     
146 | def main(args):
147 |   if len(args) > 1:
148 |     data = open(args[1], 'rb').read()
149 |   else:
150 |     data = sys.stdin.read()
151 | 
152 |   return 0 if Random.is_random(data) else 1
153 | 
154 | if __name__ == '__main__':
155 | 	sys.exit(main(sys.argv))
156 | 


--------------------------------------------------------------------------------
/util.go:
--------------------------------------------------------------------------------
  1 | package randumb
  2 | 
  3 | import (
  4 | 	"fmt"
  5 | 	"math"
  6 | 	"strconv"
  7 | )
  8 | 
  9 | func avg(nums []float64) float64 {
 10 | 	sum := 0.0
 11 | 	for i := range nums {
 12 | 		sum += nums[i]
 13 | 	}
 14 | 	return sum / float64(len(nums))
 15 | }
 16 | 
 17 | func stdDev(nums []float64) float64 {
 18 | 	return math.Sqrt(avg(mapFloat64(nums, func(f float64) float64 {
 19 | 		return math.Pow(f-avg(nums), 2)
 20 | 	})))
 21 | }
 22 | 
 23 | func median(nums []float64) float64 {
 24 | 	var sub = 0
 25 | 	c := len(nums)
 26 | 	if c%2 == 0 {
 27 | 		sub = 1
 28 | 	}
 29 | 	return nums[c/2-sub]
 30 | }
 31 | 
 32 | // Graciously plucked from gobyexample.com
 33 | func mapFloat64(vs []float64, f func(float64) float64) []float64 {
 34 | 	vsm := make([]float64, len(vs))
 35 | 	for i, v := range vs {
 36 | 		vsm[i] = f(v)
 37 | 	}
 38 | 	return vsm
 39 | }
 40 | 
 41 | func frequency(chunk []byte) float64 {
 42 | 	hist := makeByteHist(chunk)
 43 | 	lhist := float64(len(hist))
 44 | 	lchunk := float64(len(chunk))
 45 | 	if lchunk == 0.0 || lhist == 0.0 {
 46 | 		return 0.0
 47 | 	}
 48 | 	return lhist / lchunk
 49 | }
 50 | func frequencyList(data []byte, chunkSize int) []float64 {
 51 | 	chunks := makeChunks(data, chunkSize)
 52 | 	fl := []float64{}
 53 | 	for _, c := range chunks {
 54 | 		fl = append(fl, frequency(c))
 55 | 	}
 56 | 	return fl
 57 | }
 58 | 
 59 | // Chunk the data into a slice of size-sized byte slices.
 60 | func makeChunks(data []byte, size int) [][]byte {
 61 | 	j := 0
 62 | 	l := len(data)
 63 | 	chunks := make([][]byte, l/size+1)
 64 | 	for i := 0; i < l; i += size {
 65 | 		chunks[j] = append(chunks[j], data[i:int(math.Min(float64(i+size), float64(l)))]...)
 66 | 		j += 1
 67 | 	}
 68 | 	return chunks
 69 | }
 70 | 
 71 | // Histogram of tuple-wide bit frequency in data
 72 | // Example: { '0010': 4, '0110', 41 }
 73 | func makeBinHist(data []byte, tuple int) map[string]int {
 74 | 	chrMap := map[byte]string{}
 75 | 	binMap := map[string]int{}
 76 | 	binStr := ""
 77 | 	for _, d := range data {
 78 | 		// Convert byte -> binary string representation and cache it for
 79 | 		// better performance.
 80 | 		if _, in := chrMap[d]; !in {
 81 | 			chrMap[d] = strconv.FormatInt(int64(d), 2)
 82 | 		}
 83 | 		// Iterate over the binary representation and construct the
 84 | 		// histogram of binary sequences.
 85 | 		for _, bit := range chrMap[d] {
 86 | 			if len(binStr) == tuple {
 87 | 				binMap[binStr] += 1
 88 | 				binStr = ""
 89 | 			} else if bit == '1' {
 90 | 				binStr += "1"
 91 | 			} else {
 92 | 				binStr += "0"
 93 | 			}
 94 | 		}
 95 | 	}
 96 | 	return binMap
 97 | }
 98 | 
 99 | // Histogram of byte frequency in data
100 | func makeByteHist(data []byte) map[string]int {
101 | 	byteMap := map[string]int{}
102 | 	for _, d := range data {
103 | 		byteMap[fmt.Sprintf("%d", d)] += 1
104 | 	}
105 | 	return byteMap
106 | }
107 | 
108 | func sum(nums []float64) float64 {
109 | 	total := 0.0
110 | 	for _, i := range nums {
111 | 		total += i
112 | 	}
113 | 	return total
114 | }
115 | 


--------------------------------------------------------------------------------