├── README.md ├── entropy.go ├── py ├── cryptostalker.py └── randumb.py └── util.go /README.md: -------------------------------------------------------------------------------- 1 | # Summary 2 | 3 | randumb naively estimates an input's level of entropy by running some tests on it. Possible values range from 0 to 1 where 1 is as random as randumb can guess. 4 | 5 | The types of analysis currently supported are frequency and skewness. The first of [Kendall and Smith's randomness tests](https://en.wikipedia.org/wiki/Statistical_randomness) is the frequency test. It tries to answer the question of how uniform is the distribution of characters in a given body of bytes. 6 | 7 | The skewness analysis is based upon [Pearson's second coefficient](http://mathworld.wolfram.com/PearsonsSkewnessCoefficients.html). It tries to make a binary guess of randomness based upon a distribution's variance. 8 | 9 | # Description 10 | Tests: 11 | * Frequency - iterate over the input in chunks and find the ratio of "unique-number-of-bytes / length-of-chunk". A chunk of length `n`-bytes would be uniformly distributed (and equal 1 in the frequency test) if each value were a different byte value. 12 | * Example: if a 26-byte chunk contained the value "abcdefghijklmnopqrstuvwxyz", then it'd be random as far as the frequency test is concerned since each value is present only once. Of course there is a pattern there, but that's for another test to decide. 13 | * Skewness - iterate over the input and make a histogram out of the input. Each bucket contains a bit tuple (8-bit tuple by default). The amount of variance across the distribution determines the input's randomness. I currently use an anecdotal constant as a threshold for determining randomness. The basis for this equation is [Pearson's second coefficient](http://mathworld.wolfram.com/PearsonsSkewnessCoefficients.html). 14 | 15 | # Simple example 16 | I use a naïve combination of the randomness tests described above to arrive at a guess of randomness. Given an input, randumb.py returns a binary value: 0 for random, 1 for non-random. 17 | 18 | Let's look at an ELF binary in encrypted and unencrypted form. 19 | ```bash 20 | # input - regular OpenSSH binary 21 | vagrant@precise64:~/randumb$ python randumb.py < /usr/bin/ssh; echo $? 22 | 1 23 | vagrant@precise64:~/randumb$ 24 | 25 | # input - encrypted OpenSSH binary 26 | vagrant@precise64:~/randumb$ python randumb.py < /tmp/ssh.enc; echo $? 27 | 0 28 | vagrant@precise64:~/randumb$ 29 | ``` 30 | 31 | # Cryptostalker example 32 | This tool uses the randumb library to monitor a filesystem path and detect newly-written files. If these new files are deemed random and occur at a fast enough rate (configurable), then it notifes you. 33 | 34 | ## MOVED: cryptostalker.py has [moved to its own repository](https://github.com/unixist/cryptostalker) and been ported to the Go language. So it works on Linux, OSX and Windows. 35 | 36 | #### Python version 37 | I implemented this initially using linux's inotify facility. This allows a file write event to be filtered on IN_CLOSE_WRITE, which occurs when the file is finished writing. I'd prefer to use auditd to alert on new file writes since it can also give the process ID of the writer. That'd allow the process to be killed if we have enough confidence that it's probably bad. (Although auditd can place a recursive watch similar to inotify, I don't know if auditd can alert on a file only *after* all writes are complete and only if it was opened for writing.) 38 | 39 | #### Go version 40 | The file notification mechanism is Google's [fsnotify](https://github.com/fsnotify/fsnotify). Since it doesn't use the linux-specific [inotify](https://en.wikipedia.org/wiki/Inotify), cryptostalker currently relies on notifications of new files. So random/encrypted files will only be detected if they belong to new inodes; which means it wont catch the following case: a file is opened, truncated, and only then filled in with encrypted content. Fortunately, this is not how most malware works. 41 | 42 | See the [new repo for the go version](https://github.com/unixist/cryptostalker). 43 | 44 | #### Misc 45 | * I'd be stoked if someone can show me how to get auditd to behave optimally for this use case! 46 | * cryptostalker may incorrectly identify compressed files as encrypted files. I haven't found this to be true in my testing, but I can only imagine that given a quality compressor and the right type of data input this will yield some false positives. You can always tweak the `RAND_THRESHOLD`s in randumb.py to your liking. 47 | 48 | ```bash 49 | # Run with only --path parameter defaults to a detection rate of 10/60seconds 50 | vagrant@precise64:~/randumb$ python cryptostalker.py --path /home 51 | Seen 10 random files in 30.1896209717 seconds: 52 | /home/vagrant/.foo.8299 53 | /home/vagrant/.foo.5266 54 | /home/vagrant/.foo.10551 55 | /home/vagrant/.foo.8444 56 | /home/vagrant/.foo.20163 57 | /home/vagrant/.foo.820 58 | /home/vagrant/.foo.28854 59 | /home/vagrant/.foo.27284 60 | /home/vagrant/.foo.21306 61 | /home/vagrant/.foo.24437 62 | 63 | # See example usage 64 | vagrant@precise64:~/randumb$ python cryptostalker.py --help 65 | usage: cryptostalker.py [-h] --path PATH [--count COUNT] [--window WINDOW] 66 | [--sleep SLEEP] 67 | 68 | Detect high throughput of newly-created encrypted files. 69 | 70 | optional arguments: 71 | -h, --help show this help message and exit 72 | --path PATH The directory to watch. 73 | --count COUNT The number of random files required to be seen within 74 | . 75 | --window WINDOW The number of seconds within which random files 76 | must be observed. 77 | --sleep SLEEP The time in seconds to sleep between processing each new 78 | file to determine whether it is random. 79 | vagrant@precise64:~/randumb$ 80 | ``` 81 | # Frequency example (old) 82 | ```bash 83 | 84 | # input - ASCII file 85 | vagrant@precise64:~/randumb$ python randumb.py < /etc/passwd 86 | Frequency(avg): 0.165232 87 | vagrant@precise64:~/randumb$ 88 | 89 | # input - linux kernel .text 90 | vagrant@precise64:~/randumb$ objcopy -O binary --only-section=.text ~/vmlinux /dev/stdout | python randumb.py 91 | Frequency(avg): 0.326217 92 | vagrant@precise64:~/randumb$ 93 | 94 | # input - /dev/random 95 | vagrant@precise64:~/randumb$ python randumb.py < ~/input.random 96 | Frequency(avg): 0.632422 97 | vagrant@precise64:~/randumb$ 98 | 99 | # input - openssl enc -aes-256-cbc 100 | vagrant@precise64:~/randumb$ python randumb.py < ~/input.enc 101 | Frequency(avg): 0.613281 102 | vagrant@precise64:~/randumb$ 103 | ``` 104 | 105 | The closer to "1", the more of a chance portions of the program are encrypted. I've found that random/well-encrypted content will yield about .6-.7 in my testing. 106 | 107 | # Future 108 | * Support for more randomness tests, and how the tests should be evaluated in combination 109 | * Example encrypted binaries. 110 | -------------------------------------------------------------------------------- /entropy.go: -------------------------------------------------------------------------------- 1 | package randumb 2 | 3 | import ( 4 | "sort" 5 | "sync" 6 | ) 7 | 8 | // Arbitrary thresholds outside of which randomness is "likely" 9 | const ( 10 | FreqThresh = .6 11 | SkewThresh = .2 12 | ) 13 | 14 | /* 15 | * Pearson's second skewness coefficient: 16 | * 3 * (avg - median) / std_dev 17 | */ 18 | func Skewness(data []byte, tuple int) float64 { 19 | binHist := makeBinHist(data, tuple) 20 | values := []float64{} 21 | for _, i := range binHist { 22 | values = append(values, float64(i)) 23 | } 24 | sort.Float64s(values) 25 | 26 | var a, m, s float64 27 | wg := sync.WaitGroup{} 28 | 29 | // Calculate the average. 30 | wg.Add(1) 31 | go func() { 32 | a = avg(values) 33 | wg.Done() 34 | }() 35 | 36 | // Calculate the median. 37 | wg.Add(1) 38 | go func() { 39 | m = median(values) 40 | wg.Done() 41 | }() 42 | 43 | // Calculate the standard deviation. 44 | wg.Add(1) 45 | go func() { 46 | s = stdDev(values) 47 | wg.Done() 48 | }() 49 | 50 | wg.Wait() 51 | return 3 * (a - m) / s 52 | } 53 | 54 | func Frequency(data []byte, chunkSize int) float64 { 55 | fl := frequencyList(data, chunkSize) 56 | return sum(fl) / float64(len(fl)) 57 | } 58 | 59 | func IsRandom(data []byte) bool { 60 | // These vars may be changed to adjust randomness measurement 61 | var tuple = 8 62 | var chunkSize = 256 63 | 64 | wg := sync.WaitGroup{} 65 | wg.Add(1) 66 | var f, s float64 67 | go func(){ 68 | f = Frequency(data, chunkSize) 69 | wg.Done() 70 | }() 71 | 72 | wg.Add(1) 73 | go func(){ 74 | s = Skewness(data, tuple) 75 | wg.Done() 76 | }() 77 | wg.Wait() 78 | 79 | return f >= FreqThresh && s <= SkewThresh 80 | } 81 | -------------------------------------------------------------------------------- /py/cryptostalker.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import inotify.adapters 3 | import os 4 | import sys 5 | import time 6 | 7 | from randumb import Random 8 | 9 | class Stalk(object): 10 | def __init__(self, path, count, window): 11 | self.path = path 12 | self.count = count 13 | self.window = window 14 | self.files = [] 15 | self.last = time.time() 16 | 17 | def _reset(self): 18 | self.files = [] 19 | self.last = time.time() 20 | 21 | def is_random(self, filename): 22 | try: 23 | data = open(filename, 'rb').read() 24 | if len(data) > 0 and Random.is_random(data): 25 | return True 26 | except Exception as e: 27 | pass 28 | return False 29 | 30 | def _log(self, diff): 31 | print('Seen {} random files in {} seconds:'.format(len(self.files), diff)) 32 | for f in self.files: 33 | print(f) 34 | 35 | def meow(self): 36 | try: 37 | i = inotify.adapters.InotifyTree(self.path) 38 | for event in i.event_gen(): 39 | if event: 40 | (header, type_names, d_name, f_name) = event 41 | if 'IN_CLOSE_WRITE' in type_names: 42 | filename = os.path.join(d_name, f_name) 43 | if self.is_random(filename): 44 | 45 | # Code needs to be a smidge ugly code to get immediate reporting. 46 | self.files.append(filename) 47 | now = time.time() 48 | if now - self.last <= self.window and len(self.files) >= self.count: 49 | self._log(now - self.last) 50 | self._reset() 51 | if now - self.last >= self.window: 52 | self._reset() 53 | except Exception as e: 54 | print('[Error]: ', e) 55 | return 1 56 | return 0 57 | 58 | 59 | if __name__ == '__main__': 60 | 61 | p = argparse.ArgumentParser(description='Detect high throughput of newly-' 62 | 'created encrypted files.') 63 | p.add_argument('--path', required=True, help='The directory to watch.') 64 | p.add_argument('--count', type=int, default=10, 65 | help='The number of random files required to be seen within' 66 | ' .') 67 | p.add_argument('--window', type=int, default=60, 68 | help='The number of seconds within which random files' 69 | ' must be observed.') 70 | p.add_argument('--sleep', type=float, default=0.0, 71 | help='The time in seconds to sleep between processing each new' 72 | ' file to determine whether it is random.') 73 | 74 | args = p.parse_args() 75 | sys.exit(Stalk(args.path, args.count, args.window).meow()) 76 | -------------------------------------------------------------------------------- /py/randumb.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sys 3 | import math 4 | 5 | class Frequency(object): 6 | '''Byte-frequency of an input''' 7 | 8 | # If the calculation is more than RAND_THRESHOLD, it's possibly random. 9 | RAND_THRESHOLD = .6 10 | 11 | def __init__(self, data, chunk_size=256, byte_width=256, case_sensitive=True): 12 | ''' 13 | data - string of bytes to consider 14 | chunk_size - number of bytes to evaluate at a time 15 | byte_width - spectrum of valid byte values for the data (not working) 16 | case_sensitive - when data is ASCII, consider 'A' the same as 'a', etc. 17 | ''' 18 | self.data = data 19 | self.chunk_size = chunk_size 20 | self.byte_width = byte_width 21 | self.case_sensitive = case_sensitive 22 | 23 | def _min_max_gap(self, histogram): 24 | items = histogram.values() 25 | return max(items) - min(items) 26 | 27 | def _histogram(self, chunk): 28 | b = {} 29 | for c in map(ord, chunk): 30 | if not self.case_sensitive: 31 | c = c.lower() 32 | if c in b: 33 | b[c] += 1 34 | else: 35 | b[c] = 1 36 | return b 37 | 38 | # Split data into equal-sized chunks (except for final chunk) 39 | def _chunk_it(self): 40 | return [self.data[i:i+self.chunk_size] 41 | for i in range(0, len(self.data), self.chunk_size)] 42 | 43 | def _frequency(self, chunk): 44 | histo = self._histogram(chunk) 45 | histo_len = len(histo) 46 | # chunk_len should only be different than self.chunk_size for the final 47 | # chunk, or when data < self.chunk_size 48 | chunk_len = len(chunk) 49 | # This expression doesn't factor in byte_width. 50 | if histo_len == 0 or chunk_len == 0: 51 | return 0.0 52 | return float(histo_len) / chunk_len 53 | 54 | ### API 55 | def frequency_list(self): 56 | ''' 57 | List of frequency values. Similar to avg_frequency except 58 | don't perform avg 59 | ''' 60 | chunks = self._chunk_it() 61 | return map(lambda c: self._frequency(c), chunks) 62 | 63 | def gap(self): 64 | '''Gap between the highest and lowest frequency chunks''' 65 | freqs = self.frequency_list() 66 | return float(max(freqs)) - float(min(freqs)) 67 | 68 | def avg(self): 69 | '''Average of the uniformity of bytes over all chunks''' 70 | freqs = self.frequency_list() 71 | return float(sum(freqs)) / float(len(freqs)) 72 | 73 | class Skewness(object): 74 | ''' 75 | Pearson's second skewness coefficient: 76 | 3 * (avg - median) / std_dev 77 | ''' 78 | # If the calculation is less than RAND_THRESHOLD, it's possibly random. 79 | RAND_THRESHOLD = .2 80 | 81 | def __init__(self, data): 82 | self.data = data 83 | 84 | def _avg(self, l): 85 | return sum(l) * 1.0 / len(l) 86 | 87 | def _median(self, l): 88 | l.sort() 89 | c = len(l) 90 | sub = c % 2 == 0 91 | return l[c/2-sub] 92 | 93 | def _std_deviation(self, l): 94 | return math.sqrt(self._avg(map(lambda x: (x - self._avg(l))**2, l))) 95 | 96 | def skew(self, tuples): 97 | cur_bin_num = '' 98 | # Map the 2^tuple number of keys to the frequency that bit sequence occurs. 99 | bin_map = {} 100 | i = 0 101 | for byte in self.data: 102 | for bit in bin(ord(byte))[2:]: 103 | if len(cur_bin_num) == tuples: 104 | if cur_bin_num in bin_map: 105 | bin_map[cur_bin_num] += 1 106 | else: 107 | bin_map[cur_bin_num] = 1 108 | # Now that we've tallied this bit-sequence, start over. 109 | cur_bin_num = '' 110 | else: 111 | cur_bin_num += bit 112 | 113 | vals = bin_map.values() 114 | 115 | # Corner case where all bit tuple patterns occur the same number of times. 116 | # This prevents a divide-by-zero when calculating standard deviation. 117 | # Hmm, smells janky. 118 | if len(set(vals)) <= 1: 119 | return 1 120 | 121 | # Calculate skewness 122 | return 3 * (self._avg(vals) - self._median(vals)) / self._std_deviation(vals) 123 | 124 | class Random(object): 125 | ''' 126 | Mash up some characteristics of randomness and make an arbitrary decision 127 | as to whether the input is statistically random. 128 | 129 | Returns: 130 | True if random 131 | False if not random 132 | ''' 133 | 134 | @staticmethod 135 | def is_random(data, bit_tuples=8): 136 | freq = Frequency(data) 137 | skew = Skewness(data) 138 | 139 | skew.skew(bit_tuples) 140 | if freq.avg() >= freq.RAND_THRESHOLD and \ 141 | skew.skew(bit_tuples) <= skew.RAND_THRESHOLD: 142 | return True 143 | else: 144 | return False 145 | 146 | def main(args): 147 | if len(args) > 1: 148 | data = open(args[1], 'rb').read() 149 | else: 150 | data = sys.stdin.read() 151 | 152 | return 0 if Random.is_random(data) else 1 153 | 154 | if __name__ == '__main__': 155 | sys.exit(main(sys.argv)) 156 | -------------------------------------------------------------------------------- /util.go: -------------------------------------------------------------------------------- 1 | package randumb 2 | 3 | import ( 4 | "fmt" 5 | "math" 6 | "strconv" 7 | ) 8 | 9 | func avg(nums []float64) float64 { 10 | sum := 0.0 11 | for i := range nums { 12 | sum += nums[i] 13 | } 14 | return sum / float64(len(nums)) 15 | } 16 | 17 | func stdDev(nums []float64) float64 { 18 | return math.Sqrt(avg(mapFloat64(nums, func(f float64) float64 { 19 | return math.Pow(f-avg(nums), 2) 20 | }))) 21 | } 22 | 23 | func median(nums []float64) float64 { 24 | var sub = 0 25 | c := len(nums) 26 | if c%2 == 0 { 27 | sub = 1 28 | } 29 | return nums[c/2-sub] 30 | } 31 | 32 | // Graciously plucked from gobyexample.com 33 | func mapFloat64(vs []float64, f func(float64) float64) []float64 { 34 | vsm := make([]float64, len(vs)) 35 | for i, v := range vs { 36 | vsm[i] = f(v) 37 | } 38 | return vsm 39 | } 40 | 41 | func frequency(chunk []byte) float64 { 42 | hist := makeByteHist(chunk) 43 | lhist := float64(len(hist)) 44 | lchunk := float64(len(chunk)) 45 | if lchunk == 0.0 || lhist == 0.0 { 46 | return 0.0 47 | } 48 | return lhist / lchunk 49 | } 50 | func frequencyList(data []byte, chunkSize int) []float64 { 51 | chunks := makeChunks(data, chunkSize) 52 | fl := []float64{} 53 | for _, c := range chunks { 54 | fl = append(fl, frequency(c)) 55 | } 56 | return fl 57 | } 58 | 59 | // Chunk the data into a slice of size-sized byte slices. 60 | func makeChunks(data []byte, size int) [][]byte { 61 | j := 0 62 | l := len(data) 63 | chunks := make([][]byte, l/size+1) 64 | for i := 0; i < l; i += size { 65 | chunks[j] = append(chunks[j], data[i:int(math.Min(float64(i+size), float64(l)))]...) 66 | j += 1 67 | } 68 | return chunks 69 | } 70 | 71 | // Histogram of tuple-wide bit frequency in data 72 | // Example: { '0010': 4, '0110', 41 } 73 | func makeBinHist(data []byte, tuple int) map[string]int { 74 | chrMap := map[byte]string{} 75 | binMap := map[string]int{} 76 | binStr := "" 77 | for _, d := range data { 78 | // Convert byte -> binary string representation and cache it for 79 | // better performance. 80 | if _, in := chrMap[d]; !in { 81 | chrMap[d] = strconv.FormatInt(int64(d), 2) 82 | } 83 | // Iterate over the binary representation and construct the 84 | // histogram of binary sequences. 85 | for _, bit := range chrMap[d] { 86 | if len(binStr) == tuple { 87 | binMap[binStr] += 1 88 | binStr = "" 89 | } else if bit == '1' { 90 | binStr += "1" 91 | } else { 92 | binStr += "0" 93 | } 94 | } 95 | } 96 | return binMap 97 | } 98 | 99 | // Histogram of byte frequency in data 100 | func makeByteHist(data []byte) map[string]int { 101 | byteMap := map[string]int{} 102 | for _, d := range data { 103 | byteMap[fmt.Sprintf("%d", d)] += 1 104 | } 105 | return byteMap 106 | } 107 | 108 | func sum(nums []float64) float64 { 109 | total := 0.0 110 | for _, i := range nums { 111 | total += i 112 | } 113 | return total 114 | } 115 | --------------------------------------------------------------------------------