├── .gitignore ├── LICENSE.txt ├── README.md ├── add.js ├── add.lua ├── benchmark.sh ├── cas.js ├── cas.lua ├── check.js ├── check.lua ├── layer-add.js ├── layer-add.lua ├── layer-benchmark.sh ├── layer-check.js ├── layer-check.lua └── package.sh /.gitignore: -------------------------------------------------------------------------------- 1 | node_modules 2 | *.rpm 3 | *.deb 4 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2017 Erik Dubbelboer 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | redis-lua-scaling-bloom-filter 3 | ============================== 4 | 5 | `add.lua`, `cas.lua` and `check.lua` are three lua scripts for a [scaling bloom filter](http://en.wikipedia.org/wiki/Bloom_filter#Scalable_Bloom_filters) for [Redis](http://redis.io/) 6 | 7 | `layer-add.lua` and `later-check.lua` are two lua scripts for a [scaling layered bloom filter](https://en.wikipedia.org/wiki/Bloom_filter#Layered_Bloom_filters) for [Redis](http://redis.io/) 8 | 9 | The scripts are to be executed using the [EVAL](http://redis.io/commands/eval) command in Redis. 10 | 11 | _These scripts will probably not work on Redis cluster since the keys used inside the script aren't all passed as arguments!_ 12 | 13 | The layered filter has a maximum number of 32 layers. You can modify this in the source. 14 | 15 | 16 | `add.lua`, `cas.lua` and `layer-add.lua` 17 | ---------------------------------------- 18 | 19 | The `add.lua` script adds a new element to the filter. It will create the filter when it doesn't exist yet. 20 | 21 | `cas.lua` does a Check And Set, this will not add the element if it already exist. 22 | `cas.lua` will return 0 if the element is added, or 1 if the element was already in the filter. 23 | Since we use a scaling filter adding an element using `add.lua` might cause the element 24 | to exist in multiple parts of the filter at the same time. `cas.lua` prevents this. 25 | Using only `cas.lua` the `:count` key of the filter will accurately count the number of elements added to the filter. 26 | Only using `cas.lua` will also lower the number of false positives by a small amount (less duplicates in the filter means less bits set). 27 | 28 | `layer-add.lua` does a similar thing to `cas.lua` since this is necessary for the layer part to work 29 | (need to check all the filters in a layer to see if it already exists in the layer). 30 | `layer-add.lua` will return the layer number the element was added to. 31 | 32 | These scripts expects 4 arguments. 33 | 34 | 1. The base name of the keys to use. 35 | 2. The initial size of the bloom filter (in number of elements). 36 | 3. The probability of false positives. 37 | 4. The element to add to the filter. 38 | 39 | 40 | For example the following call would add "something" to a filter named test 41 | which will initially be able to hold 10000 elements with a probability of false positives of 1%. 42 | 43 | ` 44 | eval "add.lua source here" 0 test 10000 0.01 something 45 | ` 46 | 47 | 48 | `check.lua` and `layer-check.lua` 49 | --------------------------------- 50 | 51 | The `check.lua` and `layer-check.lua` scripts check if an element is contained in the bloom filter. 52 | 53 | `layer-check.lua` returns the layer the element was found in. 54 | 55 | These scripts expects 4 arguments. 56 | 57 | 1. The base name of the keys to use. 58 | 2. The initial size of the bloom filter (in number of elements). 59 | 3. The probability of false positives. 60 | 4. The element to check for. 61 | 62 | 63 | For example the following call would check if "something" is part of the filter named test 64 | which will initially be able to hold 10000 elements with a probability of false positives of 1%. 65 | 66 | ` 67 | eval "check.lua source here" 0 test 10000 0.01 something 68 | ` 69 | 70 | 71 | Tests 72 | ----- 73 | 74 | ``` 75 | $ npm install redis srand 76 | $ node add.js 77 | $ node cas.js 78 | $ node check.js 79 | $ # or/and 80 | $ node layer-add.js 81 | $ node layer-check.js 82 | ``` 83 | 84 | `add.js` and `layer-add.js` will add elements to a filter named test and then check if the elements are part of the filter. 85 | 86 | `check.js` and `layer-check.js` will test random elements against the filter build by `add.js` or `layer-add.js` to find the probability of false positives. 87 | 88 | Both script assume Redis is running on the default port. 89 | 90 | 91 | Benchmark 92 | --------- 93 | 94 | You can run `./benchmark.sh` and `./layer-benchmark.sh` to see how fast the scripts are. 95 | 96 | This script assumes Redis is running on the default port and `redis-cli` and `redis-benchmark` are installed. 97 | 98 | This is the outputs on my 2.3GHz 2012 MacBook Pro Retina: 99 | ``` 100 | add.lua 101 | ====== evalsha ab31647b3931a68b3b93a7354a297ed273349d39 0 HSwVBmHECt 1000000 0.01 :rand:000000000000 ====== 102 | 200000 requests completed in 8.27 seconds 103 | 20 parallel clients 104 | 3 bytes payload 105 | keep alive: 1 106 | 107 | 94.57% <= 1 milliseconds 108 | 100.00% <= 2 milliseconds 109 | 24175.03 requests per second 110 | 111 | 112 | check.lua 113 | ====== evalsha 437a3b0c6a452b5f7a1f10487974c002d41f4a04 0 HSwVBmHECt 1000000 0.01 :rand:000000000000 ====== 114 | 200000 requests completed in 8.54 seconds 115 | 20 parallel clients 116 | 3 bytes payload 117 | keep alive: 1 118 | 119 | 92.52% <= 1 milliseconds 120 | 100.00% <= 8 milliseconds 121 | 23419.20 requests per second 122 | 123 | 124 | layer-add.lua 125 | ====== evalsha 7ae29948e3096dd064c22fcd8b628a5c77394b0c 0 ooPb5enskU 1000000 0.01 :rand:000000000000 ====== 126 | 20000 requests completed in 12.61 seconds 127 | 20 parallel clients 128 | 3 bytes payload 129 | keep alive: 1 130 | 131 | 55.53% <= 12 milliseconds 132 | 75.42% <= 13 milliseconds 133 | 83.71% <= 14 milliseconds 134 | 91.48% <= 15 milliseconds 135 | 97.76% <= 16 milliseconds 136 | 99.90% <= 24 milliseconds 137 | 100.00% <= 24 milliseconds 138 | 1586.04 requests per second 139 | 140 | 141 | layer-check.lua 142 | ====== evalsha c1386438944daedfc4b5c06f79eadb6a83d4b4ea 0 ooPb5enskU 1000000 0.01 :rand:000000000000 ====== 143 | 20000 requests completed in 11.13 seconds 144 | 20 parallel clients 145 | 3 bytes payload 146 | keep alive: 1 147 | 148 | 0.00% <= 9 milliseconds 149 | 74.12% <= 11 milliseconds 150 | 80.43% <= 12 milliseconds 151 | 83.93% <= 13 milliseconds 152 | 97.43% <= 14 milliseconds 153 | 99.89% <= 15 milliseconds 154 | 100.00% <= 15 milliseconds 155 | 1797.59 requests per second 156 | ``` 157 | 158 | -------------------------------------------------------------------------------- /add.js: -------------------------------------------------------------------------------- 1 | 2 | var fs = require('fs'); 3 | 4 | var redis = require('redis'); 5 | var srand = require('srand'); 6 | 7 | 8 | var client = redis.createClient(6379, '127.0.0.1'); 9 | 10 | var addsource = fs.readFileSync('add.lua', 'ascii'); 11 | var checksource = fs.readFileSync('check.lua', 'ascii'); 12 | 13 | var entries = process.argv[2] || 10000; 14 | var precision = process.argv[3] || 0.01; 15 | 16 | var addsha = ''; 17 | var checksha = ''; 18 | 19 | var start; 20 | 21 | var count = process.argv[4] || 100000; 22 | var added = []; 23 | 24 | 25 | console.log('entries = ' + entries); 26 | console.log('precision = ' + (precision * 100) + '%'); 27 | console.log('count = ' + count); 28 | 29 | 30 | srand.seed(1); 31 | 32 | 33 | function randomstring(length) { 34 | var dict = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ`1234567890-=~!@#$%^&*()_+[]{};:",./<>?'; 35 | var r = ''; 36 | for (var i = 0; i < length; i++) { 37 | r += dict[Math.floor(srand.random() * dict.length)]; 38 | } 39 | return r; 40 | } 41 | 42 | 43 | function check(n) { 44 | if (n == count) { 45 | var sec = count / ((Date.now() - start) / 1000); 46 | console.log(sec + ' per second'); 47 | 48 | console.log('done.'); 49 | process.exit(); 50 | return; 51 | } 52 | 53 | client.evalsha(checksha, 0, 'test', entries, precision, added[n], function(err, found) { 54 | if (err) { 55 | throw err; 56 | } 57 | 58 | if (!found) { 59 | console.log(added[n] + ' was not found!'); 60 | } 61 | 62 | check(n + 1); 63 | }); 64 | } 65 | 66 | 67 | function add(n) { 68 | if (n == count) { 69 | var sec = count / ((Date.now() - start) / 1000); 70 | console.log(sec + ' per second'); 71 | 72 | console.log('checking...'); 73 | 74 | start = Date.now(); 75 | 76 | check(0); 77 | return; 78 | } 79 | 80 | var id = randomstring(10); 81 | 82 | added.push(id); 83 | 84 | client.evalsha(addsha, 0, 'test', entries, precision, id, function(err) { 85 | if (err) { 86 | throw err; 87 | } 88 | 89 | add(n + 1); 90 | }); 91 | } 92 | 93 | 94 | function load() { 95 | client.send_command('script', ['load', addsource], function(err, sha) { 96 | if (err) { 97 | throw err; 98 | } 99 | 100 | addsha = sha; 101 | console.log('adding add function... ' + addsha); 102 | 103 | 104 | client.send_command('script', ['load', checksource], function(err, sha) { 105 | if (err) { 106 | throw err; 107 | } 108 | 109 | checksha = sha; 110 | 111 | console.log('adding check function... ' + checksha); 112 | 113 | start = Date.now(); 114 | 115 | add(0); 116 | }); 117 | }); 118 | } 119 | 120 | 121 | client.keys('test:*', function(err, keys) { 122 | if (err) { 123 | throw err; 124 | } 125 | 126 | console.log('clearing...'); 127 | 128 | function clear(i) { 129 | if (i == keys.length) { 130 | load(); 131 | return; 132 | } 133 | 134 | client.del(keys[i], function(err) { 135 | if (err) { 136 | throw err; 137 | } 138 | 139 | clear(i + 1); 140 | }); 141 | } 142 | 143 | clear(0); 144 | }); 145 | 146 | -------------------------------------------------------------------------------- /add.lua: -------------------------------------------------------------------------------- 1 | 2 | local entries = ARGV[2] 3 | local precision = ARGV[3] 4 | local hash = redis.sha1hex(ARGV[4]) 5 | local countkey = ARGV[1] .. ':count' 6 | local count = redis.call('GET', countkey) 7 | if not count then 8 | count = 1 9 | else 10 | count = count + 1 11 | end 12 | 13 | local factor = math.ceil((entries + count) / entries) 14 | -- 0.69314718055995 = ln(2) 15 | local index = math.ceil(math.log(factor) / 0.69314718055995) 16 | local scale = math.pow(2, index - 1) * entries 17 | local key = ARGV[1] .. ':' .. index 18 | 19 | -- Based on the math from: http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives 20 | -- Combined with: http://www.sciencedirect.com/science/article/pii/S0020019006003127 21 | -- 0.4804530139182 = ln(2)^2 22 | local bits = math.floor(-(scale * math.log(precision * math.pow(0.5, index))) / 0.4804530139182) 23 | 24 | -- 0.69314718055995 = ln(2) 25 | local k = math.floor(0.69314718055995 * bits / scale) 26 | 27 | -- This uses a variation on: 28 | -- 'Less Hashing, Same Performance: Building a Better Bloom Filter' 29 | -- https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf 30 | local h = { } 31 | h[0] = tonumber(string.sub(hash, 1 , 8 ), 16) 32 | h[1] = tonumber(string.sub(hash, 9 , 16), 16) 33 | h[2] = tonumber(string.sub(hash, 17, 24), 16) 34 | h[3] = tonumber(string.sub(hash, 25, 32), 16) 35 | 36 | local found = true 37 | for i=1, k do 38 | if redis.call('SETBIT', key, (h[i % 2] + i * h[2 + (((i + (i % 2)) % 4) / 2)]) % bits, 1) == 0 then 39 | found = false 40 | end 41 | end 42 | 43 | -- We only increment the count key when we actually added the item to the filter. 44 | -- This doesn't mean count is accurate. Since this is a scaling bloom filter 45 | -- it is possible the item was already present in one of the filters in a lower index. 46 | -- If you really want to make sure an items isn't added multile times you 47 | -- can use cas.lua (Check And Set). 48 | if found == false then 49 | -- INCR is a little bit faster than SET. 50 | redis.call('INCR', countkey) 51 | end 52 | -------------------------------------------------------------------------------- /benchmark.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | PORT="${1:-6379}" 4 | 5 | # Get the SHA1 hash for our scripts by simply generating an error and 6 | # filtering out the hash. 7 | add=$(redis-cli -p "$PORT" --eval add.lua | sed 's/.*f_\([0-9a-z]\{40\}\).*/\1/') 8 | check=$(redis-cli -p "$PORT" --eval check.lua | sed 's/.*f_\([0-9a-z]\{40\}\).*/\1/') 9 | 10 | # Find a free key to use. 11 | # 10 characters should be enough to find a free key quickly. 12 | while true; do 13 | key=$(LC_ALL=C tr -dc "[:alnum:]" < /dev/urandom | head -c 10) 14 | 15 | if [ -z "$(echo "keys $key:*" | redis-cli -p "$PORT" --raw)" ]; then 16 | break 17 | fi 18 | done 19 | 20 | 21 | args="0 $key 1000000 0.01 :rand:000000000000" 22 | iter=200000 23 | 24 | echo add.lua 25 | redis-benchmark -p "$PORT" -c 20 -n $iter -r 2000000000 evalsha "$add" "$args" 26 | 27 | echo check.lua 28 | redis-benchmark -p "$PORT" -c 20 -n $iter -r 2000000000 evalsha "$check" "$args" 29 | 30 | 31 | # Delete all the keys we used. 32 | for k in $(echo "keys $key:*" | redis-cli -p "$PORT" --raw); do 33 | echo "del $k" | redis-cli -p "$PORT" > /dev/null 34 | done 35 | 36 | -------------------------------------------------------------------------------- /cas.js: -------------------------------------------------------------------------------- 1 | 2 | var fs = require('fs'); 3 | 4 | var redis = require('redis'); 5 | var srand = require('srand'); 6 | 7 | 8 | var client = redis.createClient(6379, '127.0.0.1'); 9 | 10 | var cassource = fs.readFileSync('cas.lua', 'ascii'); 11 | var checksource = fs.readFileSync('check.lua', 'ascii'); 12 | 13 | var entries = process.argv[2] || 10000; 14 | var precision = process.argv[3] || 0.01; 15 | 16 | var cassha = ''; 17 | var checksha = ''; 18 | 19 | var start; 20 | 21 | var count = process.argv[4] || 100000; 22 | var added = []; 23 | var found = 0; 24 | 25 | 26 | console.log('entries = ' + entries); 27 | console.log('precision = ' + (precision * 100) + '%'); 28 | console.log('count = ' + count); 29 | 30 | 31 | srand.seed(1); 32 | 33 | 34 | function check(n) { 35 | if (n == count) { 36 | var sec = count / ((Date.now() - start) / 1000); 37 | console.log(sec + ' per second'); 38 | 39 | console.log('done.'); 40 | process.exit(); 41 | return; 42 | } 43 | 44 | client.evalsha(checksha, 0, 'test', entries, precision, added[n], function(err, yes) { 45 | if (err) { 46 | throw err; 47 | } 48 | 49 | if (!yes) { 50 | console.log(added[n] + ' was not found!'); 51 | } 52 | 53 | check(n + 1); 54 | }); 55 | } 56 | 57 | 58 | function cas(n) { 59 | if (n == count) { 60 | var sec = count / ((Date.now() - start) / 1000); 61 | console.log(sec + ' per second'); 62 | 63 | console.log((found / (count / 100)) + '% false positives'); 64 | 65 | console.log('checking...'); 66 | 67 | start = Date.now(); 68 | 69 | check(0); 70 | return; 71 | } 72 | 73 | var id = Math.ceil(srand.random() * 4000000000); 74 | 75 | added.push(id); 76 | 77 | client.evalsha(cassha, 0, 'test', entries, precision, id, function(err, yes) { 78 | if (err) { 79 | throw err; 80 | } 81 | 82 | if (yes) { 83 | ++found; 84 | } 85 | 86 | cas(n + 1); 87 | }); 88 | } 89 | 90 | 91 | function load() { 92 | client.send_command('script', ['load', cassource], function(err, sha) { 93 | if (err) { 94 | throw err; 95 | } 96 | 97 | cassha = sha; 98 | 99 | console.log('adding cas function... ' + cassha); 100 | 101 | 102 | client.send_command('script', ['load', checksource], function(err, sha) { 103 | if (err) { 104 | throw err; 105 | } 106 | 107 | checksha = sha; 108 | 109 | console.log('adding check function... ' + checksha); 110 | 111 | start = Date.now(); 112 | 113 | cas(0); 114 | }); 115 | }); 116 | } 117 | 118 | 119 | client.keys('test:*', function(err, keys) { 120 | if (err) { 121 | throw err; 122 | } 123 | 124 | console.log('clearing...'); 125 | 126 | function clear(i) { 127 | if (i == keys.length) { 128 | load(); 129 | return; 130 | } 131 | 132 | client.del(keys[i], function(err) { 133 | if (err) { 134 | throw err; 135 | } 136 | 137 | clear(i + 1); 138 | }); 139 | } 140 | 141 | clear(0); 142 | }); 143 | 144 | -------------------------------------------------------------------------------- /cas.lua: -------------------------------------------------------------------------------- 1 | 2 | -- Check And Set 3 | -- Check if the item is already present in one of the layers and 4 | -- only add the item if it wasn't. 5 | -- Returns 1 if the item was already present. 6 | -- 7 | -- If only this script is used to add items to the filter the :count 8 | -- key will accurately indicate the number of unique items added to 9 | -- the filter. 10 | 11 | local entries = ARGV[2] 12 | local precision = ARGV[3] 13 | local hash = redis.sha1hex(ARGV[4]) 14 | local countkey = ARGV[1] .. ':count' 15 | local count = redis.call('GET', countkey) 16 | if not count then 17 | count = 1 18 | else 19 | count = count + 1 20 | end 21 | 22 | local factor = math.ceil((entries + count) / entries) 23 | -- 0.69314718055995 = ln(2) 24 | local index = math.ceil(math.log(factor) / 0.69314718055995) 25 | local scale = math.pow(2, index - 1) * entries 26 | 27 | -- This uses a variation on: 28 | -- 'Less Hashing, Same Performance: Building a Better Bloom Filter' 29 | -- https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf 30 | local h = { } 31 | h[0] = tonumber(string.sub(hash, 1 , 8 ), 16) 32 | h[1] = tonumber(string.sub(hash, 9 , 16), 16) 33 | h[2] = tonumber(string.sub(hash, 17, 24), 16) 34 | h[3] = tonumber(string.sub(hash, 25, 32), 16) 35 | 36 | -- Based on the math from: http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives 37 | -- Combined with: http://www.sciencedirect.com/science/article/pii/S0020019006003127 38 | -- 0.4804530139182 = ln(2)^2 39 | local maxbits = math.floor((scale * math.log(precision * math.pow(0.5, index))) / -0.4804530139182) 40 | 41 | -- 0.69314718055995 = ln(2) 42 | local maxk = math.floor(0.69314718055995 * maxbits / scale) 43 | local b = { } 44 | 45 | for i=1, maxk do 46 | table.insert(b, h[i % 2] + i * h[2 + (((i + (i % 2)) % 4) / 2)]) 47 | end 48 | 49 | -- Only do this if we have data already. 50 | if index > 1 then 51 | -- The last fiter will be handled below. 52 | for n=1, index-1 do 53 | local key = ARGV[1] .. ':' .. n 54 | local scale = math.pow(2, n - 1) * entries 55 | 56 | -- 0.4804530139182 = ln(2)^2 57 | local bits = math.floor((scale * math.log(precision * math.pow(0.5, n))) / -0.4804530139182) 58 | 59 | -- 0.69314718055995 = ln(2) 60 | local k = math.floor(0.69314718055995 * bits / scale) 61 | 62 | local found = true 63 | for i=1, k do 64 | if redis.call('GETBIT', key, b[i] % bits) == 0 then 65 | found = false 66 | break 67 | end 68 | end 69 | 70 | if found then 71 | return 1 72 | end 73 | end 74 | end 75 | 76 | -- For the last filter we do a SETBIT where we check the result value. 77 | local key = ARGV[1] .. ':' .. index 78 | 79 | local found = 1 80 | for i=1, maxk do 81 | if redis.call('SETBIT', key, b[i] % maxbits, 1) == 0 then 82 | found = 0 83 | end 84 | end 85 | 86 | if found == 0 then 87 | -- INCR is a little bit faster than SET. 88 | redis.call('INCR', countkey) 89 | end 90 | 91 | return found 92 | 93 | -------------------------------------------------------------------------------- /check.js: -------------------------------------------------------------------------------- 1 | 2 | var fs = require('fs'); 3 | 4 | var redis = require('redis'); 5 | var srand = require('srand'); 6 | 7 | 8 | var client = redis.createClient(6379, '127.0.0.1'); 9 | 10 | var checksource = fs.readFileSync('check.lua', 'ascii'); 11 | 12 | var entries = process.argv[2] || 10000; 13 | var precision = process.argv[3] || 0.01; 14 | 15 | var checksha = ''; 16 | 17 | var start; 18 | 19 | var count = process.argv[4] || 100000; 20 | var found = 0; 21 | 22 | 23 | console.log('entries = ' + entries); 24 | console.log('precision = ' + (precision * 100) + '%'); 25 | console.log('count = ' + count); 26 | 27 | 28 | srand.seed(2); 29 | 30 | 31 | function check(n) { 32 | if (n == count) { 33 | var sec = count / ((Date.now() - start) / 1000); 34 | console.log(sec + ' per second'); 35 | 36 | console.log((found / (count / 100)) + '% false positives'); 37 | 38 | console.log('done.'); 39 | process.exit(); 40 | return; 41 | } 42 | 43 | var id = Math.ceil(srand.random() * 4000000000); 44 | 45 | client.evalsha(checksha, 0, 'test', entries, precision, id, function(err, yes) { 46 | if (err) { 47 | throw err; 48 | } 49 | 50 | if (yes) { 51 | ++found; 52 | } 53 | 54 | check(n + 1); 55 | }); 56 | } 57 | 58 | 59 | client.send_command('script', ['load', checksource], function(err, sha) { 60 | if (err) { 61 | throw err; 62 | } 63 | 64 | checksha = sha; 65 | 66 | console.log('adding check function... ' + checksha); 67 | 68 | start = Date.now(); 69 | 70 | check(0); 71 | }); 72 | 73 | -------------------------------------------------------------------------------- /check.lua: -------------------------------------------------------------------------------- 1 | 2 | local entries = ARGV[2] 3 | local precision = ARGV[3] 4 | local count = redis.call('GET', ARGV[1] .. ':count') 5 | 6 | if not count then 7 | return 0 8 | end 9 | 10 | local factor = math.ceil((entries + count) / entries) 11 | -- 0.69314718055995 = ln(2) 12 | local index = math.ceil(math.log(factor) / 0.69314718055995) 13 | local scale = math.pow(2, index - 1) * entries 14 | 15 | local hash = redis.sha1hex(ARGV[4]) 16 | 17 | -- This uses a variation on: 18 | -- 'Less Hashing, Same Performance: Building a Better Bloom Filter' 19 | -- https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf 20 | local h = { } 21 | h[0] = tonumber(string.sub(hash, 1 , 8 ), 16) 22 | h[1] = tonumber(string.sub(hash, 9 , 16), 16) 23 | h[2] = tonumber(string.sub(hash, 17, 24), 16) 24 | h[3] = tonumber(string.sub(hash, 25, 32), 16) 25 | 26 | -- Based on the math from: http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives 27 | -- Combined with: http://www.sciencedirect.com/science/article/pii/S0020019006003127 28 | -- 0.4804530139182 = ln(2)^2 29 | local maxbits = math.floor((scale * math.log(precision * math.pow(0.5, index))) / -0.4804530139182) 30 | 31 | -- 0.69314718055995 = ln(2) 32 | local maxk = math.floor(0.69314718055995 * maxbits / scale) 33 | local b = { } 34 | 35 | for i=1, maxk do 36 | table.insert(b, h[i % 2] + i * h[2 + (((i + (i % 2)) % 4) / 2)]) 37 | end 38 | 39 | for n=1, index do 40 | local key = ARGV[1] .. ':' .. n 41 | local found = true 42 | local scalen = math.pow(2, n - 1) * entries 43 | 44 | -- 0.4804530139182 = ln(2)^2 45 | local bits = math.floor((scalen * math.log(precision * math.pow(0.5, n))) / -0.4804530139182) 46 | 47 | -- 0.69314718055995 = ln(2) 48 | local k = math.floor(0.69314718055995 * bits / scalen) 49 | 50 | for i=1, k do 51 | if redis.call('GETBIT', key, b[i] % bits) == 0 then 52 | found = false 53 | break 54 | end 55 | end 56 | 57 | if found then 58 | return 1 59 | end 60 | end 61 | 62 | return 0 63 | -------------------------------------------------------------------------------- /layer-add.js: -------------------------------------------------------------------------------- 1 | 2 | var fs = require('fs'); 3 | 4 | var redis = require('redis'); 5 | var srand = require('srand'); 6 | 7 | 8 | var client = redis.createClient(6379, '127.0.0.1'); 9 | 10 | var addsource = fs.readFileSync('layer-add.lua', 'ascii'); 11 | var checksource = fs.readFileSync('layer-check.lua', 'ascii'); 12 | 13 | var entries = process.argv[2] || 10000; 14 | var precision = process.argv[3] || 0.01; 15 | 16 | var addsha = ''; 17 | var checksha = ''; 18 | 19 | var start; 20 | 21 | var count = process.argv[4] || 100000; 22 | var added = []; 23 | var addto = []; 24 | var wrong = 0; 25 | 26 | 27 | console.log('entries = ' + entries); 28 | console.log('precision = ' + (precision * 100) + '%'); 29 | console.log('count = ' + count); 30 | 31 | 32 | srand.seed(1); 33 | 34 | 35 | function check(n) { 36 | if (n == added.length) { 37 | var sec = count / ((Date.now() - start) / 1000); 38 | console.log(sec + ' per second'); 39 | 40 | console.log((wrong / (count / 100)) + '% in a too high layer'); 41 | 42 | console.log('done.'); 43 | process.exit(); 44 | return; 45 | } 46 | 47 | client.evalsha(checksha, 0, 'test', entries, precision, added[n][0], function(err, found) { 48 | if (err) { 49 | throw err; 50 | } 51 | 52 | var layer = added[n][1]; 53 | 54 | if (found != layer) { 55 | // Finding one in a too low layer means it wasn't added to the higher layer! 56 | if (found < layer) { 57 | console.log(added[n][0] + ' expected in ' + layer + ' found in ' + found + '!'); 58 | } 59 | 60 | ++wrong; 61 | } 62 | 63 | check(n + 1); 64 | }); 65 | } 66 | 67 | 68 | function add(n) { 69 | if (n == count) { 70 | var sec = count / ((Date.now() - start) / 1000); 71 | console.log(sec + ' per second'); 72 | 73 | // This will never print 100% for layer 1 since false positives will 74 | // make some new items be added to higher layers right away. 75 | for (var i = 1; i < addto.length; ++i) { 76 | console.log('layer ' + i + ': ' + addto[i] + ' (' + (addto[i] / (count / 100)) + '%) added'); 77 | } 78 | 79 | console.log('checking...'); 80 | 81 | start = Date.now(); 82 | 83 | check(0); 84 | return; 85 | } 86 | 87 | var i = 0; 88 | 89 | // 30% of the time we add an item we already know, 90 | // pushing it up one layer. 91 | if (added.length > 100 && srand.random() < 0.3) { 92 | i = Math.floor(srand.random()*added.length); 93 | } else { 94 | var id = Math.ceil(srand.random() * 4000000000); 95 | 96 | i = added.push([id, 0]) - 1; 97 | } 98 | 99 | client.evalsha(addsha, 0, 'test', entries, precision, added[i][0], function(err, layer) { 100 | if (err) { 101 | throw err; 102 | } 103 | 104 | if (layer == 0) { 105 | throw new Error('We have run out of layers!'); 106 | } else { 107 | if (addto[layer]) { 108 | addto[layer]++; 109 | } else { 110 | addto[layer] = 1; 111 | } 112 | } 113 | 114 | added[i][1]++; 115 | 116 | add(n + 1); 117 | }); 118 | } 119 | 120 | 121 | function load() { 122 | client.send_command('script', ['load', addsource], function(err, sha) { 123 | if (err) { 124 | throw err; 125 | } 126 | 127 | addsha = sha; 128 | console.log('adding add function... ' + addsha); 129 | 130 | client.send_command('script', ['load', checksource], function(err, sha) { 131 | if (err) { 132 | throw err; 133 | } 134 | 135 | checksha = sha; 136 | 137 | console.log('adding check function... ' + checksha); 138 | 139 | start = Date.now(); 140 | 141 | add(0); 142 | }); 143 | }); 144 | } 145 | 146 | 147 | client.keys('test:*', function(err, keys) { 148 | if (err) { 149 | throw err; 150 | } 151 | 152 | console.log('clearing...'); 153 | 154 | function clear(i) { 155 | if (i == keys.length) { 156 | load(); 157 | return; 158 | } 159 | 160 | client.del(keys[i], function(err) { 161 | if (err) { 162 | throw err; 163 | } 164 | 165 | clear(i + 1); 166 | }); 167 | } 168 | 169 | clear(0); 170 | }); 171 | 172 | -------------------------------------------------------------------------------- /layer-add.lua: -------------------------------------------------------------------------------- 1 | 2 | local entries = ARGV[2] 3 | local precision = ARGV[3] 4 | local hash = redis.sha1hex(ARGV[4]) 5 | 6 | -- This uses a variation on: 7 | -- 'Less Hashing, Same Performance: Building a Better Bloom Filter' 8 | -- https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf 9 | local h = { } 10 | h[0] = tonumber(string.sub(hash, 1 , 8 ), 16) 11 | h[1] = tonumber(string.sub(hash, 9 , 16), 16) 12 | h[2] = tonumber(string.sub(hash, 17, 24), 16) 13 | h[3] = tonumber(string.sub(hash, 25, 32), 16) 14 | 15 | for layer=1,32 do 16 | local key = ARGV[1] .. ':' .. layer .. ':' 17 | local countkey = key .. 'count' 18 | local count = redis.call('GET', countkey) 19 | if not count then 20 | count = 1 21 | else 22 | count = count + 1 23 | end 24 | local factor = math.ceil((entries + count) / entries) 25 | -- 0.69314718055995 = ln(2) 26 | local index = math.ceil(math.log(factor) / 0.69314718055995) 27 | local scale = math.pow(2, index - 1) * entries 28 | 29 | -- Based on the math from: http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives 30 | -- Combined with: http://www.sciencedirect.com/science/article/pii/S0020019006003127 31 | -- 0.4804530139182 = ln(2)^2 32 | local maxbits = math.floor((scale * math.log(precision * math.pow(0.5, index))) / -0.4804530139182) 33 | 34 | -- 0.69314718055995 = ln(2) 35 | local maxk = math.floor(0.69314718055995 * maxbits / scale) 36 | local b = { } 37 | 38 | for i=1, maxk do 39 | table.insert(b, h[i % 2] + i * h[2 + (((i + (i % 2)) % 4) / 2)]) 40 | end 41 | 42 | local inlayer = false 43 | 44 | -- Only do this if we have data already. 45 | if index > 1 then 46 | -- The last fiter will be handled below. 47 | for n=1, index-1 do 48 | local keyn = key .. n 49 | local scalen = math.pow(2, n - 1) * entries 50 | 51 | -- 0.4804530139182 = ln(2)^2 52 | local bits = math.floor((scalen * math.log(precision * math.pow(0.5, n))) / -0.4804530139182) 53 | 54 | -- 0.69314718055995 = ln(2) 55 | local k = math.floor(0.69314718055995 * bits / scalen) 56 | 57 | local found = true 58 | for i=1, k do 59 | if redis.call('GETBIT', keyn, b[i] % bits) == 0 then 60 | found = false 61 | break 62 | end 63 | end 64 | 65 | if found then 66 | inlayer = true 67 | break 68 | end 69 | end 70 | end 71 | 72 | if inlayer == false then 73 | key = key .. index 74 | 75 | local found = true 76 | for i=1, maxk do 77 | if redis.call('SETBIT', key, (h[i % 2] + i * h[2 + (((i + (i % 2)) % 4) / 2)]) % maxbits, 1) == 0 then 78 | found = false 79 | end 80 | end 81 | 82 | -- If it wasn't found in this layer break 83 | if found == false then 84 | -- INCR is a little bit faster than SET. 85 | redis.call('INCR', countkey) 86 | return layer 87 | end 88 | end 89 | end 90 | 91 | -- We only reach this is we ran out of layers 92 | return 0 93 | -------------------------------------------------------------------------------- /layer-benchmark.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | PORT="${1:-6379}" 4 | 5 | # Get the SHA1 hash for our scripts by simply generating an error and 6 | # filtering out the hash. 7 | add=$(redis-cli -p "$PORT" --eval layer-add.lua | sed 's/.*f_\([0-9a-z]\{40\}\).*/\1/') 8 | check=$(redis-cli -p "$PORT" --eval layer-check.lua | sed 's/.*f_\([0-9a-z]\{40\}\).*/\1/') 9 | 10 | # Find a free key to use. 11 | # 10 characters should be enough to find a free key quickly. 12 | while true; do 13 | key=$(LC_ALL=C tr -dc "[:alnum:]" < /dev/urandom | head -c 10) 14 | 15 | if [ -z "$(echo "keys $key:*" | redis-cli -p "$PORT" --raw)" ]; then 16 | break 17 | fi 18 | done 19 | 20 | 21 | args="0 $key 1000000 0.01 :rand:000000000000" 22 | iter=20000 23 | 24 | echo layer-add.lua 25 | redis-benchmark -p "$PORT" -c 20 -n $iter -r 2000000000 evalsha "$add" "$args" 26 | 27 | echo layer-check.lua 28 | redis-benchmark -p "$PORT" -c 20 -n $iter -r 2000000000 evalsha "$check" "$args" 29 | 30 | 31 | # Delete all the keys we used. 32 | for k in $(echo "keys $key:*" | redis-cli -p "$PORT" --raw); do 33 | echo "del $k" | redis-cli -p "$PORT" > /dev/null 34 | done 35 | 36 | -------------------------------------------------------------------------------- /layer-check.js: -------------------------------------------------------------------------------- 1 | 2 | var fs = require('fs'); 3 | 4 | var redis = require('redis'); 5 | var srand = require('srand'); 6 | 7 | 8 | var client = redis.createClient(6379, '127.0.0.1'); 9 | 10 | var checksource = fs.readFileSync('layer-check.lua', 'ascii'); 11 | 12 | var entries = process.argv[2] || 10000; 13 | var precision = process.argv[3] || 0.01; 14 | 15 | var checksha = ''; 16 | 17 | var start; 18 | 19 | var count = process.argv[4] || 100000; 20 | var found = []; 21 | var added = 0; 22 | 23 | 24 | console.log('entries = ' + entries); 25 | console.log('precision = ' + (precision * 100) + '%'); 26 | console.log('count = ' + count); 27 | 28 | 29 | srand.seed(2); 30 | 31 | 32 | function check(n) { 33 | if (n == count) { 34 | var sec = count / ((Date.now() - start) / 1000); 35 | console.log(sec + ' per second'); 36 | 37 | var total = 0; 38 | 39 | for (var i = 1; i < found.length; ++i) { 40 | total += found[i]; 41 | 42 | console.log('layer ' + i + ': ' + (found[i] / (count / 100)) + '% false positives'); 43 | } 44 | 45 | console.log((total / (count / 100)) + '% false positives total'); 46 | 47 | console.log('done.'); 48 | process.exit(); 49 | return; 50 | } 51 | 52 | // Mimic the same number of srand.random() calls as layer-add.js 53 | // so we get the same id's if we use the same seed. 54 | while (added > 100 && srand.random() < 0.3) { 55 | srand.random(); 56 | ++added; 57 | } 58 | 59 | var id = Math.ceil(srand.random() * 4000000000); 60 | 61 | client.evalsha(checksha, 0, 'test', entries, precision, id, function(err, layer) { 62 | if (err) { 63 | throw err; 64 | } 65 | 66 | if (layer) { 67 | if (found[layer]) { 68 | found[layer]++; 69 | } else { 70 | found[layer] = 1; 71 | } 72 | } 73 | 74 | check(n + 1); 75 | }); 76 | } 77 | 78 | 79 | client.send_command('script', ['load', checksource], function(err, sha) { 80 | if (err) { 81 | throw err; 82 | } 83 | 84 | checksha = sha; 85 | 86 | console.log('adding check function... ' + checksha); 87 | 88 | start = Date.now(); 89 | 90 | check(0); 91 | }); 92 | 93 | -------------------------------------------------------------------------------- /layer-check.lua: -------------------------------------------------------------------------------- 1 | 2 | local entries = ARGV[2] 3 | local precision = ARGV[3] 4 | local hash = redis.sha1hex(ARGV[4]) 5 | 6 | -- This uses a variation on: 7 | -- 'Less Hashing, Same Performance: Building a Better Bloom Filter' 8 | -- https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf 9 | local h = { } 10 | h[0] = tonumber(string.sub(hash, 1 , 8 ), 16) 11 | h[1] = tonumber(string.sub(hash, 9 , 16), 16) 12 | h[2] = tonumber(string.sub(hash, 17, 24), 16) 13 | h[3] = tonumber(string.sub(hash, 25, 32), 16) 14 | 15 | for layer=1,32 do 16 | local key = ARGV[1] .. ':' .. layer .. ':' 17 | local count = redis.call('GET', key .. 'count') 18 | 19 | if not count then 20 | return layer - 1 21 | end 22 | 23 | local factor = math.ceil((entries + count) / entries) 24 | -- 0.69314718055995 = ln(2) 25 | local index = math.ceil(math.log(factor) / 0.69314718055995) 26 | local scale = math.pow(2, index - 1) * entries 27 | 28 | -- Based on the math from: http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives 29 | -- Combined with: http://www.sciencedirect.com/science/article/pii/S0020019006003127 30 | -- 0.4804530139182 = ln(2)^2 31 | local maxbits = math.floor((scale * math.log(precision * math.pow(0.5, index))) / -0.4804530139182) 32 | 33 | -- 0.69314718055995 = ln(2) 34 | local maxk = math.floor(0.69314718055995 * maxbits / scale) 35 | local b = { } 36 | 37 | for i=1, maxk do 38 | table.insert(b, h[i % 2] + i * h[2 + (((i + (i % 2)) % 4) / 2)]) 39 | end 40 | 41 | local inlayer = false 42 | 43 | for n=1, index do 44 | local keyn = key .. n 45 | local found = true 46 | local scalen = math.pow(2, n - 1) * entries 47 | 48 | -- 0.4804530139182 = ln(2)^2 49 | local bits = math.floor((scalen * math.log(precision * math.pow(0.5, n))) / -0.4804530139182) 50 | 51 | -- 0.69314718055995 = ln(2) 52 | local k = math.floor(0.69314718055995 * bits / scalen) 53 | 54 | for i=1, k do 55 | if redis.call('GETBIT', keyn, b[i] % bits) == 0 then 56 | found = false 57 | break 58 | end 59 | end 60 | 61 | if found then 62 | inlayer = true 63 | break 64 | end 65 | end 66 | 67 | if inlayer == false then 68 | return layer - 1 69 | end 70 | end 71 | 72 | return 32 73 | 74 | -------------------------------------------------------------------------------- /package.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | ### 4 | ### Note, if you run this on OSX El Capitan, you may run into an issue with FPM 5 | ### not working with the error message: 6 | ### Could not open library 'libc.dylib': dlopen(libc.dylib, 5): image not found (LoadError) 7 | ### If so, follow these steps: 8 | ### https://github.com/jordansissel/fpm/issues/1010#issuecomment-193217675 9 | 10 | #set -xe 11 | 12 | ### The dir for the package script 13 | MY_DIR=$( dirname $0 ) 14 | cd $MY_DIR 15 | 16 | ### Build debian pckages by default, but any other type will do if FPM understands it. 17 | TYPE=${1:-deb} 18 | 19 | echo "Building $TYPE package" 20 | 21 | ### Throw away any old packages 22 | rm -f *.$TYPE 23 | 24 | ### Name of the package, project, etc 25 | NAME=redis-lua-scaling-bloom-filter 26 | 27 | _GIT_VERSION=`git tag -l | tail -n 1` 28 | VERSION=${_GIT_VERSION:-1} 29 | PACKAGE_VERSION=$VERSION~$( date -u +%Y%m%d%H%M ) 30 | PACKAGE_NAME=$NAME 31 | 32 | ### List of files to package 33 | FILES="*.lua *.js *.md" 34 | 35 | ### Where this package will be installed 36 | DEST_DIR="/usr/local/${NAME}/" 37 | 38 | ### Where the sources live 39 | SOURCE_DIR=$MY_DIR 40 | 41 | fpm -s dir -t $TYPE -a all -n $PACKAGE_NAME -v $PACKAGE_VERSION --prefix $DEST_DIR -C $SOURCE_DIR $FILES 42 | --------------------------------------------------------------------------------