├── .coveragerc ├── .gitignore ├── LICENSE ├── README.md ├── cuckoofilter ├── __init__.py ├── bucket.py ├── cuckoofilter.py └── tests │ ├── __init__.py │ ├── test_bucket.py │ └── test_filter.py ├── example.py └── requirements.txt /.coveragerc: -------------------------------------------------------------------------------- 1 | [run] 2 | branch = true 3 | omit = */tests/* 4 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | __pycache__ 2 | *.py[co] 3 | *env/ 4 | .cache/ 5 | .coverage 6 | htmlcov/ 7 | /*.egg-info 8 | /build 9 | /dist 10 | /.eggs 11 | monkeytype.sqlite3 12 | /.ipynb_checkpoints 13 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Michael The 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # python-cuckoo 2 | ```python-cuckoo``` is an implementation of a Cuckoo filter in python 3. 3 | 4 | Cuckoo filters serve as a drop-in replacement for Bloom filters. 5 | Just like Bloom filters, we can add items to the filter and then ask the filter whether that item is in the filter or not. 6 | Just like Bloom filters, there is a (very small and tunable) chance this will return a false positive. 7 | Unlike regular Bloom filters, Cuckoo filters also support deletion of items. 8 | Cuckoo filters use less space than Bloom filters for low false positive rates. 9 | 10 | # Usage 11 | 12 | ```python 13 | >>> import cuckoofilter 14 | >>> cf = cuckoofilter.CuckooFilter(capacity=100000, fingerprint_size=1) 15 | 16 | >>> cf.insert('Bin Fan') 17 | 66349 18 | 19 | >>> cf.contains('Bin Fan') 20 | True 21 | 22 | >>> cf.contains('Michael The') 23 | False 24 | 25 | >>> cf.delete('Bin Fan') 26 | True 27 | 28 | >>> cf.contains('Bin Fan') 29 | False 30 | ``` 31 | 32 | # Why use Cuckoo filters 33 | The short answer: if you don't know whether you should use a Bloom filter or a Cuckoo filter, use a Cuckoo filter. 34 | For a well-written and visual explanation, check out [Probabilistic Filters By Example](https://bdupras.github.io/filter-tutorial/). 35 | 36 | Cuckoo filters are very similar to Bloom filters. 37 | They are both great at reducing a disk queries. 38 | Some example usages of Bloom filters: 39 | - Checking whether a URL is malicious or not ([Google Chrome](http://blog.alexyakunin.com/2010/03/nice-bloom-filter-application.html)) 40 | - Deciding whether an item should be cached or not ([Akamai](http://dl.acm.org/citation.cfm?doid=2805789.2805800)) 41 | - Keep track of what articles a user has already read ([Medium](https://medium.com/blog/what-are-bloom-filters-1ec2a50c68ff#.xlkqtn1vy)) 42 | 43 | # Why not use Cuckoo filters 44 | As a Cuckoo filter is filled up, insertion will become slower as more items need to be "kicked" around. 45 | If your application is sensitive to timing on insertion, choose a different data structure. 46 | 47 | Cuckoo filters might reject insertions. 48 | This occurs when the filter is about to reach full capacity or a fingerprint is inserted more than 2b times, where b is the bucket size. 49 | This limitation is also present in Counting Bloom filters. 50 | If this limitation is unacceptable for your application, use a different data structure. 51 | 52 | # Testing & profiling 53 | Python-cuckoo comes with a test suite (```cuckoofilter/tests/```). 54 | We recommend using [```py.test```](http://pytest.org/) to run unit tests. 55 | 56 | ``` 57 | pip install pytest 58 | pytest cuckoofilter/ 59 | ``` 60 | 61 | To generate a test coverage report, install the pytest coverage plugin and generate an html report. 62 | 63 | ``` 64 | pip install pytest-cov 65 | pytest --cov-report html cuckoofilter/ 66 | ``` 67 | 68 | The report will be created in a folder called ```htmlcov```. 69 | Open ```htmlcov/index.html``` to inspect what parts of the library are tested. 70 | 71 | To find out what parts of the library are slow, we need to profile our library. 72 | To do this, we can use ```cProfile``` by running ```python -m cProfile example.py```. 73 | For a visualization of this profiling information, we can use [snakeviz](https://jiffyclub.github.io/snakeviz/). 74 | 75 | ``` 76 | pip install snakeviz 77 | python -m cProfile -o out.profile example.py 78 | snakeviz out.profile 79 | ``` 80 | 81 | # Original paper 82 | 83 | Cuckoo filters were first described in: 84 | 85 | >Fan, B., Andersen, D. G., Kaminsky, M., & Mitzenmacher, M. D. (2014, December). 86 | >Cuckoo filter: Practically better than bloom. 87 | >In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies (pp. 75-88). ACM. 88 | 89 | Their reference implementation in C++ can be found on [github](https://github.com/efficient/cuckoofilter). 90 | 91 | ## See also 92 | 93 | - [Probablistic Filters By Example](https://bdupras.github.io/filter-tutorial/) for a well-written and visual explanation of Cuckoo filters vs. Bloom filters. 94 | - [Cuckoo Filters](http://mybiasedcoin.blogspot.nl/2014/10/cuckoo-filters.html) — blog post by M. Mitzenmacher, the fourth author of the original paper. 95 | - [Cuckoo Filter implementation in java](https://github.com/bdupras/guava-probably) 96 | - [Cuckoo Filter implementation in Go](https://github.com/irfansharif/cfilter) 97 | -------------------------------------------------------------------------------- /cuckoofilter/__init__.py: -------------------------------------------------------------------------------- 1 | from .cuckoofilter import CuckooFilter 2 | -------------------------------------------------------------------------------- /cuckoofilter/bucket.py: -------------------------------------------------------------------------------- 1 | import random 2 | 3 | 4 | class Bucket: 5 | '''Bucket class for storing fingerprints.''' 6 | 7 | def __init__(self, size=4): 8 | ''' 9 | Initialize bucket. 10 | 11 | size : the maximum nr. of fingerprints the bucket can store 12 | Default size is 4, which closely approaches the best size for FPP between 0.00001 and 0.002 (see Fan et al.). 13 | If your targeted FPP is greater than 0.002, a bucket size of 2 is more space efficient. 14 | ''' 15 | self.size = size 16 | self.b = [] 17 | 18 | def insert(self, fingerprint): 19 | ''' 20 | Insert a fingerprint into the bucket. 21 | The insertion of duplicate entries is allowed. 22 | ''' 23 | if not self.is_full(): 24 | self.b.append(fingerprint) 25 | return True 26 | return False 27 | 28 | def contains(self, fingerprint): 29 | return fingerprint in self.b 30 | 31 | def delete(self, fingerprint): 32 | ''' 33 | Delete a fingerprint from the bucket. 34 | 35 | Returns True if the fingerprint was present in the bucket. 36 | This is useful for keeping track of how many items are present in the filter. 37 | ''' 38 | try: 39 | del self.b[self.b.index(fingerprint)] 40 | return True 41 | except ValueError: 42 | # This error is explicitly silenced. 43 | # It simply means the fingerprint was never present in the bucket. 44 | return False 45 | 46 | def swap(self, fingerprint): 47 | ''' 48 | Swap a fingerprint with a randomly chosen fingerprint from the bucket. 49 | 50 | The given fingerprint is stored in the bucket. 51 | The swapped fingerprint is returned. 52 | ''' 53 | bucket_index = random.choice(range(len(self.b))) 54 | fingerprint, self.b[bucket_index] = self.b[bucket_index], fingerprint 55 | return fingerprint 56 | 57 | def is_full(self): 58 | return len(self.b) >= self.size 59 | 60 | def __contains__(self, fingerprint): 61 | return self.contains(fingerprint) 62 | 63 | def __repr__(self): 64 | return '' 65 | 66 | def __sizeof__(self): 67 | return super().__sizeof__() + self.b.__sizeof__() -------------------------------------------------------------------------------- /cuckoofilter/cuckoofilter.py: -------------------------------------------------------------------------------- 1 | import mmh3 2 | import random 3 | 4 | from . import bucket 5 | 6 | 7 | class CuckooFilter: 8 | ''' 9 | A Cuckoo filter is a data structure for probablistic set-membership queries. 10 | We can insert items into the filter and, with a very low false positive probability (FPP), ask whether it contains an item or not. 11 | Cuckoo filters serve as a drop-in replacement for Bloom filters, but are more space-efficient and support deletion of items. 12 | 13 | Cuckoo filters were originally described in: 14 | Fan, B., Andersen, D. G., Kaminsky, M., & Mitzenmacher, M. D. (2014, December). 15 | Cuckoo filter: Practically better than bloom. 16 | In Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies (pp. 75-88). ACM. 17 | 18 | Their reference implementation in C++ can be found here: https://github.com/efficient/cuckoofilter 19 | ''' 20 | 21 | def __init__(self, capacity, fingerprint_size, bucket_size=4, max_kicks=500): 22 | ''' 23 | Initialize Cuckoo filter parameters. 24 | 25 | capacity : size of the filter 26 | Defines how many buckets the filter contains. 27 | fingerprint_size: size of the fingerprint in bytes 28 | A larger fingerprint size results in a lower FPP. 29 | bucket_size : nr. of entries a bucket can hold 30 | A bucket can hold multiple entries. 31 | Default size is 4, which closely approaches the best size for FPP between 0.00001 and 0.002 (see Fan et al.). 32 | If your targeted FPP is greater than 0.002, a bucket size of 2 is more space efficient. 33 | max_kicks : nr. of times entries are kicked around before deciding the filter is full 34 | Defaults to 500. This is an arbitrary number also used by Fan et al. and seems reasonable enough. 35 | ''' 36 | self.capacity = capacity 37 | self.fingerprint_size = fingerprint_size 38 | self.max_kicks = max_kicks 39 | self.buckets = [bucket.Bucket(size=bucket_size) for _ in range(self.capacity)] 40 | self.size = 0 41 | self.bucket_size = bucket_size 42 | 43 | def insert(self, item): 44 | ''' 45 | Inserts a string into the filter. 46 | 47 | Throws an exception if the insertion fails. 48 | ''' 49 | self.size = self.size + 1 50 | fingerprint = self.fingerprint(item) 51 | i1, i2 = self.calculate_index_pair(item, fingerprint) 52 | 53 | if self.buckets[i1].insert(fingerprint): 54 | return i1 55 | elif self.buckets[i2].insert(fingerprint): 56 | return i2 57 | 58 | i = random.choice((i1, i2)) 59 | for kick_count in range(self.max_kicks): 60 | fingerprint = self.buckets[i].swap(fingerprint) 61 | i = (i ^ self.index_hash(fingerprint)) % self.capacity 62 | 63 | if self.buckets[i].insert(fingerprint): 64 | return i 65 | 66 | self.size = self.size - 1 67 | raise Exception('Filter is full') 68 | 69 | def contains(self, item): 70 | '''Checks if a string was inserted into the filter.''' 71 | fingerprint = self.fingerprint(item) 72 | i1, i2 = self.calculate_index_pair(item, fingerprint) 73 | return (fingerprint in self.buckets[i1]) or (fingerprint in self.buckets[i2]) 74 | 75 | def delete(self, item): 76 | '''Removes a string from the filter.''' 77 | fingerprint = self.fingerprint(item) 78 | i1, i2 = self.calculate_index_pair(item, fingerprint) 79 | if self.buckets[i1].delete(fingerprint) or self.buckets[i2].delete(fingerprint): 80 | self.size = self.size - 1 81 | return True 82 | return False 83 | 84 | def index_hash(self, item): 85 | '''Calculate the (first) index of an item in the filter.''' 86 | item_hash = mmh3.hash_bytes(item) 87 | index = int.from_bytes(item_hash, byteorder='big') % self.capacity 88 | return index 89 | 90 | def calculate_index_pair(self, item, fingerprint): 91 | '''Calculate both possible indices for the item''' 92 | i1 = self.index_hash(item) 93 | i2 = (i1 ^ self.index_hash(fingerprint)) % self.capacity 94 | return i1, i2 95 | 96 | def fingerprint(self, item): 97 | ''' 98 | Takes a string and returns its fingerprint in bits. 99 | 100 | The length of the fingerprint is given by fingerprint_size. 101 | To calculate this fingerprint, we hash the string with MurmurHash3 and truncate the hash. 102 | ''' 103 | item_hash = mmh3.hash_bytes(item) 104 | return item_hash[:self.fingerprint_size] 105 | 106 | def load_factor(self): 107 | return self.size / (self.capacity * self.bucket_size) 108 | 109 | def __contains__(self, item): 110 | return self.contains(item) 111 | 112 | def __repr__(self): 113 | return '' 114 | 115 | def __sizeof__(self): 116 | return super().__sizeof__() + sum(b.__sizeof__() for b in self.buckets) 117 | -------------------------------------------------------------------------------- /cuckoofilter/tests/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/michael-the1/python-cuckoo/ec1a6381ec68d6106bcf7b549834d56982128865/cuckoofilter/tests/__init__.py -------------------------------------------------------------------------------- /cuckoofilter/tests/test_bucket.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | from cuckoofilter import bucket as b 3 | 4 | @pytest.fixture 5 | def bucket(): 6 | return b.Bucket() 7 | 8 | def test_initialization(bucket): 9 | assert bucket.size == 4 10 | assert bucket.b == [] 11 | 12 | def test_insert(bucket): 13 | assert bucket.insert('hello') 14 | 15 | def test_insert_full(bucket): 16 | for i in range(bucket.size): 17 | bucket.insert('a') 18 | assert not bucket.insert('a') 19 | 20 | def test_contains(bucket): 21 | bucket.insert('hello') 22 | assert bucket.contains('hello') 23 | 24 | def test_delete(bucket): 25 | bucket.insert('hello') 26 | assert bucket.delete('hello') 27 | assert not bucket.contains('hello') 28 | 29 | def test_delete_non_existing_fingerprint(bucket): 30 | assert not bucket.delete('hello') 31 | 32 | def test_swap(bucket): 33 | bucket.insert('hello') 34 | swapped_fingerprint = bucket.swap('world') 35 | assert swapped_fingerprint == 'hello' 36 | assert bucket.contains('world') 37 | 38 | def test_is_full(bucket): 39 | for i in range(4): 40 | bucket.insert(i) 41 | assert bucket.is_full() 42 | -------------------------------------------------------------------------------- /cuckoofilter/tests/test_filter.py: -------------------------------------------------------------------------------- 1 | import pytest 2 | import cuckoofilter 3 | 4 | @pytest.fixture 5 | def cf(): 6 | return cuckoofilter.CuckooFilter(1000, 4) 7 | 8 | def test_insert(cf): 9 | assert cf.insert('hello') 10 | assert cf.size == 1 11 | 12 | def test_insert_second_position(cf): 13 | for _ in range(cf.bucket_size - 1): 14 | cf.insert('hello') 15 | i1 = cf.insert('hello') 16 | i2 = cf.insert('hello') 17 | assert i1 != i2 18 | 19 | def test_insert_full(cf): 20 | # A cuckoofilter can hold at most 2 * bucket_size of the same fingerprint 21 | for _ in range(cf.bucket_size * 2): 22 | cf.insert('hello') 23 | 24 | with pytest.raises(Exception) as e: 25 | cf.insert('hello') 26 | 27 | assert str(e.value) == 'Filter is full' 28 | assert cf.size == (cf.bucket_size * 2) 29 | 30 | def test_insert_over_capacitiy(cf): 31 | with pytest.raises(Exception) as e: 32 | for i in range((cf.capacity * cf.bucket_size) + 1): 33 | cf.insert(str(i)) 34 | assert str(e.value) == 'Filter is full' 35 | assert cf.load_factor() > 0.9 36 | 37 | def test_contains(cf): 38 | cf.insert('hello') 39 | assert cf.contains('hello'), 'Key was not inserted' 40 | 41 | def test_contains_builtin(cf): 42 | cf.insert('hello') 43 | assert 'hello' in cf 44 | 45 | def test_delete(cf): 46 | cf.insert('hello') 47 | assert cf.delete('hello') 48 | assert not cf.contains('hello') 49 | assert cf.size == 0 50 | 51 | def test_delete_second_bucket(cf): 52 | for _ in range(cf.bucket_size + 1): 53 | cf.insert('hello') 54 | for _ in range(cf.bucket_size + 1): 55 | cf.delete('hello') 56 | assert cf.size == 0 57 | 58 | def test_delete_non_existing(cf): 59 | assert not cf.delete('hello') 60 | 61 | def test_load_factor_empty(cf): 62 | assert cf.load_factor() == 0 63 | 64 | def test_load_factor_non_empty(cf): 65 | cf.insert('hello') 66 | assert cf.load_factor() == (1 / (cf.capacity * cf.bucket_size)) 67 | -------------------------------------------------------------------------------- /example.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Example usage. Modeled after https://github.com/efficient/cuckoofilter/blob/master/example/test.cc 3 | ''' 4 | 5 | import cuckoofilter 6 | 7 | if __name__ == '__main__': 8 | total_items = 100000 9 | cf = cuckoofilter.CuckooFilter(total_items, 2) 10 | 11 | num_inserted = 0 12 | for i in range(total_items): 13 | cf.insert(str(i)) 14 | num_inserted = num_inserted + 1 15 | 16 | for i in range(num_inserted): 17 | assert cf.contains(str(i)) 18 | 19 | total_queries = 0 20 | false_queries = 0 21 | for i in range(total_items, 2 * total_items): 22 | if cf.contains(str(i)): 23 | false_queries = false_queries + 1 24 | total_queries = total_queries + 1 25 | 26 | print('False positive rate is {:%}'.format(false_queries / total_queries)) 27 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | mmh3 2 | --------------------------------------------------------------------------------