├── .gitignore ├── LICENSE ├── README.md ├── approximating_sheep_python ├── have_i_see_this_one.py └── how_many.py ├── approximating_sheep_redis ├── have_i_seen_this_one.py └── how_many.py ├── counting_sheep_python ├── have_i_seen_this_one.py └── how_many.py ├── counting_sheep_redis ├── have_i_seen_this_one.py └── how_many.py ├── counting_things.png ├── docker-compose.yml └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | pip-wheel-metadata/ 24 | share/python-wheels/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | MANIFEST 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .nox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | *.py,cover 51 | .hypothesis/ 52 | .pytest_cache/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | target/ 76 | 77 | # Jupyter Notebook 78 | .ipynb_checkpoints 79 | 80 | # IPython 81 | profile_default/ 82 | ipython_config.py 83 | 84 | # pyenv 85 | .python-version 86 | 87 | # pipenv 88 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 89 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 90 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 91 | # install all needed dependencies. 92 | #Pipfile.lock 93 | 94 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow 95 | __pypackages__/ 96 | 97 | # Celery stuff 98 | celerybeat-schedule 99 | celerybeat.pid 100 | 101 | # SageMath parsed files 102 | *.sage.py 103 | 104 | # Environments 105 | .env 106 | .venv 107 | env/ 108 | venv/ 109 | ENV/ 110 | env.bak/ 111 | venv.bak/ 112 | 113 | # Spyder project settings 114 | .spyderproject 115 | .spyproject 116 | 117 | # Rope project settings 118 | .ropeproject 119 | 120 | # mkdocs documentation 121 | /site 122 | 123 | # mypy 124 | .mypy_cache/ 125 | .dmypy.json 126 | dmypy.json 127 | 128 | # Pyre type checker 129 | .pyre/ 130 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021 Simon Prickett 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # No, Maybe and Close Enough! 2 | 3 | Exploring Probabilistic Data Structures in Python - my 2021 Pycon USA and Australia talk. I've updated the code slightly for Pycon MEA Dubai 2022 - it now uses the latest redis-py Redis client. 4 | 5 | ![Title Page](counting_things.png) 6 | 7 | If you'd like to see the slides for the 2021 version of this talk, they're [here](https://simonprickett.dev/no_maybe_and_close_enough_slides.pdf) (PDF). Watch the 2021 video [here](https://www.youtube.com/watch?v=hM1JPkEUtks) or [read the transcript on my website](https://simonprickett.dev/no-maybe-and-good-enough-probabilistic-data-structures-in-python/). I gave a shorter version of this talk in person for Pycon MEA Dubai 2022 - [watch that here](https://www.youtube.com/watch?v=tqy8WtjBe1Q). 8 | 9 | This repository contains supporting code to run the examples from my talk. The example code uses in memory probabilistic data structures with the [hyperloglog](https://pypi.org/project/hyperloglog/) and [pyprobables](https://pypi.org/project/pyprobables/) libraries. It also uses [Redis](https://redis.io) with the [RedisBloom](https://redisbloom.io) module: this is provided for you as part of [Redis Stack](https://redis.io/docs/stack/get-started/) in a Docker container. 10 | 11 | The two probabilistic data structures examined in this code base are: 12 | 13 | * [HyperLogLog](https://en.wikipedia.org/wiki/HyperLogLog) - an algorithm for estimating the cardinality of a set (the count distinct problem) 14 | * [Bloom Filter](https://en.wikipedia.org/wiki/Bloom_filter) - a data structure used to determine whether an element may be a member of a set 15 | 16 | ## Setup 17 | 18 | To run the example code and Redis server, you will need both Python 3 (I've tested this with 3.8.6) and [Docker](https://www.docker.com/). Once you have these, clone the repo, create a virtual environment, and start the Docker container in the background: 19 | 20 | ```bash 21 | $ git clone https://github.com/simonprickett/python-probabilistic-data-structures.git 22 | $ cd python-probabilistic-data-structures 23 | $ python3 -m venv venv 24 | $ . venv/bin/activate 25 | (venv) $ pip install -r requirements.txt 26 | (venv) $ docker compose up -d 27 | ``` 28 | 29 | The Docker container uses port 6379. If you have another process (example: an existing Redis server instance) listening on that port, stop that process before starting the container. 30 | 31 | ## Running the Examples 32 | 33 | ### Counting Sheep 34 | 35 | These examples use a Python set, then a Redis set to count sheep. They give an accurate count, but at the cost of memory usage. Moving the count to a Redis set offloads the memory usage problem to Redis, and solves the problem of allowing multiple instances of the Python code to work together to count sheep. 36 | 37 | Counting sheep in memory with a Python set: 38 | 39 | ```bash 40 | (venv) $ cd counting_sheep_python 41 | (venv) $ python how_many.py 42 | There are 6 sheep. 43 | ``` 44 | 45 | Have I seen this sheep in memory with a Python set: 46 | 47 | ```bash 48 | (venv) $ python have_i_seen_this_one.py 49 | I have seen sheep 1934. 50 | I have not seen sheep 1283. 51 | ``` 52 | 53 | Counting sheep with a Redis set: 54 | 55 | ```bash 56 | (venv) $ cd ../counting_sheep_redis 57 | (venv) $ python how_many.py 58 | There are 6 sheep 59 | ``` 60 | 61 | Have I seen this sheep with a Redis set: 62 | 63 | ```bash 64 | (venv) $ python have_i_seen_this_one.py 65 | I have seen sheep 1934. 66 | I have not seen sheep 1283. 67 | ``` 68 | 69 | ## Approximating Sheep 70 | 71 | These examples use the HyperLogLog to approximate a count of sheep seen, and the Bloom Filter to determine whether or not a particular sheep has been seen. They use both in memory Python implementations of the probabilistic data structures, and the Redis equivalents. 72 | 73 | Approximate count with Python in memory HyperLogLog, compares count with an in memory set: 74 | 75 | ```bash 76 | (venv) $ cd ../approximating_sheep_python 77 | (venv) $ python how_many.py 78 | There are 100000 sheep (set). 79 | There are 100075 sheep (hyperloglog). 80 | ``` 81 | Have I (maybe) seen this sheep with Python in memory Bloom Filter: 82 | 83 | ```bash 84 | (venv) $ python have_i_see_this_one.py 85 | I might have seen sheep 9018. 86 | I have not seen sheep 454991. 87 | ``` 88 | 89 | Approximate count with Redis HyperLogLog: 90 | 91 | ```bash 92 | (venv) $ cd ../approximating_sheep_redis 93 | (venv) $ python how_many.py 94 | There are 100000 sheep (set: 4673012). 95 | There are 99565 sheep (hyperloglog: 12366). 96 | ``` 97 | 98 | Have I (maybe) seen this sheep with Redis Bloom Filter: 99 | 100 | ```bash 101 | (venv) $ python have_i_seen_this_one.py 102 | I might have seen sheep 9018. 103 | I have not seen sheep 454991. 104 | ``` 105 | 106 | -------------------------------------------------------------------------------- /approximating_sheep_python/have_i_see_this_one.py: -------------------------------------------------------------------------------- 1 | from probables import BloomFilter 2 | 3 | sheep_seen_bloom = BloomFilter( 4 | est_elements=200000, false_positive_rate=0.01 5 | ) 6 | 7 | for m in range(0, 100000): 8 | sheep_id = str(m) 9 | sheep_seen_bloom.add(sheep_id) 10 | 11 | def have_i_seen(sheep_id): 12 | if sheep_seen_bloom.check(sheep_id): 13 | print(f"I might have seen sheep {sheep_id}.") 14 | else: 15 | print(f"I have not seen sheep {sheep_id}.") 16 | 17 | have_i_seen("9018") 18 | have_i_seen("454991") -------------------------------------------------------------------------------- /approximating_sheep_python/how_many.py: -------------------------------------------------------------------------------- 1 | from hyperloglog import HyperLogLog 2 | 3 | sheep_seen = set() 4 | sheep_seen_hll = HyperLogLog(0.01) 5 | 6 | for m in range(0, 100000): 7 | sheep_id = str(m) 8 | sheep_seen.add(sheep_id) 9 | sheep_seen_hll.add(sheep_id) 10 | 11 | print(f"There are {len(sheep_seen)} sheep (set).") 12 | print(f"There are {len(sheep_seen_hll)} sheep (hyperloglog).") -------------------------------------------------------------------------------- /approximating_sheep_redis/have_i_seen_this_one.py: -------------------------------------------------------------------------------- 1 | from redis import Redis 2 | 3 | redis_conn = Redis() 4 | 5 | SHEEP_BLOOM_KEY = "sheep_seen_bloom" 6 | 7 | redis_conn.delete(SHEEP_BLOOM_KEY) 8 | redis_conn.bf().create(SHEEP_BLOOM_KEY, 0.001, 200000, noScale = True) 9 | 10 | for m in range(0, 100000): 11 | sheep_id = str(m) 12 | redis_conn.bf().add(SHEEP_BLOOM_KEY, sheep_id) 13 | 14 | def have_i_seen(sheep_id): 15 | if redis_conn.bf().exists(SHEEP_BLOOM_KEY, sheep_id): 16 | print(f"I might have seen sheep {sheep_id}.") 17 | else: 18 | print(f"I have not seen sheep {sheep_id}.") 19 | 20 | have_i_seen("9018") 21 | have_i_seen("454991") -------------------------------------------------------------------------------- /approximating_sheep_redis/how_many.py: -------------------------------------------------------------------------------- 1 | from redis import Redis 2 | 3 | redis_conn = Redis() 4 | 5 | SHEEP_SET_KEY = "sheep_seen" 6 | SHEEP_HLL_KEY = "sheep_seen_hll" 7 | 8 | redis_conn.delete(SHEEP_SET_KEY) 9 | redis_conn.delete(SHEEP_HLL_KEY) 10 | 11 | for m in range(0, 100000): 12 | sheep_id = str(m) 13 | pipeline = redis_conn.pipeline(transaction=False) 14 | pipeline.sadd(SHEEP_SET_KEY, sheep_id) 15 | pipeline.pfadd(SHEEP_HLL_KEY, sheep_id) 16 | pipeline.execute() 17 | 18 | print(f"There are {redis_conn.scard(SHEEP_SET_KEY)} sheep (set: {redis_conn.memory_usage(SHEEP_SET_KEY)}).") 19 | print(f"There are {redis_conn.pfcount(SHEEP_HLL_KEY)} sheep (hyperloglog: {redis_conn.memory_usage(SHEEP_HLL_KEY)}).") -------------------------------------------------------------------------------- /counting_sheep_python/have_i_seen_this_one.py: -------------------------------------------------------------------------------- 1 | sheep_seen = { 2 | "1934", "1201", "1199", "0007", "3409", "1015" 3 | } 4 | 5 | def have_i_seen(sheep_id): 6 | if sheep_id in sheep_seen: 7 | print(f"I have seen sheep {sheep_id}.") 8 | else: 9 | print(f"I have not seen sheep {sheep_id}.") 10 | 11 | have_i_seen("1934") 12 | have_i_seen("1283") -------------------------------------------------------------------------------- /counting_sheep_python/how_many.py: -------------------------------------------------------------------------------- 1 | sheep_seen = set() 2 | 3 | sheep_seen.add("1934") 4 | sheep_seen.add("1201") 5 | sheep_seen.add("1199") 6 | sheep_seen.add("0007") 7 | sheep_seen.add("3409") 8 | sheep_seen.add("1934") 9 | sheep_seen.add("1015") 10 | 11 | print(f"There are {len(sheep_seen)} sheep.") -------------------------------------------------------------------------------- /counting_sheep_redis/have_i_seen_this_one.py: -------------------------------------------------------------------------------- 1 | from redis import Redis 2 | 3 | redis_conn = Redis() 4 | 5 | SHEEP_SET_KEY = "sheep_seen" 6 | 7 | redis_conn.delete(SHEEP_SET_KEY) 8 | redis_conn.sadd(SHEEP_SET_KEY, "1934", "1201", "1199", "0007", "3409", "1015") 9 | 10 | def have_i_seen(sheep_id): 11 | if redis_conn.sismember(SHEEP_SET_KEY, sheep_id): 12 | print(f"I have seen sheep {sheep_id}.") 13 | else: 14 | print(f"I have not seen sheep {sheep_id}.") 15 | 16 | have_i_seen("1934") 17 | have_i_seen("1283") -------------------------------------------------------------------------------- /counting_sheep_redis/how_many.py: -------------------------------------------------------------------------------- 1 | from redis import Redis 2 | 3 | redis_conn = Redis() 4 | 5 | SHEEP_SET_KEY = "sheep_seen" 6 | 7 | redis_conn.delete(SHEEP_SET_KEY) 8 | 9 | redis_conn.sadd(SHEEP_SET_KEY, "1934") 10 | redis_conn.sadd(SHEEP_SET_KEY, "1201") 11 | redis_conn.sadd(SHEEP_SET_KEY, "1199") 12 | redis_conn.sadd(SHEEP_SET_KEY, "0007") 13 | redis_conn.sadd(SHEEP_SET_KEY, "3409") 14 | redis_conn.sadd(SHEEP_SET_KEY, "1934") 15 | redis_conn.sadd(SHEEP_SET_KEY, "1015") 16 | 17 | print(f"There are {redis_conn.scard(SHEEP_SET_KEY)} sheep.") -------------------------------------------------------------------------------- /counting_things.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/simonprickett/python-probabilistic-data-structures/41b6a5cc8f6f4f6699793fb9d585858aa01467ef/counting_things.png -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | 2 | version: "3.9" 3 | services: 4 | redis: 5 | container_name: redisprobabilistic 6 | image: "redis/redis-stack:latest" 7 | ports: 8 | - 6379:6379 9 | deploy: 10 | replicas: 1 11 | restart_policy: 12 | condition: on-failure 13 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | async-timeout==4.0.2 2 | Deprecated==1.2.13 3 | hyperloglog==0.0.13 4 | packaging==21.3 5 | pyparsing==3.0.9 6 | pyprobables==0.5.6 7 | redis==4.3.4 8 | wrapt==1.14.1 9 | --------------------------------------------------------------------------------