Bloom Filters by Example
247 | 248 |A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is 249 | present in a set. 250 | 251 |
The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it 252 | tells us that the element either definitely is not in the set or may be in the set. 253 | 254 |
The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll use to 255 | demonstrate: 256 | 257 |
Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom 265 | filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1. 266 | 267 |
It's easier to see what that means than explain it, so enter some strings and see how the bit vector changes. Fnv 268 | and Murmur are two simple hash functions: 269 | 270 |
Enter a string: 272 | 273 |
275 | murmur: 276 |
Your set: [] 279 |
When you add a string, you can see that the bits at the index given by the hashes are set to 1. I've used the 282 | color green to show the newly added ones, but any colored cell is simply a 1. 283 | 284 |
To test for membership, you simply hash the string with the same hash functions, then see if those values are set 285 | in the bit vector. If they aren't, you know that the element isn't in the set. If they are, you only know that it 286 | might be, because another element or some combination of other elements could have set the same bits. 287 | Again, let's demonstrate: 288 | 289 |
Test an element for membership: 291 |
293 | murmur: 294 |
Is the element in the set? no 297 | 298 |
Probability of a false positive: 0% 299 |
And that's the basics of a bloom filter! 302 | 303 |
Advanced Topics
304 | 305 |Before I write a bit more about Bloom filters, a disclaimer: I've never used them in production. Don't take my 306 | word for it. All I intend to do is give you general ideas and pointers to where you can find out more. 307 | 308 |
In the following text, we will refer to a Bloom filter with k hashes, m bits in the filter, and 309 | n elements that have been inserted. 310 | 311 |
Hash Functions
312 | 313 |The hash functions used in a Bloom filter should be independent and uniformly distributed. They 316 | should also be as fast as possible (cryptographic hashes such as sha1, though widely used therefore are not very 317 | good choices). 318 | 319 |
Examples of fast, simple hashes that are independent enough3 include murmur, xxHash, the fnv series of hashes, and HashMix. 324 | 325 |
To see the difference that a faster-than-cryptographic hash function can make, check out this story of a ~800% speedup when switching a 327 | bloom filter implementation from md5 to murmur. 328 | 329 |
In a short survey of bloom filter implementations: 330 |
-
331 |
- 332 | Chromium 334 | uses murmur. 336 | (also, here's 338 | a short description of how they use bloom filters) 339 | 340 |
- 341 | Plan9 uses a simple hash as proposed 342 | in Mitzenmacher 2005 343 | 344 |
- 345 | Sdroege Bloom filter uses 346 | fnv1a (included just because I wanted to show one that uses fnv.) 347 | 348 |
- 349 | Squid uses MD5 350 | 351 |
- 352 | RedisBloom 354 | uses murmur 355 | 356 |
- 357 | Apache 359 | Spark uses murmur 360 | 361 |
- 362 | influxdb 364 | uses xxhash 365 | 366 |
- 367 | bloomd 369 | (a neat project that uses a redis-ish protocol) uses murmur for the first two hashes, SpookyHash for the second two hashes, and a 371 | combination of the two for further hashes, as described in 3 372 | 373 |
- 374 | fleur (C), flor 375 | (python), and bloom (go) all use fnv 376 | 377 |
- 378 | Sqlite 380 | added a bloom filter for analytic queries, but I do not understand the hash algorithm. Dr. Hipp explains the purpose of the filters on the 382 | sqlite forum. 383 | 384 |
-
385 | RocksDB
387 | is configurable, but claims in
389 | the source that xxh3, a member of the xxhash
390 | family performed best for them
391 |
-
392 |
- They also link "Bloom Filters in 394 | Probabilistic Verification" by Dillinger and Maniolios, but it's pretty far over my head. 395 |
397 | - 398 | ScyllaDB 400 | uses murmur 401 | 402 |
How big should I make my Bloom filter?
405 | 406 |It's a nice property of Bloom filters that you can modify the false positive rate of your filter. A larger filter 407 | will have less false positives, and a smaller one more. 408 | 409 |
Your false positive rate will be approximately (1-e-kn/m)k, so you can just plug 410 | the number n of elements you expect to insert, and try various values of k and m to 411 | configure your filter for your application.2 412 | 413 |
This leads to an obvious question: 414 | 415 |
How many hash functions should I use?
416 | 417 |The more hash functions you have, the slower your bloom filter, and the quicker it fills up. If you have too few, 418 | however, you may suffer too many false positives. 419 | 420 |
Since you have to pick k when you create the filter, you'll have to ballpark what range you expect 421 | n to be in. Once you have that, you still have to choose a potential m (the number of bits) and 422 | k (the number of hash functions). 423 | 424 |
It seems a difficult optimization problem, but fortunately, given an m and an n, we have a 425 | function to choose the optimal value of k: (m/n)ln(2) 2, 3 427 | 428 |
So, to choose the size of a bloom filter, we: 429 | 430 |
431 |
-
432 |
- Choose a ballpark value for n 433 |
- Choose a value for m 434 |
- Calculate the optimal value of k 435 |
- Calculate the error rate for our chosen values of n, m, and k. If it's 436 | unacceptable, return to step 2 and change m; otherwise we're done. 437 |
How fast and space efficient is a Bloom filter?
440 | 441 |Given a Bloom filter with m bits and k hashing functions, both insertion and membership testing 442 | are O(k). That is, each time you want to add an element to the set or check set membership, you just need 443 | to run the element through the k hash functions and add it to the set or check those bits. 444 | 445 |
The space advantages are more difficult to sum up; again it depends on the error rate you're willing to tolerate. 446 | It also depends on the potential range of the elements to be inserted; if it is very limited, a deterministic bit 447 | vector can do better. If you can't even ballpark estimate the number of elements to be inserted, you may be better 448 | off with a hash table or a scalable Bloom filter4. 449 | 450 |
What can I use them for?
451 | 452 |I'll link you to wiki instead of copying what 453 | they say. C. 455 | Titus Brown also has an excellent talk on an application of Bloom filters to bioinformatics. 456 | 457 |
References
458 | 459 |1: Network 461 | Applications of Bloom Filters: A Survey, Broder and Mitzenmacher. An excellent overview. 462 |
2: Wikipedia, which has 464 | an excellent and comprehensive page on Bloom filters 465 |
3: Less Hashing, Same 467 | Performance, Kirsch and Mitzenmacher 468 |
4: Scalable 469 | Bloom Filters, Almeida et al 470 | 471 |