├── LICENSE ├── Makefile ├── README ├── index.html ├── murmurhash.js └── zh_CN └── index.html /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2016 Bill Mill 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | push: 2 | -git branch -D gh-pages 3 | git switch -c gh-pages 4 | git push -f -u origin gh-pages 5 | git checkout main 6 | -------------------------------------------------------------------------------- /README: -------------------------------------------------------------------------------- 1 | Page hosted at http://billmill.org/bloomfilter-tutorial/ 2 | -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Bloom Filters by Example 5 | 94 | 95 | 96 | 228 | 229 | 230 | 231 | 233 | 242 | 243 | 244 |
245 | 简体中文 246 |

Bloom Filters by Example

247 | 248 |

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is 249 | present in a set. 250 | 251 |

The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it 252 | tells us that the element either definitely is not in the set or may be in the set. 253 | 254 |

The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll use to 255 | demonstrate: 256 | 257 |

258 | 259 | 260 | 261 |
262 |
263 | 264 |

Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom 265 | filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1. 266 | 267 |

It's easier to see what that means than explain it, so enter some strings and see how the bit vector changes. Fnv 268 | and Murmur are two simple hash functions: 269 | 270 |

271 |

Enter a string: 272 | 273 |

274 | fnv:
275 | murmur: 276 |
277 | 278 |

Your set: [] 279 |

280 | 281 |

When you add a string, you can see that the bits at the index given by the hashes are set to 1. I've used the 282 | color green to show the newly added ones, but any colored cell is simply a 1. 283 | 284 |

To test for membership, you simply hash the string with the same hash functions, then see if those values are set 285 | in the bit vector. If they aren't, you know that the element isn't in the set. If they are, you only know that it 286 | might be, because another element or some combination of other elements could have set the same bits. 287 | Again, let's demonstrate: 288 | 289 |

290 |

Test an element for membership: 291 |

292 | fnv:
293 | murmur: 294 |
295 | 296 |

Is the element in the set? no 297 | 298 |

Probability of a false positive: 0% 299 |

300 | 301 |

And that's the basics of a bloom filter! 302 | 303 |

Advanced Topics

304 | 305 |

Before I write a bit more about Bloom filters, a disclaimer: I've never used them in production. Don't take my 306 | word for it. All I intend to do is give you general ideas and pointers to where you can find out more. 307 | 308 |

In the following text, we will refer to a Bloom filter with k hashes, m bits in the filter, and 309 | n elements that have been inserted. 310 | 311 |

Hash Functions

312 | 313 |

The hash functions used in a Bloom filter should be independent and uniformly distributed. They 316 | should also be as fast as possible (cryptographic hashes such as sha1, though widely used therefore are not very 317 | good choices). 318 | 319 |

Examples of fast, simple hashes that are independent enough3 include murmur, xxHash, the fnv series of hashes, and HashMix. 324 | 325 |

To see the difference that a faster-than-cryptographic hash function can make, check out this story of a ~800% speedup when switching a 327 | bloom filter implementation from md5 to murmur. 328 | 329 |

In a short survey of bloom filter implementations: 330 |

403 | 404 |

How big should I make my Bloom filter?

405 | 406 |

It's a nice property of Bloom filters that you can modify the false positive rate of your filter. A larger filter 407 | will have less false positives, and a smaller one more. 408 | 409 |

Your false positive rate will be approximately (1-e-kn/m)k, so you can just plug 410 | the number n of elements you expect to insert, and try various values of k and m to 411 | configure your filter for your application.2 412 | 413 |

This leads to an obvious question: 414 | 415 |

How many hash functions should I use?

416 | 417 |

The more hash functions you have, the slower your bloom filter, and the quicker it fills up. If you have too few, 418 | however, you may suffer too many false positives. 419 | 420 |

Since you have to pick k when you create the filter, you'll have to ballpark what range you expect 421 | n to be in. Once you have that, you still have to choose a potential m (the number of bits) and 422 | k (the number of hash functions). 423 | 424 |

It seems a difficult optimization problem, but fortunately, given an m and an n, we have a 425 | function to choose the optimal value of k: (m/n)ln(2) 2, 3 427 | 428 |

So, to choose the size of a bloom filter, we: 429 | 430 |

431 |

    432 |
  1. Choose a ballpark value for n 433 |
  2. Choose a value for m 434 |
  3. Calculate the optimal value of k 435 |
  4. Calculate the error rate for our chosen values of n, m, and k. If it's 436 | unacceptable, return to step 2 and change m; otherwise we're done. 437 |
438 | 439 |

How fast and space efficient is a Bloom filter?

440 | 441 |

Given a Bloom filter with m bits and k hashing functions, both insertion and membership testing 442 | are O(k). That is, each time you want to add an element to the set or check set membership, you just need 443 | to run the element through the k hash functions and add it to the set or check those bits. 444 | 445 |

The space advantages are more difficult to sum up; again it depends on the error rate you're willing to tolerate. 446 | It also depends on the potential range of the elements to be inserted; if it is very limited, a deterministic bit 447 | vector can do better. If you can't even ballpark estimate the number of elements to be inserted, you may be better 448 | off with a hash table or a scalable Bloom filter4. 449 | 450 |

What can I use them for?

451 | 452 |

I'll link you to wiki instead of copying what 453 | they say. C. 455 | Titus Brown also has an excellent talk on an application of Bloom filters to bioinformatics. 456 | 457 |

References

458 | 459 |

1: Network 461 | Applications of Bloom Filters: A Survey, Broder and Mitzenmacher. An excellent overview. 462 |

2: Wikipedia, which has 464 | an excellent and comprehensive page on Bloom filters 465 |

3: Less Hashing, Same 467 | Performance, Kirsch and Mitzenmacher 468 |

4: Scalable 469 | Bloom Filters, Almeida et al 470 | 471 |

472 | 473 | 474 | 475 | -------------------------------------------------------------------------------- /murmurhash.js: -------------------------------------------------------------------------------- 1 | //via https://gist.github.com/588423 2 | //thanks github.com/raycmorgan! 3 | function murmur(str, seed) { 4 | var m = 0x5bd1e995; 5 | var r = 24; 6 | var h = seed ^ str.length; 7 | var length = str.length; 8 | var currentIndex = 0; 9 | 10 | while (length >= 4) { 11 | var k = UInt32(str, currentIndex); 12 | 13 | k = Umul32(k, m); 14 | k ^= k >>> r; 15 | k = Umul32(k, m); 16 | 17 | h = Umul32(h, m); 18 | h ^= k; 19 | 20 | currentIndex += 4; 21 | length -= 4; 22 | } 23 | 24 | switch (length) { 25 | case 3: 26 | h ^= UInt16(str, currentIndex); 27 | h ^= str.charCodeAt(currentIndex + 2) << 16; 28 | h = Umul32(h, m); 29 | break; 30 | 31 | case 2: 32 | h ^= UInt16(str, currentIndex); 33 | h = Umul32(h, m); 34 | break; 35 | 36 | case 1: 37 | h ^= str.charCodeAt(currentIndex); 38 | h = Umul32(h, m); 39 | break; 40 | } 41 | 42 | h ^= h >>> 13; 43 | h = Umul32(h, m); 44 | h ^= h >>> 15; 45 | 46 | return h >>> 0; 47 | } 48 | 49 | function UInt32(str, pos) { 50 | return (str.charCodeAt(pos++)) + 51 | (str.charCodeAt(pos++) << 8) + 52 | (str.charCodeAt(pos++) << 16) + 53 | (str.charCodeAt(pos) << 24); 54 | } 55 | 56 | function UInt16(str, pos) { 57 | return (str.charCodeAt(pos++)) + 58 | (str.charCodeAt(pos++) << 8); 59 | } 60 | 61 | function Umul32(n, m) { 62 | n = n | 0; 63 | m = m | 0; 64 | var nlo = n & 0xffff; 65 | var nhi = n >>> 16; 66 | var res = ((nlo * m) + (((nhi * m) & 0xffff) << 16)) | 0; 67 | return res; 68 | } 69 | 70 | function getBucket(str, buckets) { 71 | var hash = murmur(str, str.length); 72 | var bucket = hash % buckets; 73 | return bucket; 74 | } 75 | 76 | -------------------------------------------------------------------------------- /zh_CN/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Bloom Filters by Example 5 | 94 | 95 | 96 | 228 | 229 | 230 | 231 | 233 | 242 | 243 |
244 | English 245 |

Bloom Filters by Example

246 | 247 |

Bloom filter 是一个数据结构,它可以用来判断某个元素是否在集合内,具有运行快速,内存占用小的特点。 248 | 249 |

而高效插入和查询的代价就是,Bloom Filter 是一个基于概率的数据结构:它只能告诉我们一个元素绝对不在集合内或可能在集合内 250 | 251 |

Bloom filter 的基础数据结构是一个 比特向量。 下面是一个简单的示例: 252 | 253 |

254 | 255 | 256 | 257 |
258 |
259 | 260 |

表中的每一个空格表示一个比特, 空格下面的数字表示它的索引。只需要简单的对输入进行多次哈希操作,并把对应于其结果的比特置为1,就可以向 Bloom filter 添加一个元素。 261 | 262 |

相较于解释,直观的看到它的变化总是更利于我们加深理解的,所以你可以输入一些字符串然后观察上方的向量变化。其中使用了 Fnv 和 Murmur 这两个简单的哈希函数: 263 |

264 |

输入一个字符串: 265 | 266 |

267 | fnv:
268 | murmur: 269 |
270 | 271 |

你的集合: [] 272 |

273 | 274 |

当你往集合里添加一个字符串的时候, 你可以检查应用对应哈希函数的位置是否为1。这里我用了绿色表示最新添加的元素对应位置,但是实际上你要知道,表格里的不同颜色都只代表了值为 1。 275 | 276 |

你可以简单的通过对字符串应用同样的哈希函数,然后看比特向量里对应的位置是否为1的方式来判断一个元素是否在集合里。如果是,你只知道元素可能在里面, 277 | 因为这些对应位置有可能恰巧是由其他元素或者其他元素的组合所引起的。 278 | 279 |

280 |

测试: 281 |

282 | fnv:
283 | murmur: 284 |
285 | 286 |

这个元素在集合中吗? 287 | 288 |

误判的概率: 0% 289 |

290 | 291 |

这些就是 Bloom filter 全部的基础内容了。 292 | 293 |

高级话题

294 | 295 |

在写下更多关于 Bloom filter 的内容之前,我需要声明一点:我从未在生产环境使用过 Bloom filter。所以不要不假思索的相信下面的内容,我想做的只是给你一个概括式的介绍,同时告诉你可以去哪里寻找更多内容。 296 | 297 |

在下面的内容里, 我们假设在 Bloom filter 里面有 k 个哈希函数, m 个比特, 以及 n 个已插入元素。 298 | 299 |

哈希函数

300 | 301 |

Bloom filter 里的哈希函数需要是彼此独立均匀分布。同时,它们也需要尽可能的快 (尽管 sha1 304 | 之类的加密哈希算法被广泛应用,但是在这一点上考虑并不是一个很好的选择). 305 | 306 |

这些都是快速,简单且彼此独立的哈希函数的例子:3 包括 murmur, fnv 族哈希函数, 以及HashMix. 310 | 311 |

如果你希望了解一个比加密哈希函数快的哈希函数可以达到什么程度,可以参考这个故事。当把 bloom filter 312 | 的实现从 md5 切换到 murmur 时,速度提升了 800%。 313 | 314 |

一个关于 Bloom filter 实现方式的简单调查: 315 |

331 | 332 |

Bloom filter 应该设计为多大?

333 | 334 |

Bloom filter 的一个优良特性就是可以修改过滤器的错误率。一个大的过滤器会拥有比一个小的过滤器更低的错误率。 335 | 336 |

错误率会近似于 (1-e-kn/m)k, 所以你只需要先确定可能插入的数据集的容量大小 n, 然后再调整 k 和 337 | m 来为你的应用配置过滤器。2 338 | 339 |

而这带来了一个显而易见的问题: 340 | 341 |

应该使用多少个哈希函数?

342 | 343 |

Bloom filter 使用的哈希函数越多运行速度就会越慢。但是如果哈希函数过少,又会遇到误判率高的问题。所以这个问题上需要认真考虑。 344 | 345 |

在创建一个 Bloom filter 的时候需要确定 k 的值,也就是说你需要提前圈定 n 的变动范围。而一旦你这样做了,你依然需要确定 m(总比特数)和 346 | k (哈希函数的个数)的值。 347 | 348 |

似乎这是一个十分困难的优化问题,但幸运的是,对于给定的 mn ,有一个函数可以帮我们确定最优的 k 值: (m/n)ln(2) 2, 3 350 | 351 |

所以可以通过以下的步骤来确定 Bloom filter 的大小: 352 | 353 |

354 |

    355 |
  1. 确定 n 的变动范围 356 |
  2. 选定 m 的值 357 |
  3. 计算 k 的最优值 358 |
  4. 对于给定的n, m, and k计算错误率。如果这个错误率不能接收,那么回到第二步,否则结束 359 |
360 | 361 |

Bloom filter 的时间复杂度和空间复杂度?

362 | 363 |

对于一个 mk 值确定的 Bloom filter,插入和测试操作的时间复杂度都是 O(k)。这意味着每次你想要插入一个元素或者查询一个元素是否在集合中,只需要使用 364 | k 个哈希函数对这个元素求值,然后将对应的比特位标记或者检查对应的比特位。 365 | 366 |

相比之下,Bloom filter 367 | 的空间复杂度更难以概述,它取决于你可以忍受的错误率。同时也取决于输入元素的范围,如果这个范围是有限的,那么一个确定的比特向量就可以很好的解决问题。如果你甚至不能很好的估计输入元素的范围,那么你最好选择一个哈希表或者一个可拓展的 368 | Bloom filter。4. 369 | 370 |

可以用 Bloom filter 来做什么?

371 | 372 |

我会将你引向 wiki而不是将它们的内容拷贝过来。 C. 374 | Titus Brown 的演讲很好的阐述了 Bloom filter 在生物信息学中的应用。 375 | 376 |

参考文献

377 | 378 |

1: Network 380 | Applications of Bloom Filters: A Survey, Broder and Mitzenmacher. An excellent overview. 381 |

2: Wikipedia, which has 383 | an excellent and comprehensive page on Bloom filters 384 |

3: Less Hashing, Same 386 | Performance, Kirsch and Mitzenmacher 387 |

4: Scalable 388 | Bloom Filters, Almeida et al 389 | 390 |

391 | 392 | 393 | 394 | --------------------------------------------------------------------------------