├── .gitignore ├── LICENSE ├── README.md ├── docs ├── allclasses-frame.html ├── allclasses-noframe.html ├── com │ └── adroll │ │ └── cantor │ │ ├── HLLCounter.html │ │ ├── HLLWritable.html │ │ ├── class-use │ │ ├── HLLCounter.html │ │ └── HLLWritable.html │ │ ├── package-frame.html │ │ ├── package-summary.html │ │ ├── package-tree.html │ │ └── package-use.html ├── constant-values.html ├── deprecated-list.html ├── help-doc.html ├── index-all.html ├── index.html ├── overview-tree.html ├── package-list ├── resources │ ├── background.gif │ ├── tab.gif │ ├── titlebar.gif │ └── titlebar_end.gif ├── serialized-form.html └── stylesheet.css ├── pom.xml ├── src ├── main │ └── java │ │ └── com │ │ └── adroll │ │ └── cantor │ │ ├── HLLCounter.java │ │ ├── HLLWritable.java │ │ └── package-info.java └── test │ └── java │ └── com │ └── adroll │ └── cantor │ ├── TestHLLCounter.java │ └── TestHLLWritable.java └── utils ├── minhash_k.py └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | *~ 2 | target/ -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 AdRoll 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Cantor 2 | ====== 3 | 4 | Cantor provides utilities for estimating the cardinality 5 | of large sets. 6 | 7 | The algorithms herein are parallelizable, and a Hadoop 8 | wrapper class is provided for convenience. 9 | 10 | It employs most of the HyperLogLog++ algorithm as seen in 11 | [this paper](http://research.google.com/pubs/pub40671.html), 12 | excluding the sparse scheme, and using a simple linear 13 | interpolation instead of kNN. In addition, it can use MinHash 14 | structures to estimate cardinalities of intersections of these 15 | sets, as described in 16 | [this blog post](http://tech.adroll.com/blog/data/2013/07/10/hll-minhash.html). 17 | 18 | Both HyperLogLog and MinHash require a precision 19 | parameter. Basic guidelines are available as follows, 20 | and `HLLCounter.MIN_P = 4 <= p <= 18 = HLLCounter.MAX_P`. 21 | 22 | ####HyperLogLog p @ 99.7% Confidence 23 | p | Relative Error 24 | ---:|---: 25 | 4 | 75% 26 | 5 | 65% 27 | 6 | 47% 28 | 7 | 32% 29 | 8 | 23% 30 | 9 | 16% 31 | 10 | 10% 32 | 11 | 8% 33 | 12 | 5% 34 | 13 | 4% 35 | 14 | 2.5% 36 | 15 | 2% 37 | 16 | 1.3% 38 | 17 | 1% 39 | 18 | 0.7% 40 | 41 | ####MinHash k @ 99% Confidence 42 | **Relative Error** | **Intersection Size -->** | | | | * 43 | :------------------|--------------------------:|-------:|-----:|------:|-----: 44 | - | 0.01% | 0.1% |1.0% | 5.0% |10.0% 45 | 100% | 90000 | 9000 |900 | 170 |75 46 | 50% | 313334 | 31334 |3134 | 587 |280 47 | 25% | - | 116800 |11520 | 2208 |1040 48 | 10% | - | - |68455 | 13128 |6210 49 | 50 | This MinHash k table can be generated by using `minhash_k.py` in the `utils` 51 | directory. For now, the only requirement is scipy, which you can install with 52 | `pip install -r utils/requirements.txt`. Then, for example, you can do: 53 | 54 | ``` 55 | %> ./utils/minhash_k.py --jaccard 0.0001 --error 1 --confidence 0.99 56 | MinHash k: 90000 57 | Error at k: 1.0 58 | %> ./utils/minhash_k.py --jaccard 0.01 --error 0.25 --confidence 0.99 59 | MinHash k: 11520 60 | Error at k: 0.25 61 | %> ./utils/minhash_k.py --jaccard 0.01 --error 0.25 --confidence 0.90 62 | MinHash k: 4800 63 | Error at k: 0.25 64 | ``` 65 | 66 | Additional information is available with `./utils/minhash_k.py --help`. -------------------------------------------------------------------------------- /docs/allclasses-frame.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | All Classes (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 |

All Classes

13 |
14 | 18 |
19 | 20 | 21 | -------------------------------------------------------------------------------- /docs/allclasses-noframe.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | All Classes (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 |

All Classes

13 |
14 | 18 |
19 | 20 | 21 | -------------------------------------------------------------------------------- /docs/com/adroll/cantor/HLLCounter.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | HLLCounter (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 79 | 80 | 81 |
82 |
com.adroll.cantor
83 |

Class HLLCounter

84 |
85 |
86 | 94 |
95 | 114 |
115 |
116 | 351 |
352 |
353 | 760 |
761 |
762 | 763 | 764 |
765 | 766 | 767 | 768 | 769 | 778 |
779 | 821 | 822 |

Copyright © 2014. All rights reserved.

823 | 824 | 825 | -------------------------------------------------------------------------------- /docs/com/adroll/cantor/HLLWritable.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | HLLWritable (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 79 | 80 | 81 |
82 |
com.adroll.cantor
83 |

Class HLLWritable

84 |
85 |
86 | 94 |
95 | 111 |
112 |
113 | 281 |
282 |
283 | 556 |
557 |
558 | 559 | 560 |
561 | 562 | 563 | 564 | 565 | 574 |
575 | 617 | 618 |

Copyright © 2014. All rights reserved.

619 | 620 | 621 | -------------------------------------------------------------------------------- /docs/com/adroll/cantor/class-use/HLLCounter.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Uses of Class com.adroll.cantor.HLLCounter (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 64 | 65 |
66 |

Uses of Class
com.adroll.cantor.HLLCounter

67 |
68 |
69 | 148 |
149 | 150 |
151 | 152 | 153 | 154 | 155 | 164 |
165 | 192 | 193 |

Copyright © 2014. All rights reserved.

194 | 195 | 196 | -------------------------------------------------------------------------------- /docs/com/adroll/cantor/class-use/HLLWritable.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Uses of Class com.adroll.cantor.HLLWritable (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 64 | 65 |
66 |

Uses of Class
com.adroll.cantor.HLLWritable

67 |
68 |
69 | 116 |
117 | 118 |
119 | 120 | 121 | 122 | 123 | 132 |
133 | 160 | 161 |

Copyright © 2014. All rights reserved.

162 | 163 | 164 | -------------------------------------------------------------------------------- /docs/com/adroll/cantor/package-frame.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | com.adroll.cantor (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 |

com.adroll.cantor

13 |
14 |

Classes

15 | 19 |
20 | 21 | 22 | -------------------------------------------------------------------------------- /docs/com/adroll/cantor/package-summary.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | com.adroll.cantor (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 64 | 65 |
66 |

Package com.adroll.cantor

67 |
68 |
Cantor provides utilities for estimating the cardinality 69 | of large sets.
70 |
71 |

See: Description

72 |
73 |
74 | 102 | 103 | 104 | 105 |

Package com.adroll.cantor Description

106 |
Cantor provides utilities for estimating the cardinality 107 | of large sets. 108 |

109 | The algorithms herein are parallelizable, and a Hadoop 110 | wrapper class is provided for convenience. 111 |

112 | It employs most of the HyperLogLog++ algorithm as seen in 113 | 114 | this paper, excluding the sparse scheme, and using 115 | a simple linear interpolation instead of kNN. In addition, 116 | it can use MinHash structures to estimate cardinalities of 117 | intersections of these sets, as described in 118 | 119 | this blog post. 120 |

121 | Both HyperLogLog and MinHash require a precision 122 | parameter. Basic guidelines are available as follows, 123 | and HLLCounter.MIN_P = 4 <= p <= 18 = 124 | HLLCounter.MAX_P. 125 |

126 | 127 | 128 | 149 | 160 | 161 |
129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 |
HyperLogLog p @ 99.7% Confidence
p Relative Error
4 75%
5 65%
6 47%
7 32%
8 23%
9 16%
10 10%
11 8%
12 5%
13 4%
14 2.5%
15 2%
16 1.3%
17 1%
18 0.7%
148 |
150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 |
MinHash k @ 99% Confidence
Relative Error - Intersection Size
- 0.01% 0.1% 1.0% 5.0% 10.0%
100% 90000 9000 900 170 75
50% 313334 31334 3134 587 280
25% - 116800 11520 2208 1040
10% - - 68455 13128 6210
159 |

162 |
163 | 164 |
165 | 166 | 167 | 168 | 169 | 178 |
179 | 206 | 207 |

Copyright © 2014. All rights reserved.

208 | 209 | 210 | -------------------------------------------------------------------------------- /docs/com/adroll/cantor/package-tree.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | com.adroll.cantor Class Hierarchy (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 64 | 65 |
66 |

Hierarchy For Package com.adroll.cantor

67 |
68 |
69 |

Class Hierarchy

70 | 78 |
79 | 80 |
81 | 82 | 83 | 84 | 85 | 94 |
95 | 122 | 123 |

Copyright © 2014. All rights reserved.

124 | 125 | 126 | -------------------------------------------------------------------------------- /docs/com/adroll/cantor/package-use.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Uses of Package com.adroll.cantor (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 64 | 65 |
66 |

Uses of Package
com.adroll.cantor

67 |
68 |
69 | 96 |
97 | 98 |
99 | 100 | 101 | 102 | 103 | 112 |
113 | 140 | 141 |

Copyright © 2014. All rights reserved.

142 | 143 | 144 | -------------------------------------------------------------------------------- /docs/constant-values.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Constant Field Values (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 64 | 65 |
66 |

Constant Field Values

67 |

Contents

68 | 71 |
72 |
73 | 74 | 75 |

com.adroll.*

76 | 118 |
119 | 120 |
121 | 122 | 123 | 124 | 125 | 134 |
135 | 162 | 163 |

Copyright © 2014. All rights reserved.

164 | 165 | 166 | -------------------------------------------------------------------------------- /docs/deprecated-list.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Deprecated List (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 64 | 65 |
66 |

Deprecated API

67 |

Contents

68 |
69 | 70 |
71 | 72 | 73 | 74 | 75 | 84 |
85 | 112 | 113 |

Copyright © 2014. All rights reserved.

114 | 115 | 116 | -------------------------------------------------------------------------------- /docs/help-doc.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | API Help (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 64 | 65 |
66 |

How This API Document Is Organized

67 |
This API (Application Programming Interface) document has pages corresponding to the items in the navigation bar, described as follows.
68 |
69 |
70 | 169 | This help file applies to API documentation generated using the standard doclet.
170 | 171 |
172 | 173 | 174 | 175 | 176 | 185 |
186 | 213 | 214 |

Copyright © 2014. All rights reserved.

215 | 216 | 217 | -------------------------------------------------------------------------------- /docs/index-all.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Index (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 64 | 65 |
C D E F G H I K M P R S T W  66 | 67 | 68 |

C

69 |
70 |
clear() - Method in class com.adroll.cantor.HLLCounter
71 |
72 |
Clears all data in the HLL and MinHash structures.
73 |
74 |
com.adroll.cantor - package com.adroll.cantor
75 |
76 |
Cantor provides utilities for estimating the cardinality 77 | of large sets.
78 |
79 |
combine(HLLCounter) - Method in class com.adroll.cantor.HLLCounter
80 |
81 |
Performs a destructive union of this 82 | HLLCounter and the one passed in.
83 |
84 |
combine(HLLCounter...) - Method in class com.adroll.cantor.HLLCounter
85 |
86 |
Performs a destructive union of this HLLCounter 87 | and all the ones passed in.
88 |
89 |
combine(HLLWritable) - Method in class com.adroll.cantor.HLLWritable
90 |
91 |
Returns a new HLLWritable that contains a 92 | representation of combining its internal 93 | HLLCounter's representation with 94 | the other's.
95 |
96 |
97 | 98 | 99 | 100 |

D

101 |
102 |
DEFAULT_K - Static variable in class com.adroll.cantor.HLLCounter
103 |
104 |
Default MinHash precision of 8192 if intersectable 105 | and HLL precision is DEFAULT_P
106 |
107 |
DEFAULT_P - Static variable in class com.adroll.cantor.HLLCounter
108 |
109 |
Default HLL precision of 18
110 |
111 |
112 | 113 | 114 | 115 |

E

116 |
117 |
equals(Object) - Method in class com.adroll.cantor.HLLWritable
118 |
119 |
Returns whether this HLLWritable 120 | is equivalent to the given Object.
121 |
122 |
123 | 124 | 125 | 126 |

F

127 |
128 |
fold(byte) - Method in class com.adroll.cantor.HLLCounter
129 |
130 |
Reduces the precision from p to q.
131 |
132 |
133 | 134 | 135 | 136 |

G

137 |
138 |
get() - Method in class com.adroll.cantor.HLLWritable
139 |
140 |
Returns a new HLLCounter that is constructed 141 | from the internal representation of the HLLCounter 142 | that this HLLWritable contains.
143 |
144 |
getByteArray() - Method in class com.adroll.cantor.HLLCounter
145 |
146 |
Returns the raw HLL structure.
147 |
148 |
getK() - Method in class com.adroll.cantor.HLLCounter
149 |
150 |
Returns the precision of the MinHash structure.
151 |
152 |
getMinHash() - Method in class com.adroll.cantor.HLLCounter
153 |
154 |
Returns the raw MinHash structure.
155 |
156 |
getP() - Method in class com.adroll.cantor.HLLCounter
157 |
158 |
Returns the precision of the HLL structure.
159 |
160 |
161 | 162 | 163 | 164 |

H

165 |
166 |
hashCode() - Method in class com.adroll.cantor.HLLWritable
167 |
168 |
Hashes this HLLWritable based on its 169 | internal structures.
170 |
171 |
HLLCounter - Class in com.adroll.cantor
172 |
173 |
HLLCounter allows for cardinality estimation of 174 | large sets with a compact data structure.
175 |
176 |
HLLCounter() - Constructor for class com.adroll.cantor.HLLCounter
177 |
178 |
Constructs a non-intersectable HLLCounter 179 | that can be used to estimate the cardinality of a set 180 | of items with precision DEFAULT_P.
181 |
182 |
HLLCounter(byte) - Constructor for class com.adroll.cantor.HLLCounter
183 |
184 |
Constructs a non-intersectable HLLCounter object 185 | that can be used to estimate the cardinality of a set of items 186 | with given precision.
187 |
188 |
HLLCounter(boolean) - Constructor for class com.adroll.cantor.HLLCounter
189 |
190 |
Constructs an HLLCounter that can be used to 191 | estimate the cardinality of a set of items with 192 | DEFAULT_P and, if intersectable is 193 | true, DEFAULT_K.
194 |
195 |
HLLCounter(byte, boolean) - Constructor for class com.adroll.cantor.HLLCounter
196 |
197 |
Constructs an HLLCounter that can be used to 198 | estimate the cardinality of a set of items with specified 199 | precision and, if intersectable is 200 | true, k is a reasonable precision 201 | guess based on p.
202 |
203 |
HLLCounter(boolean, int) - Constructor for class com.adroll.cantor.HLLCounter
204 |
205 |
Constructs an HLLCounter that can be used to 206 | estimate the cardinality of a set of items with 207 | DEFAULT_P and, if intersectable is 208 | true, specified k.
209 |
210 |
HLLCounter(byte, boolean, int) - Constructor for class com.adroll.cantor.HLLCounter
211 |
212 |
Constructs an HLLCounterthat can be used to 213 | estimate the cardinality of a set of items with specified 214 | precision and, if intersectable is 215 | true, specified k.
216 |
217 |
HLLCounter(byte, boolean, int, byte[], TreeSet<Long>) - Constructor for class com.adroll.cantor.HLLCounter
218 |
219 |
Constructs an HLLCounter that can be used to 220 | estimate the cardinality of a set of items with specified 221 | precision and, if intersectable is 222 | true, specified k, along with 223 | pre-computed HLL and MinHash structures.
224 |
225 |
HLLWritable - Class in com.adroll.cantor
226 |
227 |
HLLWritable allows for serialization and 228 | deserialization of HLLCounter objects in a 229 | Hadoop framework.
230 |
231 |
HLLWritable() - Constructor for class com.adroll.cantor.HLLWritable
232 |
233 |
Constructs an HLLWritable that contains a representation 234 | of the default HLLCounter constructed by 235 | HLLCounter().
236 |
237 |
HLLWritable(HLLCounter) - Constructor for class com.adroll.cantor.HLLWritable
238 |
239 |
Constructs an HLLWritable that contains a representation 240 | of the provided HLLCounter.
241 |
242 |
HLLWritable(byte, int, int, byte[], long[]) - Constructor for class com.adroll.cantor.HLLWritable
243 |
244 |
Constructs an HLLWritable with the given set of fields.
245 |
246 |
247 | 248 | 249 | 250 |

I

251 |
252 |
intersect(HLLCounter...) - Static method in class com.adroll.cantor.HLLCounter
253 |
254 |
Returns an estimate of the size of the intersection 255 | of the given HLLCounters.
256 |
257 |
isIntersectable() - Method in class com.adroll.cantor.HLLCounter
258 |
259 |
Returns whether this structure is intersectable.
260 |
261 |
262 | 263 | 264 | 265 |

K

266 |
267 |
k - Variable in class com.adroll.cantor.HLLWritable
268 |
269 |
The MinHash precision of the contained HLLCounter representation.
270 |
271 |
272 | 273 | 274 | 275 |

M

276 |
277 |
M - Variable in class com.adroll.cantor.HLLWritable
278 |
279 |
The HLL structure of the contained HLLCounter representation.
280 |
281 |
MAX_P - Static variable in class com.adroll.cantor.HLLCounter
282 |
283 |
Maximum HLL precision of 18
284 |
285 |
MIN_P - Static variable in class com.adroll.cantor.HLLCounter
286 |
287 |
Minimum HLL precision of 4
288 |
289 |
minhash - Variable in class com.adroll.cantor.HLLWritable
290 |
291 |
The contents of the MinHash structure of the contained 292 | HLLCounter representation.
293 |
294 |
295 | 296 | 297 | 298 |

P

299 |
300 |
p - Variable in class com.adroll.cantor.HLLWritable
301 |
302 |
The HLL precision of the contained HLLCounter represenation.
303 |
304 |
put(String) - Method in class com.adroll.cantor.HLLCounter
305 |
306 |
Insert an element into the HLLCounter structure.
307 |
308 |
put(String...) - Method in class com.adroll.cantor.HLLCounter
309 |
310 |
Insert multiple elements into the HLLCounter 311 | structure.
312 |
313 |
314 | 315 | 316 | 317 |

R

318 |
319 |
readFields(DataInput) - Method in class com.adroll.cantor.HLLWritable
320 |
321 |
Deserialize the fields of this HLLWritable 322 | from the given DataInput.
323 |
324 |
325 | 326 | 327 | 328 |

S

329 |
330 |
s - Variable in class com.adroll.cantor.HLLWritable
331 |
332 |
The number of current elements in the MinHash structure 333 | of the contained HLLCounter representation.
334 |
335 |
safeUnion(byte[], byte[]) - Static method in class com.adroll.cantor.HLLCounter
336 |
337 |
Returns an HLL structure that is the effective union 338 | of two other HLL structures.
339 |
340 |
set(HLLCounter) - Method in class com.adroll.cantor.HLLWritable
341 |
342 |
Encapsulates a representation of the given HLLCounter 343 | in this HLLWritable.
344 |
345 |
size() - Method in class com.adroll.cantor.HLLCounter
346 |
347 |
Returns the estimated number of unique insertions into 348 | the HLLCounter structure.
349 |
350 |
351 | 352 | 353 | 354 |

T

355 |
356 |
toString() - Method in class com.adroll.cantor.HLLWritable
357 |
358 |
Returns a String representation of this 359 | HLLWritable.
360 |
361 |
362 | 363 | 364 | 365 |

W

366 |
367 |
write(DataOutput) - Method in class com.adroll.cantor.HLLWritable
368 |
369 |
Serializes this HLLWritable to the given 370 | DataOutput.
371 |
372 |
373 | C D E F G H I K M P R S T W 
374 | 375 |
376 | 377 | 378 | 379 | 380 | 389 |
390 | 417 | 418 |

Copyright © 2014. All rights reserved.

419 | 420 | 421 | -------------------------------------------------------------------------------- /docs/index.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | cantor 1.0.0 API 8 | 60 | 61 | 62 | 63 | 64 | 65 | <noscript> 66 | <div>JavaScript is disabled on your browser.</div> 67 | </noscript> 68 | <h2>Frame Alert</h2> 69 | <p>This document is designed to be viewed using the frames feature. If you see this message, you are using a non-frame-capable web client. Link to <a href="com/adroll/cantor/package-summary.html">Non-frame version</a>.</p> 70 | 71 | 72 | 73 | -------------------------------------------------------------------------------- /docs/overview-tree.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Class Hierarchy (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 64 | 65 |
66 |

Hierarchy For All Packages

67 | Package Hierarchies: 68 | 71 |
72 |
73 |

Class Hierarchy

74 | 82 |
83 | 84 |
85 | 86 | 87 | 88 | 89 | 98 |
99 | 126 | 127 |

Copyright © 2014. All rights reserved.

128 | 129 | 130 | -------------------------------------------------------------------------------- /docs/package-list: -------------------------------------------------------------------------------- 1 | com.adroll.cantor 2 | -------------------------------------------------------------------------------- /docs/resources/background.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AdRoll/cantor/76b4e1b4fcca28e57e3a23cb6ea61fb428275442/docs/resources/background.gif -------------------------------------------------------------------------------- /docs/resources/tab.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AdRoll/cantor/76b4e1b4fcca28e57e3a23cb6ea61fb428275442/docs/resources/tab.gif -------------------------------------------------------------------------------- /docs/resources/titlebar.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AdRoll/cantor/76b4e1b4fcca28e57e3a23cb6ea61fb428275442/docs/resources/titlebar.gif -------------------------------------------------------------------------------- /docs/resources/titlebar_end.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/AdRoll/cantor/76b4e1b4fcca28e57e3a23cb6ea61fb428275442/docs/resources/titlebar_end.gif -------------------------------------------------------------------------------- /docs/serialized-form.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | Serialized Form (cantor 1.0.0 API) 8 | 9 | 10 | 11 | 12 | 18 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 36 |
37 | 64 | 65 |
66 |

Serialized Form

67 |
68 |
69 | 126 |
127 | 128 |
129 | 130 | 131 | 132 | 133 | 142 |
143 | 170 | 171 |

Copyright © 2014. All rights reserved.

172 | 173 | 174 | -------------------------------------------------------------------------------- /docs/stylesheet.css: -------------------------------------------------------------------------------- 1 | /* Javadoc style sheet */ 2 | /* 3 | Overall document style 4 | */ 5 | body { 6 | background-color:#ffffff; 7 | color:#353833; 8 | font-family:Arial, Helvetica, sans-serif; 9 | font-size:76%; 10 | margin:0; 11 | } 12 | a:link, a:visited { 13 | text-decoration:none; 14 | color:#4c6b87; 15 | } 16 | a:hover, a:focus { 17 | text-decoration:none; 18 | color:#bb7a2a; 19 | } 20 | a:active { 21 | text-decoration:none; 22 | color:#4c6b87; 23 | } 24 | a[name] { 25 | color:#353833; 26 | } 27 | a[name]:hover { 28 | text-decoration:none; 29 | color:#353833; 30 | } 31 | pre { 32 | font-size:1.3em; 33 | } 34 | h1 { 35 | font-size:1.8em; 36 | } 37 | h2 { 38 | font-size:1.5em; 39 | } 40 | h3 { 41 | font-size:1.4em; 42 | } 43 | h4 { 44 | font-size:1.3em; 45 | } 46 | h5 { 47 | font-size:1.2em; 48 | } 49 | h6 { 50 | font-size:1.1em; 51 | } 52 | ul { 53 | list-style-type:disc; 54 | } 55 | code, tt { 56 | font-size:1.2em; 57 | } 58 | dt code { 59 | font-size:1.2em; 60 | } 61 | table tr td dt code { 62 | font-size:1.2em; 63 | vertical-align:top; 64 | } 65 | sup { 66 | font-size:.6em; 67 | } 68 | /* 69 | Document title and Copyright styles 70 | */ 71 | .clear { 72 | clear:both; 73 | height:0px; 74 | overflow:hidden; 75 | } 76 | .aboutLanguage { 77 | float:right; 78 | padding:0px 21px; 79 | font-size:.8em; 80 | z-index:200; 81 | margin-top:-7px; 82 | } 83 | .legalCopy { 84 | margin-left:.5em; 85 | } 86 | .bar a, .bar a:link, .bar a:visited, .bar a:active { 87 | color:#FFFFFF; 88 | text-decoration:none; 89 | } 90 | .bar a:hover, .bar a:focus { 91 | color:#bb7a2a; 92 | } 93 | .tab { 94 | background-color:#0066FF; 95 | background-image:url(resources/titlebar.gif); 96 | background-position:left top; 97 | background-repeat:no-repeat; 98 | color:#ffffff; 99 | padding:8px; 100 | width:5em; 101 | font-weight:bold; 102 | } 103 | /* 104 | Navigation bar styles 105 | */ 106 | .bar { 107 | background-image:url(resources/background.gif); 108 | background-repeat:repeat-x; 109 | color:#FFFFFF; 110 | padding:.8em .5em .4em .8em; 111 | height:auto;/*height:1.8em;*/ 112 | font-size:1em; 113 | margin:0; 114 | } 115 | .topNav { 116 | background-image:url(resources/background.gif); 117 | background-repeat:repeat-x; 118 | color:#FFFFFF; 119 | float:left; 120 | padding:0; 121 | width:100%; 122 | clear:right; 123 | height:2.8em; 124 | padding-top:10px; 125 | overflow:hidden; 126 | } 127 | .bottomNav { 128 | margin-top:10px; 129 | background-image:url(resources/background.gif); 130 | background-repeat:repeat-x; 131 | color:#FFFFFF; 132 | float:left; 133 | padding:0; 134 | width:100%; 135 | clear:right; 136 | height:2.8em; 137 | padding-top:10px; 138 | overflow:hidden; 139 | } 140 | .subNav { 141 | background-color:#dee3e9; 142 | border-bottom:1px solid #9eadc0; 143 | float:left; 144 | width:100%; 145 | overflow:hidden; 146 | } 147 | .subNav div { 148 | clear:left; 149 | float:left; 150 | padding:0 0 5px 6px; 151 | } 152 | ul.navList, ul.subNavList { 153 | float:left; 154 | margin:0 25px 0 0; 155 | padding:0; 156 | } 157 | ul.navList li{ 158 | list-style:none; 159 | float:left; 160 | padding:3px 6px; 161 | } 162 | ul.subNavList li{ 163 | list-style:none; 164 | float:left; 165 | font-size:90%; 166 | } 167 | .topNav a:link, .topNav a:active, .topNav a:visited, .bottomNav a:link, .bottomNav a:active, .bottomNav a:visited { 168 | color:#FFFFFF; 169 | text-decoration:none; 170 | } 171 | .topNav a:hover, .bottomNav a:hover { 172 | text-decoration:none; 173 | color:#bb7a2a; 174 | } 175 | .navBarCell1Rev { 176 | background-image:url(resources/tab.gif); 177 | background-color:#a88834; 178 | color:#FFFFFF; 179 | margin: auto 5px; 180 | border:1px solid #c9aa44; 181 | } 182 | /* 183 | Page header and footer styles 184 | */ 185 | .header, .footer { 186 | clear:both; 187 | margin:0 20px; 188 | padding:5px 0 0 0; 189 | } 190 | .indexHeader { 191 | margin:10px; 192 | position:relative; 193 | } 194 | .indexHeader h1 { 195 | font-size:1.3em; 196 | } 197 | .title { 198 | color:#2c4557; 199 | margin:10px 0; 200 | } 201 | .subTitle { 202 | margin:5px 0 0 0; 203 | } 204 | .header ul { 205 | margin:0 0 25px 0; 206 | padding:0; 207 | } 208 | .footer ul { 209 | margin:20px 0 5px 0; 210 | } 211 | .header ul li, .footer ul li { 212 | list-style:none; 213 | font-size:1.2em; 214 | } 215 | /* 216 | Heading styles 217 | */ 218 | div.details ul.blockList ul.blockList ul.blockList li.blockList h4, div.details ul.blockList ul.blockList ul.blockListLast li.blockList h4 { 219 | background-color:#dee3e9; 220 | border-top:1px solid #9eadc0; 221 | border-bottom:1px solid #9eadc0; 222 | margin:0 0 6px -8px; 223 | padding:2px 5px; 224 | } 225 | ul.blockList ul.blockList ul.blockList li.blockList h3 { 226 | background-color:#dee3e9; 227 | border-top:1px solid #9eadc0; 228 | border-bottom:1px solid #9eadc0; 229 | margin:0 0 6px -8px; 230 | padding:2px 5px; 231 | } 232 | ul.blockList ul.blockList li.blockList h3 { 233 | padding:0; 234 | margin:15px 0; 235 | } 236 | ul.blockList li.blockList h2 { 237 | padding:0px 0 20px 0; 238 | } 239 | /* 240 | Page layout container styles 241 | */ 242 | .contentContainer, .sourceContainer, .classUseContainer, .serializedFormContainer, .constantValuesContainer { 243 | clear:both; 244 | padding:10px 20px; 245 | position:relative; 246 | } 247 | .indexContainer { 248 | margin:10px; 249 | position:relative; 250 | font-size:1.0em; 251 | } 252 | .indexContainer h2 { 253 | font-size:1.1em; 254 | padding:0 0 3px 0; 255 | } 256 | .indexContainer ul { 257 | margin:0; 258 | padding:0; 259 | } 260 | .indexContainer ul li { 261 | list-style:none; 262 | } 263 | .contentContainer .description dl dt, .contentContainer .details dl dt, .serializedFormContainer dl dt { 264 | font-size:1.1em; 265 | font-weight:bold; 266 | margin:10px 0 0 0; 267 | color:#4E4E4E; 268 | } 269 | .contentContainer .description dl dd, .contentContainer .details dl dd, .serializedFormContainer dl dd { 270 | margin:10px 0 10px 20px; 271 | } 272 | .serializedFormContainer dl.nameValue dt { 273 | margin-left:1px; 274 | font-size:1.1em; 275 | display:inline; 276 | font-weight:bold; 277 | } 278 | .serializedFormContainer dl.nameValue dd { 279 | margin:0 0 0 1px; 280 | font-size:1.1em; 281 | display:inline; 282 | } 283 | /* 284 | List styles 285 | */ 286 | ul.horizontal li { 287 | display:inline; 288 | font-size:0.9em; 289 | } 290 | ul.inheritance { 291 | margin:0; 292 | padding:0; 293 | } 294 | ul.inheritance li { 295 | display:inline; 296 | list-style:none; 297 | } 298 | ul.inheritance li ul.inheritance { 299 | margin-left:15px; 300 | padding-left:15px; 301 | padding-top:1px; 302 | } 303 | ul.blockList, ul.blockListLast { 304 | margin:10px 0 10px 0; 305 | padding:0; 306 | } 307 | ul.blockList li.blockList, ul.blockListLast li.blockList { 308 | list-style:none; 309 | margin-bottom:25px; 310 | } 311 | ul.blockList ul.blockList li.blockList, ul.blockList ul.blockListLast li.blockList { 312 | padding:0px 20px 5px 10px; 313 | border:1px solid #9eadc0; 314 | background-color:#f9f9f9; 315 | } 316 | ul.blockList ul.blockList ul.blockList li.blockList, ul.blockList ul.blockList ul.blockListLast li.blockList { 317 | padding:0 0 5px 8px; 318 | background-color:#ffffff; 319 | border:1px solid #9eadc0; 320 | border-top:none; 321 | } 322 | ul.blockList ul.blockList ul.blockList ul.blockList li.blockList { 323 | margin-left:0; 324 | padding-left:0; 325 | padding-bottom:15px; 326 | border:none; 327 | border-bottom:1px solid #9eadc0; 328 | } 329 | ul.blockList ul.blockList ul.blockList ul.blockList li.blockListLast { 330 | list-style:none; 331 | border-bottom:none; 332 | padding-bottom:0; 333 | } 334 | table tr td dl, table tr td dl dt, table tr td dl dd { 335 | margin-top:0; 336 | margin-bottom:1px; 337 | } 338 | /* 339 | Table styles 340 | */ 341 | .contentContainer table, .classUseContainer table, .constantValuesContainer table { 342 | border-bottom:1px solid #9eadc0; 343 | width:100%; 344 | } 345 | .contentContainer ul li table, .classUseContainer ul li table, .constantValuesContainer ul li table { 346 | width:100%; 347 | } 348 | .contentContainer .description table, .contentContainer .details table { 349 | border-bottom:none; 350 | } 351 | .contentContainer ul li table th.colOne, .contentContainer ul li table th.colFirst, .contentContainer ul li table th.colLast, .classUseContainer ul li table th, .constantValuesContainer ul li table th, .contentContainer ul li table td.colOne, .contentContainer ul li table td.colFirst, .contentContainer ul li table td.colLast, .classUseContainer ul li table td, .constantValuesContainer ul li table td{ 352 | vertical-align:top; 353 | padding-right:20px; 354 | } 355 | .contentContainer ul li table th.colLast, .classUseContainer ul li table th.colLast,.constantValuesContainer ul li table th.colLast, 356 | .contentContainer ul li table td.colLast, .classUseContainer ul li table td.colLast,.constantValuesContainer ul li table td.colLast, 357 | .contentContainer ul li table th.colOne, .classUseContainer ul li table th.colOne, 358 | .contentContainer ul li table td.colOne, .classUseContainer ul li table td.colOne { 359 | padding-right:3px; 360 | } 361 | .overviewSummary caption, .packageSummary caption, .contentContainer ul.blockList li.blockList caption, .summary caption, .classUseContainer caption, .constantValuesContainer caption { 362 | position:relative; 363 | text-align:left; 364 | background-repeat:no-repeat; 365 | color:#FFFFFF; 366 | font-weight:bold; 367 | clear:none; 368 | overflow:hidden; 369 | padding:0px; 370 | margin:0px; 371 | } 372 | caption a:link, caption a:hover, caption a:active, caption a:visited { 373 | color:#FFFFFF; 374 | } 375 | .overviewSummary caption span, .packageSummary caption span, .contentContainer ul.blockList li.blockList caption span, .summary caption span, .classUseContainer caption span, .constantValuesContainer caption span { 376 | white-space:nowrap; 377 | padding-top:8px; 378 | padding-left:8px; 379 | display:block; 380 | float:left; 381 | background-image:url(resources/titlebar.gif); 382 | height:18px; 383 | } 384 | .overviewSummary .tabEnd, .packageSummary .tabEnd, .contentContainer ul.blockList li.blockList .tabEnd, .summary .tabEnd, .classUseContainer .tabEnd, .constantValuesContainer .tabEnd { 385 | width:10px; 386 | background-image:url(resources/titlebar_end.gif); 387 | background-repeat:no-repeat; 388 | background-position:top right; 389 | position:relative; 390 | float:left; 391 | } 392 | ul.blockList ul.blockList li.blockList table { 393 | margin:0 0 12px 0px; 394 | width:100%; 395 | } 396 | .tableSubHeadingColor { 397 | background-color: #EEEEFF; 398 | } 399 | .altColor { 400 | background-color:#eeeeef; 401 | } 402 | .rowColor { 403 | background-color:#ffffff; 404 | } 405 | .overviewSummary td, .packageSummary td, .contentContainer ul.blockList li.blockList td, .summary td, .classUseContainer td, .constantValuesContainer td { 406 | text-align:left; 407 | padding:3px 3px 3px 7px; 408 | } 409 | th.colFirst, th.colLast, th.colOne, .constantValuesContainer th { 410 | background:#dee3e9; 411 | border-top:1px solid #9eadc0; 412 | border-bottom:1px solid #9eadc0; 413 | text-align:left; 414 | padding:3px 3px 3px 7px; 415 | } 416 | td.colOne a:link, td.colOne a:active, td.colOne a:visited, td.colOne a:hover, td.colFirst a:link, td.colFirst a:active, td.colFirst a:visited, td.colFirst a:hover, td.colLast a:link, td.colLast a:active, td.colLast a:visited, td.colLast a:hover, .constantValuesContainer td a:link, .constantValuesContainer td a:active, .constantValuesContainer td a:visited, .constantValuesContainer td a:hover { 417 | font-weight:bold; 418 | } 419 | td.colFirst, th.colFirst { 420 | border-left:1px solid #9eadc0; 421 | white-space:nowrap; 422 | } 423 | td.colLast, th.colLast { 424 | border-right:1px solid #9eadc0; 425 | } 426 | td.colOne, th.colOne { 427 | border-right:1px solid #9eadc0; 428 | border-left:1px solid #9eadc0; 429 | } 430 | table.overviewSummary { 431 | padding:0px; 432 | margin-left:0px; 433 | } 434 | table.overviewSummary td.colFirst, table.overviewSummary th.colFirst, 435 | table.overviewSummary td.colOne, table.overviewSummary th.colOne { 436 | width:25%; 437 | vertical-align:middle; 438 | } 439 | table.packageSummary td.colFirst, table.overviewSummary th.colFirst { 440 | width:25%; 441 | vertical-align:middle; 442 | } 443 | /* 444 | Content styles 445 | */ 446 | .description pre { 447 | margin-top:0; 448 | } 449 | .deprecatedContent { 450 | margin:0; 451 | padding:10px 0; 452 | } 453 | .docSummary { 454 | padding:0; 455 | } 456 | /* 457 | Formatting effect styles 458 | */ 459 | .sourceLineNo { 460 | color:green; 461 | padding:0 30px 0 0; 462 | } 463 | h1.hidden { 464 | visibility:hidden; 465 | overflow:hidden; 466 | font-size:.9em; 467 | } 468 | .block { 469 | display:block; 470 | margin:3px 0 0 0; 471 | } 472 | .strong { 473 | font-weight:bold; 474 | } 475 | -------------------------------------------------------------------------------- /pom.xml: -------------------------------------------------------------------------------- 1 | 3 | 4 | 4.0.0 5 | com.adroll.cantor 6 | cantor 7 | 1.0.0 8 | 9 | 10 | 2.2.0 11 | UTF-8 12 | 1.8.11 13 | 14 | 15 | 16 | ${project.artifactId} 17 | 18 | 19 | org.apache.maven.plugins 20 | maven-compiler-plugin 21 | 2.3.2 22 | 23 | 1.7 24 | 1.7 25 | 26 | 27 | 28 | org.apache.maven.plugins 29 | maven-surefire-plugin 30 | 2.16 31 | 32 | true 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | main 41 | 42 | true 43 | 44 | 45 | 46 | 47 | 50 | org.apache.maven.plugins 51 | maven-shade-plugin 52 | 2.2 53 | 54 | 55 | package 56 | 57 | shade 58 | 59 | 60 | false 61 | 62 | 63 | *:* 64 | 65 | **/LICENSE* 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | junit 81 | junit 82 | 4.11 83 | test 84 | 85 | 86 | org.slf4j 87 | slf4j-api 88 | 1.7.5 89 | 90 | 91 | org.apache.hadoop 92 | hadoop-common 93 | ${hadoop.version} 94 | 95 | 96 | 97 | 98 | -------------------------------------------------------------------------------- /src/main/java/com/adroll/cantor/HLLWritable.java: -------------------------------------------------------------------------------- 1 | package com.adroll.cantor; 2 | 3 | import java.io.DataInput; 4 | import java.io.DataOutput; 5 | import java.io.IOException; 6 | import java.util.Arrays; 7 | import java.util.TreeSet; 8 | 9 | import org.apache.hadoop.io.Writable; 10 | import org.slf4j.Logger; 11 | import org.slf4j.LoggerFactory; 12 | 13 | import com.adroll.cantor.HLLCounter; 14 | 15 | /** 16 | HLLWritable allows for serialization and 17 | deserialization of {@link HLLCounter} objects in a 18 | Hadoop framework. 19 | */ 20 | public class HLLWritable implements Writable { 21 | 22 | private static final Logger LOG = LoggerFactory.getLogger(HLLWritable.class); 23 | 24 | /** The HLL precision of the contained HLLCounter represenation. 25 | {@link HLLCounter#MIN_P} <= p <= {@link HLLCounter#MAX_P}. 26 | */ 27 | protected byte p; 28 | /** The MinHash precision of the contained HLLCounter representation. */ 29 | protected int k; 30 | /** The number of current elements in the MinHash structure 31 | of the contained HLLCounter representation. */ 32 | protected int s; 33 | /** The HLL structure of the contained HLLCounter representation. */ 34 | protected byte[] M; 35 | /** The contents of the MinHash structure of the contained 36 | HLLCounter representation.*/ 37 | protected long[] minhash; 38 | 39 | /** 40 | Constructs an HLLWritable that contains a representation 41 | of the default HLLCounter constructed by 42 | {@link HLLCounter#HLLCounter()}. 43 | */ 44 | public HLLWritable() { 45 | set(new HLLCounter()); 46 | } 47 | 48 | /** 49 | Constructs an HLLWritable that contains a representation 50 | of the provided HLLCounter. 51 | 52 | @param h the HLLCounter to represent and contain 53 | */ 54 | public HLLWritable(HLLCounter h) { 55 | set(h); 56 | } 57 | 58 | /** 59 | Constructs an HLLWritable with the given set of fields. 60 | 61 | @param p the byte precision of the HLL 62 | structure. {@link HLLCounter#MIN_P} <= p 63 | <= {@link HLLCounter#MAX_P}. 64 | @param k the int precision of the MinHash 65 | structure 66 | @param s the int number of elements in the 67 | MinHash structure 68 | @param M the byte[] HLL structure 69 | @param minhash the long[] elements in the MinHash 70 | structure 71 | */ 72 | public HLLWritable(byte p, int k, int s, byte[] M, long[] minhash){ 73 | this.p = p; 74 | this.k = k; 75 | this.s = s; 76 | this.M = M; 77 | this.minhash = minhash; 78 | } 79 | 80 | /** 81 | Encapsulates a representation of the given HLLCounter 82 | in this HLLWritable. 83 | 84 | @param h the HLLCounter to represent and contain 85 | */ 86 | public void set(HLLCounter h) { 87 | p = h.getP(); 88 | M = h.getByteArray(); 89 | k = h.getK(); 90 | if(h.isIntersectable()){ 91 | s = h.getMinHash().size(); 92 | } else { 93 | s = 0; 94 | } 95 | if(minhash == null || minhash.length != s){ 96 | minhash = new long[s]; 97 | } 98 | int i = 0; 99 | if(h.getMinHash() != null){ 100 | for(Long l : h.getMinHash()){ 101 | minhash[i] = l; 102 | i++; 103 | } 104 | } 105 | } 106 | 107 | /** 108 | Returns a new HLLCounter that is constructed 109 | from the internal representation of the HLLCounter 110 | that this HLLWritable contains. 111 | 112 | @return the HLLCounter this HLLWritable 113 | represents. 114 | */ 115 | public HLLCounter get() { 116 | TreeSet ts = new TreeSet(); 117 | for(long l : minhash){ 118 | ts.add(l); 119 | } 120 | HLLCounter hll = new HLLCounter(p, k > 0, k, M, ts); 121 | return hll; 122 | } 123 | 124 | /** 125 | Returns a new HLLWritable that contains a 126 | representation of combining its internal 127 | HLLCounter's representation with 128 | the other's. 129 |

130 | It is functionally equivalent to combining two 131 | HLLCounters 132 | ({@link HLLCounter#combine(HLLCounter h)}) and creating a 133 | new HLLWritable out of that. 134 |

135 | Returns null if the combination fails. 136 | 137 | @param other the HLLWritable to combine 138 | @return the HLLWritable that represents 139 | the union, null if fails, 140 | this if other 141 | is null. 142 | */ 143 | public HLLWritable combine(HLLWritable other){ 144 | if(other == null){ 145 | return this; 146 | } 147 | 148 | byte newP = (byte)Math.min(p, other.p); 149 | int newK = Math.min(k, other.k); 150 | byte[] newM = HLLCounter.safeUnion(M, other.M); 151 | // newMinhash will hold at most newK elements, but possibly less 152 | long[] newMinhash = new long[newK]; 153 | int i=0, j=0; 154 | int newS=0; 155 | 156 | try { 157 | if(newK > 0){ 158 | while ( i < s && j < other.s && newS < newK){ 159 | long left = minhash[i]; 160 | long right = other.minhash[j]; 161 | if(left < right){ 162 | newMinhash[newS] = left; 163 | i++; 164 | } else if(left > right){ 165 | newMinhash[newS] = right; 166 | j++; 167 | } else { // left == right 168 | newMinhash[newS] = left; 169 | i++; 170 | j++; 171 | } 172 | newS++; 173 | } 174 | while( i < s && newS < newK){ 175 | newMinhash[newS] = minhash[i]; 176 | i++; 177 | newS++; 178 | } 179 | while(j < other.s && newS < newK){ 180 | newMinhash[newS] = other.minhash[j]; 181 | j++; 182 | newS++; 183 | } 184 | // We allocated an array of newK size, but it's possible we didn't fill it up. 185 | // This would leave trailing 0's at the end of the array which we don't want to keep around. 186 | if (newS < newK) { 187 | newMinhash = Arrays.copyOf(newMinhash, newS); 188 | } 189 | } 190 | return new HLLWritable(newP, newK, newS, newM, newMinhash); 191 | } catch (Exception e){ 192 | LOG.error("Failed combining", e); 193 | return null; 194 | } 195 | } 196 | 197 | // WritableComparable 198 | /** 199 | Serializes this HLLWritable to the given 200 | {@link java.io.DataOutput}. 201 |

202 | Generally, this method should not be called on its own. 203 | 204 | @param out the DataOutput object to write to 205 | */ 206 | public void write(DataOutput out) throws IOException { 207 | try{ 208 | // minhash is not maxed out, M is redundant so don't write it 209 | if (s < k) { 210 | // Use -p to signify no M 211 | out.writeByte(-p); 212 | out.writeInt(k); 213 | out.writeInt(s); 214 | for(int i=0; i < s; i++){ 215 | out.writeLong(minhash[i]); 216 | } 217 | } else { 218 | out.writeByte(p); 219 | out.writeInt(k); 220 | out.writeInt(s); 221 | for(byte b : M){ 222 | out.writeByte(b); 223 | } 224 | for(int i=0; i < s; i++){ 225 | out.writeLong(minhash[i]); 226 | } 227 | } 228 | } catch(Exception e){ 229 | LOG.warn("Failed writing", e); 230 | } 231 | } 232 | 233 | /** 234 | Deserialize the fields of this HLLWritable 235 | from the given {@link java.io.DataInput}. 236 |

237 | Generally, this method should not be called on its own. 238 | For efficiency, implementations should attempt to re-use 239 | storage in the existing object where possible. 240 | 241 | @param in the DataInput to read from 242 | */ 243 | public void readFields(DataInput in) throws IOException { 244 | try { 245 | p = in.readByte(); 246 | k = in.readInt(); 247 | s = in.readInt(); 248 | if(k == 0) { 249 | s = 0; 250 | } 251 | // If p is negative, M does not exist 252 | if (p < 0) { 253 | p = (byte) -p; 254 | int m = (int)Math.pow(2, p); 255 | M = new byte[m]; 256 | } else { 257 | int m = (int)Math.pow(2, p); 258 | M = new byte[m]; 259 | for(int i = 0; i < m; i++) { 260 | M[i] = in.readByte(); 261 | } 262 | } 263 | minhash = new long[s]; 264 | 265 | for(int i = 0; i < s; i++) { 266 | long x = in.readLong(); 267 | minhash[i] = x; 268 | /** 269 | * If p was negative, M is empty and we need to re-populate 270 | * If p was positive and we read M, this won't change anything since it's just max 271 | */ 272 | int idx = (int)(x >>> (64 - p)); 273 | long w = x << p; 274 | M[idx] = (byte)Math.max(M[idx], Long.numberOfLeadingZeros(w) + 1); 275 | } 276 | } catch(Exception e) { 277 | throw new IOException(e); 278 | } 279 | } 280 | 281 | /** 282 | Hashes this HLLWritable based on its 283 | internal structures. 284 | 285 | @return the int hash value 286 | */ 287 | @Override 288 | public int hashCode() { 289 | final int prime = 31; 290 | int result = 1; 291 | result = prime * result + Arrays.hashCode(M); 292 | result = prime * result + k; 293 | result = prime * result + Arrays.hashCode(minhash); 294 | result = prime * result + p; 295 | result = prime * result + s; 296 | return result; 297 | } 298 | 299 | /** 300 | Returns whether this HLLWritable 301 | is equivalent to the given Object. 302 |

303 | If the input is another HLLWritable, 304 | the two are considered equivalent if all of their 305 | fields are equivalent (that is, the two 306 | HLLCounters likely saw the exact same 307 | data). 308 | 309 | @param obj the Object to compare to 310 | 311 | @return the boolean of the comparison 312 | */ 313 | @Override 314 | public boolean equals(Object obj) { 315 | if (this == obj) { 316 | return true; 317 | } 318 | if (obj == null) { 319 | return false; 320 | } 321 | if (getClass() != obj.getClass()) { 322 | return false; 323 | } 324 | HLLWritable other = (HLLWritable) obj; 325 | if (!Arrays.equals(M, other.M)) { 326 | return false; 327 | } 328 | if (k != other.k) { 329 | return false; 330 | } 331 | if (!Arrays.equals(minhash, other.minhash)) { 332 | return false; 333 | } 334 | if (p != other.p) { 335 | return false; 336 | } 337 | if (s != other.s) { 338 | return false; 339 | } 340 | return true; 341 | } 342 | 343 | /** 344 | Returns a String representation of this 345 | HLLWritable. 346 |

347 | The String encodes the p, 348 | k, and s fields. 349 | 350 | @return the String representation 351 | */ 352 | @Override 353 | public String toString() { 354 | return "HLLWritable [p=" + p + ", k=" + k + ", s=" + s + "]"; 355 | } 356 | } 357 | -------------------------------------------------------------------------------- /src/main/java/com/adroll/cantor/package-info.java: -------------------------------------------------------------------------------- 1 | /** 2 | Cantor provides utilities for estimating the cardinality 3 | of large sets. 4 |

5 | The algorithms herein are parallelizable, and a Hadoop 6 | wrapper class is provided for convenience. 7 |

8 | It employs most of the HyperLogLog++ algorithm as seen in 9 | 10 | this paper, excluding the sparse scheme, and using 11 | a simple linear interpolation instead of kNN. In addition, 12 | it can use MinHash structures to estimate cardinalities of 13 | intersections of these sets, as described in 14 | 15 | this blog post. 16 |

17 | Both HyperLogLog and MinHash require a precision 18 | parameter. Basic guidelines are available as follows, 19 | and {@link com.adroll.cantor.HLLCounter#MIN_P} = 4 <= p <= 18 = 20 | {@link com.adroll.cantor.HLLCounter#MAX_P}. 21 |

22 | 23 | 24 | 45 | 56 | 57 |
25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 |
HyperLogLog p @ 99.7% Confidence
p Relative Error
4 75%
5 65%
6 47%
7 32%
8 23%
9 16%
10 10%
11 8%
12 5%
13 4%
14 2.5%
15 2%
16 1.3%
17 1%
18 0.7%
44 |
46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 |
MinHash k @ 99% Confidence
Relative Error - Intersection Size
- 0.01% 0.1% 1.0% 5.0% 10.0%
100% 90000 9000 900 170 75
50% 313334 31334 3134 587 280
25% - 116800 11520 2208 1040
10% - - 68455 13128 6210
55 |
58 | */ 59 | package com.adroll.cantor; -------------------------------------------------------------------------------- /src/test/java/com/adroll/cantor/TestHLLCounter.java: -------------------------------------------------------------------------------- 1 | package com.adroll.cantor; 2 | 3 | import java.io.File; 4 | import java.io.FileInputStream; 5 | import java.io.FileOutputStream; 6 | import java.io.IOException; 7 | import java.io.ObjectInputStream; 8 | import java.io.ObjectOutputStream; 9 | import java.util.Random; 10 | 11 | import static org.junit.Assert.*; 12 | 13 | import org.junit.Test; 14 | 15 | import com.adroll.cantor.HLLCounter; 16 | 17 | public class TestHLLCounter { 18 | 19 | @Test 20 | public void test_serialization() throws IOException, ClassNotFoundException { 21 | HLLCounter h = new HLLCounter(); 22 | h.put("a"); 23 | h.put("b"); 24 | h.put("c"); 25 | assertTrue(h.size() == 3L); 26 | 27 | File f = new File("hll.ser"); 28 | FileOutputStream fos = new FileOutputStream(f); 29 | ObjectOutputStream oos = new ObjectOutputStream(fos); 30 | oos.writeObject(h); 31 | oos.close(); 32 | 33 | FileInputStream fis = new FileInputStream(f); 34 | ObjectInputStream ois = new ObjectInputStream(fis); 35 | HLLCounter hi = (HLLCounter)ois.readObject(); 36 | assertTrue(hi.size() == 3L); 37 | hi.put("d"); 38 | assertTrue(hi.size() == 4L); 39 | 40 | assertTrue(f.delete()); 41 | } 42 | 43 | @Test 44 | public void test_combination() throws Exception { 45 | HLLCounter h1 = new HLLCounter(); 46 | h1.put("a"); 47 | h1.put("b"); 48 | h1.put("c"); 49 | assertTrue(h1.size() == 3L); 50 | 51 | HLLCounter h2 = new HLLCounter(); 52 | h2.put("d"); 53 | h2.put("e"); 54 | h2.put("f"); 55 | assertTrue(h2.size() == 3L); 56 | 57 | HLLCounter h3 = new HLLCounter(); 58 | h3.put("d"); 59 | h3.put("e"); 60 | h3.put("f"); 61 | assertTrue(h3.size() == 3L); 62 | 63 | h1.combine(h2); 64 | assertTrue(h1.size() == 6L); 65 | 66 | h1.combine(h3); 67 | assertTrue(h1.size() == 6L); 68 | 69 | h2.combine(h3); 70 | assertTrue(h2.size() == 3L); 71 | 72 | h1.clear(); 73 | h2.clear(); 74 | h3.clear(); 75 | 76 | //h1 and h3 are the same, h2 is a subset 77 | for(int i = 0; i < 1000000; i++) { 78 | String s = String.valueOf(Math.random()); 79 | h1.put(s); 80 | h3.put(s); 81 | if(i > 500000) { 82 | h2.put(s); 83 | } 84 | } 85 | 86 | //Add more uniques to h2, same to h3 87 | for(int i = 0; i < 1000000; i++) { 88 | String s = String.valueOf(Math.random()); 89 | h2.put(s); 90 | h3.put(s); 91 | } 92 | 93 | //So now the union of h1 and h2 should 94 | //be h3. 95 | h1.combine(h2); 96 | assertTrue(h3.size() == h1.size()); 97 | } 98 | 99 | @Test 100 | public void test_basic() { 101 | Random r = new Random(4618201L); 102 | HLLCounter h = new HLLCounter((byte)8); 103 | fillHLLCounter(h, r, 25851093); 104 | assertTrue(h.size() == 22787413L); 105 | 106 | r = new Random(8315542L); 107 | h = new HLLCounter((byte)9); 108 | fillHLLCounter(h, r, 4954434); 109 | assertTrue(h.size() == 5013953L); 110 | 111 | //default precision of HLLCounter.DEFAULT_P = 18 112 | h = new HLLCounter(); 113 | r = new Random(73919566L); 114 | fillHLLCounter(h, r, 17078033); 115 | assertTrue(h.size() == 17034653L); 116 | 117 | h.clear(); 118 | r = new Random(57189216L); 119 | fillHLLCounter(h, r, 18592874); 120 | assertTrue(h.size() == 18526241L); 121 | 122 | h.clear(); 123 | r = new Random(10821894L); 124 | fillHLLCounter(h, r, 3777716); 125 | assertTrue(h.size() == 3760602L); 126 | } 127 | 128 | @Test 129 | public void test_fold() { 130 | Random r = new Random(123456L); 131 | HLLCounter small = new HLLCounter((byte)8); 132 | fillHLLCounter(small, r, 100000); 133 | 134 | r = new Random(123456L); 135 | HLLCounter big = new HLLCounter((byte)12); 136 | fillHLLCounter(big, r, 100000); 137 | big.fold((byte)8); 138 | 139 | assertEquals(big.size(), small.size()); 140 | 141 | r = new Random(23456L); 142 | small = new HLLCounter((byte)4); 143 | fillHLLCounter(small, r, 100000); 144 | 145 | r = new Random(23456L); 146 | big = new HLLCounter((byte)18); 147 | fillHLLCounter(big, r, 100000); 148 | big.fold((byte)4); 149 | 150 | assertEquals(big.size(), small.size()); 151 | 152 | r = new Random(3456L); 153 | small = new HLLCounter((byte)7); 154 | fillHLLCounter(small, r, 100000); 155 | 156 | r = new Random(3456L); 157 | big = new HLLCounter((byte)16); 158 | fillHLLCounter(big, r, 100000); 159 | big.fold((byte)7); 160 | 161 | assertEquals(big.size(), small.size()); 162 | } 163 | 164 | @Test 165 | public void test_intersection() { 166 | HLLCounter h0 = new HLLCounter(true, 1024); 167 | HLLCounter h1 = new HLLCounter(true, 1024); 168 | HLLCounter h2 = new HLLCounter(true, 1024); 169 | HLLCounter h3 = new HLLCounter(true, 1024); 170 | for(int i = 0; i < 10000; i++) { 171 | h0.put(String.valueOf(i)); 172 | } 173 | for(int i = 5000; i < 15000; i++) { 174 | h1.put(String.valueOf(i)); 175 | } 176 | for(int i = 8000; i < 11000; i++) { 177 | h2.put(String.valueOf(i)); 178 | } 179 | for(int i = 8000; i < 9000; i++) { 180 | h3.put(String.valueOf(i)); 181 | } 182 | 183 | assertEquals(4853, HLLCounter.intersect(h0, h1)); //about 5000 184 | assertEquals(1922, HLLCounter.intersect(h0, h2)); //about 2000 185 | assertEquals(937, HLLCounter.intersect(h0, h3)); //about 1000 186 | assertEquals(2963, HLLCounter.intersect(h1, h2)); //about 3000 187 | assertEquals(958, HLLCounter.intersect(h1, h3)); //about 1000 188 | assertEquals(986, HLLCounter.intersect(h2, h3)); //about 1000 189 | assertEquals(1862, HLLCounter.intersect(h0, h1, h2)); //about 2000 190 | assertEquals(762, HLLCounter.intersect(h0, h1, h3)); //about 1000 191 | assertEquals(934, HLLCounter.intersect(h0, h2, h3)); //about 1000 192 | assertEquals(958, HLLCounter.intersect(h1, h2, h3)); //about 1000 193 | assertEquals(762, HLLCounter.intersect(h0, h1, h2, h3)); //about 1000 194 | assertEquals(0, HLLCounter.intersect()); 195 | assertEquals(0, HLLCounter.intersect(new HLLCounter(), h0)); 196 | 197 | } 198 | 199 | private void fillHLLCounter(HLLCounter h, Random r, int n) { 200 | for(int i = 0; i < n; i++) { 201 | h.put(String.valueOf(r.nextDouble())); 202 | } 203 | } 204 | } 205 | -------------------------------------------------------------------------------- /src/test/java/com/adroll/cantor/TestHLLWritable.java: -------------------------------------------------------------------------------- 1 | package com.adroll.cantor; 2 | 3 | import static org.junit.Assert.*; 4 | 5 | import java.io.ByteArrayInputStream; 6 | import java.io.ByteArrayOutputStream; 7 | import java.io.DataInputStream; 8 | import java.io.DataOutputStream; 9 | 10 | import org.junit.Test; 11 | 12 | import com.google.common.hash.Hashing; 13 | 14 | import com.adroll.cantor.HLLWritable; 15 | import com.adroll.cantor.HLLCounter; 16 | 17 | public class TestHLLWritable { 18 | 19 | @Test 20 | public void test_serialization() throws Exception { 21 | ByteArrayOutputStream baos = new ByteArrayOutputStream(); 22 | DataOutputStream out = new DataOutputStream(baos); 23 | 24 | HLLCounter hll = new HLLCounter(true); 25 | hll.put("one", "two", "three"); 26 | HLLWritable hllw = new HLLWritable(hll); 27 | hllw.write(out); 28 | 29 | DataInputStream in = new DataInputStream(new ByteArrayInputStream(baos.toByteArray())); 30 | HLLWritable deserialized = new HLLWritable(); 31 | 32 | deserialized.readFields(in); 33 | 34 | HLLCounter d = deserialized.get(); 35 | assertEquals(HLLCounter.DEFAULT_P, d.getP()); 36 | assertEquals(3L, d.size()); 37 | assertEquals(HLLCounter.DEFAULT_K, d.getK()); 38 | assertTrue(d.isIntersectable()); 39 | assertArrayEquals(hll.getByteArray(), d.getByteArray()); 40 | assertArrayEquals(hll.getMinHash().toArray(), d.getMinHash().toArray()); 41 | } 42 | 43 | @Test 44 | public void test_serialization_non_intersectable() throws Exception { 45 | ByteArrayOutputStream baos = new ByteArrayOutputStream(); 46 | DataOutputStream out = new DataOutputStream(baos); 47 | 48 | HLLCounter hll = new HLLCounter(false); 49 | hll.put("one", "two", "three"); 50 | HLLWritable hllw = new HLLWritable(hll); 51 | hllw.write(out); 52 | 53 | DataInputStream in = new DataInputStream(new ByteArrayInputStream(baos.toByteArray())); 54 | HLLWritable deserialized = new HLLWritable(); 55 | 56 | deserialized.readFields(in); 57 | 58 | HLLCounter d = deserialized.get(); 59 | assertEquals(HLLCounter.DEFAULT_P, d.getP()); 60 | assertEquals(3L, d.size()); 61 | assertEquals(0, d.getK()); 62 | assertFalse(d.isIntersectable()); 63 | assertNull(d.getMinHash()); 64 | assertArrayEquals(hll.getByteArray(), d.getByteArray()); 65 | } 66 | 67 | @Test 68 | public void test_serialization_larger_than_ts() throws Exception { 69 | ByteArrayOutputStream baos = new ByteArrayOutputStream(); 70 | DataOutputStream out = new DataOutputStream(baos); 71 | 72 | HLLCounter hll = new HLLCounter(true, 3); 73 | 74 | hll.put("one", "two", "three", "four", "five"); 75 | HLLWritable hllw = new HLLWritable(hll); 76 | hllw.write(out); 77 | 78 | DataInputStream in = new DataInputStream(new ByteArrayInputStream(baos.toByteArray())); 79 | HLLWritable deserialized = new HLLWritable(); 80 | 81 | deserialized.readFields(in); 82 | 83 | HLLCounter d = deserialized.get(); 84 | assertEquals(HLLCounter.DEFAULT_P, d.getP()); 85 | assertEquals(5L, d.size()); 86 | assertEquals(3, d.getK()); 87 | assertTrue(d.isIntersectable()); 88 | assertArrayEquals(hll.getByteArray(), d.getByteArray()); 89 | assertArrayEquals(hll.getMinHash().toArray(), d.getMinHash().toArray()); 90 | } 91 | 92 | @Test 93 | public void test_serialization_M_created() throws Exception { 94 | /** 95 | * We need to test that M is properly created. 96 | * set p to 9 so it's different from the default of 18. 97 | * k needs to be be bigger than the number of elements so -p is written. 98 | * 99 | * Make sure M is recreated when the -p is read in. 100 | */ 101 | ByteArrayOutputStream baos = new ByteArrayOutputStream(); 102 | DataOutputStream out = new DataOutputStream(baos); 103 | 104 | HLLCounter hll = new HLLCounter((byte)9, true, 256); 105 | 106 | hll.put("one", "two", "three", "four", "five"); 107 | HLLWritable hllw = new HLLWritable(hll); 108 | hllw.write(out); 109 | 110 | DataInputStream in = new DataInputStream(new ByteArrayInputStream(baos.toByteArray())); 111 | HLLWritable deserialized = new HLLWritable(); 112 | 113 | deserialized.readFields(in); 114 | 115 | HLLCounter d = deserialized.get(); 116 | assertEquals((byte)9, d.getP()); 117 | assertEquals(5L, d.size()); 118 | assertEquals(256, d.getK()); 119 | assertTrue(d.isIntersectable()); 120 | assertArrayEquals(hll.getByteArray(), d.getByteArray()); 121 | assertArrayEquals(hll.getMinHash().toArray(), d.getMinHash().toArray()); 122 | } 123 | 124 | @Test 125 | public void test_intersection() throws Exception { 126 | ByteArrayOutputStream baos = new ByteArrayOutputStream(); 127 | DataOutputStream out = new DataOutputStream(baos); 128 | 129 | HLLCounter hll0 = new HLLCounter(true); 130 | for (int i=0; i<1000000; i++) { 131 | hll0.put(Hashing.md5().hashString(Integer.toString(i)).toString()); 132 | } 133 | 134 | HLLCounter hll1 = new HLLCounter(true); 135 | for (int i=10000; i<1100000; i++) { 136 | hll1.put(Hashing.md5().hashString(Integer.toString(i)).toString()); 137 | } 138 | HLLWritable hllw = new HLLWritable(hll0); 139 | hllw.write(out); 140 | 141 | hllw.set(hll1); 142 | hllw.write(out); 143 | 144 | DataInputStream in = new DataInputStream(new ByteArrayInputStream(baos.toByteArray())); 145 | HLLWritable deser0 = new HLLWritable(); 146 | deser0.readFields(in); 147 | 148 | HLLWritable deser1 = new HLLWritable(); 149 | deser1.readFields(in); 150 | 151 | assertEquals(998974, HLLCounter.intersect(deser0.get(), deser1.get())); 152 | } 153 | 154 | @Test 155 | public void test_set() throws Exception { 156 | HLLCounter h0 = new HLLCounter(); 157 | HLLCounter h1 = new HLLCounter(); 158 | h0.put("0", "1", "2"); 159 | h1.put("0", "1", "2", "3"); 160 | 161 | HLLWritable hllw = new HLLWritable(h0); 162 | hllw.set(h1); 163 | assertEquals(hllw.get().size(), 4L); 164 | } 165 | 166 | @Test 167 | public void test_combination() throws Exception { 168 | 169 | HLLCounter h1 = new HLLCounter(); 170 | h1.put("a"); 171 | h1.put("b"); 172 | h1.put("c"); 173 | 174 | HLLCounter h2 = new HLLCounter(); 175 | h2.put("d"); 176 | h2.put("e"); 177 | h2.put("f"); 178 | 179 | HLLCounter h3 = new HLLCounter(); 180 | h3.put("d"); 181 | h3.put("e"); 182 | h3.put("f"); 183 | 184 | HLLWritable w1 = new HLLWritable(h1); 185 | HLLWritable w2 = new HLLWritable(h2); 186 | HLLWritable w3 = new HLLWritable(h3); 187 | 188 | h1 = w1.combine(w2).get(); 189 | assertTrue(h1.size() == 6L); 190 | 191 | w1 = new HLLWritable(h1); 192 | h1 = w1.combine(w3).get(); 193 | assertTrue(h1.size() == 6L); 194 | 195 | h2 = w2.combine(w3).get(); 196 | assertTrue(h2.size() == 3L); 197 | 198 | h1.clear(); 199 | h2.clear(); 200 | h3.clear(); 201 | for(int i = 0; i < 1000000; i++) { 202 | String s = String.valueOf(Math.random()); 203 | h1.put(s); 204 | h3.put(s); 205 | if(i > 500000) { 206 | h2.put(s); 207 | } 208 | } 209 | 210 | for(int i = 0; i < 1000000; i++) { 211 | String s = String.valueOf(Math.random()); 212 | h2.put(s); 213 | h3.put(s); 214 | } 215 | w1 = new HLLWritable(h1); 216 | w2 = new HLLWritable(h2); 217 | h1 = w1.combine(w2).get(); 218 | assertTrue(h3.size() == h1.size()); 219 | } 220 | 221 | @Test 222 | public void test_combine_empty() throws Exception { 223 | HLLWritable empty = 224 | new HLLWritable((byte)15, Integer.MAX_VALUE, 0, new byte[(int)Math.pow(2, (byte)15)], new long[0]); 225 | assertEquals(0, empty.get().getMinHash().size()); 226 | 227 | HLLWritable empty2 = 228 | new HLLWritable((byte)15, 8192, 0, new byte[(int)Math.pow(2, (byte)15)], new long[0]); 229 | assertEquals(0, empty2.get().getMinHash().size()); 230 | 231 | empty = empty.combine(empty2); 232 | assertEquals(0, empty.get().getMinHash().size()); 233 | assertEquals(0, empty.get().size()); 234 | } 235 | } 236 | -------------------------------------------------------------------------------- /utils/minhash_k.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | import argparse 4 | import sys 5 | 6 | from scipy.stats import binom 7 | 8 | def err(ci, k, j): 9 | n = binom.ppf(ci, k, j) 10 | if n == 0: 11 | #this is an edge case, so we report a big error 12 | return 1e9 13 | else: 14 | return abs(n/(j*k) - 1) 15 | 16 | def find_k(j, alpha, conf, k=1, maxk=1000000): 17 | ci = 1 - ((1 - conf)/2.0) 18 | e = err(ci, maxk, j) 19 | if e > alpha: 20 | #we'll never get the precision we want for this 21 | #range, so output the end of the range 22 | return (maxk, e) 23 | #grab bounds for search space 24 | kn = find_bound(j, alpha, ci, k, maxk) 25 | ub = find_bound(j, 0.75*alpha, ci, k, maxk) 26 | #start searching... 27 | while True: 28 | if kn == maxk: 29 | break 30 | broken = False 31 | if err(ci, kn, j) <= alpha: 32 | for n in range(kn, min(2*kn, ub) + 1): 33 | if err(ci, n, j) > alpha: 34 | kn = n + 1 35 | broken = True 36 | break 37 | if not broken: 38 | return (kn, err(ci, kn, j)) 39 | else: 40 | kn += 1 41 | return (maxk, err(ci, maxk, j)) 42 | 43 | def find_bound(j, alpha, ci, k, maxk): 44 | #just a binary search to find good a good bound 45 | minb = k 46 | maxb = maxk 47 | e = err(ci, maxk, j) 48 | while True: 49 | midb = int((maxb + minb)/2) 50 | if midb - minb < 1: 51 | break 52 | midv = err(ci, midb, j) 53 | if midv <= alpha: 54 | maxb = midb 55 | else: 56 | minb = midb 57 | return midb 58 | 59 | if __name__ == '__main__': 60 | parser = argparse.ArgumentParser( 61 | description='Find an acceptable MinHash k given a desired error with desired confidence ' + 62 | 'at a particular Jaccard Index. Returns the k and the maximal error at that k.') 63 | parser.add_argument('--jaccard', 64 | dest='jaccard_index', 65 | required=True, 66 | type=float, 67 | help=('The lowest Jaccard Index to measure. In [0, 1].')) 68 | parser.add_argument('--error', 69 | dest='error', 70 | required=True, 71 | type=float, 72 | help=('The maximum error to tolerate at the Jaccard Index. ' + 73 | '1 implies a measurement of 0 or twice the actual Jaccard Index.')) 74 | parser.add_argument('--confidence', 75 | dest='confidence', 76 | required=True, 77 | type=float, 78 | help=('The level of confidence the error at the Jaccard Index ' + 79 | 'will be less than the maximum error.')) 80 | parser.add_argument('--min_k', 81 | dest='min_k', 82 | required=False, 83 | type=int, 84 | default=1, 85 | help=('The smallest k at which to begin the search. Default is 1.')) 86 | parser.add_argument('--max_k', 87 | dest='max_k', 88 | required=False, 89 | type=int, 90 | default=1000000, 91 | help=('The largest k which is acceptable. Default is 1e6.')) 92 | args = parser.parse_args() 93 | k, e = find_k(args.jaccard_index, args.error, args.confidence, k=args.min_k, maxk=args.max_k) 94 | print 'MinHash k:\t' + str(k) 95 | print 'Error at k:\t' + str(e) 96 | -------------------------------------------------------------------------------- /utils/requirements.txt: -------------------------------------------------------------------------------- 1 | scipy==0.14.0 2 | --------------------------------------------------------------------------------