├── .gitignore ├── Makefile ├── LICENSE.txt ├── README.md └── tree.c /.gitignore: -------------------------------------------------------------------------------- 1 | /tree 2 | *.o 3 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | CC = gcc 2 | CFLAGS = -O3 -Wall -Wextra -Werror -std=gnu99 3 | 4 | all : tree 5 | 6 | clean : 7 | rm -f tree 8 | 9 | tree : tree.o Makefile 10 | $(CC) tree.o -o tree 11 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright (c) 2011, Dietrich Epp 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are 6 | met: 7 | 8 | 1. Redistributions of source code must retain the above copyright 9 | notice, this list of conditions and the following disclaimer. 10 | 11 | 2. Redistributions in binary form must reproduce the above copyright 12 | notice, this list of conditions and the following disclaimer in the 13 | documentation and/or other materials provided with the distribution. 14 | 15 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 16 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 17 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 18 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 19 | HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, 20 | SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 21 | LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 22 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 23 | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT 24 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 25 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Metric tree sample implementation. 2 | 3 | This was written in response to the Stack Overflow question, 4 | [Efficiently find binary strings with low Hamming distance in large set][question] 5 | 6 | [question]: http://stackoverflow.com/questions/6389841/efficiently-find-binary-strings-with-low-hamming-distance-in-large-set/6390606#6390606 7 | 8 | This generates a bunch of pseudorandom 32-bit integers, inserts them 9 | into an index, and queries the index for points within a certain 10 | distance of the given point. 11 | 12 | That is, 13 | 14 | Let S = { N pseudorandom 32-bit integers } 15 | Let d(x,y) be the (base-2) Hamming distance between x and y 16 | Let q(x,r) = { y in S : d(x,y) <= r } 17 | 18 | There are three implementations in here which can be selected at 19 | runtime. 20 | 21 | "bk" is a BK-Tree. Each internal node has a center point, and each 22 | child node contains a set of all points a certain distance away from 23 | the center. 24 | 25 | "vp" is a VP-Tree. Each internal node has a center point and two 26 | children. The "near" child contains all points contained in a closed 27 | ball of a certain radius around the center, and the "far" node 28 | contains all other points. 29 | 30 | "linear" is a linear search. 31 | 32 | The tree implementations use a linear search for leaf nodes. The 33 | maximum number of points in a leaf node is configurable at runtime and 34 | this parameter will affect performance. If the number is low, say 1, 35 | then the memory usage of the tree implementations will skyrocket to 36 | unreasonable levels: more than 24 bytes per element. If the number is 37 | high, say infinity, then the tree will degenerate to a linear search. 38 | 39 | Note that VP trees are slightly faster than BK trees for this problem, 40 | and neither tree implementation significantly outperforms linear 41 | search (that is, by a factor of two or more) for large r (for r > 6, 42 | it seems). 43 | 44 | ## Test results 45 | 46 | Parameters: 47 | 48 | * System: 3.2 GHz AMD Phenom II / 6 cores 49 | * RAM: 4 GiB 50 | * Database size: 100M points 51 | * Results: Average # of query hits (very approximate) 52 | * Speed: Number of queries per second 53 | * Coverage: Average percentage of database examined per query 54 | * Sample size: 1000 queries for distance 1..5, 100 for 6..10 and linear 55 | * Max leaf size: 1000 points 56 | 57 | Results: 58 | 59 | -- BK Tree -- -- VP Tree -- -- Linear -- 60 | Dist Results Speed Cov Speed Cov Speed Cov 61 | 1 0.90 3800 0.048% 4200 0.048% 62 | 2 11 300 0.68% 330 0.65% 63 | 3 130 56 3.8% 63 3.4% 64 | 4 970 18 12% 22 10% 65 | 5 5700 8.5 26% 10 22% 66 | 6 2.6e4 5.2 42% 6.0 37% 67 | 7 1.1e5 3.7 60% 4.1 54% 68 | 8 3.5e5 3.0 74% 3.2 70% 69 | 9 1.0e6 2.6 85% 2.7 82% 70 | 10 2.5e6 2.3 91% 2.4 90% 71 | any 2.2 100% 72 | 73 | Above results were computed by: 74 | 75 | ./tree bk 1000 100000000 1000 1 2 3 4 5 76 | ./tree bk 100 100000000 1000 6 7 8 9 10 77 | ./tree vp 1000 100000000 1000 1 2 3 4 5 78 | ./tree vp 100 100000000 1000 6 7 8 9 10 79 | ./tree linear 1000 100000000 80 | 81 | Except "results", which was just grabbed from whatever was convenient. 82 | It's just an evaluation of the binomial function, so no need to 83 | generate it specially (and it's not accurate). 84 | 85 | Conclusion: I think VP is faster than BK because, being "deeper" 86 | rather than "shallower", it compares against more points rather than 87 | using finer-grained comparisons against fewer points. I suspect that 88 | the differences are more extreme in higher dimensional spaces. 89 | 90 | What is the correct maximum leaf node size? 91 | 92 | for n in 50 60 64 70 80 90 93 | do ./tree vp $n 100000000 1000 3 ; done | grep Rate 94 | 95 | 50: Rate: 94.607379 query/sec 96 | 60: Rate: 95.877277 query/sec 97 | 64: Rate: 97.656250 query/sec 98 | 70: Rate: 97.370983 query/sec 99 | 80: Rate: 96.711799 query/sec 100 | 90: Rate: 96.618357 query/sec 101 | 102 | I had already narrowed it down to 10 <= N <= 100 by exponential 103 | search. The 64 was added after I saw the results for N*10. I suspect 104 | that 64 plays nicer with malloc than 70 does. 105 | 106 | Answer: Allow up to 64 points per leaf node. 107 | 108 | What is the speed with the new leaf size? 109 | 110 | ./tree vp 64 100000000 1000 1 2 3 4 5 111 | ./tree vp 64 100000000 100 6 7 8 9 10 11 12 112 | 113 | Tree size: 426725132 (6.7% overhead) 114 | 115 | Dist Speed Cov Speedup BK Speedup 116 | 1 9100 0.0088% 4200x 117 | 2 580 0.18% 270x 118 | 3 97 1.2% 45x 119 | 4 29 4.3% 13x 120 | 5 12 11% 5.7x 121 | 6 6.6 21% 3.0x 2.1x 122 | 7 4.1 34% 1.9x 1.5x 123 | 8 2.9 50% 1.3x 1.1x 124 | 9 2.3 64% 1.0x 0.96x 125 | 10 1.9 77% 0.87x 0.85x 126 | 11 1.7 87% 0.75x 0.77x 127 | 12 1.5 93% 0.67x 0.70x 128 | 129 | Note: These answers computed with the original high precision numbers, 130 | then rounded in the final step to two significant figures. 131 | -------------------------------------------------------------------------------- /tree.c: -------------------------------------------------------------------------------- 1 | /* Metric tree sample implementation. 2 | 3 | This generates a bunch of pseudorandom 32-bit integers, inserts 4 | them into an index, and queries the index for points within a 5 | certain distance of the given point. 6 | 7 | That is, 8 | 9 | Let S = { N pseudorandom 32-bit integers } 10 | Let d(x,y) be the (base-2) Hamming distance between x and y 11 | Let q(x,r) = { y in S : d(x,y) <= r } 12 | 13 | There are three implementations in here which can be selected at runtime. 14 | 15 | "bk" is a BK-Tree. Each internal node has a center point, and each 16 | child node contains a set of all points a certain distance away 17 | from the center. 18 | 19 | "vp" is a VP-Tree. Each internal node has a center point and two 20 | children. The "near" child contains all points contained in a 21 | closed ball of a certain radius around the center, and the "far" 22 | node contains all other points. 23 | 24 | "linear" is a linear search. 25 | 26 | The tree implementations use a linear search for leaf nodes. The 27 | maximum number of points in a leaf node is configurable at runtime, 28 | but 1000 is a good number. If the number is low, say 1, then the 29 | memory usage of the tree implementations will skyrocket to 30 | unreasonable levels: more than 24 bytes per element. 31 | 32 | Note that VP trees are slightly faster than BK trees for this 33 | problem, and neither tree implementation significantly outperforms 34 | linear search (that is, by a factor of two or more) for r > 6. */ 35 | 36 | #include 37 | #include 38 | #include 39 | #include 40 | #include 41 | #include 42 | #include 43 | 44 | #ifndef DO_PRINT 45 | #define DO_PRINT 0 46 | #endif 47 | 48 | #ifndef HAVE_POPCNT 49 | #define HAVE_POPCNT 0 50 | #endif 51 | 52 | static uint32_t rand_x0, rand_x1, rand_c; 53 | #define RAND_A 4284966893U 54 | 55 | void 56 | seedrand(void) 57 | { 58 | time_t t; 59 | time(&t); 60 | rand_x0 = t; 61 | fprintf(stderr, "seed: %u\n", rand_x0); 62 | rand_x1 = 0x038acaf3U; 63 | rand_c = 0xa2cc5886U; 64 | } 65 | 66 | uint32_t 67 | irand(void) 68 | { 69 | uint64_t y = (uint64_t)rand_x0 * RAND_A + rand_c; 70 | rand_x0 = rand_x1; 71 | rand_x1 = y; 72 | rand_c = y >> 32; 73 | return y; 74 | } 75 | 76 | __attribute__((malloc)) 77 | static void * 78 | xmalloc(size_t sz) 79 | { 80 | void *p; 81 | if (!sz) 82 | return NULL; 83 | p = malloc(sz); 84 | if (!p) 85 | err(1, "malloc"); 86 | return p; 87 | } 88 | 89 | unsigned long 90 | xatoul(const char *p) 91 | { 92 | char *e; 93 | unsigned long x; 94 | x = strtoul(p, &e, 0); 95 | if (*e) 96 | errx(1, "must be a number: '%s'", p); 97 | return x; 98 | } 99 | 100 | typedef uint32_t bkey_t; 101 | enum { MAX_DISTANCE = 32 }; 102 | 103 | #if HAVE_POPCNT 104 | 105 | static inline unsigned 106 | distance(bkey_t x, bkey_t y) 107 | { 108 | return __builtin_popcount(x^y); 109 | } 110 | 111 | #else 112 | 113 | static inline unsigned 114 | distance(bkey_t x, bkey_t y) 115 | { 116 | uint32_t d = x^y; 117 | d = (d & 0x55555555U) + ((d >> 1) & 0x55555555U); 118 | d = (d & 0x33333333U) + ((d >> 2) & 0x33333333U); 119 | d = (d + (d >> 4)) & 0x0f0f0f0fU; 120 | d = d + (d >> 8); 121 | d = d + (d >> 16); 122 | return d & 63; 123 | } 124 | 125 | #endif 126 | 127 | static char keybuf[33]; 128 | 129 | static const char * 130 | keystr(bkey_t k) 131 | { 132 | unsigned i; 133 | for (i = 0; i < 32; ++i) { 134 | keybuf[31 - i] = '0' + (k & 1); 135 | k >>= 1; 136 | } 137 | keybuf[32] = '\0'; 138 | return keybuf; 139 | } 140 | 141 | static const char * 142 | keystr2(bkey_t k, bkey_t ref) 143 | { 144 | unsigned i; 145 | bkey_t d = ref ^ k; 146 | for (i = 0; i < 32; ++i) { 147 | keybuf[31 - i] = (d & 1) ? ('0' + (k & 1)) : '.'; 148 | d >>= 1; 149 | k >>= 1; 150 | } 151 | keybuf[32] = '\0'; 152 | return keybuf; 153 | } 154 | 155 | static unsigned num_nodes = 0; 156 | static size_t tree_size = 0; 157 | 158 | struct buf { 159 | bkey_t *keys; 160 | size_t n, a; 161 | }; 162 | 163 | static void 164 | addkey(struct buf *restrict b, bkey_t k) 165 | { 166 | size_t na; 167 | bkey_t *np; 168 | if (b->n >= b->a) { 169 | na = b->a ? 2*b->a : 16; 170 | np = xmalloc(sizeof(*np) * na); 171 | memcpy(np, b->keys, sizeof(*np) * b->n); 172 | free(b->keys); 173 | b->keys = np; 174 | b->a = na; 175 | } 176 | b->keys[b->n++] = k; 177 | } 178 | 179 | /* Linear search ==================== */ 180 | 181 | struct linear { 182 | size_t count; 183 | bkey_t *keys; 184 | }; 185 | 186 | static struct linear * 187 | mktree_linear(const bkey_t *restrict keys, size_t n, size_t max_linear) 188 | { 189 | struct linear* node; 190 | (void)max_linear; 191 | node = xmalloc(sizeof(*node)); 192 | node->count = n; 193 | node->keys = xmalloc(sizeof(bkey_t) * n); 194 | num_nodes += 1; 195 | tree_size += sizeof(bkey_t) * n + sizeof(*node); 196 | memcpy(node->keys, keys, sizeof(bkey_t) * n); 197 | return node; 198 | } 199 | 200 | static size_t 201 | query_linear(struct buf *restrict b, struct linear *restrict root, 202 | bkey_t ref, unsigned maxd) 203 | { 204 | size_t i; 205 | const bkey_t *restrict p = root->keys; 206 | for (i = 0; i < root->count; ++i) 207 | if (distance(ref, p[i]) <= maxd) 208 | addkey(b, p[i]); 209 | return root->count; 210 | } 211 | 212 | /* BK-tree ==================== */ 213 | 214 | struct bktree { 215 | unsigned short distance; 216 | unsigned short linear; 217 | union { 218 | struct { 219 | bkey_t key; 220 | struct bktree *child; 221 | } tree; 222 | struct { 223 | unsigned count; 224 | bkey_t *keys; 225 | } linear; 226 | } data; 227 | struct bktree *sibling; 228 | }; 229 | 230 | static struct bktree * 231 | mktree_bk(const bkey_t *restrict keys, size_t n, size_t max_linear) 232 | { 233 | size_t dcnt[MAX_DISTANCE + 1], i, a, pos[MAX_DISTANCE + 1], off, len; 234 | bkey_t rootkey = keys[0], *tmp; 235 | struct bktree *root, *child, *prev; 236 | assert(n > 0); 237 | 238 | num_nodes += 1; 239 | 240 | /* Build root */ 241 | root = xmalloc(sizeof(*root)); 242 | tree_size += sizeof(*root); 243 | root->distance = 0; 244 | root->sibling = NULL; 245 | if (n <= max_linear || n <= 1) { 246 | root->linear = 1; 247 | tmp = xmalloc(sizeof(*tmp) * n); 248 | tree_size += sizeof(*tmp) * n; 249 | memcpy(tmp, keys, sizeof(*tmp) * n); 250 | root->data.linear.count = n; 251 | root->data.linear.keys = tmp; 252 | return root; 253 | } 254 | root->linear = 0; 255 | root->data.tree.key = rootkey; 256 | root->data.tree.child = NULL; 257 | 258 | n -= 1; 259 | keys += 1; 260 | if (!n) 261 | return root; 262 | 263 | /* Sort keys by distance to root */ 264 | tmp = xmalloc(sizeof(*tmp) * n); 265 | for (i = 0; i <= MAX_DISTANCE; ++i) 266 | dcnt[i] = 0; 267 | for (i = 0; i < n; ++i) 268 | dcnt[distance(rootkey, keys[i])]++; 269 | for (i = 0, a = 0; i <= MAX_DISTANCE; ++i) 270 | dcnt[i] = (a += dcnt[i]); 271 | assert(a == n); 272 | memcpy(pos, dcnt, sizeof(pos)); 273 | for (i = 0; i < n; ++i) 274 | tmp[--pos[distance(rootkey, keys[i])]] = keys[i]; 275 | 276 | /* Add child nodes */ 277 | for (i = 1, prev = NULL; i <= MAX_DISTANCE; ++i) { 278 | off = dcnt[i-1]; 279 | len = dcnt[i] - off; 280 | if (!len) 281 | continue; 282 | child = mktree_bk(tmp + off, len, max_linear); 283 | child->distance = i; 284 | if (prev) 285 | prev->sibling = child; 286 | else 287 | root->data.tree.child = child; 288 | prev = child; 289 | } 290 | 291 | free(tmp); 292 | return root; 293 | } 294 | 295 | 296 | static size_t 297 | query_bk(struct buf *restrict b, struct bktree *restrict root, 298 | bkey_t ref, unsigned maxd) 299 | { 300 | /* We are trying to find x that satisfy d(ref,x) <= maxd 301 | By triangle inequality, we know: d(root,x) <= d(root,ref) + d(ref,x) 302 | By algebra: d(root,x) - d(root,ref) <= d(ref,x) 303 | By transitivity: d(root,x) - d(root,ref) <= maxd 304 | By algebra: d(root,x) <= maxd + d(root,ref) */ 305 | if (root->linear) { 306 | const bkey_t *restrict keys = root->data.linear.keys; 307 | unsigned i, n = root->data.linear.count; 308 | for (i = 0; i < n; ++i) 309 | if (distance(ref, keys[i]) <= maxd) 310 | addkey(b, keys[i]); 311 | return n; 312 | } else { 313 | unsigned d = distance(root->data.tree.key, ref); 314 | struct bktree *p = root->data.tree.child; 315 | size_t nc = 1; 316 | if (d <= maxd) 317 | addkey(b, root->data.tree.key); 318 | for (; p && p->distance + maxd < d; p = p->sibling); 319 | for (; p && p->distance <= maxd + d; p = p->sibling) 320 | nc += query_bk(b, p, ref, maxd); 321 | return nc; 322 | } 323 | } 324 | 325 | /* VP-tree ==================== */ 326 | 327 | struct vptree { 328 | unsigned short linear; 329 | union { 330 | struct { 331 | /* Closed ball (d = threshold is included) */ 332 | unsigned short threshold; 333 | bkey_t vantage; 334 | struct vptree *near; 335 | struct vptree *far; 336 | } tree; 337 | struct { 338 | unsigned count; 339 | bkey_t *keys; 340 | } linear; 341 | } data; 342 | }; 343 | 344 | static struct vptree * 345 | mktree_vp(const bkey_t *restrict keys, size_t n, size_t max_linear) 346 | { 347 | size_t dcnt[MAX_DISTANCE + 1], i, a; 348 | bkey_t rootkey = keys[0], *tmp; 349 | struct vptree *root; 350 | unsigned k; 351 | size_t median, nnear, nfar, inear, ifar; 352 | assert(n > 0); 353 | 354 | num_nodes += 1; 355 | 356 | /* Build root */ 357 | root = xmalloc(sizeof(*root)); 358 | tree_size += sizeof(root); 359 | if (n <= max_linear || n <= 1) { 360 | root->linear = 1; 361 | tmp = xmalloc(sizeof(*tmp) * n); 362 | tree_size += sizeof(*tmp) * n; 363 | memcpy(tmp, keys, sizeof(*tmp) * n); 364 | root->data.linear.count = n; 365 | root->data.linear.keys = tmp; 366 | return root; 367 | } 368 | root->linear = 0; 369 | root->data.tree.threshold = 0; 370 | root->data.tree.vantage = rootkey; 371 | root->data.tree.near = NULL; 372 | root->data.tree.far = NULL; 373 | 374 | n -= 1; 375 | keys += 1; 376 | if (!n) 377 | return root; 378 | 379 | /* Count keys inside the given ball */ 380 | for (i = 0; i <= MAX_DISTANCE; ++i) 381 | dcnt[i] = 0; 382 | for (i = 0; i < n; ++i) 383 | dcnt[distance(rootkey, keys[i])]++; 384 | for (i = 0, a = 0; i <= MAX_DISTANCE; ++i) 385 | dcnt[i] = (a += dcnt[i]); 386 | assert(a == n); 387 | median = dcnt[0] + (n - dcnt[0]) / 2; 388 | for (k = 1; k <= MAX_DISTANCE; ++k) 389 | if (dcnt[k] > median) 390 | break; 391 | if (k != 1 && median - dcnt[k-1] <= dcnt[k] - median) 392 | k--; 393 | nnear = dcnt[k] - dcnt[0]; 394 | nfar = n - dcnt[k]; 395 | // printf("keys: %zu; near: %zu; far: %zu; k=%u\n", n, nnear, nfar, k); 396 | 397 | /* Sort keys into near and far sets */ 398 | tmp = xmalloc(sizeof(*tmp) * (nnear + nfar)); 399 | inear = 0; 400 | ifar = nnear; 401 | for (i = 0; i < n; ++i) { 402 | if (keys[i] == rootkey) 403 | continue; 404 | if (distance(rootkey, keys[i]) <= k) 405 | tmp[inear++] = keys[i]; 406 | else 407 | tmp[ifar++] = keys[i]; 408 | } 409 | assert(inear == nnear); 410 | assert(ifar == nnear + nfar); 411 | 412 | root->data.tree.threshold = k; 413 | if (nnear) 414 | root->data.tree.near = mktree_vp(tmp, nnear, max_linear); 415 | if (nfar) 416 | root->data.tree.far = mktree_vp(tmp + nnear, nfar, max_linear); 417 | 418 | free(tmp); 419 | return root; 420 | } 421 | 422 | static size_t 423 | query_vp(struct buf *restrict b, struct vptree *restrict root, 424 | bkey_t ref, unsigned maxd) 425 | { 426 | /* We are trying to find x that satisfy d(ref,x) <= maxd 427 | By triangle inequality, we know: d(root,x) <= d(root,ref) + d(ref,x) 428 | By algebra: d(root,x) - d(root,ref) <= d(ref,x) 429 | By transitivity: d(root,x) - d(root,ref) <= maxd 430 | By algebra: d(root,x) <= maxd + d(root,ref) */ 431 | if (root->linear) { 432 | const bkey_t *restrict keys = root->data.linear.keys; 433 | unsigned i, n = root->data.linear.count; 434 | for (i = 0; i < n; ++i) 435 | if (distance(ref, keys[i]) <= maxd) 436 | addkey(b, keys[i]); 437 | return n; 438 | } else { 439 | unsigned d = distance(root->data.tree.vantage, ref); 440 | unsigned thr = root->data.tree.threshold; 441 | size_t nc = 1; 442 | if (d <= maxd + thr) { 443 | if (root->data.tree.near) 444 | nc += query_vp(b, root->data.tree.near, ref, maxd); 445 | if (d <= maxd) 446 | addkey(b, root->data.tree.vantage); 447 | } 448 | if (d + maxd > thr && root->data.tree.far) 449 | nc += query_vp(b, root->data.tree.far, ref, maxd); 450 | return nc; 451 | } 452 | } 453 | 454 | /* Main ==================== */ 455 | 456 | typedef void *(*mktree_t)(bkey_t *, size_t, size_t); 457 | typedef size_t (*query_t)(struct buf *, void *, bkey_t, unsigned); 458 | 459 | int main(int argc, char *argv[]) 460 | { 461 | double tm, qc; 462 | clock_t ckref, t; 463 | struct buf q = { 0, 0, 0 }; 464 | unsigned long nkeys, nquery, dist, i, j, k; 465 | void *root; 466 | bkey_t ref, *keys; 467 | unsigned long long total, totalcmp, maxlin; 468 | size_t nc; 469 | char *type; 470 | mktree_t mktree; 471 | query_t query; 472 | 473 | if (argc < 5) { 474 | fputs("Usage: TYPE MAXLIN NKEYS NQUERY DIST...\n", stderr); 475 | return 1; 476 | } 477 | type = argv[1]; 478 | if (!strcasecmp(type, "bk")) { 479 | puts("Type: BK-tree"); 480 | mktree = (mktree_t) mktree_bk; 481 | query = (query_t) query_bk; 482 | } else if (!strcasecmp(type, "vp")) { 483 | puts("Type: VP-tree"); 484 | mktree = (mktree_t) mktree_vp; 485 | query = (query_t) query_vp; 486 | } else if (!strcasecmp(type, "linear")) { 487 | puts("Type: Linear search"); 488 | mktree = (mktree_t) mktree_linear; 489 | query = (query_t) query_linear; 490 | } else { 491 | puts("Unknown type"); 492 | return 1; 493 | } 494 | maxlin = xatoul(argv[2]); 495 | nkeys = xatoul(argv[3]); 496 | nquery = xatoul(argv[4]); 497 | if (!nkeys) { 498 | fputs("Need at least one key\n", stderr); 499 | return 1; 500 | } 501 | seedrand(); 502 | printf("Keys: %lu\n", nkeys); 503 | printf("Queries: %lu\n", nquery); 504 | putchar('\n'); 505 | 506 | puts("Generating keys..."); 507 | keys = malloc(sizeof(*keys) * nkeys); 508 | for (i = 0; i < nkeys; ++i) 509 | keys[i] = irand(); 510 | 511 | puts("Building tree..."); 512 | ckref = clock(); 513 | root = mktree(keys, nkeys, maxlin); 514 | free(keys); 515 | t = clock(); 516 | printf("Time: %.3f sec\n", 517 | (double)(t - ckref) / CLOCKS_PER_SEC); 518 | printf("Nodes: %u\n", num_nodes); 519 | printf("Tree size: %zu\n", tree_size); 520 | 521 | for (k = 5; k < (unsigned) argc; ++k) { 522 | total = 0; 523 | totalcmp = 0; 524 | ckref = clock(); 525 | dist = xatoul(argv[k]); 526 | if (dist >= MAX_DISTANCE || dist <= 0) { 527 | fprintf(stderr, "Distance should be in the range 1..%d\n", 528 | MAX_DISTANCE); 529 | return 1; 530 | } 531 | putchar('\n'); 532 | printf("Distance: %lu\n", dist); 533 | for (i = 0; i < nquery; ++i) { 534 | ref = irand(); 535 | q.n = 0; 536 | nc = query(&q, root, ref, dist); 537 | totalcmp += nc; 538 | total += q.n; 539 | if (DO_PRINT) { 540 | printf("Query: %s\n", keystr(ref)); 541 | for (j = 0; j < q.n; ++j) 542 | printf(" %s\n", keystr2(q.keys[j], ref)); 543 | } 544 | } 545 | t = clock(); 546 | tm = t - ckref; 547 | qc = (double) CLOCKS_PER_SEC * (double) nquery; 548 | printf("Rate: %f query/sec\n", qc / tm); 549 | printf("Time: %f msec/query\n", 1000.0 * tm / qc); 550 | printf("Hits: %f\n", total / (double)nquery); 551 | printf("Coverage: %f%%\n", 552 | 100.0 * (double)totalcmp / ((double)nkeys * nquery)); 553 | printf("Cmp/result: %f\n", (double)totalcmp / (double)total); 554 | } 555 | return 0; 556 | } 557 | --------------------------------------------------------------------------------