├── README.md ├── btree2s.c ├── btree2v.c └── btree2u.c /README.md: -------------------------------------------------------------------------------- 1 | Btree-source-code 2 | ================= 3 | 4 | A working project for High-concurrency B-tree source code in C. You probably want to download threadskv10h.c for the latest developement version. 5 | 6 | Here are files in the btree source code: 7 | 8 | btree2s.c Single Threaded/MultiProcess version that removes keys all the way back to an original empty btree, placing removed nodes on a free list. Operates under either memory mapping or file I/O. Recommended btrees hosted on network file systems. 9 | 10 | btree2t.c Single Threaded/MultiProcess version similar to btree2s except that fcntl locking has been replaced by test & set latches in the first few btree pages. Uses either memory mapping or file I/O. 11 | 12 | btree2u.c Single Threaded/MultiProcess version that implements a traditional buffer pool manager in the first n pages of the btree file. The buffer pool accesses its pages with mmap. Evicted pages are written back to the btree file from the buffer pool pages with pwrite. Recommended for linux. 13 | 14 | btree2v.c Single Threaded/MultiProcess version based on btree2u that utilizes linux only futex calls for latch contention. 15 | 16 | threads2h.c Multi-Threaded/Multi-Process with latching implemented by a latch manager with pthreads/SRW latches in the first few btree pages. Recommended for Windows. 17 | 18 | threads2i.c Multi-Threaded/Multi-Process with latching implemented by a latch manager with test & set latches in the first few btree pages with thread yield system calls during contention. 19 | 20 | threads2j.c Multi-Threaded/Multi-Process with latching implemented by a latch manager with test & set locks in the first few btree pages with Linux futex system calls during contention. 21 | 22 | threadskv1.c Multi-Threaded/Multi-Process based on threads2i.c that generalizes key/value storage in the btree pages. The page slots are reduced to 16 or 32 bits, and the value byte storage occurs along with the key storage. 23 | 24 | threadskv2.c Multi-Threaded/Multi-Process based on threadskv1 that replaces the linear sorted key array with a red/black tree. 25 | 26 | threadskv3.c Multi-Threaded/Multi-Process based on threadskv1 that introduces librarian filler slots in the linear key array to minimize data movement when a new key is inserted into the middle of the array. 27 | 28 | threadskv4b.c Multi-Threaded/Multi-Process based on threadskv3 that manages duplicate keys added to the btree. 29 | 30 | threadskv5.c Multi-Threaded/Multi-Process based on threadskv4b that supports bi-directional cursors through the btree. Also supports raw disk partitions for btrees. 31 | 32 | threadskv6.c Multi-Threaded/Single-Process with traditional buffer pool manager using the swap device. Based on threadskv5 and btree2u. 33 | 34 | threadskv7.c Multi-Threaded/Single-Process with atomic add of a set of keys under eventual consistency. Adds an individual key lock manager. 35 | 36 | threadskv8.c Multi-Threaded/Single-Process with atomic-consistent add of a set of keys based on threadskv6.c. Uses btree page latches as locking granularity. 37 | 38 | threadskv10h.c Multi-Threaded/Multi-Process with 2 Log-Structured-Merge (LSM) btrees based on threadskv8.c. Also adds dual leaf/interior node page sizes for each btree. Note that this file is linux only. 39 | 40 | Compilation is achieved on linux or Windows by: 41 | 42 | gcc -D STANDALONE -O3 threadskv10g.c -lpthread 43 | 44 | or 45 | 46 | cl /Ox /D STANDALONE threads2h.c 47 | 48 | Please see the project wiki page for additional documentation 49 | -------------------------------------------------------------------------------- /btree2s.c: -------------------------------------------------------------------------------- 1 | // btree version 2s 2 | // with reworked bt_deletekey code 3 | // 25 FEB 2014 4 | 5 | // author: karl malbrain, malbrain@cal.berkeley.edu 6 | 7 | /* 8 | This work, including the source code, documentation 9 | and related data, is placed into the public domain. 10 | 11 | The orginal author is Karl Malbrain. 12 | 13 | THIS SOFTWARE IS PROVIDED AS-IS WITHOUT WARRANTY 14 | OF ANY KIND, NOT EVEN THE IMPLIED WARRANTY OF 15 | MERCHANTABILITY. THE AUTHOR OF THIS SOFTWARE, 16 | ASSUMES _NO_ RESPONSIBILITY FOR ANY CONSEQUENCE 17 | RESULTING FROM THE USE, MODIFICATION, OR 18 | REDISTRIBUTION OF THIS SOFTWARE. 19 | */ 20 | 21 | // Please see the project home page for documentation 22 | // code.google.com/p/high-concurrency-btree 23 | 24 | #define _FILE_OFFSET_BITS 64 25 | #define _LARGEFILE64_SOURCE 26 | 27 | #ifdef linux 28 | #define _GNU_SOURCE 29 | #endif 30 | 31 | #ifdef unix 32 | #include 33 | #include 34 | #include 35 | #include 36 | #include 37 | #include 38 | #include 39 | #else 40 | #define WIN32_LEAN_AND_MEAN 41 | #include 42 | #include 43 | #include 44 | #include 45 | #include 46 | #endif 47 | 48 | #include 49 | #include 50 | 51 | typedef unsigned long long uid; 52 | 53 | #ifndef unix 54 | typedef unsigned long long off64_t; 55 | typedef unsigned short ushort; 56 | typedef unsigned int uint; 57 | #endif 58 | 59 | #define BT_ro 0x6f72 // ro 60 | #define BT_rw 0x7772 // rw 61 | #define BT_fl 0x6c66 // fl 62 | 63 | #define BT_maxbits 24 // maximum page size in bits 64 | #define BT_minbits 9 // minimum page size in bits 65 | #define BT_minpage (1 << BT_minbits) // minimum page size 66 | 67 | /* 68 | There are five lock types for each node in three independent sets: 69 | 1. (set 1) AccessIntent: Sharable. Going to Read the node. Incompatible with NodeDelete. 70 | 2. (set 1) NodeDelete: Exclusive. About to release the node. Incompatible with AccessIntent. 71 | 3. (set 2) ReadLock: Sharable. Read the node. Incompatible with WriteLock. 72 | 4. (set 2) WriteLock: Exclusive. Modify the node. Incompatible with ReadLock and other WriteLocks. 73 | 5. (set 3) ParentModification: Exclusive. Change the node's parent keys. Incompatible with another ParentModification. 74 | */ 75 | 76 | typedef enum{ 77 | BtLockAccess, 78 | BtLockDelete, 79 | BtLockRead, 80 | BtLockWrite, 81 | BtLockParent 82 | }BtLock; 83 | 84 | // Define the length of the page and key pointers 85 | 86 | #define BtId 6 87 | 88 | // Page key slot definition. 89 | 90 | // If BT_maxbits is 15 or less, you can save 2 bytes 91 | // for each key stored by making the first two uints 92 | // into ushorts. You can also save 4 bytes by removing 93 | // the tod field from the key. 94 | 95 | // Keys are marked dead, but remain on the page until 96 | // cleanup is called. The fence key (highest key) for 97 | // the page is always present, even if dead. 98 | 99 | typedef struct { 100 | uint off:BT_maxbits; // page offset for key start 101 | uint dead:1; // set for deleted key 102 | uint tod; // time-stamp for key 103 | unsigned char id[BtId]; // id associated with key 104 | } BtSlot; 105 | 106 | // The key structure occupies space at the upper end of 107 | // each page. It's a length byte followed by the value 108 | // bytes. 109 | 110 | typedef struct { 111 | unsigned char len; 112 | unsigned char key[0]; 113 | } *BtKey; 114 | 115 | // The first part of an index page. 116 | // It is immediately followed 117 | // by the BtSlot array of keys. 118 | 119 | typedef struct { 120 | uint cnt; // count of keys in page 121 | uint act; // count of active keys 122 | uint min; // next key offset 123 | unsigned char bits:7; // page size in bits 124 | unsigned char free:1; // page is on free list 125 | unsigned char lvl:6; // level of page 126 | unsigned char kill:1; // page is being deleted 127 | unsigned char dirty:1; // page is dirty 128 | unsigned char right[BtId]; // page number to right 129 | } *BtPage; 130 | 131 | // The memory mapping hash table entry 132 | 133 | typedef struct { 134 | BtPage page; // mapped page pointer 135 | uid page_no; // mapped page number 136 | void *lruprev; // least recently used previous cache block 137 | void *lrunext; // lru next cache block 138 | void *hashprev; // previous cache block for the same hash idx 139 | void *hashnext; // next cache block for the same hash idx 140 | #ifndef unix 141 | HANDLE hmap; 142 | #endif 143 | }BtHash; 144 | 145 | // The object structure for Btree access 146 | 147 | typedef struct _BtDb { 148 | uint page_size; // each page size 149 | uint page_bits; // each page size in bits 150 | uint seg_bits; // segment size in pages in bits 151 | uid page_no; // current page number 152 | uid cursor_page; // current cursor page number 153 | int err; 154 | uint mode; // read-write mode 155 | uint mapped_io; // use memory mapping 156 | BtPage temp; // temporary frame buffer (memory mapped/file IO) 157 | BtPage alloc; // frame buffer for alloc page ( page 0 ) 158 | BtPage cursor; // cached frame for start/next (never mapped) 159 | BtPage frame; // spare frame for the page split (never mapped) 160 | BtPage zero; // zeroes frame buffer (never mapped) 161 | BtPage page; // current page 162 | #ifdef unix 163 | int idx; 164 | #else 165 | HANDLE idx; 166 | #endif 167 | unsigned char *mem; // frame, cursor, page memory buffer 168 | int nodecnt; // highest page cache segment in use 169 | int nodemax; // highest page cache segment allocated 170 | int hashmask; // number of pages in segments - 1 171 | int hashsize; // size of hash table 172 | int found; // last deletekey found key 173 | BtHash *lrufirst; // lru list head 174 | BtHash *lrulast; // lru list tail 175 | ushort *cache; // hash table for cached segments 176 | BtHash nodes[1]; // segment cache follows 177 | } BtDb; 178 | 179 | typedef enum { 180 | BTERR_ok = 0, 181 | BTERR_struct, 182 | BTERR_ovflw, 183 | BTERR_lock, 184 | BTERR_map, 185 | BTERR_wrt, 186 | BTERR_hash 187 | } BTERR; 188 | 189 | // B-Tree functions 190 | extern void bt_close (BtDb *bt); 191 | extern BtDb *bt_open (char *name, uint mode, uint bits, uint cacheblk, uint pgblk); 192 | extern BTERR bt_insertkey (BtDb *bt, unsigned char *key, uint len, uint lvl, uid id, uint tod); 193 | extern BTERR bt_deletekey (BtDb *bt, unsigned char *key, uint len, uint lvl); 194 | extern uid bt_findkey (BtDb *bt, unsigned char *key, uint len); 195 | extern uint bt_startkey (BtDb *bt, unsigned char *key, uint len); 196 | extern uint bt_nextkey (BtDb *bt, uint slot); 197 | 198 | // Helper functions to return slot values 199 | 200 | extern BtKey bt_key (BtDb *bt, uint slot); 201 | extern uid bt_uid (BtDb *bt, uint slot); 202 | extern uint bt_tod (BtDb *bt, uint slot); 203 | 204 | // BTree page number constants 205 | #define ALLOC_page 0 206 | #define ROOT_page 1 207 | #define LEAF_page 2 208 | 209 | // Number of levels to create in a new BTree 210 | 211 | #define MIN_lvl 2 212 | 213 | // The page is allocated from low and hi ends. 214 | // The key offsets and row-id's are allocated 215 | // from the bottom, while the text of the key 216 | // is allocated from the top. When the two 217 | // areas meet, the page is split into two. 218 | 219 | // A key consists of a length byte, two bytes of 220 | // index number (0 - 65534), and up to 253 bytes 221 | // of key value. Duplicate keys are discarded. 222 | // Associated with each key is a 48 bit row-id. 223 | 224 | // The b-tree root is always located at page 1. 225 | // The first leaf page of level zero is always 226 | // located on page 2. 227 | 228 | // The b-tree pages are linked with right 229 | // pointers to facilitate enumerators, 230 | // and provide for concurrency. 231 | 232 | // When to root page fills, it is split in two and 233 | // the tree height is raised by a new root at page 234 | // one with two keys. 235 | 236 | // Deleted keys are marked with a dead bit until 237 | // page cleanup The fence key for a node is always 238 | // present, even after deletion and cleanup. 239 | 240 | // Deleted leaf pages are reclaimed on a free list. 241 | // The upper levels of the btree are fixed on creation. 242 | 243 | // Groups of pages from the btree are optionally 244 | // cached with memory mapping. A hash table is used to keep 245 | // track of the cached pages. This behaviour is controlled 246 | // by the number of cache blocks parameter and pages per block 247 | // given to bt_open. 248 | 249 | // To achieve maximum concurrency one page is locked at a time 250 | // as the tree is traversed to find leaf key in question. The right 251 | // page numbers are used in cases where the page is being split, 252 | // or consolidated. 253 | 254 | // Page 0 (ALLOC page) is dedicated to lock for new page extensions, 255 | // and chains empty leaf pages together for reuse. 256 | 257 | // Parent locks are obtained to prevent resplitting or deleting a node 258 | // before its fence is posted into its upper level. 259 | 260 | // A special open mode of BT_fl is provided to safely access files on 261 | // WIN32 networks. WIN32 network operations should not use memory mapping. 262 | // This WIN32 mode sets FILE_FLAG_NOBUFFERING and FILE_FLAG_WRITETHROUGH 263 | // to prevent local caching of network file contents. 264 | 265 | // Access macros to address slot and key values from the page. 266 | // Page slots use 1 based indexing. 267 | 268 | #define slotptr(page, slot) (((BtSlot *)(page+1)) + (slot-1)) 269 | #define keyptr(page, slot) ((BtKey)((unsigned char*)(page) + slotptr(page, slot)->off)) 270 | 271 | void bt_putid(unsigned char *dest, uid id) 272 | { 273 | int i = BtId; 274 | 275 | while( i-- ) 276 | dest[i] = (unsigned char)id, id >>= 8; 277 | } 278 | 279 | uid bt_getid(unsigned char *src) 280 | { 281 | uid id = 0; 282 | int i; 283 | 284 | for( i = 0; i < BtId; i++ ) 285 | id <<= 8, id |= *src++; 286 | 287 | return id; 288 | } 289 | 290 | // place write, read, or parent lock on requested page_no. 291 | 292 | BTERR bt_lockpage(BtDb *bt, uid page_no, BtLock mode) 293 | { 294 | off64_t off = page_no << bt->page_bits; 295 | #ifdef unix 296 | int flag = PROT_READ | ( bt->mode == BT_ro ? 0 : PROT_WRITE ); 297 | struct flock lock[1]; 298 | #else 299 | uint flags = 0, len; 300 | OVERLAPPED ovl[1]; 301 | #endif 302 | 303 | if( mode == BtLockRead || mode == BtLockWrite ) 304 | off += sizeof(*bt->page); // use second segment 305 | 306 | if( mode == BtLockParent ) 307 | off += 2 * sizeof(*bt->page); // use third segment 308 | 309 | #ifdef unix 310 | memset (lock, 0, sizeof(lock)); 311 | 312 | lock->l_start = off; 313 | lock->l_type = (mode == BtLockDelete || mode == BtLockWrite || mode == BtLockParent) ? F_WRLCK : F_RDLCK; 314 | lock->l_len = sizeof(*bt->page); 315 | lock->l_whence = 0; 316 | 317 | if( fcntl (bt->idx, F_SETLKW, lock) < 0 ) 318 | return bt->err = BTERR_lock; 319 | 320 | return 0; 321 | #else 322 | memset (ovl, 0, sizeof(ovl)); 323 | ovl->OffsetHigh = (uint)(off >> 32); 324 | ovl->Offset = (uint)off; 325 | len = sizeof(*bt->page); 326 | 327 | // use large offsets to 328 | // simulate advisory locking 329 | 330 | ovl->OffsetHigh |= 0x80000000; 331 | 332 | if( mode == BtLockDelete || mode == BtLockWrite || mode == BtLockParent ) 333 | flags |= LOCKFILE_EXCLUSIVE_LOCK; 334 | 335 | if( LockFileEx (bt->idx, flags, 0, len, 0L, ovl) ) 336 | return bt->err = 0; 337 | 338 | return bt->err = BTERR_lock; 339 | #endif 340 | } 341 | 342 | // remove write, read, or parent lock on requested page_no. 343 | 344 | BTERR bt_unlockpage(BtDb *bt, uid page_no, BtLock mode) 345 | { 346 | off64_t off = page_no << bt->page_bits; 347 | #ifdef unix 348 | struct flock lock[1]; 349 | #else 350 | OVERLAPPED ovl[1]; 351 | uint len; 352 | #endif 353 | 354 | if( mode == BtLockRead || mode == BtLockWrite ) 355 | off += sizeof(*bt->page); // use second segment 356 | 357 | if( mode == BtLockParent ) 358 | off += 2 * sizeof(*bt->page); // use third segment 359 | 360 | #ifdef unix 361 | memset (lock, 0, sizeof(lock)); 362 | 363 | lock->l_start = off; 364 | lock->l_type = F_UNLCK; 365 | lock->l_len = sizeof(*bt->page); 366 | lock->l_whence = 0; 367 | 368 | if( fcntl (bt->idx, F_SETLK, lock) < 0 ) 369 | return bt->err = BTERR_lock; 370 | #else 371 | memset (ovl, 0, sizeof(ovl)); 372 | ovl->OffsetHigh = (uint)(off >> 32); 373 | ovl->Offset = (uint)off; 374 | len = sizeof(*bt->page); 375 | 376 | // use large offsets to 377 | // simulate advisory locking 378 | 379 | ovl->OffsetHigh |= 0x80000000; 380 | 381 | if( !UnlockFileEx (bt->idx, 0, len, 0, ovl) ) 382 | return GetLastError(), bt->err = BTERR_lock; 383 | #endif 384 | 385 | return bt->err = 0; 386 | } 387 | 388 | // close and release memory 389 | 390 | void bt_close (BtDb *bt) 391 | { 392 | BtHash *hash; 393 | #ifdef unix 394 | // release mapped pages 395 | 396 | if( hash = bt->lrufirst ) 397 | do munmap (hash->page, (bt->hashmask+1) << bt->page_bits); 398 | while(hash = hash->lrunext); 399 | 400 | if( bt->mem ) 401 | free (bt->mem); 402 | close (bt->idx); 403 | free (bt->cache); 404 | free (bt); 405 | #else 406 | if( hash = bt->lrufirst ) 407 | do 408 | { 409 | FlushViewOfFile(hash->page, 0); 410 | UnmapViewOfFile(hash->page); 411 | CloseHandle(hash->hmap); 412 | } while(hash = hash->lrunext); 413 | 414 | if( bt->mem) 415 | VirtualFree (bt->mem, 0, MEM_RELEASE); 416 | FlushFileBuffers(bt->idx); 417 | CloseHandle(bt->idx); 418 | GlobalFree (bt->cache); 419 | GlobalFree (bt); 420 | #endif 421 | } 422 | 423 | // open/create new btree 424 | // call with file_name, BT_openmode, bits in page size (e.g. 16), 425 | // size of mapped page cache (e.g. 8192) or zero for no mapping. 426 | 427 | BtDb *bt_open (char *name, uint mode, uint bits, uint nodemax, uint pgblk) 428 | { 429 | uint lvl, attr, cacheblk, last; 430 | BtLock lockmode = BtLockWrite; 431 | BtPage alloc; 432 | off64_t size; 433 | uint amt[1]; 434 | BtKey key; 435 | BtDb* bt; 436 | 437 | #ifndef unix 438 | SYSTEM_INFO sysinfo[1]; 439 | #endif 440 | 441 | #ifdef unix 442 | bt = malloc (sizeof(BtDb) + nodemax * sizeof(BtHash)); 443 | memset (bt, 0, sizeof(BtDb)); 444 | 445 | switch (mode & 0x7fff) 446 | { 447 | case BT_fl: 448 | case BT_rw: 449 | bt->idx = open ((char*)name, O_RDWR | O_CREAT, 0666); 450 | break; 451 | 452 | case BT_ro: 453 | default: 454 | bt->idx = open ((char*)name, O_RDONLY); 455 | lockmode = BtLockRead; 456 | break; 457 | } 458 | if( bt->idx == -1 ) 459 | return free(bt), NULL; 460 | 461 | if( nodemax ) 462 | cacheblk = 4096; // page size for unix 463 | else 464 | cacheblk = 0; 465 | 466 | #else 467 | bt = GlobalAlloc (GMEM_FIXED|GMEM_ZEROINIT, sizeof(BtDb) + nodemax * sizeof(BtHash)); 468 | attr = FILE_ATTRIBUTE_NORMAL; 469 | switch (mode & 0x7fff) 470 | { 471 | case BT_fl: 472 | attr |= FILE_FLAG_WRITE_THROUGH | FILE_FLAG_NO_BUFFERING; 473 | 474 | case BT_rw: 475 | bt->idx = CreateFile(name, GENERIC_READ| GENERIC_WRITE, FILE_SHARE_READ|FILE_SHARE_WRITE, NULL, OPEN_ALWAYS, attr, NULL); 476 | break; 477 | 478 | case BT_ro: 479 | default: 480 | bt->idx = CreateFile(name, GENERIC_READ, FILE_SHARE_READ|FILE_SHARE_WRITE, NULL, OPEN_EXISTING, attr, NULL); 481 | lockmode = BtLockRead; 482 | break; 483 | } 484 | if( bt->idx == INVALID_HANDLE_VALUE ) 485 | return GlobalFree(bt), NULL; 486 | 487 | // normalize cacheblk to multiple of sysinfo->dwAllocationGranularity 488 | GetSystemInfo(sysinfo); 489 | 490 | if( nodemax ) 491 | cacheblk = sysinfo->dwAllocationGranularity; 492 | else 493 | cacheblk = 0; 494 | #endif 495 | 496 | // determine sanity of page size 497 | 498 | if( bits > BT_maxbits ) 499 | bits = BT_maxbits; 500 | else if( bits < BT_minbits ) 501 | bits = BT_minbits; 502 | 503 | if( bt_lockpage(bt, ALLOC_page, lockmode) ) 504 | return bt_close (bt), NULL; 505 | 506 | #ifdef unix 507 | *amt = 0; 508 | 509 | // read minimum page size to get root info 510 | 511 | if( size = lseek (bt->idx, 0L, 2) ) { 512 | alloc = malloc (BT_minpage); 513 | pread(bt->idx, alloc, BT_minpage, 0); 514 | bits = alloc->bits; 515 | free (alloc); 516 | } else if( mode == BT_ro ) 517 | return bt_close (bt), NULL; 518 | #else 519 | size = GetFileSize(bt->idx, amt); 520 | 521 | if( size || *amt ) { 522 | alloc = VirtualAlloc(NULL, BT_minpage, MEM_COMMIT, PAGE_READWRITE); 523 | if( !ReadFile(bt->idx, (char *)alloc, BT_minpage, amt, NULL) ) 524 | return bt_close (bt), NULL; 525 | bits = alloc->bits; 526 | VirtualFree (alloc, 0, MEM_RELEASE); 527 | } else if( mode == BT_ro ) 528 | return bt_close (bt), NULL; 529 | #endif 530 | 531 | bt->page_size = 1 << bits; 532 | bt->page_bits = bits; 533 | 534 | bt->nodemax = nodemax; 535 | bt->mode = mode; 536 | 537 | // setup cache mapping 538 | 539 | if( cacheblk ) { 540 | if( cacheblk < bt->page_size ) 541 | cacheblk = bt->page_size; 542 | 543 | bt->hashsize = nodemax / 8; 544 | bt->hashmask = (cacheblk >> bits) - 1; 545 | bt->mapped_io = 1; 546 | } 547 | 548 | // requested number of pages per memmap segment 549 | 550 | if( cacheblk ) 551 | if( (1 << pgblk) > bt->hashmask ) 552 | bt->hashmask = (1 << pgblk) - 1; 553 | 554 | bt->seg_bits = 0; 555 | 556 | while( (1 << bt->seg_bits) <= bt->hashmask ) 557 | bt->seg_bits++; 558 | 559 | #ifdef unix 560 | bt->mem = malloc (6 *bt->page_size); 561 | bt->cache = calloc (bt->hashsize, sizeof(ushort)); 562 | #else 563 | bt->mem = VirtualAlloc(NULL, 6 * bt->page_size, MEM_COMMIT, PAGE_READWRITE); 564 | bt->cache = GlobalAlloc (GMEM_FIXED|GMEM_ZEROINIT, bt->hashsize * sizeof(ushort)); 565 | #endif 566 | bt->frame = (BtPage)bt->mem; 567 | bt->cursor = (BtPage)(bt->mem + bt->page_size); 568 | bt->page = (BtPage)(bt->mem + 2 * bt->page_size); 569 | bt->alloc = (BtPage)(bt->mem + 3 * bt->page_size); 570 | bt->temp = (BtPage)(bt->mem + 4 * bt->page_size); 571 | bt->zero = (BtPage)(bt->mem + 5 * bt->page_size); 572 | 573 | if( size || *amt ) { 574 | if( bt_unlockpage(bt, ALLOC_page, lockmode) ) 575 | return bt_close (bt), NULL; 576 | 577 | return bt; 578 | } 579 | 580 | // initializes an empty b-tree with root page and page of leaves 581 | 582 | memset (bt->alloc, 0, bt->page_size); 583 | bt_putid(bt->alloc->right, MIN_lvl+1); 584 | bt->alloc->bits = bt->page_bits; 585 | 586 | #ifdef unix 587 | if( write (bt->idx, bt->alloc, bt->page_size) < bt->page_size ) 588 | return bt_close (bt), NULL; 589 | #else 590 | if( !WriteFile (bt->idx, (char *)bt->alloc, bt->page_size, amt, NULL) ) 591 | return bt_close (bt), NULL; 592 | 593 | if( *amt < bt->page_size ) 594 | return bt_close (bt), NULL; 595 | #endif 596 | 597 | memset (bt->frame, 0, bt->page_size); 598 | bt->frame->bits = bt->page_bits; 599 | 600 | for( lvl=MIN_lvl; lvl--; ) { 601 | slotptr(bt->frame, 1)->off = bt->page_size - 3; 602 | bt_putid(slotptr(bt->frame, 1)->id, lvl ? MIN_lvl - lvl + 1 : 0); // next(lower) page number 603 | key = keyptr(bt->frame, 1); 604 | key->len = 2; // create stopper key 605 | key->key[0] = 0xff; 606 | key->key[1] = 0xff; 607 | bt->frame->min = bt->page_size - 3; 608 | bt->frame->lvl = lvl; 609 | bt->frame->cnt = 1; 610 | bt->frame->act = 1; 611 | #ifdef unix 612 | if( write (bt->idx, bt->frame, bt->page_size) < bt->page_size ) 613 | return bt_close (bt), NULL; 614 | #else 615 | if( !WriteFile (bt->idx, (char *)bt->frame, bt->page_size, amt, NULL) ) 616 | return bt_close (bt), NULL; 617 | 618 | if( *amt < bt->page_size ) 619 | return bt_close (bt), NULL; 620 | #endif 621 | } 622 | 623 | // create empty page area by writing last page of first 624 | // cache area (other pages are zeroed by O/S) 625 | 626 | if( bt->mapped_io && bt->hashmask ) { 627 | memset(bt->frame, 0, bt->page_size); 628 | last = bt->hashmask; 629 | 630 | while( last < MIN_lvl + 1 ) 631 | last += bt->hashmask + 1; 632 | #ifdef unix 633 | pwrite(bt->idx, bt->frame, bt->page_size, last << bt->page_bits); 634 | #else 635 | SetFilePointer (bt->idx, last << bt->page_bits, NULL, FILE_BEGIN); 636 | if( !WriteFile (bt->idx, (char *)bt->frame, bt->page_size, amt, NULL) ) 637 | return bt_close (bt), NULL; 638 | if( *amt < bt->page_size ) 639 | return bt_close (bt), NULL; 640 | #endif 641 | } 642 | 643 | if( bt_unlockpage(bt, ALLOC_page, lockmode) ) 644 | return bt_close (bt), NULL; 645 | 646 | return bt; 647 | } 648 | 649 | // compare two keys, returning > 0, = 0, or < 0 650 | // as the comparison value 651 | 652 | int keycmp (BtKey key1, unsigned char *key2, uint len2) 653 | { 654 | uint len1 = key1->len; 655 | int ans; 656 | 657 | if( ans = memcmp (key1->key, key2, len1 > len2 ? len2 : len1) ) 658 | return ans; 659 | 660 | if( len1 > len2 ) 661 | return 1; 662 | if( len1 < len2 ) 663 | return -1; 664 | 665 | return 0; 666 | } 667 | 668 | // Update current page of btree by writing file contents 669 | // or flushing mapped area to disk. 670 | 671 | BTERR bt_update (BtDb *bt, BtPage page, uid page_no) 672 | { 673 | off64_t off = page_no << bt->page_bits; 674 | 675 | #ifdef unix 676 | if( !bt->mapped_io ) 677 | if( pwrite(bt->idx, page, bt->page_size, off) != bt->page_size ) 678 | return bt->err = BTERR_wrt; 679 | #else 680 | uint amt[1]; 681 | if( !bt->mapped_io ) 682 | { 683 | SetFilePointer (bt->idx, (long)off, (long*)(&off)+1, FILE_BEGIN); 684 | if( !WriteFile (bt->idx, (char *)page, bt->page_size, amt, NULL) ) 685 | return GetLastError(), bt->err = BTERR_wrt; 686 | 687 | if( *amt < bt->page_size ) 688 | return GetLastError(), bt->err = BTERR_wrt; 689 | } 690 | else if( bt->mode == BT_fl ) { 691 | FlushViewOfFile(page, bt->page_size); 692 | FlushFileBuffers(bt->idx); 693 | } 694 | #endif 695 | return 0; 696 | } 697 | 698 | // find page in cache 699 | 700 | BtHash *bt_findhash(BtDb *bt, uid page_no) 701 | { 702 | BtHash *hash; 703 | uint idx; 704 | 705 | // compute cache block first page and hash idx 706 | 707 | page_no &= ~bt->hashmask; 708 | idx = (uint)(page_no >> bt->seg_bits) % bt->hashsize; 709 | 710 | if( bt->cache[idx] ) 711 | hash = bt->nodes + bt->cache[idx]; 712 | else 713 | return NULL; 714 | 715 | do if( hash->page_no == page_no ) 716 | break; 717 | while(hash = hash->hashnext ); 718 | 719 | return hash; 720 | } 721 | 722 | // add page cache entry to hash index 723 | 724 | void bt_linkhash(BtDb *bt, BtHash *node, uid page_no) 725 | { 726 | uint idx = (uint)(page_no >> bt->seg_bits) % bt->hashsize; 727 | BtHash *hash; 728 | 729 | if( bt->cache[idx] ) { 730 | node->hashnext = hash = bt->nodes + bt->cache[idx]; 731 | hash->hashprev = node; 732 | } 733 | 734 | node->hashprev = NULL; 735 | bt->cache[idx] = (ushort)(node - bt->nodes); 736 | } 737 | 738 | // remove cache entry from hash table 739 | 740 | void bt_unlinkhash(BtDb *bt, BtHash *node) 741 | { 742 | uint idx = (uint)(node->page_no >> bt->seg_bits) % bt->hashsize; 743 | BtHash *hash; 744 | 745 | // unlink node 746 | if( hash = node->hashprev ) 747 | hash->hashnext = node->hashnext; 748 | else if( hash = node->hashnext ) 749 | bt->cache[idx] = (ushort)(hash - bt->nodes); 750 | else 751 | bt->cache[idx] = 0; 752 | 753 | if( hash = node->hashnext ) 754 | hash->hashprev = node->hashprev; 755 | } 756 | 757 | // add cache page to lru chain and map pages 758 | 759 | BtPage bt_linklru(BtDb *bt, BtHash *hash, uid page_no) 760 | { 761 | int flag; 762 | off64_t off = (page_no & ~bt->hashmask) << bt->page_bits; 763 | off64_t limit = off + ((bt->hashmask+1) << bt->page_bits); 764 | BtHash *node; 765 | 766 | memset(hash, 0, sizeof(BtHash)); 767 | hash->page_no = (page_no & ~bt->hashmask); 768 | bt_linkhash(bt, hash, page_no); 769 | 770 | if( node = hash->lrunext = bt->lrufirst ) 771 | node->lruprev = hash; 772 | else 773 | bt->lrulast = hash; 774 | 775 | bt->lrufirst = hash; 776 | 777 | #ifdef unix 778 | flag = PROT_READ | ( bt->mode == BT_ro ? 0 : PROT_WRITE ); 779 | hash->page = (BtPage)mmap (0, (bt->hashmask+1) << bt->page_bits, flag, MAP_SHARED, bt->idx, off); 780 | if( hash->page == MAP_FAILED ) 781 | return bt->err = BTERR_map, (BtPage)NULL; 782 | 783 | #else 784 | flag = ( bt->mode == BT_ro ? PAGE_READONLY : PAGE_READWRITE ); 785 | hash->hmap = CreateFileMapping(bt->idx, NULL, flag, (DWORD)(limit >> 32), (DWORD)limit, NULL); 786 | if( !hash->hmap ) 787 | return bt->err = BTERR_map, NULL; 788 | 789 | flag = ( bt->mode == BT_ro ? FILE_MAP_READ : FILE_MAP_WRITE ); 790 | hash->page = MapViewOfFile(hash->hmap, flag, (DWORD)(off >> 32), (DWORD)off, (bt->hashmask+1) << bt->page_bits); 791 | if( !hash->page ) 792 | return bt->err = BTERR_map, NULL; 793 | #endif 794 | 795 | return (BtPage)((char*)hash->page + ((uint)(page_no & bt->hashmask) << bt->page_bits)); 796 | } 797 | 798 | // find or place requested page in page-cache 799 | // return memory address where page is located. 800 | 801 | BtPage bt_hashpage(BtDb *bt, uid page_no) 802 | { 803 | BtHash *hash, *node, *next; 804 | BtPage page; 805 | 806 | // find page in cache and move to top of lru list 807 | 808 | if( hash = bt_findhash(bt, page_no) ) { 809 | page = (BtPage)((char*)hash->page + ((uint)(page_no & bt->hashmask) << bt->page_bits)); 810 | // swap node in lru list 811 | if( node = hash->lruprev ) { 812 | if( next = node->lrunext = hash->lrunext ) 813 | next->lruprev = node; 814 | else 815 | bt->lrulast = node; 816 | 817 | if( next = hash->lrunext = bt->lrufirst ) 818 | next->lruprev = hash; 819 | else 820 | return bt->err = BTERR_hash, (BtPage)NULL; 821 | 822 | hash->lruprev = NULL; 823 | bt->lrufirst = hash; 824 | } 825 | return page; 826 | } 827 | 828 | // map pages and add to cache entry 829 | 830 | if( bt->nodecnt < bt->nodemax ) { 831 | hash = bt->nodes + ++bt->nodecnt; 832 | return bt_linklru(bt, hash, page_no); 833 | } 834 | 835 | // hash table is already full, replace last lru entry from the cache 836 | 837 | if( hash = bt->lrulast ) { 838 | // unlink from lru list 839 | if( node = bt->lrulast = hash->lruprev ) 840 | node->lrunext = NULL; 841 | else 842 | return bt->err = BTERR_hash, (BtPage)NULL; 843 | 844 | #ifdef unix 845 | munmap (hash->page, (bt->hashmask+1) << bt->page_bits); 846 | #else 847 | FlushViewOfFile(hash->page, 0); 848 | UnmapViewOfFile(hash->page); 849 | CloseHandle(hash->hmap); 850 | #endif 851 | // unlink from hash table 852 | 853 | bt_unlinkhash(bt, hash); 854 | 855 | // map and add to cache 856 | 857 | return bt_linklru(bt, hash, page_no); 858 | } 859 | 860 | return bt->err = BTERR_hash, (BtPage)NULL; 861 | } 862 | 863 | // map a btree page onto current page 864 | 865 | BTERR bt_mappage (BtDb *bt, BtPage *page, uid page_no) 866 | { 867 | off64_t off = page_no << bt->page_bits; 868 | #ifndef unix 869 | int amt[1]; 870 | #endif 871 | 872 | if( bt->mapped_io ) { 873 | bt->err = 0; 874 | *page = bt_hashpage(bt, page_no); 875 | return bt->err; 876 | } 877 | #ifdef unix 878 | if( pread(bt->idx, *page, bt->page_size, off) < bt->page_size ) 879 | return bt->err = BTERR_map; 880 | #else 881 | SetFilePointer (bt->idx, (long)off, (long*)(&off)+1, FILE_BEGIN); 882 | 883 | if( !ReadFile(bt->idx, *page, bt->page_size, amt, NULL) ) 884 | return bt->err = BTERR_map; 885 | 886 | if( *amt < bt->page_size ) 887 | return bt->err = BTERR_map; 888 | #endif 889 | return 0; 890 | } 891 | 892 | // deallocate a deleted page 893 | // place on free chain out of allocator page 894 | 895 | BTERR bt_freepage(BtDb *bt, uid page_no) 896 | { 897 | if( bt_mappage (bt, &bt->temp, page_no) ) 898 | return bt->err; 899 | 900 | // lock allocation page 901 | 902 | if( bt_lockpage(bt, ALLOC_page, BtLockWrite) ) 903 | return bt->err; 904 | 905 | if( bt_mappage (bt, &bt->alloc, ALLOC_page) ) 906 | return bt->err; 907 | 908 | // store chain in second right 909 | bt_putid(bt->temp->right, bt_getid(bt->alloc[1].right)); 910 | bt_putid(bt->alloc[1].right, page_no); 911 | bt->temp->free = 1; 912 | 913 | if( bt_update(bt, bt->alloc, ALLOC_page) ) 914 | return bt->err; 915 | if( bt_update(bt, bt->temp, page_no) ) 916 | return bt->err; 917 | 918 | // unlock page zero 919 | 920 | if( bt_unlockpage(bt, ALLOC_page, BtLockWrite) ) 921 | return bt->err; 922 | 923 | // remove write lock on deleted node 924 | 925 | if( bt_unlockpage(bt, page_no, BtLockWrite) ) 926 | return bt->err; 927 | 928 | // remove delete lock on deleted node 929 | 930 | if( bt_unlockpage(bt, page_no, BtLockDelete) ) 931 | return bt->err; 932 | 933 | return 0; 934 | } 935 | 936 | // allocate a new page and write page into it 937 | 938 | uid bt_newpage(BtDb *bt, BtPage page) 939 | { 940 | uid new_page; 941 | char *pmap; 942 | int reuse; 943 | 944 | // lock page zero 945 | 946 | if( bt_lockpage(bt, ALLOC_page, BtLockWrite) ) 947 | return 0; 948 | 949 | if( bt_mappage (bt, &bt->alloc, ALLOC_page) ) 950 | return 0; 951 | 952 | // use empty chain first 953 | // else allocate empty page 954 | 955 | if( new_page = bt_getid(bt->alloc[1].right) ) { 956 | if( bt_mappage (bt, &bt->temp, new_page) ) 957 | return 0; // don't unlock on error 958 | bt_putid(bt->alloc[1].right, bt_getid(bt->temp->right)); 959 | reuse = 1; 960 | } else { 961 | new_page = bt_getid(bt->alloc->right); 962 | bt_putid(bt->alloc->right, new_page+1); 963 | reuse = 0; 964 | } 965 | 966 | if( bt_update(bt, bt->alloc, ALLOC_page) ) 967 | return 0; // don't unlock on error 968 | 969 | if( !bt->mapped_io ) { 970 | if( bt_update(bt, page, new_page) ) 971 | return 0; //don't unlock on error 972 | 973 | // unlock page zero 974 | 975 | if( bt_unlockpage(bt, ALLOC_page, BtLockWrite) ) 976 | return 0; 977 | 978 | return new_page; 979 | } 980 | 981 | #ifdef unix 982 | if( pwrite(bt->idx, page, bt->page_size, new_page << bt->page_bits) < bt->page_size ) 983 | return bt->err = BTERR_wrt, 0; 984 | 985 | // if writing first page of hash block, zero last page in the block 986 | 987 | if( !reuse && bt->hashmask > 0 && (new_page & bt->hashmask) == 0 ) 988 | { 989 | // use temp buffer to write zeros 990 | memset(bt->zero, 0, bt->page_size); 991 | if( pwrite(bt->idx,bt->zero, bt->page_size, (new_page | bt->hashmask) << bt->page_bits) < bt->page_size ) 992 | return bt->err = BTERR_wrt, 0; 993 | } 994 | #else 995 | // bring new page into page-cache and copy page. 996 | // this will extend the file into the new pages. 997 | 998 | if( !(pmap = (char*)bt_hashpage(bt, new_page & ~bt->hashmask)) ) 999 | return 0; 1000 | 1001 | memcpy(pmap+((new_page & bt->hashmask) << bt->page_bits), page, bt->page_size); 1002 | #endif 1003 | 1004 | // unlock page zero 1005 | 1006 | if( bt_unlockpage(bt, ALLOC_page, BtLockWrite) ) 1007 | return 0; 1008 | 1009 | return new_page; 1010 | } 1011 | 1012 | // find slot in page for given key at a given level 1013 | 1014 | int bt_findslot (BtDb *bt, unsigned char *key, uint len) 1015 | { 1016 | uint diff, higher = bt->page->cnt, low = 1, slot; 1017 | uint good = 0; 1018 | 1019 | // make stopper key an infinite fence value 1020 | 1021 | if( bt_getid (bt->page->right) ) 1022 | higher++; 1023 | else 1024 | good++; 1025 | 1026 | // low is the lowest candidate, higher is already 1027 | // tested as .ge. the given key, loop ends when they meet 1028 | 1029 | while( diff = higher - low ) { 1030 | slot = low + ( diff >> 1 ); 1031 | if( keycmp (keyptr(bt->page, slot), key, len) < 0 ) 1032 | low = slot + 1; 1033 | else 1034 | higher = slot, good++; 1035 | } 1036 | 1037 | // return zero if key is on right link page 1038 | 1039 | return good ? higher : 0; 1040 | } 1041 | 1042 | // find and load page at given level for given key 1043 | // leave page rd or wr locked as requested 1044 | 1045 | int bt_loadpage (BtDb *bt, unsigned char *key, uint len, uint lvl, uint lock) 1046 | { 1047 | uid page_no = ROOT_page, prevpage = 0; 1048 | uint drill = 0xff, slot; 1049 | uint mode, prevmode; 1050 | 1051 | // start at root of btree and drill down 1052 | 1053 | do { 1054 | // determine lock mode of drill level 1055 | mode = (lock == BtLockWrite) && (drill == lvl) ? BtLockWrite : BtLockRead; 1056 | 1057 | bt->page_no = page_no; 1058 | 1059 | // obtain access lock using lock chaining 1060 | 1061 | if( page_no > ROOT_page ) 1062 | if( bt_lockpage(bt, bt->page_no, BtLockAccess) ) 1063 | return 0; 1064 | 1065 | if( prevpage ) 1066 | if( bt_unlockpage(bt, prevpage, prevmode) ) 1067 | return 0; 1068 | 1069 | // obtain read lock using lock chaining 1070 | 1071 | if( bt_lockpage(bt, bt->page_no, mode) ) 1072 | return 0; 1073 | 1074 | if( page_no > ROOT_page ) 1075 | if( bt_unlockpage(bt, bt->page_no, BtLockAccess) ) 1076 | return 0; 1077 | 1078 | // map/obtain page contents 1079 | 1080 | if( bt_mappage (bt, &bt->page, page_no) ) 1081 | return 0; 1082 | 1083 | // re-read and re-lock root after determining actual level of root 1084 | 1085 | if( bt->page->lvl != drill) { 1086 | if( bt->page_no != ROOT_page ) 1087 | return bt->err = BTERR_struct, 0; 1088 | 1089 | drill = bt->page->lvl; 1090 | 1091 | if( lock != BtLockRead && drill == lvl ) 1092 | if( bt_unlockpage(bt, page_no, mode) ) 1093 | return 0; 1094 | else 1095 | continue; 1096 | } 1097 | 1098 | prevpage = bt->page_no; 1099 | prevmode = mode; 1100 | 1101 | // find key on page at this level 1102 | // and descend to requested level 1103 | 1104 | if( !bt->page->kill ) 1105 | if( slot = bt_findslot (bt, key, len) ) { 1106 | if( drill == lvl ) 1107 | return slot; 1108 | 1109 | while( slotptr(bt->page, slot)->dead ) 1110 | if( slot++ < bt->page->cnt ) 1111 | continue; 1112 | else 1113 | goto slideright; 1114 | 1115 | page_no = bt_getid(slotptr(bt->page, slot)->id); 1116 | drill--; 1117 | continue; 1118 | } 1119 | 1120 | // or slide right into next page 1121 | 1122 | slideright: 1123 | page_no = bt_getid(bt->page->right); 1124 | 1125 | } while( page_no ); 1126 | 1127 | // return error on end of right chain 1128 | 1129 | bt->err = BTERR_struct; 1130 | return 0; // return error 1131 | } 1132 | 1133 | // a fence key was deleted from a page 1134 | // push new fence value upwards 1135 | 1136 | BTERR bt_fixfence (BtDb *bt, uid page_no, uint lvl) 1137 | { 1138 | unsigned char leftkey[256], rightkey[256]; 1139 | BtKey ptr; 1140 | 1141 | // remove deleted key, the old fence value 1142 | 1143 | ptr = keyptr(bt->page, bt->page->cnt); 1144 | memcpy(rightkey, ptr, ptr->len + 1); 1145 | 1146 | memset (slotptr(bt->page, bt->page->cnt--), 0, sizeof(BtSlot)); 1147 | bt->page->dirty = 1; 1148 | 1149 | ptr = keyptr(bt->page, bt->page->cnt); 1150 | memcpy(leftkey, ptr, ptr->len + 1); 1151 | 1152 | if( bt_update (bt, bt->page, page_no) ) 1153 | return bt->err; 1154 | 1155 | if( bt_lockpage (bt, page_no, BtLockParent) ) 1156 | return bt->err; 1157 | 1158 | if( bt_unlockpage (bt, page_no, BtLockWrite) ) 1159 | return bt->err; 1160 | 1161 | // insert new (now smaller) fence key 1162 | 1163 | if( bt_insertkey (bt, leftkey+1, *leftkey, lvl + 1, page_no, time(NULL)) ) 1164 | return bt->err; 1165 | 1166 | // remove old (larger) fence key 1167 | 1168 | if( bt_deletekey (bt, rightkey+1, *rightkey, lvl + 1) ) 1169 | return bt->err; 1170 | 1171 | return bt_unlockpage (bt, page_no, BtLockParent); 1172 | } 1173 | 1174 | // root has a single child 1175 | // collapse a level from the btree 1176 | // call with root locked in bt->page 1177 | 1178 | BTERR bt_collapseroot (BtDb *bt, BtPage root) 1179 | { 1180 | uid child; 1181 | uint idx; 1182 | 1183 | // find the child entry 1184 | // and promote to new root 1185 | 1186 | do { 1187 | for( idx = 0; idx++ < root->cnt; ) 1188 | if( !slotptr(root, idx)->dead ) 1189 | break; 1190 | 1191 | child = bt_getid (slotptr(root, idx)->id); 1192 | 1193 | if( bt_lockpage (bt, child, BtLockDelete) ) 1194 | return bt->err; 1195 | 1196 | if( bt_lockpage (bt, child, BtLockWrite) ) 1197 | return bt->err; 1198 | 1199 | if( bt_mappage (bt, &bt->temp, child) ) 1200 | return bt->err; 1201 | 1202 | memcpy (root, bt->temp, bt->page_size); 1203 | 1204 | if( bt_update (bt, root, ROOT_page) ) 1205 | return bt->err; 1206 | 1207 | if( bt_freepage (bt, child) ) 1208 | return bt->err; 1209 | 1210 | } while( root->lvl > 1 && root->act == 1 ); 1211 | 1212 | return bt_unlockpage (bt, ROOT_page, BtLockWrite); 1213 | } 1214 | 1215 | // find and delete key on page by marking delete flag bit 1216 | // when page becomes empty, delete it 1217 | 1218 | BTERR bt_deletekey (BtDb *bt, unsigned char *key, uint len, uint lvl) 1219 | { 1220 | unsigned char lowerkey[256], higherkey[256]; 1221 | uint slot, dirty = 0, idx, fence, found; 1222 | uid page_no, right; 1223 | BtKey ptr; 1224 | 1225 | if( slot = bt_loadpage (bt, key, len, lvl, BtLockWrite) ) 1226 | ptr = keyptr(bt->page, slot); 1227 | else 1228 | return bt->err; 1229 | 1230 | // are we deleting a fence slot? 1231 | 1232 | fence = slot == bt->page->cnt; 1233 | 1234 | // if key is found delete it, otherwise ignore request 1235 | 1236 | if( found = !keycmp (ptr, key, len) ) 1237 | if( found = slotptr(bt->page, slot)->dead == 0 ) { 1238 | dirty = slotptr(bt->page,slot)->dead = 1; 1239 | bt->page->dirty = 1; 1240 | bt->page->act--; 1241 | 1242 | // collapse empty slots 1243 | 1244 | while( idx = bt->page->cnt - 1 ) 1245 | if( slotptr(bt->page, idx)->dead ) { 1246 | *slotptr(bt->page, idx) = *slotptr(bt->page, idx + 1); 1247 | memset (slotptr(bt->page, bt->page->cnt--), 0, sizeof(BtSlot)); 1248 | } else 1249 | break; 1250 | } 1251 | 1252 | right = bt_getid(bt->page->right); 1253 | page_no = bt->page_no; 1254 | 1255 | // did we delete a fence key in an upper level? 1256 | 1257 | if( dirty && lvl && bt->page->act && fence ) 1258 | if( bt_fixfence (bt, page_no, lvl) ) 1259 | return bt->err; 1260 | else 1261 | return bt->found = found, 0; 1262 | 1263 | // is this a collapsed root? 1264 | 1265 | if( lvl > 1 && page_no == ROOT_page && bt->page->act == 1 ) 1266 | if( bt_collapseroot (bt, bt->page) ) 1267 | return bt->err; 1268 | else 1269 | return bt->found = found, 0; 1270 | 1271 | // return if page is not empty 1272 | 1273 | if( bt->page->act ) { 1274 | if( dirty && bt_update(bt, bt->page, page_no) ) 1275 | return bt->err; 1276 | if( bt_unlockpage(bt, page_no, BtLockWrite) ) 1277 | return bt->err; 1278 | return bt->found = found, 0; 1279 | } 1280 | 1281 | // cache copy of fence key 1282 | // in order to find parent 1283 | 1284 | ptr = keyptr(bt->page, bt->page->cnt); 1285 | memcpy(lowerkey, ptr, ptr->len + 1); 1286 | 1287 | // obtain lock on right page 1288 | 1289 | if( bt_lockpage(bt, right, BtLockWrite) ) 1290 | return bt->err; 1291 | 1292 | if( bt_mappage (bt, &bt->temp, right) ) 1293 | return bt->err; 1294 | 1295 | if( bt->temp->kill ) 1296 | return bt->err = BTERR_struct; 1297 | 1298 | // pull contents of next page into current empty page 1299 | 1300 | memcpy (bt->page, bt->temp, bt->page_size); 1301 | 1302 | // cache copy of key to update 1303 | 1304 | ptr = keyptr(bt->temp, bt->temp->cnt); 1305 | memcpy(higherkey, ptr, ptr->len + 1); 1306 | 1307 | // Mark right page as deleted and point it to left page 1308 | // until we can post updates at higher level. 1309 | 1310 | bt_putid(bt->temp->right, page_no); 1311 | bt->temp->kill = 1; 1312 | 1313 | if( bt_update(bt, bt->page, page_no) ) 1314 | return bt->err; 1315 | 1316 | if( bt_update(bt, bt->temp, right) ) 1317 | return bt->err; 1318 | 1319 | if( bt_lockpage(bt, page_no, BtLockParent) ) 1320 | return bt->err; 1321 | 1322 | if( bt_unlockpage(bt, page_no, BtLockWrite) ) 1323 | return bt->err; 1324 | 1325 | if( bt_lockpage(bt, right, BtLockParent) ) 1326 | return bt->err; 1327 | 1328 | if( bt_unlockpage(bt, right, BtLockWrite) ) 1329 | return bt->err; 1330 | 1331 | // redirect higher key directly to consolidated node 1332 | 1333 | if( bt_insertkey (bt, higherkey+1, *higherkey, lvl+1, page_no, time(NULL)) ) 1334 | return bt->err; 1335 | 1336 | // delete old lower key to consolidated node 1337 | 1338 | if( bt_deletekey (bt, lowerkey + 1, *lowerkey, lvl + 1) ) 1339 | return bt->err; 1340 | 1341 | // obtain write & delete lock on deleted node 1342 | // add right block to free chain 1343 | 1344 | if( bt_lockpage(bt, right, BtLockDelete) ) 1345 | return bt->err; 1346 | 1347 | if( bt_lockpage(bt, right, BtLockWrite) ) 1348 | return bt->err; 1349 | 1350 | if( bt_freepage (bt, right) ) 1351 | return bt->err; 1352 | 1353 | // remove ParentModify locks 1354 | 1355 | if( bt_unlockpage(bt, right, BtLockParent) ) 1356 | return bt->err; 1357 | 1358 | return bt_unlockpage(bt, page_no, BtLockParent); 1359 | } 1360 | 1361 | // find key in leaf level and return row-id 1362 | 1363 | uid bt_findkey (BtDb *bt, unsigned char *key, uint len) 1364 | { 1365 | uint slot; 1366 | BtKey ptr; 1367 | uid id; 1368 | 1369 | if( slot = bt_loadpage (bt, key, len, 0, BtLockRead) ) 1370 | ptr = keyptr(bt->page, slot); 1371 | else 1372 | return 0; 1373 | 1374 | // if key exists, return row-id 1375 | // otherwise return 0 1376 | 1377 | if( ptr->len == len && !memcmp (ptr->key, key, len) ) 1378 | id = bt_getid(slotptr(bt->page,slot)->id); 1379 | else 1380 | id = 0; 1381 | 1382 | if( bt_unlockpage(bt, bt->page_no, BtLockRead) ) 1383 | return 0; 1384 | 1385 | return id; 1386 | } 1387 | 1388 | // check page for space available, 1389 | // clean if necessary and return 1390 | // 0 - page needs splitting 1391 | // >0 - go ahead with new slot 1392 | 1393 | uint bt_cleanpage(BtDb *bt, uint amt, uint slot) 1394 | { 1395 | uint nxt = bt->page_size; 1396 | BtPage page = bt->page; 1397 | uint cnt = 0, idx = 0; 1398 | uint max = page->cnt; 1399 | uint newslot = slot; 1400 | BtKey key; 1401 | int ret; 1402 | 1403 | if( page->min >= (max+1) * sizeof(BtSlot) + sizeof(*page) + amt + 1 ) 1404 | return slot; 1405 | 1406 | // skip cleanup if nothing to reclaim 1407 | 1408 | if( !page->dirty ) 1409 | return 0; 1410 | 1411 | memcpy (bt->frame, page, bt->page_size); 1412 | 1413 | // skip page info and set rest of page to zero 1414 | 1415 | memset (page+1, 0, bt->page_size - sizeof(*page)); 1416 | page->act = 0; 1417 | 1418 | while( cnt++ < max ) { 1419 | if( cnt == slot ) 1420 | newslot = idx + 1; 1421 | // always leave fence key in list 1422 | if( cnt < max && slotptr(bt->frame,cnt)->dead ) 1423 | continue; 1424 | 1425 | // copy key 1426 | key = keyptr(bt->frame, cnt); 1427 | nxt -= key->len + 1; 1428 | memcpy ((unsigned char *)page + nxt, key, key->len + 1); 1429 | 1430 | // copy slot 1431 | memcpy(slotptr(page, ++idx)->id, slotptr(bt->frame, cnt)->id, BtId); 1432 | if( !(slotptr(page, idx)->dead = slotptr(bt->frame, cnt)->dead) ) 1433 | page->act++; 1434 | slotptr(page, idx)->tod = slotptr(bt->frame, cnt)->tod; 1435 | slotptr(page, idx)->off = nxt; 1436 | } 1437 | 1438 | page->min = nxt; 1439 | page->cnt = idx; 1440 | 1441 | if( page->min >= (max+1) * sizeof(BtSlot) + sizeof(*page) + amt + 1 ) 1442 | return newslot; 1443 | 1444 | return 0; 1445 | } 1446 | 1447 | // split the root and raise the height of the btree 1448 | 1449 | BTERR bt_splitroot(BtDb *bt, unsigned char *leftkey, uid page_no2) 1450 | { 1451 | uint nxt = bt->page_size; 1452 | BtPage root = bt->page; 1453 | uid right; 1454 | 1455 | // Obtain an empty page to use, and copy the current 1456 | // root contents into it 1457 | 1458 | if( !(right = bt_newpage(bt, root)) ) 1459 | return bt->err; 1460 | 1461 | // preserve the page info at the bottom 1462 | // and set rest to zero 1463 | 1464 | memset(root+1, 0, bt->page_size - sizeof(*root)); 1465 | 1466 | // insert first key on newroot page 1467 | 1468 | nxt -= *leftkey + 1; 1469 | memcpy ((unsigned char *)root + nxt, leftkey, *leftkey + 1); 1470 | bt_putid(slotptr(root, 1)->id, right); 1471 | slotptr(root, 1)->off = nxt; 1472 | 1473 | // insert second key on newroot page 1474 | // and increase the root height 1475 | 1476 | nxt -= 3; 1477 | ((unsigned char *)root)[nxt] = 2; 1478 | ((unsigned char *)root)[nxt+1] = 0xff; 1479 | ((unsigned char *)root)[nxt+2] = 0xff; 1480 | bt_putid(slotptr(root, 2)->id, page_no2); 1481 | slotptr(root, 2)->off = nxt; 1482 | 1483 | bt_putid(root->right, 0); 1484 | root->min = nxt; // reset lowest used offset and key count 1485 | root->cnt = 2; 1486 | root->act = 2; 1487 | root->lvl++; 1488 | 1489 | // update and release root (bt->page) 1490 | 1491 | if( bt_update(bt, root, bt->page_no) ) 1492 | return bt->err; 1493 | 1494 | return bt_unlockpage(bt, bt->page_no, BtLockWrite); 1495 | } 1496 | 1497 | // split already locked full node 1498 | // return unlocked. 1499 | 1500 | BTERR bt_splitpage (BtDb *bt) 1501 | { 1502 | uint cnt = 0, idx = 0, max, nxt = bt->page_size; 1503 | unsigned char fencekey[256], rightkey[256]; 1504 | uid page_no = bt->page_no, right; 1505 | BtPage page = bt->page; 1506 | uint lvl = page->lvl; 1507 | BtKey key; 1508 | 1509 | // split higher half of keys to bt->frame 1510 | // the last key (fence key) might be dead 1511 | 1512 | memset (bt->frame, 0, bt->page_size); 1513 | max = page->cnt; 1514 | cnt = max / 2; 1515 | idx = 0; 1516 | 1517 | while( cnt++ < max ) { 1518 | key = keyptr(page, cnt); 1519 | nxt -= key->len + 1; 1520 | memcpy ((unsigned char *)bt->frame + nxt, key, key->len + 1); 1521 | memcpy(slotptr(bt->frame,++idx)->id, slotptr(page,cnt)->id, BtId); 1522 | if( !(slotptr(bt->frame, idx)->dead = slotptr(page, cnt)->dead) ) 1523 | bt->frame->act++; 1524 | slotptr(bt->frame, idx)->tod = slotptr(page, cnt)->tod; 1525 | slotptr(bt->frame, idx)->off = nxt; 1526 | } 1527 | 1528 | // remember fence key for new right page 1529 | 1530 | memcpy (rightkey, key, key->len + 1); 1531 | 1532 | bt->frame->bits = bt->page_bits; 1533 | bt->frame->min = nxt; 1534 | bt->frame->cnt = idx; 1535 | bt->frame->lvl = lvl; 1536 | 1537 | // link right node 1538 | 1539 | if( page_no > ROOT_page ) 1540 | memcpy (bt->frame->right, page->right, BtId); 1541 | 1542 | // get new free page and write frame to it. 1543 | 1544 | if( !(right = bt_newpage(bt, bt->frame)) ) 1545 | return bt->err; 1546 | 1547 | // update lower keys to continue in old page 1548 | 1549 | memcpy (bt->frame, page, bt->page_size); 1550 | memset (page+1, 0, bt->page_size - sizeof(*page)); 1551 | nxt = bt->page_size; 1552 | page->dirty = 0; 1553 | page->act = 0; 1554 | cnt = 0; 1555 | idx = 0; 1556 | 1557 | // assemble page of smaller keys 1558 | // (they're all active keys) 1559 | 1560 | while( cnt++ < max / 2 ) { 1561 | key = keyptr(bt->frame, cnt); 1562 | nxt -= key->len + 1; 1563 | memcpy ((unsigned char *)page + nxt, key, key->len + 1); 1564 | memcpy(slotptr(page,++idx)->id, slotptr(bt->frame,cnt)->id, BtId); 1565 | slotptr(page, idx)->tod = slotptr(bt->frame, cnt)->tod; 1566 | slotptr(page, idx)->off = nxt; 1567 | page->act++; 1568 | } 1569 | 1570 | // remember fence key for smaller page 1571 | 1572 | memcpy (fencekey, key, key->len + 1); 1573 | 1574 | bt_putid(page->right, right); 1575 | page->min = nxt; 1576 | page->cnt = idx; 1577 | 1578 | // if current page is the root page, split it 1579 | 1580 | if( page_no == ROOT_page ) 1581 | return bt_splitroot (bt, fencekey, right); 1582 | 1583 | // lock right page 1584 | 1585 | if( bt_lockpage (bt, right, BtLockParent) ) 1586 | return bt->err; 1587 | 1588 | // update left (containing) node 1589 | 1590 | if( bt_update(bt, page, page_no) ) 1591 | return bt->err; 1592 | 1593 | if( bt_lockpage (bt, page_no, BtLockParent) ) 1594 | return bt->err; 1595 | 1596 | if( bt_unlockpage (bt, page_no, BtLockWrite) ) 1597 | return bt->err; 1598 | 1599 | // insert new fence for reformulated left block 1600 | 1601 | if( bt_insertkey (bt, fencekey+1, *fencekey, lvl+1, page_no, time(NULL)) ) 1602 | return bt->err; 1603 | 1604 | // switch fence for right block of larger keys to new right page 1605 | 1606 | if( bt_insertkey (bt, rightkey+1, *rightkey, lvl+1, right, time(NULL)) ) 1607 | return bt->err; 1608 | 1609 | if( bt_unlockpage (bt, page_no, BtLockParent) ) 1610 | return bt->err; 1611 | 1612 | return bt_unlockpage (bt, right, BtLockParent); 1613 | } 1614 | 1615 | // Insert new key into the btree at requested level. 1616 | // Level zero pages are leaf pages and are unlocked at exit. 1617 | // Interior nodes remain locked. 1618 | 1619 | BTERR bt_insertkey (BtDb *bt, unsigned char *key, uint len, uint lvl, uid id, uint tod) 1620 | { 1621 | uint slot, idx; 1622 | BtPage page; 1623 | BtKey ptr; 1624 | 1625 | while( 1 ) { 1626 | if( slot = bt_loadpage (bt, key, len, lvl, BtLockWrite) ) 1627 | ptr = keyptr(bt->page, slot); 1628 | else 1629 | { 1630 | if( !bt->err ) 1631 | bt->err = BTERR_ovflw; 1632 | return bt->err; 1633 | } 1634 | 1635 | // if key already exists, update id and return 1636 | 1637 | page = bt->page; 1638 | 1639 | if( !keycmp (ptr, key, len) ) { 1640 | if( slotptr(page, slot)->dead ) 1641 | page->act++; 1642 | slotptr(page, slot)->dead = 0; 1643 | slotptr(page, slot)->tod = tod; 1644 | bt_putid(slotptr(page,slot)->id, id); 1645 | if( bt_update(bt, bt->page, bt->page_no) ) 1646 | return bt->err; 1647 | return bt_unlockpage(bt, bt->page_no, BtLockWrite); 1648 | } 1649 | 1650 | // check if page has enough space 1651 | 1652 | if( slot = bt_cleanpage (bt, len, slot) ) 1653 | break; 1654 | 1655 | if( bt_splitpage (bt) ) 1656 | return bt->err; 1657 | } 1658 | 1659 | // calculate next available slot and copy key into page 1660 | 1661 | page->min -= len + 1; // reset lowest used offset 1662 | ((unsigned char *)page)[page->min] = len; 1663 | memcpy ((unsigned char *)page + page->min +1, key, len ); 1664 | 1665 | for( idx = slot; idx < page->cnt; idx++ ) 1666 | if( slotptr(page, idx)->dead ) 1667 | break; 1668 | 1669 | // now insert key into array before slot 1670 | // preserving the fence slot 1671 | 1672 | if( idx == page->cnt ) 1673 | idx++, page->cnt++; 1674 | 1675 | page->act++; 1676 | 1677 | while( idx > slot ) 1678 | *slotptr(page, idx) = *slotptr(page, idx -1), idx--; 1679 | 1680 | bt_putid(slotptr(page,slot)->id, id); 1681 | slotptr(page, slot)->off = page->min; 1682 | slotptr(page, slot)->tod = tod; 1683 | slotptr(page, slot)->dead = 0; 1684 | 1685 | if( bt_update(bt, bt->page, bt->page_no) ) 1686 | return bt->err; 1687 | 1688 | return bt_unlockpage(bt, bt->page_no, BtLockWrite); 1689 | } 1690 | 1691 | // cache page of keys into cursor and return starting slot for given key 1692 | 1693 | uint bt_startkey (BtDb *bt, unsigned char *key, uint len) 1694 | { 1695 | uint slot; 1696 | 1697 | // cache page for retrieval 1698 | 1699 | if( slot = bt_loadpage (bt, key, len, 0, BtLockRead) ) 1700 | memcpy (bt->cursor, bt->page, bt->page_size); 1701 | else 1702 | return 0; 1703 | 1704 | bt->cursor_page = bt->page_no; 1705 | 1706 | if( bt_unlockpage(bt, bt->page_no, BtLockRead) ) 1707 | return 0; 1708 | 1709 | return slot; 1710 | } 1711 | 1712 | // return next slot for cursor page 1713 | // or slide cursor right into next page 1714 | 1715 | uint bt_nextkey (BtDb *bt, uint slot) 1716 | { 1717 | off64_t right; 1718 | 1719 | do { 1720 | right = bt_getid(bt->cursor->right); 1721 | 1722 | while( slot++ < bt->cursor->cnt ) 1723 | if( slotptr(bt->cursor,slot)->dead ) 1724 | continue; 1725 | else if( right || (slot < bt->cursor->cnt)) 1726 | return slot; 1727 | else 1728 | break; 1729 | 1730 | if( !right ) 1731 | break; 1732 | 1733 | bt->cursor_page = right; 1734 | 1735 | if( bt_lockpage(bt, right, BtLockRead) ) 1736 | return 0; 1737 | 1738 | if( bt_mappage (bt, &bt->page, right) ) 1739 | return 0; 1740 | 1741 | memcpy (bt->cursor, bt->page, bt->page_size); 1742 | 1743 | if( bt_unlockpage(bt, right, BtLockRead) ) 1744 | return 0; 1745 | 1746 | slot = 0; 1747 | 1748 | } while( 1 ); 1749 | 1750 | return bt->err = 0; 1751 | } 1752 | 1753 | BtKey bt_key(BtDb *bt, uint slot) 1754 | { 1755 | return keyptr(bt->cursor, slot); 1756 | } 1757 | 1758 | uid bt_uid(BtDb *bt, uint slot) 1759 | { 1760 | return bt_getid(slotptr(bt->cursor,slot)->id); 1761 | } 1762 | 1763 | uint bt_tod(BtDb *bt, uint slot) 1764 | { 1765 | return slotptr(bt->cursor,slot)->tod; 1766 | } 1767 | 1768 | 1769 | #ifdef STANDALONE 1770 | 1771 | #ifndef unix 1772 | double getCpuTime(int type) 1773 | { 1774 | FILETIME crtime[1]; 1775 | FILETIME xittime[1]; 1776 | FILETIME systime[1]; 1777 | FILETIME usrtime[1]; 1778 | SYSTEMTIME timeconv[1]; 1779 | double ans = 0; 1780 | 1781 | memset (timeconv, 0, sizeof(SYSTEMTIME)); 1782 | 1783 | switch( type ) { 1784 | case 0: 1785 | GetSystemTimeAsFileTime (xittime); 1786 | FileTimeToSystemTime (xittime, timeconv); 1787 | ans = (double)timeconv->wDayOfWeek * 3600 * 24; 1788 | break; 1789 | case 1: 1790 | GetProcessTimes (GetCurrentProcess(), crtime, xittime, systime, usrtime); 1791 | FileTimeToSystemTime (usrtime, timeconv); 1792 | break; 1793 | case 2: 1794 | GetProcessTimes (GetCurrentProcess(), crtime, xittime, systime, usrtime); 1795 | FileTimeToSystemTime (systime, timeconv); 1796 | break; 1797 | } 1798 | 1799 | ans += (double)timeconv->wHour * 3600; 1800 | ans += (double)timeconv->wMinute * 60; 1801 | ans += (double)timeconv->wSecond; 1802 | ans += (double)timeconv->wMilliseconds / 1000; 1803 | return ans; 1804 | } 1805 | #else 1806 | #include 1807 | #include 1808 | 1809 | double getCpuTime(int type) 1810 | { 1811 | struct rusage used[1]; 1812 | struct timeval tv[1]; 1813 | 1814 | switch( type ) { 1815 | case 0: 1816 | gettimeofday(tv, NULL); 1817 | return (double)tv->tv_sec + (double)tv->tv_usec / 1000000; 1818 | 1819 | case 1: 1820 | getrusage(RUSAGE_SELF, used); 1821 | return (double)used->ru_utime.tv_sec + (double)used->ru_utime.tv_usec / 1000000; 1822 | 1823 | case 2: 1824 | getrusage(RUSAGE_SELF, used); 1825 | return (double)used->ru_stime.tv_sec + (double)used->ru_stime.tv_usec / 1000000; 1826 | } 1827 | 1828 | return 0; 1829 | } 1830 | #endif 1831 | 1832 | // standalone program to index file of keys 1833 | // then list them onto std-out 1834 | 1835 | int main (int argc, char **argv) 1836 | { 1837 | uint slot, line = 0, off = 0, found = 0; 1838 | int dead, ch, cnt = 0, bits = 12; 1839 | unsigned char key[256]; 1840 | double done, start; 1841 | uint pgblk = 0; 1842 | float elapsed; 1843 | time_t tod[1]; 1844 | uint scan = 0; 1845 | uint len = 0; 1846 | uint map = 0; 1847 | uid page_no; 1848 | BtKey ptr; 1849 | BtDb *bt; 1850 | FILE *in; 1851 | 1852 | if( argc < 4 ) { 1853 | fprintf (stderr, "Usage: %s idx_file src_file Read/Write/Scan/Delete/Find [page_bits mapped_pool_segments pages_per_segment start_line_number]\n", argv[0]); 1854 | fprintf (stderr, " page_bits: size of btree page in bits\n"); 1855 | fprintf (stderr, " mapped_pool_segments: size of buffer pool in segments\n"); 1856 | fprintf (stderr, " pages_per_segment: size of buffer pool segment in pages in bits\n"); 1857 | exit(0); 1858 | } 1859 | 1860 | start = getCpuTime(0); 1861 | time(tod); 1862 | 1863 | if( argc > 4 ) 1864 | bits = atoi(argv[4]); 1865 | 1866 | if( argc > 5 ) 1867 | map = atoi(argv[5]); 1868 | 1869 | if( map > 65536 ) 1870 | fprintf (stderr, "Warning: buffer_pool > 65536 segments\n"); 1871 | 1872 | if( map && map < 8 ) 1873 | fprintf (stderr, "Buffer_pool too small\n"); 1874 | 1875 | if( argc > 6 ) 1876 | pgblk = atoi(argv[6]); 1877 | 1878 | if( bits + pgblk > 30 ) 1879 | fprintf (stderr, "Warning: very large buffer pool segment size\n"); 1880 | 1881 | if( argc > 7 ) 1882 | off = atoi(argv[7]); 1883 | 1884 | bt = bt_open ((argv[1]), BT_rw, bits, map, pgblk); 1885 | 1886 | if( !bt ) { 1887 | fprintf(stderr, "Index Open Error %s\n", argv[1]); 1888 | exit (1); 1889 | } 1890 | 1891 | switch(argv[3][0]| 0x20) 1892 | { 1893 | case 'w': 1894 | fprintf(stderr, "started indexing for %s\n", argv[2]); 1895 | if( argc > 2 && (in = fopen (argv[2], "rb")) ) 1896 | while( ch = getc(in), ch != EOF ) 1897 | if( ch == '\n' ) 1898 | { 1899 | if( off ) 1900 | sprintf((char *)key+len, "%.9d", line + off), len += 9; 1901 | 1902 | if( bt_insertkey (bt, key, len, 0, ++line, *tod) ) 1903 | fprintf(stderr, "Error %d Line: %d\n", bt->err, line), exit(0); 1904 | len = 0; 1905 | } 1906 | else if( len < 245 ) 1907 | key[len++] = ch; 1908 | fprintf(stderr, "finished adding keys for %s, %d \n", argv[2], line); 1909 | break; 1910 | 1911 | case 'd': 1912 | fprintf(stderr, "started deleting keys for %s\n", argv[2]); 1913 | if( argc > 2 && (in = fopen (argv[2], "rb")) ) 1914 | while( ch = getc(in), ch != EOF ) 1915 | if( ch == '\n' ) 1916 | { 1917 | if( off ) 1918 | sprintf((char *)key+len, "%.9d", line + off), len += 9; 1919 | line++; 1920 | if( bt_deletekey (bt, key, len, 0) ) 1921 | fprintf(stderr, "Error %d Line: %d\n", bt->err, line), exit(0); 1922 | len = 0; 1923 | } 1924 | else if( len < 245 ) 1925 | key[len++] = ch; 1926 | fprintf(stderr, "finished deleting keys for %s, %d \n", argv[2], line); 1927 | break; 1928 | 1929 | case 'f': 1930 | fprintf(stderr, "started finding keys for %s\n", argv[2]); 1931 | if( argc > 2 && (in = fopen (argv[2], "rb")) ) 1932 | while( ch = getc(in), ch != EOF ) 1933 | if( ch == '\n' ) 1934 | { 1935 | if( off ) 1936 | sprintf((char *)key+len, "%.9d", line + off), len += 9; 1937 | line++; 1938 | if( bt_findkey (bt, key, len) ) 1939 | found++; 1940 | else if( bt->err ) 1941 | fprintf(stderr, "Error %d Syserr %d Line: %d\n", bt->err, errno, line), exit(0); 1942 | len = 0; 1943 | } 1944 | else if( len < 245 ) 1945 | key[len++] = ch; 1946 | fprintf(stderr, "finished search of %d keys for %s, found %d\n", line, argv[2], found); 1947 | break; 1948 | 1949 | case 's': 1950 | cnt = len = key[0] = 0; 1951 | 1952 | if( slot = bt_startkey (bt, key, len) ) 1953 | slot--; 1954 | else 1955 | fprintf(stderr, "Error %d in StartKey. Syserror: %d\n", bt->err, errno), exit(0); 1956 | 1957 | while( slot = bt_nextkey (bt, slot) ) { 1958 | ptr = bt_key(bt, slot); 1959 | fwrite (ptr->key, ptr->len, 1, stdout); 1960 | fputc ('\n', stdout); 1961 | cnt++; 1962 | } 1963 | 1964 | fprintf(stderr, " Total keys read %d\n", cnt - 1); 1965 | break; 1966 | 1967 | case 'c': 1968 | fprintf(stderr, "started counting\n"); 1969 | cnt = 0; 1970 | 1971 | page_no = LEAF_page; 1972 | cnt = 0; 1973 | 1974 | while( 1 ) { 1975 | uid off = page_no << bt->page_bits; 1976 | #ifdef unix 1977 | if( !pread (bt->idx, bt->frame, bt->page_size, off) ) 1978 | break; 1979 | #else 1980 | DWORD amt[1]; 1981 | 1982 | SetFilePointer (bt->idx, (long)off, (long*)(&off)+1, FILE_BEGIN); 1983 | 1984 | if( !ReadFile(bt->idx, bt->frame, bt->page_size, amt, NULL)) 1985 | break; 1986 | 1987 | if( *amt < bt->page_size ) 1988 | fprintf (stderr, "unable to read page %.8x", page_no); 1989 | #endif 1990 | if( !bt->frame->free && !bt->frame->lvl ) 1991 | cnt += bt->frame->act; 1992 | 1993 | page_no++; 1994 | } 1995 | 1996 | cnt--; // remove stopper key 1997 | fprintf(stderr, " Total keys read %d\n", cnt); 1998 | break; 1999 | } 2000 | 2001 | done = getCpuTime(0); 2002 | elapsed = (float)(done - start); 2003 | fprintf(stderr, " real %dm%.3fs\n", (int)(elapsed/60), elapsed - (int)(elapsed/60)*60); 2004 | elapsed = getCpuTime(1); 2005 | fprintf(stderr, " user %dm%.3fs\n", (int)(elapsed/60), elapsed - (int)(elapsed/60)*60); 2006 | elapsed = getCpuTime(2); 2007 | fprintf(stderr, " sys %dm%.3fs\n", (int)(elapsed/60), elapsed - (int)(elapsed/60)*60); 2008 | 2009 | return 0; 2010 | } 2011 | 2012 | #endif //STANDALONE 2013 | -------------------------------------------------------------------------------- /btree2v.c: -------------------------------------------------------------------------------- 1 | // btree version 2v linux futex contention 2 | // with combined latch & pool manager 3 | // and phase-fair reader writer lock 4 | // 12 MAR 2014 5 | 6 | // author: karl malbrain, malbrain@cal.berkeley.edu 7 | 8 | /* 9 | This work, including the source code, documentation 10 | and related data, is placed into the public domain. 11 | 12 | The orginal author is Karl Malbrain. 13 | 14 | THIS SOFTWARE IS PROVIDED AS-IS WITHOUT WARRANTY 15 | OF ANY KIND, NOT EVEN THE IMPLIED WARRANTY OF 16 | MERCHANTABILITY. THE AUTHOR OF THIS SOFTWARE, 17 | ASSUMES _NO_ RESPONSIBILITY FOR ANY CONSEQUENCE 18 | RESULTING FROM THE USE, MODIFICATION, OR 19 | REDISTRIBUTION OF THIS SOFTWARE. 20 | */ 21 | 22 | // Please see the project home page for documentation 23 | // code.google.com/p/high-concurrency-btree 24 | 25 | #define _FILE_OFFSET_BITS 64 26 | #define _LARGEFILE64_SOURCE 27 | 28 | #ifdef linux 29 | #define _GNU_SOURCE 30 | #include 31 | #include 32 | #define SYS_futex 202 33 | #endif 34 | 35 | #ifdef unix 36 | #include 37 | #include 38 | #include 39 | #include 40 | #include 41 | #include 42 | #include 43 | #else 44 | #define WIN32_LEAN_AND_MEAN 45 | #include 46 | #include 47 | #include 48 | #include 49 | #include 50 | #include 51 | #endif 52 | 53 | #include 54 | #include 55 | 56 | typedef unsigned long long uid; 57 | 58 | #ifndef unix 59 | typedef unsigned long long off64_t; 60 | typedef unsigned short ushort; 61 | typedef unsigned int uint; 62 | #endif 63 | 64 | #define BT_ro 0x6f72 // ro 65 | #define BT_rw 0x7772 // rw 66 | #define BT_fl 0x6c66 // fl 67 | 68 | #define BT_maxbits 15 // maximum page size in bits 69 | #define BT_minbits 12 // minimum page size in bits 70 | #define BT_minpage (1 << BT_minbits) // minimum page size 71 | #define BT_maxpage (1 << BT_maxbits) // maximum page size 72 | 73 | // BTree page number constants 74 | #define ALLOC_page 0 75 | #define ROOT_page 1 76 | #define LEAF_page 2 77 | #define LATCH_page 3 78 | 79 | // Number of levels to create in a new BTree 80 | 81 | #define MIN_lvl 2 82 | #define MAX_lvl 15 83 | 84 | /* 85 | There are five lock types for each node in three independent sets: 86 | 1. (set 1) AccessIntent: Sharable. Going to Read the node. Incompatible with NodeDelete. 87 | 2. (set 1) NodeDelete: Exclusive. About to release the node. Incompatible with AccessIntent. 88 | 3. (set 2) ReadLock: Sharable. Read the node. Incompatible with WriteLock. 89 | 4. (set 2) WriteLock: Exclusive. Modify the node. Incompatible with ReadLock and other WriteLocks. 90 | 5. (set 3) ParentModification: Exclusive. Change the node's parent keys. Incompatible with another ParentModification. 91 | */ 92 | 93 | typedef enum{ 94 | BtLockAccess, 95 | BtLockDelete, 96 | BtLockRead, 97 | BtLockWrite, 98 | BtLockParent 99 | }BtLock; 100 | 101 | enum { 102 | QueRd = 1, // reader queue 103 | QueWr = 2 // writer queue 104 | } RWQueue; 105 | 106 | volatile typedef struct { 107 | ushort rin[1]; // readers in count 108 | ushort rout[1]; // readers out count 109 | ushort ticket[1]; // writers in count 110 | ushort serving[1]; // writers out count 111 | } RWLock; 112 | 113 | // define bits at bottom of rin 114 | 115 | #define PHID 0x1 // writer phase (0/1) 116 | #define PRES 0x2 // writer present 117 | #define MASK 0x3 // both write bits 118 | #define RINC 0x4 // reader increment 119 | 120 | typedef struct { 121 | union { 122 | struct { 123 | ushort serving; 124 | ushort next; 125 | } bits[1]; 126 | uint value[1]; 127 | }; 128 | } BtMutex; 129 | 130 | // Define the length of the page and key pointers 131 | 132 | #define BtId 6 133 | 134 | // Page key slot definition. 135 | 136 | // If BT_maxbits is 15 or less, you can save 2 bytes 137 | // for each key stored by making the first two uints 138 | // into ushorts. You can also save 4 bytes by removing 139 | // the tod field from the key. 140 | 141 | // Keys are marked dead, but remain on the page until 142 | // cleanup is called. The fence key (highest key) for 143 | // the page is always present, even if dead. 144 | 145 | typedef struct { 146 | #ifdef USETOD 147 | uint tod; // time-stamp for key 148 | #endif 149 | ushort off:BT_maxbits; // page offset for key start 150 | ushort dead:1; // set for deleted key 151 | unsigned char id[BtId]; // id associated with key 152 | } BtSlot; 153 | 154 | // The key structure occupies space at the upper end of 155 | // each page. It's a length byte followed by the value 156 | // bytes. 157 | 158 | typedef struct { 159 | unsigned char len; 160 | unsigned char key[0]; 161 | } *BtKey; 162 | 163 | // The first part of an index page. 164 | // It is immediately followed 165 | // by the BtSlot array of keys. 166 | 167 | typedef struct BtPage_ { 168 | uint cnt; // count of keys in page 169 | uint act; // count of active keys 170 | uint min; // next key offset 171 | unsigned char bits:6; // page size in bits 172 | unsigned char free:1; // page is on free list 173 | unsigned char dirty:1; // page is dirty in cache 174 | unsigned char lvl:6; // level of page 175 | unsigned char kill:1; // page is being deleted 176 | unsigned char clean:1; // page needs cleaning 177 | unsigned char right[BtId]; // page number to right 178 | } *BtPage; 179 | 180 | typedef struct { 181 | struct BtPage_ alloc[2]; // next & free page_nos in right ptr 182 | BtMutex lock[1]; // allocation area lite latch 183 | volatile uint latchdeployed;// highest number of latch entries deployed 184 | volatile uint nlatchpage; // number of latch pages at BT_latch 185 | volatile uint latchtotal; // number of page latch entries 186 | volatile uint latchhash; // number of latch hash table slots 187 | volatile uint latchvictim; // next latch hash entry to examine 188 | volatile uint safelevel; // safe page level in cache 189 | volatile uint cache[MAX_lvl];// cache census counts by btree level 190 | } BtLatchMgr; 191 | 192 | // latch hash table entries 193 | 194 | typedef struct { 195 | unsigned char busy[1]; // Latch table entry is busy being reallocated 196 | uint slot:24; // Latch table entry at head of collision chain 197 | } BtHashEntry; 198 | 199 | // latch manager table structure 200 | 201 | typedef struct { 202 | volatile uid page_no; // latch set page number on disk 203 | RWLock readwr[1]; // read/write page lock 204 | RWLock access[1]; // Access Intent/Page delete 205 | RWLock parent[1]; // Posting of fence key in parent 206 | volatile ushort pin; // number of pins/level/clock bits 207 | volatile uint next; // next entry in hash table chain 208 | volatile uint prev; // prev entry in hash table chain 209 | } BtLatchSet; 210 | 211 | #define CLOCK_mask 0xe000 212 | #define CLOCK_unit 0x2000 213 | #define PIN_mask 0x07ff 214 | #define LVL_mask 0x1800 215 | #define LVL_shift 11 216 | 217 | // The object structure for Btree access 218 | 219 | typedef struct _BtDb { 220 | uint page_size; // each page size 221 | uint page_bits; // each page size in bits 222 | uid page_no; // current page number 223 | uid cursor_page; // current cursor page number 224 | int err; 225 | uint mode; // read-write mode 226 | BtPage cursor; // cached frame for start/next (never mapped) 227 | BtPage frame; // spare frame for the page split (never mapped) 228 | BtPage page; // current mapped page in buffer pool 229 | BtLatchSet *latch; // current page latch 230 | BtLatchMgr *latchmgr; // mapped latch page from allocation page 231 | BtLatchSet *latchsets; // mapped latch set from latch pages 232 | unsigned char *pagepool; // cached page pool set 233 | BtHashEntry *table; // the hash table 234 | #ifdef unix 235 | int idx; 236 | #else 237 | HANDLE idx; 238 | HANDLE halloc; // allocation and latch table handle 239 | #endif 240 | unsigned char *mem; // frame, cursor, memory buffers 241 | uint found; // last deletekey found key 242 | } BtDb; 243 | 244 | typedef enum { 245 | BTERR_ok = 0, 246 | BTERR_notfound, 247 | BTERR_struct, 248 | BTERR_ovflw, 249 | BTERR_read, 250 | BTERR_lock, 251 | BTERR_hash, 252 | BTERR_kill, 253 | BTERR_map, 254 | BTERR_wrt, 255 | BTERR_eof 256 | } BTERR; 257 | 258 | // B-Tree functions 259 | extern void bt_close (BtDb *bt); 260 | extern BtDb *bt_open (char *name, uint mode, uint bits, uint cacheblk); 261 | extern BTERR bt_insertkey (BtDb *bt, unsigned char *key, uint len, uint lvl, uid id, uint tod); 262 | extern BTERR bt_deletekey (BtDb *bt, unsigned char *key, uint len, uint lvl); 263 | extern uid bt_findkey (BtDb *bt, unsigned char *key, uint len); 264 | extern uint bt_startkey (BtDb *bt, unsigned char *key, uint len); 265 | extern uint bt_nextkey (BtDb *bt, uint slot); 266 | 267 | // internal functions 268 | void bt_update (BtDb *bt, BtPage page); 269 | BtPage bt_mappage (BtDb *bt, BtLatchSet *latch); 270 | // Helper functions to return slot values 271 | 272 | extern BtKey bt_key (BtDb *bt, uint slot); 273 | extern uid bt_uid (BtDb *bt, uint slot); 274 | #ifdef USETOD 275 | extern uint bt_tod (BtDb *bt, uint slot); 276 | #endif 277 | 278 | // The page is allocated from low and hi ends. 279 | // The key offsets and row-id's are allocated 280 | // from the bottom, while the text of the key 281 | // is allocated from the top. When the two 282 | // areas meet, the page is split into two. 283 | 284 | // A key consists of a length byte, two bytes of 285 | // index number (0 - 65534), and up to 253 bytes 286 | // of key value. Duplicate keys are discarded. 287 | // Associated with each key is a 48 bit row-id. 288 | 289 | // The b-tree root is always located at page 1. 290 | // The first leaf page of level zero is always 291 | // located on page 2. 292 | 293 | // The b-tree pages are linked with right 294 | // pointers to facilitate enumerators, 295 | // and provide for concurrency. 296 | 297 | // When to root page fills, it is split in two and 298 | // the tree height is raised by a new root at page 299 | // one with two keys. 300 | 301 | // Deleted keys are marked with a dead bit until 302 | // page cleanup The fence key for a node is always 303 | // present, even after deletion and cleanup. 304 | 305 | // Deleted leaf pages are reclaimed on a free list. 306 | // The upper levels of the btree are fixed on creation. 307 | 308 | // To achieve maximum concurrency one page is locked at a time 309 | // as the tree is traversed to find leaf key in question. The right 310 | // page numbers are used in cases where the page is being split, 311 | // or consolidated. 312 | 313 | // Page 0 (ALLOC page) is dedicated to lock for new page extensions, 314 | // and chains empty leaf pages together for reuse. 315 | 316 | // Parent locks are obtained to prevent resplitting or deleting a node 317 | // before its fence is posted into its upper level. 318 | 319 | // A special open mode of BT_fl is provided to safely access files on 320 | // WIN32 networks. WIN32 network operations should not use memory mapping. 321 | // This WIN32 mode sets FILE_FLAG_NOBUFFERING and FILE_FLAG_WRITETHROUGH 322 | // to prevent local caching of network file contents. 323 | 324 | // Access macros to address slot and key values from the page. 325 | // Page slots use 1 based indexing. 326 | 327 | #define slotptr(page, slot) (((BtSlot *)(page+1)) + (slot-1)) 328 | #define keyptr(page, slot) ((BtKey)((unsigned char*)(page) + slotptr(page, slot)->off)) 329 | 330 | void bt_putid(unsigned char *dest, uid id) 331 | { 332 | int i = BtId; 333 | 334 | while( i-- ) 335 | dest[i] = (unsigned char)id, id >>= 8; 336 | } 337 | 338 | uid bt_getid(unsigned char *src) 339 | { 340 | uid id = 0; 341 | int i; 342 | 343 | for( i = 0; i < BtId; i++ ) 344 | id <<= 8, id |= *src++; 345 | 346 | return id; 347 | } 348 | 349 | BTERR bt_abort (BtDb *bt, BtPage page, uid page_no, BTERR err) 350 | { 351 | BtKey ptr; 352 | 353 | fprintf(stderr, "\n Btree2 abort, error %d on page %.8x\n", err, page_no); 354 | fprintf(stderr, "level=%d kill=%d free=%d cnt=%x act=%x\n", page->lvl, page->kill, page->free, page->cnt, page->act); 355 | ptr = keyptr(page, page->cnt); 356 | fprintf(stderr, "fence='%.*s'\n", ptr->len, ptr->key); 357 | fprintf(stderr, "right=%.8x\n", bt_getid(page->right)); 358 | return bt->err = err; 359 | } 360 | 361 | // Phase-Fair reader/writer lock implementation 362 | // with futex calls on contention 363 | 364 | int sys_futex(void *addr1, int op, int val1, struct timespec *timeout, void *addr2, int val3) 365 | { 366 | return syscall(SYS_futex, addr1, op, val1, timeout, addr2, val3); 367 | } 368 | 369 | // a phase fair reader/writer lock implementation 370 | 371 | void WriteLock (RWLock *lock) 372 | { 373 | ushort w, r, tix; 374 | uint prev; 375 | 376 | tix = __sync_fetch_and_add (lock->ticket, 1); 377 | 378 | // wait for our ticket to come up 379 | 380 | while( 1 ) { 381 | prev = *lock->ticket | *lock->serving << 16; 382 | if( tix == prev >> 16 ) 383 | break; 384 | sys_futex( (uint *)lock->ticket, FUTEX_WAIT_BITSET, prev, NULL, NULL, QueWr ); 385 | } 386 | 387 | w = PRES | (tix & PHID); 388 | r = __sync_fetch_and_add (lock->rin, w); 389 | 390 | while( 1 ) { 391 | prev = *lock->rin | *lock->rout << 16; 392 | if( r == prev >> 16 ) 393 | break; 394 | sys_futex( (uint *)lock->rin, FUTEX_WAIT_BITSET, prev, NULL, NULL, QueWr ); 395 | } 396 | } 397 | 398 | void WriteRelease (RWLock *lock) 399 | { 400 | __sync_fetch_and_and (lock->rin, ~MASK); 401 | lock->serving[0]++; 402 | 403 | if( (*lock->rin & ~MASK) != (*lock->rout & ~MASK) ) 404 | if( sys_futex( (uint *)lock->rin, FUTEX_WAKE_BITSET, INT_MAX, NULL, NULL, QueRd ) ) 405 | return; 406 | 407 | if( *lock->ticket != *lock->serving ) 408 | sys_futex( (uint *)lock->ticket, FUTEX_WAKE_BITSET, INT_MAX, NULL, NULL, QueWr ); 409 | } 410 | 411 | void ReadLock (RWLock *lock) 412 | { 413 | uint prev; 414 | ushort w; 415 | 416 | w = __sync_fetch_and_add (lock->rin, RINC) & MASK; 417 | 418 | if( w ) 419 | while( 1 ) { 420 | prev = *lock->rin | *lock->rout << 16; 421 | if( w != (prev & MASK) ) 422 | break; 423 | sys_futex( (uint *)lock->rin, FUTEX_WAIT_BITSET, prev, NULL, NULL, QueRd ); 424 | } 425 | } 426 | 427 | void ReadRelease (RWLock *lock) 428 | { 429 | __sync_fetch_and_add (lock->rout, RINC); 430 | 431 | if( *lock->ticket == *lock->serving ) 432 | return; 433 | 434 | if( *lock->rin & PRES ) 435 | if( sys_futex( (uint *)lock->rin, FUTEX_WAKE_BITSET, 1, NULL, NULL, QueWr ) ) 436 | return; 437 | 438 | sys_futex( (uint *)lock->ticket, FUTEX_WAKE_BITSET, INT_MAX, NULL, NULL, QueWr ); 439 | } 440 | 441 | // lite weight FIFO mutex Manager 442 | 443 | void bt_getmutex (BtMutex *mutex) 444 | { 445 | uint prev, ours; 446 | 447 | ours = __sync_fetch_and_add (&mutex->bits->next, 1); 448 | 449 | while( 1 ) { 450 | prev = mutex->value[0]; 451 | if( ours == mutex->bits->serving ) 452 | break; 453 | sys_futex( mutex->value, FUTEX_WAIT_BITSET, prev, NULL, NULL, 0 ); 454 | } 455 | } 456 | 457 | void bt_relmutex (BtMutex *mutex) 458 | { 459 | ushort serving = ++mutex->bits->serving; 460 | 461 | if( mutex->bits->next != serving ) 462 | sys_futex( mutex->value, FUTEX_WAKE_BITSET, 1, NULL, NULL, 0 ); 463 | } 464 | 465 | // read page from permanent location in Btree file 466 | 467 | BTERR bt_readpage (BtDb *bt, BtPage page, uid page_no) 468 | { 469 | off64_t off = page_no << bt->page_bits; 470 | 471 | #ifdef unix 472 | if( pread (bt->idx, page, bt->page_size, page_no << bt->page_bits) < bt->page_size ) { 473 | fprintf (stderr, "Unable to read page %.8x errno = %d\n", page_no, errno); 474 | return bt->err = BTERR_read; 475 | } 476 | #else 477 | OVERLAPPED ovl[1]; 478 | uint amt[1]; 479 | 480 | memset (ovl, 0, sizeof(OVERLAPPED)); 481 | ovl->Offset = off; 482 | ovl->OffsetHigh = off >> 32; 483 | 484 | if( !ReadFile(bt->idx, page, bt->page_size, amt, ovl)) { 485 | fprintf (stderr, "Unable to read page %.8x GetLastError = %d\n", page_no, GetLastError()); 486 | return bt->err = BTERR_read; 487 | } 488 | if( *amt < bt->page_size ) { 489 | fprintf (stderr, "Unable to read page %.8x GetLastError = %d\n", page_no, GetLastError()); 490 | return bt->err = BTERR_read; 491 | } 492 | #endif 493 | return 0; 494 | } 495 | 496 | // write page to permanent location in Btree file 497 | // clear the dirty bit 498 | 499 | BTERR bt_writepage (BtDb *bt, BtPage page, uid page_no) 500 | { 501 | off64_t off = page_no << bt->page_bits; 502 | 503 | #ifdef unix 504 | page->dirty = 0; 505 | 506 | if( pwrite(bt->idx, page, bt->page_size, off) < bt->page_size ) 507 | return bt->err = BTERR_wrt; 508 | #else 509 | OVERLAPPED ovl[1]; 510 | uint amt[1]; 511 | 512 | memset (ovl, 0, sizeof(OVERLAPPED)); 513 | ovl->Offset = off; 514 | ovl->OffsetHigh = off >> 32; 515 | page->dirty = 0; 516 | 517 | if( !WriteFile(bt->idx, page, bt->page_size, amt, ovl) ) 518 | return bt->err = BTERR_wrt; 519 | 520 | if( *amt < bt->page_size ) 521 | return bt->err = BTERR_wrt; 522 | #endif 523 | return 0; 524 | } 525 | 526 | // link latch table entry into head of latch hash table 527 | 528 | BTERR bt_latchlink (BtDb *bt, uint hashidx, uint slot, uid page_no) 529 | { 530 | BtPage page = (BtPage)((uid)slot * bt->page_size + bt->pagepool); 531 | BtLatchSet *latch = bt->latchsets + slot; 532 | int lvl; 533 | 534 | if( latch->next = bt->table[hashidx].slot ) 535 | bt->latchsets[latch->next].prev = slot; 536 | 537 | bt->table[hashidx].slot = slot; 538 | latch->page_no = page_no; 539 | latch->prev = 0; 540 | latch->pin = 1; 541 | 542 | if( bt_readpage (bt, page, page_no) ) 543 | return bt->err; 544 | 545 | lvl = page->lvl << LVL_shift; 546 | if( lvl > LVL_mask ) 547 | lvl = LVL_mask; 548 | latch->pin |= lvl; // store lvl 549 | latch->pin |= lvl << 3; // initialize clock 550 | 551 | #ifdef unix 552 | __sync_fetch_and_add (&bt->latchmgr->cache[page->lvl], 1); 553 | #else 554 | _InterlockedAdd(&bt->latchmgr->cache[page->lvl], 1); 555 | #endif 556 | return bt->err = 0; 557 | } 558 | 559 | // release latch pin 560 | 561 | void bt_unpinlatch (BtLatchSet *latch) 562 | { 563 | #ifdef unix 564 | __sync_fetch_and_add(&latch->pin, -1); 565 | #else 566 | _InterlockedDecrement16 (&latch->pin); 567 | #endif 568 | } 569 | 570 | // find existing latchset or inspire new one 571 | // return with latchset pinned 572 | 573 | BtLatchSet *bt_pinlatch (BtDb *bt, uid page_no) 574 | { 575 | uint hashidx = page_no % bt->latchmgr->latchhash; 576 | BtLatchSet *latch; 577 | uint slot, idx; 578 | uint lvl, cnt; 579 | off64_t off; 580 | uint amt[1]; 581 | BtPage page; 582 | 583 | // try to find our entry 584 | 585 | while( __sync_fetch_and_or (bt->table[hashidx].busy, 1) == 1 ) 586 | sched_yield(); 587 | 588 | if( slot = bt->table[hashidx].slot ) do 589 | { 590 | latch = bt->latchsets + slot; 591 | if( page_no == latch->page_no ) 592 | break; 593 | } while( slot = latch->next ); 594 | 595 | // found our entry 596 | // increment clock 597 | 598 | if( slot ) { 599 | latch = bt->latchsets + slot; 600 | lvl = (latch->pin & LVL_mask) >> LVL_shift; 601 | lvl *= CLOCK_unit * 2; 602 | lvl |= CLOCK_unit; 603 | #ifdef unix 604 | __sync_fetch_and_add(&latch->pin, 1); 605 | __sync_fetch_and_or(&latch->pin, lvl); 606 | #else 607 | _InterlockedIncrement16 (&latch->pin); 608 | _InterlockedOr16 (&latch->pin, lvl); 609 | #endif 610 | bt->table[hashidx].busy[0] = 0; 611 | return latch; 612 | } 613 | 614 | // see if there are any unused pool entries 615 | #ifdef unix 616 | slot = __sync_fetch_and_add (&bt->latchmgr->latchdeployed, 1) + 1; 617 | #else 618 | slot = _InterlockedIncrement (&bt->latchmgr->latchdeployed); 619 | #endif 620 | 621 | if( slot < bt->latchmgr->latchtotal ) { 622 | latch = bt->latchsets + slot; 623 | if( bt_latchlink (bt, hashidx, slot, page_no) ) 624 | return NULL; 625 | bt->table[hashidx].busy[0] = 0; 626 | return latch; 627 | } 628 | 629 | #ifdef unix 630 | __sync_fetch_and_add (&bt->latchmgr->latchdeployed, -1); 631 | #else 632 | _InterlockedDecrement (&bt->latchmgr->latchdeployed); 633 | #endif 634 | // find and reuse previous entry on victim 635 | 636 | while( 1 ) { 637 | #ifdef unix 638 | slot = __sync_fetch_and_add(&bt->latchmgr->latchvictim, 1); 639 | #else 640 | slot = _InterlockedIncrement (&bt->latchmgr->latchvictim) - 1; 641 | #endif 642 | // try to get write lock on hash chain 643 | // skip entry if not obtained 644 | // or has outstanding pins 645 | 646 | slot %= bt->latchmgr->latchtotal; 647 | 648 | // on slot wraparound, check census 649 | // count and increment safe level 650 | 651 | cnt = bt->latchmgr->cache[bt->latchmgr->safelevel]; 652 | 653 | if( !slot ) { 654 | if( cnt < bt->latchmgr->latchtotal / 10 ) 655 | #ifdef unix 656 | __sync_fetch_and_add(&bt->latchmgr->safelevel, 1); 657 | #else 658 | _InterlockedIncrement (&bt->latchmgr->safelevel); 659 | #endif 660 | continue; 661 | } 662 | 663 | latch = bt->latchsets + slot; 664 | idx = latch->page_no % bt->latchmgr->latchhash; 665 | lvl = (latch->pin & LVL_mask) >> LVL_shift; 666 | 667 | // see if we are evicting this level yet 668 | // or if we are on same chain as hashidx 669 | 670 | if( idx == hashidx || lvl > bt->latchmgr->safelevel ) 671 | continue; 672 | 673 | if( __sync_fetch_and_or (bt->table[idx].busy, 1) == 1 ) 674 | continue; 675 | 676 | if( latch->pin & ~LVL_mask ) { 677 | if( latch->pin & CLOCK_mask ) 678 | #ifdef unix 679 | __sync_fetch_and_add(&latch->pin, -CLOCK_unit); 680 | #else 681 | _InterlockedExchangeAdd16 (&latch->pin, -CLOCK_unit); 682 | #endif 683 | bt->table[idx].busy[0] = 0; 684 | continue; 685 | } 686 | 687 | // update permanent page area in btree 688 | 689 | page = (BtPage)((uid)slot * bt->page_size + bt->pagepool); 690 | #ifdef unix 691 | posix_fadvise (bt->idx, page_no << bt->page_bits, bt->page_size, POSIX_FADV_WILLNEED); 692 | __sync_fetch_and_add (&bt->latchmgr->cache[page->lvl], -1); 693 | #else 694 | _InterlockedAdd(&bt->latchmgr->cache[page->lvl], -1); 695 | #endif 696 | if( page->dirty ) 697 | if( bt_writepage (bt, page, latch->page_no) ) 698 | return NULL; 699 | 700 | // unlink our available slot from its hash chain 701 | 702 | if( latch->prev ) 703 | bt->latchsets[latch->prev].next = latch->next; 704 | else 705 | bt->table[idx].slot = latch->next; 706 | 707 | if( latch->next ) 708 | bt->latchsets[latch->next].prev = latch->prev; 709 | 710 | bt->table[idx].busy[0] = 0; 711 | 712 | if( bt_latchlink (bt, hashidx, slot, page_no) ) 713 | return NULL; 714 | 715 | bt->table[hashidx].busy[0] = 0; 716 | return latch; 717 | } 718 | } 719 | 720 | // close and release memory 721 | 722 | void bt_close (BtDb *bt) 723 | { 724 | #ifdef unix 725 | munmap (bt->table, bt->latchmgr->nlatchpage * bt->page_size); 726 | munmap (bt->latchmgr, bt->page_size); 727 | #else 728 | FlushViewOfFile(bt->latchmgr, 0); 729 | UnmapViewOfFile(bt->latchmgr); 730 | CloseHandle(bt->halloc); 731 | #endif 732 | #ifdef unix 733 | if( bt->mem ) 734 | free (bt->mem); 735 | close (bt->idx); 736 | free (bt); 737 | #else 738 | if( bt->mem) 739 | VirtualFree (bt->mem, 0, MEM_RELEASE); 740 | FlushFileBuffers(bt->idx); 741 | CloseHandle(bt->idx); 742 | GlobalFree (bt); 743 | #endif 744 | } 745 | // open/create new btree 746 | 747 | // call with file_name, BT_openmode, bits in page size (e.g. 16), 748 | // size of mapped page pool (e.g. 8192) 749 | 750 | BtDb *bt_open (char *name, uint mode, uint bits, uint nodemax) 751 | { 752 | uint lvl, attr, last, slot, idx; 753 | uint nlatchpage, latchhash; 754 | BtLatchMgr *latchmgr; 755 | off64_t size, off; 756 | uint amt[1]; 757 | BtKey key; 758 | BtDb* bt; 759 | int flag; 760 | 761 | #ifndef unix 762 | OVERLAPPED ovl[1]; 763 | #else 764 | struct flock lock[1]; 765 | #endif 766 | 767 | // determine sanity of page size and buffer pool 768 | 769 | if( bits > BT_maxbits ) 770 | bits = BT_maxbits; 771 | else if( bits < BT_minbits ) 772 | bits = BT_minbits; 773 | 774 | if( mode == BT_ro ) { 775 | fprintf(stderr, "ReadOnly mode not supported: %s\n", name); 776 | return NULL; 777 | } 778 | #ifdef unix 779 | bt = calloc (1, sizeof(BtDb)); 780 | 781 | bt->idx = open ((char*)name, O_RDWR | O_CREAT, 0666); 782 | posix_fadvise( bt->idx, 0, 0, POSIX_FADV_RANDOM); 783 | 784 | if( bt->idx == -1 ) { 785 | fprintf(stderr, "unable to open %s\n", name); 786 | return free(bt), NULL; 787 | } 788 | #else 789 | bt = GlobalAlloc (GMEM_FIXED|GMEM_ZEROINIT, sizeof(BtDb)); 790 | attr = FILE_ATTRIBUTE_NORMAL; 791 | bt->idx = CreateFile(name, GENERIC_READ| GENERIC_WRITE, FILE_SHARE_READ|FILE_SHARE_WRITE, NULL, OPEN_ALWAYS, attr, NULL); 792 | 793 | if( bt->idx == INVALID_HANDLE_VALUE ) { 794 | fprintf(stderr, "unable to open %s\n", name); 795 | return GlobalFree(bt), NULL; 796 | } 797 | #endif 798 | #ifdef unix 799 | memset (lock, 0, sizeof(lock)); 800 | lock->l_len = sizeof(struct BtPage_); 801 | lock->l_type = F_WRLCK; 802 | 803 | if( fcntl (bt->idx, F_SETLKW, lock) < 0 ) { 804 | fprintf(stderr, "unable to lock record zero %s\n", name); 805 | return bt_close (bt), NULL; 806 | } 807 | #else 808 | memset (ovl, 0, sizeof(ovl)); 809 | 810 | // use large offsets to 811 | // simulate advisory locking 812 | 813 | ovl->OffsetHigh |= 0x80000000; 814 | 815 | if( !LockFileEx (bt->idx, LOCKFILE_EXCLUSIVE_LOCK, 0, sizeof(struct BtPage_), 0L, ovl) ) { 816 | fprintf(stderr, "unable to lock record zero %s, GetLastError = %d\n", name, GetLastError()); 817 | return bt_close (bt), NULL; 818 | } 819 | #endif 820 | 821 | #ifdef unix 822 | latchmgr = valloc (BT_maxpage); 823 | *amt = 0; 824 | 825 | // read minimum page size to get root info 826 | 827 | if( size = lseek (bt->idx, 0L, 2) ) { 828 | if( pread(bt->idx, latchmgr, BT_minpage, 0) == BT_minpage ) 829 | bits = latchmgr->alloc->bits; 830 | else { 831 | fprintf(stderr, "Unable to read page zero\n"); 832 | return free(bt), free(latchmgr), NULL; 833 | } 834 | } 835 | #else 836 | latchmgr = VirtualAlloc(NULL, BT_maxpage, MEM_COMMIT, PAGE_READWRITE); 837 | size = GetFileSize(bt->idx, amt); 838 | 839 | if( size || *amt ) { 840 | if( !ReadFile(bt->idx, (char *)latchmgr, BT_minpage, amt, NULL) ) { 841 | fprintf(stderr, "Unable to read page zero\n"); 842 | return bt_close (bt), NULL; 843 | } else 844 | bits = latchmgr->alloc->bits; 845 | } 846 | #endif 847 | 848 | bt->page_size = 1 << bits; 849 | bt->page_bits = bits; 850 | 851 | bt->mode = mode; 852 | 853 | if( size || *amt ) { 854 | nlatchpage = latchmgr->nlatchpage; 855 | goto btlatch; 856 | } 857 | 858 | if( nodemax < 16 ) { 859 | fprintf(stderr, "Buffer pool too small: %d\n", nodemax); 860 | return bt_close(bt), NULL; 861 | } 862 | 863 | // initialize an empty b-tree with latch page, root page, page of leaves 864 | // and page(s) of latches and page pool cache 865 | 866 | memset (latchmgr, 0, 1 << bits); 867 | latchmgr->alloc->bits = bt->page_bits; 868 | 869 | // calculate number of latch hash table entries 870 | 871 | nlatchpage = (nodemax/16 * sizeof(BtHashEntry) + bt->page_size - 1) / bt->page_size; 872 | latchhash = nlatchpage * bt->page_size / sizeof(BtHashEntry); 873 | 874 | nlatchpage += nodemax; // size of the buffer pool in pages 875 | nlatchpage += (sizeof(BtLatchSet) * nodemax + bt->page_size - 1)/bt->page_size; 876 | 877 | bt_putid(latchmgr->alloc->right, MIN_lvl+1+nlatchpage); 878 | latchmgr->nlatchpage = nlatchpage; 879 | latchmgr->latchtotal = nodemax; 880 | latchmgr->latchhash = latchhash; 881 | 882 | if( bt_writepage (bt, latchmgr->alloc, 0) ) { 883 | fprintf (stderr, "Unable to create btree page zero\n"); 884 | return bt_close (bt), NULL; 885 | } 886 | 887 | memset (latchmgr, 0, 1 << bits); 888 | latchmgr->alloc->bits = bt->page_bits; 889 | 890 | for( lvl=MIN_lvl; lvl--; ) { 891 | last = MIN_lvl - lvl; // page number 892 | slotptr(latchmgr->alloc, 1)->off = bt->page_size - 3; 893 | bt_putid(slotptr(latchmgr->alloc, 1)->id, lvl ? last + 1 : 0); 894 | key = keyptr(latchmgr->alloc, 1); 895 | key->len = 2; // create stopper key 896 | key->key[0] = 0xff; 897 | key->key[1] = 0xff; 898 | 899 | latchmgr->alloc->min = bt->page_size - 3; 900 | latchmgr->alloc->lvl = lvl; 901 | latchmgr->alloc->cnt = 1; 902 | latchmgr->alloc->act = 1; 903 | 904 | if( bt_writepage (bt, latchmgr->alloc, last) ) { 905 | fprintf (stderr, "Unable to create btree page %.8x\n", last); 906 | return bt_close (bt), NULL; 907 | } 908 | } 909 | 910 | // clear out buffer pool pages 911 | 912 | memset(latchmgr, 0, bt->page_size); 913 | last = MIN_lvl + nlatchpage; 914 | 915 | if( bt_writepage (bt, latchmgr->alloc, last) ) { 916 | fprintf (stderr, "Unable to write buffer pool page %.8x\n", last); 917 | return bt_close (bt), NULL; 918 | } 919 | 920 | #ifdef unix 921 | free (latchmgr); 922 | #else 923 | VirtualFree (latchmgr, 0, MEM_RELEASE); 924 | #endif 925 | 926 | btlatch: 927 | #ifdef unix 928 | lock->l_type = F_UNLCK; 929 | if( fcntl (bt->idx, F_SETLK, lock) < 0 ) { 930 | fprintf (stderr, "Unable to unlock page zero\n"); 931 | return bt_close (bt), NULL; 932 | } 933 | #else 934 | if( !UnlockFileEx (bt->idx, 0, sizeof(struct BtPage_), 0, ovl) ) { 935 | fprintf (stderr, "Unable to unlock page zero, GetLastError = %d\n", GetLastError()); 936 | return bt_close (bt), NULL; 937 | } 938 | #endif 939 | #ifdef unix 940 | flag = PROT_READ | PROT_WRITE; 941 | bt->latchmgr = mmap (0, bt->page_size, flag, MAP_SHARED, bt->idx, ALLOC_page * bt->page_size); 942 | if( bt->latchmgr == MAP_FAILED ) { 943 | fprintf (stderr, "Unable to mmap page zero, errno = %d", errno); 944 | return bt_close (bt), NULL; 945 | } 946 | bt->table = (void *)mmap (0, (uid)nlatchpage * bt->page_size, flag, MAP_SHARED, bt->idx, LATCH_page * bt->page_size); 947 | if( bt->table == MAP_FAILED ) { 948 | fprintf (stderr, "Unable to mmap buffer pool, errno = %d", errno); 949 | return bt_close (bt), NULL; 950 | } 951 | madvise (bt->table, (uid)nlatchpage << bt->page_bits, MADV_RANDOM | MADV_WILLNEED); 952 | #else 953 | flag = PAGE_READWRITE; 954 | bt->halloc = CreateFileMapping(bt->idx, NULL, flag, 0, ((uid)nlatchpage + LATCH_page) * bt->page_size, NULL); 955 | if( !bt->halloc ) { 956 | fprintf (stderr, "Unable to create file mapping for buffer pool mgr, GetLastError = %d\n", GetLastError()); 957 | return bt_close (bt), NULL; 958 | } 959 | 960 | flag = FILE_MAP_WRITE; 961 | bt->latchmgr = MapViewOfFile(bt->halloc, flag, 0, 0, ((uid)nlatchpage + LATCH_page) * bt->page_size); 962 | if( !bt->latchmgr ) { 963 | fprintf (stderr, "Unable to map buffer pool, GetLastError = %d\n", GetLastError()); 964 | return bt_close (bt), NULL; 965 | } 966 | 967 | bt->table = (void *)((char *)bt->latchmgr + LATCH_page * bt->page_size); 968 | #endif 969 | bt->pagepool = (unsigned char *)bt->table + (uid)(nlatchpage - bt->latchmgr->latchtotal) * bt->page_size; 970 | bt->latchsets = (BtLatchSet *)(bt->pagepool - (uid)bt->latchmgr->latchtotal * sizeof(BtLatchSet)); 971 | 972 | #ifdef unix 973 | bt->mem = valloc (2 * bt->page_size); 974 | #else 975 | bt->mem = VirtualAlloc(NULL, 2 * bt->page_size, MEM_COMMIT, PAGE_READWRITE); 976 | #endif 977 | bt->frame = (BtPage)bt->mem; 978 | bt->cursor = (BtPage)(bt->mem + bt->page_size); 979 | return bt; 980 | } 981 | 982 | // place write, read, or parent lock on requested page_no. 983 | 984 | void bt_lockpage(BtLock mode, BtLatchSet *latch) 985 | { 986 | switch( mode ) { 987 | case BtLockRead: 988 | ReadLock (latch->readwr); 989 | break; 990 | case BtLockWrite: 991 | WriteLock (latch->readwr); 992 | break; 993 | case BtLockAccess: 994 | ReadLock (latch->access); 995 | break; 996 | case BtLockDelete: 997 | WriteLock (latch->access); 998 | break; 999 | case BtLockParent: 1000 | WriteLock (latch->parent); 1001 | break; 1002 | } 1003 | } 1004 | 1005 | // remove write, read, or parent lock on requested page 1006 | 1007 | void bt_unlockpage(BtLock mode, BtLatchSet *latch) 1008 | { 1009 | switch( mode ) { 1010 | case BtLockRead: 1011 | ReadRelease (latch->readwr); 1012 | break; 1013 | case BtLockWrite: 1014 | WriteRelease (latch->readwr); 1015 | break; 1016 | case BtLockAccess: 1017 | ReadRelease (latch->access); 1018 | break; 1019 | case BtLockDelete: 1020 | WriteRelease (latch->access); 1021 | break; 1022 | case BtLockParent: 1023 | WriteRelease (latch->parent); 1024 | break; 1025 | } 1026 | } 1027 | 1028 | // allocate a new page and write page into it 1029 | 1030 | uid bt_newpage(BtDb *bt, BtPage page) 1031 | { 1032 | BtLatchSet *latch; 1033 | uid new_page; 1034 | BtPage temp; 1035 | 1036 | // lock allocation page 1037 | 1038 | bt_getmutex(bt->latchmgr->lock); 1039 | 1040 | // use empty chain first 1041 | // else allocate empty page 1042 | 1043 | if( new_page = bt_getid(bt->latchmgr->alloc[1].right) ) { 1044 | if( latch = bt_pinlatch (bt, new_page) ) 1045 | temp = bt_mappage (bt, latch); 1046 | else 1047 | return 0; 1048 | 1049 | bt_putid(bt->latchmgr->alloc[1].right, bt_getid(temp->right)); 1050 | bt_relmutex(bt->latchmgr->lock); 1051 | memcpy (temp, page, bt->page_size); 1052 | 1053 | bt_update (bt, temp); 1054 | bt_unpinlatch (latch); 1055 | return new_page; 1056 | } else { 1057 | new_page = bt_getid(bt->latchmgr->alloc->right); 1058 | bt_putid(bt->latchmgr->alloc->right, new_page+1); 1059 | bt_relmutex(bt->latchmgr->lock); 1060 | 1061 | if( bt_writepage (bt, page, new_page) ) 1062 | return 0; 1063 | } 1064 | 1065 | bt_update (bt, bt->latchmgr->alloc); 1066 | return new_page; 1067 | } 1068 | 1069 | // compare two keys, returning > 0, = 0, or < 0 1070 | // as the comparison value 1071 | 1072 | int keycmp (BtKey key1, unsigned char *key2, uint len2) 1073 | { 1074 | uint len1 = key1->len; 1075 | int ans; 1076 | 1077 | if( ans = memcmp (key1->key, key2, len1 > len2 ? len2 : len1) ) 1078 | return ans; 1079 | 1080 | if( len1 > len2 ) 1081 | return 1; 1082 | if( len1 < len2 ) 1083 | return -1; 1084 | 1085 | return 0; 1086 | } 1087 | 1088 | // Update current page of btree by 1089 | // flushing mapped area to disk backing of cache pool. 1090 | // mark page as dirty for rewrite to permanent location 1091 | 1092 | void bt_update (BtDb *bt, BtPage page) 1093 | { 1094 | #ifdef unix 1095 | msync (page, bt->page_size, MS_ASYNC); 1096 | #else 1097 | // FlushViewOfFile (page, bt->page_size); 1098 | #endif 1099 | page->dirty = 1; 1100 | } 1101 | 1102 | // map the btree cached page onto current page 1103 | 1104 | BtPage bt_mappage (BtDb *bt, BtLatchSet *latch) 1105 | { 1106 | return (BtPage)((uid)(latch - bt->latchsets) * bt->page_size + bt->pagepool); 1107 | } 1108 | 1109 | // deallocate a deleted page 1110 | // place on free chain out of allocator page 1111 | // call with page latched for Writing and Deleting 1112 | 1113 | BTERR bt_freepage(BtDb *bt, uid page_no, BtLatchSet *latch) 1114 | { 1115 | BtPage page = bt_mappage (bt, latch); 1116 | 1117 | // lock allocation page 1118 | 1119 | bt_getmutex (bt->latchmgr->lock); 1120 | 1121 | // store chain in second right 1122 | bt_putid(page->right, bt_getid(bt->latchmgr->alloc[1].right)); 1123 | bt_putid(bt->latchmgr->alloc[1].right, page_no); 1124 | 1125 | page->free = 1; 1126 | bt_update(bt, page); 1127 | 1128 | // unlock released page 1129 | 1130 | bt_unlockpage (BtLockDelete, latch); 1131 | bt_unlockpage (BtLockWrite, latch); 1132 | bt_unpinlatch (latch); 1133 | 1134 | // unlock allocation page 1135 | 1136 | bt_relmutex (bt->latchmgr->lock); 1137 | bt_update (bt, bt->latchmgr->alloc); 1138 | return 0; 1139 | } 1140 | 1141 | // find slot in page for given key at a given level 1142 | 1143 | int bt_findslot (BtDb *bt, unsigned char *key, uint len) 1144 | { 1145 | uint diff, higher = bt->page->cnt, low = 1, slot; 1146 | uint good = 0; 1147 | 1148 | // make stopper key an infinite fence value 1149 | 1150 | if( bt_getid (bt->page->right) ) 1151 | higher++; 1152 | else 1153 | good++; 1154 | 1155 | // low is the lowest candidate, higher is already 1156 | // tested as .ge. the given key, loop ends when they meet 1157 | 1158 | while( diff = higher - low ) { 1159 | slot = low + ( diff >> 1 ); 1160 | if( keycmp (keyptr(bt->page, slot), key, len) < 0 ) 1161 | low = slot + 1; 1162 | else 1163 | higher = slot, good++; 1164 | } 1165 | 1166 | // return zero if key is on right link page 1167 | 1168 | return good ? higher : 0; 1169 | } 1170 | 1171 | // find and load page at given level for given key 1172 | // leave page rd or wr locked as requested 1173 | 1174 | int bt_loadpage (BtDb *bt, unsigned char *key, uint len, uint lvl, uint lock) 1175 | { 1176 | uid page_no = ROOT_page, prevpage = 0; 1177 | uint drill = 0xff, slot; 1178 | BtLatchSet *prevlatch; 1179 | uint mode, prevmode; 1180 | 1181 | // start at root of btree and drill down 1182 | 1183 | do { 1184 | // determine lock mode of drill level 1185 | mode = (lock == BtLockWrite) && (drill == lvl) ? BtLockWrite : BtLockRead; 1186 | 1187 | if( bt->latch = bt_pinlatch(bt, page_no) ) 1188 | bt->page_no = page_no; 1189 | else 1190 | return 0; 1191 | 1192 | // obtain access lock using lock chaining 1193 | 1194 | if( page_no > ROOT_page ) 1195 | bt_lockpage(BtLockAccess, bt->latch); 1196 | 1197 | if( prevpage ) { 1198 | bt_unlockpage(prevmode, prevlatch); 1199 | bt_unpinlatch(prevlatch); 1200 | prevpage = 0; 1201 | } 1202 | 1203 | // obtain read lock using lock chaining 1204 | 1205 | bt_lockpage(mode, bt->latch); 1206 | 1207 | if( page_no > ROOT_page ) 1208 | bt_unlockpage(BtLockAccess, bt->latch); 1209 | 1210 | // map/obtain page contents 1211 | 1212 | bt->page = bt_mappage (bt, bt->latch); 1213 | 1214 | // re-read and re-lock root after determining actual level of root 1215 | 1216 | if( bt->page->lvl != drill) { 1217 | if( bt->page_no != ROOT_page ) 1218 | return bt->err = BTERR_struct, 0; 1219 | 1220 | drill = bt->page->lvl; 1221 | 1222 | if( lock != BtLockRead && drill == lvl ) { 1223 | bt_unlockpage(mode, bt->latch); 1224 | bt_unpinlatch(bt->latch); 1225 | continue; 1226 | } 1227 | } 1228 | 1229 | prevpage = bt->page_no; 1230 | prevlatch = bt->latch; 1231 | prevmode = mode; 1232 | 1233 | // find key on page at this level 1234 | // and descend to requested level 1235 | 1236 | if( !bt->page->kill ) 1237 | if( slot = bt_findslot (bt, key, len) ) { 1238 | if( drill == lvl ) 1239 | return slot; 1240 | 1241 | while( slotptr(bt->page, slot)->dead ) 1242 | if( slot++ < bt->page->cnt ) 1243 | continue; 1244 | else 1245 | goto slideright; 1246 | 1247 | page_no = bt_getid(slotptr(bt->page, slot)->id); 1248 | drill--; 1249 | continue; 1250 | } 1251 | 1252 | // or slide right into next page 1253 | 1254 | slideright: 1255 | page_no = bt_getid(bt->page->right); 1256 | 1257 | } while( page_no ); 1258 | 1259 | // return error on end of right chain 1260 | 1261 | bt->err = BTERR_eof; 1262 | return 0; // return error 1263 | } 1264 | 1265 | // a fence key was deleted from a page 1266 | // push new fence value upwards 1267 | 1268 | BTERR bt_fixfence (BtDb *bt, uid page_no, uint lvl) 1269 | { 1270 | unsigned char leftkey[256], rightkey[256]; 1271 | BtLatchSet *latch = bt->latch; 1272 | BtKey ptr; 1273 | 1274 | // remove deleted key, the old fence value 1275 | 1276 | ptr = keyptr(bt->page, bt->page->cnt); 1277 | memcpy(rightkey, ptr, ptr->len + 1); 1278 | 1279 | memset (slotptr(bt->page, bt->page->cnt--), 0, sizeof(BtSlot)); 1280 | bt->page->clean = 1; 1281 | 1282 | ptr = keyptr(bt->page, bt->page->cnt); 1283 | memcpy(leftkey, ptr, ptr->len + 1); 1284 | 1285 | bt_update (bt, bt->page); 1286 | bt_lockpage (BtLockParent, latch); 1287 | bt_unlockpage (BtLockWrite, latch); 1288 | 1289 | // insert new (now smaller) fence key 1290 | 1291 | if( bt_insertkey (bt, leftkey+1, *leftkey, lvl + 1, page_no, time(NULL)) ) 1292 | return bt->err; 1293 | 1294 | // remove old (larger) fence key 1295 | 1296 | if( bt_deletekey (bt, rightkey+1, *rightkey, lvl + 1) ) 1297 | return bt->err; 1298 | 1299 | bt_unlockpage (BtLockParent, latch); 1300 | bt_unpinlatch (latch); 1301 | return 0; 1302 | } 1303 | 1304 | // root has a single child 1305 | // collapse a level from the btree 1306 | // call with root locked in bt->page 1307 | 1308 | BTERR bt_collapseroot (BtDb *bt, BtPage root) 1309 | { 1310 | BtLatchSet *latch; 1311 | BtPage temp; 1312 | uid child; 1313 | uint idx; 1314 | 1315 | // find the child entry 1316 | // and promote to new root 1317 | 1318 | do { 1319 | for( idx = 0; idx++ < root->cnt; ) 1320 | if( !slotptr(root, idx)->dead ) 1321 | break; 1322 | 1323 | child = bt_getid (slotptr(root, idx)->id); 1324 | if( latch = bt_pinlatch (bt, child) ) 1325 | temp = bt_mappage (bt, latch); 1326 | else 1327 | return bt->err; 1328 | 1329 | bt_lockpage (BtLockDelete, latch); 1330 | bt_lockpage (BtLockWrite, latch); 1331 | memcpy (root, temp, bt->page_size); 1332 | 1333 | bt_update (bt, root); 1334 | 1335 | if( bt_freepage (bt, child, latch) ) 1336 | return bt->err; 1337 | 1338 | } while( root->lvl > 1 && root->act == 1 ); 1339 | 1340 | bt_unlockpage (BtLockWrite, bt->latch); 1341 | bt_unpinlatch (bt->latch); 1342 | return 0; 1343 | } 1344 | 1345 | // find and delete key on page by marking delete flag bit 1346 | // when page becomes empty, delete it 1347 | 1348 | BTERR bt_deletekey (BtDb *bt, unsigned char *key, uint len, uint lvl) 1349 | { 1350 | unsigned char lowerkey[256], higherkey[256]; 1351 | uint slot, dirty = 0, idx, fence, found; 1352 | BtLatchSet *latch, *rlatch; 1353 | uid page_no, right; 1354 | BtPage temp; 1355 | BtKey ptr; 1356 | 1357 | if( slot = bt_loadpage (bt, key, len, lvl, BtLockWrite) ) 1358 | ptr = keyptr(bt->page, slot); 1359 | else 1360 | return bt->err; 1361 | 1362 | // are we deleting a fence slot? 1363 | 1364 | fence = slot == bt->page->cnt; 1365 | 1366 | // if key is found delete it, otherwise ignore request 1367 | 1368 | if( found = !keycmp (ptr, key, len) ) 1369 | if( found = slotptr(bt->page, slot)->dead == 0 ) { 1370 | dirty = slotptr(bt->page,slot)->dead = 1; 1371 | bt->page->clean = 1; 1372 | bt->page->act--; 1373 | 1374 | // collapse empty slots 1375 | 1376 | while( idx = bt->page->cnt - 1 ) 1377 | if( slotptr(bt->page, idx)->dead ) { 1378 | *slotptr(bt->page, idx) = *slotptr(bt->page, idx + 1); 1379 | memset (slotptr(bt->page, bt->page->cnt--), 0, sizeof(BtSlot)); 1380 | } else 1381 | break; 1382 | } 1383 | 1384 | right = bt_getid(bt->page->right); 1385 | page_no = bt->page_no; 1386 | latch = bt->latch; 1387 | 1388 | if( !dirty ) { 1389 | if( lvl ) 1390 | return bt_abort (bt, bt->page, page_no, BTERR_notfound); 1391 | bt_unlockpage(BtLockWrite, latch); 1392 | bt_unpinlatch (latch); 1393 | return bt->found = found, 0; 1394 | } 1395 | 1396 | // did we delete a fence key in an upper level? 1397 | 1398 | if( lvl && bt->page->act && fence ) 1399 | if( bt_fixfence (bt, page_no, lvl) ) 1400 | return bt->err; 1401 | else 1402 | return bt->found = found, 0; 1403 | 1404 | // is this a collapsed root? 1405 | 1406 | if( lvl > 1 && page_no == ROOT_page && bt->page->act == 1 ) 1407 | if( bt_collapseroot (bt, bt->page) ) 1408 | return bt->err; 1409 | else 1410 | return bt->found = found, 0; 1411 | 1412 | // return if page is not empty 1413 | 1414 | if( bt->page->act ) { 1415 | bt_update(bt, bt->page); 1416 | bt_unlockpage(BtLockWrite, latch); 1417 | bt_unpinlatch (latch); 1418 | return bt->found = found, 0; 1419 | } 1420 | 1421 | // cache copy of fence key 1422 | // in order to find parent 1423 | 1424 | ptr = keyptr(bt->page, bt->page->cnt); 1425 | memcpy(lowerkey, ptr, ptr->len + 1); 1426 | 1427 | // obtain lock on right page 1428 | 1429 | if( rlatch = bt_pinlatch (bt, right) ) 1430 | temp = bt_mappage (bt, rlatch); 1431 | else 1432 | return bt->err; 1433 | 1434 | bt_lockpage(BtLockWrite, rlatch); 1435 | 1436 | if( temp->kill ) { 1437 | bt_abort(bt, temp, right, 0); 1438 | return bt_abort(bt, bt->page, bt->page_no, BTERR_kill); 1439 | } 1440 | 1441 | // pull contents of next page into current empty page 1442 | 1443 | memcpy (bt->page, temp, bt->page_size); 1444 | 1445 | // cache copy of key to update 1446 | 1447 | ptr = keyptr(temp, temp->cnt); 1448 | memcpy(higherkey, ptr, ptr->len + 1); 1449 | 1450 | // Mark right page as deleted and point it to left page 1451 | // until we can post updates at higher level. 1452 | 1453 | bt_putid(temp->right, page_no); 1454 | temp->kill = 1; 1455 | 1456 | bt_update(bt, bt->page); 1457 | bt_update(bt, temp); 1458 | 1459 | bt_lockpage(BtLockParent, latch); 1460 | bt_unlockpage(BtLockWrite, latch); 1461 | 1462 | bt_lockpage(BtLockParent, rlatch); 1463 | bt_unlockpage(BtLockWrite, rlatch); 1464 | 1465 | // redirect higher key directly to consolidated node 1466 | 1467 | if( bt_insertkey (bt, higherkey+1, *higherkey, lvl+1, page_no, time(NULL)) ) 1468 | return bt->err; 1469 | 1470 | // delete old lower key to consolidated node 1471 | 1472 | if( bt_deletekey (bt, lowerkey + 1, *lowerkey, lvl + 1) ) 1473 | return bt->err; 1474 | 1475 | // obtain write & delete lock on deleted node 1476 | // add right block to free chain 1477 | 1478 | bt_lockpage(BtLockDelete, rlatch); 1479 | bt_lockpage(BtLockWrite, rlatch); 1480 | bt_unlockpage(BtLockParent, rlatch); 1481 | 1482 | if( bt_freepage (bt, right, rlatch) ) 1483 | return bt->err; 1484 | 1485 | bt_unlockpage(BtLockParent, latch); 1486 | bt_unpinlatch(latch); 1487 | return 0; 1488 | } 1489 | 1490 | // find key in leaf level and return row-id 1491 | 1492 | uid bt_findkey (BtDb *bt, unsigned char *key, uint len) 1493 | { 1494 | uint slot; 1495 | BtKey ptr; 1496 | uid id; 1497 | 1498 | if( slot = bt_loadpage (bt, key, len, 0, BtLockRead) ) 1499 | ptr = keyptr(bt->page, slot); 1500 | else 1501 | return 0; 1502 | 1503 | // if key exists, return row-id 1504 | // otherwise return 0 1505 | 1506 | if( ptr->len == len && !memcmp (ptr->key, key, len) ) 1507 | id = bt_getid(slotptr(bt->page,slot)->id); 1508 | else 1509 | id = 0; 1510 | 1511 | bt_unlockpage (BtLockRead, bt->latch); 1512 | bt_unpinlatch (bt->latch); 1513 | return id; 1514 | } 1515 | 1516 | // check page for space available, 1517 | // clean if necessary and return 1518 | // 0 - page needs splitting 1519 | // >0 - go ahead with new slot 1520 | 1521 | uint bt_cleanpage(BtDb *bt, uint amt, uint slot) 1522 | { 1523 | uint nxt = bt->page_size; 1524 | BtPage page = bt->page; 1525 | uint cnt = 0, idx = 0; 1526 | uint max = page->cnt; 1527 | uint newslot = slot; 1528 | BtKey key; 1529 | int ret; 1530 | 1531 | if( page->min >= (max+1) * sizeof(BtSlot) + sizeof(*page) + amt + 1 ) 1532 | return slot; 1533 | 1534 | // skip cleanup if nothing to reclaim 1535 | 1536 | if( !page->clean ) 1537 | return 0; 1538 | 1539 | memcpy (bt->frame, page, bt->page_size); 1540 | 1541 | // skip page info and set rest of page to zero 1542 | 1543 | memset (page+1, 0, bt->page_size - sizeof(*page)); 1544 | page->act = 0; 1545 | 1546 | while( cnt++ < max ) { 1547 | if( cnt == slot ) 1548 | newslot = idx + 1; 1549 | // always leave fence key in list 1550 | if( cnt < max && slotptr(bt->frame,cnt)->dead ) 1551 | continue; 1552 | 1553 | // copy key 1554 | key = keyptr(bt->frame, cnt); 1555 | nxt -= key->len + 1; 1556 | memcpy ((unsigned char *)page + nxt, key, key->len + 1); 1557 | 1558 | // copy slot 1559 | memcpy(slotptr(page, ++idx)->id, slotptr(bt->frame, cnt)->id, BtId); 1560 | if( !(slotptr(page, idx)->dead = slotptr(bt->frame, cnt)->dead) ) 1561 | page->act++; 1562 | #ifdef USETOD 1563 | slotptr(page, idx)->tod = slotptr(bt->frame, cnt)->tod; 1564 | #endif 1565 | slotptr(page, idx)->off = nxt; 1566 | } 1567 | 1568 | page->min = nxt; 1569 | page->cnt = idx; 1570 | 1571 | if( page->min >= (max+1) * sizeof(BtSlot) + sizeof(*page) + amt + 1 ) 1572 | return newslot; 1573 | 1574 | return 0; 1575 | } 1576 | 1577 | // split the root and raise the height of the btree 1578 | 1579 | BTERR bt_splitroot(BtDb *bt, unsigned char *leftkey, uid page_no2) 1580 | { 1581 | uint nxt = bt->page_size; 1582 | BtPage root = bt->page; 1583 | uid right; 1584 | 1585 | // Obtain an empty page to use, and copy the current 1586 | // root contents into it 1587 | 1588 | if( !(right = bt_newpage(bt, root)) ) 1589 | return bt->err; 1590 | 1591 | // preserve the page info at the bottom 1592 | // and set rest to zero 1593 | 1594 | memset(root+1, 0, bt->page_size - sizeof(*root)); 1595 | 1596 | // insert first key on newroot page 1597 | 1598 | nxt -= *leftkey + 1; 1599 | memcpy ((unsigned char *)root + nxt, leftkey, *leftkey + 1); 1600 | bt_putid(slotptr(root, 1)->id, right); 1601 | slotptr(root, 1)->off = nxt; 1602 | 1603 | // insert second key on newroot page 1604 | // and increase the root height 1605 | 1606 | nxt -= 3; 1607 | ((unsigned char *)root)[nxt] = 2; 1608 | ((unsigned char *)root)[nxt+1] = 0xff; 1609 | ((unsigned char *)root)[nxt+2] = 0xff; 1610 | bt_putid(slotptr(root, 2)->id, page_no2); 1611 | slotptr(root, 2)->off = nxt; 1612 | 1613 | bt_putid(root->right, 0); 1614 | root->min = nxt; // reset lowest used offset and key count 1615 | root->cnt = 2; 1616 | root->act = 2; 1617 | root->lvl++; 1618 | 1619 | // update and release root (bt->page) 1620 | 1621 | bt_update(bt, root); 1622 | 1623 | bt_unlockpage(BtLockWrite, bt->latch); 1624 | bt_unpinlatch(bt->latch); 1625 | return 0; 1626 | } 1627 | 1628 | // split already locked full node 1629 | // return unlocked. 1630 | 1631 | BTERR bt_splitpage (BtDb *bt) 1632 | { 1633 | uint cnt = 0, idx = 0, max, nxt = bt->page_size; 1634 | unsigned char fencekey[256], rightkey[256]; 1635 | uid page_no = bt->page_no, right; 1636 | BtLatchSet *latch, *rlatch; 1637 | BtPage page = bt->page; 1638 | uint lvl = page->lvl; 1639 | BtKey key; 1640 | 1641 | latch = bt->latch; 1642 | 1643 | // split higher half of keys to bt->frame 1644 | // the last key (fence key) might be dead 1645 | 1646 | memset (bt->frame, 0, bt->page_size); 1647 | max = page->cnt; 1648 | cnt = max / 2; 1649 | idx = 0; 1650 | 1651 | while( cnt++ < max ) { 1652 | key = keyptr(page, cnt); 1653 | nxt -= key->len + 1; 1654 | memcpy ((unsigned char *)bt->frame + nxt, key, key->len + 1); 1655 | memcpy(slotptr(bt->frame,++idx)->id, slotptr(page,cnt)->id, BtId); 1656 | if( !(slotptr(bt->frame, idx)->dead = slotptr(page, cnt)->dead) ) 1657 | bt->frame->act++; 1658 | #ifdef USETOD 1659 | slotptr(bt->frame, idx)->tod = slotptr(page, cnt)->tod; 1660 | #endif 1661 | slotptr(bt->frame, idx)->off = nxt; 1662 | } 1663 | 1664 | // remember fence key for new right page 1665 | 1666 | memcpy (rightkey, key, key->len + 1); 1667 | 1668 | bt->frame->bits = bt->page_bits; 1669 | bt->frame->min = nxt; 1670 | bt->frame->cnt = idx; 1671 | bt->frame->lvl = lvl; 1672 | 1673 | // link right node 1674 | 1675 | if( page_no > ROOT_page ) 1676 | memcpy (bt->frame->right, page->right, BtId); 1677 | 1678 | // get new free page and write frame to it. 1679 | 1680 | if( !(right = bt_newpage(bt, bt->frame)) ) 1681 | return bt->err; 1682 | 1683 | // update lower keys to continue in old page 1684 | 1685 | memcpy (bt->frame, page, bt->page_size); 1686 | memset (page+1, 0, bt->page_size - sizeof(*page)); 1687 | nxt = bt->page_size; 1688 | page->clean = 0; 1689 | page->act = 0; 1690 | cnt = 0; 1691 | idx = 0; 1692 | 1693 | // assemble page of smaller keys 1694 | // (they're all active keys) 1695 | 1696 | while( cnt++ < max / 2 ) { 1697 | key = keyptr(bt->frame, cnt); 1698 | nxt -= key->len + 1; 1699 | memcpy ((unsigned char *)page + nxt, key, key->len + 1); 1700 | memcpy(slotptr(page,++idx)->id, slotptr(bt->frame,cnt)->id, BtId); 1701 | #ifdef USETOD 1702 | slotptr(page, idx)->tod = slotptr(bt->frame, cnt)->tod; 1703 | #endif 1704 | slotptr(page, idx)->off = nxt; 1705 | page->act++; 1706 | } 1707 | 1708 | // remember fence key for smaller page 1709 | 1710 | memcpy (fencekey, key, key->len + 1); 1711 | 1712 | bt_putid(page->right, right); 1713 | page->min = nxt; 1714 | page->cnt = idx; 1715 | 1716 | // if current page is the root page, split it 1717 | 1718 | if( page_no == ROOT_page ) 1719 | return bt_splitroot (bt, fencekey, right); 1720 | 1721 | // lock right page 1722 | 1723 | if( rlatch = bt_pinlatch (bt, right) ) 1724 | bt_lockpage (BtLockParent, rlatch); 1725 | else 1726 | return bt->err; 1727 | 1728 | // update left (containing) node 1729 | 1730 | bt_update(bt, page); 1731 | 1732 | bt_lockpage (BtLockParent, latch); 1733 | bt_unlockpage (BtLockWrite, latch); 1734 | 1735 | // insert new fence for reformulated left block 1736 | 1737 | if( bt_insertkey (bt, fencekey+1, *fencekey, lvl+1, page_no, time(NULL)) ) 1738 | return bt->err; 1739 | 1740 | // switch fence for right block of larger keys to new right page 1741 | 1742 | if( bt_insertkey (bt, rightkey+1, *rightkey, lvl+1, right, time(NULL)) ) 1743 | return bt->err; 1744 | 1745 | bt_unlockpage (BtLockParent, latch); 1746 | bt_unlockpage (BtLockParent, rlatch); 1747 | 1748 | bt_unpinlatch (rlatch); 1749 | bt_unpinlatch (latch); 1750 | return 0; 1751 | } 1752 | 1753 | // Insert new key into the btree at requested level. 1754 | // Pages are unlocked at exit. 1755 | 1756 | BTERR bt_insertkey (BtDb *bt, unsigned char *key, uint len, uint lvl, uid id, uint tod) 1757 | { 1758 | uint slot, idx; 1759 | BtPage page; 1760 | BtKey ptr; 1761 | 1762 | while( 1 ) { 1763 | if( slot = bt_loadpage (bt, key, len, lvl, BtLockWrite) ) 1764 | ptr = keyptr(bt->page, slot); 1765 | else 1766 | { 1767 | if( !bt->err ) 1768 | bt->err = BTERR_ovflw; 1769 | return bt->err; 1770 | } 1771 | 1772 | // if key already exists, update id and return 1773 | 1774 | page = bt->page; 1775 | 1776 | if( !keycmp (ptr, key, len) ) { 1777 | if( slotptr(page, slot)->dead ) 1778 | page->act++; 1779 | slotptr(page, slot)->dead = 0; 1780 | #ifdef USETOD 1781 | slotptr(page, slot)->tod = tod; 1782 | #endif 1783 | bt_putid(slotptr(page,slot)->id, id); 1784 | bt_update(bt, bt->page); 1785 | bt_unlockpage(BtLockWrite, bt->latch); 1786 | bt_unpinlatch (bt->latch); 1787 | return 0; 1788 | } 1789 | 1790 | // check if page has enough space 1791 | 1792 | if( slot = bt_cleanpage (bt, len, slot) ) 1793 | break; 1794 | 1795 | if( bt_splitpage (bt) ) 1796 | return bt->err; 1797 | } 1798 | 1799 | // calculate next available slot and copy key into page 1800 | 1801 | page->min -= len + 1; // reset lowest used offset 1802 | ((unsigned char *)page)[page->min] = len; 1803 | memcpy ((unsigned char *)page + page->min +1, key, len ); 1804 | 1805 | for( idx = slot; idx < page->cnt; idx++ ) 1806 | if( slotptr(page, idx)->dead ) 1807 | break; 1808 | 1809 | // now insert key into array before slot 1810 | // preserving the fence slot 1811 | 1812 | if( idx == page->cnt ) 1813 | idx++, page->cnt++; 1814 | 1815 | page->act++; 1816 | 1817 | while( idx > slot ) 1818 | *slotptr(page, idx) = *slotptr(page, idx -1), idx--; 1819 | 1820 | bt_putid(slotptr(page,slot)->id, id); 1821 | slotptr(page, slot)->off = page->min; 1822 | #ifdef USETOD 1823 | slotptr(page, slot)->tod = tod; 1824 | #endif 1825 | slotptr(page, slot)->dead = 0; 1826 | 1827 | bt_update(bt, bt->page); 1828 | 1829 | bt_unlockpage(BtLockWrite, bt->latch); 1830 | bt_unpinlatch(bt->latch); 1831 | return 0; 1832 | } 1833 | 1834 | // cache page of keys into cursor and return starting slot for given key 1835 | 1836 | uint bt_startkey (BtDb *bt, unsigned char *key, uint len) 1837 | { 1838 | uint slot; 1839 | 1840 | // cache page for retrieval 1841 | 1842 | if( slot = bt_loadpage (bt, key, len, 0, BtLockRead) ) 1843 | memcpy (bt->cursor, bt->page, bt->page_size); 1844 | else 1845 | return 0; 1846 | 1847 | bt_unlockpage(BtLockRead, bt->latch); 1848 | bt->cursor_page = bt->page_no; 1849 | bt_unpinlatch (bt->latch); 1850 | return slot; 1851 | } 1852 | 1853 | // return next slot for cursor page 1854 | // or slide cursor right into next page 1855 | 1856 | uint bt_nextkey (BtDb *bt, uint slot) 1857 | { 1858 | BtLatchSet *latch; 1859 | off64_t right; 1860 | 1861 | do { 1862 | right = bt_getid(bt->cursor->right); 1863 | 1864 | while( slot++ < bt->cursor->cnt ) 1865 | if( slotptr(bt->cursor,slot)->dead ) 1866 | continue; 1867 | else if( right || (slot < bt->cursor->cnt)) 1868 | return slot; 1869 | else 1870 | break; 1871 | 1872 | if( !right ) 1873 | break; 1874 | 1875 | bt->cursor_page = right; 1876 | 1877 | if( latch = bt_pinlatch (bt, right) ) 1878 | bt_lockpage(BtLockRead, latch); 1879 | else 1880 | return 0; 1881 | 1882 | bt->page = bt_mappage (bt, latch); 1883 | memcpy (bt->cursor, bt->page, bt->page_size); 1884 | bt_unlockpage(BtLockRead, latch); 1885 | bt_unpinlatch (latch); 1886 | slot = 0; 1887 | } while( 1 ); 1888 | 1889 | return bt->err = 0; 1890 | } 1891 | 1892 | BtKey bt_key(BtDb *bt, uint slot) 1893 | { 1894 | return keyptr(bt->cursor, slot); 1895 | } 1896 | 1897 | uid bt_uid(BtDb *bt, uint slot) 1898 | { 1899 | return bt_getid(slotptr(bt->cursor,slot)->id); 1900 | } 1901 | 1902 | #ifdef USETOD 1903 | uint bt_tod(BtDb *bt, uint slot) 1904 | { 1905 | return slotptr(bt->cursor,slot)->tod; 1906 | } 1907 | #endif 1908 | 1909 | #ifdef STANDALONE 1910 | 1911 | uint bt_audit (BtDb *bt) 1912 | { 1913 | uint idx, hashidx; 1914 | uid next, page_no; 1915 | BtLatchSet *latch; 1916 | uint blks[64]; 1917 | uint cnt = 0; 1918 | BtPage page; 1919 | uint amt[1]; 1920 | BtKey ptr; 1921 | 1922 | #ifdef unix 1923 | posix_fadvise( bt->idx, 0, 0, POSIX_FADV_SEQUENTIAL); 1924 | #endif 1925 | if( *(ushort *)(bt->latchmgr->lock) ) 1926 | fprintf(stderr, "Alloc page locked\n"); 1927 | *(ushort *)(bt->latchmgr->lock) = 0; 1928 | 1929 | memset (blks, 0, sizeof(blks)); 1930 | 1931 | for( idx = 1; idx <= bt->latchmgr->latchdeployed; idx++ ) { 1932 | latch = bt->latchsets + idx; 1933 | if( *(ushort *)latch->readwr ) 1934 | fprintf(stderr, "latchset %d rwlocked for page %.8x\n", idx, latch->page_no); 1935 | *(ushort *)latch->readwr = 0; 1936 | 1937 | if( *(ushort *)latch->access ) 1938 | fprintf(stderr, "latchset %d accesslocked for page %.8x\n", idx, latch->page_no); 1939 | *(ushort *)latch->access = 0; 1940 | 1941 | if( *(ushort *)latch->parent ) 1942 | fprintf(stderr, "latchset %d parentlocked for page %.8x\n", idx, latch->page_no); 1943 | *(ushort *)latch->parent = 0; 1944 | 1945 | if( latch->pin & PIN_mask ) { 1946 | fprintf(stderr, "latchset %d pinned for page %.8x\n", idx, latch->page_no); 1947 | latch->pin = 0; 1948 | } 1949 | page = (BtPage)((uid)idx * bt->page_size + bt->pagepool); 1950 | blks[page->lvl]++; 1951 | 1952 | if( page->dirty ) 1953 | if( bt_writepage (bt, page, latch->page_no) ) 1954 | fprintf(stderr, "Page %.8x Write Error\n", latch->page_no); 1955 | } 1956 | 1957 | for( idx = 0; blks[idx]; idx++ ) 1958 | fprintf(stderr, "cache: %d lvl %d blocks\n", blks[idx], idx); 1959 | 1960 | for( hashidx = 0; hashidx < bt->latchmgr->latchhash; hashidx++ ) { 1961 | if( bt->table[hashidx].busy[0] ) 1962 | fprintf(stderr, "hash entry %d locked\n", hashidx); 1963 | 1964 | bt->table[hashidx].busy[0] = 0; 1965 | } 1966 | 1967 | memset (blks, 0, sizeof(blks)); 1968 | 1969 | next = bt->latchmgr->nlatchpage + LATCH_page; 1970 | page_no = LEAF_page; 1971 | 1972 | while( page_no < bt_getid(bt->latchmgr->alloc->right) ) { 1973 | if( bt_readpage (bt, bt->frame, page_no) ) 1974 | fprintf(stderr, "page %.8x unreadable\n", page_no); 1975 | if( !bt->frame->free ) { 1976 | for( idx = 0; idx++ < bt->frame->cnt - 1; ) { 1977 | ptr = keyptr(bt->frame, idx+1); 1978 | if( keycmp (keyptr(bt->frame, idx), ptr->key, ptr->len) >= 0 ) 1979 | fprintf(stderr, "page %.8x idx %.2x out of order\n", page_no, idx); 1980 | } 1981 | if( !bt->frame->lvl ) 1982 | cnt += bt->frame->act; 1983 | blks[bt->frame->lvl]++; 1984 | } 1985 | 1986 | if( page_no > LEAF_page ) 1987 | next = page_no + 1; 1988 | page_no = next; 1989 | } 1990 | 1991 | for( idx = 0; blks[idx]; idx++ ) 1992 | fprintf(stderr, "btree: %d lvl %d blocks\n", blks[idx], idx); 1993 | 1994 | return cnt - 1; 1995 | } 1996 | 1997 | #ifndef unix 1998 | double getCpuTime(int type) 1999 | { 2000 | FILETIME crtime[1]; 2001 | FILETIME xittime[1]; 2002 | FILETIME systime[1]; 2003 | FILETIME usrtime[1]; 2004 | SYSTEMTIME timeconv[1]; 2005 | double ans = 0; 2006 | 2007 | memset (timeconv, 0, sizeof(SYSTEMTIME)); 2008 | 2009 | switch( type ) { 2010 | case 0: 2011 | GetSystemTimeAsFileTime (xittime); 2012 | FileTimeToSystemTime (xittime, timeconv); 2013 | ans = (double)timeconv->wDayOfWeek * 3600 * 24; 2014 | break; 2015 | case 1: 2016 | GetProcessTimes (GetCurrentProcess(), crtime, xittime, systime, usrtime); 2017 | FileTimeToSystemTime (usrtime, timeconv); 2018 | break; 2019 | case 2: 2020 | GetProcessTimes (GetCurrentProcess(), crtime, xittime, systime, usrtime); 2021 | FileTimeToSystemTime (systime, timeconv); 2022 | break; 2023 | } 2024 | 2025 | ans += (double)timeconv->wHour * 3600; 2026 | ans += (double)timeconv->wMinute * 60; 2027 | ans += (double)timeconv->wSecond; 2028 | ans += (double)timeconv->wMilliseconds / 1000; 2029 | return ans; 2030 | } 2031 | #else 2032 | #include 2033 | #include 2034 | 2035 | double getCpuTime(int type) 2036 | { 2037 | struct rusage used[1]; 2038 | struct timeval tv[1]; 2039 | 2040 | switch( type ) { 2041 | case 0: 2042 | gettimeofday(tv, NULL); 2043 | return (double)tv->tv_sec + (double)tv->tv_usec / 1000000; 2044 | 2045 | case 1: 2046 | getrusage(RUSAGE_SELF, used); 2047 | return (double)used->ru_utime.tv_sec + (double)used->ru_utime.tv_usec / 1000000; 2048 | 2049 | case 2: 2050 | getrusage(RUSAGE_SELF, used); 2051 | return (double)used->ru_stime.tv_sec + (double)used->ru_stime.tv_usec / 1000000; 2052 | } 2053 | 2054 | return 0; 2055 | } 2056 | #endif 2057 | 2058 | // standalone program to index file of keys 2059 | // then list them onto std-out 2060 | 2061 | int main (int argc, char **argv) 2062 | { 2063 | uint slot, line = 0, off = 0, found = 0; 2064 | int ch, cnt = 0, bits = 12, idx; 2065 | unsigned char key[256]; 2066 | double done, start; 2067 | uid next, page_no; 2068 | BtLatchSet *latch; 2069 | float elapsed; 2070 | time_t tod[1]; 2071 | uint scan = 0; 2072 | uint len = 0; 2073 | uint map = 0; 2074 | BtPage page; 2075 | BtKey ptr; 2076 | BtDb *bt; 2077 | FILE *in; 2078 | 2079 | #ifdef WIN32 2080 | _setmode (1, _O_BINARY); 2081 | #endif 2082 | if( argc < 4 ) { 2083 | fprintf (stderr, "Usage: %s idx_file src_file Read/Write/Scan/Delete/Find/Count [page_bits mapped_pool_pages start_line_number]\n", argv[0]); 2084 | fprintf (stderr, " page_bits: size of btree page in bits\n"); 2085 | fprintf (stderr, " mapped_pool_pages: number of pages in buffer pool\n"); 2086 | exit(0); 2087 | } 2088 | 2089 | start = getCpuTime(0); 2090 | time(tod); 2091 | 2092 | if( argc > 4 ) 2093 | bits = atoi(argv[4]); 2094 | 2095 | if( argc > 5 ) 2096 | map = atoi(argv[5]); 2097 | 2098 | if( argc > 6 ) 2099 | off = atoi(argv[6]); 2100 | 2101 | bt = bt_open ((argv[1]), BT_rw, bits, map); 2102 | 2103 | if( !bt ) { 2104 | fprintf(stderr, "Index Open Error %s\n", argv[1]); 2105 | exit (1); 2106 | } 2107 | 2108 | switch(argv[3][0]| 0x20) 2109 | { 2110 | case 'p': // display page 2111 | if( latch = bt_pinlatch (bt, off) ) 2112 | page = bt_mappage (bt, latch); 2113 | else 2114 | fprintf(stderr, "unable to read page %.8x\n", off); 2115 | 2116 | write (1, page, bt->page_size); 2117 | break; 2118 | 2119 | case 'a': // buffer pool audit 2120 | fprintf(stderr, "started audit for %s\n", argv[1]); 2121 | cnt = bt_audit (bt); 2122 | fprintf(stderr, "finished audit for %s, %d keys\n", argv[1], cnt); 2123 | break; 2124 | 2125 | case 'w': // write keys 2126 | fprintf(stderr, "started indexing for %s\n", argv[2]); 2127 | if( argc > 2 && (in = fopen (argv[2], "rb")) ) { 2128 | #ifdef unix 2129 | posix_fadvise( fileno(in), 0, 0, POSIX_FADV_NOREUSE); 2130 | #endif 2131 | while( ch = getc(in), ch != EOF ) 2132 | if( ch == '\n' ) 2133 | { 2134 | if( off ) 2135 | sprintf((char *)key+len, "%.9d", line + off), len += 9; 2136 | 2137 | if( bt_insertkey (bt, key, len, 0, ++line, *tod) ) 2138 | fprintf(stderr, "Error %d Line: %d\n", bt->err, line), exit(0); 2139 | len = 0; 2140 | } 2141 | else if( len < 245 ) 2142 | key[len++] = ch; 2143 | } 2144 | fprintf(stderr, "finished adding keys for %s, %d \n", argv[2], line); 2145 | break; 2146 | 2147 | case 'd': // delete keys 2148 | fprintf(stderr, "started deleting keys for %s\n", argv[2]); 2149 | if( argc > 2 && (in = fopen (argv[2], "rb")) ) { 2150 | #ifdef unix 2151 | posix_fadvise( fileno(in), 0, 0, POSIX_FADV_NOREUSE); 2152 | #endif 2153 | while( ch = getc(in), ch != EOF ) 2154 | if( ch == '\n' ) 2155 | { 2156 | if( off ) 2157 | sprintf((char *)key+len, "%.9d", line + off), len += 9; 2158 | line++; 2159 | if( bt_deletekey (bt, key, len, 0) ) 2160 | fprintf(stderr, "Error %d Line: %d\n", bt->err, line), exit(0); 2161 | len = 0; 2162 | } 2163 | else if( len < 245 ) 2164 | key[len++] = ch; 2165 | } 2166 | fprintf(stderr, "finished deleting keys for %s, %d \n", argv[2], line); 2167 | break; 2168 | 2169 | case 'f': // find keys 2170 | fprintf(stderr, "started finding keys for %s\n", argv[2]); 2171 | if( argc > 2 && (in = fopen (argv[2], "rb")) ) { 2172 | #ifdef unix 2173 | posix_fadvise( fileno(in), 0, 0, POSIX_FADV_NOREUSE); 2174 | #endif 2175 | while( ch = getc(in), ch != EOF ) 2176 | if( ch == '\n' ) 2177 | { 2178 | if( off ) 2179 | sprintf((char *)key+len, "%.9d", line + off), len += 9; 2180 | line++; 2181 | if( bt_findkey (bt, key, len) ) 2182 | found++; 2183 | else if( bt->err ) 2184 | fprintf(stderr, "Error %d Syserr %d Line: %d\n", bt->err, errno, line), exit(0); 2185 | len = 0; 2186 | } 2187 | else if( len < 245 ) 2188 | key[len++] = ch; 2189 | } 2190 | fprintf(stderr, "finished search of %d keys for %s, found %d\n", line, argv[2], found); 2191 | break; 2192 | 2193 | case 's': // scan and print keys 2194 | fprintf(stderr, "started scaning\n"); 2195 | cnt = len = key[0] = 0; 2196 | 2197 | if( slot = bt_startkey (bt, key, len) ) 2198 | slot--; 2199 | else 2200 | fprintf(stderr, "Error %d in StartKey. Syserror: %d\n", bt->err, errno), exit(0); 2201 | 2202 | while( slot = bt_nextkey (bt, slot) ) { 2203 | ptr = bt_key(bt, slot); 2204 | fwrite (ptr->key, ptr->len, 1, stdout); 2205 | fputc ('\n', stdout); 2206 | cnt++; 2207 | } 2208 | 2209 | fprintf(stderr, " Total keys read %d\n", cnt - 1); 2210 | break; 2211 | 2212 | case 'c': // count keys 2213 | fprintf(stderr, "started counting\n"); 2214 | cnt = 0; 2215 | 2216 | next = bt->latchmgr->nlatchpage + LATCH_page; 2217 | page_no = LEAF_page; 2218 | 2219 | while( page_no < bt_getid(bt->latchmgr->alloc->right) ) { 2220 | if( latch = bt_pinlatch (bt, page_no) ) 2221 | page = bt_mappage (bt, latch); 2222 | if( !page->free && !page->lvl ) 2223 | cnt += page->act; 2224 | if( page_no > LEAF_page ) 2225 | next = page_no + 1; 2226 | if( scan ) 2227 | for( idx = 0; idx++ < page->cnt; ) { 2228 | if( slotptr(page, idx)->dead ) 2229 | continue; 2230 | ptr = keyptr(page, idx); 2231 | if( idx != page->cnt && bt_getid (page->right) ) { 2232 | fwrite (ptr->key, ptr->len, 1, stdout); 2233 | fputc ('\n', stdout); 2234 | } 2235 | } 2236 | bt_unpinlatch (latch); 2237 | page_no = next; 2238 | } 2239 | 2240 | cnt--; // remove stopper key 2241 | fprintf(stderr, " Total keys read %d\n", cnt); 2242 | break; 2243 | } 2244 | 2245 | done = getCpuTime(0); 2246 | elapsed = (float)(done - start); 2247 | fprintf(stderr, " real %dm%.3fs\n", (int)(elapsed/60), elapsed - (int)(elapsed/60)*60); 2248 | elapsed = getCpuTime(1); 2249 | fprintf(stderr, " user %dm%.3fs\n", (int)(elapsed/60), elapsed - (int)(elapsed/60)*60); 2250 | elapsed = getCpuTime(2); 2251 | fprintf(stderr, " sys %dm%.3fs\n", (int)(elapsed/60), elapsed - (int)(elapsed/60)*60); 2252 | return 0; 2253 | } 2254 | 2255 | #endif //STANDALONE 2256 | -------------------------------------------------------------------------------- /btree2u.c: -------------------------------------------------------------------------------- 1 | // btree version 2u sched_yield locks 2 | // with combined latch & pool manager 3 | // and phase-fair reader writer lock 4 | // 12 MAR 2014 5 | 6 | // author: karl malbrain, malbrain@cal.berkeley.edu 7 | 8 | /* 9 | This work, including the source code, documentation 10 | and related data, is placed into the public domain. 11 | 12 | The orginal author is Karl Malbrain. 13 | 14 | THIS SOFTWARE IS PROVIDED AS-IS WITHOUT WARRANTY 15 | OF ANY KIND, NOT EVEN THE IMPLIED WARRANTY OF 16 | MERCHANTABILITY. THE AUTHOR OF THIS SOFTWARE, 17 | ASSUMES _NO_ RESPONSIBILITY FOR ANY CONSEQUENCE 18 | RESULTING FROM THE USE, MODIFICATION, OR 19 | REDISTRIBUTION OF THIS SOFTWARE. 20 | */ 21 | 22 | // Please see the project home page for documentation 23 | // code.google.com/p/high-concurrency-btree 24 | 25 | #define _FILE_OFFSET_BITS 64 26 | #define _LARGEFILE64_SOURCE 27 | 28 | #ifdef linux 29 | #define _GNU_SOURCE 30 | #endif 31 | 32 | #ifdef unix 33 | #include 34 | #include 35 | #include 36 | #include 37 | #include 38 | #include 39 | #include 40 | #else 41 | #define WIN32_LEAN_AND_MEAN 42 | #include 43 | #include 44 | #include 45 | #include 46 | #include 47 | #include 48 | #endif 49 | 50 | #include 51 | #include 52 | 53 | typedef unsigned long long uid; 54 | 55 | #ifndef unix 56 | typedef unsigned long long off64_t; 57 | typedef unsigned short ushort; 58 | typedef unsigned int uint; 59 | #endif 60 | 61 | #define BT_ro 0x6f72 // ro 62 | #define BT_rw 0x7772 // rw 63 | #define BT_fl 0x6c66 // fl 64 | 65 | #define BT_maxbits 15 // maximum page size in bits 66 | #define BT_minbits 12 // minimum page size in bits 67 | #define BT_minpage (1 << BT_minbits) // minimum page size 68 | #define BT_maxpage (1 << BT_maxbits) // maximum page size 69 | 70 | // BTree page number constants 71 | #define ALLOC_page 0 72 | #define ROOT_page 1 73 | #define LEAF_page 2 74 | #define LATCH_page 3 75 | 76 | // Number of levels to create in a new BTree 77 | 78 | #define MIN_lvl 2 79 | #define MAX_lvl 15 80 | 81 | /* 82 | There are five lock types for each node in three independent sets: 83 | 1. (set 1) AccessIntent: Sharable. Going to Read the node. Incompatible with NodeDelete. 84 | 2. (set 1) NodeDelete: Exclusive. About to release the node. Incompatible with AccessIntent. 85 | 3. (set 2) ReadLock: Sharable. Read the node. Incompatible with WriteLock. 86 | 4. (set 2) WriteLock: Exclusive. Modify the node. Incompatible with ReadLock and other WriteLocks. 87 | 5. (set 3) ParentModification: Exclusive. Change the node's parent keys. Incompatible with another ParentModification. 88 | */ 89 | 90 | typedef enum{ 91 | BtLockAccess, 92 | BtLockDelete, 93 | BtLockRead, 94 | BtLockWrite, 95 | BtLockParent 96 | }BtLock; 97 | 98 | // definition for latch implementation 99 | 100 | volatile typedef struct { 101 | ushort lock[1]; 102 | } BtSpinLatch; 103 | 104 | #define XCL 1 105 | #define PEND 2 106 | #define BOTH 3 107 | #define SHARE 4 108 | 109 | volatile typedef struct { 110 | ushort rin[1]; // readers in count 111 | ushort rout[1]; // readers out count 112 | ushort serving[1]; // writers out count 113 | ushort ticket[1]; // writers in count 114 | } RWLock; 115 | 116 | // define bits at bottom of rin 117 | 118 | #define PHID 0x1 // writer phase (0/1) 119 | #define PRES 0x2 // writer present 120 | #define MASK 0x3 // both write bits 121 | #define RINC 0x4 // reader increment 122 | 123 | // Define the length of the page and key pointers 124 | 125 | #define BtId 6 126 | 127 | // Page key slot definition. 128 | 129 | // If BT_maxbits is 15 or less, you can save 2 bytes 130 | // for each key stored by making the first two uints 131 | // into ushorts. You can also save 4 bytes by removing 132 | // the tod field from the key. 133 | 134 | // Keys are marked dead, but remain on the page until 135 | // cleanup is called. The fence key (highest key) for 136 | // the page is always present, even if dead. 137 | 138 | typedef struct { 139 | #ifdef USETOD 140 | uint tod; // time-stamp for key 141 | #endif 142 | ushort off:BT_maxbits; // page offset for key start 143 | ushort dead:1; // set for deleted key 144 | unsigned char id[BtId]; // id associated with key 145 | } BtSlot; 146 | 147 | // The key structure occupies space at the upper end of 148 | // each page. It's a length byte followed by the value 149 | // bytes. 150 | 151 | typedef struct { 152 | unsigned char len; 153 | unsigned char key[0]; 154 | } *BtKey; 155 | 156 | // The first part of an index page. 157 | // It is immediately followed 158 | // by the BtSlot array of keys. 159 | 160 | typedef struct BtPage_ { 161 | uint cnt; // count of keys in page 162 | uint act; // count of active keys 163 | uint min; // next key offset 164 | unsigned char bits:6; // page size in bits 165 | unsigned char free:1; // page is on free list 166 | unsigned char dirty:1; // page is dirty in cache 167 | unsigned char lvl:6; // level of page 168 | unsigned char kill:1; // page is being deleted 169 | unsigned char clean:1; // page needs cleaning 170 | unsigned char right[BtId]; // page number to right 171 | } *BtPage; 172 | 173 | typedef struct { 174 | struct BtPage_ alloc[2]; // next & free page_nos in right ptr 175 | BtSpinLatch lock[1]; // allocation area lite latch 176 | volatile uint latchdeployed;// highest number of latch entries deployed 177 | volatile uint nlatchpage; // number of latch pages at BT_latch 178 | volatile uint latchtotal; // number of page latch entries 179 | volatile uint latchhash; // number of latch hash table slots 180 | volatile uint latchvictim; // next latch hash entry to examine 181 | volatile uint safelevel; // safe page level in cache 182 | volatile uint cache[MAX_lvl];// cache census counts by btree level 183 | } BtLatchMgr; 184 | 185 | // latch hash table entries 186 | 187 | typedef struct { 188 | volatile uint slot; // Latch table entry at head of collision chain 189 | BtSpinLatch latch[1]; // lock for the collision chain 190 | } BtHashEntry; 191 | 192 | // latch manager table structure 193 | 194 | typedef struct { 195 | volatile uid page_no; // latch set page number on disk 196 | RWLock readwr[1]; // read/write page lock 197 | RWLock access[1]; // Access Intent/Page delete 198 | RWLock parent[1]; // Posting of fence key in parent 199 | volatile ushort pin; // number of pins/level/clock bits 200 | volatile uint next; // next entry in hash table chain 201 | volatile uint prev; // prev entry in hash table chain 202 | } BtLatchSet; 203 | 204 | #define CLOCK_mask 0xe000 205 | #define CLOCK_unit 0x2000 206 | #define PIN_mask 0x07ff 207 | #define LVL_mask 0x1800 208 | #define LVL_shift 11 209 | 210 | // The object structure for Btree access 211 | 212 | typedef struct _BtDb { 213 | uint page_size; // each page size 214 | uint page_bits; // each page size in bits 215 | uid page_no; // current page number 216 | uid cursor_page; // current cursor page number 217 | int err; 218 | uint mode; // read-write mode 219 | BtPage cursor; // cached frame for start/next (never mapped) 220 | BtPage frame; // spare frame for the page split (never mapped) 221 | BtPage page; // current mapped page in buffer pool 222 | BtLatchSet *latch; // current page latch 223 | BtLatchMgr *latchmgr; // mapped latch page from allocation page 224 | BtLatchSet *latchsets; // mapped latch set from latch pages 225 | unsigned char *pagepool; // cached page pool set 226 | BtHashEntry *table; // the hash table 227 | #ifdef unix 228 | int idx; 229 | #else 230 | HANDLE idx; 231 | HANDLE halloc; // allocation and latch table handle 232 | #endif 233 | unsigned char *mem; // frame, cursor, memory buffers 234 | uint found; // last deletekey found key 235 | } BtDb; 236 | 237 | typedef enum { 238 | BTERR_ok = 0, 239 | BTERR_notfound, 240 | BTERR_struct, 241 | BTERR_ovflw, 242 | BTERR_read, 243 | BTERR_lock, 244 | BTERR_hash, 245 | BTERR_kill, 246 | BTERR_map, 247 | BTERR_wrt, 248 | BTERR_eof 249 | } BTERR; 250 | 251 | // B-Tree functions 252 | extern void bt_close (BtDb *bt); 253 | extern BtDb *bt_open (char *name, uint mode, uint bits, uint cacheblk); 254 | extern BTERR bt_insertkey (BtDb *bt, unsigned char *key, uint len, uint lvl, uid id, uint tod); 255 | extern BTERR bt_deletekey (BtDb *bt, unsigned char *key, uint len, uint lvl); 256 | extern uid bt_findkey (BtDb *bt, unsigned char *key, uint len); 257 | extern uint bt_startkey (BtDb *bt, unsigned char *key, uint len); 258 | extern uint bt_nextkey (BtDb *bt, uint slot); 259 | 260 | // internal functions 261 | void bt_update (BtDb *bt, BtPage page); 262 | BtPage bt_mappage (BtDb *bt, BtLatchSet *latch); 263 | // Helper functions to return slot values 264 | 265 | extern BtKey bt_key (BtDb *bt, uint slot); 266 | extern uid bt_uid (BtDb *bt, uint slot); 267 | #ifdef USETOD 268 | extern uint bt_tod (BtDb *bt, uint slot); 269 | #endif 270 | 271 | // The page is allocated from low and hi ends. 272 | // The key offsets and row-id's are allocated 273 | // from the bottom, while the text of the key 274 | // is allocated from the top. When the two 275 | // areas meet, the page is split into two. 276 | 277 | // A key consists of a length byte, two bytes of 278 | // index number (0 - 65534), and up to 253 bytes 279 | // of key value. Duplicate keys are discarded. 280 | // Associated with each key is a 48 bit row-id. 281 | 282 | // The b-tree root is always located at page 1. 283 | // The first leaf page of level zero is always 284 | // located on page 2. 285 | 286 | // The b-tree pages are linked with right 287 | // pointers to facilitate enumerators, 288 | // and provide for concurrency. 289 | 290 | // When to root page fills, it is split in two and 291 | // the tree height is raised by a new root at page 292 | // one with two keys. 293 | 294 | // Deleted keys are marked with a dead bit until 295 | // page cleanup The fence key for a node is always 296 | // present, even after deletion and cleanup. 297 | 298 | // Deleted leaf pages are reclaimed on a free list. 299 | // The upper levels of the btree are fixed on creation. 300 | 301 | // To achieve maximum concurrency one page is locked at a time 302 | // as the tree is traversed to find leaf key in question. The right 303 | // page numbers are used in cases where the page is being split, 304 | // or consolidated. 305 | 306 | // Page 0 (ALLOC page) is dedicated to lock for new page extensions, 307 | // and chains empty leaf pages together for reuse. 308 | 309 | // Parent locks are obtained to prevent resplitting or deleting a node 310 | // before its fence is posted into its upper level. 311 | 312 | // A special open mode of BT_fl is provided to safely access files on 313 | // WIN32 networks. WIN32 network operations should not use memory mapping. 314 | // This WIN32 mode sets FILE_FLAG_NOBUFFERING and FILE_FLAG_WRITETHROUGH 315 | // to prevent local caching of network file contents. 316 | 317 | // Access macros to address slot and key values from the page. 318 | // Page slots use 1 based indexing. 319 | 320 | #define slotptr(page, slot) (((BtSlot *)(page+1)) + (slot-1)) 321 | #define keyptr(page, slot) ((BtKey)((unsigned char*)(page) + slotptr(page, slot)->off)) 322 | 323 | void bt_putid(unsigned char *dest, uid id) 324 | { 325 | int i = BtId; 326 | 327 | while( i-- ) 328 | dest[i] = (unsigned char)id, id >>= 8; 329 | } 330 | 331 | uid bt_getid(unsigned char *src) 332 | { 333 | uid id = 0; 334 | int i; 335 | 336 | for( i = 0; i < BtId; i++ ) 337 | id <<= 8, id |= *src++; 338 | 339 | return id; 340 | } 341 | 342 | BTERR bt_abort (BtDb *bt, BtPage page, uid page_no, BTERR err) 343 | { 344 | BtKey ptr; 345 | 346 | fprintf(stderr, "\n Btree2 abort, error %d on page %.8x\n", err, page_no); 347 | fprintf(stderr, "level=%d kill=%d free=%d cnt=%x act=%x\n", page->lvl, page->kill, page->free, page->cnt, page->act); 348 | ptr = keyptr(page, page->cnt); 349 | fprintf(stderr, "fence='%.*s'\n", ptr->len, ptr->key); 350 | fprintf(stderr, "right=%.8x\n", bt_getid(page->right)); 351 | return bt->err = err; 352 | } 353 | 354 | // Phase-Fair reader/writer lock implementation 355 | 356 | void WriteLock (RWLock *lock) 357 | { 358 | ushort w, r, tix; 359 | 360 | #ifdef unix 361 | tix = __sync_fetch_and_add (lock->ticket, 1); 362 | #else 363 | tix = _InterlockedExchangeAdd16 (lock->ticket, 1); 364 | #endif 365 | // wait for our ticket to come up 366 | 367 | while( tix != lock->serving[0] ) 368 | #ifdef unix 369 | sched_yield(); 370 | #else 371 | SwitchToThread (); 372 | #endif 373 | 374 | w = PRES | (tix & PHID); 375 | #ifdef unix 376 | r = __sync_fetch_and_add (lock->rin, w); 377 | #else 378 | r = _InterlockedExchangeAdd16 (lock->rin, w); 379 | #endif 380 | while( r != *lock->rout ) 381 | #ifdef unix 382 | sched_yield(); 383 | #else 384 | SwitchToThread(); 385 | #endif 386 | } 387 | 388 | void WriteRelease (RWLock *lock) 389 | { 390 | #ifdef unix 391 | __sync_fetch_and_and (lock->rin, ~MASK); 392 | #else 393 | _InterlockedAnd16 (lock->rin, ~MASK); 394 | #endif 395 | lock->serving[0]++; 396 | } 397 | 398 | void ReadLock (RWLock *lock) 399 | { 400 | ushort w; 401 | #ifdef unix 402 | w = __sync_fetch_and_add (lock->rin, RINC) & MASK; 403 | #else 404 | w = _InterlockedExchangeAdd16 (lock->rin, RINC) & MASK; 405 | #endif 406 | if( w ) 407 | while( w == (*lock->rin & MASK) ) 408 | #ifdef unix 409 | sched_yield (); 410 | #else 411 | SwitchToThread (); 412 | #endif 413 | } 414 | 415 | void ReadRelease (RWLock *lock) 416 | { 417 | #ifdef unix 418 | __sync_fetch_and_add (lock->rout, RINC); 419 | #else 420 | _InterlockedExchangeAdd16 (lock->rout, RINC); 421 | #endif 422 | } 423 | 424 | // Spin Latch Manager 425 | 426 | // wait until write lock mode is clear 427 | // and add 1 to the share count 428 | 429 | void bt_spinreadlock(BtSpinLatch *latch) 430 | { 431 | ushort prev; 432 | 433 | do { 434 | #ifdef unix 435 | prev = __sync_fetch_and_add (latch->lock, SHARE); 436 | #else 437 | prev = _InterlockedExchangeAdd16(latch->lock, SHARE); 438 | #endif 439 | // see if exclusive request is granted or pending 440 | 441 | if( !(prev & BOTH) ) 442 | return; 443 | #ifdef unix 444 | prev = __sync_fetch_and_add (latch->lock, -SHARE); 445 | #else 446 | prev = _InterlockedExchangeAdd16(latch->lock, -SHARE); 447 | #endif 448 | #ifdef unix 449 | } while( sched_yield(), 1 ); 450 | #else 451 | } while( SwitchToThread(), 1 ); 452 | #endif 453 | } 454 | 455 | // wait for other read and write latches to relinquish 456 | 457 | void bt_spinwritelock(BtSpinLatch *latch) 458 | { 459 | ushort prev; 460 | 461 | do { 462 | #ifdef unix 463 | prev = __sync_fetch_and_or(latch->lock, PEND | XCL); 464 | #else 465 | prev = _InterlockedOr16(latch->lock, PEND | XCL); 466 | #endif 467 | if( !(prev & XCL) ) 468 | if( !(prev & ~BOTH) ) 469 | return; 470 | else 471 | #ifdef unix 472 | __sync_fetch_and_and (latch->lock, ~XCL); 473 | #else 474 | _InterlockedAnd16(latch->lock, ~XCL); 475 | #endif 476 | #ifdef unix 477 | } while( sched_yield(), 1 ); 478 | #else 479 | } while( SwitchToThread(), 1 ); 480 | #endif 481 | } 482 | 483 | // try to obtain write lock 484 | 485 | // return 1 if obtained, 486 | // 0 otherwise 487 | 488 | int bt_spinwritetry(BtSpinLatch *latch) 489 | { 490 | ushort prev; 491 | 492 | #ifdef unix 493 | prev = __sync_fetch_and_or(latch->lock, XCL); 494 | #else 495 | prev = _InterlockedOr16(latch->lock, XCL); 496 | #endif 497 | // take write access if all bits are clear 498 | 499 | if( !(prev & XCL) ) 500 | if( !(prev & ~BOTH) ) 501 | return 1; 502 | else 503 | #ifdef unix 504 | __sync_fetch_and_and (latch->lock, ~XCL); 505 | #else 506 | _InterlockedAnd16(latch->lock, ~XCL); 507 | #endif 508 | return 0; 509 | } 510 | 511 | // clear write mode 512 | 513 | void bt_spinreleasewrite(BtSpinLatch *latch) 514 | { 515 | #ifdef unix 516 | __sync_fetch_and_and(latch->lock, ~BOTH); 517 | #else 518 | _InterlockedAnd16(latch->lock, ~BOTH); 519 | #endif 520 | } 521 | 522 | // decrement reader count 523 | 524 | void bt_spinreleaseread(BtSpinLatch *latch) 525 | { 526 | #ifdef unix 527 | __sync_fetch_and_add(latch->lock, -SHARE); 528 | #else 529 | _InterlockedExchangeAdd16(latch->lock, -SHARE); 530 | #endif 531 | } 532 | 533 | // read page from permanent location in Btree file 534 | 535 | BTERR bt_readpage (BtDb *bt, BtPage page, uid page_no) 536 | { 537 | off64_t off = page_no << bt->page_bits; 538 | 539 | #ifdef unix 540 | if( pread (bt->idx, page, bt->page_size, page_no << bt->page_bits) < bt->page_size ) { 541 | fprintf (stderr, "Unable to read page %.8x errno = %d\n", page_no, errno); 542 | return bt->err = BTERR_read; 543 | } 544 | #else 545 | OVERLAPPED ovl[1]; 546 | uint amt[1]; 547 | 548 | memset (ovl, 0, sizeof(OVERLAPPED)); 549 | ovl->Offset = off; 550 | ovl->OffsetHigh = off >> 32; 551 | 552 | if( !ReadFile(bt->idx, page, bt->page_size, amt, ovl)) { 553 | fprintf (stderr, "Unable to read page %.8x GetLastError = %d\n", page_no, GetLastError()); 554 | return bt->err = BTERR_read; 555 | } 556 | if( *amt < bt->page_size ) { 557 | fprintf (stderr, "Unable to read page %.8x GetLastError = %d\n", page_no, GetLastError()); 558 | return bt->err = BTERR_read; 559 | } 560 | #endif 561 | return 0; 562 | } 563 | 564 | // write page to permanent location in Btree file 565 | // clear the dirty bit 566 | 567 | BTERR bt_writepage (BtDb *bt, BtPage page, uid page_no) 568 | { 569 | off64_t off = page_no << bt->page_bits; 570 | 571 | #ifdef unix 572 | page->dirty = 0; 573 | 574 | if( pwrite(bt->idx, page, bt->page_size, off) < bt->page_size ) 575 | return bt->err = BTERR_wrt; 576 | #else 577 | OVERLAPPED ovl[1]; 578 | uint amt[1]; 579 | 580 | memset (ovl, 0, sizeof(OVERLAPPED)); 581 | ovl->Offset = off; 582 | ovl->OffsetHigh = off >> 32; 583 | page->dirty = 0; 584 | 585 | if( !WriteFile(bt->idx, page, bt->page_size, amt, ovl) ) 586 | return bt->err = BTERR_wrt; 587 | 588 | if( *amt < bt->page_size ) 589 | return bt->err = BTERR_wrt; 590 | #endif 591 | return 0; 592 | } 593 | 594 | // link latch table entry into head of latch hash table 595 | 596 | BTERR bt_latchlink (BtDb *bt, uint hashidx, uint slot, uid page_no) 597 | { 598 | BtPage page = (BtPage)((uid)slot * bt->page_size + bt->pagepool); 599 | BtLatchSet *latch = bt->latchsets + slot; 600 | int lvl; 601 | 602 | if( latch->next = bt->table[hashidx].slot ) 603 | bt->latchsets[latch->next].prev = slot; 604 | 605 | bt->table[hashidx].slot = slot; 606 | latch->page_no = page_no; 607 | latch->prev = 0; 608 | latch->pin = 1; 609 | 610 | if( bt_readpage (bt, page, page_no) ) 611 | return bt->err; 612 | 613 | lvl = page->lvl << LVL_shift; 614 | if( lvl > LVL_mask ) 615 | lvl = LVL_mask; 616 | latch->pin |= lvl; // store lvl 617 | latch->pin |= lvl << 3; // initialize clock 618 | 619 | #ifdef unix 620 | __sync_fetch_and_add (&bt->latchmgr->cache[page->lvl], 1); 621 | #else 622 | _InterlockedExchangeAdd(&bt->latchmgr->cache[page->lvl], 1); 623 | #endif 624 | return bt->err = 0; 625 | } 626 | 627 | // release latch pin 628 | 629 | void bt_unpinlatch (BtLatchSet *latch) 630 | { 631 | #ifdef unix 632 | __sync_fetch_and_add(&latch->pin, -1); 633 | #else 634 | _InterlockedDecrement16 (&latch->pin); 635 | #endif 636 | } 637 | 638 | // find existing latchset or inspire new one 639 | // return with latchset pinned 640 | 641 | BtLatchSet *bt_pinlatch (BtDb *bt, uid page_no) 642 | { 643 | uint hashidx = page_no % bt->latchmgr->latchhash; 644 | BtLatchSet *latch; 645 | uint slot, idx; 646 | uint lvl, cnt; 647 | off64_t off; 648 | uint amt[1]; 649 | BtPage page; 650 | 651 | // try to find our entry 652 | 653 | bt_spinwritelock(bt->table[hashidx].latch); 654 | 655 | if( slot = bt->table[hashidx].slot ) do 656 | { 657 | latch = bt->latchsets + slot; 658 | if( page_no == latch->page_no ) 659 | break; 660 | } while( slot = latch->next ); 661 | 662 | // found our entry 663 | // increment clock 664 | 665 | if( slot ) { 666 | latch = bt->latchsets + slot; 667 | lvl = (latch->pin & LVL_mask) >> LVL_shift; 668 | lvl *= CLOCK_unit * 2; 669 | lvl |= CLOCK_unit; 670 | #ifdef unix 671 | __sync_fetch_and_add(&latch->pin, 1); 672 | __sync_fetch_and_or(&latch->pin, lvl); 673 | #else 674 | _InterlockedIncrement16 (&latch->pin); 675 | _InterlockedOr16 (&latch->pin, lvl); 676 | #endif 677 | bt_spinreleasewrite(bt->table[hashidx].latch); 678 | return latch; 679 | } 680 | 681 | // see if there are any unused pool entries 682 | #ifdef unix 683 | slot = __sync_fetch_and_add (&bt->latchmgr->latchdeployed, 1) + 1; 684 | #else 685 | slot = _InterlockedIncrement (&bt->latchmgr->latchdeployed); 686 | #endif 687 | 688 | if( slot < bt->latchmgr->latchtotal ) { 689 | latch = bt->latchsets + slot; 690 | if( bt_latchlink (bt, hashidx, slot, page_no) ) 691 | return NULL; 692 | bt_spinreleasewrite (bt->table[hashidx].latch); 693 | return latch; 694 | } 695 | 696 | #ifdef unix 697 | __sync_fetch_and_add (&bt->latchmgr->latchdeployed, -1); 698 | #else 699 | _InterlockedDecrement (&bt->latchmgr->latchdeployed); 700 | #endif 701 | // find and reuse previous entry on victim 702 | 703 | while( 1 ) { 704 | #ifdef unix 705 | slot = __sync_fetch_and_add(&bt->latchmgr->latchvictim, 1); 706 | #else 707 | slot = _InterlockedIncrement (&bt->latchmgr->latchvictim) - 1; 708 | #endif 709 | // try to get write lock on hash chain 710 | // skip entry if not obtained 711 | // or has outstanding pins 712 | 713 | slot %= bt->latchmgr->latchtotal; 714 | 715 | // on slot wraparound, check census 716 | // count and increment safe level 717 | 718 | cnt = bt->latchmgr->cache[bt->latchmgr->safelevel]; 719 | 720 | if( !slot ) { 721 | if( cnt < bt->latchmgr->latchtotal / 10 ) 722 | #ifdef unix 723 | __sync_fetch_and_add(&bt->latchmgr->safelevel, 1); 724 | #else 725 | _InterlockedIncrement (&bt->latchmgr->safelevel); 726 | #endif 727 | continue; 728 | } 729 | 730 | latch = bt->latchsets + slot; 731 | idx = latch->page_no % bt->latchmgr->latchhash; 732 | lvl = (latch->pin & LVL_mask) >> LVL_shift; 733 | 734 | // see if we are evicting this level yet 735 | // or if we are on same chain as hashidx 736 | 737 | if( idx == hashidx || lvl > bt->latchmgr->safelevel ) 738 | continue; 739 | 740 | if( !bt_spinwritetry (bt->table[idx].latch) ) 741 | continue; 742 | 743 | if( latch->pin & ~LVL_mask ) { 744 | if( latch->pin & CLOCK_mask ) 745 | #ifdef unix 746 | __sync_fetch_and_add(&latch->pin, -CLOCK_unit); 747 | #else 748 | _InterlockedExchangeAdd16 (&latch->pin, -CLOCK_unit); 749 | #endif 750 | bt_spinreleasewrite (bt->table[idx].latch); 751 | continue; 752 | } 753 | 754 | // update permanent page area in btree 755 | 756 | page = (BtPage)((uid)slot * bt->page_size + bt->pagepool); 757 | #ifdef unix 758 | posix_fadvise (bt->idx, page_no << bt->page_bits, bt->page_size, POSIX_FADV_WILLNEED); 759 | __sync_fetch_and_add (&bt->latchmgr->cache[page->lvl], -1); 760 | #else 761 | _InterlockedExchangeAdd(&bt->latchmgr->cache[page->lvl], -1); 762 | #endif 763 | if( page->dirty ) 764 | if( bt_writepage (bt, page, latch->page_no) ) 765 | return NULL; 766 | 767 | // unlink our available slot from its hash chain 768 | 769 | if( latch->prev ) 770 | bt->latchsets[latch->prev].next = latch->next; 771 | else 772 | bt->table[idx].slot = latch->next; 773 | 774 | if( latch->next ) 775 | bt->latchsets[latch->next].prev = latch->prev; 776 | 777 | bt_spinreleasewrite (bt->table[idx].latch); 778 | 779 | if( bt_latchlink (bt, hashidx, slot, page_no) ) 780 | return NULL; 781 | 782 | bt_spinreleasewrite (bt->table[hashidx].latch); 783 | return latch; 784 | } 785 | } 786 | 787 | // close and release memory 788 | 789 | void bt_close (BtDb *bt) 790 | { 791 | #ifdef unix 792 | munmap (bt->table, bt->latchmgr->nlatchpage * bt->page_size); 793 | munmap (bt->latchmgr, bt->page_size); 794 | #else 795 | FlushViewOfFile(bt->latchmgr, 0); 796 | UnmapViewOfFile(bt->latchmgr); 797 | CloseHandle(bt->halloc); 798 | #endif 799 | #ifdef unix 800 | if( bt->mem ) 801 | free (bt->mem); 802 | close (bt->idx); 803 | free (bt); 804 | #else 805 | if( bt->mem) 806 | VirtualFree (bt->mem, 0, MEM_RELEASE); 807 | FlushFileBuffers(bt->idx); 808 | CloseHandle(bt->idx); 809 | GlobalFree (bt); 810 | #endif 811 | } 812 | // open/create new btree 813 | 814 | // call with file_name, BT_openmode, bits in page size (e.g. 16), 815 | // size of mapped page pool (e.g. 8192) 816 | 817 | BtDb *bt_open (char *name, uint mode, uint bits, uint nodemax) 818 | { 819 | uint lvl, attr, last, slot, idx; 820 | uint nlatchpage, latchhash; 821 | BtLatchMgr *latchmgr; 822 | off64_t size, off; 823 | uint amt[1]; 824 | BtKey key; 825 | BtDb* bt; 826 | int flag; 827 | 828 | #ifndef unix 829 | OVERLAPPED ovl[1]; 830 | #else 831 | struct flock lock[1]; 832 | #endif 833 | 834 | // determine sanity of page size and buffer pool 835 | 836 | if( bits > BT_maxbits ) 837 | bits = BT_maxbits; 838 | else if( bits < BT_minbits ) 839 | bits = BT_minbits; 840 | 841 | if( mode == BT_ro ) { 842 | fprintf(stderr, "ReadOnly mode not supported: %s\n", name); 843 | return NULL; 844 | } 845 | #ifdef unix 846 | bt = calloc (1, sizeof(BtDb)); 847 | 848 | bt->idx = open ((char*)name, O_RDWR | O_CREAT, 0666); 849 | posix_fadvise( bt->idx, 0, 0, POSIX_FADV_RANDOM); 850 | 851 | if( bt->idx == -1 ) { 852 | fprintf(stderr, "unable to open %s\n", name); 853 | return free(bt), NULL; 854 | } 855 | #else 856 | bt = GlobalAlloc (GMEM_FIXED|GMEM_ZEROINIT, sizeof(BtDb)); 857 | attr = FILE_ATTRIBUTE_NORMAL; 858 | bt->idx = CreateFile(name, GENERIC_READ| GENERIC_WRITE, FILE_SHARE_READ|FILE_SHARE_WRITE, NULL, OPEN_ALWAYS, attr, NULL); 859 | 860 | if( bt->idx == INVALID_HANDLE_VALUE ) { 861 | fprintf(stderr, "unable to open %s\n", name); 862 | return GlobalFree(bt), NULL; 863 | } 864 | #endif 865 | #ifdef unix 866 | memset (lock, 0, sizeof(lock)); 867 | lock->l_len = sizeof(struct BtPage_); 868 | lock->l_type = F_WRLCK; 869 | 870 | if( fcntl (bt->idx, F_SETLKW, lock) < 0 ) { 871 | fprintf(stderr, "unable to lock record zero %s\n", name); 872 | return bt_close (bt), NULL; 873 | } 874 | #else 875 | memset (ovl, 0, sizeof(ovl)); 876 | 877 | // use large offsets to 878 | // simulate advisory locking 879 | 880 | ovl->OffsetHigh |= 0x80000000; 881 | 882 | if( !LockFileEx (bt->idx, LOCKFILE_EXCLUSIVE_LOCK, 0, sizeof(struct BtPage_), 0L, ovl) ) { 883 | fprintf(stderr, "unable to lock record zero %s, GetLastError = %d\n", name, GetLastError()); 884 | return bt_close (bt), NULL; 885 | } 886 | #endif 887 | 888 | #ifdef unix 889 | latchmgr = valloc (BT_maxpage); 890 | *amt = 0; 891 | 892 | // read minimum page size to get root info 893 | 894 | if( size = lseek (bt->idx, 0L, 2) ) { 895 | if( pread(bt->idx, latchmgr, BT_minpage, 0) == BT_minpage ) 896 | bits = latchmgr->alloc->bits; 897 | else { 898 | fprintf(stderr, "Unable to read page zero\n"); 899 | return free(bt), free(latchmgr), NULL; 900 | } 901 | } 902 | #else 903 | latchmgr = VirtualAlloc(NULL, BT_maxpage, MEM_COMMIT, PAGE_READWRITE); 904 | size = GetFileSize(bt->idx, amt); 905 | 906 | if( size || *amt ) { 907 | if( !ReadFile(bt->idx, (char *)latchmgr, BT_minpage, amt, NULL) ) { 908 | fprintf(stderr, "Unable to read page zero\n"); 909 | return bt_close (bt), NULL; 910 | } else 911 | bits = latchmgr->alloc->bits; 912 | } 913 | #endif 914 | 915 | bt->page_size = 1 << bits; 916 | bt->page_bits = bits; 917 | 918 | bt->mode = mode; 919 | 920 | if( size || *amt ) { 921 | nlatchpage = latchmgr->nlatchpage; 922 | goto btlatch; 923 | } 924 | 925 | if( nodemax < 16 ) { 926 | fprintf(stderr, "Buffer pool too small: %d\n", nodemax); 927 | return bt_close(bt), NULL; 928 | } 929 | 930 | // initialize an empty b-tree with latch page, root page, page of leaves 931 | // and page(s) of latches and page pool cache 932 | 933 | memset (latchmgr, 0, 1 << bits); 934 | latchmgr->alloc->bits = bt->page_bits; 935 | 936 | // calculate number of latch hash table entries 937 | 938 | nlatchpage = (nodemax/16 * sizeof(BtHashEntry) + bt->page_size - 1) / bt->page_size; 939 | latchhash = nlatchpage * bt->page_size / sizeof(BtHashEntry); 940 | 941 | nlatchpage += nodemax; // size of the buffer pool in pages 942 | nlatchpage += (sizeof(BtLatchSet) * nodemax + bt->page_size - 1)/bt->page_size; 943 | 944 | bt_putid(latchmgr->alloc->right, MIN_lvl+1+nlatchpage); 945 | latchmgr->nlatchpage = nlatchpage; 946 | latchmgr->latchtotal = nodemax; 947 | latchmgr->latchhash = latchhash; 948 | 949 | if( bt_writepage (bt, latchmgr->alloc, 0) ) { 950 | fprintf (stderr, "Unable to create btree page zero\n"); 951 | return bt_close (bt), NULL; 952 | } 953 | 954 | memset (latchmgr, 0, 1 << bits); 955 | latchmgr->alloc->bits = bt->page_bits; 956 | 957 | for( lvl=MIN_lvl; lvl--; ) { 958 | last = MIN_lvl - lvl; // page number 959 | slotptr(latchmgr->alloc, 1)->off = bt->page_size - 3; 960 | bt_putid(slotptr(latchmgr->alloc, 1)->id, lvl ? last + 1 : 0); 961 | key = keyptr(latchmgr->alloc, 1); 962 | key->len = 2; // create stopper key 963 | key->key[0] = 0xff; 964 | key->key[1] = 0xff; 965 | 966 | latchmgr->alloc->min = bt->page_size - 3; 967 | latchmgr->alloc->lvl = lvl; 968 | latchmgr->alloc->cnt = 1; 969 | latchmgr->alloc->act = 1; 970 | 971 | if( bt_writepage (bt, latchmgr->alloc, last) ) { 972 | fprintf (stderr, "Unable to create btree page %.8x\n", last); 973 | return bt_close (bt), NULL; 974 | } 975 | } 976 | 977 | // clear out buffer pool pages 978 | 979 | memset(latchmgr, 0, bt->page_size); 980 | last = MIN_lvl + nlatchpage; 981 | 982 | if( bt_writepage (bt, latchmgr->alloc, last) ) { 983 | fprintf (stderr, "Unable to write buffer pool page %.8x\n", last); 984 | return bt_close (bt), NULL; 985 | } 986 | 987 | #ifdef unix 988 | free (latchmgr); 989 | #else 990 | VirtualFree (latchmgr, 0, MEM_RELEASE); 991 | #endif 992 | 993 | btlatch: 994 | #ifdef unix 995 | lock->l_type = F_UNLCK; 996 | if( fcntl (bt->idx, F_SETLK, lock) < 0 ) { 997 | fprintf (stderr, "Unable to unlock page zero\n"); 998 | return bt_close (bt), NULL; 999 | } 1000 | #else 1001 | if( !UnlockFileEx (bt->idx, 0, sizeof(struct BtPage_), 0, ovl) ) { 1002 | fprintf (stderr, "Unable to unlock page zero, GetLastError = %d\n", GetLastError()); 1003 | return bt_close (bt), NULL; 1004 | } 1005 | #endif 1006 | #ifdef unix 1007 | flag = PROT_READ | PROT_WRITE; 1008 | bt->latchmgr = mmap (0, bt->page_size, flag, MAP_SHARED, bt->idx, ALLOC_page * bt->page_size); 1009 | if( bt->latchmgr == MAP_FAILED ) { 1010 | fprintf (stderr, "Unable to mmap page zero, errno = %d", errno); 1011 | return bt_close (bt), NULL; 1012 | } 1013 | bt->table = (void *)mmap (0, (uid)nlatchpage * bt->page_size, flag, MAP_SHARED, bt->idx, LATCH_page * bt->page_size); 1014 | if( bt->table == MAP_FAILED ) { 1015 | fprintf (stderr, "Unable to mmap buffer pool, errno = %d", errno); 1016 | return bt_close (bt), NULL; 1017 | } 1018 | madvise (bt->table, (uid)nlatchpage << bt->page_bits, MADV_RANDOM | MADV_WILLNEED); 1019 | #else 1020 | flag = PAGE_READWRITE; 1021 | bt->halloc = CreateFileMapping(bt->idx, NULL, flag, 0, ((uid)nlatchpage + LATCH_page) * bt->page_size, NULL); 1022 | if( !bt->halloc ) { 1023 | fprintf (stderr, "Unable to create file mapping for buffer pool mgr, GetLastError = %d\n", GetLastError()); 1024 | return bt_close (bt), NULL; 1025 | } 1026 | 1027 | flag = FILE_MAP_WRITE; 1028 | bt->latchmgr = MapViewOfFile(bt->halloc, flag, 0, 0, ((uid)nlatchpage + LATCH_page) * bt->page_size); 1029 | if( !bt->latchmgr ) { 1030 | fprintf (stderr, "Unable to map buffer pool, GetLastError = %d\n", GetLastError()); 1031 | return bt_close (bt), NULL; 1032 | } 1033 | 1034 | bt->table = (void *)((char *)bt->latchmgr + LATCH_page * bt->page_size); 1035 | #endif 1036 | bt->pagepool = (unsigned char *)bt->table + (uid)(nlatchpage - bt->latchmgr->latchtotal) * bt->page_size; 1037 | bt->latchsets = (BtLatchSet *)(bt->pagepool - (uid)bt->latchmgr->latchtotal * sizeof(BtLatchSet)); 1038 | 1039 | #ifdef unix 1040 | bt->mem = valloc (2 * bt->page_size); 1041 | #else 1042 | bt->mem = VirtualAlloc(NULL, 2 * bt->page_size, MEM_COMMIT, PAGE_READWRITE); 1043 | #endif 1044 | bt->frame = (BtPage)bt->mem; 1045 | bt->cursor = (BtPage)(bt->mem + bt->page_size); 1046 | return bt; 1047 | } 1048 | 1049 | // place write, read, or parent lock on requested page_no. 1050 | 1051 | void bt_lockpage(BtLock mode, BtLatchSet *latch) 1052 | { 1053 | switch( mode ) { 1054 | case BtLockRead: 1055 | ReadLock (latch->readwr); 1056 | break; 1057 | case BtLockWrite: 1058 | WriteLock (latch->readwr); 1059 | break; 1060 | case BtLockAccess: 1061 | ReadLock (latch->access); 1062 | break; 1063 | case BtLockDelete: 1064 | WriteLock (latch->access); 1065 | break; 1066 | case BtLockParent: 1067 | WriteLock (latch->parent); 1068 | break; 1069 | } 1070 | } 1071 | 1072 | // remove write, read, or parent lock on requested page 1073 | 1074 | void bt_unlockpage(BtLock mode, BtLatchSet *latch) 1075 | { 1076 | switch( mode ) { 1077 | case BtLockRead: 1078 | ReadRelease (latch->readwr); 1079 | break; 1080 | case BtLockWrite: 1081 | WriteRelease (latch->readwr); 1082 | break; 1083 | case BtLockAccess: 1084 | ReadRelease (latch->access); 1085 | break; 1086 | case BtLockDelete: 1087 | WriteRelease (latch->access); 1088 | break; 1089 | case BtLockParent: 1090 | WriteRelease (latch->parent); 1091 | break; 1092 | } 1093 | } 1094 | 1095 | // allocate a new page and write page into it 1096 | 1097 | uid bt_newpage(BtDb *bt, BtPage page) 1098 | { 1099 | BtLatchSet *latch; 1100 | uid new_page; 1101 | BtPage temp; 1102 | 1103 | // lock allocation page 1104 | 1105 | bt_spinwritelock(bt->latchmgr->lock); 1106 | 1107 | // use empty chain first 1108 | // else allocate empty page 1109 | 1110 | if( new_page = bt_getid(bt->latchmgr->alloc[1].right) ) { 1111 | if( latch = bt_pinlatch (bt, new_page) ) 1112 | temp = bt_mappage (bt, latch); 1113 | else 1114 | return 0; 1115 | 1116 | bt_putid(bt->latchmgr->alloc[1].right, bt_getid(temp->right)); 1117 | bt_spinreleasewrite(bt->latchmgr->lock); 1118 | memcpy (temp, page, bt->page_size); 1119 | 1120 | bt_update (bt, temp); 1121 | bt_unpinlatch (latch); 1122 | return new_page; 1123 | } else { 1124 | new_page = bt_getid(bt->latchmgr->alloc->right); 1125 | bt_putid(bt->latchmgr->alloc->right, new_page+1); 1126 | bt_spinreleasewrite(bt->latchmgr->lock); 1127 | 1128 | if( bt_writepage (bt, page, new_page) ) 1129 | return 0; 1130 | } 1131 | 1132 | bt_update (bt, bt->latchmgr->alloc); 1133 | return new_page; 1134 | } 1135 | 1136 | // compare two keys, returning > 0, = 0, or < 0 1137 | // as the comparison value 1138 | 1139 | int keycmp (BtKey key1, unsigned char *key2, uint len2) 1140 | { 1141 | uint len1 = key1->len; 1142 | int ans; 1143 | 1144 | if( ans = memcmp (key1->key, key2, len1 > len2 ? len2 : len1) ) 1145 | return ans; 1146 | 1147 | if( len1 > len2 ) 1148 | return 1; 1149 | if( len1 < len2 ) 1150 | return -1; 1151 | 1152 | return 0; 1153 | } 1154 | 1155 | // Update current page of btree by 1156 | // flushing mapped area to disk backing of cache pool. 1157 | // mark page as dirty for rewrite to permanent location 1158 | 1159 | void bt_update (BtDb *bt, BtPage page) 1160 | { 1161 | #ifdef unix 1162 | msync (page, bt->page_size, MS_ASYNC); 1163 | #else 1164 | // FlushViewOfFile (page, bt->page_size); 1165 | #endif 1166 | page->dirty = 1; 1167 | } 1168 | 1169 | // map the btree cached page onto current page 1170 | 1171 | BtPage bt_mappage (BtDb *bt, BtLatchSet *latch) 1172 | { 1173 | return (BtPage)((uid)(latch - bt->latchsets) * bt->page_size + bt->pagepool); 1174 | } 1175 | 1176 | // deallocate a deleted page 1177 | // place on free chain out of allocator page 1178 | // call with page latched for Writing and Deleting 1179 | 1180 | BTERR bt_freepage(BtDb *bt, uid page_no, BtLatchSet *latch) 1181 | { 1182 | BtPage page = bt_mappage (bt, latch); 1183 | 1184 | // lock allocation page 1185 | 1186 | bt_spinwritelock (bt->latchmgr->lock); 1187 | 1188 | // store chain in second right 1189 | bt_putid(page->right, bt_getid(bt->latchmgr->alloc[1].right)); 1190 | bt_putid(bt->latchmgr->alloc[1].right, page_no); 1191 | 1192 | page->free = 1; 1193 | bt_update(bt, page); 1194 | 1195 | // unlock released page 1196 | 1197 | bt_unlockpage (BtLockDelete, latch); 1198 | bt_unlockpage (BtLockWrite, latch); 1199 | bt_unpinlatch (latch); 1200 | 1201 | // unlock allocation page 1202 | 1203 | bt_spinreleasewrite (bt->latchmgr->lock); 1204 | bt_update (bt, bt->latchmgr->alloc); 1205 | return 0; 1206 | } 1207 | 1208 | // find slot in page for given key at a given level 1209 | 1210 | int bt_findslot (BtDb *bt, unsigned char *key, uint len) 1211 | { 1212 | uint diff, higher = bt->page->cnt, low = 1, slot; 1213 | uint good = 0; 1214 | 1215 | // make stopper key an infinite fence value 1216 | 1217 | if( bt_getid (bt->page->right) ) 1218 | higher++; 1219 | else 1220 | good++; 1221 | 1222 | // low is the lowest candidate, higher is already 1223 | // tested as .ge. the given key, loop ends when they meet 1224 | 1225 | while( diff = higher - low ) { 1226 | slot = low + ( diff >> 1 ); 1227 | if( keycmp (keyptr(bt->page, slot), key, len) < 0 ) 1228 | low = slot + 1; 1229 | else 1230 | higher = slot, good++; 1231 | } 1232 | 1233 | // return zero if key is on right link page 1234 | 1235 | return good ? higher : 0; 1236 | } 1237 | 1238 | // find and load page at given level for given key 1239 | // leave page rd or wr locked as requested 1240 | 1241 | int bt_loadpage (BtDb *bt, unsigned char *key, uint len, uint lvl, uint lock) 1242 | { 1243 | uid page_no = ROOT_page, prevpage = 0; 1244 | uint drill = 0xff, slot; 1245 | BtLatchSet *prevlatch; 1246 | uint mode, prevmode; 1247 | 1248 | // start at root of btree and drill down 1249 | 1250 | do { 1251 | // determine lock mode of drill level 1252 | mode = (lock == BtLockWrite) && (drill == lvl) ? BtLockWrite : BtLockRead; 1253 | 1254 | if( bt->latch = bt_pinlatch(bt, page_no) ) 1255 | bt->page_no = page_no; 1256 | else 1257 | return 0; 1258 | 1259 | // obtain access lock using lock chaining 1260 | 1261 | if( page_no > ROOT_page ) 1262 | bt_lockpage(BtLockAccess, bt->latch); 1263 | 1264 | if( prevpage ) { 1265 | bt_unlockpage(prevmode, prevlatch); 1266 | bt_unpinlatch(prevlatch); 1267 | prevpage = 0; 1268 | } 1269 | 1270 | // obtain read lock using lock chaining 1271 | 1272 | bt_lockpage(mode, bt->latch); 1273 | 1274 | if( page_no > ROOT_page ) 1275 | bt_unlockpage(BtLockAccess, bt->latch); 1276 | 1277 | // map/obtain page contents 1278 | 1279 | bt->page = bt_mappage (bt, bt->latch); 1280 | 1281 | // re-read and re-lock root after determining actual level of root 1282 | 1283 | if( bt->page->lvl != drill) { 1284 | if( bt->page_no != ROOT_page ) 1285 | return bt->err = BTERR_struct, 0; 1286 | 1287 | drill = bt->page->lvl; 1288 | 1289 | if( lock != BtLockRead && drill == lvl ) { 1290 | bt_unlockpage(mode, bt->latch); 1291 | bt_unpinlatch(bt->latch); 1292 | continue; 1293 | } 1294 | } 1295 | 1296 | prevpage = bt->page_no; 1297 | prevlatch = bt->latch; 1298 | prevmode = mode; 1299 | 1300 | // find key on page at this level 1301 | // and descend to requested level 1302 | 1303 | if( !bt->page->kill ) 1304 | if( slot = bt_findslot (bt, key, len) ) { 1305 | if( drill == lvl ) 1306 | return slot; 1307 | 1308 | while( slotptr(bt->page, slot)->dead ) 1309 | if( slot++ < bt->page->cnt ) 1310 | continue; 1311 | else 1312 | goto slideright; 1313 | 1314 | page_no = bt_getid(slotptr(bt->page, slot)->id); 1315 | drill--; 1316 | continue; 1317 | } 1318 | 1319 | // or slide right into next page 1320 | 1321 | slideright: 1322 | page_no = bt_getid(bt->page->right); 1323 | 1324 | } while( page_no ); 1325 | 1326 | // return error on end of right chain 1327 | 1328 | bt->err = BTERR_eof; 1329 | return 0; // return error 1330 | } 1331 | 1332 | // a fence key was deleted from a page 1333 | // push new fence value upwards 1334 | 1335 | BTERR bt_fixfence (BtDb *bt, uid page_no, uint lvl) 1336 | { 1337 | unsigned char leftkey[256], rightkey[256]; 1338 | BtLatchSet *latch = bt->latch; 1339 | BtKey ptr; 1340 | 1341 | // remove deleted key, the old fence value 1342 | 1343 | ptr = keyptr(bt->page, bt->page->cnt); 1344 | memcpy(rightkey, ptr, ptr->len + 1); 1345 | 1346 | memset (slotptr(bt->page, bt->page->cnt--), 0, sizeof(BtSlot)); 1347 | bt->page->clean = 1; 1348 | 1349 | ptr = keyptr(bt->page, bt->page->cnt); 1350 | memcpy(leftkey, ptr, ptr->len + 1); 1351 | 1352 | bt_update (bt, bt->page); 1353 | bt_lockpage (BtLockParent, latch); 1354 | bt_unlockpage (BtLockWrite, latch); 1355 | 1356 | // insert new (now smaller) fence key 1357 | 1358 | if( bt_insertkey (bt, leftkey+1, *leftkey, lvl + 1, page_no, time(NULL)) ) 1359 | return bt->err; 1360 | 1361 | // remove old (larger) fence key 1362 | 1363 | if( bt_deletekey (bt, rightkey+1, *rightkey, lvl + 1) ) 1364 | return bt->err; 1365 | 1366 | bt_unlockpage (BtLockParent, latch); 1367 | bt_unpinlatch (latch); 1368 | return 0; 1369 | } 1370 | 1371 | // root has a single child 1372 | // collapse a level from the btree 1373 | // call with root locked in bt->page 1374 | 1375 | BTERR bt_collapseroot (BtDb *bt, BtPage root) 1376 | { 1377 | BtLatchSet *latch; 1378 | BtPage temp; 1379 | uid child; 1380 | uint idx; 1381 | 1382 | // find the child entry 1383 | // and promote to new root 1384 | 1385 | do { 1386 | for( idx = 0; idx++ < root->cnt; ) 1387 | if( !slotptr(root, idx)->dead ) 1388 | break; 1389 | 1390 | child = bt_getid (slotptr(root, idx)->id); 1391 | if( latch = bt_pinlatch (bt, child) ) 1392 | temp = bt_mappage (bt, latch); 1393 | else 1394 | return bt->err; 1395 | 1396 | bt_lockpage (BtLockDelete, latch); 1397 | bt_lockpage (BtLockWrite, latch); 1398 | memcpy (root, temp, bt->page_size); 1399 | 1400 | bt_update (bt, root); 1401 | 1402 | if( bt_freepage (bt, child, latch) ) 1403 | return bt->err; 1404 | 1405 | } while( root->lvl > 1 && root->act == 1 ); 1406 | 1407 | bt_unlockpage (BtLockWrite, bt->latch); 1408 | bt_unpinlatch (bt->latch); 1409 | return 0; 1410 | } 1411 | 1412 | // find and delete key on page by marking delete flag bit 1413 | // when page becomes empty, delete it 1414 | 1415 | BTERR bt_deletekey (BtDb *bt, unsigned char *key, uint len, uint lvl) 1416 | { 1417 | unsigned char lowerkey[256], higherkey[256]; 1418 | uint slot, dirty = 0, idx, fence, found; 1419 | BtLatchSet *latch, *rlatch; 1420 | uid page_no, right; 1421 | BtPage temp; 1422 | BtKey ptr; 1423 | 1424 | if( slot = bt_loadpage (bt, key, len, lvl, BtLockWrite) ) 1425 | ptr = keyptr(bt->page, slot); 1426 | else 1427 | return bt->err; 1428 | 1429 | // are we deleting a fence slot? 1430 | 1431 | fence = slot == bt->page->cnt; 1432 | 1433 | // if key is found delete it, otherwise ignore request 1434 | 1435 | if( found = !keycmp (ptr, key, len) ) 1436 | if( found = slotptr(bt->page, slot)->dead == 0 ) { 1437 | dirty = slotptr(bt->page,slot)->dead = 1; 1438 | bt->page->clean = 1; 1439 | bt->page->act--; 1440 | 1441 | // collapse empty slots 1442 | 1443 | while( idx = bt->page->cnt - 1 ) 1444 | if( slotptr(bt->page, idx)->dead ) { 1445 | *slotptr(bt->page, idx) = *slotptr(bt->page, idx + 1); 1446 | memset (slotptr(bt->page, bt->page->cnt--), 0, sizeof(BtSlot)); 1447 | } else 1448 | break; 1449 | } 1450 | 1451 | right = bt_getid(bt->page->right); 1452 | page_no = bt->page_no; 1453 | latch = bt->latch; 1454 | 1455 | if( !dirty ) { 1456 | if( lvl ) 1457 | return bt_abort (bt, bt->page, page_no, BTERR_notfound); 1458 | bt_unlockpage(BtLockWrite, latch); 1459 | bt_unpinlatch (latch); 1460 | return bt->found = found, 0; 1461 | } 1462 | 1463 | // did we delete a fence key in an upper level? 1464 | 1465 | if( lvl && bt->page->act && fence ) 1466 | if( bt_fixfence (bt, page_no, lvl) ) 1467 | return bt->err; 1468 | else 1469 | return bt->found = found, 0; 1470 | 1471 | // is this a collapsed root? 1472 | 1473 | if( lvl > 1 && page_no == ROOT_page && bt->page->act == 1 ) 1474 | if( bt_collapseroot (bt, bt->page) ) 1475 | return bt->err; 1476 | else 1477 | return bt->found = found, 0; 1478 | 1479 | // return if page is not empty 1480 | 1481 | if( bt->page->act ) { 1482 | bt_update(bt, bt->page); 1483 | bt_unlockpage(BtLockWrite, latch); 1484 | bt_unpinlatch (latch); 1485 | return bt->found = found, 0; 1486 | } 1487 | 1488 | // cache copy of fence key 1489 | // in order to find parent 1490 | 1491 | ptr = keyptr(bt->page, bt->page->cnt); 1492 | memcpy(lowerkey, ptr, ptr->len + 1); 1493 | 1494 | // obtain lock on right page 1495 | 1496 | if( rlatch = bt_pinlatch (bt, right) ) 1497 | temp = bt_mappage (bt, rlatch); 1498 | else 1499 | return bt->err; 1500 | 1501 | bt_lockpage(BtLockWrite, rlatch); 1502 | 1503 | if( temp->kill ) { 1504 | bt_abort(bt, temp, right, 0); 1505 | return bt_abort(bt, bt->page, bt->page_no, BTERR_kill); 1506 | } 1507 | 1508 | // pull contents of next page into current empty page 1509 | 1510 | memcpy (bt->page, temp, bt->page_size); 1511 | 1512 | // cache copy of key to update 1513 | 1514 | ptr = keyptr(temp, temp->cnt); 1515 | memcpy(higherkey, ptr, ptr->len + 1); 1516 | 1517 | // Mark right page as deleted and point it to left page 1518 | // until we can post updates at higher level. 1519 | 1520 | bt_putid(temp->right, page_no); 1521 | temp->kill = 1; 1522 | 1523 | bt_update(bt, bt->page); 1524 | bt_update(bt, temp); 1525 | 1526 | bt_lockpage(BtLockParent, latch); 1527 | bt_unlockpage(BtLockWrite, latch); 1528 | 1529 | bt_lockpage(BtLockParent, rlatch); 1530 | bt_unlockpage(BtLockWrite, rlatch); 1531 | 1532 | // redirect higher key directly to consolidated node 1533 | 1534 | if( bt_insertkey (bt, higherkey+1, *higherkey, lvl+1, page_no, time(NULL)) ) 1535 | return bt->err; 1536 | 1537 | // delete old lower key to consolidated node 1538 | 1539 | if( bt_deletekey (bt, lowerkey + 1, *lowerkey, lvl + 1) ) 1540 | return bt->err; 1541 | 1542 | // obtain write & delete lock on deleted node 1543 | // add right block to free chain 1544 | 1545 | bt_lockpage(BtLockDelete, rlatch); 1546 | bt_lockpage(BtLockWrite, rlatch); 1547 | bt_unlockpage(BtLockParent, rlatch); 1548 | 1549 | if( bt_freepage (bt, right, rlatch) ) 1550 | return bt->err; 1551 | 1552 | bt_unlockpage(BtLockParent, latch); 1553 | bt_unpinlatch(latch); 1554 | return 0; 1555 | } 1556 | 1557 | // find key in leaf level and return row-id 1558 | 1559 | uid bt_findkey (BtDb *bt, unsigned char *key, uint len) 1560 | { 1561 | uint slot; 1562 | BtKey ptr; 1563 | uid id; 1564 | 1565 | if( slot = bt_loadpage (bt, key, len, 0, BtLockRead) ) 1566 | ptr = keyptr(bt->page, slot); 1567 | else 1568 | return 0; 1569 | 1570 | // if key exists, return row-id 1571 | // otherwise return 0 1572 | 1573 | if( ptr->len == len && !memcmp (ptr->key, key, len) ) 1574 | id = bt_getid(slotptr(bt->page,slot)->id); 1575 | else 1576 | id = 0; 1577 | 1578 | bt_unlockpage (BtLockRead, bt->latch); 1579 | bt_unpinlatch (bt->latch); 1580 | return id; 1581 | } 1582 | 1583 | // check page for space available, 1584 | // clean if necessary and return 1585 | // 0 - page needs splitting 1586 | // >0 - go ahead with new slot 1587 | 1588 | uint bt_cleanpage(BtDb *bt, uint amt, uint slot) 1589 | { 1590 | uint nxt = bt->page_size; 1591 | BtPage page = bt->page; 1592 | uint cnt = 0, idx = 0; 1593 | uint max = page->cnt; 1594 | uint newslot = slot; 1595 | BtKey key; 1596 | int ret; 1597 | 1598 | if( page->min >= (max+1) * sizeof(BtSlot) + sizeof(*page) + amt + 1 ) 1599 | return slot; 1600 | 1601 | // skip cleanup if nothing to reclaim 1602 | 1603 | if( !page->clean ) 1604 | return 0; 1605 | 1606 | memcpy (bt->frame, page, bt->page_size); 1607 | 1608 | // skip page info and set rest of page to zero 1609 | 1610 | memset (page+1, 0, bt->page_size - sizeof(*page)); 1611 | page->act = 0; 1612 | 1613 | while( cnt++ < max ) { 1614 | if( cnt == slot ) 1615 | newslot = idx + 1; 1616 | // always leave fence key in list 1617 | if( cnt < max && slotptr(bt->frame,cnt)->dead ) 1618 | continue; 1619 | 1620 | // copy key 1621 | key = keyptr(bt->frame, cnt); 1622 | nxt -= key->len + 1; 1623 | memcpy ((unsigned char *)page + nxt, key, key->len + 1); 1624 | 1625 | // copy slot 1626 | memcpy(slotptr(page, ++idx)->id, slotptr(bt->frame, cnt)->id, BtId); 1627 | if( !(slotptr(page, idx)->dead = slotptr(bt->frame, cnt)->dead) ) 1628 | page->act++; 1629 | #ifdef USETOD 1630 | slotptr(page, idx)->tod = slotptr(bt->frame, cnt)->tod; 1631 | #endif 1632 | slotptr(page, idx)->off = nxt; 1633 | } 1634 | 1635 | page->min = nxt; 1636 | page->cnt = idx; 1637 | 1638 | if( page->min >= (max+1) * sizeof(BtSlot) + sizeof(*page) + amt + 1 ) 1639 | return newslot; 1640 | 1641 | return 0; 1642 | } 1643 | 1644 | // split the root and raise the height of the btree 1645 | 1646 | BTERR bt_splitroot(BtDb *bt, unsigned char *leftkey, uid page_no2) 1647 | { 1648 | uint nxt = bt->page_size; 1649 | BtPage root = bt->page; 1650 | uid right; 1651 | 1652 | // Obtain an empty page to use, and copy the current 1653 | // root contents into it 1654 | 1655 | if( !(right = bt_newpage(bt, root)) ) 1656 | return bt->err; 1657 | 1658 | // preserve the page info at the bottom 1659 | // and set rest to zero 1660 | 1661 | memset(root+1, 0, bt->page_size - sizeof(*root)); 1662 | 1663 | // insert first key on newroot page 1664 | 1665 | nxt -= *leftkey + 1; 1666 | memcpy ((unsigned char *)root + nxt, leftkey, *leftkey + 1); 1667 | bt_putid(slotptr(root, 1)->id, right); 1668 | slotptr(root, 1)->off = nxt; 1669 | 1670 | // insert second key on newroot page 1671 | // and increase the root height 1672 | 1673 | nxt -= 3; 1674 | ((unsigned char *)root)[nxt] = 2; 1675 | ((unsigned char *)root)[nxt+1] = 0xff; 1676 | ((unsigned char *)root)[nxt+2] = 0xff; 1677 | bt_putid(slotptr(root, 2)->id, page_no2); 1678 | slotptr(root, 2)->off = nxt; 1679 | 1680 | bt_putid(root->right, 0); 1681 | root->min = nxt; // reset lowest used offset and key count 1682 | root->cnt = 2; 1683 | root->act = 2; 1684 | root->lvl++; 1685 | 1686 | // update and release root (bt->page) 1687 | 1688 | bt_update(bt, root); 1689 | 1690 | bt_unlockpage(BtLockWrite, bt->latch); 1691 | bt_unpinlatch(bt->latch); 1692 | return 0; 1693 | } 1694 | 1695 | // split already locked full node 1696 | // return unlocked. 1697 | 1698 | BTERR bt_splitpage (BtDb *bt) 1699 | { 1700 | uint cnt = 0, idx = 0, max, nxt = bt->page_size; 1701 | unsigned char fencekey[256], rightkey[256]; 1702 | uid page_no = bt->page_no, right; 1703 | BtLatchSet *latch, *rlatch; 1704 | BtPage page = bt->page; 1705 | uint lvl = page->lvl; 1706 | BtKey key; 1707 | 1708 | latch = bt->latch; 1709 | 1710 | // split higher half of keys to bt->frame 1711 | // the last key (fence key) might be dead 1712 | 1713 | memset (bt->frame, 0, bt->page_size); 1714 | max = page->cnt; 1715 | cnt = max / 2; 1716 | idx = 0; 1717 | 1718 | while( cnt++ < max ) { 1719 | key = keyptr(page, cnt); 1720 | nxt -= key->len + 1; 1721 | memcpy ((unsigned char *)bt->frame + nxt, key, key->len + 1); 1722 | memcpy(slotptr(bt->frame,++idx)->id, slotptr(page,cnt)->id, BtId); 1723 | if( !(slotptr(bt->frame, idx)->dead = slotptr(page, cnt)->dead) ) 1724 | bt->frame->act++; 1725 | #ifdef USETOD 1726 | slotptr(bt->frame, idx)->tod = slotptr(page, cnt)->tod; 1727 | #endif 1728 | slotptr(bt->frame, idx)->off = nxt; 1729 | } 1730 | 1731 | // remember fence key for new right page 1732 | 1733 | memcpy (rightkey, key, key->len + 1); 1734 | 1735 | bt->frame->bits = bt->page_bits; 1736 | bt->frame->min = nxt; 1737 | bt->frame->cnt = idx; 1738 | bt->frame->lvl = lvl; 1739 | 1740 | // link right node 1741 | 1742 | if( page_no > ROOT_page ) 1743 | memcpy (bt->frame->right, page->right, BtId); 1744 | 1745 | // get new free page and write frame to it. 1746 | 1747 | if( !(right = bt_newpage(bt, bt->frame)) ) 1748 | return bt->err; 1749 | 1750 | // update lower keys to continue in old page 1751 | 1752 | memcpy (bt->frame, page, bt->page_size); 1753 | memset (page+1, 0, bt->page_size - sizeof(*page)); 1754 | nxt = bt->page_size; 1755 | page->clean = 0; 1756 | page->act = 0; 1757 | cnt = 0; 1758 | idx = 0; 1759 | 1760 | // assemble page of smaller keys 1761 | // (they're all active keys) 1762 | 1763 | while( cnt++ < max / 2 ) { 1764 | key = keyptr(bt->frame, cnt); 1765 | nxt -= key->len + 1; 1766 | memcpy ((unsigned char *)page + nxt, key, key->len + 1); 1767 | memcpy(slotptr(page,++idx)->id, slotptr(bt->frame,cnt)->id, BtId); 1768 | #ifdef USETOD 1769 | slotptr(page, idx)->tod = slotptr(bt->frame, cnt)->tod; 1770 | #endif 1771 | slotptr(page, idx)->off = nxt; 1772 | page->act++; 1773 | } 1774 | 1775 | // remember fence key for smaller page 1776 | 1777 | memcpy (fencekey, key, key->len + 1); 1778 | 1779 | bt_putid(page->right, right); 1780 | page->min = nxt; 1781 | page->cnt = idx; 1782 | 1783 | // if current page is the root page, split it 1784 | 1785 | if( page_no == ROOT_page ) 1786 | return bt_splitroot (bt, fencekey, right); 1787 | 1788 | // lock right page 1789 | 1790 | if( rlatch = bt_pinlatch (bt, right) ) 1791 | bt_lockpage (BtLockParent, rlatch); 1792 | else 1793 | return bt->err; 1794 | 1795 | // update left (containing) node 1796 | 1797 | bt_update(bt, page); 1798 | 1799 | bt_lockpage (BtLockParent, latch); 1800 | bt_unlockpage (BtLockWrite, latch); 1801 | 1802 | // insert new fence for reformulated left block 1803 | 1804 | if( bt_insertkey (bt, fencekey+1, *fencekey, lvl+1, page_no, time(NULL)) ) 1805 | return bt->err; 1806 | 1807 | // switch fence for right block of larger keys to new right page 1808 | 1809 | if( bt_insertkey (bt, rightkey+1, *rightkey, lvl+1, right, time(NULL)) ) 1810 | return bt->err; 1811 | 1812 | bt_unlockpage (BtLockParent, latch); 1813 | bt_unlockpage (BtLockParent, rlatch); 1814 | 1815 | bt_unpinlatch (rlatch); 1816 | bt_unpinlatch (latch); 1817 | return 0; 1818 | } 1819 | 1820 | // Insert new key into the btree at requested level. 1821 | // Pages are unlocked at exit. 1822 | 1823 | BTERR bt_insertkey (BtDb *bt, unsigned char *key, uint len, uint lvl, uid id, uint tod) 1824 | { 1825 | uint slot, idx; 1826 | BtPage page; 1827 | BtKey ptr; 1828 | 1829 | while( 1 ) { 1830 | if( slot = bt_loadpage (bt, key, len, lvl, BtLockWrite) ) 1831 | ptr = keyptr(bt->page, slot); 1832 | else 1833 | { 1834 | if( !bt->err ) 1835 | bt->err = BTERR_ovflw; 1836 | return bt->err; 1837 | } 1838 | 1839 | // if key already exists, update id and return 1840 | 1841 | page = bt->page; 1842 | 1843 | if( !keycmp (ptr, key, len) ) { 1844 | if( slotptr(page, slot)->dead ) 1845 | page->act++; 1846 | slotptr(page, slot)->dead = 0; 1847 | #ifdef USETOD 1848 | slotptr(page, slot)->tod = tod; 1849 | #endif 1850 | bt_putid(slotptr(page,slot)->id, id); 1851 | bt_update(bt, bt->page); 1852 | bt_unlockpage(BtLockWrite, bt->latch); 1853 | bt_unpinlatch (bt->latch); 1854 | return 0; 1855 | } 1856 | 1857 | // check if page has enough space 1858 | 1859 | if( slot = bt_cleanpage (bt, len, slot) ) 1860 | break; 1861 | 1862 | if( bt_splitpage (bt) ) 1863 | return bt->err; 1864 | } 1865 | 1866 | // calculate next available slot and copy key into page 1867 | 1868 | page->min -= len + 1; // reset lowest used offset 1869 | ((unsigned char *)page)[page->min] = len; 1870 | memcpy ((unsigned char *)page + page->min +1, key, len ); 1871 | 1872 | for( idx = slot; idx < page->cnt; idx++ ) 1873 | if( slotptr(page, idx)->dead ) 1874 | break; 1875 | 1876 | // now insert key into array before slot 1877 | // preserving the fence slot 1878 | 1879 | if( idx == page->cnt ) 1880 | idx++, page->cnt++; 1881 | 1882 | page->act++; 1883 | 1884 | while( idx > slot ) 1885 | *slotptr(page, idx) = *slotptr(page, idx -1), idx--; 1886 | 1887 | bt_putid(slotptr(page,slot)->id, id); 1888 | slotptr(page, slot)->off = page->min; 1889 | #ifdef USETOD 1890 | slotptr(page, slot)->tod = tod; 1891 | #endif 1892 | slotptr(page, slot)->dead = 0; 1893 | 1894 | bt_update(bt, bt->page); 1895 | 1896 | bt_unlockpage(BtLockWrite, bt->latch); 1897 | bt_unpinlatch(bt->latch); 1898 | return 0; 1899 | } 1900 | 1901 | // cache page of keys into cursor and return starting slot for given key 1902 | 1903 | uint bt_startkey (BtDb *bt, unsigned char *key, uint len) 1904 | { 1905 | uint slot; 1906 | 1907 | // cache page for retrieval 1908 | 1909 | if( slot = bt_loadpage (bt, key, len, 0, BtLockRead) ) 1910 | memcpy (bt->cursor, bt->page, bt->page_size); 1911 | else 1912 | return 0; 1913 | 1914 | bt_unlockpage(BtLockRead, bt->latch); 1915 | bt->cursor_page = bt->page_no; 1916 | bt_unpinlatch (bt->latch); 1917 | return slot; 1918 | } 1919 | 1920 | // return next slot for cursor page 1921 | // or slide cursor right into next page 1922 | 1923 | uint bt_nextkey (BtDb *bt, uint slot) 1924 | { 1925 | BtLatchSet *latch; 1926 | off64_t right; 1927 | 1928 | do { 1929 | right = bt_getid(bt->cursor->right); 1930 | 1931 | while( slot++ < bt->cursor->cnt ) 1932 | if( slotptr(bt->cursor,slot)->dead ) 1933 | continue; 1934 | else if( right || (slot < bt->cursor->cnt)) 1935 | return slot; 1936 | else 1937 | break; 1938 | 1939 | if( !right ) 1940 | break; 1941 | 1942 | bt->cursor_page = right; 1943 | 1944 | if( latch = bt_pinlatch (bt, right) ) 1945 | bt_lockpage(BtLockRead, latch); 1946 | else 1947 | return 0; 1948 | 1949 | bt->page = bt_mappage (bt, latch); 1950 | memcpy (bt->cursor, bt->page, bt->page_size); 1951 | bt_unlockpage(BtLockRead, latch); 1952 | bt_unpinlatch (latch); 1953 | slot = 0; 1954 | } while( 1 ); 1955 | 1956 | return bt->err = 0; 1957 | } 1958 | 1959 | BtKey bt_key(BtDb *bt, uint slot) 1960 | { 1961 | return keyptr(bt->cursor, slot); 1962 | } 1963 | 1964 | uid bt_uid(BtDb *bt, uint slot) 1965 | { 1966 | return bt_getid(slotptr(bt->cursor,slot)->id); 1967 | } 1968 | 1969 | #ifdef USETOD 1970 | uint bt_tod(BtDb *bt, uint slot) 1971 | { 1972 | return slotptr(bt->cursor,slot)->tod; 1973 | } 1974 | #endif 1975 | 1976 | #ifdef STANDALONE 1977 | 1978 | uint bt_audit (BtDb *bt) 1979 | { 1980 | uint idx, hashidx; 1981 | uid next, page_no; 1982 | BtLatchSet *latch; 1983 | uint blks[64]; 1984 | uint cnt = 0; 1985 | BtPage page; 1986 | uint amt[1]; 1987 | BtKey ptr; 1988 | 1989 | #ifdef unix 1990 | posix_fadvise( bt->idx, 0, 0, POSIX_FADV_SEQUENTIAL); 1991 | #endif 1992 | if( *(ushort *)(bt->latchmgr->lock) ) 1993 | fprintf(stderr, "Alloc page locked\n"); 1994 | *(ushort *)(bt->latchmgr->lock) = 0; 1995 | 1996 | memset (blks, 0, sizeof(blks)); 1997 | 1998 | for( idx = 1; idx <= bt->latchmgr->latchdeployed; idx++ ) { 1999 | latch = bt->latchsets + idx; 2000 | if( *(ushort *)latch->readwr ) 2001 | fprintf(stderr, "latchset %d rwlocked for page %.8x\n", idx, latch->page_no); 2002 | *(ushort *)latch->readwr = 0; 2003 | 2004 | if( *(ushort *)latch->access ) 2005 | fprintf(stderr, "latchset %d accesslocked for page %.8x\n", idx, latch->page_no); 2006 | *(ushort *)latch->access = 0; 2007 | 2008 | if( *(ushort *)latch->parent ) 2009 | fprintf(stderr, "latchset %d parentlocked for page %.8x\n", idx, latch->page_no); 2010 | *(ushort *)latch->parent = 0; 2011 | 2012 | if( latch->pin & PIN_mask ) { 2013 | fprintf(stderr, "latchset %d pinned for page %.8x\n", idx, latch->page_no); 2014 | latch->pin = 0; 2015 | } 2016 | page = (BtPage)((uid)idx * bt->page_size + bt->pagepool); 2017 | blks[page->lvl]++; 2018 | 2019 | if( page->dirty ) 2020 | if( bt_writepage (bt, page, latch->page_no) ) 2021 | fprintf(stderr, "Page %.8x Write Error\n", latch->page_no); 2022 | } 2023 | 2024 | for( idx = 0; blks[idx]; idx++ ) 2025 | fprintf(stderr, "cache: %d lvl %d blocks\n", blks[idx], idx); 2026 | 2027 | for( hashidx = 0; hashidx < bt->latchmgr->latchhash; hashidx++ ) { 2028 | if( *(ushort *)(bt->table[hashidx].latch) ) 2029 | fprintf(stderr, "hash entry %d locked\n", hashidx); 2030 | 2031 | *(ushort *)(bt->table[hashidx].latch) = 0; 2032 | } 2033 | 2034 | memset (blks, 0, sizeof(blks)); 2035 | 2036 | next = bt->latchmgr->nlatchpage + LATCH_page; 2037 | page_no = LEAF_page; 2038 | 2039 | while( page_no < bt_getid(bt->latchmgr->alloc->right) ) { 2040 | if( bt_readpage (bt, bt->frame, page_no) ) 2041 | fprintf(stderr, "page %.8x unreadable\n", page_no); 2042 | if( !bt->frame->free ) { 2043 | for( idx = 0; idx++ < bt->frame->cnt - 1; ) { 2044 | ptr = keyptr(bt->frame, idx+1); 2045 | if( keycmp (keyptr(bt->frame, idx), ptr->key, ptr->len) >= 0 ) 2046 | fprintf(stderr, "page %.8x idx %.2x out of order\n", page_no, idx); 2047 | } 2048 | if( !bt->frame->lvl ) 2049 | cnt += bt->frame->act; 2050 | blks[bt->frame->lvl]++; 2051 | } 2052 | 2053 | if( page_no > LEAF_page ) 2054 | next = page_no + 1; 2055 | page_no = next; 2056 | } 2057 | 2058 | for( idx = 0; blks[idx]; idx++ ) 2059 | fprintf(stderr, "btree: %d lvl %d blocks\n", blks[idx], idx); 2060 | 2061 | return cnt - 1; 2062 | } 2063 | 2064 | #ifndef unix 2065 | double getCpuTime(int type) 2066 | { 2067 | FILETIME crtime[1]; 2068 | FILETIME xittime[1]; 2069 | FILETIME systime[1]; 2070 | FILETIME usrtime[1]; 2071 | SYSTEMTIME timeconv[1]; 2072 | double ans = 0; 2073 | 2074 | memset (timeconv, 0, sizeof(SYSTEMTIME)); 2075 | 2076 | switch( type ) { 2077 | case 0: 2078 | GetSystemTimeAsFileTime (xittime); 2079 | FileTimeToSystemTime (xittime, timeconv); 2080 | ans = (double)timeconv->wDayOfWeek * 3600 * 24; 2081 | break; 2082 | case 1: 2083 | GetProcessTimes (GetCurrentProcess(), crtime, xittime, systime, usrtime); 2084 | FileTimeToSystemTime (usrtime, timeconv); 2085 | break; 2086 | case 2: 2087 | GetProcessTimes (GetCurrentProcess(), crtime, xittime, systime, usrtime); 2088 | FileTimeToSystemTime (systime, timeconv); 2089 | break; 2090 | } 2091 | 2092 | ans += (double)timeconv->wHour * 3600; 2093 | ans += (double)timeconv->wMinute * 60; 2094 | ans += (double)timeconv->wSecond; 2095 | ans += (double)timeconv->wMilliseconds / 1000; 2096 | return ans; 2097 | } 2098 | #else 2099 | #include 2100 | #include 2101 | 2102 | double getCpuTime(int type) 2103 | { 2104 | struct rusage used[1]; 2105 | struct timeval tv[1]; 2106 | 2107 | switch( type ) { 2108 | case 0: 2109 | gettimeofday(tv, NULL); 2110 | return (double)tv->tv_sec + (double)tv->tv_usec / 1000000; 2111 | 2112 | case 1: 2113 | getrusage(RUSAGE_SELF, used); 2114 | return (double)used->ru_utime.tv_sec + (double)used->ru_utime.tv_usec / 1000000; 2115 | 2116 | case 2: 2117 | getrusage(RUSAGE_SELF, used); 2118 | return (double)used->ru_stime.tv_sec + (double)used->ru_stime.tv_usec / 1000000; 2119 | } 2120 | 2121 | return 0; 2122 | } 2123 | #endif 2124 | 2125 | // standalone program to index file of keys 2126 | // then list them onto std-out 2127 | 2128 | int main (int argc, char **argv) 2129 | { 2130 | uint slot, line = 0, off = 0, found = 0; 2131 | int ch, cnt = 0, bits = 12, idx; 2132 | unsigned char key[256]; 2133 | double done, start; 2134 | uid next, page_no; 2135 | BtLatchSet *latch; 2136 | float elapsed; 2137 | time_t tod[1]; 2138 | uint scan = 0; 2139 | uint len = 0; 2140 | uint map = 0; 2141 | BtPage page; 2142 | BtKey ptr; 2143 | BtDb *bt; 2144 | FILE *in; 2145 | 2146 | #ifdef WIN32 2147 | _setmode (1, _O_BINARY); 2148 | #endif 2149 | if( argc < 4 ) { 2150 | fprintf (stderr, "Usage: %s idx_file src_file Read/Write/Scan/Delete/Find/Count [page_bits mapped_pool_pages start_line_number]\n", argv[0]); 2151 | fprintf (stderr, " page_bits: size of btree page in bits\n"); 2152 | fprintf (stderr, " mapped_pool_pages: number of pages in buffer pool\n"); 2153 | exit(0); 2154 | } 2155 | 2156 | start = getCpuTime(0); 2157 | time(tod); 2158 | 2159 | if( argc > 4 ) 2160 | bits = atoi(argv[4]); 2161 | 2162 | if( argc > 5 ) 2163 | map = atoi(argv[5]); 2164 | 2165 | if( argc > 6 ) 2166 | off = atoi(argv[6]); 2167 | 2168 | bt = bt_open ((argv[1]), BT_rw, bits, map); 2169 | 2170 | if( !bt ) { 2171 | fprintf(stderr, "Index Open Error %s\n", argv[1]); 2172 | exit (1); 2173 | } 2174 | 2175 | switch(argv[3][0]| 0x20) 2176 | { 2177 | case 'p': // display page 2178 | if( latch = bt_pinlatch (bt, off) ) 2179 | page = bt_mappage (bt, latch); 2180 | else 2181 | fprintf(stderr, "unable to read page %.8x\n", off); 2182 | 2183 | write (1, page, bt->page_size); 2184 | break; 2185 | 2186 | case 'a': // buffer pool audit 2187 | fprintf(stderr, "started audit for %s\n", argv[1]); 2188 | cnt = bt_audit (bt); 2189 | fprintf(stderr, "finished audit for %s, %d keys\n", argv[1], cnt); 2190 | break; 2191 | 2192 | case 'w': // write keys 2193 | fprintf(stderr, "started indexing for %s\n", argv[2]); 2194 | if( argc > 2 && (in = fopen (argv[2], "rb")) ) { 2195 | #ifdef unix 2196 | posix_fadvise( fileno(in), 0, 0, POSIX_FADV_NOREUSE); 2197 | #endif 2198 | while( ch = getc(in), ch != EOF ) 2199 | if( ch == '\n' ) 2200 | { 2201 | if( off ) 2202 | sprintf((char *)key+len, "%.9d", line + off), len += 9; 2203 | 2204 | if( bt_insertkey (bt, key, len, 0, ++line, *tod) ) 2205 | fprintf(stderr, "Error %d Line: %d\n", bt->err, line), exit(0); 2206 | len = 0; 2207 | } 2208 | else if( len < 245 ) 2209 | key[len++] = ch; 2210 | } 2211 | fprintf(stderr, "finished adding keys for %s, %d \n", argv[2], line); 2212 | break; 2213 | 2214 | case 'd': // delete keys 2215 | fprintf(stderr, "started deleting keys for %s\n", argv[2]); 2216 | if( argc > 2 && (in = fopen (argv[2], "rb")) ) { 2217 | #ifdef unix 2218 | posix_fadvise( fileno(in), 0, 0, POSIX_FADV_NOREUSE); 2219 | #endif 2220 | while( ch = getc(in), ch != EOF ) 2221 | if( ch == '\n' ) 2222 | { 2223 | if( off ) 2224 | sprintf((char *)key+len, "%.9d", line + off), len += 9; 2225 | line++; 2226 | if( bt_deletekey (bt, key, len, 0) ) 2227 | fprintf(stderr, "Error %d Line: %d\n", bt->err, line), exit(0); 2228 | len = 0; 2229 | } 2230 | else if( len < 245 ) 2231 | key[len++] = ch; 2232 | } 2233 | fprintf(stderr, "finished deleting keys for %s, %d \n", argv[2], line); 2234 | break; 2235 | 2236 | case 'f': // find keys 2237 | fprintf(stderr, "started finding keys for %s\n", argv[2]); 2238 | if( argc > 2 && (in = fopen (argv[2], "rb")) ) { 2239 | #ifdef unix 2240 | posix_fadvise( fileno(in), 0, 0, POSIX_FADV_NOREUSE); 2241 | #endif 2242 | while( ch = getc(in), ch != EOF ) 2243 | if( ch == '\n' ) 2244 | { 2245 | if( off ) 2246 | sprintf((char *)key+len, "%.9d", line + off), len += 9; 2247 | line++; 2248 | if( bt_findkey (bt, key, len) ) 2249 | found++; 2250 | else if( bt->err ) 2251 | fprintf(stderr, "Error %d Syserr %d Line: %d\n", bt->err, errno, line), exit(0); 2252 | len = 0; 2253 | } 2254 | else if( len < 245 ) 2255 | key[len++] = ch; 2256 | } 2257 | fprintf(stderr, "finished search of %d keys for %s, found %d\n", line, argv[2], found); 2258 | break; 2259 | 2260 | case 's': // scan and print keys 2261 | fprintf(stderr, "started scaning\n"); 2262 | cnt = len = key[0] = 0; 2263 | 2264 | if( slot = bt_startkey (bt, key, len) ) 2265 | slot--; 2266 | else 2267 | fprintf(stderr, "Error %d in StartKey. Syserror: %d\n", bt->err, errno), exit(0); 2268 | 2269 | while( slot = bt_nextkey (bt, slot) ) { 2270 | ptr = bt_key(bt, slot); 2271 | fwrite (ptr->key, ptr->len, 1, stdout); 2272 | fputc ('\n', stdout); 2273 | cnt++; 2274 | } 2275 | 2276 | fprintf(stderr, " Total keys read %d\n", cnt - 1); 2277 | break; 2278 | 2279 | case 'c': // count keys 2280 | fprintf(stderr, "started counting\n"); 2281 | cnt = 0; 2282 | 2283 | next = bt->latchmgr->nlatchpage + LATCH_page; 2284 | page_no = LEAF_page; 2285 | 2286 | while( page_no < bt_getid(bt->latchmgr->alloc->right) ) { 2287 | if( latch = bt_pinlatch (bt, page_no) ) 2288 | page = bt_mappage (bt, latch); 2289 | if( !page->free && !page->lvl ) 2290 | cnt += page->act; 2291 | if( page_no > LEAF_page ) 2292 | next = page_no + 1; 2293 | if( scan ) 2294 | for( idx = 0; idx++ < page->cnt; ) { 2295 | if( slotptr(page, idx)->dead ) 2296 | continue; 2297 | ptr = keyptr(page, idx); 2298 | if( idx != page->cnt && bt_getid (page->right) ) { 2299 | fwrite (ptr->key, ptr->len, 1, stdout); 2300 | fputc ('\n', stdout); 2301 | } 2302 | } 2303 | bt_unpinlatch (latch); 2304 | page_no = next; 2305 | } 2306 | 2307 | cnt--; // remove stopper key 2308 | fprintf(stderr, " Total keys read %d\n", cnt); 2309 | break; 2310 | } 2311 | 2312 | done = getCpuTime(0); 2313 | elapsed = (float)(done - start); 2314 | fprintf(stderr, " real %dm%.3fs\n", (int)(elapsed/60), elapsed - (int)(elapsed/60)*60); 2315 | elapsed = getCpuTime(1); 2316 | fprintf(stderr, " user %dm%.3fs\n", (int)(elapsed/60), elapsed - (int)(elapsed/60)*60); 2317 | elapsed = getCpuTime(2); 2318 | fprintf(stderr, " sys %dm%.3fs\n", (int)(elapsed/60), elapsed - (int)(elapsed/60)*60); 2319 | return 0; 2320 | } 2321 | 2322 | #endif //STANDALONE 2323 | --------------------------------------------------------------------------------